How Solana Geyser Streams Data: A Deep Dive

Not this Geyser (the hot spring kind) — I mean Solana Geyser.

Solana Geyser is an internal plugin system (or interface) that Solana validators expose to emit events as they process blocks. It gives apps like trading bots and data analytics tools low-latency access to blockchain data.

The main use case of the Geyser plugin is the ability to keep up with high-volume, data-intensive applications as they scale.

Validators come under heavy RPC loads which can always fall behind the network — and this affects data-critical systems by either making them slower or causing them to work with outdated information. The Geyser plugin mechanism was introduced to solve this problem: it transmits information about accounts, slots, blocks and transactions to external data sources such as relational databases (ClickHouse, PostgreSQL) and NoSQL databases (MongoDB, Apache Cassandra). This makes indexing and caching more flexible, reduces the JSON-RPC load on validators, and lets them focus on processing transactions.

Geyser Plugins in the Wild

Since the Geyser plugin mechanism was introduced, the Solana ecosystem has produced a solid list of implementations:

PostgreSQL Plugin — The original reference implementation by the Solana/Anza team. Streams account and transaction data directly into a PostgreSQL database with connection pooling, batch inserts, and SSL support. → Docs
Kafka Plugin — Publishes Geyser data to Kafka topics, built by the Blockdaemon team. Suited for teams already running Kafka-based event pipelines. → GitHub
RabbitMQ Plugin — A RabbitMQ writer built by the Holaplex team, originally part of their NFT indexer and later extracted as a standalone plugin. → GitHub
Google Bigtable Plugin — Streams Solana blockchain data directly into Google Cloud Bigtable, integrating real-time data with Google's scalable NoSQL database. Built by the Solana team but still marked as non-production-ready. → GitHub
gRPC Connector (Mango) — A generic gRPC interface with a sample connector for writing the Geyser stream to Postgres, built by the Mango team. A simpler precursor to Yellowstone. → GitHub
Yellowstone gRPC (Dragon's Mouth) — A fully functional gRPC interface for Solana, built and maintained by Triton One. Provides slots, blocks, transactions, and account update notifications over a standardised path. The most widely adopted Geyser implementation in the ecosystem. → GitHub | Docs
Old Faithful — A comprehensive open-source archive of all Solana blocks and transactions from genesis to the current chain tip, built by Triton One as part of Project Yellowstone. The historical data layer that live Geyser streams can't provide. → Docs
Jetstreamer — A high-throughput Solana backfilling and research toolkit designed to stream historical chain data live over the network from Old Faithful, built by Anza. It can stream data at over 2.7M TPS to a local Geyser plugin and natively supports ClickHouse-friendly batching. → GitHub
Richat — Compatible with the gRPC endpoints provided by most commercial Solana RPC providers; acts as a fan-out multiplexer on top of Yellowstone, letting a small number of incoming streams serve a large number of clients. Built by lamports-dev. → GitHub

Focus: Yellowstone gRPC

For this article I am concerned about Yellowstone gRPC (commonly referred to as "Yellowstone", or sometimes "Yellowstone Dragon's Mouth") — the most common Geyser plugin implementation.

It leverages gRPC, Google's high-performance framework that combines Protocol Buffers for serialization with HTTP/2 for transport, enabling fast and type-safe communication between distributed systems.

Yellowstone provides real-time streaming of:

Account updates
Transactions
Entries
Block notifications
Slot notifications

Compared to traditional WebSocket implementations, Yellowstone offers lower latency and higher stability. It includes unary operations (operations that act on a single operand to produce a result) for quick one-time data retrievals. The combination of gRPC efficiency and type safety makes Yellowstone well-suited for cloud-based services and database updates.

Data Flow: From the Validator to Your Client

To understand how data moves, you need to understand what a validator is actually doing at any given moment.

A Solana validator runs multiple internal stages in a pipeline — the most relevant one here is the banking stage. This is where transactions get executed: accounts are read, instructions run, state changes get committed. It's high throughput, low tolerance for anything slow. The validator doesn't care about your downstream database — it needs to keep up with the network, full stop.

Geyser slots into this pipeline without interrupting it.

How the plugin loads

Geyser plugins are compiled as shared libraries (.so files on Linux) and loaded into the validator process at startup via a config flag:

--geyser-plugin-config /path/to/plugin-config.json

The config points to the .so file and passes any plugin-specific settings (connection strings, buffer sizes, etc.). Once loaded, the plugin runs in-process — same memory space as the validator. There's no IPC, no socket, no network hop at this stage. The validator calls directly into your plugin's exported functions.

This is a deliberate design choice. Out-of-process plugins would add latency on every callback. In-process keeps it fast, but it also means a buggy plugin can crash your validator. Most production operators run a hardened, well-tested plugin like Yellowstone and nothing else.

The callback interface

The Geyser interface exposes a set of callbacks the validator fires after each stage of block processing. The order is deterministic:

update_account — fires for every account that was written to during transaction execution. This includes program accounts, token accounts, PDAs — anything with a state change. On a busy slot this can be thousands of calls.
notify_transaction — fires once per transaction with the full result: instructions, logs, inner instructions, compute units consumed, success or failure.
notify_entry — fires for each entry in the block (an entry is a batch of non-conflicting transactions that can be processed in parallel).
update_slot_status — fires as the slot moves through commitment levels: Processed → Confirmed → Finalized.
notify_block_metadata — fires once when the full block is complete with its metadata (blockhash, parent slot, timestamp, etc.).

The validator calls these synchronously from the banking stage thread. Your plugin code runs on that thread. This is the most important thing to understand about Geyser: you cannot block here. If your plugin does anything slow — a database write, a network call, even a mutex contention — you stall the validator. Yellowstone's entire design exists to solve this.

Yellowstone's internal queue

When Yellowstone's plugin implementation receives a callback, it does one thing: serialize the data and push it onto a bounded async channel. That's it. The callback returns immediately, the validator continues, and a separate thread pool handles everything downstream.

The channel is bounded intentionally. If the consumer side (the gRPC server) can't keep up with the producer side (the validator callbacks), the channel fills up. At that point Yellowstone has two choices — block the validator thread (unacceptable) or drop the update. It drops. This is documented behavior, not a bug. If your subscriber is slow or disconnected, you will miss data. That's why the from_slot reconnect parameter exists.

The internal architecture looks roughly like this:

Banking stage thread
  → plugin callback (sync, must be fast)
    → serialize to protobuf
      → push onto bounded channel (non-blocking)

Separate thread pool
  → drain channel
    → apply subscriber filters
      → route to matching subscribers
        → write to per-subscriber send buffer

Server-side filtering

Before any data goes on the wire, Yellowstone evaluates each update against every active subscriber's filter set. If you subscribed with account_include: ["TokenkegQfeZyiNwAJbNbGKPFXCWuBvf9Ss623VQ5DA"] (the SPL Token program), you only receive account updates owned by that program. Everything else is dropped before serialization for your connection.

This filtering is what makes Yellowstone practical at scale. The raw Geyser stream on a busy validator is enormous — hundreds of thousands of account updates per slot, thousands of transactions. Without server-side filtering, every client would need the bandwidth and CPU to process all of it and discard what they don't need. Filtering at the server means your client only pays for what it asked for.

HTTP/2 and the wire format

Once an update passes filtering, Yellowstone serializes it into a Protocol Buffer binary and writes it to your HTTP/2 stream. This is where gRPC earns its place over WebSockets.

HTTP/2 uses a single TCP connection with multiple logical streams multiplexed over it. Each subscriber is one stream. A slow subscriber's stream can buffer independently without blocking other subscribers — there's no head-of-line blocking at the transport layer. WebSockets give you one stream per connection with no multiplexing, which means one slow consumer affects everyone sharing infrastructure.

Protocol Buffers keep the payload small. A transaction update that might be 2–3KB as JSON is typically under 500 bytes as a protobuf binary. At the throughput Solana runs at (thousands of transactions per slot, 400ms slots), that size difference adds up fast on your egress bill and on your client's deserialization CPU.

The complete path:

Every step after the initial callback is async and decoupled from the validator. The validator doesn't know or care what happens after it fires the callback — it's already processing the next transaction.

Conclusion

If you've built anything on top of Solana's JSON-RPC and hit a wall — stale data, slow responses, validators falling behind — Geyser is what you move to next. It's not an upgrade, it's a different mental model: instead of asking a validator for data, the validator tells you.

Yellowstone makes that practical. The gRPC interface is typed, the filters keep your bandwidth sane, and the reconnect primitives are straightforward enough that you're not writing infrastructure from scratch. You're writing application logic.

The code in this article is a starting point. What you actually do inside the handler is where the real work is — decoding instruction data, normalising into your schema, batching writes into ClickHouse or Postgres without blocking the stream. That's where most of the design decisions live, and I'll cover that in the next one.

For now, get the connection working. Subscribe to a program you care about at Processed commitment and just print the signatures. Once you see the data flowing in under a millisecond, the rest becomes obvious.