Behind the Magic: How Readyset Speeds Up Queries with Streaming Dataflow

5 min read

28 days ago

Behind the Magic: How Readyset Speeds Up Queries with Streaming Dataflow

Database performance is often the bottleneck in high-traffic web applications. Whether you're serving user dashboards, product catalogs, or personalized content, the same SQL queries get executed repeatedly. Caching sounds like the natural solution. However, traditional caching can be challenging to implement correctly: you need to manually add cache logic to your application (such as Redis), keep cache entries in sync, and address edge cases like stale data.

Readyset takes a different approach. It offers the speed of a key-value cache but with the transparency and consistency of your database. You don’t need to rewrite queries or business logic. Just point your application to Readyset, declare the queries you want cached, and it handles the rest — including staying up to date with every insert, update, and delete. But how exactly does this work?

To understand Readyset’s performance and consistency guarantees, we need to look beneath the surface. At its core, Readyset operates as a read replica by connecting to MySQL using the traditional asynchronous replication stream. It maintains a local copy of the data as base tables to be used by cached queries in RocksDB. The real differentiator, though, is the architecture it’s built on: a model known as partially-stateful streaming dataflow.

What Is a Dataflow Graph?

Readyset transforms each cached SQL query into a dataflow graph. Think of it as a long-running computation where each node represents a relational operator — a JOIN, GROUP BY, or COUNT. These nodes are connected, and changes to base table rows flow through this graph in real time, updating cached results incrementally.

This model is inspired by research at MIT (check out the Noria project) and forms the backbone of Readyset’s query engine. The advantage is that it avoids the overhead of re-running entire SQL queries every time the data changes. Instead, only the affected nodes of the graph are recomputed.

Let’s look at a minimal example of how Readyset builds and executes a dataflow graph for a query that aggregates over a single table:

SELECT status, COUNT(*) AS total FROM orders GROUP BY status;

This query returns the total number of orders for each status (e.g., pending, shipped, cancelled). Below is the corresponding dataflow graph:

What Happens at Runtime

1. Data Ingest

Readyset maintains a local, continuously replicated copy of the orders table in RocksDB. Changes are streamed in via MySQL’s binlog.

2. GroupBy Node

  • On initial execution or during cache warming, the orders table is scanned, and rows are grouped by status.
  • The GroupBy node calculates COUNT(*) for each group.
  • On future inserts, updates, or deletes, Readyset incrementally adjusts only the affected group count — there’s no need to reprocess the entire table or re-run the full query.

3. Reader Node

  • The results of the aggregation are stored in memory by the Reader node.
  • When the query is run again, results are returned as in-memory key-value lookups, making response times extremely fast.
  • If memory pressure occurs, Readyset can evict entries and reload them as needed via an upquery from the underlying data.

Parsing and Planning

When a query arrives at Readyset, it is first parsed and normalized using the internal SQL engine located in the SQL module. The query is then analyzed and transformed into a database-agnostic intermediate representation (IR). This IR is then compiled into a dataflow graph, where each logical operation corresponds to a specific node type in the engine.

Once the graph is created, it becomes a “live” component of the Readyset runtime. Updates from the database flow through it continuously, and query results are maintained in memory in a format optimized for fast lookups. This compilation process happens only once, on first use. After that, Readyset uses a digest (a normalized identifier for the query’s logical structure) to recognize and route repeated executions of the same query, even when parameter values differ.

Keeping the Cache Fresh: MySQL Binlog and Replication

Traditional caches often struggle with stale data. For example, ProxySQL supports query caching using digests, but relies solely on a time-to-live (TTL) setting for invalidation. In contrast, Readyet avoids these issues by integrating directly with the database’s replication stream, keeping cached results in sync in real time.

For MySQL, it acts as a replica that reads the binary log (binlog) — a real-time stream of changes. Each insert, update, and delete operation is represented as a compact binary event. Readyset parses these events and maps them to affected query results.

Instead of invalidating a cache entry, it updates it incrementally. For example, if a new vote is inserted into a table, only the affected COUNT or JOIN nodes in the graph are updated. No re-querying is needed.

This leads to extremely low-latency updates and consistent cache results, typically within a few milliseconds of the write.

Partial Materialization: Memory-Efficient Caching

One challenge with materializing query results in memory is the potential for a memory blowup. If every possible query result was stored, the graph could consume an enormous amount of RAM.

Readyset addresses this with partial materialization. Instead of storing every possible result up front, the cache starts empty. 

When an application issues a SQL query, for example:

SELECT * FROM stories WHERE id = 1;

Readyset checks whether the result for id = 1 is already available in its internal cache, called a reader node. If it is, the result is returned immediately from the cache.

Now, suppose the next query is:

SELECT * FROM stories WHERE id = 2;

If id = 2 hasn’t been queried before and isn’t cached, Readyset triggers an upquery — a backward traversal through its dataflow graph that recomputes the result by reading from the base table stored in RocksDB. This base table is kept up to date through replication from the upstream MySQL database. Once computed, the result is returned and cached in the reader node for future use.

This notion of on-demand filling applies not only to the Reader node, but also to any interior nodes of the dataflow graph that need to maintain state. This includes aggregations, joins, and so on.

This approach allows Readyset to incrementally populate its cache on demand, avoiding the overhead of preloading everything while still delivering low-latency results for repeated queries.

Serving Queries

Cached queries in Readyset behave like prepared statements. When a query comes in, Readyset pattern matches it against the existing caches. If a match is found, the query is resolved using a fast in-memory lookup from the reader node. If not, it is proxied to the primary database.

Because the caches are explicitly declared with CREATE CACHE FROM ..., developers retain control over what’s cached. 

Why Is This Better Than Traditional Caching?

Readyset avoids many of the pitfalls of common cache architectures:

  • No manual cache invalidation logic
  • No tight coupling between app and cache
  • No risk of thundering herd issues
  • Compatible with existing MySQL/Postgres clients

Where conventional caches offer fast lookups but weaker data freshness, and read replicas offer consistency but limited scalability, Readyset provides both. It does this by rethinking how query results are cached, updated, and served — using a streaming dataflow model that enables low-latency, consistent, and incremental updates.

The result is a system that feels like magic, but is built on serious systems research and a production-grade Rust codebase.

Looking Ahead 

In future posts, we’ll explore more ways Readyset helps with database scaling, including the recently launched QueryPilot, now in beta for MySQL, which automatically identifies and caches the right queries for you. It builds on Readyset's cache engine to deliver optimized performance without any manual tuning. 

Learn More

Blog Post: When and Why to Cache a Query — Addressing Developer Pain Points