How Readyset Rewrites Your SQL: Inside the Query Transformation Pipeline

Why Query Rewriting Matters for Dataflow Most databases work the same way: a query arrives, the engine builds an execution plan, scans tables, joins rows, filters, aggregates, and returns the result. Every time the query runs, the work repeats from scratch. This pull-based model has served relational databases for decades, but it carries an inherent cost: read latency is proportional to the complexity of the query and the size of the data it touches. Readyset takes a fundamentally different ap

Vassili Zarouba

2026-04-28 · 17 min read

Why Query Rewriting Matters for Dataflow

Most databases work the same way: a query arrives, the engine builds an execution plan, scans tables, joins rows, filters, aggregates, and returns the result. Every time the query runs, the work repeats from scratch. This pull-based model has served relational databases for decades, but it carries an inherent cost: read latency is proportional to the complexity of the query and the size of the data it touches.

Readyset takes a fundamentally different approach. Instead of re-executing queries on demand, Readyset compiles each query into a dataflow graph — a network of operators (joins, filters, aggregations, projections) that continuously maintains the query's result as the underlying data changes. When a row is inserted, updated, or deleted in the upstream database, the change propagates through the graph, and the cached result is incrementally updated. Reads become lookups into a pre-computed materialized view, not full query executions.

This is the key difference: traditional engines optimize how to execute a query each time it runs. Readyset optimizes once, at cache-creation time, and then maintains the result incrementally forever. The tradeoff is that the query must be expressed in a form the dataflow engine can compile — and that form is more restrictive than what SQL allows.

What the Dataflow Engine Requires

SQL is a declarative language. The same logical query can be written in many equivalent ways: correlated subqueries, derived tables, CTEs, LATERAL joins, nested aggregations. A traditional optimizer treats these as interchangeable representations and picks the best execution plan regardless of syntax.

Readyset's dataflow compiler is not a traditional optimizer. It translates SQL into a directed acyclic graph of streaming operators, where each operator receives changes from its inputs and emits changes to its outputs. This architecture imposes structural constraints that SQL syntax doesn't:

Binary joins with equality predicates. Each join in the dataflow graph connects exactly two inputs via column-equality predicates (a.id = b.id). The engine uses these equalities to maintain hash-based join state. Range predicates, expression-based join keys, or multi-table ON conditions are not directly supported.

No correlated execution. In a traditional engine, a correlated subquery runs once per outer row — a nested-loop pattern. The dataflow graph has no concept of "per outer row." Every operator sees the full stream of changes from its inputs. Correlated subqueries must be rewritten into equivalent joins that the dataflow can maintain incrementally.

Flat join structure preferred. Derived tables (subqueries in FROM) are supported by the dataflow engine, but with a cost: a derived table compiles into a fully materialized intermediate node that cannot be parameterized. Every distinct combination of input data produces a stored result, regardless of whether the outer query needs it. Inlining the derived table — absorbing its FROM items, WHERE filters, and projections into the outer query — eliminates this intermediate materialization and lets the engine build parameterized lookups directly against the base tables. The rewrite pipeline aggressively inlines derived tables where semantically safe, reserving the materialized form for cases where inlining would change the query's meaning.

Supported join types. The engine supports INNER JOIN, LEFT OUTER JOIN , and CROSS JOIN. RIGHT JOIN and FULL OUTER JOIN are not supported because their incremental maintenance in a streaming context requires tracking absence of matches on both sides — a significantly harder problem.

Aggregation boundaries. GROUP BY and aggregate functions (COUNT, SUM, etc.) compile into stateful operators that maintain running totals. The engine requires that each aggregated query projects at least one aggregate-derived expression, and that GROUP BY keys are explicit column references (not positional numbers or aliases).

These constraints mean that a syntactically valid SQL query — one that PostgreSQL or MySQL would execute without complaint — may not be directly compilable by Readyset, or may compile into a less efficient dataflow graph than necessary. The query rewrite pipeline exists to bridge this gap.

The Rewrite Pipeline: Bridging SQL to Dataflow

When a user issues CREATE CACHE for a query, Readyset runs the query through a multi-pass rewrite pipeline that transforms arbitrary SQL into the canonical form the dataflow engine expects. The pipeline is organized into three blocks:

Block	Purpose	Example
A — Normalization	Desugar syntax, resolve schemas, qualify columns	SELECT * becomes SELECT t.id, t.name, ...
B — Deep Rewrites	Decorrelate subqueries, flatten derived tables, optimize	WHERE id IN (SELECT ...) becomes a join
C — Cleanup	Remove redundant clauses, parameterize literals	ORDER BY id LIMIT 10 removed when result is provably single-row

Each pass is semantics-preserving: the rewritten query returns the same result as the original for all possible data. The passes build on each other — Block A normalizes the SQL into a canonical form that Block B's transformations can reliably operate on, and Block C cleans up artifacts left by Block B.

Block A: Making SQL Canonical

Before any deep transformation can happen, the query must be in a predictable shape. Block A handles this:

Schema resolution binds table and column names to the actual schema metadata Readyset has replicated from the upstream database. This is essential for later passes that need to know primary keys, unique constraints, and column types.
Star expansion replaces SELECT * with the explicit column list. Every downstream pass expects to see named columns, not wildcards.
Column qualification ensures every column reference is prefixed with its table name (id becomes t.id). This prevents ambiguity when multiple tables have columns with the same name.
USING desugaring converts JOIN ... USING(id) into JOIN ... ON (a.id = b.id). The dataflow engine works with ON predicates, not USING clauses.

After Block A, the query is fully resolved, qualified, and desugared — a clean foundation for the transformations that follow.

Block B: The Heavy Lifting

Block B is where the real work happens. These passes transform SQL constructs that the dataflow engine cannot handle into equivalent constructs that it can. The ordering matters — each pass prepares the ground for the next.

Array Constructor Rewrite

The first Block B pass, and PostgreSQL-specific. PostgreSQL's ARRAY(SELECT ...) constructor produces an array value from a subquery's rows — a per-outer-row scalar aggregation shape the dataflow engine has no direct operator for. The pipeline rewrites each occurrence into a LATERAL LEFT JOIN whose body wraps the original subquery in array_agg(...), with a COALESCE(..., ARRAY[]) on the outer side so empty subqueries yield an empty array rather than NULL. ORDER BY and DISTINCT inside the constructor are copied into the array_agg call (required for correctness when the subquery also has LIMIT/Top-K). By running first, this pass turns a SQL construct the later passes wouldn't recognize into a LATERAL join they already know how to handle.

-- Before: ARRAY(SELECT ...) constructor — no direct dataflow operator
SELECT u.name,
       ARRAY(SELECT p.title FROM posts p WHERE p.user_id = u.id) AS post_titles
FROM users u

-- After: LATERAL + array_agg + COALESCE — shapes that the rest of the pipeline understands
SELECT u.name,
       COALESCE(array_subq.agg_result, ARRAY[]) AS post_titles
FROM users u
LEFT JOIN LATERAL (
    SELECT array_agg(inner_subq.title) AS agg_result
    FROM (SELECT p.title FROM posts p WHERE p.user_id = u.id) inner_subq
) array_subq ON TRUE

Redundant Join Elimination

Queries generated by ORMs often contain redundant self-joins — the same table joined to itself on its primary key, with all projected columns coming from one side. The pipeline detects and eliminates these early, reducing the join graph complexity before the heavier transformations that follow.

Left-Spine Hoisting

Before decorrelating subqueries, the pipeline attempts to inline the leftmost derived table in FROM — but only at the top level of the query. This is the one position where a Top-K pattern (ORDER BY ... LIMIT) is most valuable: at the top level, the LIMIT can be parameterized (e.g., LIMIT ?) and the dataflow compiler can deploy a native Top-K node — a streaming operator purpose-built for maintaining the top N rows incrementally. If the subquery were left nested and processed later by the general decorrelation pass, the Top-K would be replaced with a ROW_NUMBER() based filter — functionally correct but less efficient, and with the LIMIT no longer parameterizable.

-- Before: Top-K nested in a derived table — dataflow would emit a ROW_NUMBER filter
SELECT sq.id, sq.name, sq.score
FROM (SELECT id, name, score FROM products ORDER BY score DESC LIMIT ?) AS sq

-- After left-spine hoist: Top-K at the top level — native, parameterized Top-K node
SELECT id, name, score
FROM products
ORDER BY score DESC
LIMIT ?

With the ORDER BY and LIMIT at the top level, the dataflow compiler recognizes the Top-K shape and deploys a native streaming operator. The LIMIT remains a parameter (?), so a single cached dataflow graph serves every LIMIT 10, LIMIT 50, LIMIT 1000 variant. Column rebinding, ORDER BY rewriting, and LIMIT/OFFSET composition are all handled by a shared inlining API reused by every inlining site in the pipeline.

Subquery Decorrelation

The most complex transformation in the pipeline, and the reason the preceding passes exist — they prepare the query structure so decorrelation can see and operate on all correlated references. Consider:

SELECT o.id, o.total
FROM orders o
WHERE o.total > (SELECT AVG(total) FROM orders WHERE region = o.region)

The subquery references o.region from the outer query — it's correlated. A traditional engine would execute the subquery once per outer row. Readyset's dataflow has no per-row execution model, so the pipeline rewrites this into a join:

SELECT o.id, o.total
FROM orders o
INNER JOIN (
    SELECT region, AVG(total) AS avg_total
    FROM orders
    GROUP BY region
) AS sq ON o.region = sq.region
WHERE o.total > sq.avg_total

The correlated predicate region = o.region becomes a GROUP BY key and a join condition. The subquery is now a derived table that can be further flattened. The dataflow engine compiles this into a join operator that maintains the per-region average incrementally.

This decorrelation handles EXISTS, NOT EXISTS, IN, NOT IN, scalar subqueries, and LATERAL joins. Each has its own semantics — LATERAL with COUNT requires LEFT JOIN with COALESCE to preserve zero-count semantics; nested LATERALs correlating with grandparent scopes require wrapper flattening to eliminate scope boundaries; and IN / NOT IN / scalar subqueries against potentially-NULL operands require three-valued logic probes whose applicability depends on the operator and where the predicate appears, covered next.

Three-Valued Logic for `IN`, `NOT IN`, and Scalar Subqueries

SQL boolean expressions evaluate to TRUE, FALSE, or NULL — three-valued logic (3VL). Most operators make this transparent, but IN / NOT IN against a subquery that can produce NULL do not, and the right handling depends on where the predicate appears:

NOT IN in WHERE requires 3VL guards. If the RHS produces even a single NULL, a NOT IN predicate that would otherwise be FALSE becomes NULL — and in WHERE, NULL filters the row out while FALSE was meant to keep it (or vice versa, depending on surrounding logic). A naive decorrelation into a LEFT ANTI JOIN loses this distinction and returns wrong results.
IN in WHERE does not need 3VL guards. Both FALSE (no match, RHS has no NULLs) and NULL (no match, RHS has NULLs) cause WHERE to discard the row, so the distinction is immaterial — a plain decorrelation to a semi-join is already correct.
IN or NOT IN in the SELECT list always needs 3VL guards, regardless of operator. The predicate's value is projected out of the query, and callers may test it with IS NULL, pass it to another expression, or compare it to another boolean — the TRUE/FALSE/NULL distinction must be preserved exactly.
Scalar subqueries in the SELECT list sit alongside these cases: their NULL-vs-empty-result semantics require the same probe machinery.

Readyset handles all of these by installing probe joins alongside the decorrelated subquery. Two probes span the full truth table:

NP (null-present): EXISTS(rhs WHERE first_field IS NULL) — does the RHS produce a NULL in this correlation partition?
EP (existence): EXISTS(rhs) — is the RHS non-empty for this correlation partition?

Both are materialized as LEFT LATERAL joins projecting a sentinel present_ column that downstream predicates test with IS [NOT] NULL. A small boolean formula over lhs IS NULL, NP.present_, EP.present_, and the anti-join match yields the exact 3VL result SQL semantics require.

Probes are installed only when null inference can't prove them unnecessary. Before materializing either probe, the pipeline runs a nullability analysis on both sides of the predicate:

If the RHS comparison column is provably null-free (from NOT NULL constraints, non-nullable expressions, or grouping/aggregation that guarantees non-NULL outputs), the NP probe is skipped — a NULL can never occur, so the 3VL branch it guards is unreachable.
If the LHS expression is provably null-free, the EP probe is skipped — the "LHS is NULL" branch of the 3VL formula cannot fire.
Only the branches that nullability analysis cannot statically eliminate produce runtime probes.

This matters because every probe is an additional LEFT LATERAL join: extra nodes, extra state, extra work on every upstream change. Suppressing probes that are provably unnecessary keeps the dataflow graph minimal and aligned with what the query actually needs.

Identical probes are materialized once and reused. Real queries often contain multiple subqueries with the same RHS and correlation — two NOT IN clauses against the same subselect, a scalar subquery referenced in both SELECT and WHERE, repeated predicates generated by an ORM, and so on. A structural ProbeRegistry keys probes by a hash of the normalized RHS body and the correlation predicate (post-normalization, so cosmetic differences don't cause misses). A second occurrence finds the existing entry and reuses its LATERAL joins instead of emitting duplicates. The registry also supports lazy upgrade: if the first occurrence needed only NP but a later occurrence also needs EP, the EP probe is added to the same registry entry, avoiding a parallel set of joins for what is the same underlying subquery.

The net effect: 3VL is exact, the dataflow graph carries only the probe machinery the query actually requires, and repeated subqueries share a single compiled form.

General Derived Table Inlining

After decorrelation, the query may contain derived tables — both original ones that the left-spine pass didn't handle, and new ones introduced by decorrelation itself (as join targets for unnested subqueries). This second inlining pass flattens them into the outer join structure, eliminating unnecessary intermediate materializations.

-- Before: aggregate computed in a derived table, then filtered by the outer query
SELECT sq.name, sq.total
FROM (SELECT name, SUM(amount) AS total FROM orders GROUP BY name) AS sq
WHERE sq.total > 100

-- After inlining: derived table absorbed; outer WHERE on the aggregate migrates to HAVING
SELECT name, SUM(amount) AS total
FROM orders
GROUP BY name
HAVING SUM(amount) > 100

The outer WHERE referencing the aggregate (sq.total > 100) migrates to HAVING, since after inlining it applies to a grouped result. The GROUP BY keys, aggregate projections, and column rebinding are handled by the same shared inlining API used by the left-spine pass.

Not every derived table can be inlined. The pipeline checks for window functions that would change behavior (their partition sizes depend on the row set, which changes after inlining), nested aggregation conflicts, and self-join introduction. When inlining would change semantics, the query is left as-is and handled as a derived table in the dataflow graph — or rejected if the engine cannot support it.

Join Reordering

Two semantically identical queries that list joins in a different order — FROM a JOIN b ON ... JOIN c ON ... vs FROM a JOIN c ON ... JOIN b ON ... — should compile into the same dataflow graph and share the same cache. But the dataflow compiler produces a different graph for each join ordering, so structural differences in the SQL text lead to redundant caches for the same logical query.

It's worth stressing that this pass is not a cost-based join optimizer. Its purpose is purely canonicalization: producing a deterministic, syntax-independent join sequence so that logically equivalent queries share the same cached dataflow graph. Cost, cardinality, and statistics play no role here — the scoring is a structural tiebreaker, not a performance heuristic. True cost-based join reordering is a separate upcoming pass that will operate on this canonical form: once in place, the downstream optimization step will be configurable to run cost-based join reordering, filter hoisting, or neither — and in every case it will start from the same canonicalized shape produced here.

The pipeline normalizes join order using a deterministic structural algorithm: at each step it scores candidate joins by predicate proximity (preferring joins whose ON conditions reference the most recently joined tables) combined with the number of cross-table equalities the candidate brings into scope, and breaks ties lexicographically by relation name. This produces a canonical join sequence regardless of how the user originally wrote the query. The same approach normalizes comma-separated FROM items into explicit CROSS JOINs, ensuring consistent structure.

Clause Normalization for Semantic Fingerprinting

Readyset identifies structurally identical queries so they can share a single cached dataflow graph. Two queries that differ only in superficial ways — GROUP BY 1, 2 vs GROUP BY name, region, or ORDER BY total DESC vs ORDER BY 3 DESC — are semantically identical and should hit the same cache.

The pipeline normalizes these clauses by resolving positional references (GROUP BY 1) and alias references (ORDER BY total) to their underlying expressions (GROUP BY t.name, ORDER BY SUM(t.amount)). After normalization, semantically equivalent queries produce identical AST structures, enabling reliable fingerprint-based cache matching.

Filter Hoisting

The final Block B pass. After all structural transformations, join reordering, and clause normalization have produced the canonical query shape, the pipeline hoists parameterized filters (WHERE id = ?) from inside derived tables up to the outermost WHERE clause. These filters are most efficient at the top level, where the dataflow engine can use them for key-based lookups into materialized state. By running last, filter hoisting operates on the final semantic fingerprint shape — the same canonical form that will be used for cache matching — ensuring the hoisted filters land in the right structural position.

Today this pass runs unconditionally as the last step of Block B. Once cost-based join reordering lands, this slot becomes a configurable optimization step: the pipeline will choose between filter hoisting, cost-based join reordering, or neither — each starting from the canonical shape produced by the earlier passes.

Block C: Final Cleanup

After Block B's transformations, the query may have redundant clauses. Block C strips ORDER BY and LIMIT from queries that provably return at most one row (e.g., filtering on a unique key), and auto-parameterizes literal values so that structurally identical queries with different constants share the same cached dataflow graph.

The Result

By the time a query exits the rewrite pipeline, it is in a form the dataflow engine can directly compile:

All subqueries are decorrelated into joins
All derived tables are inlined (or left as supported derived-table operators)
All column references are fully qualified
All join predicates are column-equality pairs
Joins are reordered into the canonical semantic-fingerprint shape
GROUP BY keys and ORDER BY expressions are normalized to explicit column references (no positional or alias indirection)
Redundant joins and clauses are eliminated
Parameterizable filters are at the top level
All structural invariants required by the dataflow engine are satisfied — the query is guaranteed to be compilable without further transformation

The dataflow compiler then translates this canonical SQL into a graph of streaming operators — and from that point on, every data change flows through the graph, keeping the cached result up to date without ever re-executing the original query.

This is how Readyset turns SQL queries into live, incrementally-maintained caches: not by executing faster, but by never needing to execute again.

Future Work

Today's pipeline is built on general SQL semantics and relational-algebra principles. Every pass is a broadly-applicable rewrite: a heuristic that works for any query matching a structural shape, backed by a soundness argument that holds for any data. This generality is a feature — it keeps the pipeline compact, composable, and easy to reason about — but it is also where the next round of work lives.

Cost-based join reordering. The current Join Reordering pass is a pure canonicalization step; it doesn't use statistics or try to minimize work. A real cost-based join optimizer (CBJO) is the next major addition. Once it lands, the last Block B slot becomes a configurable optimization step — the pipeline will choose between filter hoisting, cost-based join reordering, or neither — and every option will start from the canonical shape produced by the earlier passes.

Relaxing decorrelation and inlining guardrails. Several of the current guards are deliberately cautious: they reject transformations that a more targeted analysis could safely admit. Examples include tighter window-function partition-stability analysis, composite-key redundant-join elimination, and per-column decomposition of mixed aggregate/non-aggregate expressions after inlining. Each of these is a straightforward relaxation of an existing guard — a narrower check in place of a broader veto.

From generalized heuristics to pattern-specific transformations. The passes that exist today were intentionally built as generalized rewrites — structural patterns that apply broadly. The next phase of work is the inverse: identifying specific query shapes that the generalized heuristics don't recognize, and adding targeted rewrites for them. Real workloads are full of patterns — produced by ORMs, BI tools, or common reporting idioms — that are semantically simple but don't fit any of the current structural templates. Each such pattern becomes a new pass or a new branch in an existing one: a narrow match on a concrete shape, with its own soundness argument, composed into the pipeline alongside the general rewrites.

The pipeline's architecture — a sequence of independent, composable, semantics-preserving passes — makes all of this incremental: each capability is a new pass or a relaxed guard, tested against the full suite of regression tests and verified for semantic preservation.

Want to see Readyset in action?

Book a demo and see how Readyset can accelerate your database.

Still scaling the hard way?

Modern applications demand instant performance, even under unpredictable load. Readyset helps you eliminate slow queries, stabilize latency, and scale confidently.

Revolutionize your database performance with Readyset

Serve requests at sub-millisecond latencies with the modern database scaling and query caching system for MySQL and PostgreSQL.

Product

Readyset Cloud Readyset Private Readyset Community QueryPilot rdst CLI Pricing

Resources

Blog Case Studies Company Documentation

Legal

Join our newsletter

Stay updated with the latest news, insights, and developments from Readyset — straight to your inbox.

How Readyset Rewrites Your SQL: Inside the Query Transformation Pipeline

From parser combinators to hand-written TDOP: On adopting sqlparser-rs at Readyset

/brutal-review and the new 80/20 rule

Readyset Adds New Query Capabilities With the Bucket Function

Still scaling the hard way?