How Readyset Rewrites Your SQL: Inside the Query Transformation Pipeline
14 min read
•
10 days ago

Why Query Rewriting Matters for Dataflow
Most databases work the same way: a query arrives, the engine builds an execution plan, scans tables, joins rows, filters, aggregates, and returns the result. Every time the query runs, the work repeats from scratch. This pull-based model has served relational databases for decades, but it carries an inherent cost: read latency is proportional to the complexity of the query and the size of the data it touches.
Readyset takes a fundamentally different approach. Instead of re-executing queries on demand, Readyset compiles each query into a dataflow graph — a network of operators (joins, filters, aggregations, projections) that continuously maintains the query's result as the underlying data changes. When a row is inserted, updated, or deleted in the upstream database, the change propagates through the graph, and the cached result is incrementally updated. Reads become lookups into a pre-computed materialized view, not full query executions.
This is the key difference: traditional engines optimize how to execute a query each time it runs. Readyset optimizes once, at cache-creation time, and then maintains the result incrementally forever. The tradeoff is that the query must be expressed in a form the dataflow engine can compile — and that form is more restrictive than what SQL allows.
What the Dataflow Engine Requires
SQL is a declarative language. The same logical query can be written in many equivalent ways: correlated subqueries, derived tables, CTEs, LATERAL joins, nested aggregations. A traditional optimizer treats these as interchangeable representations and picks the best execution plan regardless of syntax.
Readyset's dataflow compiler is not a traditional optimizer. It translates SQL into a directed acyclic graph of streaming operators, where each operator receives changes from its inputs and emits changes to its outputs. This architecture imposes structural constraints that SQL syntax doesn't:
Binary joins with equality predicates. Each join in the dataflow graph connects exactly two inputs via column-equality predicates (a.id = b.id). The engine uses these equalities to maintain hash-based join state. Range predicates, expression-based join keys, or multi-table ON conditions are not directly supported.
No correlated execution. In a traditional engine, a correlated subquery runs once per outer row — a nested-loop pattern. The dataflow graph has no concept of "per outer row." Every operator sees the full stream of changes from its inputs. Correlated subqueries must be rewritten into equivalent joins that the dataflow can maintain incrementally.
Flat join structure preferred. Derived tables (subqueries in FROM) are supported by the dataflow engine, but with a cost: a derived table compiles into a fully materialized intermediate node that cannot be parameterized. Every distinct combination of input data produces a stored result, regardless of whether the outer query needs it. Inlining the derived table — absorbing its FROM items, WHERE filters, and projections into the outer query — eliminates this intermediate materialization and lets the engine build parameterized lookups directly against the base tables. The rewrite pipeline aggressively inlines derived tables where semantically safe, reserving the materialized form for cases where inlining would change the query's meaning.
Supported join types. The engine supports INNER JOIN, LEFT OUTER JOIN , and CROSS JOIN. RIGHT JOIN and FULL OUTER JOIN are not supported because their incremental maintenance in a streaming context requires tracking absence of matches on both sides — a significantly harder problem.
Aggregation boundaries. GROUP BY and aggregate functions (COUNT, SUM, etc.) compile into stateful operators that maintain running totals. The engine requires that each aggregated query projects at least one aggregate-derived expression, and that GROUP BY keys are explicit column references (not positional numbers or aliases).
These constraints mean that a syntactically valid SQL query — one that PostgreSQL or MySQL would execute without complaint — may not be directly compilable by Readyset, or may compile into a less efficient dataflow graph than necessary. The query rewrite pipeline exists to bridge this gap.
The Rewrite Pipeline: Bridging SQL to Dataflow
When a user issues CREATE CACHE for a query, Readyset runs the query through a multi-pass rewrite pipeline that transforms arbitrary SQL into the canonical form the dataflow engine expects. The pipeline is organized into three blocks:
| Block | Purpose | Example |
|---|---|---|
| A — Normalization | Desugar syntax, resolve schemas, qualify columns | SELECT * becomes SELECT t.id, t.name, ... |
| B — Deep Rewrites | Decorrelate subqueries, flatten derived tables, optimize | WHERE id IN (SELECT ...) becomes a join |
| C — Cleanup | Remove redundant clauses, parameterize literals | ORDER BY id LIMIT 10 removed when result is provably single-row |
Each pass is semantics-preserving: the rewritten query returns the same result as the original for all possible data. The passes build on each other — Block A normalizes the SQL into a canonical form that Block B's transformations can reliably operate on, and Block C cleans up artifacts left by Block B.
Block A: Making SQL Canonical
Before any deep transformation can happen, the query must be in a predictable shape. Block A handles this:
- Schema resolution binds table and column names to the actual schema metadata Readyset has replicated from the upstream database. This is essential for later passes that need to know primary keys, unique constraints, and column types.
- Star expansion replaces
SELECT *with the explicit column list. Every downstream pass expects to see named columns, not wildcards. - Column qualification ensures every column reference is prefixed with its table name (
idbecomest.id). This prevents ambiguity when multiple tables have columns with the same name. - USING desugaring converts
JOIN ... USING(id)intoJOIN ... ON (a.id = b.id). The dataflow engine works withONpredicates, notUSINGclauses.
After Block A, the query is fully resolved, qualified, and desugared — a clean foundation for the transformations that follow.
Block B: The Heavy Lifting
Block B is where the real work happens. These passes transform SQL constructs that the dataflow engine cannot handle into equivalent constructs that it can. The ordering matters — each pass prepares the ground for the next.
Array Constructor Rewrite
The first Block B pass, and PostgreSQL-specific. PostgreSQL's ARRAY(SELECT ...) constructor produces an array value from a subquery's rows — a per-outer-row scalar aggregation shape the dataflow engine has no direct operator for. The pipeline rewrites each occurrence into a LATERAL LEFT JOIN whose body wraps the original subquery in array_agg(...), with a COALESCE(..., ARRAY[]) on the outer side so empty subqueries yield an empty array rather than NULL. ORDER BY and DISTINCT inside the constructor are copied into the array_agg call (required for correctness when the subquery also has LIMIT/Top-K). By running first, this pass turns a SQL construct the later passes wouldn't recognize into a LATERAL join they already know how to handle.
Redundant Join Elimination
Queries generated by ORMs often contain redundant self-joins — the same table joined to itself on its primary key, with all projected columns coming from one side. The pipeline detects and eliminates these early, reducing the join graph complexity before the heavier transformations that follow.
Left-Spine Hoisting
Before decorrelating subqueries, the pipeline attempts to inline the leftmost derived table in FROM — but only at the top level of the query. This is the one position where a Top-K pattern (ORDER BY ... LIMIT) is most valuable: at the top level, the LIMIT can be parameterized (e.g., LIMIT ?) and the dataflow compiler can deploy a native Top-K node — a streaming operator purpose-built for maintaining the top N rows incrementally. If the subquery were left nested and processed later by the general decorrelation pass, the Top-K would be replaced with a ROW_NUMBER() based filter — functionally correct but less efficient, and with the LIMIT no longer parameterizable.
With the ORDER BY and LIMIT at the top level, the dataflow compiler recognizes the Top-K shape and deploys a native streaming operator. The LIMIT remains a parameter (?), so a single cached dataflow graph serves every LIMIT 10, LIMIT 50, LIMIT 1000 variant. Column rebinding, ORDER BY rewriting, and LIMIT/OFFSET composition are all handled by a shared inlining API reused by every inlining site in the pipeline.
Subquery Decorrelation
The most complex transformation in the pipeline, and the reason the preceding passes exist — they prepare the query structure so decorrelation can see and operate on all correlated references. Consider:
The subquery references o.region from the outer query — it's correlated. A traditional engine would execute the subquery once per outer row. Readyset's dataflow has no per-row execution model, so the pipeline rewrites this into a join:
The correlated predicate region = o.region becomes a GROUP BY key and a join condition. The subquery is now a derived table that can be further flattened. The dataflow engine compiles this into a join operator that maintains the per-region average incrementally.
This decorrelation handles EXISTS, NOT EXISTS, IN, NOT IN, scalar subqueries, and LATERAL joins. Each has its own semantics — LATERAL with COUNT requires LEFT JOIN with COALESCE to preserve zero-count semantics; nested LATERALs correlating with grandparent scopes require wrapper flattening to eliminate scope boundaries; and IN / NOT IN / scalar subqueries against potentially-NULL operands require three-valued logic probes whose applicability depends on the operator and where the predicate appears, covered next.
Three-Valued Logic for IN, NOT IN, and Scalar Subqueries
SQL boolean expressions evaluate to TRUE, FALSE, or NULL — three-valued logic (3VL). Most operators make this transparent, but IN / NOT IN against a subquery that can produce NULL do not, and the right handling depends on where the predicate appears:
NOT INinWHERErequires 3VL guards. If the RHS produces even a singleNULL, aNOT INpredicate that would otherwise beFALSEbecomesNULL— and inWHERE,NULLfilters the row out whileFALSEwas meant to keep it (or vice versa, depending on surrounding logic). A naive decorrelation into aLEFT ANTI JOINloses this distinction and returns wrong results.INinWHEREdoes not need 3VL guards. BothFALSE(no match, RHS has noNULLs) andNULL(no match, RHS hasNULLs) causeWHEREto discard the row, so the distinction is immaterial — a plain decorrelation to a semi-join is already correct.INorNOT INin theSELECTlist always needs 3VL guards, regardless of operator. The predicate's value is projected out of the query, and callers may test it withIS NULL, pass it to another expression, or compare it to another boolean — theTRUE/FALSE/NULLdistinction must be preserved exactly.- Scalar subqueries in the SELECT list sit alongside these cases: their
NULL-vs-empty-result semantics require the same probe machinery.
Readyset handles all of these by installing probe joins alongside the decorrelated subquery. Two probes span the full truth table:
- NP (null-present):
EXISTS(rhs WHERE first_field IS NULL)— does the RHS produce aNULLin this correlation partition? - EP (existence):
EXISTS(rhs)— is the RHS non-empty for this correlation partition?
Both are materialized as LEFT LATERAL joins projecting a sentinel present_ column that downstream predicates test with IS [NOT] NULL. A small boolean formula over lhs IS NULL, NP.present_, EP.present_, and the anti-join match yields the exact 3VL result SQL semantics require.
Probes are installed only when null inference can't prove them unnecessary. Before materializing either probe, the pipeline runs a nullability analysis on both sides of the predicate:
- If the RHS comparison column is provably null-free (from
NOT NULLconstraints, non-nullable expressions, or grouping/aggregation that guarantees non-NULLoutputs), the NP probe is skipped — aNULLcan never occur, so the 3VL branch it guards is unreachable. - If the LHS expression is provably null-free, the EP probe is skipped — the "
LHSisNULL" branch of the 3VL formula cannot fire. - Only the branches that nullability analysis cannot statically eliminate produce runtime probes.
This matters because every probe is an additional LEFT LATERAL join: extra nodes, extra state, extra work on every upstream change. Suppressing probes that are provably unnecessary keeps the dataflow graph minimal and aligned with what the query actually needs.
Identical probes are materialized once and reused. Real queries often contain multiple subqueries with the same RHS and correlation — two NOT IN clauses against the same subselect, a scalar subquery referenced in both SELECT and WHERE, repeated predicates generated by an ORM, and so on. A structural ProbeRegistry keys probes by a hash of the normalized RHS body and the correlation predicate (post-normalization, so cosmetic differences don't cause misses). A second occurrence finds the existing entry and reuses its LATERAL joins instead of emitting duplicates. The registry also supports lazy upgrade: if the first occurrence needed only NP but a later occurrence also needs EP, the EP probe is added to the same registry entry, avoiding a parallel set of joins for what is the same underlying subquery.
The net effect: 3VL is exact, the dataflow graph carries only the probe machinery the query actually requires, and repeated subqueries share a single compiled form.
General Derived Table Inlining
After decorrelation, the query may contain derived tables — both original ones that the left-spine pass didn't handle, and new ones introduced by decorrelation itself (as join targets for unnested subqueries). This second inlining pass flattens them into the outer join structure, eliminating unnecessary intermediate materializations.
The outer WHERE referencing the aggregate (sq.total > 100) migrates to HAVING, since after inlining it applies to a grouped result. The GROUP BY keys, aggregate projections, and column rebinding are handled by the same shared inlining API used by the left-spine pass.
Not every derived table can be inlined. The pipeline checks for window functions that would change behavior (their partition sizes depend on the row set, which changes after inlining), nested aggregation conflicts, and self-join introduction. When inlining would change semantics, the query is left as-is and handled as a derived table in the dataflow graph — or rejected if the engine cannot support it.
Join Reordering
Two semantically identical queries that list joins in a different order — FROM a JOIN b ON ... JOIN c ON ... vs FROM a JOIN c ON ... JOIN b ON ... — should compile into the same dataflow graph and share the same cache. But the dataflow compiler produces a different graph for each join ordering, so structural differences in the SQL text lead to redundant caches for the same logical query.
It's worth stressing that this pass is not a cost-based join optimizer. Its purpose is purely canonicalization: producing a deterministic, syntax-independent join sequence so that logically equivalent queries share the same cached dataflow graph. Cost, cardinality, and statistics play no role here — the scoring is a structural tiebreaker, not a performance heuristic. True cost-based join reordering is a separate upcoming pass that will operate on this canonical form: once in place, the downstream optimization step will be configurable to run cost-based join reordering, filter hoisting, or neither — and in every case it will start from the same canonicalized shape produced here.
The pipeline normalizes join order using a deterministic structural algorithm: at each step it scores candidate joins by predicate proximity (preferring joins whose ON conditions reference the most recently joined tables) combined with the number of cross-table equalities the candidate brings into scope, and breaks ties lexicographically by relation name. This produces a canonical join sequence regardless of how the user originally wrote the query. The same approach normalizes comma-separated FROM items into explicit CROSS JOINs, ensuring consistent structure.
Clause Normalization for Semantic Fingerprinting
Readyset identifies structurally identical queries so they can share a single cached dataflow graph. Two queries that differ only in superficial ways — GROUP BY 1, 2 vs GROUP BY name, region, or ORDER BY total DESC vs ORDER BY 3 DESC — are semantically identical and should hit the same cache.
The pipeline normalizes these clauses by resolving positional references (GROUP BY 1) and alias references (ORDER BY total) to their underlying expressions (GROUP BY t.name, ORDER BY SUM(t.amount)). After normalization, semantically equivalent queries produce identical AST structures, enabling reliable fingerprint-based cache matching.
Filter Hoisting
The final Block B pass. After all structural transformations, join reordering, and clause normalization have produced the canonical query shape, the pipeline hoists parameterized filters (WHERE id = ?) from inside derived tables up to the outermost WHERE clause. These filters are most efficient at the top level, where the dataflow engine can use them for key-based lookups into materialized state. By running last, filter hoisting operates on the final semantic fingerprint shape — the same canonical form that will be used for cache matching — ensuring the hoisted filters land in the right structural position.
Today this pass runs unconditionally as the last step of Block B. Once cost-based join reordering lands, this slot becomes a configurable optimization step: the pipeline will choose between filter hoisting, cost-based join reordering, or neither — each starting from the canonical shape produced by the earlier passes.
Block C: Final Cleanup
After Block B's transformations, the query may have redundant clauses. Block C strips ORDER BY and LIMIT from queries that provably return at most one row (e.g., filtering on a unique key), and auto-parameterizes literal values so that structurally identical queries with different constants share the same cached dataflow graph.
The Result
By the time a query exits the rewrite pipeline, it is in a form the dataflow engine can directly compile:
- All subqueries are decorrelated into joins
- All derived tables are inlined (or left as supported derived-table operators)
- All column references are fully qualified
- All join predicates are column-equality pairs
- Joins are reordered into the canonical semantic-fingerprint shape
GROUP BYkeys andORDER BYexpressions are normalized to explicit column references (no positional or alias indirection)- Redundant joins and clauses are eliminated
- Parameterizable filters are at the top level
- All structural invariants required by the dataflow engine are satisfied — the query is guaranteed to be compilable without further transformation
The dataflow compiler then translates this canonical SQL into a graph of streaming operators — and from that point on, every data change flows through the graph, keeping the cached result up to date without ever re-executing the original query.
This is how Readyset turns SQL queries into live, incrementally-maintained caches: not by executing faster, but by never needing to execute again.
Future Work
Today's pipeline is built on general SQL semantics and relational-algebra principles. Every pass is a broadly-applicable rewrite: a heuristic that works for any query matching a structural shape, backed by a soundness argument that holds for any data. This generality is a feature — it keeps the pipeline compact, composable, and easy to reason about — but it is also where the next round of work lives.
Cost-based join reordering. The current Join Reordering pass is a pure canonicalization step; it doesn't use statistics or try to minimize work. A real cost-based join optimizer (CBJO) is the next major addition. Once it lands, the last Block B slot becomes a configurable optimization step — the pipeline will choose between filter hoisting, cost-based join reordering, or neither — and every option will start from the canonical shape produced by the earlier passes.
Relaxing decorrelation and inlining guardrails. Several of the current guards are deliberately cautious: they reject transformations that a more targeted analysis could safely admit. Examples include tighter window-function partition-stability analysis, composite-key redundant-join elimination, and per-column decomposition of mixed aggregate/non-aggregate expressions after inlining. Each of these is a straightforward relaxation of an existing guard — a narrower check in place of a broader veto.
From generalized heuristics to pattern-specific transformations. The passes that exist today were intentionally built as generalized rewrites — structural patterns that apply broadly. The next phase of work is the inverse: identifying specific query shapes that the generalized heuristics don't recognize, and adding targeted rewrites for them. Real workloads are full of patterns — produced by ORMs, BI tools, or common reporting idioms — that are semantically simple but don't fit any of the current structural templates. Each such pattern becomes a new pass or a new branch in an existing one: a narrow match on a concrete shape, with its own soundness argument, composed into the pipeline alongside the general rewrites.
The pipeline's architecture — a sequence of independent, composable, semantics-preserving passes — makes all of this incremental: each capability is a new pass or a relaxed guard, tested against the full suite of regression tests and verified for semantic preservation.
Authors