Readyset | Drop-in SQL Caching for PostgreSQL and MySQL

Optimizing Straddled Joins in Readyset: From Hash Joins to Index Condition Pushdown

5 min read

•

3 months ago

Optimizing Straddled Joins in Readyset: From Hash Joins to Index Condition Pushdown

Introduction

Readyset is designed to serve queries from cached views with sub-millisecond latency. This post focuses on the cold path—cases where a cache miss forces execution against the base tables. In these scenarios, Readyset must evaluate the query from scratch, including materializing intermediate results. The focus here is on straddled joins, where filtering predicates apply to both sides of the join in addition to the ON clause.

Example:

SELECT u.id, u.name, o.order_id, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.email = 'some@provide.com'
 AND o.status = 'SHIPPED';

In a previous optimization, we transitioned from nested loop joins to hash joins (see https://readyset.io/blog/introducing-hash-join-algorithm), eliminating the need to repeatedly scan one side of the join. This brought significant performance improvements for many workloads, however, some queries remained problematic.

Straddled joins are common in production workloads; for example, filtering users by some property on the left table and orders by a status flag on the right table, then joining on user_id. While our hash join algorithm improved over nested loops, it was still inefficient in these cases, especially when one side’s predicate had low cardinality (like a boolean flag or status column).

This post explains why the old execution strategy struggled, and how our new optimization using Index Condition Pushdown (ICP) changes the execution model to make straddled joins much more efficient.

Previous algorithm

During evaluation of straddled joins on cache misses, we observed significant latency in up-queries. To identify the underlying bottleneck, we profiled query execution and analyzed the resulting flamegraphs

Approximately 30% of the query execution time was attributed to data decompression. We initially suspected the RocksDB compression algorithm as the bottleneck and proceeded to benchmark alternative compression methods to validate this hypothesis.

Switching to ZSTD did not improve performance. Decompression remained a dominant contributor to query execution time. As the next step, we disabled compression in RocksDB entirely to isolate its impact. This came with a space tradeoff but was necessary to confirm whether compression was the root cause:

Disabling compression didn’t eliminate the problem, it merely shifted the bottleneck. The system began spending the majority of execution time on disk reads, as evident from increased I/O activity on ext4.

This made it clear that compression wasn’t the issue; rather, the excessive amount of data being read from disk was the primary cause. Isolated iostat output for the query confirmed this:

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme1n1       10068.00  55728.00     0.00   0.00    0.08     5.54    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.79  81.00

The NVMe device was handling approximately 10K IOPS, with the majority of reads around 5KB in size. Despite their small size, the device reached ~81% utilization while executing a single join query, indicating I/O saturation.

Upon examining the query pattern, we identified that one side of the join had a low-cardinality index, causing the engine to scan nearly the entire table on each execution. Due to internal constraints (such as maintaining key provenance and supporting incremental view maintenance) the join engine was independently evaluating both sides of the join, regardless of selectivity.

Execution example

SELECT u.id, u.name, o.order_id, o.totalFROM users uJOIN orders o ON u.id = o.user_idWHERE u.email = 'some@provide.com' AND o.status = 'SHIPPED';

Scenario:

Users table: The predicate u.email = 'some@provide.com' is highly selective, returning exactly one row.
Orders table: The predicate o.status = 'SHIPPED' is very low selectivity — ~99% of all rows match.

Old Execution Strategy: Hash Joins

The old algorithm relied on the hash join approach:

Lookup both sides independently:
1. users: filter by email → 1 row.
2. orders: filter by status → ~99% of the table.
Materialize both result sets:
1. Millions of orders rows, plus 1 users row.
Build and probe a hash table:
1. Hash on o.user_id.
2. Probe with the user row.
3. Discard nearly all rows after the join.

Why This Is Inefficient:

High I/O: We had to read essentially the entire orders table.
High memory usage: Millions of rows materialized only to discard them.
Wasteful CPU work: The join considered far more rows than necessary.

New Execution Strategy: Index Condition Pushdown

The new algorithm issues an initial upquery to one side of the join, then combines the resulting join key with the original predicates on the other side. This composite condition is pushed down to RocksDB, allowing index-based retrieval of only the rows required to satisfy the join. This eliminates unnecessary data reads and avoids full-table scans.

Step‑by‑Step:

Apply the left predicate first: u.email = 'some@provide.com' → 1 row (u.id = 123).
Group rows by join key: collect distinct values {123}.
For each join key, build a right‑side lookup:

WHERE o.status = 'SHIPPED'
AND o.user_id = 123;

Leverage storage engine indexes: with an index on (user_id, status), fetch only matching rows.
Build the result set incrementally: combine left row(s) with right rows per join key, no full materialization needed.

Benchmark: Measuring the Impact

To quantify the performance improvement brought by the new Index Condition Pushdown strategy, we ran a controlled benchmark comparing the old hash join algorithm against the new execution model on the same hardware and dataset.

Old Algorithm: Hash Join

Under the previous strategy, Readyset executed straddled joins using independent filtering on both sides, followed by full materialization and hash-based probing. This approach was highly inefficient when one side had low-cardinality filters.

Results:

Throughput: 7.0 events/s
Latency (avg): 2,284 ms
95th percentile latency: 4,129 ms
Max latency: 6,831 ms
Total events processed: 4,203 over 10 minutes

The system was CPU- and IO-bound, spending excessive time reading and decompressing unnecessary rows from disk; most of which were discarded after join evaluation.

New Algorithm: Index Condition Pushdown

With the ICP-enabled join model, Readyset instead defers right-side lookups until left-side predicates are evaluated, allowing the use of compound indexes (e.g. (user_id, status)) to fetch only relevant rows. This minimizes materialization and disk reads.

Results:

Throughput: 3,214.4 events/s
Latency (avg): 4.98 ms
95th percentile latency: 11.87 ms
Max latency: 3,467.2 ms
Total events processed: 1,928,645 over 10 minutes

This represents a >450x throughput improvement and >450x latency reduction. The join strategy is now highly cache- and index-efficient, with the storage engine only returning matching rows based on precise key lookups.

Conclusion

While Readyset excels in delivering low-latency results via caching, optimizing the cold path is critical for consistent performance during cache misses. By rethinking how straddled joins are executed and leveraging Index Condition Pushdown, we’ve significantly improved real-world performance for these workloads.

Authors

Marcelo Altmann