When Query Optimization Isn’t Enough: Solving MySQL Overload with Caching

9 min read

about 2 months ago

When Query Optimization Isn’t Enough: Solving MySQL Overload with Caching

Consider the perfect scenario: the database is running on reliable hardware, MySQL has been finely tuned, and queries have been reviewed and optimized. Despite this, sustained business growth inevitably leads to an increase in workload over time. When this happens, even optimized queries can begin to overwhelm the system, not because they are slow, but because they run frequently and consume cumulative resources over time.

At this stage, optimizing performance goes beyond analyzing execution plans and index usage. To further reduce query latency and system load, it’s essential to understand how each query interacts with underlying system resources—specifically CPU cycles, memory access patterns, and the storage stack. Low-level tools such as perf and eBPF offer deep insight into these interactions. For those interested in using eBPF to trace database performance, let us know—we’re considering a dedicated post on that topic. In the meantime, Marcelo has written a practical example of using BPFtrace to investigate memory behavior in Rust applications, which is applicable to many database workloads as well.

This post uses perf stat to measure performance across three scenarios: (1) reading data from disk (cold cache), (2) accessing data already in memory (warm cache), and (3) serving queries directly from an external cache layer (Readyset). The goal is to evaluate not only response time but also system-level efficiency and resource utilization under each condition. To preview what we’ll explore in detail, the table below shows how Readyset dramatically reduces query latency and CPU usage, even compared to a fully warm cache. The rest of this post walks through how we achieved these results and what they mean at the system level.

Metric

Cold Cache 

Warm Cache

Readyset

Wall-Clock Latency

10.44s

5.40s

0.12s

Instructions Executed

163M

154M

93M

Estimated CPU Time

~112ms

~55ms

~41.7ms

Even with well-optimized queries, high-frequency workloads can still saturate CPU and memory subsystems, leading to performance bottlenecks. As the table above shows, this is where a caching layer like Readyset becomes essential. Readyset offloads repeated query execution from the database engine entirely, delivering consistent performance gains while reducing pressure on the database, even after extensive tuning.

With these insights, you can make informed decisions to improve resource efficiency and minimize overall system load.

Testing

In this section, we will examine how the same SQL query behaves when executed against MySQL under three different execution environments. Each test uses perf stat to capture CPU cycles, instructions executed, cache misses, context switches, and block I/O events. This allows us to estimate not only how fast the query runs, but how much actual CPU work it performs. The database used is the publicly available MySQL Employees dataset, and the query being tested is:

SELECT e.emp_no, e.first_name, e.last_name, s1.salary AS initial_salary, s2.salary AS recent_salary FROM employees e JOIN salaries s1 ON e.emp_no = s1.emp_no JOIN salaries s2 ON e.emp_no = s2.emp_no WHERE s1.to_date < s2.from_date AND s1.salary > s2.salary AND s2.to_date = '9999-01-01';

The CPU in this setup operates at 2.7 GHz, which translates to approximately 2.7 billion cycles per second. By dividing 1 second by 2.7 billion, we find that each cycle takes about 0.370 nanoseconds. This value helps translate the cycle counts reported by perf into actual CPU time spent on query execution.

Run 1 – Cold Cache

Before running the first test, we use the command echo 3 > /proc/sys/vm/drop_caches to clear the Linux page cache, along with inode and dentry caches. This step ensures that neither the operating system nor MySQL has any relevant data already loaded into memory. By flushing these caches, we simulate a true “cold cache” scenario, where all data must be fetched from disk. This allows us to measure the full impact of I/O latency and observe how the system behaves when starting from scratch, without any memory-based optimizations.

To capture low-level system performance during query execution, we run the query using perf stat. This tool collects metrics such as CPU cycles, executed instructions, cache misses, context switches, and block I/O events. It provides detailed insight into how efficiently the CPU is utilized and whether the workload is compute-bound or I/O-bound.

$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" $ sudo perf stat -e cycles,instructions,cache-misses,context-switches,block:block_rq_issue,block:block_rq_complete -g \ -- mysql -uroot -pmsandbox -h 127.0.0.1 -P 22233 < query.sql

Result:

180,690,705 cycles 163,222,649 instructions # 0.90 insn per cycle 336,622 cache-misses 54 context-switches 80 block:block_rq_issue 4 block:block_rq_complete 10.435519092 seconds time elapsed 0.052447000 seconds user 0.059771000 seconds sys

This output shows that the query required 180 million cycles and executed 163 million instructions, with an IPC (instructions per cycle) of 0.90. A value below 1.0 generally indicates that the CPU spent time stalled, likely due to waiting on I/O or memory. Cache misses were high (336,622), confirming that most data was not already in memory. Block-level I/O activity was recorded, with 80 I/O requests issued and only 4 completed during the observed window. Although the query used only ~0.11 seconds of CPU time, it took 10.44 seconds to complete, which reflects the time spent waiting on disk access.

Run 2 – Warm Cache

The second run tests the same query after the required data has been cached in memory, either in MySQL’s InnoDB buffer pool or the OS page cache. The goal is to observe how performance changes when I/O is no longer a factor and the CPU can operate at full efficiency.

148,842,336 cycles 153,940,609 instructions # 1.03 insn per cycle 264,918 cache-misses 27 context-switches 0 block:block_rq_issue 0 block:block_rq_complete 5.404173004 seconds time elapsed 0.029532000 seconds user 0.033550000 seconds sys

This run executed ~154 million instructions in ~149 million cycles, with an IPC of 1.03, indicating more efficient CPU usage. The query ran in 5.4 seconds, roughly half the time of the cold cache run. Cache misses dropped to 264k, and there were no I/O operations recorded. The estimated active CPU time was ~55 milliseconds, and context switching dropped to 27. This shows the benefits of serving the query from memory, where data access is much faster than disk and CPU resources are used more effectively. It’s important to note that while no disk I/O occurred, cache misses still happened. These reflect misses in the CPU’s internal caches (such as L1, L2, or L3), requiring data to be fetched from main memory (RAM). Since the working set was already resident in memory (either the OS page cache or MySQL buffer pool), no block-level I/O was necessary. This highlights the benefit of warm cache execution: data access is significantly faster than from disk, and CPU resources are used more efficiently.

Run 3 – Readyset Cache

The final run tests the same query via Readyset, a caching layer that proxies MySQL and offloads read queries to its own memory-resident store. For this test, I'm using the Readyset Docker version to spin and validate this setup quickly (you can find more details in the official documentation): 

You can start a Readyset container using the following command:

$ sudo docker run -d -p 44008:3307 -p 44009:6034 \ --name Readyset \ -e UPSTREAM_DB_URL="mysql://ReadysetUser:ReadysetPWD@172.17.0.1:22233/employees" \ -e LISTEN_ADDRESS="0.0.0.0:3307" \ Readysettech/Readyset:latest

Once Readyset is running and connected to the upstream MySQL instance, we verify that the query is being proxied through Readyset:

mysql> SHOW PROXIED QUERIES; +--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------+ | query id | proxied query | Readyset supported | count | +--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------+ | q_cc3975f2c1fa2371 | SELECT `e`.`emp_no`, `e`.`first_name`, `e`.`last_name`, `s1`.`salary` AS `initial_salary`, `s2`.`salary` AS `recent_salary` FROM `employees` AS `e` JOIN `salaries` AS `s1` ON (`e`.`emp_no` = `s1`.`emp_no`) JOIN `salaries` AS `s2` ON (`e`.`emp_no` = `s2`.`emp_no`) WHERE ((`s1`.`to_date` < `s2`.`from_date`) AND ((`s1`.`salary` > `s2`.`salary`) AND (`s2`.`to_date` = $1))) | yes | 0 | +--------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+-------+

If Readyset recognizes the query and marks it as supported, we can load it into the cache:

mysql> CREATE CACHE FROM q_cc3975f2c1fa2371; Query OK, 0 rows affected (15.06 sec)

With the cache populated, we rerun the query and measure system performance using the same perf stat command, now targeting Readyset’s SQL proxy (port 44008):

$ sudo /usr/libexec/perf.5.15.0-208.159.3.el8uek.x86_64 stat \ -e cycles,instructions,cache-misses,context-switches,block:block_rq_issue,block:block_rq_complete \ -g -- mysql -uReadysetUser -pReadysetPWD -h 127.0.0.1 -P 44008 < query.sql

Result:

112,732,769 cycles 93,673,121 instructions # 0.83 insn per cycle 194,694 cache-misses 24 context-switches 0 block:block_rq_issue 0 block:block_rq_complete 0.116779798 seconds time elapsed 0.031629000 seconds user 0.062280000 seconds sys

This run shows a significantly lighter workload. The query executed only 93.6 million instructions — the lowest among all runs — and completed in just 0.12 seconds of wall-clock time. The IPC (instructions per cycle) was 0.83, slightly lower than in previous tests, which indicates more idle cycles relative to instruction execution. However, this doesn’t imply inefficiency — rather, it reflects the fact that much less work was needed overall. Total cache misses dropped to 194,694, showing tight memory locality. 

Importantly, there were no I/O requests issued or completed, confirming that the query was fully served from Readyset’s in-memory cache without ever reaching the upstream MySQL database

The contrast across scenarios is captured below, highlighting Readyset’s ability to eliminate I/O without requiring any manual tuning.

Metric

Cold Cache

Warm Cache

Readyset

Disk I/O Requests Issued

80

0

0

Disk I/O Requests Completed

4

0

0

Manual Tuning Required

No

Yes

No

As a result, the database remained completely free to handle other queries, which is especially valuable when dealing with high-frequency read workloads that would otherwise create contention even after query tuning. The context switch count (24) was lower than both the cold and warm cache runs (54 and 27 respectively), supporting that the process ran with minimal interruptions. Estimated CPU active time was about 41.7 milliseconds (based on 2.7 GHz frequency), further validating that Readyset reduced not only latency, but also CPU and memory subsystem pressure.

Conclusion

The three perf stat runs illustrate distinct execution profiles driven by data locality and execution path. In the cold cache scenario, the query incurred 181 million CPU cycles, executed 163 million instructions, and triggered over 330k cache misses, along with multiple block-level I/O operations. This led to the highest observed latency at 10.44 seconds, reflecting the full cost of reading from disk when neither the OS nor MySQL holds relevant data in memory.

In the warm cache scenario, the query benefited from memory residency — execution became more CPU-efficient with IPC rising to 1.03, cache misses dropping to 264k, and zero I/O activity. As a result, total wall-clock time dropped to 5.4 seconds, and CPU cycles reduced to 149 million, showing that memory access and caching mechanisms significantly improve query performance.

The third run, using Readyset, served the query from a dedicated memory-resident cache layer that completely bypassed the upstream MySQL database. This test recorded the lowest instruction count (93 million), lowest cache misses (194k), and shortest execution time (0.12 seconds). Although IPC dropped slightly to 0.83, this was due to fewer total instructions and a lighter workload, not inefficiency. Crucially, this execution path required no I/O and very little CPU time (~41.7ms), leaving MySQL free to handle other operations.

To summarize the observed differences across all scenarios, the following table consolidates the performance metrics and highlights Readyset’s impact relative to both cold and warm cache executions:

MetricCold CacheWarm CacheReadysetΔ vs. Cold CacheΔ vs. Warm Cache
Wall-Clock Latency10.44s5.40s0.12sReadyset is ~87× fasterReadyset is ~45× faster
Instructions Executed163M154M93M43% fewer instructions with Readyset40% fewer instructions with Readyset
CPU Cycles181M149M113M38% fewer cycles with Readyset24% fewer cycles with Readyset
Instructions per Cycle (IPC)0.901.030.8318% lower IPC with Readyset due to reduced workload119% lower IPC with Readyset due to reduced workload
Cache Misses336,622264,918194,69442% fewer cache misses with Readyset27% fewer cache misses with Readyset
Context Switches54272456% fewer context switches with Readyset11% fewer context switches with Readyset
Block I/O Requests Issued8000Readyset eliminates disk I/O entirely2No change (disk I/O eliminated in both)
Block I/O Requests Completed400Readyset eliminates disk I/O entirely2No change (disk I/O eliminated in both)
Estimated CPU Time~112ms~55ms~41.7ms63% less active CPU time with Readyset24% less active CPU time with Readyset

Notes: 1 IPC is lower with Readyset because Readyset completes the entire workload with fewer instructions, fewer cycles, and less time. CPU isn’t being underutilized, less work is required to serve the same query. 

2 Disk I/O is fully eliminated with both warm cache and Readyset; however, only Readyset eliminates it with zero tuning.

These results reinforce an important point: while traditional query tuning and memory optimization are essential, they can eventually reach their limits in high-frequency workloads. In such cases, offloading repetitive reads to a purpose-built caching layer like Readyset can substantially reduce pressure on the database engine. This approach not only lowers latency but also minimizes CPU cycles, memory pressure, and context switching, enabling better scalability and resource utilization.

For a deeper look into cost-efficiency considerations and architectural trade-offs between vertical and horizontal scaling strategies, check out our full analysis here:

 Vertical Scaling of MySQL vs Horizontal Scaling with Readyset