Introducing the ReadySet Bucket Function for Faster Queries

Readyset Adds New Query Capabilities With the Bucket Function

5 min read

•

15 days ago

Readyset Adds New Query Capabilities With the Bucket Function

Readyset is a cache layer that sits between the application and the database, acting as a cache for supported queries and proxying unsupported queries upstream. This means that any query Readyset supports is theoretically supported by the upstream database, but not necessarily the other way around.

What if that is no longer the case? What if Readyset can support queries that the upstream database can’t? What if you could keep your relational database and add extra functionality without any plugins or extensions?

Readyset’s streaming Dataflow engine works well for systems that periodically poll for results or require quick updates, such as monitoring and near-real-time time-series workloads. While these queries can technically be served by upstream relational databases, their execution time is often too long to be practical.

A New Query Pattern Enabled by Readyset’s Bucket Function

This month, we introduced a Readyset-specific built-in function, which doesn’t exist in upstream databases, called Bucket.

The Bucket function helps you group time based data into regular intervals, for example by hour, day, or month, so you can easily see trends over time such as daily sales totals or hourly user activity.

It takes two arguments: the first is a timestamp or datetime column, and the second is an interval. The interval is a string that follows the format <positive integer> <unit>[s], where the unit can be year, month, day, hour, minute, or second. You can then use the result in a GROUP BY clause with an aggregate function to summarize your data, similar to how you might analyze events in a time-series database.

SELECT
BUCKET(event_time, '1 hour') AS bucket_hour,
SUM(value)
FROM events
GROUP BY bucket_hour

You can also take it a step further and take advantage of Readyset’s parameterization, like so:

CREATE CACHE ALWAYS FROM
SELECT 
BUCKET(event_time, '30 minutes') AS bucket_30min,
MIN(value)
FROM events
WHERE event_time >= ?
AND event_time < ?
GROUP BY bucket_30min

then query by binding literals to the placeholders/parameters:

CREATE CACHE ALWAYS FROM
SELECT 
BUCKET(event_time, '30 minutes') AS bucket_30min,
MIN(value)
FROM events
WHERE event_time >= '2023-09-01 23:59:59' 
AND event_time <  '2023-09-03 00:00:00'
GROUP BY bucket_30min

Examples and Considerations:

Let’s take a look at the NYC Yellow Taxi Trips dataset for the 4 years from 2020 to 2023, this is roughly 134 million records after some cleaning (like removing any rows after the date 31 Dec 2023). We will only import a few selected columns into a postgresql-15 database table.

CREATE TABLE IF NOT EXISTS nyc_taxi_trips (
    pickup_datetime TIMESTAMP,
    dropoff_datetime TIMESTAMP,
    passenger_count BIGINT,
    trip_distance DOUBLE PRECISION,
    fare_amount DOUBLE PRECISION
);

While we will only use two of these columns, we still want the rows to be reasonably sized for the purpose of this demo. We run this demo on a local machine with 55 free GBs of memory.

Keep in mind that Bucket is not supported in PostgreSQL, so we will try to simulate it using PostgreSQL built-in functions:

SELECT
    to_timestamp(floor(extract(epoch FROM pickup_datetime) / 1800) * 1800) AS bucket_30min,
    MAX(fare_amount)
FROM nyc_taxi_trips
GROUP BY bucket_30min
ORDER BY bucket_30min DESC
LIMIT 10;

Giving us the output:

bucket_30min	max
2023-12-31 23:30:00+00	250
2023-12-31 23:00:00+00	295.6
2023-12-31 22:30:00+00	185
2023-12-31 22:00:00+00	400
2023-12-31 21:30:00+00	1087.3
2023-12-31 21:00:00+00	260
2023-12-31 20:30:00+00	220
2023-12-31 20:00:00+00	155.6
2023-12-31 19:30:00+00	109
2023-12-31 19:00:00+00	282.24

This takes 20.46 seconds to execute on average.

Now, let’s create a cache in Readyset with the new Bucket function and query it:

CREATE CACHE ALWAYS FROM SELECT
    BUCKET(pickup_datetime, '30 minutes') as bucket_30min,
    MAX(fare_amount)
FROM nyc_taxi_trips
GROUP BY bucket_30min
ORDER BY bucket_30min DESC
LIMIT 10;

Giving us the exact output:

bucket_30min	max("public"."nyc_taxi_trips"."fare_amount")
2023-12-31 23:30:00	250
2023-12-31 23:00:00	295.6
2023-12-31 22:30:00	185
2023-12-31 22:00:00	400
2023-12-31 21:30:00	1087.3
2023-12-31 21:00:00	260
2023-12-31 20:30:00	220
2023-12-31 20:00:00	155.6
2023-12-31 19:30:00	109
2023-12-31 19:00:00	282.24

The cache takes only a few minutes to build. Querying it returns results in 5.6 milliseconds, with the cache read itself taking just 0.396 milliseconds! That’s almost 365× faster, with no additional expression nesting.

Let’s also insert twenty thousand rows into the top two buckets and see how fast Readyset can keep the cache updated. We’ll use a simple script to generate data, making sure the generated fare_amount exceeds the previous maximum. After doing this, we see that Readyset replicates the changes almost instantly and still produces results in a familiar 5.6 milliseconds, with the cache read taking 0.468 milliseconds.

bucket_30min	max("public"."nyc_taxi_trips"."fare_amount")
2023-12-31 23:30:00	499.99
2023-12-31 23:00:00	499.91
2023-12-31 22:30:00	185
2023-12-31 22:00:00	400
2023-12-31 21:30:00	1087.3
2023-12-31 21:00:00	260
2023-12-31 20:30:00	220
2023-12-31 20:00:00	155.6
2023-12-31 19:30:00	109
2023-12-31 19:00:00	282.24

When working with this query pattern, keep in mind that it will be fully materialized, meaning all rows are stored in memory. Depending on the size of your dataset and the predicates used, this can increase memory usage during cache creation.

To optimize memory, you can create a filtered view that includes only the most relevant time period or subset of data. Using this view as the input to the cache helps Readyset focus on active data, reducing memory requirements and improving efficiency.

Readyset’s Approach to Time-Series–Like Workloads

We’re not building a full-fledged time-series database, and that’s by design. While this pattern might look familiar if you’ve worked with time-series systems, we’re not aiming to compete with purpose-built solutions that handle both storage and execution.

Instead, our goal with this new pattern is to give you a practical alternative for handling time-series–like workloads without needing to introduce a new database into your infrastructure. If you’ve ever avoided building certain features because they seemed too complex or required a specialized database, Readyset can now help you unlock that functionality using your existing relational database.

You manage the storage. We’ll take care of making the queries fast. Let us know what other time-series functionality you’d like to see in Readyset.

Conclusion

With features like Bucket, Readyset isn’t just speeding up your existing queries—it’s enabling new ones your database can’t easily handle. It’s more than a cache layer; it’s a tool that extends your application’s capabilities. More power, better performance, zero database changes—just plug it in and go.

Authors

Mohamed Abdeen