Introducing Hash Join Algorithm

MySQL

5 min read

•

12 months ago

Joins are a concept in relational databases, allowing the combination of data from two or more tables. Joins establish a relationship between two tables using a common key, which is used to lookup data from one table to another.

There are different types of joins and different algorithms that can be used to retrieve the data. In this article we will be focusing on discussing the differences between each algorithm.

Nested Loop Joins

A Nested Loop Join algorithm is a straightforward method for performing JOIN operations between two tables in a database.

How Nested Loop Join Works:

Initialization: The algorithm starts by selecting one of the tables as the outer table and the other as the inner table. Typically, the smaller table is chosen as the outer table to minimize the number of iterations.
Outer Loop: The algorithm iterates over each row of the outer table. For each row in the outer table:
- Read a row from the outer table.
Inner Loop: For each row from the outer table, the algorithm iterates over every row of the inner table to find matching rows based on the join condition.
- For the current outer row, iterate over each row in the inner table.
- Check if the join condition is satisfied for the current pair of outer and inner rows.
- If the join condition is satisfied, combine the outer row and the inner row to form a joined row and add it to the result set.
Completion: This process continues until all rows from the outer table have been processed and compared with all rows from the inner table.

Pseudo code example:

FOR each row in OuterTable DO
    FOR each row in InnerTable DO
        IF OuterTable.row matches InnerTable.row based on join condition THEN
            Add the combined row to the result set
        END IF
    END FOR
END FOR

Usage

When joining small tables.
When no indexes are available.

Advantages:

Simplicity: The algorithm is easy to understand and implement.
No Index Requirement: Does not require indexes on the join columns, making it flexible for use in various situations.

Disadvantages:

Inefficiency: Can be very slow for large tables due to its O(M*N) time complexity, where M and N are the sizes of the two tables.
Resource Intensive: High computational and I/O overhead as it requires a full scan of the inner table for each row in the outer table.

Variations

There are some variations of Nested Loop, such as:

Block Nested Loop - Where some rows from OuterTable are buffered in memory, making it a batch lookup in the InnerTable, reducing the amount of times we need to scan rows from the InnerTable

Index Nested Loop - If there are indexes available in the InnerTable to satisfy the ON condition, the inner loop can be adjusted to use an index loop-up. This reduces the complexity to O(M*log N).

Hash Join

A Hash Join is a more efficient algorithm for performing JOIN operations between two tables, especially when dealing with large datasets. It is particularly effective when there are no indexes on the join columns and when the join condition is an equality condition. Here’s how it works:

How Hash Join Works

Build Phase:
- The algorithm starts by selecting one of the tables as the build table (typically the smaller table).
- A hash table is constructed in memory, where each entry corresponds to a row in the build table.
- The rows are inserted into the hash table based on the hash value of the join key. This allows for quick lookup during the next phase.
Probe Phase:
- The algorithm then iterates over each row in the other table, known as the probe table.
- For each row in the probe table, it computes the hash value of the join key and uses this hash value to quickly locate matching rows in the hash table.
- If matching rows are found in the hash table, the rows from the probe table and the build table are combined to form the joined rows, which are then added to the result set.

Pseudocode Example

// Build phase
FOR each row in BuildTable DO
    Compute hash value for the join key
    Insert row into HashTable based on hash value
END FOR

// Probe phase
FOR each row in ProbeTable DO
    Compute hash value for the join key
    IF hash value exists in HashTable THEN
        Retrieve matching rows from HashTable
        FOR each matching row DO
            Combine rows from ProbeTable and BuildTable
            Add the combined row to the result set
        END FOR
    END IF
END FOR

Usage

Hash Join is typically used:

When the join condition is an equality condition.
When dealing with large tables without indexes.
When there is enough memory available to efficiently build and probe the hash table.

Advantages

Efficiency: Generally more efficient than Nested Loop Join ( O(M*N) ) for large tables, with an average time complexity of O(M + N), where M and N are the sizes of the two tables
Scalability: Works well even when the tables are large and do not fit entirely in memory, as the hash table can be partitioned and processed in chunks.
Index Independence: Does not require indexes on the join columns, making it versatile for various scenarios.

Disadvantages

Memory Usage: Requires sufficient memory to store the hash table, which can be a limitation for very large tables.
Hash Collisions: The effectiveness of the algorithm depends on a good hash function. Poor hash functions can lead to collisions, reducing performance.
Overhead: Involves additional overhead for computing hash values and managing the hash table.

Join’s at Readyset

Readyset is a fast in-memory cache for your queries, with incremental view maintenance that does not require invalidation, instead as new data is ingested in your database, the cache entries are automatically updated. The latency for a query that hits a cache hit (it’s found in cache) is in the figures of sub milliseconds. However, when we hit a cache miss we need to perform what we call up-query or replay, which basically means traverse the dataflow graph upwards to find a node that has materialized the data we are requesting. In certain cases we need to go back all the way up to our base node, which consists in the RocksDB table containing a copy of all the records.

A Normal JOIN or single table look-up is done via Index Lookup which is fast. Some types of joins that have predicates in the WHERE clause coming from different sides of the join are not fully indexed and require extra steps to ensure we are returning just the records matching the ON condition and also the predicates in the WHERE clause. We call those straddle joins and they are disabled by default.

The current algorithm used to resolve straddle joins is Nested Loop Join Algorithm.

We are thrilled to announce that starting at Readyset release stable-240627 we have implemented the Hash Join algorithm used in those joins operations to take advantage of the algorithm efficiency. The performance benefits from Hash Join are orders of magnitude faster if compared to the previous release.

To demonstrate such performance improvements, we will use a sample pair of tables and run some batch data into it and check how many people named X have an active salary:

Schema:

CREATE TABLE `person` (
  `ID` int DEFAULT NULL,
  `name` char(11) DEFAULT NULL
);

CREATE TABLE `salary` (
  `salary` int DEFAULT NULL,
  `person_id` int DEFAULT NULL,
  `active` tinyint DEFAULT '1'
);

Data load - For each person, we insert a salary of 100k as inactive and 200k as active.

for f in $(seq 1 1000)
do
  mysql -e "set sql_log_bin=0; INSERT INTO person VALUES (${f}, 'person_name'); INSERT INTO salary VALUES (100000, '${f}', 0), (120000, ${f}, 1);" test
done

Query:

SELECT COUNT(*)
FROM person
JOIN salary ON person.ID = salary.person_id
WHERE person.name = 'person_name'
  AND salary.active=1;

Results:

Batch Size	Readyset + Nested Loop (Cache Miss)	Readyset + Hash Join (Cache Miss)	Readyset (Cache Hit)	MySQL
1k	2,01sec	0,03sec	0,00sec	0,03sec
10k	0,05sec	0,01sec	0,00sec	0,01sec
100k	3 min 12,11sec	0,20sec	0,00sec	0,20sec

Summary

Readyset version stable-240627 hash join algorithm delivers a significantly higher speed for handling cache misses during join operations that involve predicates from both sides.

The increase in performance becomes more evident as the volume of data grows, highlighting the advantages of this version with larger datasets.

Queries resulting in a cache hit do not require a replay, they are served directly from cache and continue to experience sub millisecond latency delivered by Readyset.

Authors

Marcelo Altmann