December 15, 2017

Scaling Distributed Joins

Adam Prout

Chief Technical Officer

Most users of SQL databases have a good understanding of the join algorithms single-box databases employ. They understand the trade-offs and uses for nested loop joins, merge joins, and hash joins. Distributed join algorithms, on the other hand, tend not to be as well understood. Distributed databases need to make a different set of tradeoffs to account for table data that is spread around a cluster of machines instead of stored on a single machine, like in a traditional database. Because these data movement trade-offs are often misunderstood, many people have made broad claims that distributed joins don’t scale. It is true that some distributed join algorithms involve more data movement than others, but just like single-box joins, it’s possible to optimize your table schema or queries to use more scalable distributed join algorithms for queries with higher throughput requirements.

Types of Distributed Joins

For the purpose of this blog, let’s consider any algorithm for joining two tables in a distributed SQL database a distributed join. Using that definition, there are five ways distributed SQL databases can run a distributed join. The way table data is distributed in the cluster is also a key aspect of how join queries are executed. The most common practice is using sharding to distribute data among nodes in the cluster (as SingleStore does). So, let’s assume the data is sharded using a hash function on some columns of the table (a shard key) to make the descriptions more concrete.

Local/Collocated Reference Table Join

Most distributed databases support tables that are copied to every node in the cluster. In SingleStore, we call these reference tables. Reference tables are usually used for dimension tables in a star schema that are small and rarely updated. Joining a distributed table (a table whose data is sharded around the cluster) against a reference table is almost the same as doing a local single-box join at each node in the cluster. Each node in the cluster can join its part of the distributed table with the reference table locally, then the results can be merged together by another node in the cluster before being sent to the user (the merging logic is a bit trickier for left joins).

Pros:
Very scalable. No data movement.
Cons:
Requires schema tuning. Tables needs to be marked as reference tables ahead of time.
Reference tables are copied on every node in the cluster using some disk and memory capacity.

Local/Collocated Distributed Table Join

If two distributed tables are both sharded on the columns being joined, then you can join them locally on each node the same way as a reference-sharded table join with a final merging step to produce the complete result. This is the most scalable join algorithm for joining two distributed tables, as it involves no data movement before doing the join. It’s also the most restrictive type of distributed join, as the tables need to be sharded on the columns involved in the join condition before a collocated join is possible. Most distributed databases only allow a table to be sharded on one key, so this limits the flexibility of this join type. As a result, it’s important to have columns that are regularly joined on as shard keys. These columns would often be foreign key columns in a single-box database.

Pros:
Very scalable. No data movement.
Cons:
Requires schema tuning. Shard keys need to be chosen ahead of time.
Inflexible. Most databases only allow a single shard key, which limits the number of joins that can be run collocated.

Remote Distributed Table Join

If the distributed tables involved in a join have filters that reduce the number of rows that need to be joined to a small subset of both tables, then the rows for both sides of the join can be pulled to a single node in the cluster which can do a single-box join. This is the first join type we’ve described so far that involves data movement before joining. In this case, all nodes in the cluster send the data they have for the two sides of the join to a single node so it can run the join. This type of join only performs well when the number of rows involved in the join are small. The node doing the join will be overwhelmed if the data sizes involved grow too large. This type of join also limits the parallelism that is available on a single node, whereas other join algorithms are able to use the entire cluster to run the join.

Pros:
No schema tuning needed (shard keys or reference tables).
Scalable when applied to joins involving a small number of rows. Very little data movement.
Cons:
Only applicable to joins with a small number of rows being joined.
Worse case performance is extremely poor if this join algorithm is chosen when the join involves a large number of rows when a single node is doing all the work to run the join.

Broadcast Join

Broadcast joins work well when only one side of the join involves a small set of rows (has a selective filter). To do a broadcast join, the side of the join with a small set of rows is sent to every node in the cluster, then joined locally with the larger table on the other side of the join. You can think of a broadcast join as creating a temporary reference table for the join at every node, then running a reference table join as described in the Reference Table Join section.

Pros:
No schema tuning needed (shard keys or reference tables).
Flexible. Applicable to more query shapes then the previous types of joins.
Cons:
Only feasible when one side of the join contains a small number of rows.
Not as scalable. The broadcasted rows are copied to every node in the cluster requiring a lot of inter-node communication before the join is executed.

Reshuffle Join

If both sides of the join involve a large number of rows, then doing a broadcast join will send too much data around the cluster and may exhaust the memory or disk of nodes. Instead in this case, it’s best to repartition the data and send some part of the data in the table to each node in the cluster to run a local distributed join on each node exactly as described in the Distributed Table Join section. The typical way this happens is one side of the join is reshuffled (has its shard key recalculated with a different set of key columns) to match the shard key of the table on the other side of the join. Some joins could result in both sides of the table getting reshuffled if there is no shard key involved in the join condition. This is the most expensive, but most flexible way of running a distributed join.

Pros:
No schema tuning needed (shard keys or reference tables).
Very flexible. Applicable to any query shape.
Cons:
The least scalable type of join. A lot of data movement is needed as many of the rows involved in the join are copied to other nodes to execute the join.

Optimizing Distributed Joins

To make distributed joins scalable for high throughput workloads, it’s best to avoid data movement as much as possible. Some options for doing this are:

Make small and rarely updated tables that you regularly join against into reference tables. This avoids the need to broadcast those small tables around the cluster to join against them.
As much as possible, choose shard key columns that are commonly joined on. This will allow local distributed joins to be used more often, which has less data movement and is still extremely scalable (every node in the cluster is involved in running the join in parallel).
Joins that need to reshuffle both tables involved in the join will be hard to make scale for high throughput workloads. At lower levels of concurrency the reshuffled joins are fine. These queries can also be run as a remote distributed join if the number of rows involved in the join is small enough, so consider restricting the rows involved in the join if possible.

For more detailed examples and tuning advice see the SingleStore documentation here.