The world is moving toward highly-parallel, distributed systems up and down the stack. It started with the app servers on the front lines. Now it’s happening at the data layer. This is generally a good thing, but these kinds of systems can sometimes feel like a foreign country. In this article we’ll talk about how MemSQL spreads the work around a cluster of servers without sacrificing ease of use.

There are two kinds of machines in a MemSQL cluster: aggregators and leaves. The same software runs on both kinds of machines, and they all communicate using the same MySQL-compliant protocol. The only difference is in what role each one is told to play.

Aggregators aggregate

An aggregator is like a SQL proxy sitting in front of your cluster. Under normal circumstances, you only ever talk to the cluster through an aggregator. You connect to it with your favorite SQL client and insert rows, run selects, and manipulate data any way you want. From your perspective you are simply talking to “the” database server. The work of coordinating the parallel operations and combining everything into a single set of results is what an aggregator does. There could be any number of leaves acting in concert to answer your query, or only one. To you it’s completely transparent.

Leaves… are where you leave data

A “leaf” server is where your data gets stored. There’s nothing very complicated about it. Each leaf is an instance of MemSQL that acts independently. In fact, a leaf is barely aware that it’s part of a larger cluster of machines. All it knows is that some user named “root” logs into it from time to time and tells it to store or retrieve some data, and that it sometimes has to replicate data to and from its buddies. If you were to connect to a leaf with a SQL client and poke around, you’d see something like this:

 

Each of those memsql_demo_N databases is a partition, created and managed by the aggregator. When you query “select count(1) from lineitem” from an aggregator, in the background it gets translated into parallel queries across the cluster.

So where does the data go?

As much as the aggregators try to maintain the illusion of a single consistent database, it helps to know a bit about how things work under the hood. A distributed system is fundamentally different from a single-box database. There are a few ways you can arrange your cluster, each with its tradeoffs.

The simplest MemSQL cluster has one aggregator and one leaf. Let’s ignore the aggregator and focus on the leaf. When you create a database, by default it’s split into 8 partitions. Think of a donkey loaded up with a bunch of packages.

This works just fine. Modern multi-core servers, running an in-memory database like MemSQL, can handle an enormous number of concurrent reads and writes very quickly. Because the data is partitioned even on a single leaf it’s easy to achieve parallelism and use all those expensive CPU cores to their fullest. But there’s only so much load a single donkey can carry. To expand the size of your cluster, you add another one and rebalance.

 

This also works fine. As data gets inserted through an aggregator, the aggregator will make sure that both donkeys get equal amounts of data. In non-donkey terms, this configuration is like RAID-0. There is only one copy of your data, split up and distributed over multiple devices. As with RAID, you should use donkeys of similar sizes to achieve good balance. This configuration can take you pretty far:

However, there is a potential problem. The more donkeys you add, the more likely it is that any one donkey will stumble or get sick, making part of the dataset unavailable. Since there is only one copy this increases the chances of the entire cluster failing. For best results you want two copies of each partition.

This configuration operates more like RAID 1+0. You can expand your cluster’s capacity more or less infinitely by adding more pairs of donkeys. In the case of a failure, the backup donkey automatically takes up the slack until full capacity is restored.

This doesn’t mean that the backup donkeys are doing nothing. MemSQL high-availability is more fine-grained than that. Remember that each database is split into lots of partitions which are spread across the leaves. The aggregator makes sure each leaf machine gets some of the active partitions and some of the backups, so you can use all of your cluster’s CPUs during normal operation. In other words, your data is striped.

 

Building a SQL-compliant database that can scale over thousands of machines is bigger deal that it might sound. Most of the hard scaling and performance problems I’ve observed with traditional single-box SQL databases come down to two causes: I/O bottlenecks on the disk, and the fact that it runs on a single box. There’s only so much RAM and CPU power you can cram into a single machine. If the success of Google, Facebook, and Amazon have taught us anything, it’s the importance of scaling horizontally over lots of commodity computers.

But it’s been too easy to throw the baby out with the bath water, to give up a powerful query language in exchange for scale. A distributed “No SQL” system like MongoDB is a relatively easier programming problem than the one we’ve tackled. MongoDB, for instance, doesn’t yet support very high rates of concurrent reads and writes. For years Mongo ignored the hard bits of supporting relational set operations. Even now it has only partial support for joins and aggregations, and of course you have to learn a new query language to use it… a language that seems to be edging closer and closer to SQL itself. Our approach is to cut to the chase and give you a highly-concurrent, distributed SQL database, built out of parts that are straightforward to understand.

A couple of details

There are two kinds of tables in MemSQL: reference tables and distributed. Distributed tables are laid out as described here. Reference tables are intended for smaller data that tends to get joined frequently along many dimensions. To make this more efficient, reference tables are replicated on each leaf and aggregator. In data warehouse terms, the tiny dimension tables are reference tables, and gigantic fact tables are distributed.

There are two kinds of aggregators: the master and the children. There is no real difference between the software running on them. Master is just a role assigned to one (and only one) aggregator at a time. The master is responsible for DDL commands like adding or dropping tables, rearranging partitions, and making sure all the other aggregators have consistent metadata about the cluster. All the normal stuff like inserts and deletes can be handled by any aggregator, even when the master is down. In the event of a failure, you can easily promote any aggregator to be the master, and the cluster marches on.