The most popular data structure used for indexing in relational databases is the Btree (or its variant the B+tree). Btrees rose to popularity because they do fewer disk I/O operations to run a lookup compared to other balanced trees. To the best of my knowledge, MemSQL is the first commercial relational database in production today to use a skiplist as its primary index backing data structure. A lot of research and prototyping went into the decision to use a skiplist. I hope to provide some of the rationale for this choice and to demonstrate the power and relative simplicity of MemSQL’s skiplist implementation. I’ll show some very simple single-threaded table scans that that run more than eight times faster on MemSQL compared to MySQL as a very basic demonstration (MemSQL performs much better than this on more aggressive and complex workloads). This article will stick to high-level design choices and leaves most of the nitty-gritty implementation details for other posts.
What is a Skiplist
Btrees made a lot of sense for databases when the data lived most of its life on disk and was only pulled into memory as needed during a query. Btrees do extra work to reduce disk I/O that is needless overhead if your data fits into memory. MemSQL is a memory optimized database so it is free to use other data structures, such as skiplists, that are not well suited to having data stored on disk.
Skiplists are a relatively recent invention. The seminal skiplist paper was published in 1990 by William Pugh (Skiplists: a probabilistic alternative to balanced trees). This makes it about 20 years younger than the Btree, which was first proposed in the 1970s. A skiplist is an ordered data structure providing expected O(Log(n)) lookup, insertion and deletion complexity. It provides this level of efficiency without the need for complex tree balancing or page splitting like that required by Btrees, redblack trees or AVL trees. As a result, it’s a much simpler and more concise data structure to implement. Lockfree skiplist implementations have recently been developed (Lockfree Linked Lists and Skiplists) that provide thread safety with better parallelism under a concurrent read/write workload than threadsafe balanced trees that require locking. I won’t dig into the details of how to implement a skiplist lockfree here, but to get an idea of how it might be done see this blog post about common pitfalls in writing lockfree algorithms.
A skiplist is made up of elements attached to towers. Each tower in a skiplist is linked at each level of the tower to the next tower at the same height forming a group of linked lists, one for each level of the skiplist. When an element is inserted into the skiplist, its tower height is determined randomly via successive coin flips (a tower with height n occurs once in 2^n times). The element is linked into the linked lists at each level of the skiplist once its height has been determined. The towers support binary searching by starting at the highest tower and working towards the bottom, using the tower links to check when one should move forward in the list or down the tower to a lower level.
Why a Skiplist Index for MemSQL
1) Memory Optimized
MemSQL is a memory-optimized database, so we can assume that data always resides in memory. Being memory-optimized means indexes are free to use pointers to rows directly without the need for indirection. In a traditional database, rows need to be addressable by some other means than a pointer as their primary storage location is on disk. This indirection usually takes the form of a cache of memory resident pages (often called a buffer pool) that is consulted in order to find a particular row’s in-memory address or to read it into memory from disk if needed. This indirection is expensive and usually done at the page level (e.g., 8K at a time in SQL Server). MemSQL doesn’t have to worry about this overhead. This makes data structures that refer to rows arbitrarily, like a skiplist does, feasible. Dereferencing a pointer is much less expensive than looking up a page in the buffer pool.
MemSQL’s skiplist implementation is about 1500 lines of code including comments. Having recently spent some time in both SQL Server’s and Innodb’s Btree implementations, I can tell you they are both close to 50 times larger in terms of lines of code and both have many more moving parts. For example, a Btree has to deal with page splitting and page compaction while a skiplist has no equivalent operations. The first generally available build of MemSQL’s took a little over a year to build and stabilize. This feat wouldn’t have been possible with a more complex indexing data structure.
3) Lock free
A lockfree or non-blocking algorithm is one in which some thread is always able to make progress, no matter how all the threads’ executions are interleaved by the OS. MemSQL is designed to support highly concurrent workloads running on hardware with many cores. These goals makes lockfree algorithms desirable for MemSQL. The algorithms for writing a thread safe skiplist lockfree are now a solved problem in academia. A number of papers have been published on the subject in the past decade. It’s much harder to make a lockfree skiplist perform well when there is low contention (ie., a single thread iterating over the entire skiplist with no other concurrent operations executing). Optimizing this case is a more active area of research. Our approach to solving this particular problem is a topic for another time.
Btrees, on the other hand, have historically needed to use a complex locking scheme to achieve thread safety. Some newer lockfree Btree-like data structures such as the BWtree have recently been proposed that avoid this problem. Again, the complexity of the BWTree data structure far outpaces that of a skiplist or even a traditional Btree (it requires more complex compaction algorithms then a Btree and depends on a log-structured storage system to persist its pages). The simplicity of the skiplist is what makes it well suited for a lockfree implementation.
The speed of a skiplist comes mostly from its simplicity. MemSQL is executing fewer instructions to insert, delete, search or iterate compared to other databases.
Skiplists also support some extra operations that are useful for query processing and that aren’t readily implementable in a balance tree.
For example, a skiplist is able to estimate the number of elements between two elements in the list in logarithmic time. The general idea is to use the towers to estimate how many rows are between two elements linked together at the same height in the tree. If we know the tower height at which the nodes are linked, we can estimate how many elements are expected to be between these elements because we know the expected distribution of towers at that height. Knowing how many elements are expected to be in an arbitrary range of the list is very useful for query optimization when calculating how selective different filters in a select statement are. Traditional databases need to build separate histograms to support this type of estimation in the query optimizer.
Addressing Some Common Concerns About Skiplists
The most well known disadvantage of a skiplist is its memory overhead. The skiplist towers require storing a pointer at each level of each tower. On average each element will have a tower height of 2 (we flip a coin in succession to determine the tower height such that a tower of height n occurs 1 in 2^n times). This means on average each element will have 16 bytes of overhead for the skiplist towers. The significance of this overhead depends on the size of the elements being stored in the list. In MemSQL, the elements stored are rows of some users table. The average row size in a relational database tends to be 100s of bytes in size, dwarfing the skiplist’s memory overhead.
BTrees have their own memory overhead issues that make them hard to compare directly to skiplists. After a BTree does a page split, both split pages are usually only 50% full (some databases have other heuristics, but the result of a split is pages with empty space on them). Depending on a workload’s write patterns, BTree’s can end up with fragmented pages all over the BTree due to this splitting. Compaction algorithms to reclaim this wasted space are required, but they often need to be triggered manually by the user. A fully compacted BTree will be more memory efficient than a skiplist.
Another way MemSQL is able to improve memory use compared to a traditional database is in how it implements secondary indexes. Secondary indexes need only contain pointers to the primary key row. There is no need to duplicate the data in key columns like secondary Btree indexes do.
CPU Cache efficiency
Skiplists do not provide very good memory locality because traversing pointers during a search results in execution jumping somewhat randomly around memory. The impact of this effect is very workload specific and hard to accurately measure. The cost of executing the rest of a query (sorting, executing expressions, protocol overhead to return the queries result) tends to dominate the cost of traversing the tower pointers during a search for most queries.
Most skiplist implementations use backwards pointers (double linking of list elements) to support iterating backwards in the list. The backwards pointers add extra memory overhead and extra implementation complexity (lockfree doubly linked lists are difficult to implement). MemSQL’s skiplist employs a novel reverse iterator that uses the towers to iterate backwards without the need for reverse links. The idea is to track the last tower link that is visited at each level of the skiplist while seeking to the end, or to a particular node, of the skiplist. These links can be used to find the element behind each successive element (each time the reverse iterator moves backwards, it updates the links at each level that are used). This iterator saves the memory required for backwards pointers, but does result in higher reverse iteration execution cost. Reverse iteration is important for a SQL database because it allows ORDER BY queries to run without sorting even if the ORDER BY wants the opposite sort order (i.e., ascending instead of descending) of the order the index provides.
Quick Performance Comparison
Performance benchmarking of databases and data structures is very difficult. I’m not going to provide a comprehensive benchmark here. Instead, I’ll show a very simple demonstration of our skiplists single-threaded scan performance compared to innodb’s Btree. I’m going to run “SELECT SUM(score) FROM users” over a 50 million row users table. There is no concurrent write workload in this demonstration (where MemSQL really shines) and MemSQL is running with query parallelism disabled (both MySQL and MemSQL are scanning using only a single thread). Innodb is running with a big enough buffer pool to fit the entire table in memory, so there is no disk I/O going on.
CREATE TABLE `users` (
`user_id` bigint(20) NOT NULL AUTO_INCREMENT,
`first_name` varchar(100) CHARACTER SET utf8,
`last_name` varchar(100) CHARACTER SET utf8,
PRIMARY KEY (`user_id`)
MemSQL’s single-threaded scan performance is 5 times faster in the first case and 8 times faster in the second case. There is no black magic involved here. MemSQL needs to run far fewer CPU instructions to read a row for the reasons discussed above.
Because MemSQL is a memory optimized database, it is able to take advantage of the simplicity of a lockfree skiplist for its indexing. The result is a more modern indexing design based on recent developments in data structure and lockfree/non-blocking algorithms research. The simplicity of the skiplist is the source of a lot of MemSQL’s speed and scalability.