August 13, 2012

Loading half a billion records in 40 minutes

Nikita Shamgunov

Co-founder and CTO

Loading half a billion records in 40 minutes

Disclaimer. This will not be an apples to apples comparison with derwik, since we obviously don’t have the same dataset, and we need a much bigger machine to load everything into memory. But I believe this experiment will get the point across. So without further ado, let’s go through the steps.

Adam Derewecki wrote a cool post about his experience loading half a billion records into MySQL. SingleStore is a MySQL-compatible drop-in replacement that is built from the ground up to run really fast in memory.

Disclamer: this isn’t an apples-to-apples comparison with derwik since we don’t have his dataset and need a much beefier machine to load everything into memory.

Schema

Here is the schema for the table. Note that the table has three indexes. On top of it memsql will automatically generate a primary index under the covers. We won’t be disabling keys.

drop table if exists store_sales_fact;
create table store_sales_fact(
       date_key                smallint not null,
       pos_transaction_number  integer not null,
       sales_quantity          smallint not null,
       sales_dollar_amount     smallint not null,
       cost_dollar_amount      smallint not null,
       gross_profit_dollar_amount smallint not null,
       transaction_type        varchar(16) not null,
       transaction_time        time not null,
       tender_type             char(6) not null,
       product_description     varchar(128) not null,
       sku_number              char(12) not null,
       store_name              char(10) not null,
       store_number            smallint not null,
       store_city              varchar(64) not null,
       store_state             char(2) not null,
       store_region            varchar(64) not null,
       key date_key(date_key),
       key store_region(store_region),
       key store_state(store_state)
);

Hardware

We have a very cool machine, with 64 cores and 512 GB of RAM, from Peak Hosting . You can rent one for yourself for a little under two grand a month. They were kind enough to give it to us to use for free. Here is a spec of one core.

vendor_id       : AuthenticAMD
cpu family      : 21
model           : 1
model name      : AMD Opteron(TM) Processor 6276
stepping        : 2
cpu MHz         : 2300.254
cache size      : 2048 KB

You read that correctly, this machine has sixty-four 2.3 GHz cores and 512 GB of RAM or almost 8 times the largest memory footprint available in the cloud today, all on dedicated hardware with no virtualization overhead or resource contention with other unknown third parties.

Loading efficiently

Loading data efficiently is actually not that trivial. The best way of doing it with SingleStore is to use as much CPU as you can get. Here are a few tricks that can be applied.

1. Multi-inserts

We can batch inserts into 100 row multi-inserts. This will reduce the number of network roundtrips. Each roundtrip now accounts for 100 rows instead of one. Here is what multi-inserts look like.

insert into store_sales_fact values('1','1719','4','280','97','183','purchase','14:09:10','Cash','Brand #6 chicken noodle soup','SKU-#6','Store71','71','Lancaster','CA','West'),
('1','1719','4','280','97','183','purchase','14:09:10','Cash','Brand #5 golf clubs','SKU-#5','Store71','71','Lancaster','CA','West'),
('1','1719','4','280','97','183','purchase','14:09:10','Cash','Brand #4 brandy','SKU-#4','Store71','71','Lancaster','CA','West'),

2. Load in parallel.

Our customer has a sample file of 510,593,334 records. We can use the command line mysql client to pipe this file into SingleStore, but this would not leverage all the cores available in the system. So ideally we should spit the file to at least as many chunks as there are CPUs in the system.

3. Increase granularity

Splitting the file into 64 big chunks will introduce a data skew. The total data load time will be the time the slowest thread loads the data. To address this problem we will split the file into thousands of chunks. And every time a thread frees up we will start loading another chunk. So we split the file into 2000 chunks.

 1 -rw-r--r--  1  35662844 2012-06-06 11:42 data10.sql
 2 -rw-r--r--  1  35651723 2012-06-06 11:42 data11.sql
 3 -rw-r--r--  1  35658433 2012-06-06 11:42 data12.sql
 4 -rw-r--r--  1  35663665 2012-06-06 11:42 data13.sql
 5 -rw-r--r--  1  35667480 2012-06-06 11:42 data14.sql
 6 -rw-r--r--  1  35659549 2012-06-06 11:42 data15.sql
 7 -rw-r--r--  1  35661617 2012-06-06 11:42 data16.sql
 8 -rw-r--r--  1  35650414 2012-06-06 11:42 data17.sql
 9 -rw-r--r--  1  35661625 2012-06-06 11:42 data18.sql
10 -rw-r--r--  1  35667634 2012-06-06 11:42 data19.sql
11 -rw-r--r--  1  35662989 2012-06-06 11:42 data1.sql

4. Load script. The load script uses python multiprocessing library to load data efficiently.

 1 import re
 2 import sys
 3 import os
 4 import multiprocessing
 5 import optparse
 6
 7 parser = optparse.OptionParser()
 8 parser.add_option("-D", "--database", help="database name")
 9 parser.add_option("-P", "--port", help="port to connect. use 3306 to connect to memsql and 3307 to connect to mysql", type="int")
 10 (options, args) = parser.parse_args()
 11
 12 if not options.database or not options.port:
 13     parser.print_help()
 14     exit(1)
 15
 16 total_files = 2000
 17
 18 def load_file(filename):
 19     try:
 20         print "loading from cpu: %d" % os.getpid()
 21         query = 'mysql -h 127.0.0.1 -D %s -u root -P %d < %s' % (options.database, options.port, filename)
 22         print query
 23         os.system(query)
 24         print "done loading from cpu: %d" % os.getpid()
 25     except e as Exception:
 26         print e
 27         pass
 28
 29 os.system('echo "delete from store_sales_fact" | mysql -h 127.0.0.1 -u root -P %d' % options.port)
 30 p = multiprocessing.Pool(processes = 2*multiprocessing.cpu_count())
 31 for j in range(0, total_files):
 32     p.apply_async(load_file, ['data/data%d.sql' % j])
 33
 34 p.close()
 35 p.join()

5. Running it

I started loading by issuing the following command.

time python load.py -D test -P 3306

After we do this let’s start htop to check the processor saturation. It looks pretty busy.

SingleStore uses lockfree data structures that eliminate a lot of contention.

It took a bit of time, but the data has been loaded.

 real    39m21.465s
 user    33m53.210s
 sys     5m24.470s

The result is almost 200K inserts a second for a table with 4 indexes. The memory footprint of the memsql process is 267 Gb.

Running the same test against mysql

I would not be fair to skip comparison with mysql, particularly that it’s so easy to do since memsql uses mysql wire protocol and support mysql syntax. I used the my.cnf settings from http://derwiki.tumblr.com/post/24490758395/loading-half-a-billion-rows-into-mysql and fired up the same command

time python load.py -D test -P 3307

real    536m0.376s
user    27m17.130s
sys     5m36.780s

I did not disable indexes for this run to make it fair compared to SingleStore.

Conclusion

With the Peak Hosting offering of 512Gb of RAM and memory optimized software like SingleStore you can save full business days in data loading and get immediate, valuable insight into your data.

Engineering

Start building with SingleStore

Start free Talk to a specialist

Explore more resources

Documentation Pricing Get started with SingleStore

Loading half a billion records in 40 minutes

schemaSchema

hardwareHardware

loading-efficientlyLoading efficiently

1-multi-inserts1. Multi-inserts

2-load-in-parallel2. Load in parallel.

3-increase-granularity3. Increase granularity

4-load-script-the-load-script-uses-python-multiprocessing-library-to-load-data-efficiently4. Load script. The load script uses python multiprocessing library to load data efficiently.

5-running-it5. Running it

running-the-same-test-against-mysqlRunning the same test against mysql

conclusionConclusion

related-readingRelated reading

Painlessly Analyze Your S3 Access Logs With SingleStore

Using SingleStoreDB as a Vector Database for Q&A Chatbots

The Lambda Architecture Isn’t

How to Deploy SingleStore on the Mesosphere DCOS

Running SingleStore’s 107 Node Test Infrastructure on CoreOS

Through a Table, Sparsely