One of the most common tasks with any database is loading large amounts of data into it from an external data store. Both MemSQL and MySQL provide the LOAD DATA command for this task; this command is very powerful, but by itself, it has a number of restrictions:
It can only read from the local filesystem, so loading data from a remote store like Amazon S3 requires first downloading the files you need.
Since it can only read from a single file at a time, loading from multiple files requires multiple LOAD DATA commands. If you want to perform this work in parallel, you have to write your own scripts.
If you are loading multiple files, it’s up to you to make sure that you’ve deduplicated the files and their contents.
Why We Built the MemSQL Loader
At MemSQL, we’ve acutely felt all of these limitations. That’s why we developed MemSQL Loader, which solves all of the above problems and more. MemSQL Loader lets you load files from Amazon S3, the Hadoop Distributed File System (HDFS), and the local filesystem. You can specify all of the files you want to load with one command, and MemSQL Loader will take care of deduplicating files, parallelizing the workload, retrying files if they fail to load, and more.
Use a load command to load a set of files
View the progress of a job using the ps command
We have been using MemSQL Loader here at MemSQL for quite a while now, and have provided a binary version on our website for anyone to use. However, we are proud of the code we produced (or at least proud enough), and have decided to open source the MemSQL Loader project.
Give MemSQL Loader a Try – Download Now on GitHub
The project uses several open source libraries, such as the Voluptuous data validation library and our own MemSQL Python connector. You can find the project here. Check it out, and let us know what you think!