shutterstock_249571990

MemSQL Streamliner, an open source tool available on GitHub, is an integrated solution for building real-time data pipelines using Apache Spark. With Streamliner, you can stream data from real-time data sources (e.g. Apache Kafka), perform data transformations within Apache Spark, and ultimately load data into MemSQL for persistence and application serving.

Streamliner is great tool for developers and data scientists since little to no code is required – users can instantly build their pipelines.

For instance, a non-trivial yet still no-code-required use case is: pulling data in a comma-separated value (CSV) format from a real-time data source; parsing it; then creating and populating a MemSQL table. You can do all this within the Ops web UI, depicted in the image below.

As you can see, we have simulated the real-time data source with a “Test” that feeds in static CSV values. You can easily replace that with Kafka or a custom data source. The static data is then loaded into the hr.employees table in MemSQL.

oct15-streamliner-advanced
Sometimes, however, you need to perform more complex operations on your data before inserting it into MemSQL. Streamliner supports uploading JAR files containing your own Extractor and Transformer classes, allowing you to run extracts from custom data sources and arbitrary transformations on your data.

Here is an example of how to create a custom Transformer class. It assumes we will receive strings from an input source (e.g. Kafka) that represent user IDs and their country of origin. For each user, we are going to look up the capital of their country and write out their user ID, country, and capital to MemSQL.

The first thing you should do is check out the Streamliner starter repo on GitHub. This repo is a barebones template that contains everything you need to start writing code for your own Streamliner pipelines.

Our example Transformer is going to have a map from ISO country code to that country’s capital. It takes in a CSV string containing a user ID (which is an integer) and a country code. It will return a DataFrame containing rows with three columns: user ID, country code, and country capital.

Open up Transformers.scala and edit the BasicTransformer code so that it looks like this:

Compile the project using the ‘make build’ command, and upload the resulting JAR to MemSQL Ops:

streamliner-blog-2

streamliner-blog-3

Then create a new pipeline. For now, we will use the test extractor, which sends a constant stream of strings to our transformer; this is very useful for testing your code.

streamliner-blog-4.png

Then, choose your BasicTransformer class.

streamliner-blog-5.png

Finally, choose the database and table name in MemSQL where you would like your data to go.

streamliner-blog-6.png

Once you save your pipeline, it will automatically start running and populating your table with data.

streamliner-blog-7

You can select rows from the table created in MemSQL to see that it is being populated with your transformed data:

streamliner-select-snippet

Writing your own transformers is an extremely powerful way to manipulate data in real-time pipelines. The code shown above is just one demonstration of how Streamliner gives you the ability to easily pre-process data while leveraging the power of Spark – a fast, general purpose, distributed framework. Check out the MemSQL documentation and the Streamliner examples repo for more examples of the kinds of things you can do with Streamliner.