This month’s MemSQL Ops release includes performance features for Streamliner, our integrated Apache Spark solution that simplifies creation of real-time data pipelines. Specific features in this release include the ability to run Spark SQL inside of the MemSQL database, in-browser Python programming, and NUMA-aware deployments for MemSQL.
We sat down with Carl Sverre, MemSQL architect and technical lead for Ops development, to talk about the latest release.
Q: What’s the coolest thing about this release for users?
I think the coolest thing for users is that we now support Python as a programming language for building real-time data pipelines with Streamliner. Previously, users needed to code in Scala – Scala is less popular, more constrained, and harder to use. In contrast, Python syntax is widely in use by developers, and has a broad set of programming libraries providing extensibility beyond Spark. Users can import Python libraries like Numpy, Scipy, and Pandas, which are easy to use and feature-rich compared to corresponding Java / Scala libraries. Python also enables users to prototype a data pipeline much faster than with Scala. To allow users to code in Python, we built MemSQL infrastructure on top of PySpark and also implemented a ‘pip’ command that installs any Python package across machines in a MemSQL cluster.
Q: Why focus on Python, the “language of data science”?
We believe that Python is the way most generalist developers know how to write code. In fact, here at MemSQL, much of the software we write outside of the core database is written in Python – this includes Ops, our interactive management tool; Psyduck, the code testing tool; and much more. We know that developers experimenting with their own data pipelines (many using Streamliner) want to get up and running quickly. Naturally, they seek the most intuitive programming language to do so: Python!
Q: Can you dig into SQL pushdown on a technical level?
Spark SQL is the Apache Spark module for working with structured data. As an example, with Spark SQL you can do a command like:
It counts customers in a Spark data table (dataframe) grouped by zip code. The same query can also be run through Spark SQL, for example:
sqlContext.sql(“select count(*) from customers group by zip-code”)
Now, we can boost Spark SQL performance by pushing the majority of computation into MemSQL. How do we do it? First, we process the Spark SQL operator tree and turn it into SQL syntax that MemSQL understands, then execute that query directly against the MemSQL database, where performance on structured data is strictly faster than Spark. By leveraging the optimized MemSQL engine, we are able to process individual Spark SQL queries much faster than Spark can.
Q: What are you most proud of in this release?
I am most proud of the performance boost from running Spark SQL in MemSQL. It is awesome that we can leverage the MemSQL in-memory, distributed database to boost Spark performance without users having to change any of their application code. With MemSQL, Spark SQL is just faster.