production-spark-connection

Apache Spark is one of the most powerful distributed computing frameworks available today. Its combination of fast, in-memory computing with an architecture that’s easy to understand has made it popular for users working with huge amounts of data.

While Spark shines at operating on large datasets, it still requires a solution for data persistence. HDFS is a common choice, but while it integrates well with Spark, its disk-based nature can impact performance in real-time applications (e.g. applications built with the Spark Streaming libraries). Also, Spark does not have a native capability to commit transactions.

Making Spark Even Better

That’s why MemSQL is releasing the MemSQL Spark connector, which gives users the ability to read and write data between MemSQL and Spark. MemSQL is a natural fit for Spark because it can easily handle the high rate of inserts and reads that Spark often requires, while also having enough space for all of the data that Spark can create.

MemSQL Spark Connector

Operationalize and streamline Spark deployments with the MemSQL Spark Connector – Click to Tweet

The MemSQL Spark Connector provides everything you need to start using Spark and MemSQL together. It comes with a number of optimizations, such as reading data out of MemSQL in parallel and making sure that Spark colocates the data in its cluster with MemSQL nodes when MemSQL and Spark are running on the same physical machines. It also provides two main components: a MemSQLRDD class for loading data from a MemSQL query and a saveToMemsql function for persisting results to a MemSQL table.

We’ve made our connector open source; you can find the project here. Check it out and let us know how it works.

Try the MemSQL Spark Connector – Download Now on GitHub