Top Spark Summit Questions

The MemSQL team enjoyed sponsoring and attending Spark Summit last week, where we spoke with hundreds of developers, data scientists, and architects all getting a better handle on modern data processing technologies like Spark and MemSQL. After a couple of days on the expo floor, I noticed several common questions. Below are some of the most frequent questions and answers exchanged in the MemSQL booth.

1. When should I use MemSQL?

MemSQL shines in use cases requiring analytics on a changing data set. The legacy data processing model, which creates separate siloes for transactions and analytics, prevents updated data from propagating to reports and dashboards until the nightly or weekly ETL job begins. Serving analytics from a real-time operational database means reports and dashboards are accurate up to the last event, not last week.

That said, MemSQL is a relational database and you can use it to build whatever application you want! In practice, many customers choose MemSQL because it is the only solution able to handle concurrent ingest and query execution for analyzing changing datasets in real-time.

2. What does MemSQL have to do with Spark?

Short answer: you need to persist Spark data somewhere, whether in MemSQL or in another data store. Choosing MemSQL provides several benefits including:

  • In-memory storage and data serving for maximum performance
  • Structured database schema and indexes for fast lookups and query execution
  • A connector that parallelizes data transfer and processing for high throughput

Longer answer: There are two main use cases for Spark and MemSQL:

  1. Load data through Spark into MemSQL, transforming and enriching data on the fly in Spark

    In this scenario, data is structured and ready to be queried as soon as it lands in MemSQL, enabling applications like dashboards and interactive analytics on real-time data. We demonstrated this “real-time pipeline” at Spark Summit, processing and analyzing real-time energy consumption data from tens of millions of devices and appliances.

  2. Leverage the Spark DataFrame API for analytics beyond SQL using data from MemSQL

    One of the best features of Spark is the expressive but concise programming interface. In addition to enabling MemSQL users to express iterative computations, it gives them access to the many libraries that run on the Spark execution engine. The MemSQL Spark connector is optimized to push computation into MemSQL to minimize data transfer and to take advantage of the MemSQL optimizer and indexing.

3. What’s the difference between MemSQL and Spark SQL?

There are several differences:

  • Spark is a data processing framework, not a database, and does not natively support persistent storage. MemSQL is a database that stores data in memory and writes logs and full database snapshots to disk for durability.
  • Spark treats datasets (RDDs) as immutable – there is currently no concept of an INSERT, UPDATE, or DELETE. You could express these concepts as a transformation, but this operation returns a new RDD rather than updating the dataset in place. In contrast, MemSQL is an operational database with full transactional semantics.
  • MemSQL supports updatable relational database indexes. The closest analogue in Spark is IndexRDD, which is currently under development, and provides updateable key/value indexes within a single thread.
  • In addition to providing a SQL server, the Spark DataFrame library is a general purpose library for manipulating structured data.

4. How do MemSQL and Spark interact with one another?

The MemSQL Spark Connector is an open source tool available on the MemSQL GitHub page. Under the hood, the connector creates a mapping between MemSQL database partitions and Spark RDD partitions. It also takes advantage of both systems’ distributed architectures to load data in parallel. The connector comes with a small library that includes the MemSQLRDD class, allowing the user to create an RDD from the result of a SQL query in MemSQL. MemSQLRDD also comes with a method called saveToMemSQL(), which makes it easy to write data to MemSQL after processing.

5. Can I have one of those cool t-shirts? (Of course!) What does the design mean?

htap-shirts-1024

The design is a graphical representation of Hybrid Transactional/Analytical Processing (HTAP), a term coined by Gartner. It refers to the convergence of transactional and analytical processing in a single database, usually for real-time analytics.

Circling back to the first question, MemSQL excels at this kind of hybrid workload. In addition to reducing latency and consolidating hardware, HTAP powers tight operational feedback loops that can create opportunities for net new revenue and bottom line cost savings. For more information on HTAP, read the Gartner Market Guide for In-Memory Databases.