Today we launched MemSQL 5.5 featuring MemSQL Pipelines, a new way to achieve maximum performance for real-time data ingestion at scale. This implementation enables exactly-once semantics when streaming from message brokers such as Apache Kafka.

An end-to-end real-time analytics data platform requires real-time analytical queries and real-time ingestion. However, it is rare to find a data platform that satisfies both of these requirements. With the launch of MemSQL Pipelines as a native feature of our database, we now deliver an end-to-end solution from real-time ingest to analytics.

Real-Time Analytical Queries and Data Ingestion

Let’s define real-time analytical queries and real-time data ingestion separately.

A data platform that supports real-time analytical queries quickly returns results for sophisticated analytical queries, which are usually written in SQL with lots of complex JOINs. Execution of real-time analytical queries differentiates MemSQL from competitors. In the past year, Gartner Research recognized MemSQL as the number one operational data warehouse, as well as awarded Visionary placements to the company Operational Database and Data Warehouse Magic Quadrants.

A data platform that supports real-time ingestion can instantly store streaming data from sources like web traffic, sensors on machines, or edge devices. MemSQL Pipelines ingests data at scale in three steps. First, performantly pulling from data sources – Extract. Second, mapping and enriching the data – Transform. Finally, loading the data into MemSQL – Load. This all occurs within one database, or pipeline. The transactional nature of Pipelines sets it apart from other solutions. Streaming data is atomically committed in MemSQL, and exactly-once semantics are ensured by storing metadata about each pipeline in the database.

pipelines-archtiecture

At the crux of MemSQL Pipelines is a new, unique database object to MemSQL – a PIPELINE, or a top-level database element similar to a TABLE, INDEX or VIEW. Let’s explore the different properties of MemSQL Pipelines.

Pipelines allow extraction from data sources (e.g. Apache Kafka) using a robust, database-native mechanism.
The system that pulls data from data sources IS the MemSQL Database. Other solutions that claim to pull streaming data typically leverage separate “middleware” solutions to extract data. That not only decreases performance, it also requires additional provisioning and management.

MemSQL ensures true exactly-once semantics for Kafka messages.
MemSQL stores and manages Kafka offsets within the MemSQL Database. The latest loaded Kafka offsets are stored in MemSQL; only when a Kafka message is reliably extracted, transformed, and loaded in MemSQL are the offsets incremented. In the event of any error, such as Kafka connectivity, improper transforms, or malformed data, MemSQL always ensures that each Kafka message is processed exactly once.

Data enrichment and transformation in Pipelines can be implemented using any programming language.
MemSQL Pipelines introduces the concept of a “transform” specified as part of the Pipeline DDL. Transforms are user-defined scripts that enrich and map external data for loading into MemSQL, written in any programming language for familiarity and flexibility.

Data loading into MemSQL happens efficiently and in parallel between MemSQL data partitions and Kafka brokers.
MemSQL is a distributed system, and MemSQL Pipelines was implemented with this attribute in mind. Pipelines loads data between distributed systems and streams data in parallel from individual Kafka brokers directly into MemSQL data partitions. Moreover, MemSQL performs distributed data loading optimizations such as lessening the total number of threads used, sharing data buffers, and minimizing intra-cluster connections.

Learn more about MemSQL Pipelines in our technical documentation.

Try MemSQL Pipelines Today

Build your own pipeline today: Download MemSQL 5.5