April 4, 2018

Creating an IoT Kafka Pipeline in Under Five Minutes

Seth Luersen

Previous Head of Training, Curriculum, and Certification Programs

In a recent SingleStore webcast, we discussed how modern enterprises can easily adopt new data management tools to manage data size, growth, and complexity. Then we demonstrated how to use Apache Kafka and SingleStore to build interactive, real-time data pipelines. Data pipelines capture, process, and serve massive amounts of data to millions of users.

During the webcast we also shared how to:

Build new data pipelines with modern tools
Enable data workflows to support machine learning and predictive analytics with data pipelines
Deploy a real-time data pipeline using Apache Kafka and SingleStore Pipelines in under five minutes

We received additional questions from the webcast attendees, and wanted to share the answers publicly.

Question: Is SingleStore a NoSQL database or a Relational Database Management System (RDBMS)?
SingleStore is a modern, in-memory optimized, massively parallel processing, shared-nothing, real-time database. SingleStore stores data in tables, and supports standard SQL data types. Geospatial and JSON data types are first-class citizens in SingleStore. With SingleStore, you can store and query structured, semi-structured, and unstructured data.

Question: What are the minimum memory requirements for SingleStore?
SingleStore is a distributed system consisting of one or more nodes. You can find out more about system and hardware requirements in our documentation.

Question: Is loading JSON into SingleStore similar to MongoDB?
Behind the scenes, MongoDB represents JSON documents in a binary-encoded format called BSON. BSON is also the wire format for MongoDB. JSON is a first-class citizen in SingleStore. The JSON data type validates JSON. Behind the scenes, SingleStore stores validated JSON as text. For those who want to preserve BSON in SingleStore, the supported SQL data type is VARBINARY or any variation thereof: LONGBLOB, MEDIUMBLOB, BLOB, or TINYBLOB.

Question: What is the infrastructure requirement for running SingleStore and Apache Kafka?
Like Apache Kafka, SingleStore is a distributed system consisting of one or more nodes that run as a cluster of nodes. At a minimum, the infrastructure requirement is for a stand-alone Apache Kafka producer and broker, and a stand-alone SingleStore cluster consisting of one master aggregator and one leaf node. In a production environment, there are two clusters: an Apache Kafka cluster and SingleStore cluster.

A SingleStore Pipeline ingests streaming data from an Apache Kafka topic into SingleStore leaf nodes by default. SingleStore leaf nodes contain individual database partitions. Each database partition stores the data from the Kafka stream into its own destination table. Parallelism between the number of SingleStore database partitions and Kafka broker partitions for the given topic determines optimal performance.

Question: In order to digest data from Apache Kafka, are there any consumer concepts in SingleStore?
Apache Kafka follows a more traditional design, shared by most messaging systems, where data is pushed to the broker from the producer and pulled from the broker by the consumer. In a pull-based system, the consumer can catch up when the system falls behind. A SingleStore Pipeline for Apache Kafka uses a pipeline extractor for Kafka. This extractor is a Kafka consumer.

Question: Does a SingleStore Pipeline using the Apache Kafka extractor only support real-time ingestion into a SingleStore “rowstore” table?
With a SingleStore Pipeline, you can massively ingest data in parallel into distributed tables. In SingleStore, a table is either distributed or non-distributed (reference table). There are two storage types for tables: in-memory rowstore and columnstore. All columnstore tables have an unexposed, in-memory rowstore table. SingleStore automatically spills rows from the in-memory rowstore to columnstore. All data, including the hidden rowstore table, is queryable for the columnstore table.

Question: Is it possible to move data from an in-memory rowstore table to a columnstore table?
Yes. It is as simple as:

INSERT INTO columnstore_table SELECT * FROM rowstore_table;

Question: In this webinar demonstration, where is this data being ingested from and is it pre-generated data?
The webinar demo showcases a SingleStore Pipeline for Apache Kafka. SingleStore hosts the Apache Kafka cluster. A Python program generates data and writes it to a Kafka Producer for the adtech topic. The SingleStore Pipeline consumes this topic from the Kafka broker endpoint.

Question: What happens if you need to adjust or change your data schema?
You can modify a table in SingleStore with a data definition language (DDL) ALTER TABLE… statement. By default, SingleStore supports an online ALTER statement. Since SingleStore Pipelines are written with DDL, you can also alter a SingleStore Pipeline with an ALTER PIPELINE… statement. Typically, the procedure to handle schema changes is simply:

STOP PIPELINE mypipeline;
ALTER TABLE mytable… ;
ALTER PIPELINE mypipeline…;
TEST PIPELINE mypipeline;
START PIPELINE mypipeline;

Question: Will you please provide a code example of how a SingleStore Pipeline for Apache Kafka transforms a JSON message?
Here is a simple example of JSON in a Kafka message:{“id”:1,”item”:”cherry”,”quantity”:1} Written in Python, here is a basic transform script that extracts the id field from JSON:

#!/usr/bin/env python
import struct
import sys
import json
def transform_records():
    while True:
        byte_len = sys.stdin.read(8)
        if len(byte_len) == 8:
            byte_len = struct.unpack("L", byte_len)[0]
            result = sys.stdin.read(byte_len)
            yield result
        else:
            return
for l in transform_records():
    parsed_json = json.loads(l)
    sys.stdout.write("%s\t%s\n" % (parsed_json["id"], l))

Question: How is it possible to persist complex master detail records with foreign keys with a SingleStore Pipeline?
SingleStoreDB Self-Managed 6 does not enforce foreign key constraints, does not support triggers, and a SingleStore Pipeline only supports loading data into a single table. Recently, a SingleStore engineer demoed a beta release of SingleStoreDB Self-Managed 6.5 in which a SingleStore Pipeline can load data into a stored procedure! This architecture allows for conditional logic within the stored procedure to manage complex scenarios such as data ingest into related tables. To learn more about streaming data into stored procedures with SingleStore Pipelines, see Recapping An Evening with SingleStore Engineering.

Question: Usually the data format for data in Apache Kafka is in a binary form such as Apache Avro, so how does SingleStore support user-defined decoding?
SingleStore Pipelines support data ingest that is in either a CSV or TSV data format. One way to ingest compressed Avro data from a Kafka topic is to create a data pipeline with Apache Spark. Spark Streaming allows Spark to consume a Kafka topic directly. Using the SingleStore Spark Connector, it is possible to decode the binary formats and save data directly into SingleStore. You can learn more about working with Spark in our SingleStore Spark Connector Guide.

Another approach is to use an Avro to JSON converter. Once converted, the Kafka message is essentially a JSON blob. In a SingleStore Pipeline, you can easily transform JSON using any Linux supported API that efficiently parses JSON. SingleStore Pipelines also supports jq, a lightweight and flexible command-line JSON processor. To learn more, read JSON Streaming and the Future of Data Ingest.

JSON is a first-class citizen in SingleStore. With built-in JSON functions, it is possible to parse JSON key-values into persisted, computed columns. SingleStore supports indexing of computed columns. With SingleStore, you can easily index and parse JSON with standard SQL.

Question: Does SingleStore handle back-pressure automatically?
A large Apache Kafka cluster in production is capable of delivering messages at a very fast rate, in order of millions of messages per second. For use cases where there is high volume and a high rate, many API-driven consumers reach a point where they cannot keep pace and fall behind, a behavior known as back-pressure.

Because SingleStore is a modern, in-memory optimized, massively parallel processing, shared-nothing, real-time database, a SingleStore Pipeline for Apache Kafka can easily consume and ingest messages at very high volume and high rates. A SingleStore Pipeline ingests streaming data from an Apache Kafka topic into SingleStore leaf nodes by default. SingleStore leaf nodes contain individual database partitions. Each database partition consumes the Kafka stream into the designated destination table. By default, tables in SingleStore are in-memory, rowstore tables.

Parallelism between the number of SingleStore database partitions and Kafka broker partitions for the given topic determines optimal performance as this parallelism dictates the total batch size. SingleStore keeps track of the Kafka earliest and latest offsets in the information_schema.PIPELINES_BATCHES table.

Question: What are the advantages to using Apache Kafka versus Amazon S3?
Apache Kafka is a modern, distributed messaging system. Amazon S3 is cloud object storage built to store and retrieve files. SingleStore Pipelines has extractors for both Apache Kafka and Amazon S3. For both extractors, the number of database partitions in SingleStore determines the degree of parallelism for data ingest.

For S3, the number of database partitions in SingleStore equates to the number of files in a pipeline batch. Each database partition ingests a specific S3 file from a folder in an S3 bucket. The files can be compressed. Amazon S3 has known limits for GET request speeds that start at over 100 requests per second. In addition, the pricing model for S3 is based on the volume of data egress. To learn more about SingleStore S3 Pipelines at scale, check out 1.3 Billion NYC Taxi Rows.

An Apache Kafka cluster is able to support reads and writes in the millions per second. Of course, this is very dependent on the package size. As package sizes increase, messaging throughput decreases. That said, as a distributed system, you can scale out Apache Kafka to meet your requirements.

Question: How can I get the O’Reilly eBook trilogy from SingleStore?
You can download our O’Reilly eBooks covering topics that range from Predictive Analytics to Artificial Intelligence at SingleStore O’Reilly Trilogy.

Question: What do I need to get started with Apache Kafka and SingleStore?
You can deploy a SingleStore cluster on-premises, with Docker, with Amazon Web Services, or with Microsoft Azure. To learn more, check out our Quick Start Guides.

To learn how to quickly setup a SingleStore Pipeline for Apache Kafka, review our Quick Start for Kafka Pipelines. You can watch the recording of the Pipelines in Five Minutes webcast here. If you prefer a quick demo of SingleStore, SingleStore Pipelines, and Apache Kafka, sign up at SingleStore Demo.