JSON Streaming And The Future Of Data Ingest

Jeremy Wright

As businesses continue to become technology focused, data is more prevalent than ever. In response, companies have adopted a handful of common formats to help manage this explosive growth in data.

Data Formats Today

For a long time, XML has been the giant in terms of data interchange formats. Recently, JSON has become popular, catching a wave of interest due to its lightweight streaming support, and general ease of use. JSON is a common format for web applications, logging, and geographical data.

For example, the CloudTrail service, part of Amazon Web Services (AWS), uses JSON for its log records. Every API request to any AWS region is stored as a JSON record in a designated S3 bucket. The amount of businesses and developers using AWS, whether directly or indirectly, is staggering at over one million enterprise customers in 2015. Even though only some of them are using CloudTrail, that’s a mountain of JSON data sitting in the cloud.

However, JSON is by nature unstructured, which presents a challenge when ingested in a relational database. This is where a tool called jq, developed by Stephen Dolan, comes in handy. For those that work with UNIX tools a lot, it’s a bit like sed or awk for JSON. In fact, one of the MemSQL developers, after using jq, said: “it felt like the good parts of an unholy union between awk, bash, and web hipsters.”

More generally, jq can take in a stream of JSON and apply transformations to each record, extract fields, filter arrays, and even change the structure of the JSON data completely. Additionally, jq has support for loops, conditionals, reduce, and a lot of other common programming features.

You can find a jq tutorial here: https://stedolan.github.io/jq/tutorial/.

Ingesting Data with MemSQL

MemSQL is built around the idea of getting instant access to information. When framed against the backdrop of the ongoing data explosion, that means finding a solution for fast and reliable ingest.

For this purpose, MemSQL built Pipelines, which natively integrates together the Extract, Transform, and Load elements of most ingest problems. Pipelines supports streaming data from platforms such as Kafka and S3 without the need to connect extra tools.

A user operating MemSQL could implement their own custom transformer to unpack JSON and transform it to the appropriate form, but this would not be very extensible and wastes developer time.

MemSQL is always trying to release features that keep everything simple and fast, so for MemSQL 6, we shipped native support for jq as a JSON transformer. Users can now leverage features in jq to handle JSON data out of the box. This provides a performant and easy-to-use solution to handle one of the most common data formats today.

Consuming CloudTrail

MemSQL uses CloudTrail for auditing our AWS usage. An example CloudTrail record looks like this:

    "Records": [
            "eventVersion": "1.01",
            "userIdentity": {
                "type": "IAMUser",
                "principalId": "AIDAJDPLRKLG7UEXAMPLE",
                "arn": "arn:aws:iam::123456789012:user/Alice",
                "accountId": "123456789012",
                "accessKeyId": "AKIAIOSFODNN7EXAMPLE",
                "userName": "Alice",
 "sessionContext": { "attributes": { "mfaAuthenticated": "false", "creationDate": "2014-03-18T14:29:23Z" } } }, "eventTime": "2014-03-18T14:30:07Z", "eventSource": "cloudtrail.amazonaws.com", "eventName": "StartLogging", "awsRegion": "us-west-2", "sourceIPAddress": "", "userAgent": "signin.amazonaws.com", "requestParameters": { // arbitrary json }, "responseElements": { // arbitrary json }, "requestID": "cdc73f9d-aea9-11e3-9d5a-835b769c0d9c", "eventID": "3074414d-c626-42aa-984b-68ff152d6ab7" }, // additional records ] }

We originally hand wrote a transform in Go to parse the records and pull individual fields into their own columns. MemSQL also has a JSON datatype so we can store and query arbitrary values without having a sparse table with lots of columns that are usually null. The table we stored the results in had this schema:

CREATE TABLE cloudtrail (
       eventTime DATETIME,
       eventSource TEXT,
       eventName TEXT,
       awsRegion TEXT,
       sourceIPAddress TEXT,
       userAgent TEXT,
       requestID TEXT,
       eventID TEXT,
       userType TEXT,
       userARN TEXT,
       userAccountID TEXT,
       userAccessKeyID TEXT,
       userName TEXT,
       requestParameters JSON,
       responseElements JSON

The Go transform was around 300 lines of streaming logic, JSON parsing and error handling, and actually mapping the JSON values to columns. It does its job, but it’s highly specific to the expected structure of the incoming JSON. In contrast, here’s the same Pipeline (with the same table) using jq:

AS LOAD DATA S3 'my_company_logs_bucket/CloudTrailLogs'
CREDENTIALS '{"aws_access_key_id": "your_access_key_id", "aws_secret_access_key": "your_secret_access_key"}'
WITH TRANSFORM ('memsql://json', '', '-r ".Records[] | [.eventTime, .eventSource, .eventName, .awsRegion, .sourceIPAddress, .userAgent, .requestID, .eventID] + ((if .userIdentity != null then .userIdentity else {} end) | [.type, .arn, .accountId, .accessKeyId, .userName]) + [(.requestParameters|tostring), (.responseElements|tostring)] | @tsv"')
INTO TABLE cloudtrail

Here, the transform gathers the fields we want into a single JSON array, performs a couple conditional checks, then outputs a tab separated line for each log record. But we’re not just reading fields. We have to convert certain types (JSON to string, for example) and use conditionals to avoid trying to read properties of a null object. And while this looks like an eyeful, jq actually is very easy to learn for anyone familiar with basic UNIX-like tools.

The tech industry is moving really fast in a lot of directions right now. On the one hand, there’s been a big shift back to relational databases and structured data. On the other, the web is ever growing and constantly producing endless unstructured JSON data. With tools such as jq integrated into MemSQL, we can see that these two seemingly disparate paradigms can definitely coexist, and allow our users to reap the benefits of both.

json blog