JSON as Column Type
This post originally appeared on the Novus Tech Blog

More and more, companies are migrating from NoSQL databases back to relational, ACID-compliant databases that also offer high availability and horizontal scalability – aka “NewSQL.” Novus Partners is no exception, having recently completed our migration from MongoDB to MemSQL.

Of course, NoSQL developers often come to depend on rapid development cycle and “schema-less” data models, which can make the transition back to relational – including the prospect of devising a relational model for your schemaless data – daunting. It turns out, however, that MemSQL offers a feature that enales you to keep many of those benefits and ease the NoSQL-to-NewSQL transition: the JSON column type.

In this article, we’ll get up and running quickly with MemSQL, and then immediately learn about its native JSON support and what it can do.

Getting Started with MemSQL

Although MemSQL is a proprietary technology (binary download only), they recently released a free Community Edition that is perfect for developer-testing. All you need are a 64-bit Linux environment and the MySQL client (MemSQL made a strategic decision to implement the MySQL client protocol – “bug-for-bug”).

After downloading and installing the Community Edition, and the MySQL Client, you will want to alias memsql to the MySQL client command as below (I have this alias in my ~/.local.bash:

If you already know MySQL, you already know most of MemSQL commands and syntax. MemSQL adds some proprietary commands and syntax, mostly related to its replication and sharding functionality, which we won’t discuss here. If you want to learn more, you can find all their documentation online.

Let’s start out by inspecting the default databases, and then define our own “test” database where we can explore MemSQL’s JSON support:

Prototyping a Schema

Perhaps in our existing NoSQL database, we had a document collection named tasks that contained, among other things, unstructured data related to our distributed compute application. Our developers added and removed fields from their Task objects as needed for different compute jobs, which was great from a development perspective because it meant rapid development cycle without the need for frequent database schema changes and migrations.

Fortunately, we don’t have to leave that agility behind when we transition from NoSQL back to MemSQL’s relational model.

The simplest version of our new tasks table has two columns: a bigint primary key and the column for JSON data, which has the type… JSON

Now, let’s insert some data into this table. The task documents themselves are just JSON literals inside single-quotes, like any other data type:

An important note: MemSQL will not let you insert invalid JSON documents. Attempting to do so yields an error:

JSON Field Predicates

Suppose we want to select rows based on a specific JSON document field. Even though the MemSQL JSON column type is something like a LONGTEXT column in terms of data format and limitations, the database query engine understands the JSON specification and how to navigate the document tree.

For example, we can select all the tasks that I previously inserted having user of ‘nzucker’

The clause t.task::$user navigates to the user field of the JSON document, with the $ prefix ensuring that the value is evaluated as a STRING data type (it is actually short-hand for the JSON_EXTRACT_STRING function).

Naturally, MemSQL supports predicates that use nested JSON fields. For example, we can also query by the uid.clientId field:

The % prefix on ::%clientId ensures that the field is interpreted as a double (the only numeric data type supported by the JSON standard), which is important if you are doing numeric comparisons in your queries.

You can also select “sub-documents” from JSON columns using a similar syntax:

Because we’re extracting JSON documents from our JSON column no $ or % prefix is required in the task::uid clause.

Persisted Columns

MemSQL also supports persisted columns (also known as “computed columns”) extracted from JSON documents fields. This feature is very convenient if you find yourself repeatedly querying the same deeply-nested JSON document fields.

Here we persisted the client_id field, extracted from the task object. Note that if our task documents have a suitable primary key field (say, a field named _id) we could extract that field to populate task_id field.

Now let’s select using this persisted column:

If you plan your persisted columns in advance, you will make life easier for developers or analysts who may not be familiar with JSON structure. Also, you can create indexes using these JSON-derived persisted columns, which obviously has a huge benefit.

Updating JSON Document Fields “In-Place”

If you have used JSON persistence from an object-oriented language, you might have written code like the following:

  • 1. Fetch the entire JSON document out of the store.
  • 2. Deserialize the JSON document into an object.
  • 3. Update a single field on the object.
  • 4. Serialize the entire object back to the store as JSON.

That’s quite a bit of data transfer just to manipulate a single field. Well, since we can select database rows by JSON fields, why not update individual fields of JSON documents as well? This too is possible:

It’s worth mentioning that MemSQL doesn’t support JSON field-level validation, so “in-place” updates such as this run the risk generating data that violates your domain model. For example, if the task::uid::which field is required, but you set task::uid::$which = NULL, your application may encounter errors due to the missing field. So, use this feature with caution (and perhaps a robust set of integration tests).

Manipulating JSON Arrays

Another great feature of MemSQL’s JSON support is the ability to manipulate JSON arrays in-place, using the JSON_ARRAY_PUSH_ expression.

Let’s continue with our “task” documents example by defining their domain model in Scala as follows:

As with any other JSON field, MemSQL array manipulation functions enable us to add entries to a JSON array “in-place.” Using our example data set, we first update one of our documents to have a field called “history,” initialized with an empty JSON array:

Then can then we insert into the array and observe the results:

Finding the Last Element of the Array

Perhaps we want to find the most recent update in our task document’s history array. One (inefficient) approach is to extract the entire array, and serialize it into an array object, and find the last element:

This certainly works in a pinch, but a better solution is to obtain the last element – representing the most recent status – directly in the SQL. Thank you to Justin Fu at MemSQL Support, who provided this solution:

Performance Considerations

A quick word about performance: JSON is a plain-text-based data format and therefore will always be slower than an optimized, binary format (see Thrift and ProtocolBuffers). At Novus, our migration path was to continue using Salat for Scala case class-to-JSON serialization, which in turn uses json4s for JSON AST operations. For most cases, this was painless and Just Worked.

However, in some cases, we encountered human-perceptible performance degradation after moving from MongoDB to MemSQL. Typically, this occurred when attempting to deserialize thousands of JSON documents into Scala objects while processing a SELECT result. Although many developers know Mongo as “just JavaScript,” it’s actually not. The storage and wire format are the carefully planned and optimized BSON format. Given our application was now transferring and parsing JSON text rather than BSON, this slow-down was completely understandable.

So, the trade off was to sacrifice human-readability for performance, reverting to a binary format for the performance-sensitive functionality. We successfully used Twitter Chill for this purpose, storing the objects in a BLOB database column. Another option is to bite the bullet and devise a relational model for your object model, particularly if the design is stable.

Either way, be sure to focus on functionality and domain modeling first, before turning to performance optimizations.

Thank you to Carlos Bueno (@archivd) and the rest of the MemSQL team for feedback on earlier drafts of this article.