Engineering

Scaling Distributed Joins

Most users of SQL databases have a good understanding of the join algorithms single-box databases employ. They understand the trade-offs and uses for nested loop joins, merge joins, and hash joins. Distributed join algorithms, on the other hand, tend not to be as well understood. Distributed databases need to make a different set of tradeoffs to account for table data that is spread around a cluster of machines instead of stored on a single machine, like in a traditional database. Because these...


JSON Streaming And The Future Of Data Ingest

As businesses continue to become technology focused, data is more prevalent than ever. In response, companies have adopted a handful of common formats to help manage this explosive growth in data. Data Formats Today For a long time, XML has been the giant in terms of data interchange formats. Recently, JSON has become popular, catching a wave of interest due to its lightweight streaming support, and general ease of use. JSON is a common format for web applications, logging, and geographical...


Running Stored Procedures on Distributed Systems with MemSQL 6

Today we’re announcing the general availability of MemSQL 6. This is a big milestone for the product, which comes with new features to help customers get even more value out of MemSQL. The latest release includes breakthrough query performance, enhanced online operations, and extensibility. In this blog, we’ll take a deeper look at the new Extensibility features. Why did you add Extensibility to MemSQL 6? The Extensibility feature was built based on market demand, and enables people to move...


Psyduck: The MemSQL Journey to Containers

One of the main themes at DockerCon 2017 was the challenge of migrating legacy applications to containers. At MemSQL, we’re early adopters. We are already into our third year of running Docker at scale in production for our distributed software testing regime, where the performance, isolation, and cost benefits of containers are very attractive. The Challenge Before I take you through our journey to containers, let me start by outlining some of the general challenges of testing a distributed...


The Curious Case of Thread Group Identifiers

At MemSQL, we are out to build awesome software and we’re always trying to solve hard problems. A few days ago, I uncovered a cool Linux mystery with some colleagues and fixed it. We thought sharing that experience might benefit others. The scene of the crime While developing an internal tool to get stack traces, we decided to use the SYS_tgkill Linux system call to send signals to specific threads. The tgkill syscall sends a signal to a specific thread based on its “thread group”...


Arrays a Hidden Gem in MemSQL

Arrays - A Hidden Gem in MemSQL

Released this March, MemSQL 6 Beta 1 introduced MemSQL Procedural SQL (MPSQL). MPSQL supports the creation of: User-Defined Functions (UDFs) Stored Procedures (SPs) Table-Valued Functions (TVFs) User-Defined Aggregate Functions (UDAFs) A Hidden Gem: Array Types There’s a hidden gem in MemSQL 6 Beta 1 that we didn’t document at first — array types!  These make programming much more convenient. Since we compile your extensions to machine code, the performance is fantastic. And you...


ArcGIS, Spark & MemSQL Integration

This is a guest post by Mansour Raad of Esri. We were fortunate to catch up with him at Strata+Hadoop World San Jose. This post is replicated from Mansour’s Thunderhead Explorer blog ArcGIS, Spark & MemSQL Integration Just got back from the fantastic Strata + Hadoop 2017 conference where the topics ranged from BigData, Spark to lots of AI/ML and not so much on Hadoop explicitly, at least not in the sessions that I attended. I think that is why the conference is renamed Strata + Data from...


Everything We’ve Known About Data Movement Has Been Wrong

Data movement remains a perennial obstacle in systems design. Many talented architects and engineers spend significant amounts of time working on data movement, often in the form of batch Extract, Transform, and Load (ETL). In general, batch ETL is the process everyone loves to hate, or put another way, I’ve never met an engineer happy with their batch ETL setup. In this post, we’ll look at the shift from batch to real time, the new topologies required to keep up with data flows, and the...


MemSQL Opens New Office in Second Tech Hub: Seattle, WA

Behind the scenes of the world’s leading companies in finance, retail, media, and energy, sits MemSQL – the operational data warehouse powering real-time data ingest and analytics. At MemSQL, hiring exceptional talent drives innovation in real-time technology and enables us to advance the state of the art in databases. We hire top engineers from prestigious universities such as MIT, Stanford, and Carnegie Mellon University, as well as companies like Facebook, Microsoft, Oracle and...


MemSQL Pipelines

MemSQL Pipelines: Real-Time Data Ingestion with Exactly-Once Semantics

Today we launched MemSQL 5.5 featuring MemSQL Pipelines, a new way to achieve maximum performance for real-time data ingestion at scale. This implementation enables exactly-once semantics when streaming from message brokers such as Apache Kafka. An end-to-end real-time analytics data platform requires real-time analytical queries and real-time ingestion. However, it is rare to find a data platform that satisfies both of these requirements. With the launch of MemSQL Pipelines as a native feature...