July 24, 2018

Introducing SingleStore Pipelines to Stored Procedures

Bryan Offutt

Product Manager at SingleStore

Introducing SingleStore Pipelines to Stored Procedures

With the release of SingleStoreDB Self-Managed 6.5, the SingleStore team is announcing our latest innovation in data ingestion: Pipelines to stored procedures. By marrying the speed and exactly once semantics of SingleStore Pipelines with the flexibility and transactional guarantees of stored procedures, we’ve eliminated the need for complex transforms and unlocked the doors to a variety of new, groundbreaking use cases. This blog post will walk you through how to use Pipelines to stored procedures to break data streams up into multiple tables, join them with existing data, and update existing rows, but this is just the tip of the iceberg of what is possible with this exciting new functionality.

How Does it Work?

Extract → Transform (Optional) → Stored Procedure OR Traditional Single Table Load

Pipelines to stored procedures augments the existing SingleStore Pipelines data flow by providing the option to replace the default Pipelines load phase with a stored procedure. The default Pipelines load phase only supports simple insertions into a single table, with the data being loaded either directly after extraction or following an optional transform. Replacing this default loading phase with a stored procedure opens up the possibility for much more complex processing, providing the ability to insert into multiple tables, enrich incoming streams using existing data, and leverage the full power of SingleStore Extensibility.

Pipelines Transforms vs. Stored Procedures

Those familiar with SingleStore Pipelines might be wondering, “I already have a transform in my pipeline….should I use a stored procedure instead?” Leveraging both a transform and a stored procedure in the same Pipeline allows you to combine the third-party library support of a traditional transform alongside the multi-insert and data-enrichment capabilities of a stored procedure.

Traditional Transform	Stored Procedure
Pros - Can be written in any language (BASH, python, Go, etc.) - Can leverage third-party libraries - Easily portable to systems outside of SingleStore Cons - More difficult to debug and manage - Less performant for most use cases - Only allows inserting into a single table - No transactional guarantees	Pros - Insert into multiple tables - SingleStore native - High performance querying of existing SingleStore tables - Transactional guarantees Cons - SingleStore-specific language - No access to third-party libraries for more complex transformations

Example Use Cases

Insert Into Multiple Tables

Pipelines to stored procedures now enables you to insert data from a single stream into multiple tables in SingleStore. Consider the following stored procedure:

CREATE PROCEDURE proc(batch query(tweet json))
AS
BEGIN
    INSERT INTO tweets(tweet_id, user_id, text)
      SELECT tweet::tweet_id, tweet::user_id, tweet::text
      FROM batch;
    INSERT INTO users(user_id, user_name)
      SELECT tweet::user_id, tweet::user_name
      FROM batch;
END

This procedure takes in tweet data in JSON format, separates out the tweet text from the user information, and inserts them into their respective tables in SingleStore. The entire stored procedure is wrapped in a single transaction, ensuring that data is never inserted into one table but not the other.

Load Into Table and Update Aggregation

There are many use cases where maintaining an aggregation table is a more performant and sensible alternative to running aggregation queries across raw data. Pipelines to stored procedures allows you to both insert raw data and update aggregation counters using data streamed from any Pipelines source. Consider the following stored procedure:

CREATE PROCEDURE proc(batch query(tweet json))
AS
BEGIN
    INSERT INTO tweets(tweet_id, user_id, text)
      SELECT tweet::tweet_id, tweet::user_id, tweet::text
      FROM batch;
    INSERT INTO retweets_counter(user_id, num_retweets)
      SELECT tweet::retweeted_user_id, 1
      FROM batch
      ON DUPLICATE KEY UPDATE num_retweets = num_retweets + 1
      WHERE tweet::retweeted_user_id is not null;
END

This procedure takes in tweet data as JSON, inserts the raw tweets into a “tweets” table, and updates a second table which tracks the number of retweets per user. Again, the transactional boundaries of the stored procedure ensures that the aggregations in retweets_counter are always in sync with the raw data in the tweets table.

Use Existing Data to Enrich a Stream

It’s also possible to use a stored procedure to enrich an incoming stream using data that already exists in SingleStore. Consider the following stored procedure, which uses an existing SingleStore table to join an incoming IP address batch with existing geo data about its location:

CREATE PROCEDURE proc(batch query(ip varchar, ...))
AS
BEGIN
    INSERT INTO t
      SELECT batch.*, ip_to_point_table.geopoint
      FROM batch
      JOIN ip_to_point_table
      ON ip_prefix(ip) = ip_to_point_table.ip;
END

These use cases only scratch the surface of what is possible using Pipelines to stored procedures. By joining the speed of Pipelines with the flexibility of stored procedures, SingleStoreDB Self-Managed 6.5 gives you total control over all of your streaming data. For more information on the full capabilities of SingleStore Extensibility, please see our documentation.

Product