Delivering Scalable Self-Service Analytics

Eric Hanson
Eric Hanson

Within 48 hours of launching Google Analytics as a free product, virtually all of Google’s servers crashed. Eric Schmidt called this Google’s “most successful disaster.” Why would a free product, whose hardware requirements melted down a datacenter, be worth it? Wesley Chan, the creator of Google Analytics, later said that, “Google Analytics generates about three billion dollars in extra revenue,” as noted in Steven Levy’s book, In The Plex. Google Analytics allowed Google’s customers to measure how good AdWords actually were, and showed them, with their own data, exactly how high they could increase bids and still make money. As Chan said, “know more, spend more.”

The full potential of such an offering comes when customers are allowed to arbitrarily segment and calculate flexible aggregates. To do that, they need to be able to query raw unaggregated data. This way your company does not have to guess what the customer wants to aggregate, as all choices remain available. If raw data access is not provided, then data must be precomputed, at least on some dimensions, which limits flexibility and the extent of the insights users can get from data.

Cost and technology constraints have led most companies to build analytics with this precompute approach for customers, because they need to serve analytics to many customers concurrently. The scale required to offer raw data access remained untenable. It was unthinkable to perform computations on raw data points on the scale of billions of rows per request concurrently for thousands of customers.

Today, MemSQL is changing that conventional wisdom and offering companies the ability to serve raw unaggregated data performance to a range of customers.

To explain this capability further, there are three major pieces of technology in this use case:

  • Scale-out
  • Columnstore query execution
  • Efficient data isolation

Scale-Out

Knowing system performance characteristics on a per-core basis, users can calculate how much compute and storage is needed to serve analytics at scale. Once that calculation is done, the key is to utilize a distributed system allowing enough dedicated compute power to meet demand. MemSQL can be used to run one to hundreds of nodes, which lets users scale the performance appropriately.

For example, if you have a million customers with one million data points each, you can say that you have one trillion data points. Imagine that at the peak, one thousand of those customers are looking at the dashboard simultaneously – essentially firing off one thousand concurrent queries against the database. Columnstore compression can store these trillion rows on a relatively small MemSQL cluster with approximately 20 nodes. Conservatively, MemSQL can scan 100 million rows per second per core, which mean that just one core can service 100 concurrent queries scanning one million rows each, and deliver sub-second results for analytical queries over raw data – below we will provide a benchmark for a columnstore query execution performance.

Columnstore Query Execution

A simple query over a columnstore table, such as a GROUP BY, can run at a rate of hundreds of millions to over a billion data points per second per core.

To demonstrate this, we loaded a public dataset about every airline flight in the United States from 1987 until 2015. As the goal was to understand performance per core, we loaded this into a single node MemSQL cluster running on a 4 core, 8 thread Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz.

To repeat this experiment, download the data using the following bash script:

mkdir csv
for s in `seq 1987 2015`
do
for m in `seq 1 12`
do
wget http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_${s}_${m}.zip
done
done

Create this table:

CREATE TABLE ontime (
 Year INT,
 Quarter INT,
 Month INT,
 DayofMonth INT,
 DayOfWeek INT,
 FlightDate Date,
 UniqueCarrier Varchar(100),
 AirlineID INT,
 Carrier Varchar(100),
 TailNum Varchar(100),
 FlightNum Varchar(100),
 OriginAirportID INT,
 OriginAirportSeqID INT,
 OriginCityMarketID INT,
 Origin Varchar(100),
 OriginCityName Varchar(100),
 OriginState Varchar(100),
 OriginStateFips Varchar(100),
 OriginStateName Varchar(100),
 OriginWac INT,
 DestAirportID INT,
 DestAirportSeqID INT,
 DestCityMarketID INT,
 Dest Varchar(100),
 DestCityName Varchar(100),
 DestState Varchar(100),
 DestStateFips Varchar(100),
 DestStateName Varchar(100),
 DestWac INT,
 CRSDepTime INT,
 DepTime INT,
 DepDelay INT,
 DepDelayMinutes INT,
 DepDel15 INT,
 DepartureDelayGroups Varchar(100),
 DepTimeBlk Varchar(100),
 TaxiOut INT,
 WheelsOff INT,
 WheelsOn INT,
 TaxiIn INT,
 CRSArrTime INT,
 ArrTime INT,
 ArrDelay INT,
 ArrDelayMinutes INT,
 ArrDel15 INT,
 ArrivalDelayGroups INT,
 ArrTimeBlk Varchar(100),
 Cancelled INT,
 CancellationCode Varchar(100),
 Diverted INT,
 CRSElapsedTime INT,
 ActualElapsedTime INT,
 AirTime INT,
 Flights INT,
 Distance INT,
 DistanceGroup INT,
 CarrierDelay INT,
 WeatherDelay INT,
 NASDelay INT,
 SecurityDelay INT,
 LateAircraftDelay INT,
 FirstDepTime Varchar(100),
 TotalAddGTime Varchar(100),
 LongestAddGTime Varchar(100),
 DivAirportLandings Varchar(100),
 DivReachedDest Varchar(100),
 DivActualElapsedTime Varchar(100),
 DivArrDelay Varchar(100),
 DivDistance Varchar(100),
 Div1Airport Varchar(100),
 Div1AirportID INT,
 Div1AirportSeqID INT,
 Div1WheelsOn Varchar(100),
 Div1TotalGTime Varchar(100),
 Div1LongestGTime Varchar(100),
 Div1WheelsOff Varchar(100),
 Div1TailNum Varchar(100),
 Div2Airport Varchar(100),
 Div2AirportID INT,
 Div2AirportSeqID INT,
 Div2WheelsOn Varchar(100),
 Div2TotalGTime Varchar(100),
 Div2LongestGTime Varchar(100),
 Div2WheelsOff Varchar(100),
 Div2TailNum Varchar(100),
 Div3Airport Varchar(100),
 Div3AirportID INT,
 Div3AirportSeqID INT,
 Div3WheelsOn Varchar(100),
 Div3TotalGTime Varchar(100),
 Div3LongestGTime Varchar(100),
 Div3WheelsOff Varchar(100),
 Div3TailNum Varchar(100),
 Div4Airport Varchar(100),
 Div4AirportID INT,
 Div4AirportSeqID INT,
 Div4WheelsOn Varchar(100),
 Div4TotalGTime Varchar(100),
 Div4LongestGTime Varchar(100),
 Div4WheelsOff Varchar(100),
 Div4TailNum Varchar(100),
 Div5Airport Varchar(100),
 Div5AirportID INT,
 Div5AirportSeqID INT,
 Div5WheelsOn Varchar(100),
 Div5TotalGTime Varchar(100),
 Div5LongestGTime Varchar(100),
 Div5WheelsOff Varchar(100),
 Div5TailNum Varchar(100),
 key (AirlineID) using clustered columnstore
);

Then load data into the table:

load data infile '/home/memsql/csv/*' into table ontime fields terminated by ',' enclosed by '"' lines terminated by ',\n' ignore 1 lines;

Once the data is loaded, run a simple group by command. The following query performs a full table scan:

SELECT OriginCityName, count(*) AS flights
FROM ontime GROUP BY OriginCityName ORDER BY flights DESC LIMIT 20;

On a machine with 4 cores, a 164 million row dataset query runs in 0.04 seconds which is 1 billion rows per second per core. No, that’s not a typo. That’s a billion rows per second per core. More complex queries will consume more CPU cycles, but with this level of baseline performance there is a lot of room across a cluster of 8, 16, or even hundreds of machines to handle multi-billion row datasets with response times under a quarter of a second. At that speed, queries appear to be instantaneous to users, leading to great user satisfaction.

Try this example using MemSQL 6. New vectorized query execution techniques in MemSQL 6, using SIMD and operations directly on encoded (compressed) data, make this speed possible.

Efficient Data Isolation Per Customer

Data warehouses such as Redshift and Big Query support large scale, but may not sufficiently isolate different queries in highly concurrent workloads. On top of that, both have a substantial fixed overhead on a per query basis. Redshift in particular does not support many concurrent queries: http://docs.aws.amazon.com/redshift/latest/dg/cm-c-defining-query-queues.html.

Depending on the analytical requirements, MemSQL allows for an ordered and partitioned physical data layout to ensure only scanned data belongs to a single customer. In our example, the columnstore was clustered on AirlineID.

MemSQL supports clustered columnstore keys that allow global sorting of columnstore tables. In this case, if you have a predicate on AirlineID a user will only scan the subset of data belonging to that airline. This allows MemSQL to deliver on very high concurrency (in the thousands of concurrent queries) with each query scanning and aggregating millions of data points.

More on Query Execution

At MemSQL, we are continuously innovating with new query processing capabilities. This is a list of recent innovations in our shipping product: https://docs.memsql.com/release-notes/latest/60-release-notes/.

Bringing it All Together

Going back to our original example, though our dataset is one trillion rows, because of the clustered columnstore key, each customer only needs to scan through one million rows. For a simple query like the above, scanning 500 million rows per second per core means that a single CPU core could support 500 concurrent queries and deliver sub-second performance.

To recreate the work mentioned in this blog, try out MemSQL 6: memsql.com/download

 

Practical example of what's new with MemSQL 6 beta 2.