Themes from Big Data East

We spent last week at the Big Data Innovation Summit in Boston. Big data trade shows, particularly those mixed with sophisticated practitioners and people seeking new solutions, are always a perfect opportunity to take a market pulse.

Here are the big 5 big data themes we encountered over the course of two days.

Real-Time Over Resuscitated Data

The action is in real time, and trade show discussions often gravitate to deriving immediate value from real-time data. All of the megatrends apply… social, mobile, IoT, cloud, pushing startups and global companies to operate instantly in a digital,connected world.

While there has been some interest in resuscitating data from Hadoop with MapReduce or SQL on Hadoop, those directions are changing. For example, Cloudera recently announced the One Data Platform Initiative, indicating a shift from MapReduce

this initiative will enable [Spark] to become the successor to Hadoop’s original MapReduce framework for general Hadoop data processing

With Spark’s capabilities for streaming and in-memory processing, we are likely to see a focus on those real-time workflows. This is not to say that Spark won’t be used to explore expansive historical data throughout Hadoop clusters.

But judge your own predilection for real-time and historical data. Yes, both are important, but human beings tend to have an insatiable desire for the now.

Data Warehousing is Poised for Refresh

When the last wave of data warehousing innovation hit mainstream, there was a data M&A spree that started with SAP’s acquisition of Sybase in May 2010. Within 10 months, Greenplum was acquired by EMC, Netezza by IBM, Vertica by HP, and Aster by Teradata.

Today, customers are suffering economically with these systems which have become expensive to maintain and do not deliver the instant results companies now expect.

Applications like real-time dashboards push conventional data warehousing systems beyond their comfort zone, and companies are seeking alternatives.

Getting to ETL Zero

If there is a common enemy in the data market, it is ETL, or the Extract, Transform, and Load process. We were reminded of this when Riley Newman from Airbnb mentioned that

ETL was like extracting teeth…no one wanted to do it.

Ultimately, Riley did find a way to get it done by shifting ETL from a data science to a data engineering function (see final theme below), but I have yet to meet a person who is happy with ETL in their data pipeline.

ETL pain is driving new solution categories like Hybrid Transactional and Analytical Processing, or HTAP for short. In HTAP solutions, transactions and analytics converge on a single data set, often enabled by in-memory computing. HTAP capabilities are the forefront of new digital applications with situational awareness and real-time interaction.

The Matrix Dashboard is Coming

Of course, all of these real-time solutions need dashboards, and dashboards need to be seen. Hiperwall makes a helpful solution to tie multiple monitors together in a single, highly-configurable screen. The dashboards of the future are here!

Hiperwall Dashboards

Emerging Data Science Organizational Structures

Organizational structures for data science are still emerging. Riley Newman from Airbnb shared his view of the data stack,

The Data Stack

Visualization | Sustained narrative around execution

Experimentation | Disentangling causality

Data Products | Machine learning, recommendation algorithms

Analysis | Understanding user behaviors and business drivers

ETL | Curation of client data for analysis / reporting

Infrastructure | Stability of warehouse systems and tools

Then he identified where the data science team should be spending the bulk of its time, and defined an organizational structure to fill in additional areas with complementary resources. Infrastructure is largely handled in the cloud, and ETL was assigned to a dedicated data engineering team.

The Data Stack Resources

Next Up Strata + Hadoop World New York 2015

We’ll be continuing our fall show circuit next with Strata + Hadoop World in New York, September 29th to October 1st. Visit to catch the latest MemSQL details. Hope to see you there!