apache spark

Durable Storage for Real-Time Analytics with MemSQL and Spark

Apache Spark has made a name for itself as a powerful data processing engine for transforming large datasets in a swift, distributed manner. After using Spark to complete such transformations, you often want to store your data in a persistent and efficient format for long-term access. The common solution of storing data in HDFS solves the issue of persistence, but suffers efficiency issues as a result of the HDFS disk-based architecture. The MemSQL Spark Connector solves both of these issues by...


machine learning at scale

Video: Scoring Machine Learning Models at Scale

At Strata+Hadoop World, MemSQL Software Engineer, John Bowler shared two ways of making production data pipelines in MemSQL: 1) Using Spark for general purpose computation 2) Through a transform defined in MemSQL pipeline for general purpose computation In the video below, John runs a live demonstration of MemSQL and Apache Spark for entity resolution and fraud detection across a dataset composed of a hundred thousand employees and fifty million customers. John uses MemSQL and writes a Spark job...


Video: Building the Ideal Stack for Real-Time Analytics

Building a real-time application starts with connecting the pieces of your data pipeline. To make fast and informed decisions, organizations need to rapidly ingest application data, transform it into a digestible format, store it, and make it easily accessible. All at sub-second speed. A typical real-time data pipeline is architected as follows: Application data is ingested through a distributed messaging system to capture and publish feeds. A transformation tier is called to distill...


Manage Case Study

How Manage Accelerated Data Freshness by 10x

Success in the mobile advertising industry is achieved by delivering contextual ads in the moment. The faster and more personalized a display ad, the better. Any delay in ad delivery means lost bids, revenue, and ultimately, customers. Manage, a technology company specializing in programmatic mobile marketing and advertising, helps drive mobile application adoption for companies like Uber, Wish, and Amazon. In a single day, Manage generates more than a terabyte of data and processes more than 30...


PowerStream

Using MemSQL and Spark for Machine Learning

At Spark Summit in San Francisco, we highlighted our PowerStream showcase application, which processes and analyzes data from over 2 million sensors on 200,000 wind turbines installed around the world. We sat down with one of our PowerStream engineers, John Bowler, to discuss his work on our integrated MemSQL and Apache Spark solutions. What is the relationship between MemSQL and Spark? At its core, MemSQL is a database engine, and Spark is a powerful option for writing code to transform data....


MemSQL Guide to Spark Summit 2016

Spark Summit 2016 kicks off this week with more than 90 sessions and five tracks to choose from in the heart of San Francisco. The three day marathon of learning, which includes an entire day dedicated to Spark Training, attracts more than 2,500 engineers, business professionals, scientists, and analytic enthusiasts from across the country. https://spark-summit.org/2016 Hilton San Francisco 333 O’Farrell St, San Francisco, CA 94102 USA Throughout the show, speakers will address the different...


Spark Summit East 2016

Real-Time Solutions Take Center Stage at Spark Summit East 2016

We spent last week in New York at Spark Summit East talking with the visionaries and data architects using Apache Spark. PowerStream Demo At the show we introduced PowerStream, an Internet of Things (IoT) showcase application with visualizations and alerts based on data from 2 million sensors across global wind farms. PowerStream ingests that data and provides actionable insights in real time, giving users a glimpse of how the future of sustainability can be fully realized by adapting data to...


Streamliner Python

Introducing a Performance Boost for Spark SQL, Plus Python Support

This month’s MemSQL Ops release includes performance features for Streamliner, our integrated Apache Spark solution that simplifies creation of real-time data pipelines. Specific features in this release include the ability to run Spark SQL inside of the MemSQL database, in-browser Python programming, and NUMA-aware deployments for MemSQL. We sat down with Carl Sverre, MemSQL architect and technical lead for Ops development, to talk about the latest release. Q: What’s the coolest thing...


Top Spark Summit Questions

Top 5 Questions Answered at Spark Summit

The MemSQL team enjoyed sponsoring and attending Spark Summit last week, where we spoke with hundreds of developers, data scientists, and architects all getting a better handle on modern data processing technologies like Spark and MemSQL. After a couple of days on the expo floor, I noticed several common questions. Below are some of the most frequent questions and answers exchanged in the MemSQL booth. 1. When should I use MemSQL? MemSQL shines in use cases requiring analytics on a changing...


Enterprise Apache Spark

Harnessing the Enterprise Capabilities of Spark

As more developers and data scientists try Apache Spark, they ask questions about persistence, transactions and mutable data, and how to deploy statistical models in production. To address some of these questions, our CEO Eric Frenkiel recently wrote an article for Data Informed explaining key use cases integrating MemSQL and Spark together to drive concrete business value. The article explains how you can combine MemSQL and Spark for applications like stream processing, advanced analytics, and...