What is Machine Learning?

Machine learning (ML) is a method of analyzing data using an analytical model that is built automatically, or “learned”, from training data. The idea is that the model gets better as you feed it more data points, enabling your algorithm to automatically get better over time.  

Machine learning has two distinct steps, training, and operationalization. Training takes a data set you know a lot about (known as a training set), then explores the data set to find patterns, and develop your model. Once you have developed your model you move on to operationalization. This is where you deploy it to a production system where it runs to score new data, then the system returns the results to the user.  

How to Get Started with Machine Learning

To accomplish these steps, you will commonly make use of several tools. You need a tool to bring the data in, a tool to cleanse the data, libraries to develop the calculations, and a platform for testing the algorithm. Once you are ready to operationalize the model, you need a compatible platform to run your model and an application to process and/or display the results.

Using MemSQL for Machine Learning Operationalization

MemSQL is a distributed database platform that excels at doing the kind of calculations typically found in a machine learning model. MemSQL is a great environment for storing training data, as the user can run it in a small configuration, such as a single node model on a laptop. Because MemSQL is compatible with MySQL, a data scientist could also use a MySQL instance for the algorithmic development.

Where MemSQL really shines is the operationalization of the model. The key requirements for effectively operationalizing an algorithm are the following:

  • Ingest the data quickly
  • Fast calculations
  • Scale out to handle growth
  • Compatibility with existing libraries
  • A powerful programming language to express the algorithm
  • Operational management capabilities to ensure data durability, availability, and reliability

MemSQL is a perfect fit for these requirements and can be used in an ML solution in a couple of different ways.

Three ways to Operationalize ML with MemSQL

Calculations Outside the Database

MemSQL can be a fast service layer that both stores the raw data and serves the results to the customer. This is useful when the creation of the model is done with existing infrastructure, such as a Spark cluster. A real-world example of this is a large energy company that is using MemSQL for upstream production processing. The company has a set of oil drills all around the world. The drills are expensive to fix, because of the cost of parts and the cost of the labor (as the drills are often in remote locations). Keeping the drill from breaking results in a dramatic cost savings. The drills are equipped with numerous sensors (collecting heat, vibration, directionetc.) that continuously send data back to a Kafka queue. Data is pulled from this queue into a Spark cluster, where a PMML (Predictive Model Markup Language) model calculates the health of the drill. The scored data then lands in MemSQL, and is served to the drill operators in real time. This allows the operators to slow down or reposition the drill if it is in danger of damage. Having a data platform that can continuously ingest the scored data at high throughput, while still allowing the models to be run, is critical to delivering this scenario. Because MemSQL is a modern scale-out architecture and sophisticated query processor, it can handle data processing better than any other database in the industry.

Calculations on Ingest

Some customers don’t want to maintain a separate calculation cluster, but still want to make use of existing statistical or ML libraries. In this case, they can use the MemSQL Pipelines feature to easily ingest data into the database. Customers can then execute the ML scoring algorithm as data arrives, using the transform capability of Pipelines. Transforms are a feature that allow customers to execute any code on the data prior to its insertion in the database. This code can easily integrate or call out to existing libraries, such as TensorFlow. The results of the calculations are then inserted in the database. Because MemSQL is a distributed system and MemSQL Pipelines run in parallel, the workload is evenly distributed over the resources of the cluster.

Calculations in the Database

Sometimes it is more efficient to do the scoring calculations as close to the data as possible, especially when the new data needs to be compared with a larger historical set of data. In this case, you need a language to encode the algorithm in the database. It is important the language is expressive enough to enable the algorithm and core operations fast, allowing efficient querying over the existing data, and can be composed with other functionality.

One example of an organization that has successfully used this approach is Thorn, a non-profit that uses image recognition to find missing and exploited children. The application keeps a set of pictures of exploited children in its system, and matches the faces of those children to new pictures that are continuously culled from websites around the country. The new pictures are reduced to vectors using a deep learning-based approach, and are matched against vectors representing the base pictures.

Prior to using MemSQL, the matching process would take hours or days. By using the MemSQL high-performance vector DOT_PRODUCT built-in function, processing the incoming pictures can be done in minutes or seconds. Another image recognition example is Nyris.io, which uses a similar technique to match product photos using deep learning coupled with fast in-database DOT_PRODUCT calculations. The application quickly matches user provided images with reference product images to enable ecommerce transactions.   

To build an operational ML application with MemSQL, please visit memsql.com.