As industries latch on to the rise of machine learning and artificial intelligence, we see firsthand that the key to success is often the data itself. While discussions on the latest algorithms and learning models capture mainstream attention, the real determining factor of a company’s success is its ability to leverage a corpus of data.

Recently our CEO Eric Frenkiel delivered a talk on this topic, Machines and the Magic of Fast Learning. The premise was simple, and inspired by a discussion with Alistair Croll, chair of O’Reilly’s Strata + Hadoop World conference, that collecting and managing a large corpus of data leads to definitive business benefits.

To understand the implications of the corpus, consider that computing origins started with actors, or more specifically, applications. These applications initially generated moderate amounts of data, and a fixed set of interactions between the data and application. With applications and devices driving larger volumes of data, we added operators to apply data science to enhance experiences across everything from enterprise software to mobile apps. When actors and operators are brought together, real-time machine learning can be applied to drive new knowledge back into the business. But the real magic takes place when a feedback loop is developed to enrich the experience.

Rise of the Data Corpus
The data corpus theme stretches to the world’s largest companies. Here the rapid rise of mass data capture quickly shifts into rich analytics. Some examples of data collected to drive multibillion dollar industries include:

  • App stores from Apple and Google
  • Online music, video, and books books Apple, Google, and Amazon
  • Seller marketplaces from Amazon.com
  • Social networks from Facebook

Apple specifically recently stated that during the life of the App Store, it has paid developers of $70 billion to date. With so much at stake, these companies use similar approaches to drive their businesses. We took a closer look at this phenomenon in a post on the analytics race amongst the world’s most valuable companies. We concluded that data is fueling their success.

The data corpus also represents a strategic lever for image-focused business models. In a New Yorker article, What’s Wrong With Twitter’s Live-Video Strategy, Om Malik discussed the importance of assembling a large volume of face-recognized photos.

…but there is an even bigger potential payoff. A large corpus of images is needed to train computer vision algorithms to distinguish between cats, dogs, and houseplants. The better its technology is at identifying the information inside the photos, the more opportunities the company will have to target advertising at specific users. A photo of your baby might offer an opportunity to place an advertisement for diapers or baby-food formula.

Additionally, prominent industry data scientists like Peter Skomoroch have made the corpus a focal point, albeit with a dash of humor.

Across all industries, the data corpus, not the algorithms, will be the secret weapon for machine learning and artificial intelligence. We have already seen this model create massive value with some of the companies mentioned earlier.

The current question is which new businesses will emerge to create the next great data corpus.