Shifting from Big Data to Machine Learning: Lessons Learned
The old adage is as true as ever in the world of open source technology: “Those who do not learn history are doomed to repeat it.”
Numerous surveys, articles, listicles and case studies address best practices for companies wishing to implement a Big Data project and make a return on their investment. Machine learning is the new buzzword to take hold among executives (and in the marketing materials of enterprise consultants). Before plunging into the world of machine learning, firms should pause and learn from the mistakes made in the implementation of Big Data projects over the last five years.
The chart here shows the percentage growth in search popularity of “machine learning” and “big data” from 2012 as tracked by Google Trends. Since2015, “big data” appears to be leveling off in growth, while “machine learning” shows an inflection point with a high rate of growth. Anecdotally, conversations behind the closed doors of executive offices are shifting focus as well.
Based on my experience, there are no silver bullets. At least we now have a blueprint for what worked and failed in Big Data. History can be our guide
Big Data projects returned 55 cents for every dollar spent, and the three primary reasons for an underwhelming ROI is lack of skilled practitioners and immature technology and a lack of compelling business cases.
Clearly, buzz about machine learning is growing as companies aspire to interpret and apply their data to drive incremental revenue. Using the three reasons listed in the survey cited by Jeff Kelly as a guide, let’s address the next wave in open source.
Commoditization Does Not Improve ROI
People who write “data scientist” on their resume are rarely left alone by recruiters and are only a few phone calls away from setting up an interview with an enthusiastic employer. Recently, the Harvard Business Review famously proclaimed data science as “The sexiest job of the 21st century”. The role of a data scientist includes technical proficiency to manipulate data using SQL, noSQL and ETL tools, broad knowledge of statistical techniques and predictive modeling (the core of machine learning), plus softer skills to visualize and present the data and output. The key role of a data scientist though is using his broad skill set to find insights from the noise and make a meaningful impact on business.
As noted above, Big Data projects fail due to a lack of skilled practitioners. Cloud solutions and developer shops have attempted to bridge the gap with talent and services by addressing the “how-to” of building a data operation. The problem, however, is the talent and services stop at the data layer before any insight or value can be generated. The gap still exists, though it is not from a lack of talent but rather that the focus has been exclusively on the data itself. The reason executives are left unsatisfied is due to Big Data services being commoditized with most of the focus on the data and not enough room for adaptability.
Multitudes of consultants with deep skills in the latest technologies are now available to hire and build the bridge, but the assumption that every firm has the same gap is incorrect. Big Data from the perspective of the firm cannot be commoditized. Data science is now the proposed solution to finally complete the bridge and generate ROI from Big Data projects.
Training the Data Scientist and Closing the Gap
There is a risk now with data scientists becoming commoditized as well. If the data scientist focuses exclusively on the “how-to” skills without a keen sense for business and discovering relevant insights, I expect history will repeat itself.
A growing number of universities and online educational services now offer graduate degrees in data science. Although there is not a universal standard yet between programs, the most common disciplines that data science master students will encounter are computer science and statistics. There is also typically exposure to common languages and tools like SQL, noSQL, Python, R, and even GitHub. The missing piece from this mix of skills is training in businesses operations and basic economics. Without understanding how the real world works, the data scientist is at risk of building more bridges but never closing the gap.
The Technical Utility Belt
Machine learning will be closely tied to the technologies, languages and tools being utilized for data storage and manipulation. Since it is dependent on a functioning data environment, machine learning may suffer from the same criticism of “immature technology” as with Big Data. If the underlying data framework is still not a stable production environment, the machine learning algorithm will have to sit waiting until data and ETL is under control.
Therefore, prior to undertaking a machine learning project, a conscious choice should be made to systems and infrastructure to ensure a robust and production-ready operation.
There are plenty of competitors to choose from as well. The technology continues to mature as Microsoft’s Azure and Google’s Big Query gain traction against Amazon and Hadoop. Spark also appears to have grown up and made a name for itself beyond Hadoop. Most recently, notebooks like Jupyter, Zeppelin and Beaker are meeting the needs of data scientists who must collaborate and test various machine learning algorithms.
Interestingly, the actual science behind machine learning is more accessible than ever. Unless the data scientist will be building a process from scratch, countless types of machine learning libraries can be found by checking GitHub. A quick search returned nearly 40K repositories across multiple languages, including Google’s well-known TensorFlow system. The technology is becoming more mature, but it is still relatively new so the risk still exists.
Use Cases, Use Cases, Use Cases!
What else needs to be said?
Machine learning can be seen as a reaction to the underwhelming return from Big Data. Executives and practitioners are realizing that ingesting every data point imaginable and firing it from a fire hose between unstructured and structured data warehouses is not what they actually want from Big Data.
In hindsight, maybe “Big Data” was not the right term to use for these types of projects. The term is a giant umbrella for a broad set of skills and services, but the focus of the term is just data. “Big ROI” sounds more relevant and immediately shifts focus from data operation to business case. Maybe the term “data science” should be scrapped as well.
The biggest mistake executives and firms will make is thinking now that machine learning is the silver bullet that will solve every problem. Based on my experience, there are no silver bullets. At least we now have a blueprint for what worked and failed in Big Data. History can be our guide.