How Open Source Software enabled Big Data for the masses
The power of the collective
In organizational theory, there are two driving models for innovation. The private investment model – which allows inventors to claim financial benefits through intellectual property rights, and the collective action model, which states that in situations where market options aren’t optimal, innovators will collaborate toward a public good without expectations for material return. Opensource software (OSS), has both the benefits, and challenges of the latter. The strongest benefit of the collective action model is the harvesting of intelligence from a community to create a product that is better removed from the flaws or special interest of any one contributor. Contrast that with the rigid controls and special purpose of a commercially distributed proprietary product.
Nowhere is this comparison more obvious than the speculation mass media attempts to make about the server operating system of the future. Will it be Linux? Will it be Microsoft Windows? Most now realize the folly of those speculations because it was never a zero-sum game. Both are as relevant today as they were fiveyears ago, and I don’t see that changing anytime soon. Beyond the cost, the technology selected for a specific task will continue to be driven by backwards compatibility for existing systems, and sometimes by the skills of the software development team. The exception to this is when it comes to data aggregation and processing architecture –which is hands down, being dominated by the open-source community.
“By 2016, at least 95% of IT organizations will leverage nontrivial elements of OSS technology within their mission-critical IT portfolios, including cases where they might not be aware of it — an increase from 75% in 2010.”-Gartner
With even the U.S. government creating policy for the use of open source tools– there is no longer a question about OSS going mainstream. The biggest modern companies are not only running open-source software, they have also become the biggest patrons of source projectsbeing fed into thecommunity. Companies like Facebook and Google are using publicly available source code to power their critical infrastructure, and also making products they build internally available to be freely distributed. Some very successful examples of this are the Hadoop eco-system originally conceived at Yahoo, and the CassandraNoSQL database developed by Facebook. The most forward-looking companies are not content to just release source-code into the wild, but are instead determined to continually out innovate existing opensource solutions. One example of an innovative opensource tool my group recently started working with is named Presto. Originally conceived by Facebook, Presto is a nearly real time query engine created to outperform Apache Hive. With its distributed query processing, it eliminates latency and disk IO that is common with traditional MapReduce jobs.
How big data came to the masses
Mining insight from massive amounts of data has always been challenging, but it was never impossible. All you had to do was transform your data into a specific format and inject it into an enterprise relational database. This model worked great – IBM and Oracle are masters of buildingsophisticated software to serve this purpose. The problem with this approach came when there was massive amounts of data, in different formats, stored in different proprietary database products. For anyone other than the very well heeled, transforming and connecting that data from disparate systems in any meaningful way was nearly impossible. Under this model, the technology architecture required a lot of moving parts and time to process. In addition, the database software licensed model was largely predicated by the number of cores running on the server and sometimes on how the data would be used (internal vs. external). Scaling this model to multiple servers rapidly became not only expensive,but a logistical license nightmare to deal with. As an experienced IT leader in a large organization, there are times when you consider hiring a full time software license administrator.
Consider the scenario above, and that in order to do bigdata, we need to connect historical data streams that exist in multiple places and in different formats. The optimal big data architecture is distributed for scalability, performance and fault tolerance, It becomes clear that without the freedoms brought by freely distributed open-source software, big data would still be outside the financial reach of most. The notion of big data has been able to reach the fever pitch we see today partly because of theaccessibility of free software like Linux and Apache Hadoop that scaleshorizontally on inexpensive commodity hardware.
“Mining insight from massive amounts of data has always been challenging, but it was never impossible. All you had to do was transform your data into a specific format and inject it into an enterprise relational database.”
Not just another arrow in the quiver
Even with all this modern technology, leaders know better than to assume that opensource projects will automatically translate to a lower TCO. While there are typically savings on licensing, there are often greater costs for skilled support. When mapped against the cost of closed source software, OSS tends to be heavier in operating cost (it is likely that OSS staff will require a different set of skills than those required by private packaged software). With the need for specialized skills, comes the need for regular training. Managing the source code and working under a GPL for OSS requires its own process, separate from internal and proprietary software development efforts, particularly if you are going to distribute OSS software to third parties. It is also important to remember that among OSS projects, the maturity of the product and corresponding documentation is often driven by the size of the developer community. Therefore, you must ensure that you assess the feasibility of each OSS projects on its own merits. Even with all this, the OSS ecosystem will continue to drive the promise of big data, and indeed is the very reason the buzz exists in the first place.