At the start of the new millennium, the business world was witnessing a change happening. Unprecedented volumes and velocities of unstructured data getting generated were mandating a revolution in data storage and processing. This data seemed to hold tremendous value. But traditional relational databases and business intelligence software were unable to handle this. So, then what could? It was in 2003 and 2004 when coincidentally Google published a pair of research papers about its search engine technology. These articles talked about the Google File System (GFS), a means of storing data across distributed machines, and Google MapReduce, a distributed number-crunching platform that runs atop GFS. It paved the way for some techies to invent a massive data storage and processing system.
Eventually, when the idea of a data lake took shape, the cry of joy that erupted was YAHOO! That’s right! It was Yahoo, the company which bootstrapped one of the most influential software technologies of the last five years – Hadoop. Hadoop could make the idea of a data lake materialize – a data repository which could hold all kinds of data in its native format. They formed a team to work on Hadoop. But midway a founder engineer broke off. He started to build his own company – Cloudera along with another engineer from Oracle.
Rob Bearden, a serial entrepreneur from Atlanta, Georgia, was keeping a hawk’s eye on the big data technology. He saw the rapid rise of Cloudera and felt this technology could reshape the way big businesses operated. He wanted to start a new company but with an experienced team to get a head start. So, he sent a mail to a Hadoop software lead at Yahoo known as Eric 14 as his last name of fourteen letters was a mouthful. Together, within six months, they convinced the Yahoo board to spin off Eric and about 24 other engineers.
Dubbed Hortonworks, this new venture indeed had its hands on the right technology. Moreover, Rob, at this point fueled the enterprise with $100 millions of venture capital money. A data lake built on Hadoop, an open-source platform designed to crunch vast amounts of data using an army of dirt-cheap servers, was taking shape.
Today, Hadoop underpins not only Yahoo, but Facebook, Twitter, eBay, and dozens of other high-profile web outfits. Due to its success on the web, the data lake technology is primed for use in the corporate data world. In today’s internet-driven world, more and more data is hitting big businesses. A data lake is a way of dealing with that data.
“Change is hardest at the beginning, messiest in the middle and best at the end.”
Robin Sharma, Bestselling author & leadership speaker
What we were talking about here is a shift in enterprise data management. If bringing a change in technology is hard, changing people’s mindset to embrace that technology is grueling. That made the road to the data lake fraught with risk. The risks became evident when some of the data lakes failed. Companies became wary of this new technology.
Hadoop has a long history with elephants. It was named Hadoop after the yellow stuffed elephant that once belonged to the son of one of its founder, Doug Cutting. The elephant not only rubs on its name but also on its working. As massive a storage system as a data lake has all the chances to get unwieldy. And it does.
If we have to extract value from all the data brought at one site, we have to cover few more steps. We have to organize and catalog it so that we can find it when needed. It took nearly a year for some companies to do that. This dampened the enthusiasm of businesses who were looking for quick returns.
We can liken various applications which generate data to Mom and pop stores. You will have to go to multiple such outlets to get all the things you need.
A data lake is a massive repository like Walmart which physically brings in all kinds of products from various manufacturers. The challenge with this approach is how to organize this data so that it can be easily searched and accessed.
Amazon is a virtual platform which provides you the facility to buy things from various companies. It is a virtual marketplace. The information about the merchandise is at a central place and not the merchandise. Moreover, that information is well organized and searchable and has recommendations. A virtual data lake works on a similar concept. You only bring all the metadata to a central place and NOT the data. Whenever the data is needed, you can access it from where it is natively stored.