CDATA[Before starting from home for the airport, you put your flight number – DL28 on Google search. It shows you the current flight status from Atlanta to London. But another time, you want to ask what are the chances of this flight getting delayed. Google won’t give you any relevant answer, although it has all the flight data. Imagine another scenario where you are boarding the flight to London and have updated your status on Facebook. A school friend you had lost touch during the years resides in London and sees your post. You end up having dinner with her in the evening.
These are the three possibilities of how far you can go with data – when a question is very relevant and thought beforehand by R&D / IT (in the above case – Google) and the system is designed to answer that. This system design is referred to as Business Intelligence
. But when the system is not designed to answer a question which is crucial, the things we do to get an answer is called Data Analytics
. In the third case, when the precise question itself is not known is termed as Discovery
Existing Dimension Models
For better Business Intelligence, Organizations are building data-warehouses and organizing data in very formal dimension models or cubes so that the desired question can be answered rapidly. But as the data needs of organizations are getting more dynamic, these dimension models can be built up fast enough. To create a platform for Analytics and Discovery is a better approach.
Through Hadoop technology, it is getting possible to create a platform for Data Analytics and Discovery. Data Lake or Data Reservoir or Enterprise Data Hub are some new terms where you use Hadoop to store all the data. Organizations are hiring Data Scientists to think of questions and find answers using various algorithms, machine learning etc. The problem here is that few individuals who don’t fully know the nitty-gritty’s of business have to come up with all the possible questions.
What is a Data Lake?
What if there was a repository which could store all the data in its native format until it was needed? Could business users query that data in the way they wanted and get answers quickly? These questions lead to a Data Lake.
In a practical sense, a Data Lake is characterized by three key attributes:
Collect anything and everything:
A Data Lake contains all data, both raw sources over extended periods of time as well as any processed data.
Let everyone dive in:
A Data Lake enables users across multiple business units to refine, explore and enrich data on their terms.
Use your own engine:
A Data Lake enables multiple data access patterns across a shared infrastructure – batch, interactive, online, search, in-memory and other processing engines.
Make a Data lake a Discovery Engine
Here are some tips to convert a Data Lake into a Discovery engine. By using these tips, organizations can create a culture of data-driven decision making. It is important to give access to data not only to your business analysts but also to all the employees. This self-service discovery tool should be a part of the employee portal so that they can get an answer to your questions. This ultimately improves the efficiency of the organization.
1. Eliminate data modeling
Star Schema, Snowflake schema etc are 20th-century concepts, designed when data storage was super expensive. Now, as storage is dirt cheap, you can keep your data in an original format. Use machine learning concepts to find facts, dimensions, and the relationship between data. If you store your data in the original form you can reach any time dimension you need. All you need to do is to store all the inserts, updates and deletes on the data.
If you enter into a library, and there is no catalog available, it is impossible to find a book. The same applies to data. To find data, it is necessary to build a catalog of all the data in your data lake. Advanced algorithms or machine learning techniques are used to build this catalog. It is important to have all the information cataloged and searchable at your fingertips.
Its hard to do Discovery alone (there is only one Einstein), its important for organizations to provide access to data with a collaboration tool. In our example, when you shared your flight information with your network, you got a nice evening with your friend.
Data is an asset to an organization. As organizations are opening data to many employees it is also important that only the right eyes see the data. It is also very important to design a security model so that employees can collaborate securely.
In last decade, companies who build their fortune, mostly because of their recommendation engines. Whether, its Netflix, Facebook, LinkedIn, Google, companies are able to find new customers, customers are able to find the new products using these recommendation engines. Every data has some co-relation with other data. These correlations are then converted to recommendations. Further data discovery can become very profound using these recommendation engines.