What is Unstructured Data and How to Process it on Hadoop?
Hadoop was invented to process unstructured data. First at Google, then at Yahoo and Bing, it was used to create page rank based on keywords from the text on the pages. But till date, most of the practical use cases of Hadoop are only to offload ETL from proprietary databases to Hadoop or create new ETL. The reason is that not many people know how to process unstructured data. They are unaware of the kind of insights unstructured data can provide.
What is the business value in processing unstructured data?
Here are some use cases where you can derive value from analyzing unstructured data
Compliance (Secure your customer data)
It is one area where it is crucial to process unstructured data. For example, GDPR (General Data Protection Regulation – a new data regulation by EU), where you need to know every detail about your customers. Along with CRM database and other databases, there are so many emails, chat messages, log files where you store customer information. In wake of recent data breaches, a company intends to detect the exact location of PII (Personal Identifiable Information) to secure it. Now you can also check whether you have got any SSN’s or Credit Card numbers stored in chat files or emails.
Sentiment Analysis (understanding customers)
The key to a successful business lies in understanding customer behavior. If you ask them to rate their experience on a scale of 1 to 5 you only know whether they are feeling happy or unsatisfied or somewhere in the middle. They write – what exactly they like or hate, in the comment section. It’s crucial to analyze the text provided by the customer first to step in their shoes and then enhance the services provided to them.
Classification of problems of the customer
A company’s customer support team works with various customers and resolves their answers via chat messages, emails, phone calls, etc. We can analyze this text to find a pattern and solve the problem for a broad audience.
These use-cases provide only a whiff of the tremendous value unstructured data can unleash. Now let’s see the various kinds of unstructured data that exists.
Types of Unstructured Data
Fully Unstructured Data
These are video files, audio files, and pictures. Not many techniques are available on Hadoop to gather intelligence from fully unstructured data. However, data scientists can leverage technology available in this blog (https://www.tensorflow.org/deploy/hadoop), to process completely unstructured data and get some intelligence. Tensor flow requires a vast amount of processing. GPU’s don’t have that much processing power so Hadoop may not be the right technology to process fully unstructured data.
Unstructured Text Data
It is the text written in various forms like – web pages, emails, chat messages, pdf files, word documents, etc. Hadoop was first designed to process this kind of data. Using advanced programming, we can find insights from this data. Below, I mainly stress upon handling this unstructured text data.
This data is mostly in log files, or IOT logs, where we see the structure but require some rules to find the details. For example, a clickstream log may look like :
2017-11-01 14:27:57,944-INFO : com.ovaledge.oasis.dao.DomainDaoImpl – RUNNING QUERY: Select * from domain where DOMAINTYPE=’DATAAPP_CATEGORY’;
The above line starts with a date and then has a class name and some details about the class name. We can write rules to extract this information.
Incompatibly Structured Data (But they call it Unstructured)
Data in Avro, JSON files, XML files are structured data, but many vendors call them unstructured data as these are files. They only treat data sitting in a database as structured. Hadoop has an abstraction layer called Hive which we use to process this structured data.
Now that we have categorized it, our next step will be to process this massive amount of unstructured data. So now let’s talk about how to do this technically on Hadoop.
Structuring Unstructured Data
Text Extraction (Different File Formats)
Hadoop by default only supports text file format. To process various kind of files, for example – HTML, PDF, Word, PPT, etc you have to write a custom input format. There are numerous open source solutions available to extract the text from various file formats.
Parsing / Tokenization
Once you extract the text, you need to glean the sentences from the paragraph and then words from it. It requires some machine learning logic to train the model. If you want, you can use some Java-based open source libraries to parse the text. We commonly use two libraries: Stanford and Open Text API.
Phrase Recognition (Corpus Building)
You want to separate phrases from the text. For this either you can use rules to check all the word combinations from a dictionary. You can also use machine learning models.
Named Entity Recognition
Now you want to separate nouns, proper nouns, address, a city from the text. You have to identify if a particular word is a city, address or state. For this, you have to create a machine learning model to determine if a word is within a specific category. For name and city, you can find an open source model. But to identify anything other than that; first, you have to create a model. Then train it with your data and after that, it would be able to recognize the text.
At OvalEdge we have a robust algorithm, where we create a model to learn from structured data. Then we apply the model automatically. You only have to point to ‘Payment Term Table’, to train a model with payment terms. Point to ‘Company Table’ to identify company names. Once you identify names, you can index them for searching or analytics.