A data catalog is a metadata repository which helps companies organize and find data that’s stored in their many systems. It works like a library catalog. But instead of detailing books and journals, it has information about tables, files, and databases. This information comes from a company’s ERP, HR, Finance, and E-commerce systems (as well as social media feeds). The catalog also shows where all the data entities are located. A data catalog contains a number of critical information about each piece of data, such as the data’s profile (statistics or informative summaries about the data), lineage (how the data is generated), and what others say about it. A catalog is the go-to spot for data scientists, business analysts, data engineers, and others who are trying to find data to build insights, discover trends, and identify new products for the company. A data catalog works differently than a data lake. While they are both a central repository of data, you must move all the data into the technology while using a data lake. For example, if the data lake is in S3, you must move all the data to S3. This can become very expensive and is only applicable for certain use cases. On the other hand, a data catalog contains the metadata and its whereabouts, which enables the user to move to the appropriate place.  

A company collects and stores an abundance of their data, but it is inaccessible and unavailable to be analyzed. The plan to make data-driven business decisions is hindered at the very beginning. Here are some numbers about the current use of data: Only 0.5% of all data is currently analyzed. 1 Only 14% of business stakeholders make thorough use of customer insights. 2 Organizations that leverage customer behavioral insights outperform peers by 85% in sales growth and more than 25% in gross margin. 3
  1. Source: Datanami
  2. Source: Forrester’s Q2 2016 Intelligence Enterprise Self-Assessment Scorecard
  3. Source: McKinsey and Company

Top 3 reasons for Data Inaccessibility

1. Complex Data Stack
In a company data originates from many sources which cannot be easily mined. Just consider a typical data stack for a company: Complex data stack
2. Dispersed Knowledge
3. Lack of Governance

A data catalog customarily has the following features.

1. Collects and Organizes All Metadata
The first step for building a data catalog is collecting the data’s metadata. Data catalogs use metadata to identify the data tables, files, and databases. The catalog crawls the company’s databases and brings the metadata (not the actual data) to the data catalog.
A data catalog can typically crawl:
Data Management Platforms
  • Relational Databases – Oracle, SQL Server, MySQL, DB2, etc.
  • Data Warehouses – Teradata, Vertica etc.
  • Object Storage
  • Cloud Platforms – Google Big Query, MS Azure Data Lake, AWS – Athena & Red Shift
  • Non-Relational / NoSQL Databases- Cassandra, MongoDB
  • Hadoop Distributions
Analytics and Business Intelligence Platforms
  • Modern Business Intelligence Platforms
  • Analytic Applications
Custom Applications
2. Shows Data Profile
By looking at the profile of data consumers view and understand the data quickly. These profiles are informative summaries that explain the data. For example, the profile of a database often includes the number of tables, files, row counts, etc.. For a table, the profile may include column description, top values in a column, null count of a column, distinct count, maximum value, minimum value and much more.
3. Builds Data Lineage
Data Lineage is a visual representation of where the data is coming from, where it moves and what transformations it undergoes over time. It provides the ability to track, manage and view the data transformation along its path from source to destination. Hence, it enables the analyst to trace errors back to the root cause in the analytics.
4. Marks Relationships Amongst Data
Through this feature, data consumers can discover related data across multiple databases. For example, an analyst may need consolidated customer information. Through the data catalog, she finds that five files in five different systems have customer data. With a data catalog and the help of IT, one can have an experimental area where you can join all the data and clean it. Then one can use that consolidated customer data to achieve your business goals.
5. Houses a Business Glossary
A data catalog is an apt platform to host a business glossary and make it available across an organization. Business glossary is a document which enables data stewards to build and manage a common business vocabulary. This vocabulary can be linked to the underlying technical metadata to provide a direct association between business terms and objects.
6. Tags Data Through AI

The primary benefit of a data catalog is that it acts as a single source of reference for all one’s data needs in an organization. OvalEdge catalogs the entire data of various databases, file systems & visualization software. It creates a knowledge base of data by human curation and various data, machine learning and code parsing algorithms.
1. Data Democratization
  • Provide access to data to everyone
  • Communication/collaboration on data
2. Knowledge base of entire data portfolio
  • Profiling algorithms to quickly answer most common questions
  • Documentation on Data.
3. Combined toolsets
  • Query tools
  • Documentation tools
  • Governance tools
4. Algorithms to organize data
  • Machine Learning algorithms to tag data automatically
  • Relationship algorithms to find relationships
  • Code parsing algorithms to understand relationships and lineage
  • SQL query logs parsing to understand relationships and lineage

Here is the step-by-step process of building a data catalog.
Accessing and Indexing Metadata of Databases
The first step for building a data catalog is collecting the data’s metadata. The catalog crawls the company’s databases and brings the metadata (not the actual data) to the data catalog. Data catalogs then use this metadata to identify the data tables, the columns of the tables, files, and databases.
Profiling to See the Data Statistics
The next step is to profile the data to help data consumers view and understand the data quickly. These profiles are informative summaries that explain the data. For example, the profile of a database often includes the number of tables, files, row counts, etc. For a table, the profile may include column description, top values in a column, null count of a column, distinct count, maximum value, minimum value and much more.
Building or Loading Existing Business Glossary
The third step is to build a business glossary or upload an existing one into the data catalog. Business glossary is an enterprise-wide document created to improve business understanding of the data. It enables data stewards to build and manage a common business vocabulary. This vocabulary can be linked to the underlying technical metadata to associate business terms with objects. A business glossary can have multiple data dictionaries attached to it. A data dictionary is more technical in nature and tends to be system specific. It contains the description and Wiki of every table or file and all their metadata entities. Employees can collaborate to create a business glossary through web-based software or use an excel spreadsheet.
Marking Relationship Amongst data
Marking relationships is the next vital step. Through this step, data consumers can discover related data across multiple databases. For example, an analyst may need consolidated customer information. Through the data catalog, she finds that five files in five different systems have customer data. With a data catalog and the help of IT, one can have an experimental area where you can join all the data, clean it. Then use that consolidated customer data to achieve your business goals.
Building Lineage
After marking relationships, a Data Catalog builds lineage. A visual representation of data lineage helps to track data from its origin to its destination. It explains the different processes involved in the data flow. Hence, it enables the analyst to trace errors back to the root cause in the analytics. Generally, ETL (Extract, Transfer, Load) tools are used to extract data from source databases, transform and cleanse the data and load it into a target database. A data catalog parses these tools to create the lineage. Some of the ETL tools which can be parsed are –
  • SQL Parsing
  • Alteryx
  • Informatica
  • Talend
Organizing Data
In a table/file data is arranged in a technical format and not in a way to make the most sense to a business user. So we need human collaboration on data assets so that they can be discovered, accessed and trusted by business users. Below are a few techniques by which we can arrange data for easy discovery –
  1. Tagging
  2. Organizing by an amount of usage
  3. Organizing by specific users’ usage
  4. Through automation – Sometimes when there is a large amount of data we can use advanced algorithms to organize data.

Find your edge now. See how OvalEdge works.

 

Copyright All Rights Reserved © 2019

OvalEdge

5655 Peachtree Pkwy
Suite # 216
Peachtree Corners, GA 30092

OvalEdge

Tech Alpharetta
2972, Webb Bridge Rd
Alpharetta, GA 30009

OvalEdge India

Manjeera Trinity Corporate
5th Floor, Suite # 514
eSeva Ln, KPHB Phase 3, Kukatpally
Hyderabad, Telangana 500072