A step-by-step guide to build a data catalog
What is a Data Catalog?
A data catalog helps companies organize and find data that’s stored in their many systems. It works a lot like a fashion catalog. But instead of detailing swimsuits or shoes, it has information about tables, files, and databases from a company’s ERP, HR, Finance, and E-commerce systems (as well as social media feeds). The catalog also shows where all the data entities are located.
A data catalog contains lots of critical information about each piece of data, such as the data’s profile (statistics or informative summaries about the data), lineage (how the data is generated), and what others say about it.
The catalog is the go-to spot for analysts and others, who are trying to find data to build insights, discover trends, and identify new products for the company.
Why does a company need a data catalog?
According to Forrester research, only 14% of business stakeholders make thorough use of customer insights. That’s because most companies don’t have access to their data.
Often, that happens due to the existence of many sources of data which cannot be easily mined. Just consider a typical data stack for a company:
A Typical Data Stack of a Company
A data catalog solves this problem, as well as others. It makes data easy to find. It gives a comprehensive view of each piece of data across the databases and identifies relationships. Plus, it puts guardrails on the data and governs who can access it.
The step-by-step process of building a data catalog
Accessing Metadata of all databases
The first step for building a data catalog is collecting the data’s metadata. Data catalogs use metadata to identify the data tables, files, and databases. The catalog crawls the company’s databases and brings the metadata (not the actual data) to the data catalog.
- Transactional databases (RDBMS) – Oracle, SQL Server, MySQL, DB2, etc.
- Data warehouse – Teradata, Vertica etc.
- NoSQL databases – Cassandra, MongoDB
- Cloud Storage – Google Big Query, MS Azure Data Lake, AWS – Athena & Red Shift
- Tableau and Power BI
Building a Data dictionary
The second step is to build a data dictionary or upload an existing one into the data catalog. A data dictionary contains the description and Wiki of every table or file and all their metadata entities.
Employees can collaborate to create a data dictionary through web-based software or use an excel spreadsheet.
An example of a data dictionary
Profiling to See the Data Statistics
The next step is to profile the data to help data consumers view and understand the data quickly. These profiles are informative summaries that explain the data.
For example, the profile of a database often includes the number of tables, files, row counts, etc.. For a table, the profile may include column description, top values in a column, null count of a column, distinct count, maximum value, minimum value and much more.
An example of a table profile in the data catalog, OvalEdge
Marking Relationship Amongst Data
Marking relationships is the next vital step. Through this step, data consumers can discover related data across multiple databases. For example, an analyst may need consolidated customer information. Through the data catalog, she finds that five files in five different systems have customer data. With OvalEdge data catalog and the help of IT, one can have an experimental area where you can join all the data, clean it. Then use that consolidated customer data to achieve your business goals.
An example of relationships marked for the Table ‘Accounts’
The many ways relationships can be marked
Data relationships can be identified and marked amongst data in several ways:
1. Through human knowledge
2. An advanced algorithm can find out the connection. For example, in a table under the column name ‘vendor,’ you have an entity ‘Amazon.’ In another table, for column V1 you also have an entity ‘Amazon.’ The algorithm will figure out that there is a relationship between these tables.
3. Gathering intelligence from queries – Some developers know the connection between different data sets from experience. Their query logs can be examined and then parsed to mark relationships.
After marking relationships, a Data Catalog builds lineage. A visual representation of data lineage helps to track data from its origin to its destination. It explains the different processes involved in the data flow. Hence, it enables the analyst to trace errors back to the root cause in the analytics.
Generally, ETL ( Extract, Transfer, Load) tools are used to extract data from source databases, transform and cleanse the data and load it into a target database. A data catalog parses these tools to create the lineage. Some of the ETL tools which can be parsed are –
- SQL Parsing
In a table/file data is arranged in a technical format and not in a way to make the most sense to a business user. So we need human collaboration on data assets so that they can be discovered, accessed and trusted by business users. Below are a few techniques by which we can arrange data for easy discovery –
- Organizing by an amount of usage
- Organizing by specific users usage
- Through automation – Sometimes when there is a large amount of data we can use advanced algorithms to organize data.
Making the catalog easily accessible
For high usage by data consumers, a data catalog should be easily accessible by the web app, mobile app, Android and IOS apps. It should have chat features.
Security is critical
Since a data catalog houses all the data, it’s critical that there are guardrails protecting the most sensitive files. A data catalog should have the following features to ensure that only the right eyes see the data –
- Role-based security
- Information that who accessed what data at what time
Estimated time to build a data catalog
The time it takes to build a data catalog depends on the number of databases to be cataloged.
Crawling and profiling can be done in a day or two. Uploading an existing data dictionary, building lineage and marking relationships for a database can be done in one week with the help of the database’s stakeholder. If an organization has to catalog ten databases, it may take four to five weeks. A big corporation can build its data catalog in about three months while a medium-sized company can build one in two to four weeks.
Data catalogs are the new dynamic and agile tools needed by today’s data-driven organizations. They serve as a single source of reference for all your data needs. Static metadata repositories requiring expert IT users and loads of manual curation are a thing of the past.