Best Practices for Improving Data Quality

Written by Sharad Varshney | Mar 25, 2021 7:12:06 AM

If you’ve read our other data quality blogs in this series, you’ll understand the importance of high-quality data for accurate decision making, compliance, and more. In this article, we explain the best practices you should follow to ensure data quality at the source.

Data is often described as the new oil, but ensuring data quality is better compared to growing a delicate fruit. Just as with a fruit tree, data quality needs to be ensured at the source. This is why data quality issues can't be remedied in a data warehouse.

Data scientists spend the majority of their time cleaning data sets that have been neglected at this crucial stage. Not only is this a waste of precious time, but it also creates another problem too.

When data is cleaned later on many assumptions are made that may distort the outcome. Yet, data scientists have no choice but to make these assumptions. This is why data governance is so important for data quality improvement.

When it comes to quality, data is not like oil but is like a fruit. Its quality needs to be nurtured at the source and also when it is in transit. Data quality cannot be fixed in the data warehouse.

The trouble with independent users is that they tend to focus their energy on areas they are most affected by. For example, a project manager might be more concerned with inefficiencies in the IT asset management process, while a CFO might present a report to the board, or shareholders, and find an important piece of data is missing.

It is important to think of Data Quality Management as a process including various methodologies, frameworks, and tools working together to improve data quality.

In this blog, we will not only present a strategic approach to improving data quality but a complete execution strategy too.

Are you struggling to fully understand how data flows from one place to another within your cloud-based or on-prem infrastructure? Download our Whitepaper How to Build Data Lineage to Improve Quality and Enhance Trust

Why does data quality suffer?

The quality of data can be determined using several interconnected parameters. These parameters include the consistency of the data, its timeliness or relevance, accuracy, and completeness.

Data quality improvement is the process of enhancing the accuracy, completeness, timeliness, and consistency of data to ensure that it is fit for its intended use.

There are two key reasons for bad quality data. The first is related to source systems and the second occurs at the analysis phase.

Source systems

When organizations collect data with no proper controls or standardization processes in place issues can arise. These issues occur in four core areas:

During capture: Data capture is an important part of the quality control process. This initial step can set the course for a bad quality data set.
For example, if a telephone number is entered incorrectly at this stage, later along the data journey, this information could conflict with records in other systems making it very difficult to confirm the customer’s identity.
During transformation: As data passes from user to user and system to system, it is transformed. For example, when a process isn’t documented correctly, it’s impossible to track the lineage of this data efficiently, and as a result, the quality of the data suffers.
Imagine a scenario where an accounting record passes from one staff member in the finance department to another. If the first staff member fails to update the record before transferring it, they could inadvertently enable a customer to skip a due payment.
Due to timeliness: Even if the data capture stage produces high-quality data, over time, it may diminish. For example, someone might provide the correct address or job title when the data is captured, but if the same individual changes their job or address these fields must be updated.
Due to inconsistent processes and standards: This occurs when you capture data from different systems using different standards. For example, when you capture a unit of measurement in one system, you might be using codes like EA or LB. In another system, different standards might be used, like EACH or POUND.

Let’s take the country code analogy to explain some of these issues in greater detail. Many systems require users to enter a country code in order to complete registration documents, make bookings, and more. In some cases, users are required to enter these codes manually instead of selecting an option from a pre-established list.

The trouble is, there is no guarantee that each user will enter the same information. In fact, it’s almost impossible. When you ask people to type this information independently, you will inadvertently create many codes for the same country, and the system will be full of conflicting data points.

User one:	USA
User two:	US
User three:	UNITED STATES

Analysis phase

Data quality can be impacted during the analysis phase for many reasons. For example, fields could be mapped incorrectly or users could make the wrong assumptions based on the data.

This lack of coherence and absence of standards also affects digital transformation. When companies are merged—bad data quality makes these mergers difficult. When there are no standards or common problems defined, data quality becomes a big issue.

When data quality isn’t perfect, it becomes untrustworthy, making it difficult to convince employees to use it for data-driven initiatives.

Related: Building an Effective Data Governance Framework

Why do you need an independent data quality manager?

As we mentioned at the start of this blog, data quality is a core outcome of a data governance initiative. As a result, a key concern for data governance teams, groups, and departments is to improve the overall quality of data. But there is a problem: coordination.

If you talk to different people from different departments about data quality you will always get different responses. For example, if you ask an ETL developer how they measure data quality, they will probably rely on a certain set of parameters or rules that ensure that the data they enter is up to scratch.

If the quality at the source is bad, they are unlikely to flag it up, or even see it as their concern. Alternatively, if you talk to someone who deals with a CRM system, their focus will be on the consistency of data because they are unable to match conflicting terms in the system. In short, every individual sees data quality from a different perspective.

As most data quality problems occur because of issues with integrations and data transformation across multiple applications, it’s important to have an independent data quality manager, or data governance manager, to take charge of improving data quality across an organization.

Because there are so many conflicting opinions, you need an independent body to mediate and implement data quality improvement efforts company-wide, without bias, and based on a hierarchy of importance. This body can be a data governance manager or group.

To work efficiently on fixing data quality issues requires prioritization. These issues should be prioritized based on parameters like business impact, prevalence, and the complexity of data quality issues.

How to improve Data Quality

Everyone's individual data quality problem is highly important to that individual. However, to avoid getting lost in the sea of issues, you need to prioritize. Data quality issues should be prioritized based on parameters like business impact, prevalence, and the complexity of data quality issues. This then enables us to work on fixing those issues efficiently.

The following is a tried and tested strategy for improving data quality: the data quality improvement lifecycle.

Identify

The first step in tackling data quality issues is establishing a central hub where problems can be reported and tracked. Providing a platform and necessary tools is crucial in this endeavor. By capturing the context and understanding the potential business ramifications of not solving the problem, the team can prioritize and tackle the issues with a sense of urgency and purpose.

Related: Data Governance and Data Quality: Working Together

When it comes to identifying data quality problems, several approaches should be taken. Some of these include:

Business User Reporting: By providing a system where business users can easily report problems with context, teams can stay on top of issues as they arise. This is particularly important when business users are working with data and making decisions based on that data.
Data Engineer Collaboration: Another way to identify problems is through collaboration among data engineers and analysts. By looking at data at every stage of the pipeline, these professionals can report issues and have access to the tools they need to easily query the data and report problems.
Anomaly Detection: Automated tools can monitor data for anomalies, such as a sudden spike in null values. These tools can then report the problem, making it easier for teams to stay on top of data quality issues.
AI-based Tools: For critical data elements, AI-based tools can be used to identify problems. For example, by pointing an AI tool at master data, such as customer data, teams can detect data quality issues that may not be obvious otherwise. While these tools can identify problems, they may not provide context.
Log Observation: The technique of data observability, which came from the DevOps field, is gaining popularity. By observing log files of various ETL pipelines, teams can identify and report problems in the pipeline to a central depository.

Overall, it's important to have a combination of these methods to identify and report data quality problems. The more ways to detect and report issues, the more comprehensive and effective the data governance will be.

Once you create widespread data literacy within an organization, you can then put in place a reporting mechanism where users can go and communicate their data quality issues

Prioritize

The next step is to develop a mechanism that helps you understand the business impact of these data quality issues. This is the most important task data governance/qaulity managers are required to do. They must consider the following in their evaluation:

Business value
Primary root cause analysis
Approximate effort to fix the problem
Change management

This process enables the governance team to prioritize the issues efficiently. This prioritization process usually creates a bottleneck as it can be difficult to make a unanimous decision.

When a data governance/quality manager is unable to prioritize, s/he must take these prioritization to the regular data governance committee for guidance. They will weigh up the problem based on many factors including cost/benefit ratio and business impact.

When critical data quality decisions are made, a change to the business process is required. This essentially results in extra work and expenditure, so it needs to be decided at a cross-departmental, impartial, committee level.

Analyze the Root Cause

Once issues are identified and prioritized, the person responsible for approving and fixing the problem needs to conduct a further root cause analysis. This process involves asking questions, such as where each individual problem stems from. What is the real cause of the problem? There must be an internal discussion about who is best placed to conduct this analysis and whether this person is from the Data or Business team.

Generally, self-service tools for all the data quality team members are very helpful in this process. These tools are Lineage and Impact Analysis, Querying tool and Collaboration tools. Once the root cause is identified, analyst must report the root cause in the central issue tracking system and should alert all the stakeholders associated with the data assets or data objects.

Related: Data Literacy: What it is and How it Benefits Your Business

Improve

When it comes to solving data quality problems, the solution will depend on the root cause of the issue. In some cases, a temporary fix may be all that's needed, such as cleaning data. However, in other cases, a more permanent solution is required. These permanent solutions include:

Code Change: This involves making changes to the code causing the problem. Changing code can be a complex process that requires technical expertise, but it can be a powerful way to fix issues.
Business Process Change: In some cases, the problem may be rooted in how the business operates. Changing the business process can be an effective way to address these issues, but it can also be a major undertaking. For example, as you are collecting vendor data via a form, adding a layer of approval in the Vendor registration program may reduce the number of duplicate vendors.
Implementation of an MDM program: A Master Data Management (MDM) program is an effective way to ensure that data is accurate, consistent, and complete. This solution is mostly achieved for master data like Customer, Vendor, inventory, etc. For this, you make a central API-based repository of master data, and then every business process calls the master data through the API. A complete project and budget are required to achieve this.
Reference Data Management Program: Certain data elements do not change often, but they do change. For that, you need to have a Reference Data Management (RDM) program. For example, failure codes can be limited, but when you add a failure code to one system, you need to ensure that all the systems have this failure code and its description. A reference data management program can help to ensure that data is accurate and consistent by providing a centralized repository for reference data.
Third-Party Data Quality Error Mitigation Program: Third-party data is one of the major sources of data quality problems. As you get the data from a third-party, their operation is not under your control. In this case, you need a clear specification and communication plan. What is the expectation of data quality and how the third-party provider can work with you to solve this problem.
Controlled solution to known problems: This includes having a pre-defined process to solve known problems like using a data dictionary, data lineage, data mapping, etc.

Remember, a temporary fix is the cleaning of data. This can help to improve the quality of the data in the short term, but it may not address the root cause of the problem.

Beyond this, new data technologies are enabling companies to onboard data products that allow various teams to use standardized definitions and avoid the confusion that can lead to data quality issues.

Related: Data Observability: What it is and Why it is important for your Business

Control

The final step in the process is control so that the identified issue occurred again, right people knows how to fix it. This is generally done by writing a set of data quality rules, problems are automatically identified. These rules will ensure that if this issue arises again, a notification or ticket is created to address the problem.

A notification makes it much easier to deal with the problem quickly rather than having to consult multiple people and conduct complex analyses. While control can be used to make temprary improvements, data quality rules can also be incorporated into a permanent solution.

At the heart of the entire data quality improvement lifecycle is governance. Every stage requires an effective governance process to ensure quality issues don’t return. OvalEdge is a comprehensive data governance platform that provides any organization with tools it needs to maximize high-quality data assets.

View full post