data_quality

Good quality, clean data is a pre-requisite to implementing business processes that meet expected business outcomes. Data quality is defined by how effectively the available data supports business transactions and decisions needed to meet an organization’s strategic goals and objectives.  The level of data quality required to effectively support operations will obviously vary by information system or business unit. For example, financial systems require a high degree of quality data due to the importance and usage of the data, but a marketing system may have the latitude to operate with a lower level of data quality without significantly impacting the realization of the granular strategic objectives. Since the “purpose” varies, so does the bar that is used to measure fitness to purpose.

While the degree to which the compliance required may vary across project environments, we feel that the basis for making these evaluations should stay consistent across the board. In this light, we propose a 6 point data quality assessment framework that can be used to objectively assess the quality of data and then devise appropriate remedial measures to create compliance with quality benchmarks.

Why need a framework?

A Framework helps to create business rules and checks that can be implemented in data quality software for automated profiling of datasets. This automation goes a long way in reducing the time, effort and error in generating clean data sets that pass business acceptability norms. In cases where the profiling is done manually, frameworks help to develop robust project plans that can accurately account for the time and effort involved in the very important data cleaning phase of ETL/BI deliveries.

The 6 elements of a data quality assessment framework

At a technical level, data assessment must be conducted across following key dimensions-

Valid data-The criterial refers largely to the physical format of a data elements. Examples of invalid data could be numeric values in customer record names, emails without the @ sign, invalid date formats, currency values containing letters and so on. Assessing data validity of various data elements would require constructing business rules for various possible values/formats and then running the data through them.

Unique data-In many cases, data elements need to be unique. E.g. it is not normally possible to have 2 customer records with the same email or social security number. Uniqueness test is typically conducted by identifying duplicate data elements.

Complete data-This test checks if a) a particular data element always has some value or b) has value based on the value of another data element. E.g. A delivery order cannot be missing line items. A mandatory question that requires an answer in y/n cannot have a null value. Automating checks for assessing data completeness requires crafting business rules for both the conditions a) and b)

Precise data-This condition requires that a data element be free from variation and contradiction across datasets. For example, the time of a particular call in a customer call center call record cannot be different in different reports. A purchase order cannot possibly have date X in the orders table and then date Y showing up in delivery or invoice tables. Automating these checks would require ensuring consistency of related data elements’ attributes and dimensions across datasets.

Timely data-This condition requires that data elements represent the most current information resulting from a set of business events. For example, a customer record may not hold address information that is no longer valid.

Consistent data-This condition requires that data elements be only used for their intended purpose across datasets. For example, the same product code may not be used for multiple product types. A campaign code used to identify a lead generation campaign in one table may not be used for a branding campaign in another.

Summary

Having a clear view of the common dimensions along which data quality assessments must be made helps companies both at strategic and tactical levels. At a strategic level, the data quality gaps identified for various dimensions feed into significantly streamlined data management practices that in turn drive higher level changes around people, processes, and technologies. At a tactical level, companies can quickly put together profiling rules and automation that drive higher implementation efficiencies in the overall ETL delivery projects.

About this sample

Ensuring acceptable data quality is a pre-requisite to any ETL deployment. With this, data integration projects result in largely wasted investments given the discrepancy between actual data and the data that is expected by business. This article provides a conceptual framework for building automated data quality checks that can save deployment teams significant amounts of time and resource

About the Client

UK based specialist ETL and ELT Consultancy with technology expertise covering Talend, Informatica, Kettle tools

Article purpose

Demonstrate subject matter expertise in automating data quality assessments by building data processing rules based on high level patterns