Data Quality Evolves

August 16, 2005

Data Quality Evolves

Data Quality Evolves

By Fern Halper and Marcia Kaufman

Data quality is an integral part of any company’s information infrastructure and poor data quality, therefore, impairs the organization in a number of ways. Poor data quality affects operational efficiencies, decision making, and reporting – to name only a few of the more obvious targets. And, data quality is at the heart of new regulatory compliance mandates.

Definitions of data quality put forth by academicians, government agencies, industry practitioners, and consortiums are evolving along multiple fronts. This evolution is driven by changes in the complexity of a company’s information infrastructure. Traditional measures of data quality include accuracy/reliability, completeness, consistency, timeliness, reasonableness, and validity. Today, metrics such as interpretability, understandability, usability, and accessibility are being added to the mix.

New Metrics Speak to Changes in Information Infrastructure

These newer metrics tend to be more user-focused and speak to the changes in a company’s information infrastructure, which include:

    ?     More data sources
    ?     More movement of data
    ?     More integration
    ?     More transformations
    ?     The need to link data from disparate sources
    ?     More sophisticated ways of interacting with data and information

They also speak to the notion of data in context. Data from multiple data sources needs to be linked together in order to make sure that “information” is relevant, usable, and understandable.

An Example

Consider, for example, a scenario in which a business that has just undergone a major acquisition must bring information together from multiple data sources for use in an executive dashboard. The dashboard may include financial and accounting information as well sales and HR information. There may be ten or more data sources feeding this dashboard. Some of the data may appear in more than one data source.

In addition to the traditional concerns about data quality – accuracy of the data feeding the dashboard, completeness of the data, timeliness of the data, and so on – emerging issues, such as usability and understandability are now part of the data quality equation. Regarding these latter concerns, if the executive using the dashboard doesn’t have the same understanding in her head as her colleagues of how the financial terms in the dashboard are derived or calculated), or of how that data is being used, this impacts the interpretability and, ultimately, the quality of the data and information presented.

Even traditional metrics such as consistency are taking on new meaning. Whereas consistency was previously thought of primarily as consistency of format (e.g., in a database), in a more complex environment where data may be transferred, transformed, and aggregated – consistency also refers to the ability to understand what has happened to the data and whether the data has maintained its integrity through the transformations and transfer process – piece by piece.

The Next Generation of Data Quality

Vendors, such as IBM, are laying the foundation for the next generation of data quality. IBM’s goal is to simplify information integration and to deliver accurate, consistent, timely, and coherent information to the business. From a technology standpoint, IBM is now looking at both structured and unstructured data and dealing with such issues as consistency and understandability.

IBM has recently announced the beta release of a number of products that leverage a new underlying metadata infrastructure and common services. First and foremost, this new software aims at making it easier to integrate data, but it will also help identify key measures such as understandability and consistency. With this new release, IBM combines technology acquired from Ascential Software with their existing information integration portfolio. The software includes:

    ?     Rational Data Architect. This product helps data architects model, visualize, and relate data across multiple information sources. This software combines traditional data modeling with metadata discovery and analysis. It can also create an abstracted view of data and deploy it directly to WebSphere Information Analyzer. It can utilize the information from the WebSphere Business Glossary (see below).
    ?     WebSphere Business Glossary. This web-based application provides different categories of users (business users, analysts, data stewards) with editable business and technical definitions of data.
    ?     WebSphere Information Analyzer. This product helps business users analyze, profile, audit, and analyze data. It includes quality rules such as completeness and validation and data rules and metrics. It shares a central repository with WebSphere DataStage and WebSphere QualityStage, and automates the metadata sharing across those products.

These three products – with their underlying common metadata repository – help companies build, integrate, and monitor well-defined data assets. Data integrated from multiple data sources need to be complete and accurate, but the data also need to be consistent and understandable to support enterprise-wide information requirements. This new release incorporates a metadata infrastructure to provide coordinated and consistent information about the data used by all of the products. The result should be more understandable information.

What Does All of This Mean for Companies?

The reality is that companies are still struggling with some of the traditional measures. Taking the next step in data quality is still in the early adopter stage. However, as companies continue to deal with challenges – such as mergers and acquisitions that necessitate bringing disparate data sources together, consolidating their information infrastructure to be more competitive, and compliance issues in general – they will have little choice but to deal with both the traditional as well these next-generation data quality measures. Taking this next step will obviously require time and may be regarded as yet another rude awakening by some companies, but it is a step that companies must make in order to use data from multiple sources for decision making and other important corporate activities.

 

Newsletters 2005
About Fern Halper

Leave a Reply

Your email address will not be published.