One of the biggest benefits from a big data analysis is the ability to find answers to new types of questions and then quickly take action to improve business results. For example, a retailer can leverage social media data to analyze customer reaction to a marketing campaign and make immediate changes rather than waiting several weeks to review results from traditional marketing data sources. Various big data sources such as Twitter feeds, audio files, or emails are analyzed and the resulting patterns that emerge can provide important business value for companies.
Increasingly, leaders understand there is enormous value in integrating the results of big data analysis with traditional systems of record. But this value comes with an overriding concern: what is the quality of these big data sources? And at what point do you need to worry about the quality of big data? It is well understood that these big data are not considered clean and have not been vetted for data quality. Large volumes of high velocity data are valued for their immediacy and for the potential patterns that can be identified through analytics. But, at some point before making a business decision you need make sure that the data you are analyzing is correct and that the correlations between data sources make sense. The context has to be right to ensure that you are making informed decisions from that data.
What does it mean to understand the context of big data? How can you take the results of a big data analysis and understand where that data came from and if it has the same meaning as data elements managed in your systems of record? Building confidence in your big data is an area that is just beginning to get a lot of attention from companies that want to ensure that their big data pilot projects can be successfully integrated into company wide decision-making processes. Software vendors focusing on big data and data quality are actively working on new solutions to support their customers in this area.
I was recently at an IBM big data customer meeting where executives were discussing a product that will likely be introduced in the fourth quarter of this year. This Big Data Catalog (part of the IBM InfoSphere Information Integration and Governance (IIG) portfolio) is intended to help companies ensure that big data sources can be more easily navigated and can be trusted.. In essence, the product will ingest data from these sources and discover the lineage of the data – where it came from and how it was changed over time. Therefore, a data analyst will be able to determine if the data comes from a legitimate source. The analyst will be able to trust that data selected from the Big Data Catalog has not been manipulated to skew results. For example, a social media site may have comments about a new product introduction. The Big Data Catalog will include both the original stream and the trusted stream that has been cleansed of misleading data such as a series of automated comments submitted by a competitor to make it look like customers are complaining about quality.
While the Big Data catalog will initially be aimed for the data scientist who is responsible for identifying which raw data sources should be ingested and analyzing that data to identify patterns or trends., releases next year are intended to provide support for business analysts who will be able to search the Big Data Catalog for big data sources that have been vetted according to company policies and rules. Ultimately, if organizations are going to improve business outcomes based on the analysis of big data sources it will become increasingly important to understand the context of those sources. Having access to a big data catalog will enable users to select data sources with an understanding of the origin, lineage, and potential value of that data.