As the era of Industry 4.0, with its promises of predictive analytics, integrated business planning, and increased operational efficiencies continues to heat up, Big Data is a topic on everyone’s mind. Yet, with so much discussion of the potential value to be leveraged from the growing quantities of data being generated by all manner of sensors and devices, less attention is being paid to the all-too-necessary precursor of effective analytics—data quality.
In this domain, the age-old maxim “garbage in, garbage out” still reigns supreme. Even the most advanced machine learning algorithms are useless when fed poor quality data.
“Data quality is everything,” says Tom Redman, president at Data Quality Solutions. “The first thing is that if you’re using existing data to train a model and you don’t do a really good job cleaning it up, you’re going to get a bad model. Even if the model [you construct] is good, if you put bad data into it, you’re just going to get a bad result. If you stack these things up, it’s like a cascade, and the problem will quickly get out of control.”
So, how does one define what is or isn’t quality data? This is a challenging question because much of the answer depends on the particular problem you’re looking to solve. Generally speaking, the quality of data can be measured in accordance with four primary dimensions: accuracy, consistency, completeness, and timeliness.
If values that have been gathered from across a network have accuracy, they properly reflect the information that was produced by each device. For example, if several devices within a single space are all reporting the ambient temperature in that area, data analysts should expect those values to either be the same or within a reasonable deviance from one another. Consistency is similar. When data is consistent, it means that multiple events reported under similar conditions do not exhibit irreconcilable variances. In contrast, completeness is attained when there are no substantial gaps in a time-series of reported events or captured values from sensors. Finally, if data possesses timeliness, it means that it has been able to pass from its initial point of creation through various communication protocols and levels of integration into a data management platform where it can be synchronized with data from other sources quickly enough to effectively be acted on.
While such a whirlwind of metrics and criteria may seem convoluted, Redman says that it can all be boiled down into two simple axioms. Data must be right, meaning that it’s accurate, consistent, and complete. It must also be the right data, which entails not only meeting technical standards of quality, but being unbiased and pertaining to the particular range of inputs for which one aims to develop a predictive model. Poorly calibrated equipment may be responsible for shortcomings in the former, but the latter is especially important because it calls on the insight and creativity of human analysts and their ability to communicate their needs to operational technicians who create data further upstream.
Ensuring data quality from the outset
Redman’s approach to ensuring data quality differs from some others in that, while he acknowledges technology is important, he believes that it is first and foremost a management concern. In his view, when communication between data creators and data users is made clearer, it becomes exponentially easier to not only collect the data that is right, but the right data.
“One thing that you’ll notice is that no one ever really creates bad data if they’re going to use it themselves, but a lot of data is created the first time in one part of an organization and not used until somewhere downstream in another part of it. People go along blithely creating the data, and then the people who have to use it say, ‘Oh, this is no good,’ and so they have to clean it,” he says. “It never occurs to them that maybe they should figure out who’s creating the data and go down there and have a little chat about their requirements. The goal of data quality should be to get out of the cleaning business altogether.”
In other words, a conscious decision needs to be made to develop methods of communication between the various members of an organization whereby the requirements of all data being generated can be clearly delineated. Redman sees this as management’s responsibility to impose, and, if necessary, to provide training for as well.
And while Redman stresses that kinks in the communication pipeline should be fully sorted out before an organization rushes to more sophisticated technological approaches, once a strong workflow has been put in place by management, investing in the right hardware and software is also important.
Increasing data cleaning efficiency
Given the strenuousness of a data janitor’s work, Redman’s stance isn’t surprising. According to Anil Datoo, vice president of data management at
Emerson, around 70% of all data integration activities are spent validating, structuring, organizing, and cleaning data, a statistic that was echoed in an article on Big Data in The New York Times in 2014. With so much time being committed to the task of data cleaning and very little headway having been made to reduce it over the past half-decade, working to ensure that more data is in tip-top shape from its inception isn’t a bad strategy.