Preventing Bad Data Problems

For many managers and engineering and operations personnel focused on day-to-day production operations, high-level data management can seem like an abstraction reserved for data scientists, analytics teams or IT. But with increasing amounts of manufacturing operations and decisions dependent on the flow of correct data, data management is fast becoming a concern at levels of industrial organizations.

To help clarify this burgeoning reality, consider a manufacturing company where a system integrator inputs new devices into the data stream. “One single device has the potential to introduce bad data into the data stream or behaviors that can monopolize resources leading to service downtime and production stoppages,” said Michael Parisi, product marketing manager at HiveMQ, an MQTT platform provider. “This can cost the company a lot of time, money and resources to track down the faulty device and mitigate the problem. That’s why the HiveMQ MQTT Platform ensures messages are securely and reliably delivered from producers to consumers, while allowing customers to enforce data standards.”

To further the data management capabilities of its MQTT Platform, HiveMQ has released HiveMQ Data Hub, an integrated policy engine within the HiveMQ broker designed to enforce data integrity and quality. The HiveMQ Data Hub is designed to detect and manage distributed data and misbehaving MQTT clients with the ability to validate, standardize and manipulate data in motion.

“One of the primary challenges in IoT is data quality,” said Dominik Obermaier, co-founder and CTO of HiveMQ. “Overloading IT systems with bad data leads to poor decision-making and an erosion of trust for the business. With HiveMQ Data Hub, customers can create a data policy to validate data and eliminate bad information. The result is feeding cleaner, easier to manage, high-quality data into enterprise and cloud systems, which leads to better IoT applications and data-driven decision making.”

Key capabilities of the HiveMQ Data Hub include:

• Create a schema policy in JSON or Protobuf formats. “Schemas allow users to create the blueprint for how data is formatted and the relationship it has with other data systems,” said Parisi. “They are replicated across the complete HiveMQ cluster with JSON and Protobuf schema formats currently supported. An MQTT message is considered valid if it adheres to the schema policy provided and invalid if it does not conform to the schema outline provided. Schemas use declarative policies that help ensure pipeline issues are resolved early and at a high scale to deliver the right data to the right place and in the right format.”

• Define policy actions for data that fails validation. “Data policies define how the actual pipeline is handled in the broker, specifically schema validation,” explained Parisi. “They are the set of rules and guidelines that dictate how data and messages running through the broker should be expected by users. When data fails validation, policy actions define the steps that are taken next. Messages can be rerouted to another MQTT topic, forwarded, dropped or simply ignored. These policies allow you to quarantine data for further inspection, as well as provide reasons for validation failures, and define schema standards across teams. Data policies are crucial for maintaining decoupled pipelines between data producers and consumers and help streamline data across the organization, even bringing an added level of consistency that fosters reliability and ultimately higher data quality.”

• Store schema registries locally for faster access and data processing in a single system.

• Define behavioral policies to determine how devices work with the broker and log bad actors.

• Visualize the data in tools like Grafana with an API.

“Without Data Hub, data consumers have to process and validate messages on their own,” said Parisi, “and the margin for error is high as faulty clients can flood the system with bad data (or behaviors) that generate unnecessary network traffic and end up wasting computing resources. Not to mention, many clients don’t follow naming conventions or agreed-upon MQTT behaviors, which makes it difficult to identify and fix them.”