HiveMQ Data Hub: Enhance the Value of IoT Data

by Michael Parisi

Apr 11, 2024 13 min read

Today we released HiveMQ Data Hub, an integrated data policy engine within the HiveMQ MQTT Platform. Data Hub sits on top of the HiveMQ MQTT broker and acts as a one-stop-shop for defining and enforcing standards on how MQTT data is validated and transformed across a deployment.

Data is the cornerstone of modern business initiatives, ultimately enabling better, data-driven decision making. Organizations must have a well-defined data strategy in place to ensure the data coming in is complete and actionable, however harnessing the power of all of that data is easier said than done. Data Hub utilizes declarative and behavioral policies to increase organizations’ overall data quality and operational efficiency, while simultaneously reducing system and production costs.

Improve Data Hygiene

The MQTT protocol is data agnostic, meaning clients send and receive data whether it is valid or not, and validation logic takes time and resources to implement. The data running through MQTT platforms includes critical and time-sensitive information where misalignments (i.e. bad data being sent downstream, rapidly sending messages, etc.) can happen often, especially when integrating a variety of third party systems and devices into the data stream. Therefore, it is imperative to avoid bad downstream data.

At the core of the Data Hub is the policy engine, designed for organizations that process large amounts of data and that are critically impacted by bad data. Using the API’s provided by the Data Hub, companies can streamline data definitions across the business and offer a single source of data truth, all within the same platform as the HiveMQ broker. Everything is managed in a single system allowing for data to be processed faster and negating the need to manage another standalone system. The HiveMQ Data Hub includes the following features:

Data schemas: Schemas allow users to create the blueprint for how data is formatted and the relationship it has with other data systems. They are replicated across the complete HiveMQ cluster and both JSON and Protobuf schema formats are currently supported with more planned (Avro, etc.). An MQTT message is considered valid if it adheres to the schema policy provided and invalid if it does not conform to the schema outline provided. Schemas use declarative policies that help ensure pipeline issues are resolved early and at a high scale to deliver the right data to the right place and in the right format. The Schema example below shows GPS coordinates.

{
   "$id": "https://example.com/geographical-location.schema.json",
   "$schema": "https://json-schema.org/draft/2020-12/schema",
   "title": "Longitude and Latitude Values",
   "description": "A geographical coordinate.",
   "required": [ "latitude", "longitude" ],
   "type": "object",
   "properties": {
      "latitude": {
         "type": "number",
         "minimum": -90,
         "maximum": 90
      },
      "longitude": {
         "type": "number",
         "minimum": -180,
         "maximum": 180
      }
   }
}

Data policies: Data policies define how the actual pipeline is handled in the broker, specifically schema validation. They are the set of rules and guidelines that dictate how data and messages running through the broker should be expected by users. When data fails validation, policy actions define the steps that are taken next. Messages can be rerouted to another MQTT topic, forwarded, dropped or simply ignored. These policies allow you to quarantine data for further inspection, as well as provide reasons for validation failures, and define schema standards across teams. Data policies are crucial for maintaining decoupled pipelines between data producers and consumers and help streamline data across the organization, even bringing an added level of consistency that fosters reliability and ultimately higher data quality.
Breaking it down a bit further we can look at policies in three components:
- Matching: Policies should match for specific criteria via a hierarchically aligned topic filter.
- Validations: A set of validations is executed for each of those matching incoming MQTT messages.
- Actions: The validation steps have two outcomes for which further actions need to be taken, e.g., logging a message, incrementing a metric, or even re-rerouting the MQTT message.
  The policy example below shows GPS coordinates that pair with the above schema.

{
  "id":"location-policy",
  "matching":{
    "topicFilter":"location"
  },
  "validation":{
    "validators":[
      {
        "type":"schema",
        "arguments":{
          "strategy":"ANY_OF",
          "schemas":[
            {
              "schemaId":"coordinates",
              "version":"latest"
            }
          ]
        }
      }
    ]
  },
  "onFailure":{
    "pipeline":[
      {
        "id":"logFailure",
        "functionId":"System.log",
        "arguments":{
          "level":"WARN",
          "message":"The client ${clientId} attempted to publish invalid coordinate data: ${validationResult}"
        }
      }
    ]
  }
}

Behavior Policies: Behavior policies allow you to uncover bad acting clients directly in the broker. They dictate agreed-upon behaviors for how devices should work with the broker by logging, stopping, or transforming the behaviors. Flow control enables the validation of in-flight message flow patterns to avoid inefficiency (i.e. constantly repeating Connect-Publish-Disconnect) and you can enforce data ingestion limits to restrict how much data clients can ingest (no over-sending, even if data is correct). These policies significantly limit performance issues as they empower you to uncover clients violating good resource usage and provide the ability to detect and drop repetitive messages.

{
  "id":"example-behavior-policy",
  "matching":{
    "clientIdRegex":".*"
  },
  "behavior":{
    "id":"Publish.duplicate"
  },
  "onTransitions":[
    {
      "fromState":"Any.*",
      "toState":"Duplicated",
      "Mqtt.OnInboundPublish":{
        "pipeline":[
          {
            "id":"count-duplicate-messages",
            "functionId":"Metrics.Counter.increment",
            "arguments":{
              "metricName":"repetitive-messages-count",
              "incrementBy":1
            }
          },
          {
            "id":"drop",
            "functionId":"Mqtt.drop",
            "arguments":{
              
            }
          }
        ]
      }
    }
  ]
}

Data Transformation: Soon transformation policies will be available where you can move more operations to the edge so data can be standardized (i.e. automatically convert Fahrenheit to Celsius, convert to the metric system, etc.), undesirable data can be filtered out (i.e. temperatures less than 17 degrees), and data from various sources and versions can be unified into a singular data standard, all before it reaches consumers. This significantly increases efficiency as processes are automated, replacing manual transformation logic with clients, which can be time consuming, error prone, and not scalable.
Control Center: Easily visualize your data with a simple user interface that helps you manage schemas, data, and behavior policies. The dashboard provides an overview of overall quality metrics, making it easy to locate bad actors and bad data sources. You can get a more in-depth look at your data with tools like Grafana.

Control IoT Data

Without Data Hub, data consumers have to process and validate messages on their own, and the margin for error is high as faulty clients can flood the system with bad data (or behaviors) that generate unnecessary network traffic and end up wasting computing resources. Not to mention, many clients don’t follow naming conventions or agreed-upon MQTT behaviors, which makes it difficult to identify and fix them.

Data Hub provides data services to customers and becomes an independent arbiter that facilitates standards enforcement for clients’ behavior and on the data they publish. With Data Hub, users can specify how the broker should behave and control their IoT data by introducing policies. Policies written by developers, DevOps engineers, engineering managers, etc. for data producers and consumers are made available to the Data Hub via the REST API in the broker. The policy engine then implements and enforces the expectations provided by the policies in real-time and that data is sent to the consumer (Google Pub Sub, Kafka, Enterprise Services, etc.) in the format they expect. Data Hub uses a Policy Engine The HiveMQ MQTT Platform ensures messages are securely and reliably delivered from producers to consumers, while allowing customers to enforce data standards.

For example, consider a manufacturing company where a systems integrator inputs new devices into the data stream. One single device has the potential to introduce bad data into the data stream or behaviors that can monopolize resources leading to service downtime and production stoppages. It could cost the company a lot of time, money, and resources to track down the faulty device and mitigate the problem.

With Data Hub we provide a way to validate data entering the data stream, and redirect that data or even transform it. Doing so streamlines the development of new services as well as the third party data producers by enforcing a data contract with an API-first approach to enable data development in parallel.

Try HiveMQ Data Hub Now

With HiveMQ Data Hub, it has never been easier to ensure that the IoT data moving through the HiveMQ MQTT Platform is of the highest quality while still being delivered at the scale and performance customers have come to expect. Data Hub setup is quick and easy, and creating the first policy takes just a few minutes. See our policy cookbook for how to get started quickly and take a look at the docs for a more in-depth explanation of all of the features available.

Michael Parisi

Mike Parisi was a Product Marketing Manager at HiveMQ who owned positioning for the core HiveMQ platform and HiveMQ Data Hub. Mike specializes in hosting training sessions and go-to-market plans for new products and features. He has extensive experience helping SaaS companies launch new high impact products and he revels in bringing together people and technology.

HiveMQ Data Hub: Enhance the Value of IoT Data

Improve Data Hygiene

Control IoT Data

Try HiveMQ Data Hub Now

Michael Parisi

Distributed Data Intelligence in Manufacturing: The Path, Benefits, and Pitfalls

Build Stateful IoT Data Pipelines with HiveMQ Data Hub

HiveMQ at Hannover Messe 2025: Unlock the Value of Your Industrial Data