A Guide to Distributed Tracing for IoT Systems Using HiveMQ and OpenTelemetry
The HiveMQ Distributed Tracing Extension is HiveMQ’s portion of the distributed tracing operations that give you greater insight into the behavior of complex, multi-tier, distributed systems.
Distributed tracing is a way of tracing application activity in the form of requests as data flows from devices at the edge or front-end through intermediate layers, and ultimately to backend systems whether they are other Operational Technology (OT) systems or Information Technology (IT) Systems. A common use of distributed tracing is troubleshooting and illuminating application flows and requests that have high latency or other problems.
The HiveMQ Distributed Tracing Extension enhances IoT observability by enabling you to track MQTT messages in an end-to-end manner as they pass through HiveMQ Broker. This provides deeper insights into message flows, helping to identify performance bottlenecks and pinpoint reasons for message failures, ultimately making your IoT applications more performant and resilient.
In this article, we cover the basics of Distributed Tracing, define common terms, and provide the steps needed to do initial setup and configuration. This will provide you with a basis for troubleshooting operations with HiveMQ’s Distributed Tracing Extension in combination with your OpenTelemetry-compatible APM like DataDog. This article is NOT an exhaustive guide to troubleshooting, nor is it a replacement for HiveMQ’s Distributed Tracing Extension Documentation. Rather, it is a starting point for more steps toward improving observability and reliable operations in your distributed MQTT-centric systems.
Setting Up HiveMQ Distributed Tracing with OpenTelemetry and APM Tools
Introduction to Distributed Tracing Concepts
Observability is the ability to understand the internal state of a complex system based on its external outputs (telemetry data). A key part of observability is instrumentation—code added to a service (like HiveMQ Broker via the extension) to collect monitoring data. In the Distributed Tracing Extension, instrumentation is what helps generate traces from HiveMQ Broker behavior.
An APM or Application Performance Management system is what brings together all relevant data and lets users examine behaviors, visualize distributed traces, chart performance, and generally view and gain insights into application behavior. Some examples of APMs include systems like Datadog, Azure Application Insights, and Grafana Tempo.
The extension uses OpenTelemetry (OTel), an open-source observability framework. This standards-based approach ensures that the tracing data can be sent to a variety of Application Performance Monitoring (APM) tools that support OpenTelemetry, including Datadog.
The following diagram depicts a common system architecture and shows how distributed tracing fits into the overall landscape of observability, monitoring, tracing, and logging.
This next picture, while similar to the first, shows how traces and spans cover various tiers, levels, and groups of systems making up more complex, composite systems. From this picture you can see that a distributed trace encompasses every path from device through OT systems to IT systems, and end-user applications. Rather than looking at multiple, disparate tools, dashboards, monitors, and other data, a distributed trace gives a single, threaded view of parts of application behavior and performance:
Requirements for Setting Up HiveMQ Distributed Tracing with OTel and Datadog
System-Level Components
An Open Telemetry (OT) capable system like Datadog
Datadog account with an active APM service.
Running HiveMQ MQTT broker instance (HiveMQ Platform).
The HiveMQ Distributed Tracing Extension installed and enabled on your HiveMQ broker. This is a commercial extension.
An OpenTelemetry Collector (OTel Collector): This component is crucial for receiving telemetry data from the HiveMQ extension (via OTLP - OpenTelemetry Protocol) and then exporting it to Datadog.
Terminology
A larger and more complete list and definition of terms is found in the documentation. The following are several key terms for use here.
General Terminology
Telemetry: data automatically transmitted by a system about its behavior, including traces, metrics, and logs.
Distributed Tracing: a way of tracking the path of a request, such as an MQTT message, as it flows through multiple components in a distributed system. Gives an end-to-end view of a message's journey.
OpenTelemetry (OTel): An open-source observability framework (tools, APIs, SDKs) used by the extension to generate, collect, and export telemetry data (like traces and spans) in a standardized way.
Trace Terminology
Trace: the complete end-to-end journey of a single request through the distributed system. A trace is composed of multiple spans.
Span: a single, named, and timed operation or unit of work within a trace, such as an API call or a message being processed by the broker. Spans have a start and end time and can have parent-child relationships.
Root Span: the first span in a trace
Child Span: a sub-operation triggered by a parent span
Configuration-Related Terminology
The following terms are related to the configuration of the HiveMQ Distributed Tracing
Extension:
Trace Context Propagation: how trace identifiers (like Trace ID and Span ID) are passed along with a request as it travels between services, ensuring that all spans are correlated in a single trace
Span Exporter: configured in the HiveMQ extension, sends collected trace data to a backend system like an OTel Collector or directly to an APM. Example:
otlp-exporter
is a specific type using gRPC or HTTP.Batch Span Processor: batches multiple spans together before exporting them for efficiency.
Service Name: A user-defined name to identify the HiveMQ instance or service producing the traces.
Sampling: deciding which traces to collect and send to the backend to optimize data volume. Two (2) types:
Head-based: decision at the start of tracing, or
Tail-based: decision after all spans complete.
Traces and Spans Visualized
And, focusing on HiveMQ-specific Broker spans, we see these spans that give insight into HiveMQ operations in the setting of a distributed trace:
Supported Functionality in HiveMQ Distributed Tracing Extension
Pay special attention to these notes and details on functionality supported within the extension:
Supports PUBLISH MQTT messages only.
CONNECT, SUBSCRIBE, PUBACK, or other QoS message types are not presently supported.
Interceptors should have their own child spans
The HiveMQ Broker supports include/exclude filtering with topic filters and client ID patterns, should be separate for incoming/outgoing MQTT messages.
Kafka and Amazon Kinesis extensions support tracing as well (bi-directional).
Note: The HiveMQ MQTT Java client doesn't support tracing. So, if you already have an existing trace in your application and send out an MQTT publish, you need to put the according trace context data manually into the message.
The Setup
For this section, we call out details about installation and configuration of the extension. Please refer to HiveMQ Enterprise Distributed Tracing Extension for more comprehensive steps on configuring.
Installation
The Distributed Tracing Extension is included with HiveMQ Broker Platform distributions. With a valid license, you can enable the extension. To get started, follow these instructions to apply licenses and set up HiveMQ Distributed Tracing Extension.
Configuration
Use this documentation as a starting point for configuring the Distributed Tracing Extension.
We focus on two (2) key parts of configuration in this article – Extension configuration and Broker Tracing configuration.
Extension Configuration
The base configuration of the HiveMQ Distributed Tracing Extension involves configuring the following four (4) items:
Service name
Trace context propagation
Batch Span Processor
Span Exporter
The HiveMQ documentation has a more comprehensive explanation of the configuration parameters and options. For the purposes of this article, we’ll take a look at a sample configuration with the four (4) key elements in bold text:
<?xml version="1.0" encoding="UTF-8"?>
<hivemq-distributed-tracing-extension xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="config.xsd">
<service-name>HiveMQ Broker</service-name>
<propagators>
<propagator>tracecontext</propagator>
</propagators>
<batch-span-processor>
<schedule-delay>5000</schedule-delay>
<max-queue-size>2048</max-queue-size>
<max-export-batch-size>512</max-export-batch-size>
<export-timeout>30</export-timeout>
</batch-span-processor>
<exporters>
<otlp-exporter>
<id>my-otlp-exporter</id>
<endpoint>http://localhost:4317</endpoint>
<protocol>grpc</protocol>
</otlp-exporter>
</exporters>
</hivemq-distributed-tracing-extension>
These, as the names imply, define the service name, the format of distributed tracing header, the batch processing parameters, and where the data is to be exported.
The <exporters/> stanza is where you identify the OTLP exporter—local or remote—with DNS, IP address, or hostname in the <endpoint/> and the protocol, grpc in this case.
Broker Tracing Configuration
The other key piece of information needed is what to trace. For tracing setup, the centerpiece is selecting what MQTT clients and what MQTT topic(s) you want to trace. There are two (2) configuration items that scope what is being traced:
client-id-pattern:
car/command/#</topic-filter>
topic-filter:
car/command/#
These are typical MQTT client and topic patterns, supporting the single-level (‘+’) and multi-level (trailing ‘#’) wildcard specifiers. As shown in the following example from HiveMQ’s broker tracing configuration:
<?xml version="1.0" encoding="UTF-8" ?>
<tracing xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="tracing.xsd">
<context-propagation>
<outbound-context-propagation>
<enabled>true</enabled>
<exclude>
<client-id-patterns>
<client-id-pattern>iot-device-.*</client-id-pattern>
</client-id-patterns>
</exclude>
</outbound-context-propagation>
</context-propagation>
<sampling>
<publish-sampling>
<enabled>true</enabled>
<include>
<topic-filters>
<topic-filter>car/command/#</topic-filter>
</topic-filters>
</include>
</publish-sampling>
</sampling>
</tracing>
Operational Best Practices for Distributed Tracing with HiveMQ and OpenTelemetry
This section discusses several good practices and considerations relating to distributed tracing and setup. Sampling is a very important aspect, especially in systems with even moderate amounts of data or large, high transaction volumes. It is not feasible to capture all data in your systems. Rather, the utility comes from sampling effectively and having the appropriate instrumentation in place before you need to trace.
Run a Local OTel Collector
In general it's recommended to have a local OpenTelemetry Collector running, so applications can offload their data as reliably and quickly as possible via the OTLP exporters. In Java (and probably other OT SDKs) there are not many exporters available. That's the task of the collector distributions, if a specialized exporter is needed for an APM provider.
Sampling
It's also beneficial because you can configure additional sampling and processing of the trace data on the collector. For example: send all erroneous and high latency traces, and only send 5% or 10% of the remaining traces.
Metadata Format – Inter-application Tracing
For the inter-application tracing, like the customer application and HiveMQ, you have to define the format of the tracecontext in the metadata. In OpenTelemetry, these format definitions are called propagators. Propagators are configured in the Distributed Tracing Extension. Note that these propagators are only configured globally, not differently for incoming/outgoing or MQTT/Kafka/Kinesis.
As an exporter you normally use OTLP to a local OpenTelemetry (OTel) collector.
Transport Protocol Options:
gRPC is used as the transport protocol; HTTP would be an alternative. The defining element is OTLP (Open Telemetry Line Protocol) versus other transports such as ZipKin, which defines how the data is actually encoded on the wire.
Since the OT Java SDK only supports ZipKin and OTLP, the usage of an OT collector is needed should you require a specialized exporter for your APM vendor. The OT collector can also be configured to scrape HiveMQ Prometheus metrics and forward them to the APM vendor, such as Datadog.
Making Use of Distributed Traces
What Does a Distributed Trace Look Like?
A distributed trace is typically shown as a waterfall diagram. Distributed tracing collects deep insightful information about the behavior and performance of components in distributed systems. The expressive power of a distributed trace comes together in the form of a graphical, often interactive display people can use. A distributed trace is often visualized as follows:
You can see that time is represented in the horizontal direction while the vertical dimension is made of representations of the individual spans within the trace coming from 1 or more tiers or nodes in the system.
The following image from Datadog shows a distributed trace.
How to Read and Interpret a Distributed Trace Diagram
From the previous visualization examples, it is easy to see the details of each span, the identifiers, time spent in each span, and the larger perspective of how each span contributes to flow and response times within components.
Performance
These visualizations are organized in such a way that even someone not familiar with a system can quickly understand and interpret the behavior. Slower components stand out visually as larger, longer blocks of more colored real estate on the computer screen. Time progresses left-to-right.
When looking into performance it is helpful to understand that there are many types of latency, but they can generally be categorized into three (3) types of delay or latency:
propagation delay
network latency
processing latency
The following diagram shows how these delays relate:
In the case of macro-scale, or component-to-component, system performance analysis, we often ignore propagation delays. Propagation delays are important and dominate at smaller scales, such as within a CPU, a memory controller, chip-to-chip communications, and similar. This delay can’t be ignored altogether, however. Propagation delays also rise in importance as a factor when applications or systems are “chatty”. Chattiness is when an application has many, small back-and-forth conversations. Less Information is transmitted per turn in the conversation. For such frequent, small packets, the percentages of time of each delay type contributes a greater share.
For example, if an application has to send 100,000,000,000 bytes of data, the ideal way to send that is to bundle all or as much as possible together. See that the propagation delay is negligible compared to the transmission and processing delays in this case:
If instead, the 100,000,000,000 bytes need to be broken up into smaller packets, say of 1000 bytes, a sender will have to transmit 100,000,000 packets. In this case the previously negligible propagation and processing delays rise to consume a larger percentage of the total delay. We can see this effect visually:
Quick Interpretation of Visuals in a Distributed Trace
When reading the distributed trace diagrams, consider the following general signs:
Longer horizontal lines mean that a span is taking more time (processing delay).
Whitespace (larger gaps between spans) means that:
there is increased transmission (network) delay
Or there are other components in the flow not instrumented and therefore not represented in the trace and display
Whitespace between spans is an indication of latency network transmission latency (network delay).
The number, type, and frequency of spans indicate the level of chattiness visually. That is, many smaller rectangles indicate smaller, shorter transactions.
More and overlapping spans sprawling vertically can indicate higher degrees of parallelism in the overall system.
The number of rows in the chart indicate the level of connectedness or number of components communicating in the trace. Generally, the more rows in a trace, the more complex the system is.
Interconnections
Another powerful benefit of this type of visualization is synoptic view of the major components of a system. In one graphic, a person can see both the interconnections as well as the performance of the components of complex, networked systems. In this way, the trace diagram helps people better comprehend systems architecture and transactional views of a live system.
Analytic Use of Distributed Trace Data
On the other hand, all of the instrumented components register and transmit information about their performance centrally and quantitatively. Therefore, the performance data can be used for analytical purposes in addition to interactive visual inspection. You can take regular samples of, say 0.1% - 2% of your production traffic and use that information to baseline systems and entire application flows.
Conclusion
We have given you an overview of distributed tracing, a foundational understanding of concepts and terms to get you started on your path to gaining deeper understanding of your systems. If you’re an active customer, reach out to your Customer Success TAM or CSM to learn more. If you’re new to HiveMQ, please contact us for more information and a technical discussion.

Bill Sommers
Bill Sommers is a Technical Account Manager at HiveMQ, where he champions customer success by bridging technical expertise with IoT innovation. With a strong background in capacity planning, Kubernetes, cloud-native integration, and microservices, Bill brings extensive experience across diverse domains, including healthcare, financial services, academia, and the public sector. At HiveMQ, he guides customers in leveraging MQTT, HiveMQ, UNS, and Sparkplug to drive digital transformation and Industry 4.0 initiatives. A skilled advocate for customer needs, he ensures seamless technical support, fosters satisfaction, and contributes to the MQTT community through technical insights and code contributions.