Streaming IoT Data and MQTT Messages to Apache Kafka
Written by Dominik Obermaier and Ian Skerrett
Published: April 16, 2019
Apache Kafka is a real-time streaming platform that is gaining broad adoption within large and small organizations. Kafka’s distributed microservices architecture and publish/subscribe protocol make it ideal for moving real-time data between enterprise systems and applications. By some accounts, over one third of Fortune 500 companies are using Kafka. On Github, Kafka is one of the most popular Apache projects with over 11K stars and over 500 contributors. Without a doubt, Kafka is an open source project that is changing how organizations move data inside their cloud and data center.
The architecture of Kafka has been optimized to stream data between systems and applications as fast as possible in a scalable manner. Kafka clients/producers are tightly coupled with a Kafka cluster requiring each client to know the IP addresses of the Kafka cluster and to have direct access to all individual nodes. Inside trusted networks, this allows changes to the broker topology, which means topics and partitions can be scaled out by using multiple nodes directly from Kafka clients. Kafka topic spaces are also kept reasonably flat in most cases, as usually multiple partitions are used for scaling a single Kafka topic. It is very often undesirable to have hundreds or even thousands of topics in the Kafka system and preferred approach is using few topics for most of the data streams. Kafka is well suited for communication between systems inside the same trusted network with stable IP addresses and connections.
For IoT use cases where devices are connected to the data center or cloud over the public Internet, the Kafka architecture is not suitable out-of-the-box. If you attempt to stream data from thousands or even millions of devices using Kafka over the Internet, the Apache Kafka architecture is not suitable. There are a number of reasons Kafka by itself is not well suited for IoT use cases:
Kafka brokers need to be addressed directly by the clients, which means each clients needs to be able to connect to the Kafka brokers directly. Professional IoT deployments usually leverage a load balancer as first line of defense in their cloud, so devices just use the IP address of the load balancer for connecting to the infrastructure and the load balancer effectively acts as a proxy. If you want your devices to connect to Kafka directly, your Kafka brokers must be exposed to the public Internet.
Kafka does not support large amounts of topics. When connecting millions of IoT devices over the public Internet, individual and unique topics (often with some unique IoT device identifier included in the topic name) are commonly used, so read and write operations can be restricted based on the permissions of individual clients. You don’t want your smart thermostat to get hacked and those credentials potentially being used to eavesdrop on all the data streams in the system.
Kafka clients are reasonably complex and resource intensive compared to client libraries for IoT protocols. The Kafka APIs for most programming languages are pretty straightforward and simple, but there is a lot of complexity under the hood. The client will for example use and maintain multiple TCP connections to the Kafka brokers. IoT deployments often have constrained devices that require minimal footprint and don’t require very high throughput on the device side. Kafka clients are by default optimized for throughput.
Kafka clients require a stable TCP connection for best results. Many IoT use cases involve unreliable networks, for example connected cars or smart agriculture, so a typical IoT device would need to consistently reestablish a connection to Kafka.
It’s unusual (and very often not even possible at all) to have tens of thousands or even millions of clients connected to a single Kafka cluster. In IoT use cases, there are typically large numbers of devices, which are simultaneously connected to the backend, and are constantly producing data.
Kafka is missing some key IoT features. The Kafka protocol is missing features such as keep alive and last will and testament. These features are important to build a resilient IoT solution that can deal with devices experiencing unexpected connection loss and suffering from unreliable networks.
Kafka still brings a lot of value to IoT use cases. IoT solutions create large amounts of real-time data that is well suited for being processed by Kafka. The challenge is: How do you bridge the IoT data from the devices to the Kafka cluster?
Many companies that implement IoT use cases are looking at possibilities of integrating MQTT and Kafka to process their IoT data. MQTT is another publish/subscribe protocol that has become the standard for connecting IoT device data. The MQTT standard is designed for connecting large numbers of IoT devices over unreliable networks, addressing many of the limitations of Kafka. In particular, MQTT is a lightweight protocol requiring a small client footprint on each device. It is built to securely support millions of connections over unreliable networks and works seamlessly in high latency and low throughput environments. It includes IoT features such as keep alive, last will and testament functionality, different quality of service levels for reliable messaging, as well as client-side load balancing (Shared Subscriptions) amongst other features designed for public Internet communication. Topics are dynamic, which means an arbitrary number of MQTT topics can exist in the system, very often up to tens of millions of topics per deployment in MQTT server clusters.
While Kafka and MQTT have different design goals, both work extremely well together. The question is not Kafka vs MQTT, but how to integrate both worlds together for an IoT end-to-end data pipeline. In order to integrate MQTT messages into a Kafka cluster, you need some type of bridge that forwards MQTT messages into Kafka. There are four different architecture approaches for implementing this type of bridge:
1. Kafka Connect for MQTT
Kafka has an extension framework, called Kafka Connect, that allows Kafka to ingest data from other systems. Kafka Connect for MQTT acts as an MQTT client that subscribes to all the messages from an MQTT broker.
If you don’t have control of the MQTT broker, Kafka Connect for MQTT is a worthwhile approach to pursue. This approach allows Kafka to ingest the stream of MQTT messages.
There are performance and scalability limitations with using Kafka Connect for MQTT. As mentioned, Kafka Connect for MQTT is an MQTT client that subscribes to potentially ALL the MQTT messages passing through a broker. MQTT client libraries are not intended to process extremely large amounts of MQTT messages so IoT systems using this approach will have performance and scalability issues.
This approach centralizes business and message transformation logic and creates tight coupling, which should be avoided in distributed (microservice) architectures. The industry leading consulting company Thoughtworks called this an anti-pattern and even put Kafka into the “Hold” category in their previous Technology Radar publications.
2. MQTT Proxy
Another approach is the use of a proxy application that accepts the MQTT messages from IoT devices but does not implement the publish/subscribe or any MQTT session features and thus is not a MQTT broker. The IoT devices connect to the MQTT proxy, which then pushes the MQTT messages into the Kafka broker.
The MQTT proxy approach allows for the MQTT message processing to be done within your Kafka deployment, so management and operations can be done from a single console. An MQTT proxy is usually stateless, so it could (in theory) scale independent of the Kafka cluster by adding multiple instances of the proxy.
The limitations to the MQTT Proxy is that it is not a true MQTT implementation. An MQTT proxy is not based on pub/sub. Instead it creates a tightly coupled stream between a device and Kafka. The benefit of MQTT pub/sub is that it creates a loosely coupled system of end points (devices or backend applications) that can communicate and move data between each end point. For instance, MQTT allows communication between two devices, such as two connected cars can communicate with each other, but an MQTT proxy application would only allow data transmission from a car to a Kafka cluster, not with another car.
Some Kafka MQTT Proxy applications support features like QoS levels. It’s noteworthy that the resumption of a QoS message flow after a connection loss is only possible, if the MQTT client reconnects to the same MQTT Proxy instance, which is not possible, if load balancer is used that uses least-connection or round-robin strategies for scalability. So the main reason for using QoS levels in MQTT, which is no message loss, only works for stable connections, which is an unrealistic assumption in most IoT scenarios.
The main risks for using such an approach is the fact that a proxy is not a fully featured MQTT broker, so it is not an MQTT implementation as defined by the MQTT specification, only implementing a tiny subset, so it’s not a standardized solution. In order to use MQTT with MQTT clients properly, a fully featured MQTT broker is required.
If message loss is not an important factor and if MQTT features designed for reliable IoT communication are not used, the proxy approach might be a lightweight alternative, if you only want to send data unidirectionally to Kafka over the Internet.
3. Build Your Own Custom Bridge
Some companies build their own MQTT to Kafka bridge. The typical approach is to create an application using an open source MQTT client library and an open source Kafka client library. The custom application is responsible for transposing and routing the data between the MQTT broker and Kafka instance.
The main challenge with this approach is the custom application is typically not designed to be fault tolerant and resilient. This becomes important, if the IoT solution requires and end-to-end guaranty of at least once or exactly once message delivery. For example, an MQTT message set to quality of service level 1 or 2 that is sent to the custom application will acknowledge receipt of the message. However, if the custom application crashes before forwarding the message to Kafka, the message is lost. Similarly, if the Kafka cluster is not available, the custom application will need to buffer the MQTT messages. In case the custom application crashes before the Kafka cluster is available again, all the buffered messages will be lost. To solve these issues, the custom application will require significant development efforts, building functionality similar to technology already found in Kafka and an MQTT broker.
4. MQTT Broker Extension
A final approach is to extend the MQTT broker to create an extension that includes the native Kafka protocol. This allows the MQTT broker to act as a first-class Kafka client and stream the IoT device data to multiple Kafka clusters.
To implement this approach, you need to have access to the MQTT broker and the broker needs to have the ability to install extensions.
This approach allows the IoT solution to use a native MQTT implementation and a native Kafka implementation. IoT devices use an MQTT client to send data to a fully featured MQTT broker. The MQTT broker is extended to include a native Kafka client and transposes the MQTT message to the Kafka protocol. This allows the IoT data to be routed to multiple Kafka clusters and non-Kafka applications at the same time. Using an MQTT broker will also provide access to all the MQTT features required for IoT devices, such as last will and testament. An MQTT broker, like HiveMQ, is designed for high availability, persistence, performance and resilience, so messages can be buffered on the broker while Kafka is not writable, so important messages are never lost from IoT devices. Consequently this approach delivers true end-to-end message delivery guarantees even for unreliable networks, public Internet communication and changing network topologies (as often seen in containerized deployments e.g. with Kubernetes).
HiveMQ Enterprise Extension for Kafka
In conversations with HiveMQ customers, some operating clusters with millions of devices and very high message throughput, we saw the need for creating an MQTT broker extension for Kafka. Our customers wanted to benefit from the native implementations of both MQTT and the Kafka protocol with all delivery guarantees of both protocols. Therefore, we are excited to announce the HiveMQ Enterprise Extension for Kafka.
Our customers see enormous value in a joint MQTT and Kafka solution. They view Kafka as an excellent platform for processing and distributing real-time data in their data center or cloud environments. They want to use MQTT and HiveMQ to move data from devices to different back-end systems. The back-end systems include Kafka and also non-Kafka systems. They also know if they are trying to connect millions of devices, like connected cars, they need to be using a native and battle-tested implementation of MQTT, like HiveMQ.
HiveMQ Enterprise Extension for Kafka implements the native Kafka protocol inside the HiveMQ broker. This allows for seamless integration of MQTT message with a single Kafka cluster or multiple Kafka clusters at the same time. It supports 100% the entire MQTT 3 and MQTT 5 specification. We even make is possible to map potentially millions of MQTT topics to a limited number of Kafka topics. Finally, we have extended the HiveMQ Control Center to make it possible to monitor MQTT messages written to Kafka.
We are excited to bring this new product to our HiveMQ customers. This is the best approach for using Apache Kafka with IoT use cases.
The free trial of the HiveMQ Enterprise Extension for Apache Kafka can be downloaded in our Marketplace. It was never easier to integrate MQTT and Apache Kafka for IoT use cases.