Skip to content

Seamlessly Integrate MQTT Data With Data Lakes

by Nasir Qureshi
11 min read

In the rapidly evolving IoT landscape, businesses need a reliable, end-to-end solution that seamlessly integrates sensor data into data lake architectures. This is paramount for unlocking the full potential of IoT data for use cases like predictive maintenance, asset tracking, and connected devices. HiveMQ's Enterprise Broker and Enterprise Data Lake Extension together offer a reliable, secure, and robust end-to-end solution for seamlessly integrating IoT sensor data with leading Data Clouds. 

In this blog post, we will discuss a data lake, how it can be used in an IoT deployment, and how the HiveMQ Enterprise Data Lake Extension can enable seamless integration of MQTT data into popular data lakes such as Snowflake, AWS, Azure, and Databricks without needing additional infrastructure.

What is a Data Lake?

Data lakes are centralized repositories that allow organizations to store vast amounts of raw and processed data in their native format. This type of storage system can handle large volumes of structured, semi-structured, and unstructured data. This is particularly useful in IoT deployments as IoT data comes in various formats, such as sensor readings, logs, images, videos, and more. Popular cloud providers like Azure, AWS, and Google provide architectures that include data lake services. Data Clouds like Snowflake and Databricks operate their services on top of these public clouds. 

How is a Data Lake Used in an IoT Deployment? 

Depending on the deployment size, IoT devices can generate petabytes of data that may need to be stored, transformed, validated, analyzed, cataloged, and processed to enable several use cases and make business decisions. For instance, as illustrated in the diagram below, a data lake can be more than a repository. It can be an engine to store, transform, validate, analyze, catalog, and process the data from IoT devices. This makes a data lake an essential part of enterprise IoT deployments. 

Data Lake Reference ArchitectureImage Source:

What is HiveMQ’s Enterprise Data Lake Extension?

This purpose-built extension enables our customers to integrate data from IoT devices directly into any data cloud that is deployed in AWS or Azure and can read from Amazon S3 and Azure Blob Storage. More specifically, the Data Lake Extension allows customers to:

  • Forward MQTT messages directly to the data lake without the need for additional infrastructure.

  • Support any data lake infrastructure that can read data from Amazon S3 or Azure Blob storage — Databricks, Snowflake, AWS Data Lake, Azure Data Lake, etc.

  • Convert MQTT messages into Parquet table rows with column mappings for efficient storage.

  • Use mappings to store only the MQTT message elements needed. This helps optimize storage capacity and querying while saving unnecessary data storage costs.

The Data Lake Extension can be utilized to enable several use cases, including predictive maintenance in smart manufacturing, real-time asset tracking in transportation and logistics, and connecting cars for OEM manufacturers. 

In each use case, data is generated from IoT devices like sensors that are embedded in manufacturing machinery, equipment, or cars to continuously collect and transmit data on crucial metrics like equipment health, asset location coordinates, engine oil pressure and temperature, etc. In each scenario, IoT devices communicate using the MQTT protocol. The data generated by IoT devices is published to HiveMQ’s MQTT broker, which serves as an intermediary between the IoT devices and the subsequent stages of the data pipeline. It receives, manages, and routes the MQTT messages using the pub/sub model, ensuring efficient and reliable communication between publishing devices and subscribing clients. HiveMQ Broker is 100% compliant with the MQTT protocol and purpose-built for an instant bi-directional data push between IoT devices and enterprise systems.

Once the data is integrated with the MQTT broker, it can then be sent into a data lake architecture. Here is where HiveMQ’s Data Lake Extension comes in. It can seamlessly integrate the MQTT data into a data lake architecture by forwarding MQTT messages directly to the data lake via the primary Cloud Storage service (AWS S3 or Azure Blob Storage) without the need for additional infrastructure. 

The Data Lake Extension also has a couple of other valuable features: 

  1. Parquet Conversion: The HiveMQ Data Lake Extension can save MQTT messages into Parquet table rows with column mappings. This structured format optimizes storage and enhances query performance within the data lake.

  2. Topic Filtering: Utilize topic filtering to forward only essential MQTT message components selectively.

    a. For instance, engineers may find that temperature changes and machine vibration patterns are vital signs of potential defects in factory equipment. By employing topic filters, only the required topics of the MQTT message will be transmitted to the data lake storage application, omitting less essential data to simplify storage and processing.

    b. This also helps optimize storage capacity and querying while saving unnecessary data storage costs.

Then, depending on the use case, the data can be stored, processed, refined, and organized in the data lake. The data can also be integrated with BI and analytics applications to create monitoring dashboards, build applications to send event-driven alerts, and more.

The HiveMQ Data Lake Extension supports Amazon S3 and Azure Blob Storage with the HiveMQ version 4.26 platform release. 

Integration with Data Cloud Platforms

The HiveMQ Enterprise Data Lake Extension can integrate data with four different data lakes: Snowflake, AWS, Azure, and Databricks.

HiveMQ Integration with Snowflake, AWS, Azure, and Databricks

HiveMQ Enterprise Data Lake ExtensionHere is how the HiveMQ can integrate IoT device data with these data lakes:

  1. MQTT-compatible devices connect with the HiveMQ Platform via TCP/IP using the MQTT protocol's pub/sub model to send data.

  2. The HiveMQ Platform (Broker + Data Lake Extension) converts the sensor data into Parquet format and transmits it to the primary storage layer — Amazon S3 or Azure Blob Storage. Users can utilize topic filters to forward only the required topics. This helps optimize storage capacity and querying while saving unnecessary data storage costs.

  3. Amazon S3 or Azure Blob (object storage) serves as the storage foundation for the data lake, hosting the Parquet-formatted sensor data received from HiveMQ.

  4. The data in object storage can be sent to the data lake for analytics, application development, event-triggered notifications, reporting, statistical analysis, and AI machine learning.


HiveMQ's Data Lake Extension enables direct message forwarding to the data lake without additional infrastructure. This simplifies the integration of MQTT data into various data lake infrastructures. The extension includes features like Parquet Conversion and Topic Filtering to optimize storage, reduce costs, and enhance query performance.

Find out more about the HiveMQ Enterprise Data Lake Extension.

Nasir Qureshi

Nasir Qureshi is a Senior Product Marketing Manager at HiveMQ. With a passion for working on disruptive technology products, Nasir has helped SaaS companies in their hyper-growth journey for over 3 years now. He holds an MBA from California State University with a major in Technology and Data Management. His interests include IoT devices, networking, data security, and privacy.

  • Nasir Qureshi on LinkedIn
  • Contact Nasir Qureshi via e-mail

Related content:

HiveMQ logo
Review HiveMQ on G2