Netflix


Location:

Los Gatos, California

Application:

MQTT-based device testing and certification solution for consumer electronic devices that run the Netflix application.

Key Challenge:

  • Achieving flexible, fault-tolerant, and scalable data transport over unreliable network connections to and from a broad range of geographically distributed devices.

Results:

  • Reliable transfer of MQTT data with proven ability to scale to increasing workload during periods of peak demand and accommodate the increasing numbers of devices Netflix is adding to their system
  • Increased observability through the HiveMQ Control Center, logs, and metrics.

Download Case Study

Netflix relies on HiveMQ to run the Netflix app certification process

With 222 million subscribers in over 190 countries, Netflix is not only the largest streaming platform in the world but also the industry leader in reliability and ease of use. One of the big advantages Netflix offers its members is the convenience of watching content on hundreds of different types of TV devices along with browsers and mobile devices. To ensure that all those TV devices deliver the flawless user experience that Netflix members enjoy, the Partner Infrastructure team at Netflix provides infrastructure to test and certify each device before onboarding the device to the Netflix application. After certification, continued monitoring in the field guarantees that devices remain in line with Netflix’s quality standards. HiveMQ plays a central role in helping Netflix achieve the robust and scalable bidirectional communication the device certification process requires.

"The adoption of HiveMQ has allowed us to move to a new paradigm of testing. Previously, all testing required a user to navigate to a website and manually launch tests. With the MQTT architecture that HiveMQ supports, we have made testing automatable via a modern CLI. This approach enables much more scalable testing and continuous integration that has empowered quite a few teams both inside and outside of Netflix."

Benson Ma, Senior Software Engineer, Netflix

Working together with leading consumer electronics companies around the globe also means that the Partner Infrastructure team must efficiently handle fluctuations in the certification load. For example, when electronics stores in the US roll out new models around the holidays season, that translates to October through November being a peak certification season. In that time frame, any delays or downtime in the certification process can lead to product-release delays and loss of revenue on the part of Netflix and the partner.

In some partner scenarios, multiple partners work in collaboration. There could be an OEM in Korea and an integrator in India. The device can be in one network and the trigger for the automated test on another network in another country.” explains Inder Singh, Senior Software Engineer at Netflix.

Netflix Device Management Platform Architecture

The Netflix Device Management Platform is designed to give developers a consistent way to deploy and execute automated tests on Netflix-ready devices. At the remote partner location, the Partner Infrastructure team provides a customized embedded computer called the Reference Automation Environment (RAE). Each RAE comes preinstalled with multiple Netflix services to detect, onboard, and collect information from the devices that connect to the RAE. Due to the constrained nature of the RAE and the devices being tested, MQTT is used for all communication with the RAE. Platform users can interact with the cloud services and the services on the RAEs through a web browser or command line interface (CLI).

Netflix Device Management Platform Components
Netflix Device Management Platform Components

Several considerations led Netflix to choose the MQTT protocol for use in the platform. Key features included MQTT’s support for hierarchical topics, client authentication/authorization, per-topic access control lists (ACL), bi-directional request/response message patterns, and the ability to handle unreliable network connections. All of these elements are crucial for the business use cases the Device Management Platform fulfills.

The Device Management Platform encompasses numerous cloud services that orchestrate tests, execute commands, collect logging information, pump metrics, and more. In keeping with Netflix’s internal adoption and support of Apache Kafka as the standard tooling for message queues, the cloud services on the platform implement Kafka-based message processing. For consistency and ease of development, the Device Management Platform also utilizes a custom-built authorization system that is part of the Netflix infrastructure. In the resulting architecture, the MQTT-based services on the RAE and the Kafka-based services in the Netflix cloud need to continuously send and receive information between each other.

To establish the necessary bridge between the MQTT and Kafka protocols, the Partner Infrastructure team deploys a five-node HiveMQ Enterprise MQTT Broker with the HiveMQ Enterprise Extension for Kafka. To accomodate integration with the Netflix authorization system, the team leveraged HiveMQ’s flexible extension framework to design a bespoke HiveMQ security extension. The custom HiveMQ extension enables users to interact with the platform via web browser or CLI while using the supported Netflix authorization system. The HiveMQ bridge allows MQTT messages from the field to be directly converted to Kafka records in the cloud. Conversely, Kafka messages from the cloud are mapped to the appropriate MQTT topics on the RAE. The RAE and device session identifiers are embedded in the topic of each MQTT message, which allows the Netflix custom HiveMQ extension to apply topic ACLs to precisely control which RAEs and devices each partner can see and interact with.

Automated event-sourced testing with MQTT, HiveMQ, and Kafka

At Netflix, building the best solution is a process of continuous research and innovation. Initially, the Partner Infrastructure team used an IoT cloud platform as the transport plane backing the Device Management Platform. However, as they scaled up, difficulties arose with the cloud platform’s support for MQTT. Issues included dropped MQTT messages, limited scalability of device connection rates, and restrictions on MQTT message size.

After investigating alternative MQTT broker solutions, Netflix conducted internal stress tests with an on-premise MQTT broker, the HiveMQ Enterprise MQTT broker, and their existing IoT cloud service as the control. The following month-long benchmarking tested approximately 1,000 Netflix Resource Automation Environments (RAE) with 5,000 clients per RAE. Ramp-up time was 1 minute with traffic of 100 messages per second and message sizes ranging from 256 KB to 1MB per message. In addition to being the only broker that provided 100% support for all features of the MQTT protocol, the HiveMQ broker was the most performant on each of the benchmarks. Based on the in-house evaluation results, Netflix selected HiveMQ as the MQTT broker for their Device Management Platform.

Netflix relies on HiveMQ to handle the bidirectional movement of data between the MQTT-based services on the RAE, the Kafka-based services in the cloud, and the platform users (CLI or web UI). With the combination of MQTT, HiveMQ, Apache Kafka and the custom HiveMQ security extension the team created to perform authorization-based topic routing, the Partner Infrastructure team accomplishes truly bidirectional and effective device management at scale.

Achieving scalable, reliable, and secure device management

Gracefully managing the large number of devices that manufacturers submit to Netflix for certification is a significant technical challenge. For the Partner Infrastructure team, it is absolutely critical to keep device information up to date so that testing runs properly. Since the adoption of HiveMQ, the team has been very satisfied with their results and feels that HiveMQ has met all their business requirements. The fluid horizontal scalability of the HiveMQ broker easily handles bursts in demand and HiveMQ’s Kafka integration is a perfect fit for achieving the reliable bidirectional communication the Netflix device certification process requires.

Inder Singh confirms: “Reliability is critical to us. We haven’t encountered any performance issues using the HiveMQ broker. With our new form of testing, all we need to do is provide the partners with the certs and they keep those certs on their tools and then enable remote automation on a device that could be anywhere.

Helping 400+ consumer electronics and TV operator partners scale their Netflix certification and testing on 1,000+ devices at any given time (in locations around the world) is a complex undertaking. The use of HiveMQ has allowed the Partner Infrastructure team to streamline data transport in the Device Management Platform into a flat and reliable mesh (previously data exchange was point-to-point and multi-point). This topology has enormous implications for simplifying device test execution, edge monitoring, and data collection from the RAE. One such benefit is that the team can now enable continuous integration workflows for both internal use and partners. Additionally, being able to extend HiveMQ functionality with a custom extension has opened the possibility for all authorized participants to securely interact with the platform - effectively scaling the platform’s evolution.

Benson Ma sums up: “The adoption of HiveMQ has allowed us to move to a new paradigm of testing. Previously, all testing required a user to navigate to a website and manually launch tests. With the MQTT architecture that HiveMQ supports, we have made testing automatable via a modern CLI. This approach enables much more scalable testing and continuous integration that has empowered quite a few teams both inside and outside of Netflix.