Skip to content

How to Compute MTBF, MTTR, and Availability in Real-Time Without a Separate Data Stack

by Dr. Ankit Chaudhary, Sven Kobow
24 min read

Why Are Reliability KPIs Like MTBF and MTTR So Often Miscomputed?

Reliability KPIs are miscomputed not because the math is hard, but because the definitions are. MTBF, MTTR, and availability are among the oldest reliability KPIs in the industry, and at the same time, among the most consistently miscomputed. The question hiding inside each of them is harder than it looks: what counts as a failure, what counts as recovery, and what counts as operational time?

The three KPIs are different views of the same equipment timeline, and each depends on a definition that the math alone cannot supply.

  • MTBF is used as an indicator of equipment reliability, i.e., how often a piece of equipment fails. It needs a failure definition.

  • MTTR is used as an indicator of maintenance responsiveness, i.e., how quickly equipment is restored after a failure. It needs a recovery definition, a clear answer to "back in service, as of when?"

  • Availability is used as an indicator of overall operational performance, i.e., the share of scheduled production time during which the equipment is actually running. It needs both, plus a definition of scheduled operational time.

Anyone who has worked across multiple OT vendors knows the answers are never universal. The definitional gap shows up differently for each KPI.

For failure definitions, a Siemens drive signals differently from an ABB drive. Fault dictionaries run to hundreds of codes per vendor. Failure for MTBF purposes is almost always a subset of those codes, filtered by severity, sometimes gated by duration, sometimes counted only if the equipment was supposed to be running. The same fault code can mean failure on one line and expected behavior on another.

For recovery definitions, some equipment signals it cleanly. The fault clears, the status returns to running, and the down interval has a well-defined end. Other equipment remains in an error state until a reset is acknowledged or is considered recovered only after a specific reinitialization sequence completes. Reasonable engineers at the same plant will disagree on which signal counts.

For operational time, planned maintenance windows usually don't count as downtime, but tooling doesn't handle that uniformly across vendors. Mode changes, shift boundaries, and scheduled stoppages each require proper handling and equipment-specific knowledge.

In each case, the definition is equipment-specific and engineer-judgment-specific. The reliability engineer is the person who knows the answer. Existing tooling forces the reliability engineer to extract raw data from a historian or an MES platform and use a spreadsheet or similar tools to apply their custom formulas to compute the final KPIs.

Section summary: Reliability KPIs are miscomputed not because the math is hard but because their definitions vary by vendor, by equipment, and by engineer judgment. The person who knows the right definition, the reliability engineer, is forced to extract raw data from historians or MES platforms and rebuild the calculation in spreadsheets.

Why Do Current Approaches Fall Short for Reliability KPIs?

Current approaches struggle because they solve the infrastructure problem, not the definition problem. What "the current approach" looks like depends on where the manufacturer is on their digital journey. Common variations include:

  • PLC and HMI to spreadsheet. Data exported manually or via CSV, KPIs computed in Excel by a reliability engineer or shift supervisor.

  • PLC to historian or time-series DB. Data captured but reliability KPIs computed downstream in a separate analytics tool.

  • PLC to SCADA, MES, or a dedicated point solution. Each system computes its own version of the KPI, and they disagree.

  • PLC to MQTT broker to streaming stack (Kafka, Flink or Spark, time-series DB, BI layer). The modern data-platform approach, more common at digitally mature manufacturers.

What all four share is that the KPI definition lives downstream of the reliability engineer, in a system the engineer doesn't directly control. Whether the path is fragmented or modern, the definitional problem is the same.

The fragmentation pattern is worth dwelling on, because it compounds at scale. Each site typically runs its own historian, capturing data from local production equipment. KPIs are computed at the bottom of the operational hierarchy (machine, shift, line) and rolled up into broader metrics for the area, the site, and the enterprise. When each layer of that rollup uses a slightly different definition of failure or recovery, the enterprise-level number is the average of averages of averages, each built on incompatible inputs. The aggregated MTBF agrees with nothing the plant floor reports.

In practice, the lack of expressive, consistent definition produces one of three outcomes:

  1. Reliability KPIs computed in spreadsheets from periodic CSV exports. The numbers arrive days late, and only the person who built the spreadsheet knows what they mean.

  2. Reliability KPIs computed generically. Every fault code counts as a failure, every state change counts as recovery, every scheduled hour counts as operational time. Produces numbers nobody trusts.

  3. Reliability KPIs computed correctly but expensively. The platform team owns a stream-processing job for each KPI definition. The reliability team cannot read or modify any of them. Adding a new KPI variant means filing a ticket.

None of these is a good outcome. The missing capability is not more infrastructure. It is letting the reliability engineer write the definition directly, in a form the system can execute, and in a vocabulary that travels consistently from the machine to the enterprise.

Section summary: The traditional stack, whether fragmented or modern, removes the reliability engineer from the definition and makes consistent rollup across the hierarchy impossible. Both are the wrong direction.

How Does HiveMQ Pulse Compute Reliability KPIs in Real-Time?

HiveMQ Pulse computes reliability KPIs through a single expression, BetweenState, that evaluates a Boolean predicate the user writes to define a state of interest. It is an expression in HiveMQ Pulse that captures the intervals during which a user-defined predicate holds true over a stream of operational data, then applies an aggregation function to the collected data. HiveMQ Pulse not only simplifies KPI definition but also does so without requiring an additional infrastructure stack by pushing these computations directly to the HiveMQ Brokers.

The expression has two parts:

BetweenState(<Predicate>, <AggregationFunction>)

The predicate is a user-written Boolean expression that defines a state of interest. Over an unbounded stream, a state is represented as a window. Unlike a time-based or count-based window, this one is defined by the data itself. It opens when the predicate becomes true, captures every consecutive tuple that satisfies it, and closes when the predicate becomes false again.

The aggregation function specifies what to compute over the data points within that window. For reliability KPIs, the relevant aggregation is the elapsed time between the window's open and close, which gives the raw interval for the KPI averages. The same mechanism extends to any aggregation over the captured tuples, such as event counts, total runtime, or maximum temperature.

For MTBF and MTTR specifically, the expression composes naturally as a windowed mean of the elapsed times of the intervals for which the predicate held true, averaged either over a fixed count of such intervals or over a fixed time horizon:

Mean(BetweenState(<Predicate>, TimeElapsed), <Duration | Count>)

A 24-hour window gives a rolling 24-hour view. A 100-interval window gives a per-100-event view for assets where wear cycles matter more than calendar time. Figure 1 (a) shows an example unified name space (UNS) for a packaging factory. Figure 1 (b) and (c) shows how MTBF can be calculated in HiveMQ Pulse using the between state and simple moving average (SMA) functions. Reliability teams typically want short and long windows side by side. Short windows surface the bad shift. Long windows surface the real trend. The gap between them is often where the interesting story resides. To this, end teams can define MTBF for a varying count of time intervals using the same TBF tag.

How to Compute MTBF, MTTR, and Availability in Real-Time Without a Separate Data Stack(a)   

How to Compute MTBF, MTTR, and Availability in Real-Time Without a Separate Data Stack(b)

How to Compute MTBF, MTTR, and Availability in Real-Time Without a Separate Data Stack(c)

Figure 1. (a) UNS topic tree for an example packaging factory. (b) A Pulse tag using BetweenState to compute the time-between-failure (TBF) intervals from the fault data points published by a line asset. (c) A Pulse tag that computes the Mean Time Between Failures (MTBF) by taking a moving average over every 100 TBF intervals from (b)

Section summary: BetweenState is one expression that captures data-defined intervals and aggregates over them. It is the foundation of every reliability KPI in this post.

Can One Expression Compute MTBF, MTTR, and Availability?

Yes. The same BetweenState expression produces all three reliability KPIs when paired with different predicates.

MTBF uses a failure predicate and TimeElapsed. The windowed mean of the resulting intervals is the MTBF for the chosen horizon:

Mean(BetweenState(<failure_predicate>,TimeElapsed),<Time|Count>)

MTTR uses a recovery predicate and the same aggregation. The shape of the expression is identical to MTBF. Only the predicate changes:

Mean(BetweenState(<recovery_predicate>,TimeElapsed),<Time|Count>)

Availability is composed of the same primitives as a ratio of operational-state intervals (defined by an operational predicate) to scheduled time over a window. It uses the same BetweenState building block. Only the surrounding composition changes.

This is the architectural point worth dwelling on. HiveMQ Pulse does not ship MTBF, MTTR, or availability features. It ships an expression vocabulary in which your reliability team defines all three using the same primitive, with the predicate as the only thing that changes from one KPI to the next.

Section summary: One expression, three predicates, three KPIs. The same approach extends to any reliability or production KPI built on state intervals.

Where Does the Engineer's Knowledge Actually Live?

It lives in the predicate. The predicate is the place where equipment-specific and vendor-specific judgment becomes executable. Three realistic predicate styles, one for each of the three definitions a reliability program needs.

One of these examples uses PackML. PackML (ISA-TR88.00.02) is a manufacturing standard that defines a common state machine for production equipment, with named states such as Execute, Aborted, and Clearing that have the same meaning across vendors.

Defining Failure: a Vendor-Specific Fault Code Subset

Used as the failure predicate for MTBF. Failure here is a curated list of codes the reliability team has identified as meaningful, gated on severity:

failure_predicate := (fault_code == F0023 OR fault_code == F0047 OR fault_code == F0102) AND severity >= 2

This is the most common shape of a failure predicate. The reliability engineer maintains the code subset based on root cause analysis of prior incidents. When the team learns a new fault code is operationally significant, they add it to the predicate, and the MTBF starts counting it immediately.

Defining Recovery: a PackML State Sequence on a Pharmaceutical Filling Line

For a pharmaceutical filling line, the recovery interval starts when the machine enters Clearing, the PackML state in which the operator has acknowledged the fault and active repair is underway, and ends when the machine returns to Execute. Used as the recovery predicate for MTTR:

recovery_predicate := PackML_state == "Clearing" AND mode == "Production"

The MTTR for the line is the windowed mean of how long the machine stays in this state. The same expression works over any state vocabulary the equipment exposes. On PackML-compliant equipment, the predicate references PackML states. On custom equipment, the predicate references custom states. The expression is the same.

Defining Operational Time: a Mode-Gated Production State

Used to define operational time for availability, and equally useful as a gating clause inside other predicates to exclude planned maintenance windows:

operational_predicate := running == true AND scheduled_mode=="PRODUCTION"

The same predicate works as a gating clause inside failure or recovery predicates to exclude planned maintenance windows from both MTBF and MTTR.

Section summary: Across all three KPIs, the same pattern holds. The vendor-specific and equipment-specific knowledge lives in a predicate that the reliability engineer writes, not in code that the platform team has to maintain.

What's Next for Reliability and Production KPIs in HiveMQ Pulse?

Lighthouse customers are collapsing the traditional reliability stack into a single HiveMQ Pulse deployment alongside their existing dashboards. 

A consistent pattern shows up across discrete manufacturing, process industries, and energy. Teams that were running an MQTT broker, a streaming layer, and a time-series database to compute reliability KPIs are consolidating these into a single platform alongside their existing visualization layer. The number of systems involved drops. The latency from the event to KPI update drops with it.

The change that surprises platform teams most is time-to-definition. What used to be a multi-week cross-team project is now a configuration change that a reliability engineer makes themselves.

What this looks like depends on the role you play:

  • For reliability engineers and plant-side OT leads: You write the predicate. Your fault-code subset, your duration thresholds, your mode gates, your recovery conditions, all of that becomes a predicate you can read, edit, and version. The MTBF, MTTR, and availability numbers on the dashboard are computed from definitions you authored, not from generic interpretations inherited from a vendor or a platform team.

  • For solution architects at OT vendors and systems integrators: Reliability KPIs become part of your broker integration, not a parallel streaming project. The predicate model also means you can ship vendor-specific predicate libraries to your customers, such as a Siemens predicate set or an ABB predicate set, and let them compose and refine from there.

  • For platform and data engineers evaluating Pulse: The property worth examining is composability. The same primitive covers MTBF, MTTR, and availability. You are not adopting a single-purpose KPI engine. You are adopting a stateful operator vocabulary that applies uniformly to every KPI you build on it.

Private beta access involves working directly with the HiveMQ Pulse team on your reliability use case. To start that conversation, write to sales@hivemq.com.

Section summary: Private beta customers see fewer systems, faster time-to-definition, and reliability KPIs that come from definitions the reliability team owns.

Ready to compute reliability KPIs on your own terms? Contact sales@hivemq.com to request access to the HiveMQ Pulse private beta.

Frequently Asked Questions

Dr. Ankit Chaudhary

Dr. Ankit Chaudhary is a Senior Software Engineer at HiveMQ, based in Berlin, where he works on HiveMQ Pulse, the company's distributed data intelligence platform. His focus is on real-time stream processing and in-flight data transformation from the edge to the cloud.

Ankit holds a PhD in data management systems from TU Berlin, with a particular focus on stream processing systems. As a founding engineer of the NebulaStream platform, he built its query optimizer and worked to keep continuous queries efficient across edge and cloud environments. His research has earned awards at top-tier database conferences, including a Best Paper Award at ICDE 2025, and has been published in venues such as VLDB, SIGMOD, EDBT and CIDR. Before his PhD, he spent more than a decade in industry building real-time data systems.

  • Dr. Ankit Chaudhary on LinkedIn

Sven Kobow

Sven Kobow is a Staff Industry Architect at HiveMQ with more than two decades of experience in IT and IIoT.

In this role, he bridges strategic vision and technical implementation - designing reference architectures and deployment patterns that help industrial enterprises build on HiveMQ's full platform stack. He works deeply in customer domains to architect Unified Namespace Solutions and event-driven data platforms, with a consistent focus on extracting real business value from operational data. Before joining HiveMQ, he worked in the automotive sector at a major OEM.

  • Sven Kobow on LinkedIn
  • Contact Sven Kobow via e-mail
HiveMQ logo
Review HiveMQ on G2