What is MTTR, and how is it calculated?

MTTR (Mean Time To Repair, sometimes called Mean Time To Recovery) is the average time required to restore a piece of equipment to operational condition following a failure. It is a measure of maintainability: the lower the MTTR, the faster an organization recovers when failures occur. The formula is total repair time divided by the number of repairs in a given period. Like MTBF, the math is straightforward - the difficulty lies in the definitions. Three boundaries need to be established explicitly: when does the repair clock start (at the moment the fault is detected, when an operator acknowledges the alarm, or when a technician arrives at the equipment?), when does it stop (when the machine restarts, when it produces its first good unit, or when it passes a qualification run?), and whether waiting time - for a spare part, for maintenance coverage, for a shift handover - counts as repair time or is tracked separately. In regulated industries such as pharmaceutical manufacturing, restart alone is typically not sufficient to end the repair interval; the asset must pass re-qualification before production resumes, and that time is properly included in MTTR.

What is MTBF, and how is it calculated?

MTBF (Mean Time Between Failures) is the average time a piece of equipment operates between successive failures. It is calculated by dividing the total operational time by the number of qualifying failures in a given period. The critical variable is the definition of failure, which is always specific to the asset, line, and operational context. A generic definition that counts every fault code as a failure produces numbers that reliability teams rightly distrust.

What is the difference between TBF and MTBF?

Time Between Failures (TBF) is the raw interval between two successive failure events. MTBF is the mean of those intervals over a defined time window. In HiveMQ Pulse, BetweenState(failure_predicate, TimeElapsed) produces the individual TBF values. A windowed mean over those values yields MTBF. The distinction matters because a single long TBF interval can mask a cluster of short ones if only the mean is tracked.

What is considered a good MTBF and good MTTR?

There is no universal benchmark - a "good" MTBF for a high-speed pharmaceutical filling line is not the same as a good MTBF for a welding robot in automotive assembly, because the asset types, operational profiles, and acceptable failure rates differ fundamentally. As a directional principle: higher MTBF is better (failures occur less frequently) and lower MTTR is better (recovery is faster). In practice, the most meaningful benchmark is your own historical baseline - is MTBF trending up or down over rolling windows? - and peer comparison with sites running the same asset class under comparable operating conditions. Published targets vary widely by industry: discrete manufacturing teams often target MTTR under 45 minutes for production-critical equipment; highly automated process lines in pharma or food and beverage may set tighter thresholds because unplanned stops carry regulatory and batch-loss implications. What matters more than any specific number is that the definition of failure used to compute MTBF is consistent over time and across sites - so that an improvement in the number reflects an actual improvement in reliability, not a change in what gets counted.

Can MTBF and MTTR be used to improve OEE?

Yes, directly. OEE (Overall Equipment Effectiveness) decomposes equipment performance into three components: Availability, Performance, and Quality. The Availability component is derived from MTBF and MTTR using the formula Availability = MTBF / (MTBF + MTTR) . That means improving MTBF (reducing how often failures occur) or reducing MTTR (recovering faster when they do) translates directly into higher Availability, and by extension higher OEE. The two metrics also point to different improvement levers: a low MTBF typically signals a reliability or maintenance engineering problem - failure is occurring too frequently - while a high MTTR points to a maintenance operations problem - recovery is too slow. In HiveMQ Pulse, both can be tracked in real time using the same BetweenState operator with different predicates, which means the OEE Availability picture is always current rather than a number extracted from last week's batch export. Performance and Quality — the other two OEE components - can be derived from the same operator vocabulary using Count -based aggregations, so OEE as a whole becomes a natural extension of the same reliability computation, not a separate system to run alongside it.

How does HiveMQ Pulse compute MTBF differently from traditional approaches?

HiveMQ Pulse uses a BetweenState expression that evaluates a Boolean predicate, written by the reliability engineer, to define what constitutes a failure event. The expression emits time intervals between consecutive failures. A windowed mean of those intervals yields MTBF. The computation runs inside the HiveMQ platform itself, without forwarding data to an external stream processor or time-series database.

Can the same approach compute MTTR and availability?

Yes. MTTR uses the same BetweenState expression with a recovery predicate; for example, the entry into the PackML Clearing state on a pharmaceutical filling line. Availability is derived from the ratio of accumulated operational intervals to total scheduled production time over a window. All three KPIs share the same underlying expression. Adding MTTR or availability once MTBF is running requires writing a new predicate, not standing up new infrastructure.

How does PackML help standardize MTBF definitions across vendors?

PackML (defined in ISA-TR88.00.02) provides a common state machine for manufacturing equipment. Aborted marks unplanned failures requiring operator intervention, Execute marks active production, while the Clearing, Resetting, Starting, and Execute sequence marks the repair path. Because both the failure and recovery predicates can be grounded in these standardized states, the same predicate definition works across PackML-compliant equipment from different vendors. A pharmaceutical manufacturer can apply a single MTBF definition across filling lines from multiple suppliers without a per-vendor translation layer.

What happens when PackML implementations differ across vendors?

PackML defines the state machine but leaves implementation latitude to vendors. Edge cases, such as when a machine uses Suspended rather than Aborted for certain fault conditions, can introduce asymmetry in failure counts across otherwise comparable lines. Pulse expressions handle this at the predicate level. A refinement can be added for a specific vendor's behavior without modifying the underlying pipeline. That's a configuration change the reliability engineer controls directly.

Can reliability KPIs be calculated without Kafka or Flink?

Yes. Reliability KPIs such as MTBF, MTTR, and availability do not require Kafka, Flink, or a separate analytics stack. With HiveMQ Pulse, reliability engineers can define failure, recovery, and operational states directly and compute these KPIs in real time from streaming operational data.

How do reliability KPIs support predictive maintenance?

Reliability KPIs such as MTBF, MTTR, and availability help identify degradation patterns before equipment failures occur. HiveMQ Pulse computes these metrics in real time, enabling maintenance teams to detect issues earlier and take proactive action to reduce unplanned downtime.

HiveMQ Pulse Industrial Data Management

How to Compute MTBF, MTTR, and Availability in Real-Time Without a Separate Data Stack

by Dr. Ankit Chaudhary, Sven Kobow

Jun 9, 2026 24 min read

TL;DR

Reliability KPIs like MTBF, MTTR, and availability are hard because the definitions hiding inside them are hard, not the math. What counts as a failure, recovery or operational time? The answers are equipment-specific and engineer-judgment-specific. Traditional tooling forces those answers to be computed after the fact by extracting raw data from siloed systems like historians or MES platforms. HiveMQ Pulse takes a different approach. It introduces a single expression, BetweenState, that lets the engineer who knows the definition encode it directly as a state condition. The same primitive produces the full reliability KPI family, computed on the fly, with no additional storage or compute infrastructure required.

The hard part of MTBF, MTTR and availability is the definition, not the arithmetic.
BetweenState is one expression primitive that produces the full reliability KPI family when paired with different state conditions.
The engineer who knows what failure, recovery and operational time mean for their equipment writes the predicates directly without a separate stream-processing stack.

Who is this blog for: This post is for reliability engineers, OT leads, and platform architects in discrete manufacturing, process industries, and energy who need MTBF, MTTR, and availability computed in real time

Why Are Reliability KPIs Like MTBF and MTTR So Often Miscomputed?

Reliability KPIs are miscomputed not because the math is hard, but because the definitions are. MTBF, MTTR, and availability are among the oldest reliability KPIs in the industry, and at the same time, among the most consistently miscomputed. The question hiding inside each of them is harder than it looks: what counts as a failure, what counts as recovery, and what counts as operational time?

The three KPIs are different views of the same equipment timeline, and each depends on a definition that the math alone cannot supply.

MTBF is used as an indicator of equipment reliability, i.e., how often a piece of equipment fails. It needs a failure definition.
MTTR is used as an indicator of maintenance responsiveness, i.e., how quickly equipment is restored after a failure. It needs a recovery definition, a clear answer to "back in service, as of when?"
Availability is used as an indicator of overall operational performance, i.e., the share of scheduled production time during which the equipment is actually running. It needs both, plus a definition of scheduled operational time.

Anyone who has worked across multiple OT vendors knows the answers are never universal. The definitional gap shows up differently for each KPI.

For failure definitions, a Siemens drive signals differently from an ABB drive. Fault dictionaries run to hundreds of codes per vendor. Failure for MTBF purposes is almost always a subset of those codes, filtered by severity, sometimes gated by duration, sometimes counted only if the equipment was supposed to be running. The same fault code can mean failure on one line and expected behavior on another.

For recovery definitions, some equipment signals it cleanly. The fault clears, the status returns to running, and the down interval has a well-defined end. Other equipment remains in an error state until a reset is acknowledged or is considered recovered only after a specific reinitialization sequence completes. Reasonable engineers at the same plant will disagree on which signal counts.

For operational time, planned maintenance windows usually don't count as downtime, but tooling doesn't handle that uniformly across vendors. Mode changes, shift boundaries, and scheduled stoppages each require proper handling and equipment-specific knowledge.

In each case, the definition is equipment-specific and engineer-judgment-specific. The reliability engineer is the person who knows the answer. Existing tooling forces the reliability engineer to extract raw data from a historian or an MES platform and use a spreadsheet or similar tools to apply their custom formulas to compute the final KPIs.

Section summary: Reliability KPIs are miscomputed not because the math is hard but because their definitions vary by vendor, by equipment, and by engineer judgment. The person who knows the right definition, the reliability engineer, is forced to extract raw data from historians or MES platforms and rebuild the calculation in spreadsheets.

Why Do Current Approaches Fall Short for Reliability KPIs?

Current approaches struggle because they solve the infrastructure problem, not the definition problem. What "the current approach" looks like depends on where the manufacturer is on their digital journey. Common variations include:

PLC and HMI to spreadsheet. Data exported manually or via CSV, KPIs computed in Excel by a reliability engineer or shift supervisor.
PLC to historian or time-series DB. Data captured but reliability KPIs computed downstream in a separate analytics tool.
PLC to SCADA, MES, or a dedicated point solution. Each system computes its own version of the KPI, and they disagree.
PLC to MQTT broker to streaming stack (Kafka, Flink or Spark, time-series DB, BI layer). The modern data-platform approach, more common at digitally mature manufacturers.

What all four share is that the KPI definition lives downstream of the reliability engineer, in a system the engineer doesn't directly control. Whether the path is fragmented or modern, the definitional problem is the same.

The fragmentation pattern is worth dwelling on, because it compounds at scale. Each site typically runs its own historian, capturing data from local production equipment. KPIs are computed at the bottom of the operational hierarchy (machine, shift, line) and rolled up into broader metrics for the area, the site, and the enterprise. When each layer of that rollup uses a slightly different definition of failure or recovery, the enterprise-level number is the average of averages of averages, each built on incompatible inputs. The aggregated MTBF agrees with nothing the plant floor reports.

In practice, the lack of expressive, consistent definition produces one of three outcomes:

Reliability KPIs computed in spreadsheets from periodic CSV exports. The numbers arrive days late, and only the person who built the spreadsheet knows what they mean.
Reliability KPIs computed generically. Every fault code counts as a failure, every state change counts as recovery, every scheduled hour counts as operational time. Produces numbers nobody trusts.
Reliability KPIs computed correctly but expensively. The platform team owns a stream-processing job for each KPI definition. The reliability team cannot read or modify any of them. Adding a new KPI variant means filing a ticket.

None of these is a good outcome. The missing capability is not more infrastructure. It is letting the reliability engineer write the definition directly, in a form the system can execute, and in a vocabulary that travels consistently from the machine to the enterprise.

Section summary: The traditional stack, whether fragmented or modern, removes the reliability engineer from the definition and makes consistent rollup across the hierarchy impossible. Both are the wrong direction.

How Does HiveMQ Pulse Compute Reliability KPIs in Real-Time?

HiveMQ Pulse computes reliability KPIs through a single expression, BetweenState, that evaluates a Boolean predicate the user writes to define a state of interest. It is an expression in HiveMQ Pulse that captures the intervals during which a user-defined predicate holds true over a stream of operational data, then applies an aggregation function to the collected data. HiveMQ Pulse not only simplifies KPI definition but also does so without requiring an additional infrastructure stack by pushing these computations directly to the HiveMQ Brokers.

The expression has two parts:

BetweenState(<Predicate>, <AggregationFunction>)

The predicate is a user-written Boolean expression that defines a state of interest. Over an unbounded stream, a state is represented as a window. Unlike a time-based or count-based window, this one is defined by the data itself. It opens when the predicate becomes true, captures every consecutive tuple that satisfies it, and closes when the predicate becomes false again.

The aggregation function specifies what to compute over the data points within that window. For reliability KPIs, the relevant aggregation is the elapsed time between the window's open and close, which gives the raw interval for the KPI averages. The same mechanism extends to any aggregation over the captured tuples, such as event counts, total runtime, or maximum temperature.

For MTBF and MTTR specifically, the expression composes naturally as a windowed mean of the elapsed times of the intervals for which the predicate held true, averaged either over a fixed count of such intervals or over a fixed time horizon:

Mean(BetweenState(<Predicate>, TimeElapsed), <Duration | Count>)

A 24-hour window gives a rolling 24-hour view. A 100-interval window gives a per-100-event view for assets where wear cycles matter more than calendar time. Figure 1 (a) shows an example unified name space (UNS) for a packaging factory. Figure 1 (b) and (c) shows how MTBF can be calculated in HiveMQ Pulse using the between state and simple moving average (SMA) functions. Reliability teams typically want short and long windows side by side. Short windows surface the bad shift. Long windows surface the real trend. The gap between them is often where the interesting story resides. To this, end teams can define MTBF for a varying count of time intervals using the same TBF tag.

How to Compute MTBF, MTTR, and Availability in Real-Time Without a Separate Data Stack (a)

How to Compute MTBF, MTTR, and Availability in Real-Time Without a Separate Data Stack (b)

How to Compute MTBF, MTTR, and Availability in Real-Time Without a Separate Data Stack (c)

Figure 1. (a) UNS topic tree for an example packaging factory. (b) A Pulse tag using BetweenState to compute the time-between-failure (TBF) intervals from the fault data points published by a line asset. (c) A Pulse tag that computes the Mean Time Between Failures (MTBF) by taking a moving average over every 100 TBF intervals from (b).

Section summary: BetweenState is one expression that captures data-defined intervals and aggregates over them. It is the foundation of every reliability KPI in this post.

Can One Expression Compute MTBF, MTTR, and Availability?

Yes. The same BetweenState expression produces all three reliability KPIs when paired with different predicates.

MTBF uses a failure predicate and TimeElapsed. The windowed mean of the resulting intervals is the MTBF for the chosen horizon:

Mean(BetweenState(<failure_predicate>,TimeElapsed),<Time|Count>)

MTTR uses a recovery predicate and the same aggregation. The shape of the expression is identical to MTBF. Only the predicate changes:

Mean(BetweenState(<recovery_predicate>,TimeElapsed),<Time|Count>)

Availability is composed of the same primitives as a ratio of operational-state intervals (defined by an operational predicate) to scheduled time over a window. It uses the same BetweenState building block. Only the surrounding composition changes.

This is the architectural point worth dwelling on. HiveMQ Pulse does not ship MTBF, MTTR, or availability features. It ships an expression vocabulary in which your reliability team defines all three using the same primitive, with the predicate as the only thing that changes from one KPI to the next.

Section summary: One expression, three predicates, three KPIs. The same approach extends to any reliability or production KPI built on state intervals.

Where Does the Engineer's Knowledge Actually Live?

It lives in the predicate. The predicate is the place where equipment-specific and vendor-specific judgment becomes executable. Three realistic predicate styles, one for each of the three definitions a reliability program needs.

One of these examples uses PackML. PackML (ISA-TR88.00.02) is a manufacturing standard that defines a common state machine for production equipment, with named states such as Execute, Aborted, and Clearing that have the same meaning across vendors.

Defining Failure: a Vendor-Specific Fault Code Subset

Used as the failure predicate for MTBF. Failure here is a curated list of codes the reliability team has identified as meaningful, gated on severity:

failure_predicate := (fault_code == F0023 OR fault_code == F0047 OR fault_code == F0102) AND severity >= 2

This is the most common shape of a failure predicate. The reliability engineer maintains the code subset based on root cause analysis of prior incidents. When the team learns a new fault code is operationally significant, they add it to the predicate, and the MTBF starts counting it immediately.

Defining Recovery: a PackML State Sequence on a Pharmaceutical Filling Line

For a pharmaceutical filling line, the recovery interval starts when the machine enters Clearing, the PackML state in which the operator has acknowledged the fault and active repair is underway, and ends when the machine returns to Execute. Used as the recovery predicate for MTTR:

recovery_predicate := PackML_state == "Clearing" AND mode == "Production"

The MTTR for the line is the windowed mean of how long the machine stays in this state. The same expression works over any state vocabulary the equipment exposes. On PackML-compliant equipment, the predicate references PackML states. On custom equipment, the predicate references custom states. The expression is the same.

Defining Operational Time: a Mode-Gated Production State

Used to define operational time for availability, and equally useful as a gating clause inside other predicates to exclude planned maintenance windows:

operational_predicate := running == true AND scheduled_mode=="PRODUCTION"

The same predicate works as a gating clause inside failure or recovery predicates to exclude planned maintenance windows from both MTBF and MTTR.

Section summary: Across all three KPIs, the same pattern holds. The vendor-specific and equipment-specific knowledge lives in a predicate that the reliability engineer writes, not in code that the platform team has to maintain.

What's Next for Reliability and Production KPIs in HiveMQ Pulse?

Lighthouse customers are collapsing the traditional reliability stack into a single HiveMQ Pulse deployment alongside their existing dashboards.

A consistent pattern shows up across discrete manufacturing, process industries, and energy. Teams that were running an MQTT broker, a streaming layer, and a time-series database to compute reliability KPIs are consolidating these into a single platform alongside their existing visualization layer. The number of systems involved drops. The latency from the event to KPI update drops with it.

The change that surprises platform teams most is time-to-definition. What used to be a multi-week cross-team project is now a configuration change that a reliability engineer makes themselves.

What this looks like depends on the role you play:

For reliability engineers and plant-side OT leads: You write the predicate. Your fault-code subset, your duration thresholds, your mode gates, your recovery conditions, all of that becomes a predicate you can read, edit, and version. The MTBF, MTTR, and availability numbers on the dashboard are computed from definitions you authored, not from generic interpretations inherited from a vendor or a platform team.
For solution architects at OT vendors and systems integrators: Reliability KPIs become part of your broker integration, not a parallel streaming project. The predicate model also means you can ship vendor-specific predicate libraries to your customers, such as a Siemens predicate set or an ABB predicate set, and let them compose and refine from there.
For platform and data engineers evaluating Pulse: The property worth examining is composability. The same primitive covers MTBF, MTTR, and availability. You are not adopting a single-purpose KPI engine. You are adopting a stateful operator vocabulary that applies uniformly to every KPI you build on it.

Private beta access involves working directly with the HiveMQ Pulse team on your reliability use case. To start that conversation, write to sales@hivemq.com.

Section summary: Private beta customers see fewer systems, faster time-to-definition, and reliability KPIs that come from definitions the reliability team owns.

Ready to compute reliability KPIs on your own terms? Contact sales@hivemq.com to request access to the HiveMQ Pulse private beta.

Frequently Asked Questions

Dr. Ankit Chaudhary

Dr. Ankit Chaudhary is a Senior Software Engineer at HiveMQ, based in Berlin, where he works on HiveMQ Pulse, the company's distributed data intelligence platform. His focus is on real-time stream processing and in-flight data transformation from the edge to the cloud.

Ankit holds a PhD in data management systems from TU Berlin, with a particular focus on stream processing systems. As a founding engineer of the NebulaStream platform, he built its query optimizer and worked to keep continuous queries efficient across edge and cloud environments. His research has earned awards at top-tier database conferences, including a Best Paper Award at ICDE 2025, and has been published in venues such as VLDB, SIGMOD, EDBT and CIDR. Before his PhD, he spent more than a decade in industry building real-time data systems.

Sven Kobow

Sven Kobow is a Staff Industry Architect at HiveMQ with more than two decades of experience in IT and IIoT.

In this role, he bridges strategic vision and technical implementation - designing reference architectures and deployment patterns that help industrial enterprises build on HiveMQ's full platform stack. He works deeply in customer domains to architect Unified Namespace Solutions and event-driven data platforms, with a consistent focus on extracting real business value from operational data. Before joining HiveMQ, he worked in the automotive sector at a major OEM.