DevOps Journey: Service Monitoring – Learning from Service Failures

Coming from a Development background, there are several concepts that you need to learn in your journey to become a DevOps engineer. You need to be familiar with the operation side of things.

Being Developers, you are more concerned with ensuring your code works according to the acceptance criteria and the definition of done.

In ITIL, the set of features and functionality are called UTILITY.
UTILITY: Fit to purpose, or utility, means that service must fulfill the customer needs. It is summarized as what the system does.

The value of your software won’t be realized until it is ‘Fit for Use’.

In ITIL, the set of qualities that define ‘Fit for Use’ are called WARRANTY. In Operation side of things, IT Operators are not only concerned with UTILITY but also WARRANTY.
WARRANTY: Fit for Use is a promise or guarantee that a product or a service will meet its agreed purpose by being available when needed, with enough capacity, and dependable security.

In this article, I will discuss the reliability attribute and how each service failure can serve as an opportunity to continuously increase your skill as DevOps engineer.

To maximize the up-time of your services and keep disruptions to a minimum, reliability metrics (failure metrics) should be tracked. Tracking the reliability of services is a challenge when you do not have automated CMMS (Computerized Maintenance Management Systems). I provided a simple excel tracking template to track the services you deployed. (see Download Page: look for Service Monitoring-Sheet for IT Managers/DevOps).

As soon as your software is deployed to production, you will encounter several error conditions that are not in your test matrix!
Deployment issues, that are rarely considered by developers with minimal delivery experience, will occur. Best to track these to continuously improve on the reliability of your services.

Availability and Reliability Metrics

In general, we speak of availability in terms of “nines.”
– Two nines, 99%, allows 3.65 days of downtime per year (100%-99%)*365
– Three nines, 99.9% is about 8 hours of downtime
– Four nines, 99.99% is about 53 minutes of downtime
– Five nines, 99.999% is about 5 minutes of downtime, <– Our BHAG (Big Hairy Audacious Goal)

Downtime is measured from the time of failure until the system is back in operating condition.
<In the given template, this is ‘Monitoring’ Sheet Availability Column.>

In order to increase reliability, failure metrics should be tracked and monitored as well. The common service/device failure metrics are MTTR, MTBF, and MTTF.

1. Mean Time to Recover (Average Downtime)

  • FORMULA: MTTR(Recover) = Total Downtime / Number of Failures
  • It is the goal of IT operators to quickly recover from downtime
  • Mean Time to Recover is from the time when the failure occurs until the service resumes operations

<In the given template, this is ‘Monitoring’ Average Downtime Column>

2. MTTR (Mean Time to Repair – the average time to repair the system)

  • FORMULA: MTTR(repair) = Total Repair Time / Number of Failures
  • It is the goal of IT operators to repair systems at the quickest possible time with minimum cost
  • Mean Time to Repair includes the following time: Repair Time, Testing Period until Return to Normal Operation
  • It does not include point of failure and diagnostics
    <In the given template, this is ‘Monitoring’ MTTR Column>
  • For each downtime, log an incident list to indicate Downtime, Time To Repair and Manhours to Repair
    <In the given template, this is ‘Incident-List’ sheet.>

3. MTBF (Mean Time Before Failure)

  • FORMULA: MTBF = Total Operation time/number of failures
  • The goal is to have longer operation hours before failure occurs through effective preventive maintenance
  • Mean Time Before Failure does not account times when servers are down for maintenance
  • It is the average time between failures

<In the given template, this is ‘Monitoring’ MTBR.>

4. MTTF (Mean Time to Failure)

  • These are for measuring the reliability of equipment that is not repairable. Example of these are light bulbs, memory modules
  • Simply put, this is a measure of the lifespan of a device (Device Starts working -> Device Failure Occurs)
  • FORMULA: MTTF = Sum of all units hours of operation / total number of units
  • This metric, although not included in the template, is useful when the operators need to estimate how long a component would last as part of a larger piece of equipment that they maintain

Logging the above metrics with historical information on how each incident occur, their root cause, corrective actions, and preventive actions will continuously improve the reliability of your systems. While this article is mainly learning from your past incidents, it is recommended that you build reliable systems upfront through resilient architecture using a fully reliable infrastructure.

– Agile Pinoy


Published by agilepinoy

We are Agile Pinoy. We believe that Filipinos can build globally competitive Products, one team at a time.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: