Explaining downtime costs & what can you do about them?
What is the “true cost” your enterprise or company incurs during IT downtime? How can you calculate the downtime costs? Where do you even begin assessing the damages after a disaster occurs? And most importantly what can you do to prevent it from happening?
With the rise of Digital Transformation and increasing reliance on IT infrastructures, downtime costs keep rising. According to Gartner, in 2014 the average global downtime cost in IT, amounted to 5.600$ p/minute.
A later study in 2016, by Ponemon Institute, shows it has grown up to 9.000$ p/minute. With the most recent study for 2020 by Statista, supporting these numbers. To top that off, a few days ago as of drafting this article, a recent Facebook outage cost the company a staggering 164.000$ p/minute in revenue. Not to mention a stock decline wiping out 40$ billion in market cap and a personal hit on Mark Zuckerberg in roughly 6$ billion.
In lieu of that news, today we’ll take a look at how exactly any sort of IT downtime harms your enterprise. It’s a real danger to big companies like Google and Facebook, similarly as much as it is for small companies.
According to a recent study by CBIInsights, done with 100+ post-mortem analyses, a whopping 38% of them went under simply by running out of money. So, it’s not hard to understand how losing an average of 137-400$ p/minute for SME’s worldwide, can correlate.
What is IT downtime cost?
“Downtime” is defined as a set of time during which equipment or machine is not functional due to technical failure, lack of maintenance, or other factors.
Downtime in IT is slightly different, the core of the message stays – malfunctions or technical failures. However, the distinction comes in that it affects multiple areas of your enterprise. And because the IT industry is incredibly varied, only the basic calculations templates apply.
In the IT industry the causes vary, as much as different companies vary from each other. More often than not, they are unique to each business.
A downtime event can be: natural disasters, hardware planned/unplanned failures, electrical outages, cyberattacks like DDoS or full-fledged APTs, human errors, national attacks, etc.
Regardless of the reason, any time your essential servers go down, the “costs timer” starts ticking.
There are two categories of downtime costs: Direct costs, and In-direct costs.
Direct downtime costs:
These costs are relatively easy to calculate with in house numbers, and usually have fixed or calculable variables.
Equipment replacement costs:
Quite self-explanatory and relatively rare occurrence when your hardware gets damaged beyond repair. This is typically a fixed price, and in grander scope of things doesn’t amount to excessive costs. However, it’s important to note into your yearly budget as a possible expense, remember – all hardware eventually fails!
Cost due to lost productivity:
This expense calculates how much money was lost due to the downtime event and your employees not working, based primarily on their wages. The equation is as follows:
Cost of lost productivity = (Number of employees affected) x (Avg. hourly wage) x (Number of hours of downtime)
Costs of employee recovery:
This metric stands for money spent on catching up to normal operations. Alongside our employee’s wages, you must add any overtime pay, and incorporate additional costs incurred due to possible missed deadlines. The basic equation is:
Cost of employee recovery = (Number of employees affected) x (Avg. hourly wage) x (Number of hours to catch up)
Cost of IT recovery:
Here we calculate how much money was spent to get the IT component back into working shape. This is different from replacement component cost but measures the number of hours it took the In-house IT staff, or IT provider to fix the problem. The basic equation is:
Cost of IT recovery = (Number of IT engineers affected) x (Avg. hourly wage or cost of IT staff) x (Time-to-detection + Time-to-resolution)
In-direct downtime costs:
These are a slightly harder to calculate right away or don’t have an easy equation to follow. The very first thing you must do is take a page out of product metrics, and calculate your revenue lost due to downtime event.
For ease we’ll use only the basic equation for an uptime of a year:
Revenue lost = (Annual revenue/8,760 hours per year) x (Hours of downtime event)
Revenue loss due to churn:
Here we calculate how much money was lost due to customers leaving because of the downtime event. A useful metric to contrast with is the repeat sales rates. The equation:
Projected loss due to lost customers = (Revenue Lost) x (Average rate of repeat sales)
Revenue loss due to damaged brand reputation:
Estimation of how much money was lost due to potential customers being scared away due to the downtime event. However, it’s important to note that the only measurable data is acquired via contrasting your revenue with your referrals data (via shopping sites, social media referrals and analysis of customer pain points).
The equation is:
Projected revenue loss due to brand damage = (Revenue Lost) x (Percentage of sales from referrals)
Though as we’ve said, the data is very wonky on this one. For example, an organization can experience sales loss following the event, and then for a sustained period after. However, other macro-economic factors can contribute to this.
One of surer ways to figure out the effect of downtime on brand image is observing the share prices following the disaster. For instance, in 2017, the British Airways experienced a server outage, costing them approximately 80 million pounds in financial loss, stranding roughly 75 000 passengers over the weekend.
Only after a few days, British Airways’ shares dropped by 4% in one day. In other words, BA lost 170 million pounds in value.
Human resource damages:
Another important thing to note is morale damage your team accumulates due to downtime events. Aside from possible lost deals, inconvenienced clients, reputation damage, you must understand that your workers do get affected morally by company failures. Respect towards one’s job is crucial!
Especially your IT department, not a single system administrator likes to stay after hours or get woken up at 3 AM to attend to the problem. And because your employees must make up for the work lost due to downtime, well you get the picture…
If these scenarios happen week after week, month after month, people will get frustrated and look for an exit. And once that happens, you’re out of valuable human resources, which puts you even further behind. Sometimes it can even break SME’s.
That is why proper HRMS practices must take the brunt of the damage. If you don’t have an active HR department taking down the heat, and minimizing the damages, you will soon find that you’re out of qualified workers, and worst yet, nobody will want to come and work for you!
What can we do to prevent downtime?
So, what can you do to prevent this disaster? There are three crucial things to do:
1. Develop a “Business Continuity and Disaster Recovery” (BCDR) plan
Obviously, the very first thing to do, is create a plan of emergency, understanding your business through and through. For this we recommend that you have a BPMN chart of your enterprise handy.
A Business Continuity and Disaster Recovery plans (or BCDR) involve – redirecting resources, setting up chains of command and coordination of employee shifts in case of a disaster. This is done to ensure or minimize data loss and work interruptions during a downtime event.
Every employee must understand exactly what he’s supposed to do in case of an outage, a disaster, or server downtime. They need to know whom to contact if their direct superior is unavailable, or whom they need to report to in case they notice a problem. Because you have your BPMN chart in front of you, you can easily discern which departments or jobs can be done remotely or shifted, in case your network goes down.
Understand that downtime can come in many forms, and a disaster recovery plan is crucial if you want to minimize the costs and not go under.
Another crucial thing to take care of, in case of remote work, make sure to have secure networks set up. After all, Cyberattacks amount to most instances of downtime, especially ransomware forms.
The second part of a BCDR is setting up a recovery initiative. For that…
2. Create back-up storage facilities and secondary emergency data centres, either in-house or using third-party solutions.
Creating a proper Data backup strategy is as crucial as is the planning phase. We’ve already mentioned the paired storage paradigm. However, it’s a bit more involved than that:
There’s something called the 3-2-1 backup strategy in the industry. It is a baseline rule among many industry leaders. It involves having:
At least three copies of your data.
Backed-up data on two different storage types.
At least one copy of the data offsite.
That is however, if you’re an enterprise level business, and want to have everything done in-house. For SME’s its far better to use a readily available solution on the market. The typical price ranges in between 30-100$ per year, for a 5TB of cloud backup.
If you’re adamant on creating your own backup infrastructure, here’s a list of pointers:
- Your recovery data should be stored in multiple locations, separate by physical distance. Never in one datacentre, and certainly not in the same building as your office.
- Keep crucial data stored on a physical copy somewhere, such as magnetic tapes for long term solutions, or DVDs or CDs for short term storage.
- Set up a clear and strict data backup schedules.
- Constantly test your infrastructure.
3. Implement a “High severity incident management program”.
Finally, admit to yourself, that downtime is inevitable, part of the reason it must always be accounted for during yearly budget plans. And then, start learning from these incidents.
Here’s where a “high severity incident management program” (SEV) comes into play. This is the practice of recording, tracking, and assigning business value to problems that impact critical systems.
SEV is the industry wide term derived from *severity* but is commonly understood as “incident”.
Think of it as a DEFCON for enterprises, a mode of categorizing and assigning priority to incidents according to their severity.
The core idea behind SEV management is bringing up everyone in the company or affected teams on the same page as fast possible! A typical SEV lifecycle looks like this:
Aside from bringing up everyone to speed, the program includes the paradigm of SEV analysis and information gathering. The best way to prevent a future disaster, is to learn from it and take steps to stop it from happening.
Here’s a google template for SEV reporting we advise you learn and send out to your colleagues next time something goes awry!
We’ll touch up SEV reporting and how to use this data, at a later date, as well as the true role of incident managers (IMOC) and roles of “Technical leads on-call” (TLOC).
Which brings us to..
Un-sung heroes and importance of Infrastructure
One of the worst types of hardware malfunctions is loss of crucial data, servers can burn down, either due to lack of proper maintenance, human error, disaster etc. And losing all the data on your clients, partners, or projects can be a death sentence for your company.
But one thing is for certain, never take your system administrators for granted! They are often the ones that put out fires on a daily basis, but because no one notices, people are quick to disregard their efforts!
So go ahead and give your system administrator a hug, I’m sure he needs one!
And that wraps up our take on the “true cost” of downtime in IT. Hopefully we’ve shed some light why it’s soo important to properly prepare yourself against disaster.
Oh, and remember that in case you lose your primary storage with crucial information. An advanced form of data recovery from a broken (single) hard drive can cost between 700-2000$; And a third-party back-up solution costs 45$ per year.
Now, dear reader, ask yourself, how much downtime can your business tolerate?
Stay classy business and tech nerds!