Director’s Blog: 10 Ways to eliminate the causes of unplanned downtime in your data centre

Data centre downtime is a nightmare.

But I don’t have to tell you that… We’ve all seen it first-hand.

And the costs are exorbitant. Not only are there repair/maintenance costs, there’s also the potential for huge loss of revenue. We’ve spoken before about how the average data centre downtime incident costs over £400k.

But that’s just an average – In 2013 Google had an outage that reportedly cost over half a million dollars… for just five minutes of downtime. Amazon estimate that whilst their .com domain is down, they lose $1,104 per second. And that’s not even factoring in the potential for damage to reputation, lapses in communication, and loss of vital data.

Clearly this is something we want to avoid.

The causes, though, range from the routine to the ridiculous. The number of scenarios that could potentially force downtime is nearly endless.

The majority, however, can be summarised into these big four:

1) Failures (Equipment, systems, UPS, utilities, generators, etc.)
2) Weather
3) Human error
4) Cyber crime

So without further ado, here are 10 ways to eliminate these causes of unplanned downtime at your data centre.

1. Plan Your Downtime

A no-brainer, perhaps, but one which should be kept in the forefront of our minds.

The regular scheduling of planned downtime for maintenance, pre-emptive repairs, and upgrades is paramount to avoiding unplanned downtime caused by equipment failures.

When performed properly, these periods of downtime are completely invisible to your clients and the outside world. They may be time consuming and expensive to schedule regularly, but it’s a pittance compared to the costs of unplanned downtime.

2. Don’t Skimp (Remember the Costs!)

Now, more than ever, is a time for investment.

Due largely to the growth in enterprise cloud computing and M2M connectivity, not to mention content-heavy applications, Cisco predict that global IP traffic will reach 1 zettabyte this year.

That’s a billion terabytes.

And it gets better. Cisco also predict that by 2017 that figure will reach 1.4 zettabytes. 40% growth in just a couple of years.

It’s safe to say that most data centres are going to be growing in the coming years, and that new data centres will be popping up all over the place. But whilst that happens, I’d also like to venture a prediction: unplanned downtime will increase.

While rushing to meet the ever-increasing demand for server space, some companies will not take the time and care necessary to ensure near-flawless uptime. I highly recommend that you avoid this potential trend; train your staff thoroughly, invest in high quality monitoring systems, and do everything you possibly can to avoid unplanned downtime.

On that note, I’ll leave you with the words of Lee Brathwaite, Vice President of Real Estate for Verizon – A company renowned for their exceptional uptime statistics.

“In a business where an hour of downtime can result in millions of dollars of lost revenue, it is critical that customers are confident that the networks they depend on are continuously up and running”

3. Watch Out for Squirrels…

If you followed the earlier link to datacenterknowledge.com, you will no doubt have read that in 2010 squirrels took down half of Yahoo’s Santa Clara data centre. In fact, it’s not that uncommon – a 2011 study found that 17% of level 3 communications’ cable damage was due to squirrel chewing.

Perhaps even more bizarrely, in 2010 Google reported that aerial fibre links to its $600m Oregon data centre were “regularly shot down by hunters”.

But I’m not here to talk to you about squirrels. Or hunters.

The takeaway from this is that it’s vital to assess all potential threats to your uptime, and take action against them. You can’t necessarily predict every eventuality, but if you really look for them you’ll find a lot more threats than are immediately apparent.

And you’ll be surprised how taking preventative measures against one risk can help protect you against a plethora of others you never considered.

Google solved their hunting problem by moving their fibre links underground. A simple step, but one which also protects against damage from other sources, including weather, vehicles, and, of course, squirrels.

4. Implement a Well Designed UPS

You didn’t think you’d get through a whole article without me talking about , did you?

Data centre UPS systems are essential to avoiding unplanned downtime, and as I’ve written previously they don’t just protect against outages. Power ‘events’ come in a variety of forms, and they can cause significant damage to your critical systems.

And that’s why just any old UPS won’t do – even if it’s designed for data centre power supply applications. You need a UPS system designed for you.

The same is true of your backup generator. Ideally your diesel generator and UPS suppliers will have a history of working together (or indeed be one and the same), as the switchover between these vital systems is essential to avoiding unplanned downtime.

Generators and UPS systems vary tremendously in size and function, and experienced suppliers know that every system must be tailored to the specific needs of the client.

That’s the way we operate at KOHLER Uninterruptible Power, and you shouldn’t accept anything less.

5. Have a Backup for Your Backup

Unfortunately, there is such a thing as UPS failure. And it’s not pretty.

With old-fashioned online UPS systems, you could potentially have a situation where UPS failure resulted in unplanned downtime even though mains power was present.

Thankfully in modern systems this isn’t the case. In the last article, where we discussed the major components of a modern UPS, I talked briefly about the static switch. This UPS component enables switchover to mains/bypass power in the event of a UPS failure.

This is simply another example of the value of investing in modern, high quality, tailored systems. In an ideal world, any single system or piece of equipment should be able to fail without causing unplanned downtime.

That might not always be possible, but it’s a good point to aim for.

6. Look After Your Batteries

I’ve written about the importance of UPS battery maintenance before, but I really can’t stress it enough. If your UPS is vital to your overall uptime (and it is), battery maintenance becomes paramount as well.

Of course, in a large data centre you’re not going to be running on battery power for long. Even a 500KVA system is only designed to support critical loads for long enough to switch over to a backup power supply.

But those precious minutes between mains power and backup power are a scary time. If everything goes to plan, the event will be invisible to the outside world. If there’s a fault with your batteries, you’re going to experience unplanned downtime.

So look after them.

7. Don’t Rely on People

OK, this is a little facetious. You can’t avoid relying on people altogether.

But studies cite human error as the cause of between 50-80% of all unplanned outages in data centres. Whatever the precise figure may be, it’s time to sit up and take notice.

There are any number of electronic systems and devices designed to remove some of the human element from data centre operations. Tracking everything from environmental conditions (heat, humidity, etc.) to the clock speed of individual servers, these innovations can identify potential issues before they happen.

The KOHLER Uninterruptible Power Remote Monitoring system is an excellent example of how you can cut a huge part of the human element out of the maintenance process, whilst also drastically reducing the chances of unforeseen downtime.

Remember what I said earlier about skimping? Some of these systems may be on the expensive side, but I’ll wager the costs of unforeseen downtime would be more costly in the long run.

8. Consider Outsourcing

The decision of whether to perform a function in-house or not is usually informed by costs.

Can we justify the personnel and training costs necessary to meet this function, or should we simply outsource it?

But I’d like to venture a second question into the discussion: Is this function essential to maintaining our uptime?

Sure, you have technically minded people on site most of the time. But are they specialists in server maintenance? How about UPS systems? Utilities? Environmental factors?

It’s certainly possible to train your people to perform these functions, but when it’s just one small part of their duties they’re never going to be experts. When you outsource vital functions to a specialist, you’re giving yourself the best possible chance of avoiding the human error factor.

9. Defend Your Digital Fortress

It’s a chilling thought.

Your systems are in perfect health. You’re regularly cycling your servers, monitoring your UPS batteries, and you’re expertly maintaining your systems.

But even though you’ve done everything right, you’re still at risk.

Cyber crime accounts for a relatively small proportion of unplanned downtime incidents, but it can be very, very expensive to resolve. Not only is there the potential for major damage to your systems, there’s also the potential for loss of sensitive information.

Just like every other element of your operations, you can’t skimp if you want to maintain your uptime. I would strongly advise that you consider uptime to be your primary concern – Above minimising costs, and above improving energy efficiency.

I won’t pretend to be an expert in digital security, but I know enough to say with certainty that this is not an area in which you’ll want to be lacking.

And of course, we also shouldn’t forget about physical security. Even in this day and age there are occasionally high profile instances of physical security breaches!

Cyber crime is becoming more prevalent every year. As a data centre, choosing whether or not to invest in state of the art security should be an easy decision.

10.Get a Health MOT

An outside perspective is often useful.

Even if you’ve done your due care and diligence to implement high quality, modern UPS systems, I’d still recommend that you have them thoroughly reviewed by an external source.

When it comes to your UPS, we offer a free site survey service to ensure your system is properly designed/installed to protect you against unplanned downtime. We’re experts in everything necessary to maintain a constant supply of clean, incident-free power supply to your data centre.

Just get in touch, and we’ll do the rest.

And who knows, if you take all these suggestions on board, maybe you’ll hit that elusive ‘six nines’ (99.9999%) availability target next year!

You may also be interesed in....

Need an answer quickly? Our team of experts are on hand to help, call +65 6302 0702.

More Articles