Is your disaster recovery plan a disaster waiting to happen?

Oct 28, 2016

Bridgeworks COO, Claire Buchanan warns that your organization should be planning a strategy that sustains business and IT operations continuity.

Buchanan says, “Heads of Infrastructure and the CIOs are whose necks are on the line if the IT Business Continuity plan falls short. If acts of terrorism, cyber security issues, extreme weather, employees strikes, ‘acts of god’, power outages, human errors is keeping you up at night, there’s some hope…

YOUR DISASTER RECOVERY PLAN – IS IT A DISASTER WAITING TO HAPPEN?

Resilience in the face of risk is the key to the 21st Century enterprise, but without a cast iron strategy the cost can be severe.

Let’s be clear about this, your organization should be planning a strategy that sustains business and IT operations continuity should corporate data center operations get impacted by acts of terrorism, cyber security issues, extreme weather, employees strikes, ‘acts of god’, power outages, human errors… the list goes on and on. It’s a wonder anyone sleeps at night!

80% of Enterprises are less than equipped….

If this wasn’t uncomfortable enough, let’s cut to the really scary bit. Out of 854 companies that were recently surveyed by Gartner* it’s estimated that over 80% have an ITScore Assessment for Business Continuity Management (BCM) that shows they are either less than equipped, or worse, not equipped at all to deal with an event that could impact on a business’ ability to operate. Forget acts of nature, a successful cyber-attack could take out a business for days or even weeks.  So it’s imperative that an enterprise has multiple data copies, and the most up-to-date replication of its data at all times to ensure that you have the best chance of meeting those all-important Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs), especially for your most important mission critical applications.

We know that the reason so many enterprises are scoring so low for BCM is down to a continual struggle to sustain production data currency and availability. This struggle is mostly caused by an inability to transfer large volumes of data at speed, that enable cast-iron, regular complete backups (without failures) and which would support more reliable replication, no matter the distance.

Every minute is crucial to RTOs and RPOs

So, why is Networked data transfer capacity so crucial to achieving these RTOs and RPOs?  Everything starts from the simple backup.  Well, it’s actually not so simple at all. With ever-growing volumes of data, needing to be shifted to multiple, off-site backup locations, enterprises are failing to get a full data backup to where it needs to be quickly. Latency is the biggest killer to network speeds and enterprises are persevering with struggling antiquated solutions that are no longer coping with the demands.

Everyone is looking to inject speed in to backup and replication strategies, I am talking improved networked data capacity without incurring significantly greater communication services cost as an example. Speed means options are opened up and lost data risk is significantly reduced.  And why would that be? Let’s be honest, without the most current, complete and available production data your Disaster Recovery strategy is about as useful as a chocolate fireguard.

Every minute and every $ counts

Getting optimal data transfer capacity is a business critical objective and it’s why enterprises are spending millions on trying to solve the issues that are negatively impacting business operations availability and agility. Every minute counts when something goes wrong.

Gartner reveal in their Tier Group Recovery Time Targets and Related Spending that:

–      Organizations that are trying to achieve 0-2 hours RTO and RPOs are spending 6% or more of their IT budgets

–     Organizations that are trying to achieve 2-8 hours RTO and RPOs are spending 3-5% of their IT budgets

–     Organizations that are trying to achieve 8-24 hours RTO and RPOs are spending 1-3% of their IT budget

Predictability and cost

Enterprises should be able to rest in the knowledge that their last backups and replications – housed in multiple locations and at a safe distance, outside the circle of disruption – are watertight and have been regularly tested in a DR scenario – another thorny subject. Why thorny? This is the Achilles heel, how do you define/manage RPO? How do you make RPO predictable? Then how do you drive down the cost? Before you get to this of course you have to test data in a live operations type environment – dicey at best – right? Then of course you have the cost of manual workarounds, you have to recreate and reconcile the data loss that you have incurred. Oh and have you tracked all those replication jobs that actually never completed?

Instead, many know deep down that they are unprotected. And why? Because 1/ they don’t know how to raise the bar and 2/ there are some dynamics that should be considered here.

Network and Storage guys are feeling the pressure.

Ultimately, getting this right sits with the Head of Infrastructure and the CIOs. These are the guys whose necks are on the line if the IT Business Continuity plan falls short.

With the multitude of responsibilities that these guys face with enterprise infrastructure they rely and trust in the expertise of network and storage teams to help them find the best solutions, to keep up with the newest innovations that have the potential to solve IT challenges.

We have been in to enterprises where Network and storage guys have come up with the most complex and elaborate methods to try to make a difference and yet, the same results show an inability to harvest regular backups that will provide that safety net.

Automation – less cost, less risk

The great news out in the market is that Transformational Technologies will make a dramatic impact. The one that has me scratching my head is the guys who you’d think would be excited about solutions that make them look like rock stars, are actually looking on with scepticism and suspicion.

For example, we know that self-managing Artificial Intelligence (AI) powered technologies are already on the market, that can help enterprises achieve better performance over their networks and can ensure backups are less likely to timeout or fail. Yet, there is still a trepidation from IT Departments to trial these new solutions and we are hearing that this is because of a couple of reasons that have nothing to do with what is best for the protection of an enterprise, should a crisis occur.

So why is this? Well, it’s no secret that AI is ruffling features across sectors, with stories of the technology delivering better efficiency and automation of process. More efficient ways of working is making people across the board, feel devalued and at risk of looking deskilled. I think nothing could be further from the truth. These technologies automate the mundane and allow people to truly add value to their organizations, rather than trying to second guess a network, which let’s face it, management automation does in its sleep, doesn’t make mistakes and is far less costly to audit.

Here’s what the Head of Infrastructures needs to know

Tip 1 – Don’t leave yourself exposed 

Remember the story of the Emperor’s New Clothes? Everyone knows there is a problem but are refusing to acknowledge the issues, instead opting to accept complexity and expense as the norm. It is time for leadership to call it out.

Tip 2 – Keep it simple

Rather than allowing inertia or apathy, shouldn’t your network and storage guys feel empowered to source and trial new solutions that are capable of saving the company’s bacon when things go wrong? If they are encouraged to move away from just acceptance of the norm, they even have the chance of saving the day and also reducing costs into the bargain. No brainer, right?!

To give you one example, we had a client that was trying to backup 430GB a night and just couldn’t. Their WAN – with its latency and packet loss – ensured that they had to take the decision to do 50GB incremental backups and even that was still taking them 12 hours. No need for me to tell you the impact that had on their RPO and RTO’s.  By applying some of that efficient self-managing AI technology I was talking about above (Incidentally, WANrockIT and PORTrockIT by Bridgeworks), they were able to optimize their existing and expensive infrastructure to achieve full backups in sub 4hrs – every time. In one fell swoop they had taken 8 hours out of their backup window, dramatically decreased their risk not to mention huge cost – no need to manually move the data to a third party every night as a safeguard.

Tip 3 – Fix the underlying problem

Let’s be honest, we have been avoiding the underlying issues for years. Incrementals, staging, not backing up every night, failed replication, slow restores and all because we can’t get the speed into the networks.  Not because it isn’t there from a capacity standpoint, but because we haven’t fixed the latency and packet loss issues that make our networks look slow.

Full backups and completed replication

No longer is that the case. Elegant. Simple. Full Backups multiple times a day. Completed Replication. All are a thing of the now, not the future. Think about it, why wouldn’t you want a scenario where you could just ship the data where it needs to go, and then simply restore when it goes wrong?

In conclusion, your enterprise simply cannot afford the additional risk of ignoring Transformational Technologies that are available and on the market now. Looking at reducing the complexity of your set up, with the addition of the right tech for the new challenges that large volumes of data present, should be part of your thinking if you really want to mitigate the risk. This is ultimately your decision to make as, let’s face it, it’s your head on the line, should the safety net fail. It’s important to make sure your teams are making the right decisions to support you, and help them carve a path that allows them to be the super-heros, should disaster strike. This way, you are already on the way to alleviating those sleepless nights, caused by the worry that you might not be as prepared as you’d like.

* Out of the 854 ITScore assessments, the number of clients who were at either Level 1 or Level 2 (i.e. low maturity) was 685 or 80%, only 169 or 20% were at the higher (effective) levels of maturity (i.e. Levels 3 or 4).