How to avoid the next major IT outage

With CrowdStrike shining a light on the vulnerability of third-party service providers, here are five steps to help you avoid falling victim to the next business-crippling outage.

Some of the world’s most high-profile organisations were left scrambling recently after a global IT outage caused severe and widespread disruption to their businesses.

A faulty software update took out 8.5 million computers globally, affecting industries spanning transportation, banks, retail media and healthcare. Some of those companies continued to experience problems with their systems for days following the outage, leading to contention over to who is to blame for their loss of service.

Of course, if your organisation wasn’t using the affected software, you may have breathed a sigh of relief as the scale of the disruption became apparent. However, you may not be so lucky in the future, says Simon Withington, Technology Assurance Director, at Forvis Mazars.

The incident has highlighted a major vulnerability in modern IT architecture and practices – the risks associated with third-party ecosystems. It demonstrated the technology concentration risk in this area and how an overreliance on a single vendor to deliver IT services, without adequate safeguards, can bring many organisations to a standstill.

“Many IT services consumed today are managed by large, sophisticated organisations, and customers benefit from the resilience that operating model offers them,” explains Withington. “The problem is there is just a single point of failure – when one of those vendors has an outage or there is an issue, all of their customers can be impacted at once. This is something we’re seeing more often.”

Leveraging technology to transform their company remains a priority for organisations, according to the C-Suite Barometer – as is reviewing supply chains, operations and processes. Organisations must therefore consider what preventative measures they can take to minimise the associated risks from their third-party ecosystem and help protect them from future events.

Step 1: Assess and Understand the Third (and Fourth)-Party Risk

First and foremost, you may have faith in the resiliency of the service your third-party vendor provides. But unless you’ve thoroughly investigated the extent of your dependency on that vendor, you can’t be sure what might happen should a problem arise.

You may not even be consuming a service from a vendor hit by problems, but one of your third-party cloud application or service providers might, meaning your business could still suffer. This can be viewed as a fourth-party risk, and therefore it is important to have insight into any risks associated with your ecosystem of third and fourth-party vendors.

“Sometimes organisations are unable to identify all services they consume – or place blind faith in their providers. They should  do an assessment of the potential points of failure and build mitigation plans around those; assess their architecture, assess their ecosystems and really understand the risks – and if there are risks, put in place mitigation plans,” says Withington.

Step 2: Consider A Multi-Vendor Approach?

As we’ve seen, a concentration of services with one service or vendor increases the impact of an outage should the vendor experience a failure. Organisations could potentially adopt a multi-vendor approach to diversify and reduce risk, or provision additional resilience capability.

“Adopting a multi-vendor approach means moving some services away from some vendors for certain parts of the organisation,” says Withington. “Or you can also provision additional resilience, either through the same vendor, or an alternative vendor. By methodically examining risks or threats, this is a strategic decision that organisations need to think about.”

Step 3: Failure to Prepare, Prepare to Fail

Of course, you may be happy to have all your eggs in one vendor’s basket. In which case, the emphasis must be on preparedness.

“Look at scenarios whereby that vendor isn’t available and assess what you would do,” advises Withington “That includes IT disaster recovery planning based on a total outage of the service. You need to technically evaluate how long it would take to recover in the event of an outage. Does that align to the needs of the organisation and its customers? You can link that to more general business continuity planning – your systems aren’t working, so what do you do?”

Workshop those scenarios and test what you would do in an outage frequently, ensuring your response and recovery measures are effective. Forvis Mazars’ Continuity 360 survey can also provide insight at a high level on your readiness for responding to these disruptions.

Step 4: Hold your vendors to account

Although third-parties are contractually responsible for managing their services safely and securely, organisations are ultimately accountable in the event of a failure. So, it’s important totake appropriate measures to ensure that controls designed to govern and monitor third-parties are in place, such as their secure software development and deployment practices.

As part of your vendor onboarding, and ongoing vendor management, you need to continuously challenge the vendor around how effective their control environment is.

There are different things that vendors can do to provide assurance, including third-party assurance assessments that that can be procured, and different accreditations. You need to find out if your suppliers have those, and the results of those assessments. Is there anything within that control environment which may inadvertently cause an outage to occur?

“You need to assess those controls, assess the assurances that are in place, have your ongoing vendor service meetings,” says Withington. “We recommend speaking to key vendors on a quarterly basis. You’re assessing the service they’re providing and any risks that they’re seeing – such as similar outages that may have impacted other customers. Be up to date with your vendor performance. We also recommend speaking to industry peers to understand their experiences of incidents.”

Step 5: Manage Software Changes and Updates

Similarly, it is crucial to adopt a balanced approach to test and change management for vendor updates and releases.Where auto-updates are allowed, these should be subject to risk and impact assessments, and deployment should be managed to prevent simultaneous disruption of all services.

As a customer, you can challenge your vendors on how effective their software development lifecycle is.

“Sometimes software is automatically updated and pushed out as quickly as possible – particularly when addressing novel cyber security threats. Vendors want to make sure the client is protected but organisations need think about the risks associated with pushing updates out into live environments – a bug in that code could cause a similar outage to occur again,” says Withington.

“So, what are you doing to quality assure what the vendor is doing? You can decide whether you want to auto accept updates or want to do deeper testing in development or pre-production environments to make sure you’re happy with whatever is being deployed. You can look at the deployment cadence as well, so rather than deploying updates across all endpoints/users, you can take a staged approach affecting a few at a time and see what happens.”

Ultimately, this comes down to the fact that when you outsource to a vendor, you don’t outsource the risk – that still resides with you as an organisation. It is therefore critical to ensure preventative measures are in place to avoid the fallout from of the next third-party outage.

National contacts