Business Resilience: Lessons from CrowdStrike Outages
The recent CrowdStrike outages served as a stark reminder of the vulnerabilities in our digital infrastructure, affecting millions and disrupting essential services.
Challenges faced with Software as a Service (SaaS) & Security as a Service
In today’s business landscape, reliance on Software as a Service (SaaS) providers is fundamental to enable cloud-based working. Recent global outages triggered by CrowdStrike updates have underscored the need for effective risk management and mitigation strategies.
Peter Davies, Chief Information Security Officer at Solace Global, reflects on these disruptions, combining his cybersecurity expertise with a personal account of managing the chaos during his travels. His insights underscore the critical need for robust risk management and effective contingency planning.
Travelling During the Crowdstrike Outage
When news broke of massive outages affecting 8.5 million Microsoft Windows PCs and servers – disrupting crucial services like medical operations, airlines, and banking – I initially suspected a zero-day vulnerability or a state-sponsored cyberattack.
At that time, I was en route from France to England with my family. My thoughts raced through potential scenarios, such as flight cancellations and the challenge of using credit cards for emergency transportation or accommodation. This experience drove home the importance of contingency planning, such as arranging for cash withdrawals to cover unforeseen travel needs.
Thankfully, our flight went off without a hitch. However, the irony of withdrawing money at the airport only to be notified by my banking app about exorbitant fees highlighted an important lesson. Effective risk management in business mirrors this scenario: having clear, documented processes and procedures in place is essential.
These documents should detail the impacts on people and technology, ensuring swift and effective communication with all relevant stakeholders.
Just as in personal finance, robust business resilience strategies enable faster recovery and cost savings by minimising the need for high-stress, spontaneous decision-making during crises.
How a CrowdStrike Update Caused an Outage: Root Cause and Vendor Response
The root cause of the outage was traced back to issues with CrowdStrike updates to their End Point Detection and Response (EDR) sensor content which was released via channel 291. The update was to mitigate a Command-and-Control framework which allows threat actors to remotely control a victim’s PC which has been compromised with malware. These frameworks allow threat actors to exfiltrate data from the targets network, launch botnet attacks and distributed denial of service (DDOS).
Fortunately, this issue affected only Microsoft Windows machines, as the patch was not deployed to macOS or Linux, which do not use channel 291 for named pipe execution. This limitation prevented an even larger-scale disruption.
Whilst CrowdStrike customers can control which version of the EDR sensor is installed via configuration settings, they have no control over the Rapid Response Content which is designed to respond to the ever-changing threat landscape and thereby protecting the customer from new threats.
However, serious errors in these platforms can jeopardise the Information Security triad – Confidentiality, Integrity, and Availability – leading to potential reputational damage and regulatory fines. For instance, if data has been impacted, then GDPR could impose fines of up to 4% of the company’s annual revenue.
To address these issues, The CEO of CrowdStrike has announced improvements to the Software Development lifecycle and enhancements (CrowdStrike, 2024), which will in the future allow customers greater control over the content delivery system.
However, this introduces a new paradox for the customer: delaying content updates to avoid outages might increase the risk of vulnerability exposure. Ideally, content delivery updates would be configurable, allowing for critical security updates to be quickly implemented and enabling less critical updates to be delayed for a short period.
How Cybercriminals Took Advantage of the CrowdStrike Outage to Spread Malware
In the wake of the CrowdStrike outage, cybercriminals seized the opportunity to exploit the situation by distributing malware disguised as recovery tools. It is essential to have stringent protocols to ensure that recovery tools are only downloaded from known, reputable vendors, especially during high-stress periods and especially when security privileges are elevated.
It is recommended that as part of the mitigation strategies that approved site URLs are stored and readily accessible to prevent falling foul of URL spoofing or accidently following an incorrect link. By maintaining a secure, vetted list of URLs, organisations can better safeguard against opportunistic threats and ensure the integrity of their recovery processes.
Business Resilience Planning: Lessons Learnt from BIAs and Third-Party Supply Chain Insights
A key aspect of business resilience is understanding third-party supply chains and conducting comprehensive Business Impact Analyses (BIAs). High-level BIAs might overlook critical infrastructure dependencies, potentially leaving an organisation vulnerable to unforeseen knock-on-effects. To mitigate this, mapping critical operating systems and hardware stacks provides a broader view of how accidents or targeted attacks on specific systems could impact operations, enhancing overall preparedness.
While technological solutions such as Security Information and Event Management (SIEM) systems are invaluable for managing cyber incidents by offering insights into security events, they often fall short in providing context about the broader business impact of affected systems.
To address this gap, it’s important to complement SIEM data with a clear understanding of business impacts to ensure effective decision-making during disruptions by managing and prioritising critical incidents so that key systems are prioritised and brought back online quickly.
By having consolidated signal intelligence from SIEM systems alongside a robust grasp of critical business functions and mitigation strategies prepares organisations for potential incidents. Conducting desktop exercises or fire drills can further refine response strategies, uncover new risks, and identify opportunities for improvement. These simulated scenarios help in testing responses to known threats, ultimately strengthening resilience and enhancing preparedness.
Future-Proofing Risk Management: Balancing Updates, Offline Recovery, and Key Security
Some key considerations for future risk management and business resilience based on CrowdStrike:
1. Managing Automatic Updates
While disabling all automatic updates is neither practical nor advisable, adopting a balanced approach to patch management is crucial. Consider implementing a hold-off period of 5 to 10 days for non-critical updates. This strategy allows time for other users to identify potential issues and for vendors to address any problems before the updates impact your organisation. Prioritise and apply patches for zero-day vulnerabilities or severe issues as-soon-as-possible to ensure timely protection.
2. Enhancing Recovery with Offline Machines
Incorporate offline PCs that are updated only at predefined intervals into your recovery strategy. These off-grid machines provide several benefits:
- Secure Operations: Ensure a safe environment for sensitive activities like banking and IT access during outages.
- Known Clean Systems: Maintain a clean, unaffected PC for critical tasks.
- Troubleshooting Aid: Use offline machines to access resources and troubleshoot issues in the event of a core infrastructure failure.
3. Securing Key Information
During the CrowdStrike incident, the availability of BitLocker keys was essential for recovery. To avoid similar challenges in the future, store crucial encryption keys and other essential information in a secondary, secure location. Ideally, this should be on separate infrastructure to enhance resilience and ensure that you can access these keys when needed for recovery operations.
By implementing these considerations, organisations can improve their risk management strategies and enhance their overall business resilience.
Optimising Business Recovery: Key Considerations for Impact Assessment and Resilience
Understanding the business impact of failures is important for ensuring organisational resilience. To facilitate better outcomes and faster recovery decisions, businesses must enhance visibility into risks and interdependencies. While some aspects of infrastructure may be challenging to fully mitigate due to cost or complexity, a comprehensive grasp of business impacts enables more effective decision-making and resilience planning.
Effective communication with the board of directors is essential. Providing them with visibility into current risks, along with a register of acceptable risks, ensures informed oversight and supports a secure operating environment.
The CrowdStrike outage serves as a stark reminder of the interconnectedness of modern business infrastructures and the importance of thorough planning, risk assessment, and mitigation strategies to maintain business continuity and security.
About the Author
With over 25 years of experience, Peter Davies has been a pivotal figure in advancing business resilience across various organisations. His expertise includes developing risk awareness cultures, establishing global Security Operations Centres, and implementing specialist technologies to manage travellers, workforce, and business assets in challenging environments.
As Chief Information Security Officer at Solace Global, Peter oversees the product management of the Solace Secure platform, which supports traveller safety and enterprise resilience.
Peter Davies, MSc Cyber Security & Human Factors
Chief Information Security Office & Product Manager
Enhancing resilience and business continuity planning.
Solace Global Risk is a leading provider of comprehensive risk management solutions, serving clients globally with a commitment to excellence. With a worldwide presence and a team of seasoned experts, Solace Global Risk empowers organisations to navigate complex risk landscapes with confidence and resilience.
Journey Risk Management
Global Security and Threat Intelligence
Risk Management Software
Diligent in-country travel security
Be one step ahead to prevent a crisis
Travel with confidence
Your duty of care doesn’t end the moment your people set foot in their destination – and neither does ours.
From transfers to ongoing security and emergency evacuations, our travel risk services always have you covered.
Arm yourself with the knowledge to avoid a potential threat from turning into a crisis. Intelligence advisories give you tailored reports to anticipate possible disruptions, mitigate risk and help you make well-informed decisions, faster.
Give your people peace of mind when they travel for work, so they remain focused on the job at hand. We mitigate risks, manage incidents if they occur, and support your people with security advice or help in a crisis.