Learning from the CrowdStrike Outage: Enhancing Resilience and Incident Response
Overview:
In the wake of the CrowdStrike outage, businesses around the globe are focusing on restoring business continuity and bolstering their resilience for future incidents. On Friday, July 19, 2024, a faulty content update triggered crashes across approximately 8.5 million Windows devices, displaying the infamous blue screen of death. This affected a range of sectors, including hospitals and airlines.
Thank you for reading this post, don't forget to subscribe!Although less than 1% of all Windows machines were impacted, the outage caused significant disruptions, particularly in healthcare. For instance, Mass General Brigham hospitals and clinics canceled all non-urgent visits on the day of the outage. Other major healthcare providers, such as Memorial Sloan Kettering Cancer Center, Cleveland Clinic, and Mount Sinai, also faced operational challenges.
This incident was not a result of a cyberattack but rather a defective content configuration update to CrowdStrike’s Falcon threat detection platform. According to the company’s preliminary post-incident review, a bug in the content validator allowed the faulty update to pass through validation despite containing errors.
“What we’re hearing is that the recovery is well underway. Most healthcare organizations I’ve been talking to are back up and running,” said David Finn, Executive Vice President of Governance, Risk, and Compliance at First Health Advisory, in an interview with TechTarget Editorial.
“The scope was much smaller than some of the other issues we’ve seen in the recent past in healthcare, but the response was healthy. Still, I think there are a lot of lessons learned.”
Health IT security experts suggest that this incident can serve as a valuable learning opportunity for improving future response and recovery strategies.
Planning for the Inevitable
“The bad thing is always going to happen,” Finn stated, drawing on his 40 years of experience in health IT security and privacy. “The trick is to plan for it, be prepared, and ensure your ability to recover and remain resilient.”
Whether it’s a large-scale cyberattack, like the one at Change Healthcare in February 2024, or a global IT outage without malicious origins, healthcare organizations of all sizes must be ready to respond to a variety of incidents that could disrupt critical systems.
Finn emphasized the importance of proactive due diligence and thorough incident response planning, particularly in identifying and addressing single points of failure. Preparing for potential operational challenges in advance can make all the difference when an incident actually occurs.
“We have to change the way we think about deploying this stuff,” Finn added. “Software, fortunately or not, is written by human beings, and human beings will always make mistakes. It’s our job to protect against those kinds of mistakes.”
The Importance of Resilience
Cyber-resilience is essential for enabling organizations to quickly recover and restore operations. By understanding that incidents like the CrowdStrike outage are bound to occur, organizations can focus on building resilience to effectively manage such events.
Finn highlighted the need for resilience and redundancy in response to incidents like the CrowdStrike outage.
“I still trust CrowdStrike, but that trust doesn’t mean they’re going to be perfect every time,” Finn noted.
Healthcare organizations responded quickly to the incident, despite the disruptions it caused. For instance, Mass General Brigham activated its incident command to manage its response, keeping clinics and emergency departments open for urgent cases. By Monday, July 22, they had resumed scheduled appointments and procedures.
According to Erik Weinick, co-head of the privacy and cybersecurity practice at New York-based law firm Otterbourg, the CrowdStrike incident underscores the need for organizations to reassess their legal and technical risk protocols.
“Although initial reports indicate that the incident was an accident, not an attack, organizations should use this incident as motivation to conduct information audits, penetration testing, update system mapping and software, including security patches, and remind users about best security practices like multifactor authentication and frequently changing difficult-to-guess passwords,” Weinick said.
Essentially, organizations can leverage incidents like the CrowdStrike outage to strengthen their risk management strategies and enhance their cyber-resilience.
Third-Party Risk Management Challenges
Even with strict security controls in place, organizations are still vulnerable to risks from third-party vendors. As the interconnectedness of healthcare systems grows, so does the potential for third-party risks.
The global IT outage highlighted the importance of third-party risk management and the associated challenges. In 2023 and 2022, some of the largest healthcare data breaches were caused by third-party vendors.
“People probably did a lot of risk analysis around CrowdStrike, but I’ll bet no one ever asked what tools they use to produce their software,” Finn speculated.
“Until we get standards in place for software development and certifications for software sold to critical infrastructure sectors, we’re going to have to dig a little deeper.”
In response to the incident, CrowdStrike announced plans to enhance its software resilience and testing processes, including adding more validation checks to its Content Validator for Rapid Response Content to prevent the deployment of faulty content.
The company also plans to conduct multiple independent third-party security code reviews to prevent similar incidents in the future.
“On the legal front, organizations should review their vendor agreements to understand their obligations regarding privacy and data security, who their partners are working with, and what limitations exist on liability for incidents like the CrowdStrike outage,” Weinick advised.
He also recommended checking business disruption insurance coverage and conducting tabletop exercises to rehearse business continuity and recovery procedures in the event of a systems outage.
Key Takeaways
The CrowdStrike outage reinforced essential IT and security considerations for organizations worldwide, particularly in the areas of resilience, third-party risk management, and incident response and recovery. By learning from this event, organizations can better prepare for future challenges and improve their overall cyber-resilience.