Computer systems across the globe were still recovering this week from a massive meltdown Friday morning that spread rapidly, affecting hospitals, airlines, banks, emergency services and other organizations. Unlike other major outages over the past couple of decades, Friday’s chaos didn’t generate from an outside cyber attack. Rather, the call came from inside the house: a faulty Windows software update pushed by cybersecurity provider CrowdStrike.
Microsoft estimated that nearly 9 million Windows computers were disabled by the outage, which did not affect CrowdStrike customers using Mac or Linux operating systems. By some estimates, the CrowdStrike incident was the largest-ever IT systems outage.
By Monday, while some CrowdStrike customers had restored service, the company warned of hackers attempting to exploit the failed software update by distributing a malicious download disguised as a fix. Meanwhile, airports were strained by long lines and healthcare services worked to sort out disrupted records systems. The Wall Street Journal reported that in the UK, many doctors offices and pharmacies had reverted to old-school methods of scheduling appointments and filling prescriptions — pen and paper.
What caused the outage?
CrowdStrike acknowledged that a software update it pushed to one of its main services, a cybersecurity platform called Falcon, had caused the disruption to customers using Microsoft’s Windows operating system. A flaw in the update caused many Windows PCs to display the dreaded “blue screen of death,” rebooting in a continuous loop.
Several factors have been blamed for the flaw, including the company not sufficiently testing its update. CrowdStrike also did not roll the update out incrementally but rather pushed the update to all its Windows customers, another factor that’s been largely criticized. The timing of the update, coming on a Friday, has also drawn critique, as it raised the risk of problems going unnoticed over the weekend in many organizations.
Where Falcon rests in the hierarchy of computers where it’s installed is also a factor in how disruptive the update was, said Yashin Manraj, CEO of software company Pvotal Technologies, as is the distributed nature of the IT workforce.
“[Fixing the problem] requires IT operators, if possible, to connect physically to kiosks or workstations, unmount the disks, decrypt the drives if they still have the decryption keys, erase the faulty sys file and reboot, Manraj said. “While the fix itself is not complex or technically challenging, it does require a massive in-person workforce to access devices physically — something antithetical to the work-from-home philosophy and remote culture that many startups have been recently promoting. Some of our clients have been unable to find available IT staff to fix the impacted devices at airports, mission control centers or even their own support or ancillary staff devices that could work remotely.”
A test of resilience
Some observers expect the CrowdStrike outage to lead organizations large and small to reconsider their reliance on a single vendor, while others suggested companies should redouble their risk management procedures to ensure they aren’t exposed to such a threat.
Michael J. Davern, professor of accounting and business information systems at the University of Melbourne, wrote for The Conversation: “… the risk of an outage like Friday’s should have been on the risk register of the affected organisations. We can choose our risk appetite and accordingly invest in risk treatments to keep the identified risks within it.”
Others pointed to business continuity planning that involves extensive training and testing.
[Business continuity plans] should include training and education programs for team members, and senior management is essential to ensure everyone is prepared to respond effectively if and when the next outage occurs,” said Skillsoft’s chief product and technology officer, Apratim “AP” Purakayastha. “This incident serves as a stark warning for companies not to overlook the potential for similar internal disruptions in their cyber and IT planning. Proactive and thorough planning is essential to mitigate risks and ensure operational resilience.”