Lessons from the CrowdStrike Meltdown
Industry leaders reflect on system vulnerability and the need for redundancy
July 29, 2024
Wikipedia now has a dedicated page explaining the CrowdStrike IT outage that sent many Windows PCs crashing on July 19, 2024. According to CrowdStrike, “As part of regular operations, CrowdStrike released a content configuration update for the Windows sensor to gather telemetry on possible novel threat techniques. These updates are a regular part of the dynamic protection mechanisms of the Falcon platform. The problematic Rapid Response Content configuration update resulted in a Windows system crash.”
The irony is, CrowdStrike is what many enterprises rely on to safeguard their data and systems. Instead, it became the source of a worldwide IT meltdown that cost Fortune 500 companies roughly $5.4B, estimated The Guardian. The event is a wakeup call for system resilience, many agreed, but in what form?
Design and Test Your Systems like Airplanes
Todd Tuthill, Vice President of Aerospace and Defense, Siemens Digital Industries Software, says the meltdown echoes the disruption the world has just experienced four years ago during the COVID shutdown. Somehow, certain industries still seemed unprepared. “Disaster planning needs to be a priority and should be a part of any high-level planning for worst-case scenarios,” he says. “By and large, the airlines affected by the CrowdStrike IT incident recovered quickly, which inclines me to think that the answer may be to design systems to operate with limited data access for brief (or not so brief) windows while critical infrastructure is repaired.”
Most airlines recovered in a couple of days, but Delta Airline's recovery stretched into weeks, prompting the Department of Transportation to launch an investigation.
Capt. Sully Sullenger, known for successfully landing of the US Airways Flight 1549 in the Hudson River after a bird strike on January 15, 2009, writes in a LinkedIn blog post, “Our systems should be designed more like airplanes, by avoiding single points of failure, having multiple ways of keeping critical functions operational, and with humans in the loop and in control of it all.”
Tuthill says, “The strenuous testing an aircraft needs to pass before being cleared for flight is a great example for manufacturing firms before signing off on any system that impacts daily lives to keep critical functions operational.”
Third-Party Enterprise Security Software in the Crosshair
During the COVID shutdown in 2020, many Asian suppliers went off line first, forcing manufacturers to scramble for alternative sources. Many of them turned to on-demand manufacturing service providers like Xometry and Fictiv to make up the capacity and meet their regular demands.
Looking at the CrowdStrike crisis, Matt Leibel, on-demand manufacturing service provider Xometry's Chief Technology Officer, says, “In a world that is growing ever more connected in the cloud and in the integration and use of 3rd party platforms, the events underscore the need for all organizations, especially manufacturers, to have a comprehensive continuity plan in place. Manufacturers need to test their continuity plan regularly and update it as necessary. They also need to have redundancies in place to communicate in real-time with key stakeholders, from employees to partners and especially customers.”
Jim Ruga, CTO of Fictiv, says, “Always maintain a staging/testing layer between third-party software and production systems. Never assume that updates from a vendor are flawless. In the case of Crowdstrike, if the update had been tested in a staging environment, the IT departments would have detected the issue before it reached production systems. This incident has raised questions about the reliability of third-party updates and the responsibility of IT departments in managing such updates.”
The Cloud Paradox
In many instances, a faulty update can be reverted or its impact minimized with a subsequent corrective update, often released and installed via the cloud. In other words, the cloud could be both a source of threat as well as a method of recovery. In the case of the CrowdStrike bug, the update disabled the physical machine itself, leaving many Windows users unable to reboot or get to the cloud to absorb the fix.
Ruga says, “The solution for most Windows systems was straightforward—boot into safe mode, remove one file, restart, and done. However, the fact that IT departments had to manually log into systems one at a time significantly delayed the recovery process, highlighting the urgent need for a quicker response to such system failures.”
Tuthill says, “The CrowdStrike issue showed a fundamental weakness in many companies' strategy of moving everything to the cloud, and exposed a liability in a part of modern society—our ability to move data securely and quickly.”
John McEleney, Cofounder of the cloud-based CAD firm Onshape (now part of PTC), says the CrowdStrike issue didn't affect Onshape users. He says, “Cloud-native has the advantage over older installed software. According to our head of development, our use of cloud (virtual infrastructure) allows us to access and replace our servers at any point in time from any location. Companies spent many days just getting access to the machines they needed to restart.”
More Siemens Digital Industries Software Coverage
Subscribe to our FREE magazine,
FREE email newsletters or both!About the Author
Kenneth WongKenneth Wong is Digital Engineering’s resident blogger and senior editor. Email him at kennethwong@digitaleng.news or share your thoughts on this article at digitaleng.news/facebook.
Follow DE