Exploring the Factors Behind the Historic CrowdStrike Windows Crisis - Insights

Exploring the Factors Behind the Historic CrowdStrike Windows Crisis - Insights

James Lv13

Exploring the Factors Behind the Historic CrowdStrike Windows Crisis - Insights

blue screens of death

Harun Ozalp/Anadolu via Getty Images

[Updated 24-July with details from CrowdStrike’s preliminary post-incident review]

Microsoft Windows powers more than a billion PCs and millions of servers worldwide, many of them playing key roles in facilities that serve customers directly. So, what happens when a trusted software provider delivers an update that causes those PCs to immediately stop working?

As of July 19, 2024, we know the answer to that question: Chaos ensues.

Also: 7 password rules to live by in 2024, according to security experts

In this case, the trusted software developer is a firm called CrowdStrike Holdings, whose previous claim to fame was being the security firm that analyzed the 2016 hack of servers owned by the Democratic National Committee. That’s just a quaint memory now, as the firm will forever be known as The Company That Caused The Largest IT Outage In History . It grounded airplanes, cut off access to some banking systems, disrupted major healthcare networks, and threw at least one news network off the air.

Microsoft estimates that the CrowdStrike update affected 8.5 million Windows devices. That’s a tiny percentage of the worldwide installed base, but as David Weston, Microsoft’s Vice President for Enterprise and OS Security, notes, “the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services.” According to a Reuters report , “Over half of Fortune 500 companies and many government bodies such as the top US cybersecurity agency itself, the Cybersecurity and Infrastructure Security Agency, use the company’s software.”

Disclaimer: This post includes affiliate links

If you click on a link and make a purchase, I may receive a commission at no extra cost to you.

What happened?

CrowdStrike, which sells security software designed to keep systems safe from external attacks, pushed a faulty “sensor configuration update” to the millions of PCs worldwide running its Falcon Sensor software. That update was, according to CrowdStrike, a “Channel File” whose function was to identify newly observed, malicious activity by cyberattackers.

Although the update file had a .sys extension, it was not itself a kernel driver. It communicates with other components in the Falcon sensor that run in the same space as the Windows kernel, the most privileged level on a Windows PC, where they interact directly with memory and hardware. CrowdStrike says a “logic error” in that code caused Windows PCs and servers to crash within seconds after they booted up, displaying a STOP error, more colloquially known as the Blue Screen of Death (BSOD).

Also: Microsoft is changing how it delivers Windows updates: 4 things you need to know

In a Preliminary Post Incident Review posted on its website July 24, CrowdStrike confirmed some details about the incident that had previously been reported and added a few more. The code that failed was part of the Falcon sensor, which runs in the Windows kernel space. Version 7.11 of the sensor was released on February 28, 2024. According to CrowdStrike, this release introduced “a new [InterProcess Communication (IPC)] Template Type to detect novel attack techniques that abuse Named Pipes. This release followed all Sensor Content testing procedures…”

Three additional instances of the IPC Template Type were deployed between April 8 and April 24, without incident. On July 19, the company says, “two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.” Those instances were deployed into production. “When received by the sensor and loaded into the Content Interpreter,” the report continues, “problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD).”

Also: Who needs ransomware when a faulty software update can shut down critical infrastructure?

Repairing the damage from a flaw like this is a painfully tedious process that requires manually rebooting every affected PC into the Windows Recovery Environment and then deleting the defective file from the PC using the old-school command line interface. If the PC in question has its system drive protected by Microsoft’s BitLocker encryption software, as virtually all business PCs do, the fix requires one extra step: entering a unique 48-character BitLocker recovery key to gain access to the drive and allow the removal of the faulty CrowdStrike driver.

If you know anyone whose job involves administering Windows PCs in a corporate network that uses the CrowdStrike code, you can be confident they are very busy right now, and will be for days to come.

https://techidaily.com

We’ve seen this movie before

When I first heard about this catastrophe (and I am not misusing that word, I assure you), I thought it sounded familiar. On Reddit’s Sysadmin Subreddit, user u/externedguy reminded me why . Maybe you remember this story from 14 years ago:

“Defective McAfee update causes worldwide meltdown of XP PCs.”

Oops, they did it again.

At 6AM today, McAfee released an update to its antivirus definitions for corporate customers that had a slight problem. And by “slight problem,” I mean the kind that renders a PC useless until tech support shows up to repair the damage manually. As I commented on Twitter earlier today, I’m not sure any virus writer has ever developed a piece of malware that shut down as many machines as quickly as McAfee did today.

In that case, McAfee had delivered a faulty virus definition (DAT) file to PCs running Windows XP. That file falsely detected a crucial Windows system file, Svchost.exe, as a virus and deleted it. The result, according to a contemporary report , is that “affected systems will enter a reboot loop and [lose] all network access.”

Also: The best VPN services: Expert tested and reviewed

The parallels between that 2010 incident and this year’s CrowdStrike outage are uncanny. At its core was a defective update, pushed to millions of PCs running a powerful software agent, causing the affected devices to stop working. Recovery required manual intervention on every single device. Plus, the flawed code was pushed out by a public security company desperately trying to grow in a brutally competitive marketplace.

Newsletters

ZDNET Tech Today

ZDNET’s Tech Today newsletter is a daily briefing of the newest, most talked about stories, five days a week.

Subscribe

See all

The timing was particularly unfortunate for McAfee. Intel had announced its intention to acquire McAfee for $7.68 billion on April 19, 2010. The defective DAT file was released two days later, on April 21.

That 2010 McAfee screw-up was a big deal, kneecapping Fortune 500 companies (including Intel!) as well as universities and government/military deployments worldwide. It knocked 10% of the cash registers at Australia’s largest grocery chain offline, forcing the closure of 14 to 18 stores.

Also: 5 ways to save your Windows 10 PC in 2025 - and most are free

In the You Can’t Make This Up Department… CrowdStrike’s founder and CEO, George Kurtz, was McAfee’s Chief Technology Officer during that 2010 incident.

What makes the 2024 sequel so much worse is that it also affected Windows-based servers running in the cloud, on Microsoft Azure and on AWS. Just as with the many laptops and desktop PCs that were bricked by this faulty update, the cloud-based servers require time-consuming manual interventions to recover.

https://techidaily.com

CrowdStrike’s QA failed

Surprisingly, this isn’t CrowdStrike’s first faulty Falcon sensor update this year.

Less than a month earlier, according to a report from The Stack , CrowdStrike released a detection logic update for the Falcon sensor that exposed a bug in the sensor’s Memory Scanning feature. “The result of the bug,” CrowdStrike wrote in a customer advisory, “is a logic error in the CsFalconService that can cause the Falcon sensor for Windows to consume 100% of a single CPU core.” The company rolled back the update, and customers were able to resume normal operations by rebooting.

Also: When Windows 10 support runs out, you have 5 options but only 2 are worth considering

At the time, computer security expert Will Thomas noted on X/Twitter , “[T]his just goes to show how important it is to download new updates to one machine to test it first before rolling out to the whole fleet!”

In that 2010 incident, the root cause turned out to be a complete breakdown of the QA process . It seems self-evident that a similar failure in QA is at work here. Were these two CrowdStrike updates not tested before they were pushed out to millions of devices?

Part of the problem might be a company culture that’s long on tough talk. In the most recent CrowdStrike earnings call, CEO George Kurtz boasted about the company’s ability to “ship game-changing products at a rapid pace,” taking special aim at Microsoft:

And more recently, following yet another major Microsoft breach in CIS’ Cyber Safety Review Board’s findings, we received an outpouring of requests from the market for help. We decided enough is enough, there’s a widespread crisis of confidence among security and IT teams within the Microsoft security customer base.

[…]

Feedback has been overwhelmingly positive. CISAs now have the ability to reduce monoculture risk from only using Microsoft products and cloud services. Our innovation continues at breakneck pace multiplying the reasons for the market to consolidate on Falcon. Thousands of organizations are consolidating on the Falcon platform.

Given recent events, some of those customers might be wondering whether that “breakneck pace” is part of the problem.

As part of its initial response, CrowdStrike says it plans to take additional measures to improve “software resiliency and testing.” More importantly, it plans to implement a “staggered deployment strategy … in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.” The company also committed to provide customers with “greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.”

Meanwhile, the United States House of Representatives Homeland Security Committee plans to call CrowdStrike’s CEO up for hearings on what went wrong, and CrowdStrike’s Chief Security Officer, Shawn Henry, posted an apology on LinkedIn , admitting “On Friday, we failed you. … The confidence we built in drips over the years was lost in buckets within hours, and it was a gut punch.”

https://techidaily.com

How much fault should Microsoft shoulder?

It’s impossible to let Microsoft completely off the hook. After all, the Falcon sensor problems were unique to Windows PCs, as admins of Linux and Mac-focused shops were quick to remind us.

Partly, that’s an architectural issue. Developers of system-level apps for Windows, including security software, historically implement their features using kernel extensions and drivers. As this example illustrates, faulty code running in the kernel space can cause unrecoverable crashes, whereas code running in user space can’t.

Also: 7 ways to make Windows 11 less annoying

That used to be the case with MacOS as well, but in 2020, with MacOS 11, Apple changed the architecture of its flagship OS to strongly discourage the use of kernel extensions . Instead, developers are urged to write system extensions that run in user space rather than at the kernel level. On MacOS, CrowdStrike uses Apple’s Endpoint Security Framework and says using that design, “Falcon achieves the same levels of visibility, detection, and protection exclusively via a user space sensor.”

Could Microsoft make the same sort of change for Windows? Perhaps, but doing so would certainly bring down the wrath of antitrust regulators, especially in Europe. The problem is especially acute because Microsoft has a lucrative enterprise security business, and any architectural change that makes life more difficult for competitors like CrowdStrike would be rightly seen as anticompetitive.

Indeed, a Microsoft spokesperson told the Wall Street Journal that it can’t follow Apple’s lead because of antitrust concerns. According to the WSJ report , “In 2009, Microsoft agreed it would give makers of security software the same level of access to Windows that Microsoft gets.” That concern might be open for debate, but given Microsoft’s history with EU regulators, it’s understandable why the company hasn’t wanted to get tangled up in that argument.

Microsoft currently offers APIs for Microsoft Defender for Endpoint , but competitors aren’t likely to use them. They’d much rather argue that their software is superior, and using the “inferior” offering from Microsoft would be hard to explain to customers.

Nonetheless, this incident, which caused many billions of dollars’ worth of damage, should be a wake-up call for the entire IT community. At a minimum, CrowdStrike needs to step up its testing game, and customers need to be more cautious about allowing this sort of code to deploy on their networks without testing it themselves.

We’ve used every iPhone 16 model and here’s our best buying advice for 2024

20 years later, real-time Linux makes it to the kernel - really

My biggest regret with upgrading my iPhone to iOS 18 (and I’m not alone)

Want a programming job? Learn these three languages

Also read:

  • Title: Exploring the Factors Behind the Historic CrowdStrike Windows Crisis - Insights
  • Author: James
  • Created at : 2024-10-18 08:29:21
  • Updated at : 2024-10-25 03:43:06
  • Link: https://technical-tips.techidaily.com/exploring-the-factors-behind-the-historic-crowdstrike-windows-crisis-insights/
  • License: This work is licensed under CC BY-NC-SA 4.0.