A simple coding mistake led to the CrowdStrike outage? Well, this is not surprising!

By Miltiadis Siavvas, Information Technologies Institute (ITI) of the Centre for Research and Technology-Hellas (CERTH)

CrowdStrike disruption is an infamous incident that led to a global IT outage in July 2024, which is known to be one of the worst IT disruptions in history, described by analysts as “the largest outage in the history of information technology and historic in scale”. The incident caused 8.5 million Microsoft devices to crash and fail to reboot properly, bringing airlines, banks, and businesses to a halt. Particularly, it led to major service disruptions with hundreds of flights being cancelled, emergency call centers services in several U.S. states such as Alaska, Indiana, and Ohio to stop working, and the London stock exchange to delay trading for 20 minutes, among others. Early estimates suggest that the disruption caused global economic losses of several billion dollars, affecting hundreds of organizations across multiple sectors.

What caused the CrowdStrike outage?

On the 19th of July 2024, CrowdStrike, a leading American cybersecurity company, released a faulty update to its Falcon Sensor security software, that is a suite to protect computers against cyberattacks, causing widespread failures on Microsoft Windows systems. The Falcon Sensor operates deep within the operating system, at the kernel level, with the highest privileges, in order to detect and prevent threats in real time.

According to the external Root-Cause Analysis (RCA) report of the CrowdStrike outage link, the major outage was caused by a simple coding mistake, an “out-of-bounds memory read”. Surprisingly, this is a very simple programming mistake that could have been avoided by following standard security best practices, such as by adding a simple array bounds check. Yes, a simple if/else statement!

To better illustrate how this simple coding mistake led to one of the greatest IT outages in history, and since we do not have access to the actual source code of the Falcon’s source code, as it is a commercial and proprietary product, a representative example is provided below.

Figure 1: An example illustrating the out-of-bounds read vulnerability that caused the CrowdStrike outage.

The left snippet shows the vulnerable implementation, which lacks an array bounds check and can lead to a system crash, while the right one adds a simple size array bounds check before the iteration, effectively preventing the out-of-bounds access and ensuring safe execution.

On the left side of the above figure, we can see a flawed code snippet, whereas on the right side, the fixed version is illustrated (the fix is marked in red). As can be seen by this figure, the main issue was that the program attempts to read values from a given array (i.e., the inputArray) without checking its length before accessing its elements. In the actual case of the CrowdStrike issue, the program attempted to read 21 values from an input array, the length of which was 20. When the program attempted to read the 21st value, it resulted in an “out of bounds read”. This out-of-bounds read triggered a system crash because the resulting exception was not handled gracefully.

This issue could have been easily avoided by adding a bounds check just before reading data from the array, verifying that the array is large enough to accommodate the number of values the program intends to read (as shown in the right part of the figure). More specifically, in the fixed version shown on the right side of the figure, the program initially checks whether the number of values that need to be read (i.e., the value of the templateArraySize) is less than or equal to the size of the inputArray and proceeds with the for loop only if this condition is met. Additionally, in case the condition is not satisfied, the event could be logged with a specific error code, in order to inform the developers about this issue.

Beyond the development itself, the existence of such a simple coding issue indicates weaknesses in the broader quality assurance procedure as well, since, with proper testing (including well-defined test cases for each code change, proper regression testing, etc.), this issue could have been detected during the testing phase and eliminated prior to the release of the software update. According to CrowdStrike’s RCA report, unit testing covered only the correct path, while manual testing focused only on using valid data. Furthermore, no regression testing was conducted by the QA team, as it is normally conducted before each new release.

Why this “out-of-bounds read” led to a “Blue Screen of Death”?

As already stated, the Falcon software operates directly at the operating system’s kernel level, having access to critical OS functions, in order to provide deep protection against cyberattacks in real time. Particularly, it runs as a kernel-mode driver in Ring 0 of the Microsoft Windows OS and not via a dedicated application programming interface (APIs) in order to gain elevated privileges and access to core system resources. However, any crash in this area leads to triggering a Blue Screen of Death (BSOD), which forces the operating system to stop immediately, a build-in safeguard that, in this case, magnified the impact of the incident.

It is worth mentioning that Microsoft attributed part of the problem to a 2009 antitrust agreement with the European Union, which required the company to sustain low-level kernel access to third-party developers. Microsoft stated that if they had restricted direct access to their kernel to vendors like CrowdStrike, the outage could have been entirely prevented, or at least, its impact could have been significantly minimized. (https://www.theverge.com/2024/7/23/24204196/crowdstrike-windows-bsod-faulty-update-microsoft-responses)

Are developers the ones to be blamed?

Though debatable, in my opinion, the answer to this question is “No”, or at least “not only them”. Although indeed the mistake was made by developers and their lack of technical and especially security expertise, they are not solely to blame.

It is unrealistic to expect from the developers to remember hundreds or even thousands of vulnerability patterns that they need to avoid and code-level countermeasures that they need to implement. In addition to this, strict production deadlines often force developers to neglect the quality and security of the code that they write, in favor of delivering the promised functionality at the agreed time. Therefore, while the mistakes are indeed made by developers, placing the blame solely on them is not fair nor realistic.

What should we do?

There are three key ways in which this issue can be alleviated, which are listed below:

First of all, organizations need to adopt a Security-by-Design philosophy throughout their development process, making, in that way, security a driving factor of the overall development, instead of an “afterthought” or an “add-on feature”.
Secondly, they should invest in proper training of their developers on security awareness and especially on best coding practices. Even simple coding practices and techniques can significantly strengthen the security of the software they produce.
Finally, development teams need to be equipped with the right mechanisms and tools (e.g., static code analyzers, formal verification techniques, etc.), which could help them detect and fix security issues early enough in the production cycle, long before the software reaches the market.

How the DOSS Platform could help in avoiding such issues?

In our opinion, apart from adopting a security-by-design philosophy that is unquestionably critical for building secure software systems, equipping developers with tools able to encapsulate the required security expertise and help them detect and eliminate common security issues is an important step towards building highly secure software products and applications.

The DOSS platform, through its Component Tester, provides advanced static code analysis techniques along with sophisticated deep learning based vulnerability detection models that are able to detect the existence of critical vulnerabilities that reside in the source code, including buffer overflows, like “out-of-bounds read” and “out-of-bounds write”, which was the root cause of the CrowdStrike incident.

Their ability to be run headlessly and automatically throughout the overall development cycle enables their seamless integration into the developers’ workflows and their frequent adoption during the broader development process. This continuous adoption enables the early identification and elimination of vulnerabilities long before a software version is released to the public.

Hence, DOSS, through the innovative AI-enabled security testing solutions that it delivers, contributes towards the production of software that is free from critical vulnerabilities, and therefore towards the prevention of critical breaches and disruptions in real-world systems.

A simple coding mistake led to the CrowdStrike outage? Well, this is not surprising!

Recent Posts

Recent Comments

Leave a Reply Cancel Reply