Are Large Language Models Ready for Software Vulnerability Detection?

By Miltiadis Siavvas, Information Technologies Institute (ITI) of the Centre for Research and Technology-Hellas (CERTH)

The increasing reliance of our everyday lives on software-intensive systems, renders their security an aspect of utmost importance. Hence, there is a strong need for advanced mechanisms for enabling the early identification and elimination of software vulnerabilities, prior to the release of the software to the public.

Large Language Models (LLMs) have demonstrated remarkable capabilities in detecting software vulnerabilities, even sometimes surpassing traditional and well-established vulnerability detection techniques like static analysis.They have quickly become the core focus of the machine learning- and deep learning-based vulnerability detection research community, outperforming earlier techniques that were based on more traditional deep learning (DL) models such as LSTMs, which have been known to be the best performing models for years.

This naturally raises the question: are LLMs at the stage where they can actually replace traditional vulnerability detection techniques and become the go-to vulnerability detection solution for practitioners?

The present article attempts to shed light on the above question by analyzing the current state of the LLM-based vulnerability detection literature, in order to identify and report potential challenges that these techniques face, and assess whether these weaknesses can affect their reliability and practicality, and therefore act as an obstacle to their broader adoption in practice.

A brief historical overview

Contrary to static analysis security testing (SAST) that is based on checking the source code against a set of predefined and deterministic rules that indicate vulnerabilities, machine learning-based vulnerability detection (which is also commonly referred to as vulnerability prediction) is based on the identification of the existence of common vulnerability patterns in the source code, which are automatically learned through data via training. Hence, ML-based vulnerability detection promises to capture more complex vulnerability patterns, which are either too complex to be expressed in the form of static analysis rules, or completely unknown to security experts and engineers, in order to model them in the first place.

Contrary to SAST tools, which check the code against a predefined set of vulnerability patterns or rules, ML-based vulnerability detection automatically learns such patterns from data.”

In the following table, a summary of the evolution of the field from simple ML models to the more advanced LLM-based techniques that we have today is illustrated.

A large number of VPMs have been proposed in the related literature over the years, ranging from simple ML models that utilize software metrics or simple text mining techniques (e.g., Bag of Words) for source code representation, to more advanced Deep Learning (DL) models, such as Long Short-Term Memory (LSTM) networks, which utilize more complex textual representation techniques (e.g., token sequences, word embeddings, text-rich graphs, etc.).

Text mining-based techniques and particularly LLMs have demonstrated the highest predictive performance, rendering them the core focus of current literature, with the current research endeavors focusing almost exclusively on LLMs, which persistently outperform other traditional DL-based techniques in recent studies.

What are the challenges of LLM-Based Vulnerability Detection?

Despite the promising results with respect to the observed detection accuracy that they have achieved so far, LLM-based vulnerability detection models face a number of challenges that hinder their practicality. The main challenges that LLM-based vulnerability detection models exhibit and need to be addressed are summarized below:

Reduced accuracy on unseen software: The performance of LLMs in vulnerability detection drops significantly when employed to software components (e.g., source code files) from completely unknown repositories. This issue highlights the lack of generalizability of existing models, rendering them impractical as they would typically need to be adapted to the code base on which they should be utilized. This means that these models cannot be used as off-the-shelf solutions for detecting vulnerabilities in new projects, and proper training/retraining is required.
Lack of high-quality datasets Existing vulnerability datasets contain mislabelled samples, which obviously affects the performance of the produced vulnerability detection models. However, constructing a high-quality vulnerability dataset is a highly challenging task, especially given the fact that it should be sufficiently large in order to be suitable for fine-tuning LLMs in the downstream task of vulnerability detection. Existing datasets also have limited scope with respect to programming languages and application domains, potentially further limiting the generalizability of the produced models.
Reduced context from the source code: LLMs treat source code purely as text (i.e., token sequences), potentially missing semantic and syntactic code information that is better reflected in different modalities like graphs, such as Abstract Syntax Trees (ASTs), Control-flow Graphs (CFGs), and Data-flow Graphs (DFGs). As a result, more complex vulnerability patterns that cannot be easily reflected as token sequences may be missed by the produced models.
Difficulty in reporting the exact location and type of the detected vulnerability: Although LLMs can accurately detect a software component that contains vulnerabilities (e.g., a function, a file, even a code snippet), they are currently unable to accurately report the exact location of the vulnerability (i.e., the exact lines of code where the vulnerability resides) nor its actual type (e.g., Buffer Overflow, etc.). This makes them less practical for developers, who need actionable insights for mitigation.

From the above analysis, it can be said that despite the promising results that LLMs have demonstrated in vulnerability detection, there is still a lot of work to be done before being considered reliable and practical enough so that they can be widely adopted in practice as part of real-world software development pipelines.

Can these challenges be addressed?

The aforementioned open issues, while challenging, can be addressed, and actually the research community has already started tackling these issues with several promising directions. Below is a brief overview of how the various challenges are addressed by the current research:

Domain adaptation for improved generalizability: It has been observed that introducing even a small amount of (often unlabelled) data from the target project into the LLM leads to a considerable increase in its cross-project detection accuracy. This can be attributed to the fact that the LLM, even though it is not exposed to vulnerabilities that the target project contains, it “learns” the internal characteristics of the target project (e.g., specific frameworks, unique variable names, etc.), enabling it to better infer about its code.
Data Augmentation and Cleansing Techniques for high-quality vulnerability dataset construction: Advanced cleaning techniques are being applied to eliminate mislabeled or noisy samples from existing and widely-used vulnerability datasets to enhance their quality and reliability. Data augmentation techniques are also adopted in order to increase the size of the dataset by generating close-to-real synthetic samples with guaranteed labels.
Input Enrichment with code graphs for deeper understanding/reasoning of source code semantics: Enriching the input prompt of LLM-based vulnerability detection models with additional context from the source code has been found to further improve their accuracy. Additional graph-based representations of the source code are typically examined, such as ASTs, CFGs, and DFGs, which can better capture the code syntax and semantics, leading to highly capable LLM-based techniques (e.g., GraphCodeBERT). Another direction in the literature is the combination of LLMs with Graph Neural Networks (GNNs), which can better process and represent graphical representations of the source code.
Explainability for localization and type identification: Explainable AI (XAI) methods, such as SHAP and LIME, are used to highlight the specific code lines in which vulnerabilities actually reside, as well as to infer their type. This makes the results more actionable for developers, as they can narrow their focus down to the exact lines of code in which the security issue resides, as well as to the specific mitigation actions that are relevant to the detected vulnerability type and need to be employed.

The Role of the DOSS Project

Within the DOSS project, we are actively working towards addressing these core challenges of LLM-based vulnerability detection models. Our goal is to render LLM-based VDMs more practical, reliable, and usable in real-world software development environments. By combining domain adaptation, dataset refinement, explainable AI, and system-level integration, DOSS aims to push LLM-based vulnerability detection closer to being a solution that can be actively adopted in practice.

Are Large Language Models (LLMs) the key to accurate Vulnerability Detection?

A brief historical overview

What are the challenges of LLM-Based Vulnerability Detection?

Can these challenges be addressed?

The Role of the DOSS Project

Recent Posts

Recent Comments

Leave a Reply Cancel Reply