The DOSS approach for Vulnerability Prediction using Large Language Models (LLMs)

By Miltiadis Siavvas, Information Technologies Institute (ITI) of the Centre for Research and Technology-Hellas (CERTH)

Background and Existing Challenges

Software Security is a matter of major concern for software-based systems and their broader supply chain since the exploitation of a single vulnerability can lead to far-reaching consequences both for the owning enterprise or manufacturer (ranging from reputation damages to severe financial losses) and for the actual customers (e.g., leakage of sensitive information). A recently published report[1] revealed that in the first quarter of 2023 (Q1) there were 310 security incidents that accounted for 349,171,305 breached records, representing a 12.7% increase in incidents and a threefold increase in records compared to the previous quarter. This is highly troubling considering the damages that can be caused by security incidents, which, according to the IBM Cost of a Data Breach Report 2023[2], the global average cost of a data breach in 2023 was $4.45 million, 15% more than in 2020. Hence, there is a strong need for mechanisms that will enable the identification of potential security issues early in the overall software development cycle, in order to be eliminated prior to the release of the products on the market.

Among the existing mechanisms that are utilized for identifying and eliminating security issues from software products, vulnerability prediction has recently emerged as a promising solution[3]. Vulnerability prediction focuses on the identification of security hotspots, that is, software components that are likely to contain critical vulnerabilities[4]. This information is highly important during the broader software development as it can help software engineers and project managers better allocate their limited testing resources. For instance, more thorough and advanced dynamic testing (e.g., fuzz testing, penetration testing, etc.) can be employed on high-risk components, increasing the possibility of detecting and fixing actual security issues. Existing research endeavors in the field of vulnerability prediction give emphasis on the construction of machine learning (ML)-based vulnerability prediction models (VPMs) that utilize attributes extracted directly from the source code of the analyzed software as input, in order to predict whether a given software component is likely to contain a vulnerability or not.

Several VPMs have been proposed over the years, being based chiefly on software metrics, text mining, and static analysis[5]. Among the existing solutions, the text mining-based VPMs have demonstrated the most promising results. Initial attempts in the field of text mining-based VPMs focused on the simple concept of Bag of Words[6], that is, utilizing the tokens (keywords) extracted from the source code along with their frequencies (i.e., number/ratio of their occurrence in the source code) as inputs. Subsequent attempts utilized the token sequence as the main input of the models utilizing sequential models such as the Long Short-Term Memory (LSTM) as the main ML technique[7]. More recently, researchers have started examining whether the inclusion of more context from the source code would lead to improved predictive performance, putting specific emphasis on incorporating graphical information of the source code, such as Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), Data Flow Graphs (DFGs), and Code Property Graphs (CPGs)[8]. These studies suggest that the incorporation of more context from the source code could lead to more accurate vulnerability prediction.

The recent advancements in the field of Artificial Intelligence (AI) with the promising and impressive capabilities showcased by Transformers, and particularly of the Large Language Models (LLMs) like the GPT, BERT, and BART, have created new possibilities for deriving more accurate VPMs[9]. In particular, the advanced capabilities showcased by LLMs in comprehending and processing textual information, renders them promising candidates for processing source code in an attempt to detect potential security issues. Utilizing LLMs in vulnerability prediction could potentially enhance the capabilities of the models to predict the existence of vulnerabilities, as more context from the source code will be taken into account. The LLM-based VPMs, along with the syntactical information, will be able to capture the semantic information of the source code (e.g., the actual meaning of the variable, method, and class names, etc.), gaining a deeper understanding of the source code compared to conventional text mining-based VPMs. This deeper understanding is expected to enable the produced LLM-based VPMs to detect more complex vulnerability patterns that may exist in the source code.

DOSS Contributions and Proposed Solution

To this end, within the context of the DOSS project, we will examine the ability of LLMs to be used effectively for building highly accurate VPMs. More specifically, several Transformer-based models will be examined (e.g., GPT, BERT, and BART among others) in the downstream task of Vulnerability Prediction through proper fine-tuning based on a well-curated vulnerability dataset. The predictive performance of the LLM-based VPMs will be compared against existing text mining-based VPMs that utilize more traditional Natural Language Processing (NLP) techniques, which have demonstrated promising results in the literature. In addition to this, we will also examine the extension of the LLM-based VPMs by incorporating graphical information from the source code, focusing mainly on Code Property Graphs (CPGs). This will help us determine whether the inclusion of structural information along with textual information could further enhance the predictive performance of the produced models. This will further support the common belief that the more contextual information of the source code is retrieved the more accurate the vulnerability prediction process will be.

A high-level overview of the envisaged approach is illustrated in the figure below. As can be seen by this figure, the envisaged vulnerability prediction mechanism will receive as input the source code of the system under test, and, for each software component that it contains, it will extract both its textual and graphical representations of their source code, which will be provided as input to the produced LLM-based VPM. This will judge whether the component is likely to be vulnerable or not, based on prior knowledge. A detailed report with all the potentially vulnerable components of the analyzed software will be produced and provided to the user.

Figure 1: The high-level overview of the Large Language Model (LLM)-based Vulnerability Prediction mechanism of the DOSS project

The LLM-based Vulnerability Prediction Mechanism will constitute a core element of the Component Tester module of the DOSS project. The information retrieved from these models will be leveraged for enabling the more effective and efficient detection of vulnerabilities that may reside in the software of IoT components.

[1] https://www.itgovernance.co.uk/blog/data-breaches-and-cyber-attacks-quarterly-review-q1-2023

[2] https://securityintelligence.com/articles/cost-of-a-data-breach-2023-financial-industry/

[3] https://www.sciencedirect.com/science/article/pii/S095058492300157X

[4] https://link.springer.com/chapter/10.1007/978-3-319-95189-8_13

[5] https://arxiv.org/abs/2306.11673

[6] https://dl.acm.org/doi/abs/10.1145/2372225.2372230

[7] https://www.mdpi.com/1099-4300/24/5/651

[8] https://ieeexplore.ieee.org/abstract/document/9167194

[9] https://ieeexplore.ieee.org/abstract/document/10232867