Automatic Vulnerability Categorization: Are Large Language Models (LLMs) the solution?

By Miltiadis Siavvas,  Information Technologies Institute (ITI) of the Centre for Research and Technology-Hellas (CERTH)

  1. Problem Statement

The early identification and mitigation of software vulnerabilities is critical for the development of secure software. To facilitate the vulnerability identification and mitigation process, several tools and techniques have been proposed over the years that are able to detect potential source code-level mistakes, which could indicate the existence of vulnerabilities, with static analysis being the most popular one. However, the limitation of existing static analysis tools, i.e. their tension to generate a large volume of false positives and their inability to identify more complex vulnerability patterns, has led to the proposition of more advanced machine learning (ML)-based vulnerability detection models. These models, by analyzing the source code of a given application, are able to highlight software components (e.g., classes, methods, or even code snippets) that are likely to contain vulnerabilities, based on the vulnerability patterns that they have learned from historical data (i.e., actual vulnerabilities found in real-world software).

While existing ML-based vulnerability detection techniques are able to identify more complex vulnerability patterns compared to traditional static analysis tools, rarely do they provide information about the exact code location and the type of the identified vulnerability. This information is critical for the effective mitigation of the identified vulnerabilities, as it can be used by software engineers to better prioritize their testing and fortification activities. Although recently several research endeavors have started utilizing techniques chiefly from the field of eXplainable AI (XAI), in order to locate the exact vulnerable lines of code, the identification of the type of the underlying vulnerability is still scarcely investigated.

Vulnerability categorization is usually a subsequent step that is performed manually by security experts, who inspect the identified potentially vulnerable source code snippet in order to verify the actual existence of the vulnerability, identify its type, and write a relevant report. This is a manual, time-consuming, and effort-demanding procedure, which adds significant delays in the vulnerability identification and mitigation process, due to the need for human intervention. These delays may significantly affect the resolution time of the vulnerability, leading to potential delays in the production cycle of the project or leaving projects exposed to potential attacks for prolonged periods of time.

Although there are research attempts that utilize ML models to automate the categorization process, they are based on the textual description of the vulnerabilities that are provided by experts and not on the actual source code, thus, not fully automating the process. Having models able to identify the type of an identified vulnerability directly from the detected source code snippet without requiring any textual description, is expected to greatly streamline their mitigation process.

  1. Evaluation, Results, and Discussion

To this end, within the DOSS project, we empirically examined whether Large Language Models (LLMs), which have already demonstrated promising results in various Software Engineering (SE) tasks including requirements classification, vulnerability detection, etc., can be used for accurate vulnerability categorization directly from source code.  More specifically, we focused on two popular LLMs, namely BERT and its variant CodeBERT, and examined two potential ways for utilizing them for the purposes of vulnerability categorization, which are: (i) using only their embedding vectors in order to build simpler ML classifiers, and (ii) fine-tuning them on the downstream task of vulnerability categorization based on domain-specific data. Less complex models based on simpler natural processing techniques (NLP) techniques were also used as a baseline for comparison. The high-level overview of our experiment is presented in the figure below:

Figure 1: The high-level overview of the adopted methodology

As can be seen by this figure, a dataset with vulnerable source code snippets was utilized for (i) training Machine Learning (ML) models based on various text representation techniques (i.e., Bag of Words and word embeddings), and (ii) fine-tuning the selected LLMs (i.e., BERT and CodeBERT) in the downstream task of software vulnerability categorization. The selected dataset contains 4530 vulnerable source code snippets, which are categorized as SQL Injection, Cross-site Request Forgery (XSRF), Command Injection, Path Disclosure, Open Redirect, Remote Code Execution (RCA), and Cross-site Scripting (XSS). As can be seen, the dataset contains highly diverse and critical vulnerabilities.

For the evaluation, the 10-fold stratified cross-validation approach was adopted in order to reduce bias, whereas popular performance metrics were used, specifically Accuracy, Recall, Precision, and F1-score, puting particular focus on the latter. As already stated, simpler ML-based models that utilized the BoW approach and embedding algorithms, particularly Word2Vec and fastText, were used as the baseline for the comparison. It should be noted that the Word2Vec and fastText algorithms were also trained on a large source code corpus of 11.5 million lines of code. Various ML algorithms were examined, with Random Forest (RF) demonstrating the best results in all the studied cases.

The results of the evaluation are presented in the table below:

Table 1: The evaluation results of the trained software vulnerability categorization models

From the results presented in the table above, several interesting observations can be made. First of all, all the examined models have demonstrated sufficient results in vulnerability classification, demonstrating an average F1-Score above 75% in all studied cases. The best-performing model was CodeBERT followed by BERT which were both fined-tuned on domain-specific data (i.e., on the selected vulnerability classification dataset), indicating that fine-tuning the LLMs tend to be the best approach for the task of vulnerability classification (at least in the studied dataset). Another interesting observation is that the models that were built on the more traditional BoW approach led to better (or at least comparable) predictive performance to those built based on embedding vectors.

The above analysis highlights the superiority of LLMs, particularly CodeBERT and BERT, in vulnerability classification compared to other more traditional NLP approaches, especially when they are fine-tuned on domain-specific data.

Going one step further, we compared the best-performing models with respect to their ability to predict each one of the 7 studied vulnerability categories. The results, particularly the reported average F1-Scores, are illustrated in the table below:

Table 2: The F1-score per category of the best performing approaches

As can be seen by the table above, the fine-tuned CodeBERT (which was found to be the best-performing model overall) demonstrated the best results in almost all the studied vulnerability categories, further supporting its superiority compared to other approaches in identifying specific vulnerability types. The only exceptions were the Open Redirect and the Remote Code Execution vulnerability categories, in which CodeBERT demonstrated poorer performance compared to the BoW approach (but still sufficient). This suggests that CodeBERT could be used as the main model for categorizing the detected vulnerabilities, and in case the detected vulnerability is classified as Open Redirect or Remote Code Execution by CodeBERT, one could supplementarily utilize another model that has higher predictive performance in this category (particularly the BoW with RF model in our case), to reach safer conclusions.

The above analysis suggests that while fine-tuned LLMs are generally superior in vulnerability classification, supplementary models may be necessary for specific categories in order to reach safer conclusions.

For a more detailed description of this work we refer the reader to the actual publication (by Ilias Kalouptsoglou et al.), which can be found here.

Leave a Reply