Conference:
The 24th IEEE International Conference on Software Quality, Reliability and Security (QRS 2024), 1-5 July 2024, Cambridge, UK
Authors:
Kalouptsoglou I, Siavvas M, Ampatzoglou A, Kehagias D, Chatzigeorgiou A.
Abstract:
Nowadays, security testing is an integral part of the testing activities during the software development life-cycle. Over the years, various techniques have been proposed to identify security issues in the source code, especially vulnerabilities, which can be exploited and cause severe damages. Recently, Machine Learning (ML) techniques capable of predicting vulnerable software components and indicating high-risk areas have appeared, among others, accelerating the effort demanding and time consuming process of vulnerability localization. For effective subsequent vulnerability elimination, there is a need for automating the process of labeling detected vulnerabilities in vulnerability categories i.e., identifying the type of the vulnerability. Several techniques have been proposed over the years for automating the labeling process of vulnerabilities. However, the vast majority of the proposed methods attempt to identify the type of vulnerabilities based on their textual description that is provided by experts, such as the description provided by the vulnerability report in the National Vulnerability Database, and not on their actual source code, hindering their full automation and the vulnerability categorization from the software testing phase. This work examines the vulnerability classification directly from the source code during the vulnerability detection step.
Moreover, this way, a vulnerability detection method will be able to provide complete information and interpretation of its findings. Leveraging the advances in the field of Artificial Intelligence and Natural Language Processing, we construct and compare several multi-class classification models for categorizing vulnerable code snippets. The results highlight the importance of the context-aware embeddings of the pre-trained Transformer-based models, as well as the significance of transfer learning from a programming language-related domain.