Software Requirements Classification using Large Language Models (LLMs)

By April 26, 2024May 2nd, 2024Insights

By Miltiadis Siavvas,  Information Technologies Institute (ITI) of the Centre for Research and Technology-Hellas (CERTH)

The definition of software requirements, both functional and non-functional, is the first step of the Software Development Lifecycle (SDLC). The correct specification of these requirements is critical for the production of high-quality and dependable software products. For instance, incorrect, ambiguous, or incomplete specification of requirements can lead to the insertion of critical software bugs or vulnerabilities in the source code. It is believed that around 50% of the vulnerabilities that software contains stem from incorrect or vague requirements and design choices, which manifest as security bugs in the source code at later stages of development[1] [2].

An important aspect of requirements engineering and particularly specification is the accurate requirements classification. Requirements classification refers to the task of identifying the exact category of the defined requirements, i.e., whether they are Functional Requirements (FRs) or Non-functional Requirements (NFRs), whereas, in the case of NFRs, it refers to the identification of their specific type, e.g., security requirements, performance requirements, usability requirements, etc. Knowing the type of the defined requirements is important for the software engineers, (i) for determining the main strategy that must be followed for their implementation (as FRs and NFRs are implemented differently), and (ii) for better prioritizing their development and testing activities (e.g., security requirements must be implemented first). In addition to this, techniques for requirements classification can also enable the QA team to verify and validate the correct definition of requirements promptly and detect potential misclassifications early, prior to their implementation in the source code.

Traditionally, software requirements classification was performed manually. However, the manual classification is an effort-demanding and time-consuming process, whereas it is also prone to human errors, due to misinterpretations or lack of expertise. Recently, there has been a shift towards using Natural Language Processing (NLP) and Machine Learning (ML) techniques as a way to automate and streamline the requirements classification process. Among the studied approaches, Large Language Models (LLMs) have started gaining the attention of the research community, mainly due to the remarkable capabilities that they have demonstrated in language understanding and text generation. Specific emphasis has been given to the Bidirectional Encoder Representations from Transformers (BERT) and its variants, showcasing very promising results in requirements classification.

To this end, within the DOSS project, we are planning to utilize LLMs for requirements classification, and, later on, for the automatic extraction of software requirements from textual descriptions. As a first step towards this goal, we have performed an empirical evaluation of various LLMs in software requirements classification task. Two classification approaches were considered in our experiment: (i) a binary classification approach, aiming at categorizing requirements in FRs and NFRs, and (ii) a multi-class classification approach, with the purpose of identifying and reporting the specific type of NFRs that the text belongs to. The high-level overview of our experiment is presented in the figure below:

Figure 1: The high-level overview of the adopted methodology

As can be seen by this figure, a dataset with software requirements was utilized, and after appropriate preprocessing, it was leveraged for (i) training Machine Learning (ML) models based on simple text representation techniques (i.e., Bag of Words and word embeddings), and (ii) fine-tuning LLMs in the downstream task of software requirements classification. For the latter, we examined various popular Transformer-based models, including GPT-2, BART, T5, BERT, and BERT’s variants, particularly DistilBERT and RoBERTa. It should be noted that for the dataset, instead of being based solely on the PROMISE dataset (which is a common practice in the related literature), we have constructed an extended dataset by merging 5 different requirements datasets, comprising 3471 samples. The results of the evaluation are illustrated in the table below:

Table 1: The evaluation results of the trained software requirements classification models

From the results presented in the table above, several interesting observations can be made. First of all, although all the examined models manage to accurately predict the category of the analyzed requirements achieving a sufficient F1-score, Transformer-based models exhibit better predictive performance (in general) than ML models that are built on simpler NLP techniques. This highlights the superiority of LLMs in software requirements classification (both binary and multi-class) compared to simpler and more traditional techniques. Among the LLMs, BERT and its variants demonstrate an F1 score higher than 90%, which is in line with what has been reported in the numerous research endeavors that can be found in the related literature. However, other LLMs, like BART and GPT-2, demonstrated higher (even slightly) predictive performance with respect to the F1-score, compared to the BERT models in the binary and multi-class classification cases respectively. As opposed to the current literature, where emphasis has been given almost predominantly on BERT and its variants, other pre-trained models seem to provide similar (or even better) results, and therefore, there is room for further experimentation and research in this direction.




Leave a Reply