Conference:
EuroCyberSec 2024, 23. October 2024, Krakow, Poland
Authors:
Maliga D, Nagy R, Buttyán L.
Abstract:
In this paper, we present a pipeline that we designed for cleaning and processing large datasets of potentially malicious binaries using access to a rate-limited cloud-based malware analysis platform. Our goal is to efficiently filter out and discard benign files, to extract metadata from the remaining, likelyto-be-malware samples, and to create graph-based databases containing only metadata of verified malware. The main issue that we have to solve is the limited quota for accessing online malware analysis platforms that can be used for deciding about the maliciousness of a binary and obtaining metadata from static and dynamic analysis of samples. Our pipeline solves the problem by reaching a state where every sample in the database is either confirmed malware (based on its VirusTotal report) or similar to a confirmed malware with a minimal amount of requests made to the online platform. A database in such a state is already usable in practice, while confirming the malicious nature of and extracting metadata for all the samples in it can be continued in the background.