New Paper Accepted: Lightweight Behavior-Based Malware Detection
Our paper entitled Lightweight Behavior-Based Malware Detection has been accepted for publication at the conference MEDES2023.
The paper presents a lightweight approach for malware detection where we rely on basic performance metrics (e.g., CPU usage), instead of invasive OS monitoring. Our custom-made dataset consists of real data augmented with a GAN on which we train an LSTM model. In practice, we show that we can achieve very good performance in detecting a malware without being invasive.
The authors of the paper are: Marco Anisetti, Claudio Ardagna, Nicola Bena (me), Gabriele Gianini and Vincenzo Giandomenico (whose thesis is the basis for this work). I am going to present the paper virtually.
Below is the full (preliminary) abstract.
Modern malware detection tools rely on special permissions to collect data able to reveal the presence of suspicious software within a machine. Typical data they collect for this task are the set of system calls, the content of network traffic, file system changes, and API calls. Giving access to these data to an externally created program, however, means granting the company that created that software complete control over the host machine. This is undesirable for many reasons. In this work, we propose an alternative approach to the task, which relies on easily accessible data - such as information about system performances (CPU, RAM, Disk, and Network usage) - and does not need high-level permissions. To investigate the effectiveness of this approach, we collected the performance data in the form of a multi-valued time series and ran a number of malware programs in a suitably devised sandbox. Then - to address the fact that deep learning models need large training sets – we augmented the dataset using a deep learning generative model (a Generative Adversarial Network). Finally, we trained an LSTM (Long Short Term Memory) network to capture the malware behavioral patterns. Our investigation found that this approach, based on easy-to-collect information, is very effective (we achieved 85\% accuracy), despite the fact that the data used for training the detector are very different from the ones specifically targeted to this purpose.