Information security companies Sophos and ReversingLabs announced the release of the SoReL-20M database , which consists of 20 million Windows Portable Executable files. Of these, 10 million files are malware images.
The database, designed to improve the information security industry, provides metadata, labels and functions for files, and also allows interested parties to download available malware samples for further research. A publicly available dataset containing carefully selected samples and relevant metadata is expected to help accelerate research into the use of machine learning for malware detection.
Although machine learning models are built on data, there is no standard large-scale database in the field of information security, which can be easily accessed by everyone, from independent researchers to information security laboratories and corporations. According to Sophos experts, the lack of such a database impeded the development of the information security sector.
“Collecting large numbers of carefully selected, labeled samples is costly and complex, and sharing datasets is often complicated by intellectual property issues and the risk of exposing malware to unknown third parties. As a result, most malware detection research uses proprietary internal datasets, so the results cannot be compared, ”Sophos said.
The industrial-scale SoReL-20M database, covering 20 million samples, including 10 million malware cleaned, is designed to solve this problem. For each sample, the database contains functions extracted from the EMBER 2.0 dataset, labels, detection metadata, and complete malware binaries.
It also provides PyTorch and LightGBM machine learning models trained using this data, as well as scripts for loading and iterating the data, and scripts for training and testing the models.
Sophos accepts the possibility that experienced hackers will be able to use the database to their advantage and create tools to carry out cyberattacks. However, according to experts, there are currently many other sources that attackers can use to gain access to information about malware.