TY - GEN
T1 - Enhancing Malware Classification
T2 - 2023 Cyber Research Conference - Ireland, Cyber-RCI 2023
AU - Syeda, Durre Zehra
AU - Asghar, Mamoona Naveed
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - In recent times, the field of machine learning has witnessed remarkable progress, notably improving the efficiency of malware detection systems. However, the rapid surge in data dimensionality due to advanced technologies necessitates effective feature selection techniques. Feature selection is crucial in refining classifiers by identifying vital features and reducing computational complexity. The multitude of available feature selection algorithms, each with unique criteria, poses a challenge in choosing the right technique for specific datasets in various domains. To tackle this challenge, a combination of Filter and Embedded feature selection methods has been employed. These methods integrates outcomes from multiple feature selection approaches, effectively mitigating the limitations of individual methods. This paper presents a comprehensive comparison between Filter-based techniques, such as chi-squared and Information Gain, ANOVA-F and Embedded techniques like Lasso, Random Forest, XGBoost, and Extra-Tree Classifier. Additionally, it explores API categorization using novel datasets. Experimental findings consistently highlight Random Forest as the preferred choice, consistently delivering high classification accuracy (98%), F-measure (97%), recall (95%), precision (100%), AUC (98%), and demonstrating efficient feature reduction for malware classification datasets. Notably, all feature models exhibit a significant emphasis on Kernel and System Management, Registry Operations, File System and System Information based APIs.
AB - In recent times, the field of machine learning has witnessed remarkable progress, notably improving the efficiency of malware detection systems. However, the rapid surge in data dimensionality due to advanced technologies necessitates effective feature selection techniques. Feature selection is crucial in refining classifiers by identifying vital features and reducing computational complexity. The multitude of available feature selection algorithms, each with unique criteria, poses a challenge in choosing the right technique for specific datasets in various domains. To tackle this challenge, a combination of Filter and Embedded feature selection methods has been employed. These methods integrates outcomes from multiple feature selection approaches, effectively mitigating the limitations of individual methods. This paper presents a comprehensive comparison between Filter-based techniques, such as chi-squared and Information Gain, ANOVA-F and Embedded techniques like Lasso, Random Forest, XGBoost, and Extra-Tree Classifier. Additionally, it explores API categorization using novel datasets. Experimental findings consistently highlight Random Forest as the preferred choice, consistently delivering high classification accuracy (98%), F-measure (97%), recall (95%), precision (100%), AUC (98%), and demonstrating efficient feature reduction for malware classification datasets. Notably, all feature models exhibit a significant emphasis on Kernel and System Management, Registry Operations, File System and System Information based APIs.
KW - API categorisation
KW - dataset generation
KW - feature scoring
KW - feature selection
KW - machine learning models
KW - malware classification
UR - http://www.scopus.com/inward/record.url?scp=85206089793&partnerID=8YFLogxK
U2 - 10.1109/Cyber-RCI59474.2023.10671445
DO - 10.1109/Cyber-RCI59474.2023.10671445
M3 - Conference Publication
AN - SCOPUS:85206089793
T3 - 2023 Cyber Research Conference - Ireland, Cyber-RCI 2023
BT - 2023 Cyber Research Conference - Ireland, Cyber-RCI 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 24 November 2023
ER -