Vol. 4 No. 4 (2025): October
RESEARCH ARTICLES

Explainable Machine Learning Models For Detecting Malware in PDF Files

B Chennakasava Reddy
Department of Computer Science and Engineering, Annamacharya Institute of Technology and Sciences, Kadapa, Andhra Pradesh, India
G Jagadeesh
Department of Computer Science and Engineering, Annamacharya Institute of Technology and Sciences, Kadapa, Andhra Pradesh, India
C Maheshwar Reddy
Department of Computer Science and Engineering, Annamacharya Institute of Technology and Sciences, Kadapa, Andhra Pradesh, India
C Venkateswara Reddy
Department of Computer Science and Engineering, Annamacharya Institute of Technology and Sciences, Kadapa, Andhra Pradesh, India
S Mohammed Jabeer
Department of Computer Science and Engineering, Annamacharya Institute of Technology and Sciences, Kadapa, Andhra Pradesh, India

Published 2025-04-20

Keywords

  • malware detection,
  • PDF,
  • machine learning models

How to Cite

B Chennakasava Reddy, G Jagadeesh, C Maheshwar Reddy, C Venkateswara Reddy, & S Mohammed Jabeer. (2025). Explainable Machine Learning Models For Detecting Malware in PDF Files. International Journal of Computational Learning & Intelligence, 4(4), 598–607. https://doi.org/10.5281/zenodo.15250268

Abstract

The Portable Document Format (PDF) is widely used for document sharing, making it a common target for cyber threats. Attackers often embed malicious code within PDFs to exploit system vulnerabilities. Traditional malware detection techniques struggle to keep up due to evolving attack methods and reliance on predefined feature sets. This study presents an improved approach for detecting PDF malware using machine learning and explainability analysis. A comprehensive dataset of 15,958 PDF samples—comprising benign, malicious, and evasive files—is developed for this research. Three widely used PDF analysis tools (PDFiD, PDFINFO, and PDF-PARSER) are employed to extract meaningful features, alongside additional derived features that enhance classification accuracy. Through systematic feature selection and empirical evaluation, an optimal feature set is identified. The proposed method is tested with various machine learning classifiers, with the Random Forest model achieving approximately 2% higher accuracy compared to baseline models. Additionally, a decision tree is generated to enhance model interpretability, offering insights into classification rules. A comparative analysis with existing studies highlights key findings and advancements in PDF malware detection

References

  1. Hossain, G. M. S., Deb, K., Janicke, H., & Sarker, I. H. (2024). PDF malware detection: Toward machine learning modeling with explainability analysis. IEEE Access, 12, 13833–13850.
  2. Zareei, M. (2023). A comprehensive survey on machine learning techniques for cybersecurity threat detection. Journal of Information Security and Applications, 67, 103135.
  3. Saxe, J., & Berlin, K. (2023). Deep learning for document malware classification. In Proceedings of the 16th International Conference on Machine Learning for Cybersecurity (MLSec) (pp. 1–12).
  4. Ahmadi, M., Kara, A. U., & Baradaran, H. (2023). A hybrid machine learning approach for malware detection in PDF files. Computers & Security, 112, 102522.
  5. Lanzi, A., Balzarotti, D., Kruegel, C., & Vigna, G. (2022). AccessMiner: Using system-centric models for malware classification. In Proceedings of the ACM Conference on Computer and Communications Security (CCS) (pp. 399–412).
  6. Ahmed, S. T., Fathima, A. S., Nishabai, M., & Sophia, S. (2024). Medical ChatBot assistance for primary clinical guidance using machine learning techniques. Procedia Computer Science, 233, 279-287.
  7. Ahmed, S. T., Kumar, V. V., & Jeong, J. (2024). Heterogeneous workload-based consumer resource recommendation model for smart cities: EHealth edge–cloud connectivity using federated split learning. IEEE Transactions on Consumer Electronics, 70(1), 4187-4196.
  8. Ahmed, S. T., Priyanka, H. K., Attar, S., & Patted, A. (2017, June). Cataract density ratio analysis under color image processing approach. In 2017 International Conference on Intelligent Computing and Control Systems (ICICCS) (pp. 178-180). IEEE.
  9. Basha, S. M., & Fathima, A. S. (2023). Natural language processing: Practical approach. MileStone Research Publications.
  10. Biggio, B., Fumera, G., & Roli, F. (2020). Adversarial machine learning for cybersecurity: Challenges and opportunities. IEEE Transactions on Cybernetics, 50(1), 1–16.
  11. Dwaram, J. R., & Madapuri, R. K. (2022). Crop yield forecasting by long short‐term memory network with Adam optimizer and Huber loss function in Andhra Pradesh, India. Concurrency and Computation: Practice and Experience, 34(27). https://doi.org/10.1002/cpe.7310
  12. Fathima, A. S., Basha, S. M., Ahmed, S. T., Mathivanan, S. K., Rajendran, S., Mallik, S., & Zhao, Z. (2023). Federated learning based futuristic biomedical big-data analysis and standardization. Plos one, 18(10), e0291631.
  13. Fathima, A. S., Prakesh, D., & Kumari, S. (2022). Defined Circle Friend Recommendation Policy for Growing Social Media. International Journal of Human Computations & Intelligence, 1(1), 9-12.
  14. Madapuri, R. K., & Mahesh, P. C. S. (2017). HBS-CRA: Scaling impact of change request towards fault proneness: Defining a heuristic and biases scale (HBS) of change request artifacts (CRA). Cluster Computing, 22(S5), 11591–11599. https://doi.org/10.1007/s10586-017-1424-0
  15. Pastrana, S., Kotzias, P., Gove, R. M., Grier, C. H., & Barbera, M. V. (2021). Measuring and understanding the impact of evasive malware. In Proceedings of the 29th USENIX Security Symposium (pp. 789–806).
  16. Pierazzi, F., Suarez-Tangil, G., Cavallaro, J., & Kirda, E. (2021). Intriguing properties of adversarial ML attacks in the problem space. In Proceedings of the IEEE Symposium on Security and Privacy (S&P) (pp. 1332–1349).
  17. Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., & Nicholas, C. (2021). Malware detection by eating a whole EXE. In Proceedings of the 35th International Conference on Machine Learning (ICML) (pp. 1–10).
  18. Raja, D. K., Kumar, G. H., Basha, S. M., & Ahmed, S. T. (2022). Recommendations based on integrated matrix time decomposition and clustering optimization. International Journal of Performability Engineering, 18(4), 298.
  19. Reddy, B. S. H. (2025). Deep learning-based detection of hair and scalp diseases using CNN and image processing. Milestone Transactions on Medical Technometrics, 3(1), 145–5. https://doi.org/10.5281/zenodo.14965660
  20. Reddy, B. S. H., Venkatramana, R., & Jayasree, L. (2025). Enhancing apple fruit quality detection with augmented YOLOv3 deep learning algorithm. International Journal of Human Computations & Intelligence, 4(1), 386–396. https://doi.org/10.5281/zenodo.14998944
  21. Stewin, T., & Bystrov, I. (2020). Understanding PDF malware obfuscation techniques. Journal of Computer Virology and Hacking Techniques, 16(4), 399–415.
  22. Ye, Y., Wang, D., Li, T., & Ye, D. (2022). A survey on machine learning techniques for malware detection. ACM Computing Surveys, 55(2), 1–36.
  23. Zhang, J., Du, Y., & Chen, C. (2020). Feature engineering for PDF malware detection. In Proceedings of the IEEE International Conference on Machine Learning Applications (ICMLA) (pp. 271–278).