Spam detection using linear genetic programming

Meli, Clyde; Nezval, Vítězslav; Komínková Oplatková, Zuzana; Buttigieg, Victor

dc.title	Spam detection using linear genetic programming	en
dc.contributor.author	Meli, Clyde
dc.contributor.author	Nezval, Vítězslav
dc.contributor.author	Komínková Oplatková, Zuzana
dc.contributor.author	Buttigieg, Victor
dc.relation.ispartof	Advances in Intelligent Systems and Computing
dc.identifier.issn	2194-5357 Scopus Sources, Sherpa/RoMEO, JCR
dc.identifier.isbn	978-3-319-97887-1
dc.date.issued	2019
utb.relation.volume	837
dc.citation.spage	80
dc.citation.epage	92
dc.event.title	23rd International Conference on Soft Computing, MENDEL 2017
dc.event.location	Brno
utb.event.state-en	Czech Republic
utb.event.state-cs	Česká republika
dc.event.sdate	2017-06-20
dc.event.edate	2017-06-22
dc.type	conferenceObject
dc.language.iso	en
dc.publisher	Springer Verlag
dc.identifier.doi	10.1007/978-3-319-97888-8_7
dc.relation.uri	https://link.springer.com/chapter/10.1007/978-3-319-97888-8_7
dc.subject	identification	en
dc.subject	linear genetic programming	en
dc.subject	NP-complete	en
dc.subject	security	en
dc.subject	spam detection	en
dc.description.abstract	Spam refers to unsolicited bulk email. Many algorithms have been applied to the spam detection problem and many programs have been developed. The problem is an adversarial one and an ongoing fight against spammers. We prove that reliable Spam detection is an NP-complete problem, by mapping email spams to metamorphic viruses and applying Spinellis’s [30] proof of NP-completeness of metamorphic viruses. Using a number of features extracted from the SpamAssassin Data set, a linear genetic programming (LGP) system called Gagenes LGP (or GLGP) has been implemented. The system has been shown to give 99.83% accuracy, higher than Awad et al.’s [3] result with the Naïve Bayes algorithm. GLGP’s recall and precision are higher than Awad et al.’s, and GLGP’s Accuracy is also higher than the reported results by Lai and Tsai [19]. © Springer Nature Switzerland AG 2019.	en
utb.faculty	Faculty of Applied Informatics
dc.identifier.uri	http://hdl.handle.net/10563/1008169
utb.identifier.obdid	43880128
utb.identifier.scopus	2-s2.0-85051789860
utb.source	d-scopus
dc.date.accessioned	2018-08-30T13:31:25Z
dc.date.available	2018-08-30T13:31:25Z
utb.ou	CEBIA-Tech
utb.contributor.internalauthor	Komínková Oplatková, Zuzana
utb.fulltext.affiliation	Clyde Meli 1(✉) http://orcid.org/0000-0003-3551-862X , Vitezslav Nezval 1 , Zuzana Kominkova Oplatkova 2 , and Victor Buttigieg 3 1 Department of Computer Information Systems, University of Malta, Msida, Malta clyde.meli@um.edu.mt, vnez@cis.um.edu.mt 2 Department of Informatics and Artificial Intelligence, Tomas Bata University, Zlín, Czech Republic kominkovaoplatkova@fai.utb.cz 3 Department of Communications and Computer Engineering, University of Malta, Msida, Malta victor.buttigieg@um.edu.mt
utb.fulltext.dates	-
utb.fulltext.references	1. Almeida, T.A., Yamakami, A.: Advances in spam filtering techniques. In: Computational Intelligence for Privacy and Security, pp. 199–214. Springer, Heidelberg (2012) 2. Androutsopoulos, I., et al.: An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167. ACM, New York (2000) 3. Awad, W.A., ELseuofi, S.M.: Machine learning methods for e-mail classification. Int. J. Comput. Appl. 16(1), 0975–8887 (2011) 4. Blickle, T., Thiele, L.: A Comparison of selection schemes used in genetic algorithms. Gloriastrasse 35, CH-8092 Zurich: Swiss Federal Institute of Technology (ETH) Zurich, Computer Engineering and Communications Networks Lab (TIK (1995) 5. Borodin, Y., et al.: Live and learn from mistakes: a lightweight system for document classification. Inf. Process. Manag. 49(1), 83–98 (2013) 6. Brameier, M.: On linear genetic programming. Fachbereich Informatik, Universität Dortmund (2004) 7. Cid, I., et al.: The impact of noise in spam filtering: a case study. In: Perner, P. (ed.) Advances in Data Mining. Medical Applications, E-Commerce, Marketing, and Theoretical Aspects, pp. 228–241. Springer, Heidelberg (2008) 8. Cormack, G.V., Lynam, T.: TREC 2005 spam track overview. In: The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings (2005) 9. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco (1979) 10. Graham, P.: Better Bayesian Filtering. http://www.paulgraham.com/better.html 11. Graham, P.: A Plan for Spam. http://www.paulgraham.com/spam.html 12. Gržinić, T., et al.: CROFlux—Passive DNS method for detecting fast-flux domains. In: 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1376–1380 (2014) 13. Harris, E.: The Next Step in the Spam Control War: Greylisting. http://projects.puremagic.com/greylisting/whitepaper.html 14. Holz, T., et al.: Measuring and detecting fast-flux service networks. In: 15th Network and Distributed System Security Symposium (NDSS) (2008) 15. Hunt, R., Carpinter, J.: Current and new developments in spam filtering. In: 2006 14th IEEE International Conference on Networks, pp. 1–6 (2006) 16. Gonçalves, I.: Controlling Overfitting in Genetic Programming. CISUG (2011) 17. Juknius, J., Čenys, A.: Intelligent botnet attacks in modern Information warfare. In: 15th International Conference on Information and Software Technology, pp. 37–39 (2009) 18. Kolari, P., et al.: Detecting spam blogs: a machine learning approach. In: Proceedings of the National Conference on Artificial Intelligence, p. 1351. AAAI Press/MIT Press, Menlo Park/Cambridge 1999 (2006) 19. Lai, C.-C., Tsai, M.-C.: An empirical performance comparison of machine learning methods for spam e-mail categorization. In: Fourth International Conference on Hybrid Intelligent Systems, HIS 2004, pp. 44–48 IEEE (2004) 20. Lee, K., et al.: Uncovering social spammers: social honeypots + machine learning. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 435–442 ACM, New York (2010) 21. Sahami, M., et al.: A Bayesian approach to filtering junk e-mail. In: Proceedings of AAAI-98 Workshop on Learning for Text Categorization (1998) 22. Meli, C., Oplatkova, Z.K.: SPAM detection: Naïve Bayesian classification and RPN expression-based LGP approaches compared. In: Software Engineering Perspectives and Application in Intelligent Systems, pp. 399–411. Springer, Heidelberg (2016) 23. Meli, C.: Application and improvement of genetic algorithms and genetic programming towards the fight against spam and other internet malware. Submitted Ph.D. thesis, University of Malta, Malta (2017) 24. Miranda-García, A., Calle-Martín, J.: Yule’s characteristic K revisited. Lang. Resour. Eval. 39(4), 287–294 (2005) 25. Ntoulas, A., et al.: Detecting spam web pages through content analysis. In: Proceedings of the 15th International Conference on World Wide Web, pp. 83–92. ACM, New York (2006) 26. Oltean, M., Grosan, C.: Evolving evolutionary algorithms using multi expression programming. In: ECAL, pp. 651–658 (2003) 27. Oltean, M., Dumitrescu, D.: Multi expression programming. Babes-Bolyai University (2002) 28. Rao, J.M., Reiley, D.H.: The economics of spam. J. Econ. Perspect. 26(3), 87–110 (2012) 29. Ruan, G., Tan, Y.: A three-layer back-propagation neural network for spam detection using artificial immune concentration. Soft. Comput. 14(2), 139–150 (2009) 30. Spinellis, D.: Reliable identification of bounded-length viruses is NP-complete. IEEE Trans. Inf. Theory 49(1), 280–284 (2003) 31. Stuart, I., et al.: A neural network classifier for junk e-mail. In: Document Analysis Systems VI, pp. 442–450. Springer, Heidelberg (2004) 32. Wang, Z.-Q., et al.: An efficient SVM-based spam filtering algorithm. In: 2006 International Conference on Machine Learning and Cybernetics, pp. 3682–3686. IEEE (2006) 33. Yule, G.U.: On sentence- length as a statistical characteristic of style in prose: with application to two cases of disputed authorship. Biometrika 30(3–4), 363–390 (1939) 34. Zhang, L., et al.: An evaluation of statistical spam filtering. Techniques 3(4), 243–269 (2004) 35. Zhang, M., Fogelberg, C.G.: Genetic programming for image recognition: an LGP approach. In: EvoWorkshops 2007, pp. 340–350. Springer, Heidelberg (2007) 36. RPN, An Introduction To Reverse Polish Notation. http://h41111.www4.hp.com/calculators/uk/en/articles/rpn.html 37. Symantec Internet Security Report (2016). https://resource.elq.symantec.com/LP=2899
utb.fulltext.sponsorship	This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic within the National Sustainability Programme project No. LO1303 (MSMT-7778/2014) and also by the European Regional Development Fund under the project CEBIA-Tech No. CZ.1.05/2.1.00/03.0089 and further it was supported by Grant Agency of the Czech Republic—GACR P103/15/06700S. This research has in part been carried out using computational facilities procured through the European Regional Development Fund, Project ERDF-076 ‘Refurbishing the Signal Processing Laboratory within the Department of CCE’, University of Malta.
utb.scopus.affiliation	Department of Computer Information Systems, University of Malta, Msida, Malta; Department of Informatics and Artificial Intelligence, Tomas Bata University, Zlín, Czech Republic; Department of Communications and Computer Engineering, University of Malta, Msida, Malta
utb.fulltext.projects	LO1303 (MSMT-7778/2014)
utb.fulltext.projects	CZ.1.05/2.1.00/03.0089
utb.fulltext.projects	GACR P103/15/06700S