Font Size: a A A

Research On Spam Classification Based On Machine Learning

Posted on:2021-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2437330605463099Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Spam began to flood in the late 20 th century,with low cost,convenient transmission,strong induction and other characteristics.Some business organizations take advantage of this opportunity as a means to seek profits,arbitrary dissemination of spam.The wide spread of spam has brought a lot of inconvenience and trouble to people's work and life.We all have spam in our inboxes that not only take up storage space,but also take up time and energy.At the same time,users need to invest a lot of time in dealing with spam.There are many forms of spam,and with the development of the Internet has been constantly updated anti-spam work is facing great challenges.Therefore,it is of great practical significance to constantly update the methods and methods of spam classification and filtering to improve the current situation of email use.This paper studies two aspects with data mining tools and machine learning.All the analysis are realized with the help of R language programming software.First,studying and analyzing the text content of the whole mail data,respectively from two angles of spam and junk mail and analyzeing of two kinds of high frequency words in text content,and drawing the two mail content corresponding word cloud,semantics and the part of speech of the high frequency words in the final analysis,and relevant conclusions.Secondly,the naive bayesian method,support vector machine and k-nearest neighbor method are used to model and analyze 7000 pieces of mail data.The evaluation index selected in this paper is accuracy.By comparing the classifier model established by the three algorithms,it is concluded that the naive bayesian classification model with the Laplace parameter of 2.5 is the optimal classification effect in this paper,and the accuracy can reach 97.1%.The innovation points of this paper mainly include the following three aspects: first,the text content is analyzed to assist the establishment of a model to judge the nature of mail.Secondly,establishing multiple classifiers by various methods and selecting the optimal model from multiple models.Thirdly,in the k-nearest neighbor method,the best K value is selected by using the tenfold cross validation method and the comparison model's accuracy double method.
Keywords/Search Tags:Machine learning, Junk emails, Chinese text classification, Realization by R program
PDF Full Text Request
Related items