Research On Spam Classification Based On Machine Learning

Posted on:2021-03-01

Degree:Master

Type:Thesis

Country:China

Candidate:J Wang

Full Text:PDF

GTID:2437330605463099

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

Spam began to flood in the late 20 th century,with low cost,convenient transmission,strong induction and other characteristics.Some business organizations take advantage of this opportunity as a means to seek profits,arbitrary dissemination of spam.The wide spread of spam has brought a lot of inconvenience and trouble to people’s work and life.We all have spam in our inboxes that not only take up storage space,but also take up time and energy.At the same time,users need to invest a lot of time in dealing with spam.There are many forms of spam,and with the development of the Internet has been constantly updated anti-spam work is facing great challenges.Therefore,it is of great practical significance to constantly update the methods and methods of spam classification and filtering to improve the current situation of email use.This paper studies two aspects with data mining tools and machine learning.All the analysis are realized with the help of R language programming software.First,studying and analyzing the text content of the whole mail data,respectively from two angles of spam and junk mail and analyzeing of two kinds of high frequency words in text content,and drawing the two mail content corresponding word cloud,semantics and the part of speech of the high frequency words in the final analysis,and relevant conclusions.Secondly,the naive bayesian method,support vector machine and k-nearest neighbor method are used to model and analyze 7000 pieces of mail data.The evaluation index selected in this paper is accuracy.By comparing the classifier model established by the three algorithms,it is concluded that the naive bayesian classification model with the Laplace parameter of 2.5 is the optimal classification effect in this paper,and the accuracy can reach 97.1%.The innovation points of this paper mainly include the following three aspects: first,the text content is analyzed to assist the establishment of a model to judge the nature of mail.Secondly,establishing multiple classifiers by various methods and selecting the optimal model from multiple models.Thirdly,in the k-nearest neighbor method,the best K value is selected by using the tenfold cross validation method and the comparison model’s accuracy double method.

Keywords/Search Tags:

Machine learning, Junk emails, Chinese text classification, Realization by R program

PDF Full Text Request

Related items

1	Realization Of Text Classification And Recognition Based On NLP Method
2	Research On Network Public Opinion Classification Algorithm Based On Machine Learning
3	Text Analysis Of Ica And Snow Tourism Review Based On Machine Learning
4	Several Classification Algorithms And Their Applications In Statistical Learning
5	Application Of Machine Learning Classification Algorithm In Chinese Databases
6	Chinese Text Classification Based On Statistical Method
7	Research Of Classification Of The Relative Motion Problem In Elementary Mathematics Based On SVM
8	A Text Classification Based On The Recurrent Neural Networks
9	Analysis And Research Based On Multivariate Statistics And Machine Learning
10	Research Of SVM Kernel Functions In Text Classification