| Spam message is a type of spam which contains commercial advertisements or non-compliant legal texts which users are unwilling to receive without their consent.With the popularity of mobile phones,spam messages are also increasingly rampant in daily life.This phenomenon has already seriously affected our daily life,and even social stability.China Mobile has blocked more than 200 million spam messages in 2017.This number is still increasing with time going by.Today,the average number of spam messages received by each person has reached 9 messages per month.The arrival of the big data era has allowed a large amount of personal information data to be accumulated,while the huge amount of data needs to be properly managed.When it comes to such a huge amount of SMS data,in order to ensure a better user experience,to find out more meaningful information for protecting people from spam harassment has become an urgent problem.With the rapid development of deep learning and natural language processing,the ability of deep learning model is further affirmed for information extraction.This paper conducts in-depth research on the deep learning method in spam message classification.Research contents and results are listed below:First,when preprocessing the spam message,it was found that the data noise is quite large,and the jieba participle could not recognize the new word.To solve this problem,the data is processed in a streamlined manner,including traditional word conversion,number and special symbol replacement,and typo correction.For the new words which are not recognized,an improved new word recognition tool is introduced,and the new word is imported into the jieba custom vocabulary.Then,in the process of spam messages identification,the RCM spam message recognition model combined with Bi-lstm and TextCNN is proposed,which solves the problem of polysemous expressions possess same expression,we also uses the histogram method to further extract nonlinear features of sentence vectors.The obtained features are merged with the sensitive features extracted by TextCNN,which improves the accuracy of spam recognition,reaching 96.81%.Finally,based on the original two-classification algorithm for spam identification system,in order to reduce the probability of non-spam SMS prediction as spam messages,a class of “no processing” is introduced.Both fixed threshold and difference threshold selection method are proposed for "no processing ",which is used to obtain a reasonable threshold,this method increases the accuracy by 1.013%,reaching 97.823%. |