Font Size: a A A

Research On Internet Spam Identification Method

Posted on:2020-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:M YuFull Text:PDF
GTID:2438330602956617Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The network influences and changes our lives in a new way of information dissemination,and let us inadvertently enter a new era of mass media.Everyone in the network can publish information at any time,which also leads to the proliferation of information on the network,as well as spam.In recent years,the rapid development of deep learning technology has greatly changed the status quo in the field of natural language processing.This article analyzes the title of the question on the Quora website as text data,and hopes to identify the spam which can also be called insincere questions.Due to the particularity of the data,many question texts have some subtleness as spam information.Therefore,traditional word frequency based machine learning methods may not perform well.Such problems put forward new requirements for our models and effects.In this paper,both machine learning methods and deep learning methods is used for the purpose of spam detection,and these methods are applied to compare the effects on Quora dataset.The two traditional machine learning method applied are the naive Byesian model and the logistic regression model.Firstly we use TF-IDF method to transfer text data to digit data as the input of the model,and improves the model classification effect by adjusting the model hyperparameters.However,the performance of these two individual models is not excellent.Therefore,this paper uses the results of the two classifications and adopts ridge regression to carry out the stacking integrated learning model,and adjust the regularization coefficients to improve the model effect while avoiding over-fitting.The traditional integrated learning model achieved an F1 score of 0.60436.In the method of deep learning,we first use word embedding to convert text input into digital input,and convert a document to be like a matrix form as the model input.We have used three kinds of pre-trained word embeddings,and compared their effects.Then we use convolutional neural networks,recurrent neural networks,capsule neural networks as classifiers.In the convolution neural network,the structure adopted in this paper is to get the output results through four groups of convolution layers and pooling layers,and then predict the final classification through flattening and full-connected layer.In the recurrent neural network,the structure adopted in this paper is the word embedding layer then two bidirectional LSTM layers plus the fullconnected layer to predict the final classification.In the capsule neural network,the structure adopted in this paper is word embedding layer,then spatial dropout layer,bidirectional GRU layer and capsule neural network and finally the final classification is predicted by full connection layer.The experimental results show that all the three deep learning methods are much better than traditional machine learning methods.The best one is capsule neural network,and the test set F1-score is 0.69782.However,the deep learning model also has its shortcomings.Because of its large amount of parameters,the time required for training is rather long.How to improve the training effect of the model without losing the accuracy of the model will become the focus of future research.
Keywords/Search Tags:Natural Language Processing, Text Classification, Machine Learning, Deep Learning
PDF Full Text Request
Related items