Font Size: a A A

Research On Bad Short Text Recognition Based On Machine Learning

Posted on:2019-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:W HanFull Text:PDF
GTID:2438330548958380Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of mobile Internet,followed by the development of big data technology,social networking has become particularly developed,and resulted in a large number of text data.At the same time,with the rapid development of platforms such as microblogging and live broadcast,the text is mainly based on short texts such as barrage and comments.While these texts enriched everyone’s communication,some of the bad texts that were mixed in also contributed to the health of the Internet.These bad short texts are mainly based on reactionary texts,insulting indecent texts and advertising texts.These bad texts have seriously hampered people’s access to effective information.At the same time,some of the information on pornographic violence has also had a great negative impact on young people.How to effectively filter bad text information and purify the social network environment has become an important topic in the age of social networking.The existing filtering methods are divided into two types.One is a bad text filtering system based on rules,and the other is a bad text filtering system based on machine learning,ie,a text classification system.However,due to the variant words,the very short length and the colloquial,the sparse features and the unbalanced samples,in the short text of the Internet.,ordinary filtering methods do not work well.For the filtering of bad short texts,the paper improves the filtering efficiency of bad short texts from several aspects such as reducing text noise,reducing the sparseness of text features,and increasing text semantic features.The main research work of this article is:(1)Improved text preprocessing method.Normal text pre-processing methods cannot effectively remove the noise of bad short texts.By analyzing a large amount of bad short text content,we improved the text preprocessing method and preprocessed the text from many aspects,such as text denoising,hash information normalization,and stop word deactivation.(2)Extracting bad short text features from multiple angles.The short text feature itself is more colloquial,and variant words and typos are also widely found in short texts on the Internet.This reduces the segmentation effect of the ordinary Chinese word segmentation method.In this study,we have added features extracted from the short text 2-gram model as the basic features of the text.In addition,we have increased the overall characteristics of the text based on the overall short texts.Finally,we found that the above text features would lose the semantic information of the text.Therefore,we added text semantic features based on word2 vec.(3)Feature weight analysis and feature fusion.We extracted features such as Bigram features,text style features,and text semantic features for short texts.We assigned weights to different types of features,and then used feature fusion methods to represent texts.(4)Use the extracted short text features of this article to identify bad texts for recognition experiments.We crawled and tagged datasets containing bad texts from the Internet,and used different classifiers for bad short text recognition experiments.The experimental results show that the bad text feature extraction method in this research with the SVM classifier are the best methods to identify the bad text.
Keywords/Search Tags:machine learning, feature extraction, text style features, 2-gram, TF-IDF
PDF Full Text Request
Related items