| In the era of new media,the highly active social media has attracted a large number of netizens with common interests and hobbies.“Like-minded people” get together through social networks to contact each other and realize information exchange and interaction,which makes everyone inevitably become the disseminator and receiver of information,and can participate in the production process of media content anytime and anywhere.As a result,the amount of information in the network is growing exponentially,but this rapidly growing information is mixed with a lot of bad information containing pornographic,violent,reactionary and other sensitive content.The bad information not only caused serious harm and huge losses to people’s physical and mental health and property,but also endanger national stability and undermine social stability and national unity.For the purpose of interests,some criminals usually deform the sensitive words in the text in order to avoid the bad text containing sensitive words being detected by the regulatory platform.At present,the methods based on text classification and keyword table matching are mainly used to detect bad text,but these detection methods rely on accurate and effective keyword feature sets,and can not accurately identify non-standard sensitive information.Therefore,in order to maintain a harmonious and friendly network environment,how to efficiently and accurately detect the disguised bad text information from the massive text information is an urgent problem to be solved.To solve the above problems,by studying the deformation form of sensitive words in bad text,this thesis proposes a bad text classification algorithm based on restoration of key information(KIR-BTCA),which further improves the recognition ability of sensitive word variants in bad text by restoring the key information variants in the text.Similarly,this method still has some limitations.Based on this,in order to better improve the diversity of algorithms,this thesis designs and proposes an ensemble bad text classification algorithm based on ensemble learning.The algorithm integrates three bad text classification algorithms based on restoration of key information,improved knn and sensitive word decision tree.Through the bagging idea of ensemble learning,the majority voting method and one vote veto method are adopted respectively,thereby,the final ensemble classification algorithm based on the majority voting(MV-ECA)and ensemble classification algorithm based on one vote veto(OVV-ECA)are obtained.For the sake of testing the validity of the algorithm proposed in this thesis,the collected text datasets are manually marked,and grouping experiments are carried out according to the actual situation.In the experiment,the results of each method are compared and their pros and cons are analyzed.Finally,the results show that the two ensemble classification algorithms are equivalent to KIR-BTCA in accuracy,and significantly better than the three single classification algorithms in accuracy and recall.In terms of comprehensive effect,MV-ECA is more dominant than OVV-ECA.To sum up,the research on bad text classification algorithm helps to purify the network environment and upgrade the function of Internet management software,which can reduce the harm caused by bad information. |