| Today,the network information technology has been very mature,everyone can express their opinions and views on current affairs,politics,literature and art,historical records anytime and anywhere through the network.This has brought hidden dangers,because not everyone can abide by the Internet management laws and regulations formulated by the state,and some people will publish some harmful comments on the Internet,including pornography,violence,political sensitivity and other content,which greatly damage the Internet network security and bring adverse factors to social stability.Harmful speech is mainly composed of harmful sensitive words.In order to purify the network environment,we urgently need effective measures to detect and filter these sensitive words and create a healthy network space.At present,most of the detection methods for sensitive words are simple string matching,such as KMP algorithm.These string matching is based on the exact string matching,and find out the location of the pattern string from the given target string,which requires that each character of the pattern string should match the target string.However,in order to avoid the existing detection methods,some special expressions are often used to cheat the detection system,such as(shape near sound near word split word,etc.)to transform sensitive words,which makes the detection more difficult,and the requirements of detection algorithm are also increased.In view of the above problems,this paper proposes a new Chinese character matching method based on the Phonographic code to detect sensitive words.In this method,the common Chinese characters are encoded by the improved phonetic code,and the similarity of Chinese characters is calculated by this code.Then,based on the traditional dictionary tree,we use the concept of fuzzy matching to match the target strings one by one.When the similarity of two Chinese characters is greater than the fuzzy matching parameters,it can be considered as hit.The fuzzy matching parameters can be set and modified manually to indicate the detection intensity.The smaller the parameters,the greater the detection intensity,and vice versa.This method can combine several methods to deal with the single deformation sensitive words,and can deal with the transformation of Chinese characters into pinyin,sound close to words,shape close to words,split words and their combinations.In order to split a word,another processing step is needed for the sensitive thesaurus.At the same time,this paper also proposes a method to quantify the similarity of Chinese characters.Based on the concept of statistics,this method transforms the subjective standard of whether and to what extent Chinese characters are similar into the objective standard,and analyzes the experimental results under this standard.In the commonly used Chinese character data sets,the accuracy is significantly higher than the existing detection methods.This effectively improves the accuracy of harmful speech review,and improves the filtering ability. |