Research On Chinese Deformation Sensitive Word Detection

Posted on:2021-05-10

Degree:Master

Type:Thesis

Country:China

Candidate:H Zhou

Full Text:PDF

GTID:2428330647950937

Subject:Circuits and Systems

Abstract/Summary:

PDF Full Text Request

Today,the network information technology has been very mature,everyone can express their opinions and views on current affairs,politics,literature and art,historical records anytime and anywhere through the network.This has brought hidden dangers,because not everyone can abide by the Internet management laws and regulations formulated by the state,and some people will publish some harmful comments on the Internet,including pornography,violence,political sensitivity and other content,which greatly damage the Internet network security and bring adverse factors to social stability.Harmful speech is mainly composed of harmful sensitive words.In order to purify the network environment,we urgently need effective measures to detect and filter these sensitive words and create a healthy network space.At present,most of the detection methods for sensitive words are simple string matching,such as KMP algorithm.These string matching is based on the exact string matching,and find out the location of the pattern string from the given target string,which requires that each character of the pattern string should match the target string.However,in order to avoid the existing detection methods,some special expressions are often used to cheat the detection system,such as(shape near sound near word split word,etc.)to transform sensitive words,which makes the detection more difficult,and the requirements of detection algorithm are also increased.In view of the above problems,this paper proposes a new Chinese character matching method based on the Phonographic code to detect sensitive words.In this method,the common Chinese characters are encoded by the improved phonetic code,and the similarity of Chinese characters is calculated by this code.Then,based on the traditional dictionary tree,we use the concept of fuzzy matching to match the target strings one by one.When the similarity of two Chinese characters is greater than the fuzzy matching parameters,it can be considered as hit.The fuzzy matching parameters can be set and modified manually to indicate the detection intensity.The smaller the parameters,the greater the detection intensity,and vice versa.This method can combine several methods to deal with the single deformation sensitive words,and can deal with the transformation of Chinese characters into pinyin,sound close to words,shape close to words,split words and their combinations.In order to split a word,another processing step is needed for the sensitive thesaurus.At the same time,this paper also proposes a method to quantify the similarity of Chinese characters.Based on the concept of statistics,this method transforms the subjective standard of whether and to what extent Chinese characters are similar into the objective standard,and analyzes the experimental results under this standard.In the commonly used Chinese character data sets,the accuracy is significantly higher than the existing detection methods.This effectively improves the accuracy of harmful speech review,and improves the filtering ability.

Keywords/Search Tags:

information security, sensitive word, fuzzy matching, exact matching, Similarity criteria

PDF Full Text Request

Related items

1	Implementation Of Website Sensitive Word Detection System Based On The Improved DFA Algorithm
2	The Research For Fast Exact String Matching Algorithms
3	Design And Implementation Of Filtering System For Security Information Sensitive Words Based On Aggregate Tree Matching
4	Research On Ontology Matching Based On Word Embedding And Structural Similarity
5	Research And Implementation Of Approximate String Matching Algorithm Supporting Swapping
6	Acceleration Of Pattern Matching Algorithms On Many-Core Hardware
7	Research Of String Matching Algorithm Optimization
8	Research And Application Of The Approach To Understand Elementary Mathematics Based On Sentence Framework And Fuzzy Matching
9	Research And Application Of Chinese Sensitive Word Filtering Technology Based On Information Source
10	The Research On Database Schema Matching System