Research On The Methods Of Sensitive Information Discovery And Desensitization In Text Documents

Posted on:2021-03-07

Degree:Master

Type:Thesis

Country:China

Candidate:T Z Li

Full Text:PDF

GTID:2416330614971841

Subject:Information security

Abstract/Summary:

PDF Full Text Request

With the proliferation of the internet,the volume of information acquired and disseminated through increasingly broadened channels has experienced exponential growth.The diversification of information channels transforms the model of information acquisition and dissemination from one-directional to multidirectional.During the process,the disclosure of sensitive information has become inevitable,which may have a series of adverse effects on individual privacy,private asset security,corporate information security,and even national security.However,the definition of sensitive information varies across different domains and sectors.The lack of benchmarking across industries imposes barriers to precise sensitive information recognition.This paper explores application scenarios in judiciary information disclosure and examines the issue of sensitive information detection and desensitization.In judiciary practices,the detection and desensitization of sensitive information published in adjudication reports rely heavily on manual identification,which proves to be both time-consuming and labor-intensive,especially when confronted with massive information and large data sets.Deploying computer models is one possible approach to overcome the limitations of manual detection.However,owing to the complexity of sensitive data and the difficulties in identifying entity dependency,the computer-aided detection method is not without drawbacks.To address these problems,this paper leverages the capacity of feature extraction in neural networks,combined with the contextual background to identify sensitive information and discover entity relations.The computer-aided approach adopted in this paper designs strategies based on entity sensitivity to desensitize data.This project received substantial support from the "Research on the Collaborative Technical Support of Integrated Trial,Enforcement and Litigation Services(2018YFC083130)" sponsored by the National Key Research & Development Program.This paper proposes two models to facilitate data detection and desensitization.(1)An LSTM-based personal data identification model is proposed to address the problem of data complexity and the difficulties of discovering entity dependency.Based on the Lattice method of input,this approach extracts the contextual features from the input by adding the relative position information to the entity and combining semantic data of words and phrases.The model deduces hypotactic relations among clauses based on an understanding of the semantics.The results of the experiment prove that the method is effective in extracting personal attribute data.(2)A BERT-based personal data identification model is proposed to overcome the limitations of an LSTM network.A Bert Model is pre-trained with extensive external background knowledge and is thus able to construct a more accurate semantic representation.After finetuned with specific task modification,the pre-trained model proceeds with the identification of sensitive data and entity dependency based on the integration of external knowledge and the contextual information.Evidence from the experiments proves that the accuracy of this method is 2% higher than that based on LSTM model.Subsequently,corresponding desensitization strategies are designed based on data attributes and the correlation between data sets.This paper proposes a novel approach that integrates multiple models and methods to detect and desensitize data.Experiments are performed to verify the validity of this method using adjudication report data sets.The experimental results demonstrate the efficiency and feasibility of the approach,which fares well against manual detection of sensitive data.

Keywords/Search Tags:

Sensitive Data, Entity Recognition, Relation Extraction, Deep Learning, Data Desensitization

PDF Full Text Request

Related items

1	Research And Implementation Of Text Automatic Summarization Based On Deep Learning
2	Research On Entity And Relation Extraction Algorithm For Judgment Documents
3	Research On Entity Recognition And Entity Relation Extraction Of Internet Fraud Cases Based On Semi-supervised Learning Analysis
4	Research On Construction Of Knowledge Graph Of Judicial Case Texts Based On Deep Learning
5	Research And Implementation Of Security Classification And Privacy Data Recognition Algorithm On Government Data
6	Research And Application Of Knowledge Extraction For Government Affairs
7	Research On Military Event Extraction Based On Deep Learning
8	Reasearch On Entity Relation Extraction In The Field Of Party Building
9	Research On Entity Identification And Relationship Extraction For Legal Documents
10	Research Of Entity Relationship Extraction Of Legal Text Based On Deep Learning