Font Size: a A A

Research On The Methods Of Sensitive Information Discovery And Desensitization In Text Documents

Posted on:2021-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:T Z LiFull Text:PDF
GTID:2416330614971841Subject:Information security
Abstract/Summary:PDF Full Text Request
With the proliferation of the internet,the volume of information acquired and disseminated through increasingly broadened channels has experienced exponential growth.The diversification of information channels transforms the model of information acquisition and dissemination from one-directional to multidirectional.During the process,the disclosure of sensitive information has become inevitable,which may have a series of adverse effects on individual privacy,private asset security,corporate information security,and even national security.However,the definition of sensitive information varies across different domains and sectors.The lack of benchmarking across industries imposes barriers to precise sensitive information recognition.This paper explores application scenarios in judiciary information disclosure and examines the issue of sensitive information detection and desensitization.In judiciary practices,the detection and desensitization of sensitive information published in adjudication reports rely heavily on manual identification,which proves to be both time-consuming and labor-intensive,especially when confronted with massive information and large data sets.Deploying computer models is one possible approach to overcome the limitations of manual detection.However,owing to the complexity of sensitive data and the difficulties in identifying entity dependency,the computer-aided detection method is not without drawbacks.To address these problems,this paper leverages the capacity of feature extraction in neural networks,combined with the contextual background to identify sensitive information and discover entity relations.The computer-aided approach adopted in this paper designs strategies based on entity sensitivity to desensitize data.This project received substantial support from the "Research on the Collaborative Technical Support of Integrated Trial,Enforcement and Litigation Services(2018YFC083130)" sponsored by the National Key Research & Development Program.This paper proposes two models to facilitate data detection and desensitization.(1)An LSTM-based personal data identification model is proposed to address the problem of data complexity and the difficulties of discovering entity dependency.Based on the Lattice method of input,this approach extracts the contextual features from the input by adding the relative position information to the entity and combining semantic data of words and phrases.The model deduces hypotactic relations among clauses based on an understanding of the semantics.The results of the experiment prove that the method is effective in extracting personal attribute data.(2)A BERT-based personal data identification model is proposed to overcome the limitations of an LSTM network.A Bert Model is pre-trained with extensive external background knowledge and is thus able to construct a more accurate semantic representation.After finetuned with specific task modification,the pre-trained model proceeds with the identification of sensitive data and entity dependency based on the integration of external knowledge and the contextual information.Evidence from the experiments proves that the accuracy of this method is 2% higher than that based on LSTM model.Subsequently,corresponding desensitization strategies are designed based on data attributes and the correlation between data sets.This paper proposes a novel approach that integrates multiple models and methods to detect and desensitize data.Experiments are performed to verify the validity of this method using adjudication report data sets.The experimental results demonstrate the efficiency and feasibility of the approach,which fares well against manual detection of sensitive data.
Keywords/Search Tags:Sensitive Data, Entity Recognition, Relation Extraction, Deep Learning, Data Desensitization
PDF Full Text Request
Related items