Font Size: a A A

Research On Extracting Threat Intelligence Information Based On Pre-trained Language Models

Posted on:2024-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:W GuanFull Text:PDF
GTID:2568307157951919Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Automatically analyzing and extracting standardized security entities and their relationships is crucial for network security situational awareness in the face of massive unstructured threat intelligence information.However,the complex sources of threat intelligence,varying naming conventions of security entities,and issues such as long entity spans and nested entities result in overlapping relationships.BERT,a model used in general fields,is not directly applicable for feature extraction in network security due to its large number of parameters and high computational costs.Additionally,BERT’s semantic representation only captures unidirectional context information.Therefore,this thesis investigates named entity recognition and relationship extraction tasks separately based on BERT.The primary research includes the following components:(1)To address the lack of BERT’s applicability in the professional field,this study performs Masked Language Model(MLM)pre-training on the BERT model using a substantial amount of network security corpus.The objective is to enhance BERT’s capability to represent threat intelligence.To tackle the issues of low efficiency and nested security entities,global pointers are used for unification labeling and decoding.Additionally,nontarget entity words are introduced during training to participate in adversarial training to alleviate sparsity in the labeling matrix.Exploiting the structural characteristics of security entities,expert dictionaries and heuristic rules are employed to assist in entity recognition.(2)To address issues such as overlapping relations and long distances between subjects and objects,this study introduces graph attention based on syntactic dependence in the output layer of BERT.The attention is adaptively adjusted via the Highway network,enabling BERT to incorporate grammatical structure features in context information.The multi-head labeling method is used to uniformly label entities and relationships,which facilitates single-stage joint extraction of entities and relationships.Moreover,the imbalance of relationship categories in the labeling matrix is alleviated by improving the loss function.(3)By combining the aforementioned methods,a threat intelligence information extraction system is designed and implemented.The system automatically extracts structured triple information from text input and presents a knowledge map interface with visual representations.The experimental results demonstrate that expanding entity words through global pointers can enhance the model’s entity classification capability.Additionally,prior knowledge can effectively identify lengthy professional vocabulary,compensating for the pre-trained model’s limitations.Compared with the baseline model,it exhibits the best performance without significantly affecting reasoning time.The system achieves the highest F1 value of 0.836 on a public network security dataset.In the relationship extraction task,the single-stage joint extraction technique avoids accumulating errors between subtasks,dynamically adjusts sample weights using the loss function,and mitigates the issue of data imbalance.BERT,after incorporating grammatical features,exhibits greater representation ability,which is validated by multiple datasets.Moreover,the implemented threat intelligence information extraction system can effectively extract structured key information,demonstrating the effectiveness of the method in practice.
Keywords/Search Tags:Threat Intelligence, Pre-trained Language Model, Named Entity Recognition, Relation Extraction
PDF Full Text Request
Related items