Research On Extracting Threat Intelligence Information Based On Pre-trained Language Models

Posted on:2024-06-17

Degree:Master

Type:Thesis

Country:China

Candidate:W Guan

Full Text:PDF

GTID:2568307157951919

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Automatically analyzing and extracting standardized security entities and their relationships is crucial for network security situational awareness in the face of massive unstructured threat intelligence information.However,the complex sources of threat intelligence,varying naming conventions of security entities,and issues such as long entity spans and nested entities result in overlapping relationships.BERT,a model used in general fields,is not directly applicable for feature extraction in network security due to its large number of parameters and high computational costs.Additionally,BERT’s semantic representation only captures unidirectional context information.Therefore,this thesis investigates named entity recognition and relationship extraction tasks separately based on BERT.The primary research includes the following components:(1)To address the lack of BERT’s applicability in the professional field,this study performs Masked Language Model(MLM)pre-training on the BERT model using a substantial amount of network security corpus.The objective is to enhance BERT’s capability to represent threat intelligence.To tackle the issues of low efficiency and nested security entities,global pointers are used for unification labeling and decoding.Additionally,nontarget entity words are introduced during training to participate in adversarial training to alleviate sparsity in the labeling matrix.Exploiting the structural characteristics of security entities,expert dictionaries and heuristic rules are employed to assist in entity recognition.(2)To address issues such as overlapping relations and long distances between subjects and objects,this study introduces graph attention based on syntactic dependence in the output layer of BERT.The attention is adaptively adjusted via the Highway network,enabling BERT to incorporate grammatical structure features in context information.The multi-head labeling method is used to uniformly label entities and relationships,which facilitates single-stage joint extraction of entities and relationships.Moreover,the imbalance of relationship categories in the labeling matrix is alleviated by improving the loss function.(3)By combining the aforementioned methods,a threat intelligence information extraction system is designed and implemented.The system automatically extracts structured triple information from text input and presents a knowledge map interface with visual representations.The experimental results demonstrate that expanding entity words through global pointers can enhance the model’s entity classification capability.Additionally,prior knowledge can effectively identify lengthy professional vocabulary,compensating for the pre-trained model’s limitations.Compared with the baseline model,it exhibits the best performance without significantly affecting reasoning time.The system achieves the highest F1 value of 0.836 on a public network security dataset.In the relationship extraction task,the single-stage joint extraction technique avoids accumulating errors between subtasks,dynamically adjusts sample weights using the loss function,and mitigates the issue of data imbalance.BERT,after incorporating grammatical features,exhibits greater representation ability,which is validated by multiple datasets.Moreover,the implemented threat intelligence information extraction system can effectively extract structured key information,demonstrating the effectiveness of the method in practice.

Keywords/Search Tags:

Threat Intelligence, Pre-trained Language Model, Named Entity Recognition, Relation Extraction

PDF Full Text Request

Related items

1	Research On Unstructured Threat Intelligence Entity Extraction Method Based On Machine Learning
2	Chinese Entity Relation Extraction Based On BERT And Knowledge Verification
3	Research And Implementation Of Information Extraction Model For Cyber Threat Intelligence
4	Research On Chinese Entity Relation Extraction Based On Schemas And Pre-trained Language Models
5	Research Of Joint Extraction Of Entities And Relations Based On Pre-trained Model
6	Research And Application Of Threat Intelligence Knowledge Graph Construction Method For Unstructured Data
7	Research On Threat Semantic Recognition And Sharing Based On Multi-source Threat Intelligence
8	Research On Key Technologies Of Knowledge Extraction For Chinese Threat Intelligence
9	Domain Adaptation Research And Application Of Named Entity Recognition
10	Chinese Named Entity Recognition Based On Pre-trained Language Models