| In the face of the increasingly severe network security landscape,Cyber Threat Intelligence(CTI)is gaining significant attention from cybersecurity practitioners and major enterprises.Among various types of threat intelligence,Advanced Persistent Threat(APT)reports hold considerable analytical and utilization value,attracting widespread interest.APT reports are commonly disseminated in non-structured data formats such as PDF.Current research on threat intelligence integration primarily focuses on the extraction and utilization of semi-structured threat intelligence data.However,the majority of threat intelligence is still shared on the internet in non-structured formats.Therefore,it is crucial to investigate techniques for named entity recognition and entity relationship extraction in the realm of unstructured threat intelligence.By employing advanced named entity recognition and entity relationship extraction technologies,we can effectively achieve the integration and utilization of unstructured threat intelligence,enhancing the efficiency and accuracy of threat intelligence integration and utilization,thereby further strengthening network security defense capabilities.In this thesis,the semantic connotation and related standards of threat intelligence are analyzed,threat intelligence ontology,threat intelligence named entity recognition and entity relation extraction methods are studied and implemented.The main work of this thesis includes:(1)A threat intelligence ontology is constructed.Based on the semantic connotation of threat intelligence and the threat intelligence standard STIX 2.1,designed and constructed an ontology for the field of threat intelligence.The ontology encapsulates 13 types of threat entities and 7 types of inter-entity relationships,which in turn define the types of entities and relationships required for threat intelligence knowledge graph construction.(2)A threat intelligence named entity recognition method based on data enhancement and BBC is proposed.In order to solve the problem of insufficient semantic accuracy of existing data augmentation methods,this method increases the number of threat intelligence entities and the diversity of samples,and realizes data augmentation by filling the threat intelligence domain vocabulary in the knowledge base into the template sentence that conforms to the threat intelligence context environment.Then,the augmented data is combined with the BERT+BiLSTM+CRF(BBC)model for threat intelligence named entity recognition.Experiments are conducted on the threat intelligence named entity identification dataset,and the models of the proposed method in this paper all outperform the models without data augmentation.(3)A threat intelligence entity relationship extraction method integrating multiple entity information is proposed.Firstly,according to the data characteristics of multiple entities and relations in threat intelligence sentences,the Brat annotation tool is improved by designing a sentence extraction algorithm so that it can be used to generate threat intelligence entity relation datasets.Then,the entity boundary information and entity type information are fused with entity semantic information in word embeddings by adding labels on both sides of the entity,so as to improve the performance of R-BERT model.Experiments on the threat intelligence entity relationship data set show that the macro average F1 score of the proposed method can reach 81.061%,better than the current common entity relationship extraction methods.(4)Designed and implemented the threat intelligence knowledge graph management tool.This tool can automatically extract threat entities and relationships from unstructured threat intelligence text data,and can store,retrieve and visualize the extracted results. |