| With the rapid development of Internet technology,network attacks occur frequently,which seriously threatens the information security of individuals,enterprises,and countries.As a kind of big data of information security,cyber threat intelligence can analyze the existing or potential threat information,adjust the defects of the defense system,and promote the network offensive and defensive from the traditional passive defense afterwards to the premeditated active defense.Therefore,it is of great significance to research the content of cyber threat intelligence.Unstructured threat intelligence data contains a large amount of event intelligence,the identification of intelligence is subject to a lot of restrictions and interference,and the criticality of intelligence content is difficult to guarantee.How to collect and master a large amount of cyber threat intelligence information and make efficient use of large-scale,multisource and heterogeneous cyber threat intelligence has become an urgent problem to be solved.This thesis focuses on the information extraction technology in the field of cyber threat intelligence,including named entity recognition and entity-relation joint extraction for unstructured cyber threat intelligence.The main contributions of this t are thesis follows:(1)A named entity recognition method of threat intelligence based on multi-dimensional feature fusion is proposed.Aiming at the unclear boundary and mixed Chinese and English in threat intelligence entities,Word2Vec is used to extract word vectors,a convolution neural network is used to capture character-level features,and dependency syntactic analysis is used to extract dependency syntactic features to expand semantic features.After the multi-dimensional features are fused,feature vectors are input into Bi-directional Long Short-Term Memory to capture the full-text semantic information,and the multi-head self-attention mechanism is added to extract the dependency between words.Combined with the conditional random field,the legal and effective entity tagging prediction results are obtained,and the boundary is divided according to the prediction label to realize entity extraction.Then,the proposed method is trained,verified,and tested under the artificial self-constructed cyber threat intelligence corpus database.The experimental results show that the F1 score of the entity extraction model proposed in this thesis reaches 82.1%.Compared with other public entity extraction models in the same data sets,the experimental results show that our model outperforms other sequence annotation models in the comprehensive performance of intelligence entity extraction.(2)An entity and relation joint extraction method for cyber threat intelligence based on adversarial training is proposed.Aiming at the low efficiency of entity-relation extraction for unstructured threat intelligence,the BERT pre-training language model is used to learn rich semantic features.On this basis,the adversarial training method is introduced,which mix the original input and adversarial samples and train at the same time to improve the robustness of the model to input jitter.The context semantic information is captured through the BiLSTM coding layer and input into the CRF decoding layer to obtain the global optimal tagging sequence.According to the specified entity-relation matching rules,the predicted tagging sequence is extracted to obtain the threat intelligence entityrelation triples.Then,the proposed method has experimented with the selfconstructed cyber threat intelligence corpus data sets.The experimental results show that the proposed method surpasses some entity-relation extraction models in recent years and proves the effectiveness of the model.(3)An automatic threat intelligence data acquisition and information extraction system is designed and implemented.Based on the named entity recognition and entity-relation extraction task for cyber threat intelligence,this thesis constructs an automatic cyber threat intelligence corpus acquisition framework and a cyber threat intelligence information extraction system combined with the preorder model.The system includes modules such as automatic crawling for threat intelligence text data,data cleaning,information extraction for intelligence text,and so on.After testing the system,the test results meet the expected requirements. |