| In recent years,the incidence of environmental pollution incidents has been increasing year by year.It is urgent to monitor environmental pollution incidents quickly.However,the traditional monitoring methods still have the problems of poor matching of monitoring technology and unbalanced regional development,which can not cover the whole region,all time periods and all kinds of contaminants.Because of its universality,authenticity and freshness,internet news texts can often make up for the shortcomings of physical equipment monitoring.However,environmental pollution incidents often have a "domino effect".There are many interfering information in the news texts,such as the mixed expression of multiple times,places and characters,which makes the information extraction of environmental pollution incidents in network news texts face many challenges.Based on this,the main research contents and results are as follows:(1)Construction of the thesaurus of environmental pollution events.According to the national standard Code of Environmental Pollution Categories,the categories of environmental pollution incidents and the initial Thesaurus of each category are determined.Dictionaries such as Cilin,How Net and various large-scale training words vectors are used to expand the thesaurus.Finally,the thesaurus of environmental pollution incidents is formed.(2)Quick labeling of corpus of environmental pollution incidents.Considering that a large number of internet news text labeling consumes manpower and material resources,a clustering method-LDA model,is proposed to generate clustering clusters,and then manually map to event categories to achieve rapid labeling of environmental pollution events.Finally,the accuracy of labeling event categories is evaluated manually to verify the effectiveness of this method.(3)Automatic detection of environmental pollution incidents.TF-IDF vector is used to represent the global features of the document,and the frequency of words in the document is calculated to construct the document topic feature vector.Combining document global eigenvectors and topic distribution eigenvectors,a joint topic eigenvector is constructed as the input of the supervisory classification model to realize the category detection of environmental pollution events.(4)Information extraction of environmental pollution incidents.There are many interference information in news texts.The input feature vectors are improved by synthesizing the part of speech and syntactic characteristics of words in the text,which fully represent the event expression characteristics of the text.The Bi-LSTM+CRF model is introduced to extract environmental pollution event information.(5)Statistical analysis of data from environmental pollution incidents.The detection and information extraction of environmental pollution events contained in massive internet texts are carried out,and the distribution characteristics of the event extraction results are analyzed according to the categories,time and space,which further illustrates the practical value of the proposed method. |