Font Size: a A A

Research On Open Text Information Extraction In Chinese Knowledge Graph

Posted on:2019-08-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:J XuFull Text:PDF
GTID:1368330623950467Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the arrival of the era of big data,Knowledge Graph is becoming an important form of knowledge representation.Because Knowledge Graph can provide a more complete semantic description for the objective world,it has been more and more widely used in the fields of semantic search,machine reading,intelligent question answering,and various smart cognitions.The automatic construction of Knowledge Graph,as the core technology of Knowledge Graph,has become one of the research hotspots in the field of new artificial intelligence.Information extraction for massive web texts,as a basic technology for constructing Knowledge Graph,is an important research topic in the field of artificial intelligence.Different from traditional text information extraction technology,which restricts the corporal domain and semantic category,web texts have characteristics such as massiveness,openness,and non-standardization.Based on the components of Knowledge Graph,for the open Chinese text data,this thesis focuses on the key technologies such as entity recognition,entity disambiguation,relation extraction,and attribute extraction.The main achievements are as follows.1)An unsupervised entity mention recognition method based on topic model and semantic analysisTraditional entity recognition methods such as heuristic rules,dictionary matching,and supervised machine learning models have limitations such as strong task dependency,poor self-adaptive capability,and limited physical categories.In order to meet the requirement of automatic construction of large scale Chinese Knowledge Graph,this thesis presents an unsupervised entity mention recognition method based on topic model and semantic analysis,which includes entity boundary detection and entity mention classification.Entity boundary detection aims at detecting all named and nominal entity mentions.First,we use shallow and deep syntactic analysis to automatically acquire the noun phrases with full boundary from the text as candidate entity mentions.Then,combined with the topic model and statistical algorithm,the non-entity mentions are filtered from the candidate set by measuring the importance of the mentions to the document.Entity mention classification is designed to identify entity categories and mention categories.In this thesis,a category decision algorithm based on the distribution semantics is presented to identify the entity category by measuring the semantic similarity of the context information of the entity mentions.In addition,we use shallow syntactic knowledge to formulate rules for acquire the named and nominal categories of the mentions in each entity category.Experiments are carried out on two famous public datasets,ACE and DEFT,in Natural Language Processing field.The results show that the proposed method is effective in the detection and classification of entity mentions.2)An unsupervised entity linking disambiguation method based on knowledge drivenIn view of the ambiguity entity mentions of user queries,this thesis proposes a disambiguation method based on entity linking technology.With the help of external knowledge,the method can complete disambiguation by linking the entity mentions in the user query text to the correct entities in the local knowledge base.First,introducing the idea of incremental evidence mining,we use external knowledge sources to enrich and optimize the related information of entity mentions and local knowledge base,helping solve the problem of lacking context of user queries and non-standard description,and reduce dependence on local knowledge base.Then,based on various knowledge of the entity,an inference linking algorithm is proposed.The algorithm makes full use of the entity name,category,context information,popularity,semantic correlation between entities,and relationship between the entities in the external knowledge source and local knowledge base,and improves the accuracy and recall rate of entity linking,reaching accurate disambiguation purpose of the entity mentions.Experiments are conducted on the wellknown public dataset published by NLPCC in the field of natural language processing.The results verify the effectiveness of the proposed method.3)A weakly-supervised open relation extraction method based on syntactic pattern and machine learning technologyTo solve the problem in the closed training corpus and the limited relationship classes for the traditional relation extraction methods,we propose a weakly supervised open relation extraction method.It is characterized by using a text string as an indicator of the relationship between entities and unstructured text data,and the output results are expressed as a structured <Entity 1,Relationship Indicator,Entity 2> format.The relationships are flexible and there is no limit to the number of classes.The basic idea of this method is to firstly obtain the candidate relational tuples from the text based on syntactic analysis and abstract it into syntactic patterns.Then,based on the designed algorithm distinguishing positive and negative cases,using the word vector model and synonym dictionary,by calculating the semantic similarity between the syntactic patterns,positive and negative cases are judged for each relation tuple in the candidate set,to automatically generate the required training corpus.Then,introducing the shallow(such as part-of-speech tags)and deep(such as subject-object syntactic structure)textual features,the classifier model is trained to distinguish the entity-relation tuples.Experiments are conducted on real news datasets(from People’s Daily and Sina,etc)and Baidu Encyclopedia datasets,and the results show the effectiveness of the proposed method.4)A weakly-supervised entity attribute value extraction method based on bidirectional long short-term memory network.Traditional methods for extracting entity attribute values require artificial syntax pattern,annotation of training corpus,and definition of textual features,which increase the labor costs and make extraction performance heavily dependent on the coverage of patterns,corpus,and features.In view of the above deficiencies,we propose a weakly supervised entity attribute value extraction method for open Chinese text data.This method synthesizes syntactic analysis,word vector model and deep learning technique to transform the extraction of entity attribute values into a relation classification problem.First,a method based on category mapping is proposed to automatically generate training corpus.This method uses the attribute name information to obtain the category mapping of the attribute value,and recognizes the entity attribute value in combination with syntactic knowledge and regular expression.At the same time,it extracts the text segment related to the entity and its attribute value as the training corpus,which helps to remove the noise information in the corpus and reduce dependence on training corpus size.Then,the word vector model is used to represent the training corpus as the vector form and incorporate simple and effective textual features to train the currently popular deep learning model—bidirectional long short-term memory network for distinguishing the relationship among entity,attribute name,and attribute value.Experiments are conducted on the well-known public data set published by TAC in the field of natural language processing.The results show that the proposed method is effective and significantly better than other traditional and deep learning methods.
Keywords/Search Tags:Knowledge Graph, Chinese Text Information Extraction, Automatic Construction, Semantic Analysis, Deep Learning, Machine Learning
PDF Full Text Request
Related items