| In the process of power grid design and construction,a large number of files such as design specifications and test reports are generated and handed over to the power supply companies.However,most of these documents are un-structured in word,pdf and excel file formats.These documents are important raw data sources to operation,maintenance and asset management systems of power supply companies.The knowledge extracted from power grid project documents can be applied to generate and calibrate the topology,equipment and finance data in different information systems of power supply companies.However,Due to the lack of effective data extraction and transformation technology,the value of these data has not been mined.Therefore,it is urgent to study effective information extraction technology to automatically analyze the natural language text data of unstructured power grid infrastructure projects,and then a Knowledge Graph(KG)is constructed to achieve hierarchical storage,visual expression and related information recommendation and providing a novel and effective structured data source for data applications in various departments of the power grid at the same time.In this paper,the noise-containing multi-source heterogeneous data from infrastructure project is taken as the research object.Based on natural language processing(NLP)technology,the information extraction model and method and the Knowledge Graph construction technology are researched:Firstly,to remove the noise in the original unstructured natural text data and non-standard semi-structured table data,which are difficult to be intelligently analyzed in the raw files,the data preprocessing technology is proposed for data cleaning: Chinese word segmentation and removing null,zero and abnormal values.Finally,the skip-gram model,an efficient method for learning high quality vector representations of words from large amounts of unstructured text data,is used to help learning algorithms to achieve better performance in NLP tasks by grouping similar words.Secondly,The high-value information contained in the files can be located through knowledge extraction and fusion approaches.The required named entity types are predefined.On this basis,a supervised machine learning model is constructed to achieve named entities and part-of-speech tagging results,while named entity recognition is regarded as a sequence labeling task;Aiming at the problem of entity ambiguity and redundancy after recognition,a referential model is proposed for co-referential resolution;A graph-based model is implemented to identify the semantic relationship between named entities,through finding the combination of the edge with the highest weight and score in the spanning tree composed of entity nodes and relationship edges.Finally,the power grid project Knowledge Graph(KG)is constructed and stored in the neo4 j graph database,which contains the data layer of the attribute graph model and the model layer of the visual display.According to simulation and experimental results,the KG of power grid projects can be used to transforms natural language from multiple,heterogeneous documents into nodes and relationships in semantic knowledge base.The proposed KG model provides a novel and effective structured data source for data applications in various departments of the power grid. |