Font Size: a A A

Research On Location Information Extraction For Web Text

Posted on:2021-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:T SunFull Text:PDF
GTID:2480306293952479Subject:Cartography and Geographic Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of the mobile Internet,the Internet has become an important channel for generating geographic information.According to statistics,nearly 70% of the Internet data is related to geographic information.The amount of location data generated by webpage text and social networks is close to the amount of data collected by specialized devices.Extracting location information from text quickly and accurately can greatly improve the efficiency of data collection and better meet people's needs for geographic information.The location information in the text includes two parts: the geographical named entity and the relative location information.The geographical named entity is the place name and the name of some organizations in the text.The relative location information is attached to the entity and used to describe the spatial relationship between entities.The existing research only focuses on the extraction method of geographical named entities,ignores the recognition and transformation of the relative position relationship between entities,and lacks the relevant corpus of full-position information.At the same time,the existing identification methods still have many shortcomings,such as the complex geographical names recognition recall rate is not high,the identification range is not accurate and so on.Therefore,it is of great significance in both theory and practice to study the problems existing in automatic extraction of spatial location information in network text.On the basis of the existing research at home and abroad,this paper establishes a corpus of full-position information,in which the labeling of relative position relation is added.Based on the expanded corpus,the method of extracting and visualizing the position information in text is designed.The main contents are as follows:(1)Building a location information annotation corpus based on network text and design a relevant annotation system.In this paper,a large number of corpora are extracted from relevant websites and processed with text extraction,pretreatment,cleaning,word segmentation and part of speech tagging.IBO labeling system is adopted to design related labels to mark the corpus into characters and form a corpus.This corpus solves the problems of insufficient data of open corpus,poor timeliness of corpus and lack of annotation of relative position information.(2)BERT pre-training model is introduced to design a recognition method based on the bert-bilstm-crf composite model.BERT model has a strong ability to express text features,Bi LSTM model can extract context features well,and CRF model imposes constraints on label distribution.In this paper,the validity of this method for the recognition of geographical named entities and relative location information is verified by designing comparative test and combining with relevant evaluation indexes.(3)Transforming the position information in the text into structured information.This paper summarizes the relative position information and summarizes the common four kinds of relationship semantics and three kinds of distribution structures among geographically named entities.Based on baidu map platform,the paper designs the method of reasoning and transformation of position information in text.Finally,a demonstration application of extracting the path from the text was developed according to the actual needs of the current new epidemic outbreak,and the transformation method was verified.
Keywords/Search Tags:BERT model, web text, location information, geographic named entity recognition
PDF Full Text Request
Related items