Font Size: a A A

Research On The Key Technologies Of Spatio-Temporal And Topic Extraction Based On Geological Report Text

Posted on:2021-04-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q J QiuFull Text:PDF
GTID:1360330614973072Subject:Surveying the science and technology
Abstract/Summary:PDF Full Text Request
For a long time,due to the diversity of research methods and research directions,large massive of geological data has been accumulated in the domain of geosciences.Geological research has gradually transited from qualitative research to quantitative research,from data sparsity to data intensive.These massive geological data include not only the traditional structured data,but also a large number of unstructured text data.The data models and methods of spatial data and structured data for geological big data have been relatively mature,but a lot of geological unstructured data has not been effective retrieval and utilization.Study of spatio-temporal and thematic relevance in geological structured and unstructured data is to solve one of the key scientific problems,making the best of geological big data.It provides data and technical support for the connection between structured and unstructured geological big data and intelligent knowledge service,so as to improve the quality of geological big data application ability.According to the characteristics of the qualitative,unstructured and fuzzy uncertainty of the description of spatiotemporal and subject information in Chinese geological report text,based on the unstructured geological report,this research focuses on the technical main line of "geological report text description,standardized expression,structured information extraction,visualization display".This thesis focuses on the key scientific issues such as Chinese word segmentation technology of geological report text,spatio-temporal information extraction technology of geological report text,topic information extraction technology,etc.The main research contents and innovations include the following aspects:(1)Standardized expression of spatiotemporal and topic in geological report textThis thesis combs and summarizes the description characteristics of spatiotemporal and topic information in the geological report text,clarifies thedescription characteristics of geological object,spatio information,temporal information,topic information and attribute information,proposes the structural expression model of geological entity information in the geological report text,and explores the feasibility of natural language processing technology and deep learning to extract spatiotemporal information of geological reports text.Based on the geological report text as the data source,a Chinese geological report text tagging corpus of spatiotemporal and topic information is constructed to provide standardized training set and test set for the extraction of spatiotemporal and topic information of geological reports.(2)Chinese word segmentation method of geological report based on word and word frequency with deep learningBy integrating the unigram language model and deep learning,we propose a weakly supervised model: DGeo Segmenter.DGeo Segmenter is trained with words and corresponding frequencies.We built DGeo Segmenter using the bi-directional long short-term memory(Bi-LSTM)model,which randomly extracts words and combines them into sentences.Our evaluation results using geoscience reports and benchmark datasets demonstrate the effectiveness of our method,DGeo Segmenter can segment both geoscience terms and general terms.Since manually labeled datasets and hand-crafted rules are not necessary for this proposed algorithm,it can easily be applied to various domains including information retrieval and text mining.(3)Spatiotemporal information extraction of geological report based on spatiotemporal convolution neural networkUsing multiple developed instances of sentences in a knowledge base,our method first interprets new sentences by selecting and matching the most similar knowledge base sentence based on similarity(i.e.,string similarity and semantic similarity),and then transforms the sentences into training data.Taking the matched sentences as inputs,we propose and train a spatial-oriented convolutional neuralnetwork(SP-CNN)to obtain the deep features of natural-language texts.More specifically,we exploit a spatial-oriented channel that combines human prior knowledge to automatically match words and comprehend the linguistic clues of the spatial relationships.Finally,a softmax classifier is applied to predict the classification results of the input data based on a group of deep features learned from constructed sentences.Our evaluate our method both qualitatively and quantitatively using a real dataset.The experimental results demonstrate that SP-CNN can effectively extract spatial relations from natural-language texts and achieve higher performance than other current state-of-the-art approaches.(4)Topic extraction method of geological report based on enhanced word vectorKeyphrase extraction remains a complicated task,and the performance of state-of-the-art approaches is still low.Automatic discovery of high-quality and meaningful keyphrases requires the application of useful knowledge and suitable techniques.Seeing both challenges and opportunities in the situation described above,this paper proposes an ontology and enhanced word embedding-based(OEWE)methodology for the task of automatic keyphrase extraction from geoscience documents.We first develop a quantitative analysis for keyphrase extraction evaluation based on conditional probability and the naive Bayesian model,which is valuable when human-annotated keyphrases are not available.The domain ontology is then performed on a multiway tree to enrich the domain-specific knowledge with some concepts and relationships of a domain.Simultaneously,word2 vec,a model of a word distribution using deep learning,is updated by applying the geological ontology,and it links domain background information and identifies infrequent but representative keyphrases.We used two homemade geoscience datasets to evaluate the performance of OEWE.We compared our method with frequency,term frequency-inverse document frequency(TF-IDF),Text Rank and rapid automatic keyword extraction(RAKE),finding that our method achieved average F1-scores of 30.1% and 40.7% on two manually annotated datasets.
Keywords/Search Tags:Geological big data, Geological ontology, Geological text, Geological topic, Spatiotemporal information, Word embedding
PDF Full Text Request
Related items