| Chinese geological text is the main carrier of geological survey results,which contains a wealth of geological information such as mineralisation prediction,mineral census,ore control rules and mineral search directions.Data mining of Chinese geological texts can efficiently and precisely extract valuable geological information from massive text data resources,laying a solid foundation for subsequent geological investigations.Chinese word segmentation is the basis of data mining.By cutting Chinese geological texts into phrases with geoligical significance,it can strengthen the computer’s ability to process text information and promote the development of geological data digitization.However,there is little research on Chinese word segmentation in the field of geology.Due to the incomplete coverage of core dictionary and other problems,the precision of word segmentation for Chinese geological texts is low,and it is difficult to meet the needs of geologists in word segmentation.It has become an urgent need to strengthen the research on word segmentation in the geological field and realize the precise word segmentation of related texts.Aiming at the problem that the general word segmentation method relies heavily on the core dictionary and is difficult to identify unregistered words,this paper builds a word segmentation model based on the term combination probability.Based on this model,a word segmentation model combining term combination probability and machine learning is proposed.Finally,according to the research results,a comprehensive word segmentation system covering Chinese geological text format processing and word segmentation is designed and developed.The main research contents of this paper are as follows:(1)Through experiments,this paper compare the word segmentation results and word segmentation effects of the general word segmentation method,summarize the specific performance and main reasons of the low word segmentation precision and poor word segmentation effect of this type of method,and optimize these problems to improve the word segmentation precision.(2)By analysing Chinese geological texts,this paper define the term combination probability.Starting from the characteristics of Chinese geological texts and the Chinese language writing standard,the term combination probability is optimized,the segmentation characteristics of Chinese geological texts are strengthened,and a word segmentation model based on term combination probability is built under the condition of zero samples.The model result is added to the core dictionary of general word segmentation method as feature words to improve the word segmentation precision of the general word segmentation method.(3)In order to improve the word segmentation precision,this paper constructs a geological corpus data set using the term combination probability word segmentation model and the general word segmentation method.By using the Bi LSTM-CRF model,this paper trains a word segmentation model combining term combination probability and machine learning.Comparing with the general word segmentation method,this word segmentation model has improved the precision and effectiveness.In order to improve the word segmentation precision within the limited geological corpus data set,this paper combines iterative learning control theory to filter out unregistered words and add unregistered words to the geological corpus data set,so that the word segmentation precision of the model can be improved.(4)Based on the above research results,a comprehensive word segmentation system for the geological field is designed and developed.The system can meet the requirements of Chinese geological texts,and the results are precise. |