Research On The Integrated Processing Technology Of Sentence Segmentation And Lexical Analysis Of Ancient Texts Based On Deep Learning

Posted on:2021-02-02

Degree:Master

Type:Thesis

Country:China

Candidate:N Cheng

Full Text:PDF

GTID:2435330647957489

Subject:Linguistics and Applied Linguistics

Abstract/Summary:

PDF Full Text Request

With the development of artificial intelligence technology and the increasing number of electronic ancient texts,the study of ancient Chinese information processing has received more and more attention.Because ancient Chinese books carry a brilliant civilization and contain a wealth of language knowledge,it is an indispensable task to organize,process,and research them.However,the volume of ancient books is huge,and it is time-consuming and labor-intensive to process and analyze only by manual means.The automatic analysis of large-scale ancient Chinese texts with advanced technology can not only greatly reduce the burden of manual annotation,but also excavate text features and laws that were previously difficult to discover with the naked eye.It can also promote the machine's deep understanding of ancient texts and further develop intelligent applications for ancient texts.When it comes to text analysis,the first problem to be solved is automatic sentence segmentation(for non-punctuation text)and automatic lexical analysis(including automatic word segmentation,part-of-speech tagging,named entity recognition,etc.).The quality of the automatic segmentation and tagging will not only make it difficult to carry out in-depth knowledge mining of text,but also directly affect the downstream tasks of natural language processing(such as syntactic analysis,semantic analysis,etc.).At present,the automatic lexical analysis in the field of modern Chinese has achieved good results,but there are still many problems in the field of ancient Chinese:(1)Previous studies mostly use a single book as the experimental data set.The size of the data set is small,and the generalization ability of the model is weak,so it cannot process the large-scale,cross-era text;(2)Traditional statistical learning methods are mostly used for research.It relies heavily on artificial features,and the accuracy of the model needs to be improved.(3)The vast majority of ancient texts are not punctuated,and tasks such as lexical analysis need to be based on automatic sentence segmentation.In the previous research,automatic sentence segmentation and lexical analysis are processed in a �pipeline� way.This method ignores the deep dependency of each task and is prone to multilevel diffusion of errors and low efficiency.For these problems,we propose the following solutions:(1)We expand the scale of the annotated corpus of ancient Chinese to provide data support for automatic sentence segmentation and lexical analysis of ancient Chinese.We select representative ancient texts according to different eras,and perform word segmentation,part-of-speech tagging,and sentence segmentation with automatic tagging and manual proofreading on the collected corpus.The size of the annotated corpus is 4.21 million words,which meets the needs of the model's generalization ability experiment.(2)Based on deep learning technology,a framework of automatic sentence segmentation and lexical analysis in ancient Chinese is constructed,and the technical solutions of the sequence labeling model applied to lexical analysis in ancient Chinese are discussed in detail.Deep learning models can realize the automatic extraction of features through multiple layers of non-linear transformations,avoiding the tedious feature engineering of traditional machine learning models.By comparing the tagging effects of various combination models of different network levels on ancient Chinese,the best combination model suitable for lexical analysis of ancient Chinese is obtained.We add the Bert vector to the input layer,use the Bi LSTM model to further extract features,and access the conditional random field(CRF)at the output layer to get the optimal result.(3)We adopt the method of combined learning of sentence segmentation and lexical analysis to realize the multi-task integration of ancient Chinese.We develop an integrated lexical analysis platform for ancient Chinese.The system can output the results of automatic sentence segmentation,word segmentation and part-of-speech tagging simultaneously,avoiding multi-level diffusion of tagging errors and greatly improving processing efficiency.The feasibility of automatic sentence segmentation and lexical analysis of the integrated model was verified by setting up single task(e.g.,word segmentation only)control experiments on different books.In addition,we set up cross-text experiments on the mixed corpus based model and the single corpus based model,which proved the generalization ability of the mixed corpus based model.We further construct a cross-era mixed corpus based model based on a large-scale refined corpus,and discuss the universality of its integrated annotation for texts of different eras.In general,this paper implements an integrated method of sentence segmentation and lexical analysis for ancient Chinese,and develop an integrated processing system.Based on the deep learning model,the constructed corpus is used to verify the effectiveness of the model on sentence segmentation and lexical analysis and the generalization ability to deal with texts of different eras.The research shows that the integrated tagging method can improve the performance of sentence segmentation,word segmentation,and part-of-speech tagging tasks.The average F1 score of each task reaches 90.71%,92.33%,86.93%,which is 0.8%,1.16%,and 0.44% higher than single task processing,respectively.The mixed corpus based model is capable of automatic tagging for texts in different eras,and the performance of the sentence segmentation task is better than that of the single corpus based model as a whole.The realization of the integrated processing technology of sentence segmentation and lexical analysis lays the foundation for knowledge mining,syntax analysis and semantic analysis of ancient Chinese.

Keywords/Search Tags:

sentence segmentation of ancient Chinese, word segmentation, part-of-speech tagging, deep learning, ancient Chinese information processing

PDF Full Text Request

Related items

1	Research On The Methods Of Ancient Chinese Word Segmentation And Part-of-speech Tagging
2	Information Processing On Mencius And Its Commentations And Annotations
3	Tibetan Segmentation And POS Tagging Study
4	Research On Automatic Texts Segmentation And Word Segmentation For Ancient Chinese Texts
5	Ancient Chinese Character Image Segmentation Based On IVHFS And IDFA
6	Research And Implementation Of The Tibetan Part Of Speech Tagging System
7	Experimental Study On The Fusion Of Dictionary Segmentation And Model Word Segmentation In Chinese
8	Research On Word Segmentation And Part-of-speech Of Tibetan On Neural Network
9	A Study On The Inconsistent Word Segmentation Of Middle Ancient Chinese Corpus
10	A Study On Cantonese Word Segmentation Specification For Information Processing