Research On Text Segmentation Of Language Model

Posted on:2016-11-04

Degree:Master

Type:Thesis

Country:China

Candidate:C Xu

Full Text:PDF

GTID:2298330467474747

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The nature of text segmentation is to divide the text into several semanticparagraphs based on the similarity of the text inner sub-topics. Each paragraph has thelargest semantic consistency. Therefore, the key issue to the text segmentation is toidentify topic similarity and semantic paragraph boundaries. In this paper, we focuson the content and algorithm of feature extraction which based on the language model,and using Dotplotting algorithm to segment the text.We discuss the language model. Language model is a probabilistic model whichhas been widely used in character recognition, machine translation, informationretrieval and other techniques. It can be a good solution to the probability problems ofthe words in text. Because of words are the basic units of the text, we can obtainshallow topic information through the statistics of words in the text. But the languagemodel only considers the probability of the words. It doesnâ€™t calculate the semanticcorrelation of the word, so that it is unable to get the deep information of the topic.Based on this, we present an improved language model Bigram, using incidencematrix to bind the words to describe the correlation of sentences and to get featureextraction, so that we can determine the similarity of sub-topics.As we use the language model to get feature extraction, the fact that the majorityof the words in language belong to the low-frequency words, there certainly existsparse problem. And in the training process to the corpus, the maximum likelihoodalgorithm estimates the probability of zero to the Bigram model which is never exist,so it is need to do the data smoothing to the Bigram model. In this paper, we discussseveral smoothing algorithm which are commonly used. In order to get betterparameters, we consider the characteristics of the Bigram model and decide to useKatz smoothing algorithm.In this pager, we use the Dotplotting algorithm to segment the text. It considersthe distribution of the words in the document and global optimization, but it is fail togive full consideration of the determined boundaries when determining the newsemantic paragraph. And the result of the density is different form front scan and backscan. According to these problems, we give the Dotplotting algorithm someimprovements: increasing density value of back scan; considering the semantics of the paragraph cannot describe a sub-topic very well which is too short, we give a penaltyfactor of the length of the paragraph to constrain it; giving an improved densityevaluation function.In short, we study deeply in text segmentation using the language model Bigramand the Dotplotting algorithm based on the idea of word convergence. According tothe defect of the original method, we propose an improved method, and prove that itcan raise the accuracy of text segmentation by comparing the experimental results.

Keywords/Search Tags:

Text Segmentation, Sub-topic, Language Model, Dotplotting, DensityFunction

PDF Full Text Request

Related items

1	Study On Text Segmentation Based On Content
2	Text Segmentation Methods Based On Semantic Topic Guidance And Data Augmentation Training
3	Research On The Key Issues Of Text Segmentation And Its Application In Multi-document Summarization
4	Research On Topic Segmentation Techniques In Dialogue Text
5	Research On Bilingual Topic Model And Its Algorithm In Cross-language Information Retrieval
6	Tibetan-Chinese Cross-language Topic Detection And Tracking
7	Research And Application Of Topic Model For Short Texts Based On Part-of-Speech Feature And Semantic Enhancement
8	Research On The Method And Technique Of Chinese And Thai Cross - Language Topic Detection
9	Research On Application Of Improved Topic Segmentation Model In Teachers' Discourse Text Analysis
10	Research On Joint Learning Of Topic And Embedding Model