Font Size: a A A

Research On Text Segmentation Of Language Model

Posted on:2016-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:C XuFull Text:PDF
GTID:2298330467474747Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The nature of text segmentation is to divide the text into several semanticparagraphs based on the similarity of the text inner sub-topics. Each paragraph has thelargest semantic consistency. Therefore, the key issue to the text segmentation is toidentify topic similarity and semantic paragraph boundaries. In this paper, we focuson the content and algorithm of feature extraction which based on the language model,and using Dotplotting algorithm to segment the text.We discuss the language model. Language model is a probabilistic model whichhas been widely used in character recognition, machine translation, informationretrieval and other techniques. It can be a good solution to the probability problems ofthe words in text. Because of words are the basic units of the text, we can obtainshallow topic information through the statistics of words in the text. But the languagemodel only considers the probability of the words. It doesn’t calculate the semanticcorrelation of the word, so that it is unable to get the deep information of the topic.Based on this, we present an improved language model Bigram, using incidencematrix to bind the words to describe the correlation of sentences and to get featureextraction, so that we can determine the similarity of sub-topics.As we use the language model to get feature extraction, the fact that the majorityof the words in language belong to the low-frequency words, there certainly existsparse problem. And in the training process to the corpus, the maximum likelihoodalgorithm estimates the probability of zero to the Bigram model which is never exist,so it is need to do the data smoothing to the Bigram model. In this paper, we discussseveral smoothing algorithm which are commonly used. In order to get betterparameters, we consider the characteristics of the Bigram model and decide to useKatz smoothing algorithm.In this pager, we use the Dotplotting algorithm to segment the text. It considersthe distribution of the words in the document and global optimization, but it is fail togive full consideration of the determined boundaries when determining the newsemantic paragraph. And the result of the density is different form front scan and backscan. According to these problems, we give the Dotplotting algorithm someimprovements: increasing density value of back scan; considering the semantics of the paragraph cannot describe a sub-topic very well which is too short, we give a penaltyfactor of the length of the paragraph to constrain it; giving an improved densityevaluation function.In short, we study deeply in text segmentation using the language model Bigramand the Dotplotting algorithm based on the idea of word convergence. According tothe defect of the original method, we propose an improved method, and prove that itcan raise the accuracy of text segmentation by comparing the experimental results.
Keywords/Search Tags:Text Segmentation, Sub-topic, Language Model, Dotplotting, DensityFunction
PDF Full Text Request
Related items