Research On The Calculation Method Of Similarity Based On The Fusion Of Tibetan Text Segment

Posted on:2017-03-20

Degree:Master

Type:Thesis

Country:China

Candidate:M Q Wu

Full Text:PDF

GTID:2308330491456702

Subject:Computer software and theory

Abstract/Summary:

Similarity computing is the basic technology of information processing, such as data mining, Machine Translation, automatic question answering, query retrieval and so on. In the field of Tibetan information processing for similarity calculation method is less and less. on the analysis of the existing Chinese segmentation fusion similarity calculation method based on the proposed Tibetan segmentation fusion of similarity calculation method:the method to the paragraph as a unit, each section of text in approximately as a short text, by calculating the similarity of short text and short text, then get the similarity between the long text and long text value. Thus we have the two Tibetan texts similarity value.The following technical route and method adopted in this paper:Two Tibetan text for a given, respectively after the removal of stop words, feature dimension reduction, and then filter out the specified Tibetan word part of speech of each paragraph, we will obtain all paragraphs in line with the conditions of two Tibetan text; then calculate the number of feature words and the value of TF, while the TF value is normalized; then the value of TF and some related parameters to calculate the weights of each paragraph of words; finally the weights of each paragraph calculated two paragraphs and paragraphs in the text similarity value and similarity matrix, through a series of precise treatment is calculated the two Tibetan text similarity value.Next will in the Tibetan text similarity calculation and further extended to Tibetan sentence similarity computation, try to compute the similarity of the Tibetan sentences, and fusion is paragraphic similarity. Finally, by paragraphic similarity fusion for text similarity. While trying to establish the more complex similarity model system, the system can Tibetan text is similar to that in the sentence to find out, and the ability to accurately enumerate what Tibetan sentences is similar.In this paper, the experimental results are evaluated according to the accuracy and the recall rate as well as the F1 value. As the corpus of the experiment is closed, it can only be used to test a probable value. In this paper,150 test texts were randomly selected from a good classification of corpus, and the F1 test value reached 67.86%, which is between the accuracy and the recall rate, the accuracy and the recall rate is roughly equal. The experimental results show that this method has a certain effect.

Keywords/Search Tags:

Tibetan text, Number of feature words, Weight value, Similarity computing, Segment fusion

Related items

1	Research On Text Similarity Algorithm Based On Vector Space Model
2	Research On Calculation Of Semantic Similarity Of Short Text Based On Feature Fusion
3	Research On Tibetan Text Classification Technology Based On TWC＿CNN
4	Chinese Text Clustering Based On Text Similarity
5	Research On Semantic Similarity And Feature Weight Relation In Text Classification
6	Research On Chinese Text Similarity Detection Technology Based On Word Weight Analysis
7	Sentiment Analysis Of Tibetan Weibo Based On Multi-feature Fusion
8	Chinese Text Similarity Research Based On Semantic And Text Structure
9	Research On Tibetan Text Classification Technology Based On Phrase Features And Polynomial Naive Bayes
10	A Study On The Sentiment Orientation Of Tibetan Short Texts