Font Size: a A A

The Research Of One-word And Three-word Structures To Those Can Separate Into Sentences Based On Corpus

Posted on:2015-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:H W ChuFull Text:PDF
GTID:2255330428465540Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
One-word sentences and three-word sentences as special linguistic phenomena exist in all kinds of texts, and they have certain research value. In this paper the main work is based on the large Chinese corpus of statistics and analysis about one-word sentences and three-word sentences, all these work are all based on segmentation and part of speech tagging about many Chinese texts.This paper is divided into five chapters:Chapter one, summarized the development of corpus, corpus linguistics and the statistics of word frequency, brief introduced the development situation about the corpus and word frequency at home and abroad. In this chapter I introduced the research purpose, significance and the content structure.Chapter two, briefly summarized the basic definition of Chinese segmentation, after that I introduced the difficulty of Chinese segmentation and the evaluation of the segment effection.According to these, to prove the segmentation tool I used in this paper is well, its accuracy is relatively high, and we can ignore some noise in the segmentation about the corpus texts.Chapter three and four are the main parts of the paper. In these two chapter I reach conclusions based on some experimentations. In chapter three, I extracted one-word sentences of these tested texts. According to the statistical data to come to a conclusion of one word as a sentence. Besides I classified all these words according to their speeches, and then calculated their proportion. After that by calculating the conditional probability of the speeches, calculated the ration of different part of speech words occur independently in the actual application process and the total of the number of occurrences, and figure out what words are independent with high probability.Chapter four has the similar experimental procedure with chapter three. In this chapter I extracted three-word sentences from a small capacity tested text drew the appropriate vocabulary, and then calculated the t-test value of each entry. After that I set a threshold to analyze which one of the two front and behind words is more inclined to be combined with the second word, divided two types (A+B)+C and A+(B+C). I divided these three-word structures for different structural types. Finally, by way of statistics to determine which type of these three-word structures is more likely to form sentences.Chapter five is the conclusion and outlook, summarized the conclusions of the two main sections of this paper, and then introduced the unfinished work as well as the future further work of this paper.After Chinese information processing for the large-scale Chinese corpus, we found that high-frequency words and IF word which can form sentences, the more the number of occurrences in the text, the greater the opportunity to be able to form sentences. As the reduction in tests of low-frequency words, the opportunity to form sentences basically unchanged. Most of the content words can form sentences, but nouns, verbs, adjectives, ets. more likely to associate with other components. Relatively it is not a great opportunity form them to independently form sentences On the contrary, onomatopoeia, interjections ets. rarely occurrences as sentences, but the total number of them occurrences is not high, relatively it is greater opportunity to independently form sentences. Three-word sentences with two types:(A+B)+C and A+(B+C). Within a certain threshold range, we can determine which type are them by t-test value.But beyond the threshold range,we can’t not only determine which type are them belong by t-test value, we need to see the specific phrases. In addition, from the sentence structure, three-word sentences more appear as subject-predicate structure, verb-object structure and modification structure.
Keywords/Search Tags:corpus, statistics, one-word structures, three-word structures, types
PDF Full Text Request
Related items