Font Size: a A A

A Research On The Extraction Of The Valid String:Based On The Dynamic Circulating Corpus

Posted on:2005-01-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y SuiFull Text:PDF
GTID:1115360125951106Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The goal of this dissertation is to study the extraction of valid strings from natural language corpus. The study is based on the new concept of valid string and the theory of the degree of circulation and is sustained by the Dynamic Circulating Corpus.Valid string is not a unit in grammar but is a unit in language communication and understanding. Most grammatical units, such as a word, a phrase or a chunk, may be used independently in communication and be understood as valid strings. There are also valid strings that are combinations of these basic grammatical units.On the surface, a valid string is a grammatical unit or a combination of several units. A valid string is not a static item waiting to be used but is dynamic unit in actual language use. By monitoring the use of valid strings in large scale real time natural language corpus, the actual language use can be monitored indirectly and the goal of dynamic language knowledge updating can be reached eventually.The concept of valid string is defined in terms of not only grammar but also cognitive psychology and the study of mass media. It is based on the curve of the frequency, distribution and circulation of the valid strings.A sentence fragment corpus was built for this study and all potential strings were extracted by using an all-round combination strategy. The combined strings were then compared with a circulation curve model to determine their validity.The dynamic circulating corpus built for this study consists of data from ten newspapers (from January to June, 2003), with 8,687,925 entries which have an average length of 16 characters and a total of 8,687,925x16=139,006,800 characters. The data is stored according to their dates.A soft-ware for the processing of Dynamic Circulating Corpus was designed for the study, which consists of several modules for the identifying and combining of potential valid strings.A total of 157,661 valid strings were extracted from the corpus and the validity rate is 80.21%.The contribution of this dissertation is:1.to have defined the concept of valid string on the basis of cognition;2.to have analyzed and posited three models of the curve for valid strings;3.to have established a method for the extraction and evaluation of valid strings; and4.to have built a Dynamic Circulating Corpus based on the sentence fragment corpus.
Keywords/Search Tags:corpus, dynamic language knowledge updating, degree of circulation, valid string, combination
PDF Full Text Request
Related items