Font Size: a A A

Constructing A Tag Set For Chinese Learner's English Corpus

Posted on:2005-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:L TangFull Text:PDF
GTID:2155360125458584Subject:Foreign Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Linguistics corpus has now become one of the prevailing branches of sciences in linguistics in general and corpus-based linguistics in particular since its first inception in Europe. The earliest application of corpus in linguistics and psycholinguistics was around 1950s triggered by the need of information science. Leech (1992) stated: linguistic corpus would play a pioneer role in linguistic research and bring about a new thinking-style in language. Svartvik admitted in 1996 that the corpus was gaining its dominant position in linguistic research because it would provide us a kind of fresh philosophy in the model of thinking. In 2003, Jurafsky published his book entitled Probabilistic Linguistics, in which he claimed that various evidence had proved that language was by nature probabilistic. In China, linguistic corpus was introduced into the research area of language learning. The CLEC (Chinese Learner English Corpus) is one of the largest of this kind in China. Gui Shichun (2004) confirmed that probability was one of the core theories for explaining linguistic facts. As a result, in the research of linguistics corpus the approach of synthesis and weighing has been widely used in the process of statistics handling.In this research, we have conducted a corpus-based study by means of "theory-then-research approach" and tried to develop an error-identifying tag set based on a selection of more than 1,300 students' written materials from different grades at random both in class-settings and after-school settings in Hunan University. The total size of these tagged materials amounts to 200,000 words from written assignments. The references from other corpuses available at present (mainly the BNC, both the Basic Tag Set and the Enriched Tag Set) will provide the basis for the Hunan U Tag Set for identifying syntactic, morphological, lexical, and discoursal errors. Applying the Hunan U Tag Set, the author has examined the written materials of the students and made a quantitative analysis of those identified errors. For the purpose of the analysis of the Chinese students' second language learning in Hunan University, the research has put forth a set of its own error-identifying tags in accordance with the designated purpose for pedagogical implication of the university in comparison with the other corpuses.First of all, the present research builds up an English Learner's Corpus (calledthe Hunan U Corpus) and conducts an investigation of the inappropriateness in the corpus, i.e. mistakes, errors, and Chinglish, focusing on the students' written sources in order to extract related data from the samples. Secondly, by setting up a series of error-identifying tag sets in accordance with the specific subjects in the Hunan U Corpus, we obtain an error frequency of each category and find out what the common weak points of the subjects are and shed light on how the process of English language learning is to be strengthened. The data obtained in accordance with the tag sets, for instance, the error rates, the specific obstacles in language learning and the most common errors with the highest frequency can, therefore, be applied in the evaluation of the pedagogical significance in second language teaching and learning.In order to guarantee the reliability of the statistics, the study applies "test-retest reliability" in selecting two kinds of samples: individual samples and group samples at random from the Hunan U Corpus. The corpus-based approach of the synthesis and the weighing of the statistics from the samples can explain the abstract linguistic facts more objectively and more systematically than the massive and chaotic linguistic data. It encourages the language teachers to find a way to understand various language phenomena by means of the probabilistic modeling of linguistic data in teaching and learning. Moreover, the Hunan U Tag Set itself displays the specific feature of second language learning in writing and also provides some useful linguistic statistics on where the weak points lie in: both of the individ...
Keywords/Search Tags:error-identification tag set, learners' corpus, statistical calculation, error frequency
PDF Full Text Request
Related items