Font Size: a A A

Study On Corpus

Posted on:2004-06-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:T T HeFull Text:PDF
GTID:1115360092493150Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The present paper is a study of corpus proper. It is based on linguistic theory and principles of software engineering and database. With the help of theory and methods of other related subjects and previous research findings, the paper analyzes some famous corpora, examines some academic and practical issues related to corpus construction and discusses how to construct corpus for the study of linguistics.Corpus is a representative collection of linguistic material with some kind of structure for application. It is large enough and machine-readable.The core of a corpus system is corpus. It also includes hardware, software, users of corpus, and the rules of collecting and processing linguistic material. Different parts of a corpus system affect and restrict one another. They work together to determine the quality and worthiness of corpus.The development of a big corpus can be regarded as a software engineering; therefore, it should follow the principles and methods of software engineering. However, it also has its own special features. So it can be called Corpus Engineering. The life cycle of a corpus engineering can be divided into seven phases: the planning phase, the needs analysis phase, designing phase, linguistic material collection phase, realizing phase, annotating phase, and using and maintenance phase.Balanced corpora have the following characteristics: authenticity of linguistic material, finity of the number of samples, representativeness of corpus, and balance of structure. The authenticity of linguistic material is the basis, the finity of sample is the reality, and the representativeness of corpus is the goal, while the balance is the means to realize the goal.The stream of linguistic material is all linguistic material produced continually from one or several web sites on the Internet. When it passes through monitor programs, the programs draw out useful information from it. Whether the stream should be stored depends on the need. The mechanism of linguistic material stream is similar to that of man's brain. The construction of monitor corpus based on the stream of linguistic material is useful for the finding of new linguistic phenomenon.The normalization of corpus is the key to make corpora sharable, thus to reduce the repetition of corpora. The jiormalization of corpus meta-data is an easier step and can be done first. The corpus meta-data can be divided into six classes: information about copyright, information about background of linguistic material creator, information about medium of linguistic information, information aboutthe content of linguistic material, information about collecting linguistic material, and information about management of linguistic material.The general rules for annotating corpus are data independence of original linguistic material and annotating symbols, publicity of corpus, generality, compromise, consistency, correctness of annotation symbols, and user's rights to know all about corpus.In the process of annotating corpus, the following relations are important: detailed and simple; general and specific; principled and flexible, absolute and indefinite.HNC theory sets up a network of concept. It can be used to describe the meaning of words and the associations between different' concepts. The study of formalized definition of HNC concept expression aims at building a system of word semantic knowledge for automatic semantic annotation so that the formalization of semantic annotation symbols and calculability of the meaning of words can be realized.The development of application software for corpus can promote the corpus-based linguistic study. It is an important aspect of corpus study. So we should attach more importance to it.
Keywords/Search Tags:corpus, corpus system, annotation of corpus, normalization
PDF Full Text Request
Related items