Along with the enhancement of China’s soft power, International- Chinese Teaching has entered the golden age of development. As an important part of Chinese language teaching, vocabulary teaching is always a hot subject of research. However, the compilation principals of the existing word lists are basically based on simple statistics on Absolute Frequency. So they neither can reveal the real pragmatic rules, nor accord with objective cognitive rules in people’vocabulary acquisition. Therefore, based on cognitive theory and data analyzing technology, building a Basic-level Category Vocabulary (hereinafter refer to as BLCV) lexicon is not only with theoretical value, but also with practical application value, and is of importance and urgency.This study uses content analysis method, controlled experiment method, comparison method, definition method, mathematical statistics and other specific research methods. It starts from the important carrier of groups’cognitive experience-large scale corpus, using the programed, practical quantitative analysis and Natural Language Processing skills, and strive to make research of BLCV detailed, objective, and accurate.In the construction process of the lexicon, cognitive category hierarchy will be considered firstly, and then based on large scale corpus and the distinction of Absolute Frequency and Relative Frequency concept, Relative Frequency Locating Method will be used to locate the expectant BLCV. And then through the artificial verification, all BLCV will be extracted. In the classification process of BLCV, Absolute Frequency will no longer be separately used; the classification of BLCV will be operated in three dimensions which are the Pragmatical Load, the Inverse Document Frequency, and the Term Frequency-Pragmatical Load Ratio, in order to guarantee that the classification results are common, universal and free. That can ensure the lexicon accord with Chinese language learners’ cognitive characters and be hierarchy-cleared for International Chinese teaching materials compilation, teaching methods, language test, tool book compilation. Besides, features of BLCV will be studied in this research, including word length, type of structure, entropy, self-information, pragmatic collocation, acquisition order and word’s sources. At the end of the thesis, the content will be concluded, and the application value of BLCV lexicon will be put in the discussion.The innovation of the thesis can be listed as follows:In the BLCV extraction process, Absolute Frequency and Relative Frequency are proposed and strictly distinguished as a set of concepts. Despite of that Absolute Frequency was settled as the only dimension in previous word list compilation, Relative Frequency is used for reflect cognitive rules, in order to satisfy learner’s actual needs. In the process of the classification of BLCV, Seeming Productivity and Actual Productivity are proposed as a set of concepts, which refer to the productivity based on text matching by computer and that based on the inspection of the unity of the key words and the derivative words. And then figure out the latter is the real reflection of vocabulary productivity. Besides, a BLCV classification system which contains Pragmatic Load, Inverse Document Frequency index, and Term Frequency-Pragmatic Load Index is established. So the BLCV classification problem can be solved in systemic way. Moreover, several key technologies including Web Crawler, Corpus skills, Text Clustering, PageRank, Data Smoothing and several essential concepts including Inverse Document Frequency, Self-information, Entropy, Graph Theory are applied in the process of BLCV classification, confirmation and feature mining. Therefore, the efficiency, scientificalness and objectivity are guaranteed. Also, the relation between words’semantics and grammar is studied and used for productivity judgment. Comparing with Seeming Productivity, this kind of judgment is with a higher accuracy rate. Last but not least, constructing a diachronic corpus to study BLCV’s sources and development is also something researchers never did. Therefore, to some extent, this thesis is with pioneering and leading.Corpuses are important prop for this thesis. In this research, the idea of customized corpus is insisted. Three billion-word level corpus including Large-scale Text Corpus, Web Text Classified Corpus and Diachronic Corpus is established. Besides, one million-word level corpus, Pupil’s Essay Corpus and one million-word level lexicon, Collocation Lexicon are also established. Each corpus and lexicon is focus on specific demand with specific kind of text, data type and application method. Therefore, the idea of customized corpus is an essential feature of this thesis.With the construction of BLCV lexicon, this thesis studied BLCV’s extraction, classification, featuring, collocation, acquisition and sources in theory, which can be reference for research on Chinese vocabulary semantics and vocabulary teaching in international Chinese teaching. In the application, BLCV lexicon can be used for language testing, text book and reference book compilation directly. This study also can be a typical example for teaching and researching, resources platform constructing and sharing in international Chinese teaching as to form. Therefore, constructing the BLCV lexicon is an innovative way to satisfy the actual demand of international Chinese teaching. |