| The New Chinese Proficiency Test(New HSK)was first launched in 2009.Following the principle of combining examination with teaching and aiming at promoting teaching through examination and promoting learning through examination,the New HSK tests the non-first language candidates’ ability to communicate in Chinese.Statistical analysis of the word frequency of the new HSK is of great significance to the evaluation of the new HSK test and the improvement of the HSK vocabulary syllabus.This paper mainly uses mathematical statistics and quantitative analysis methods to analyse the 30 new HSK exam papers(5 sets for level 1-6,a total of 240,000 words)published by Hanban/Confucius Institute Headquarters in 2018.Two of the most popular Chinese phrase segmentation items based on Python--pkuseg and jieba are used to study the word segmentation and word frequency statistics of the new HSK exams.Based on the new HSK vocabulary outline,the number of words,the word frequency of each level,the coverage rate,the super class words and other data are analyzed and the conclusions are drawn:1.Pkuseg is more suitable for related word segmentation applications in the field of TCSOL without prior model training.2.The overall word frequency distribution of the new HSK is relatively reasonable,and the internal statistical results of the same level are similar.However,there are also problems such as low usage rate of syllabus words,insufficient coverage rate and large number of superclass words in the papers of HSK5 and 6.3.According to the statistical results of HSK word frequency,it is found that the new HSK vocabulary syllabus has some improvements to be made in terms of level,organization form and rules of word collection.In view of these problems,this paper puts forward some suggestions,such as improving the organization form and inclusion method of the syllabus,adding a list of cultural terms,revising the syllabus regularly,and increasing the level of the syllabus. |