Font Size: a A A

The New Word Extraction Method And Application Research Facing The Subject

Posted on:2012-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y LiuFull Text:PDF
GTID:2285330335969124Subject:Education Technology
Abstract/Summary:PDF Full Text Request
In recent years, as the booming development of subject, various new words emerge in some fields. They fully reflect the core knowledge and professional values of a subject. To some extents, the change of new words reflects the development of a scientific discipline. New words detection makes theoretical and practical sense for the research of language information processing such as information retrieval, data mining, and automatic translation and the development and future trend of a subject. In the area of new words detection, the common method is to combine statistical method with rules. The difficulty of the research is that it’s hard to find a universal way to find all areas of new words because the modern Chinese word-formation is flexible. Moreover, the accuracy and recall rate of existing approaches are not good enough. Thus, the performance needs to be optimized. This paper presents a greedy atomic vocabulary of word-formation for education technology subject which provides a new method to find the new vocabulary in this field.A new method of new word detection for discipline is proposed in this paper which includes text pretreatment, new string of word-formation and statistics based on the greedy atomic vocabulary, word string filter of rule base, repeat screening substring, new vocabulary refining and results sort. The paper analizes 100 papers which are from the core of education technology academic journal of《education research in the audio-visual》by this method. Firstly extracting the strings that are labled by" ",’’, (),《》and their lengths are not more than 10 which forms candidate new table 1; Second, segmenting of the full text using generic dictionary, removing the words which are not morphological according to the part of speech, and counting candidate new word strings by using atomic vocabulary words of word-formation which forms candidate new table 2; Thirdly, filtering some garbage word strings from candidate list of new words by using the glossary which is not morphological; Fourth, filtering repeat substrings which contain the same content by using frequency subtraction, and filtering some garbage word strings by using of root area of the hot glossary; Ultimately, outputting the sort of results by calculating the TF/IDF value of every new words.Finally, applying this method to the education technology field and getting some new words by statistic analysis. According to test results, the accuracy and recall rate are improved which indicates that this method is practicable and effective for field of new words detection.
Keywords/Search Tags:New word detection, Atomic vocabulary word-formation, Word frequency, Word string of filtering, Education technology
PDF Full Text Request
Related items