Font Size: a A A

Research On Lexical Level Knowledge Mining Based On Corpus

Posted on:2014-07-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:P HanFull Text:PDF
GTID:1365330482951918Subject:Information Science
Abstract/Summary:PDF Full Text Request
With the development of digital technology and network technology,the object and research mission of library and information science,natural language processing and text mining have changed a lot.In this context,the research of knowledge acquisition from unstructured text has increasingly become a mainstream trend.Thus,the research of word frequency distribution from macro levels and inner-structure mining from micro levels based on language network,as well as the exploring of word sense relationship is carried out.The lexical level knowledge mined and extracted from the corpus is not only beneficial to the researches in library and information science,such as knowledge organization,construction of lexical and information retrieval,but also contributes to solving the problems in natural language processing,which include ambiguity resolution,machine translation and computer-aided services.This research,based on a variety of models and techniques,succeeds in word frequency distribution,properties of language network analysis at macro level and word similarity calculation,as well as word sense induction by employing many kinds of methods and corpora.At the level of word frequency distribution,this paper validates the Zipf's law in English word and Chinese word by employing the maximum likelihood estimation.The fitting results show that Zipf's value of English words and Chinese words is close to 1.0 and 1.3 respectively.Besides,the finding shows that the English words confirm the Zipf's law well,but Chinese words distribution cannot confirm the Zipf's law well.Furthermore,this part further investigated the differences between English language and Chinese language through the statistics of high frequency words and low frequency words.Based on word co-occurrences relationships,the second part constructed the language network of ancient Chinese poetry firstly.And then,this part not only examined the general properties,but also investigated the inner structure of ancient Chinese poetry networks.At macro levels,the findings reveal that ancient Chinese poetry networks are small networks and exhibit typical scale-free characteristics.Nonetheless,compared with modern Chinese character network,there are significant differences.In terms of internal structure analysis,this paper suggests that k-core can reveal the author characteristics,writing style and the social environment they were in.Using the corpus of People's daily collected from the whole year of 1998,a giant language network is constructed based on word co-occurrences.Under distributional hypothesis,this part mainly centers on the similar word mining and word similarity calculation.In addition,the paper proposed a new algorithm named Contribution Discount Similarity algorithm(CDSim),which can capture not only the edge weight,but also the global characteristic.Compared with the three typical methods of node similarity calculation,such as common neighbors,Jaccard and Salton,CDSim performs best.Besides,in order to verify the contribution of left neighbor nodes and right neighbor nodes in similar word mining,this paper further explored the influence of left and right neighbor nodes respectively.The finding results show that right neighbor nodes contribute more to nouns similarity mining,while the mining of verb similarly is just the opposite.At the level of word sense induction(WSI),this paper investigated the feasibility of Chinese words WSI from large corpus based on graph-clustering algorithm.This study firstly constructed the sub-network correspond to each target polysemy,and then clustered the frequency words to different clusters by using graph-clustering algorithm.The results indicate that Chinese words WSI from large corpus based on graph-clustering algorithm is feasible and effective.Furthermore,the findings suggest that the corpus characteristic such as scales,contents and areas may influence the WSI result.
Keywords/Search Tags:Corpus, Treebank, Zipf's law, Maximum likelihood estimation, Complex network, Language network, Node similarity, Word similarity, Word sense induction, Word sense disambiguation
PDF Full Text Request
Related items