Font Size: a A A

Research On The Internet Definition Extraction

Posted on:2010-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:J YuFull Text:PDF
GTID:2298330452461494Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the society and the progress of technologies, new objectsand new vocabularies are continuously created and developed. It is usual that these newvocabularies are not defined in various dictionaries and reference books. People have tosearch for definitions of these vocabularies by using the search engines on the internet.Although the major search engines assemble a lot of web pages that lists the results ashypertext links which related to key words, most of the results are not people concernedand people have to click links to open web pages one by one to look for definitions.Under the background, this paper researches the definition extraction based on searchengines, mainly in the following aspects:(1) The research and construction of corpus for Definition ExtractionChinese Wikipedia and Sogo News are used as the internet corpus; Corpus BuildingModule based on XML is developed. The module can build special corpus for DefinitionExtraction.(2) Study of Internet Definition Extraction method based on statisticsN-Gram Language Model is used to acquire statistic characteristics, and it provides amethod of using weights of key words and grammatical relations as the sources of thefeatures of sentences. Through the investigation of different uses of words in definitioncorpus and news corpus, the thesis puts forward the concept of sub-sentences and maxmembership grade of sentences based on the membership grade of words and sentences.One definition extraction method is provided, which is comprised of the word form,the part of speech of word, the weight of key word, the grammatical relation, thelinguistic pattern of definition, the characteristic of wording to form the feature vector foreach sentence marked as definition or non-definition, and then compares several popularor classical classifiers to study and recognize definitions.(3) The study of web page large-quantity acquisition and web information extractionThe Web Page Acquisition Module has been designed and developed, adopting thetechnology of multi-thread and Google AJAX API. This paper proposed an informationbased method for computing paragraph weight, the Web Information Extraction Module has been designed and developed. The author applied the Balanced Random Forestclassifier to test the extraction of definitions from internet, adopting these modules.(4) Research on Internet Definition Extraction modelThe paper gives a viable internet definition extraction model. It mainly used forE-Learning system, Definitional Question-Answer system and for KnowledgeDiscovering and other areas in Natural Language Processing.In this paper, some critical technologies in Internet Definition Extraction have beeninvestigated, a method of definition extraction based on statistics is set forth, and theauthor has designed and developed partial modules of Internet Definition ExtractionModel. The anthor hopes that this dissertation will promote the study in DefinitionExtraction.
Keywords/Search Tags:Internet, Definition, Definition Extraction, N-Gram
PDF Full Text Request
Related items