Font Size: a A A

Research And Implementation Of Mining Bilingual Named Entities From Large-Scale Web Pages

Posted on:2010-06-16Degree:MasterType:Thesis
Country:ChinaCandidate:S D JiaoFull Text:PDF
GTID:2178360272470141Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Large scale bilingual named entity corpus can improve a lot the performance of the system like machine translation or cross language information retrieval. So many methods aimed to mine bilingual named entities have been proposed. The previous methods mainly used the bilingual corpus as the resource; but these methods are limited on scale and diversity, and also can not handle the out of vocabulary (OOV) problem well. As the web developing very quickly, lots of web pages contain bilingual named entities, and the web also has the advantage on diversity and real-time; the bilingual named entities from web come from variety domains, also include plenty of OOV. So, how to mine bilingual named entities has been paid more attention on research.This paper proposes a method on how to mine bilingual named entities from large scale web pages. The method mainly uses the redundant information of large-scale web pages. Firstly, extract the bilingual string from large scale web pages; secondly, use Chinese word segmentation and suffix tree to extract candidates of bilingual pairs; thirdly, use the model based on SVM to determine which candidate is bilingual named entity; finally, use the post processing methods to filter noise and bad translations, then get the correct bilingual named entities.This paper designs and implements the bilingual named entity mining system. The input of the system is a set of large scale web pages and the output is the extracted bilingual named entities. The system is composed of 4 modules: (1) bilingual string extraction; (2) candidate bilingual pair extraction; (3) bilingual named entity alignment; (4) post processing.Contribution of this study can be summarized as follows: (1) an integration mining scheme is presented to discover, extract and verify the bilingual named entity with high quality from a large scale web page set; (2) a combination of previous methods has been made, and the results of the experiments show that our scheme gains higher extraction performance than pervious approaches.
Keywords/Search Tags:Bilingual Named Entity, Data Mining, Chinese Word Segmentation, Cross Language Information Retrieval, Natural Language Processing
PDF Full Text Request
Related items