Research And Implementation Of Mining Bilingual Named Entities From Large-Scale Web Pages

Posted on:2010-06-16

Degree:Master

Type:Thesis

Country:China

Candidate:S D Jiao

Full Text:PDF

GTID:2178360272470141

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Large scale bilingual named entity corpus can improve a lot the performance of the system like machine translation or cross language information retrieval. So many methods aimed to mine bilingual named entities have been proposed. The previous methods mainly used the bilingual corpus as the resource; but these methods are limited on scale and diversity, and also can not handle the out of vocabulary (OOV) problem well. As the web developing very quickly, lots of web pages contain bilingual named entities, and the web also has the advantage on diversity and real-time; the bilingual named entities from web come from variety domains, also include plenty of OOV. So, how to mine bilingual named entities has been paid more attention on research.This paper proposes a method on how to mine bilingual named entities from large scale web pages. The method mainly uses the redundant information of large-scale web pages. Firstly, extract the bilingual string from large scale web pages; secondly, use Chinese word segmentation and suffix tree to extract candidates of bilingual pairs; thirdly, use the model based on SVM to determine which candidate is bilingual named entity; finally, use the post processing methods to filter noise and bad translations, then get the correct bilingual named entities.This paper designs and implements the bilingual named entity mining system. The input of the system is a set of large scale web pages and the output is the extracted bilingual named entities. The system is composed of 4 modules: (1) bilingual string extraction; (2) candidate bilingual pair extraction; (3) bilingual named entity alignment; (4) post processing.Contribution of this study can be summarized as follows: (1) an integration mining scheme is presented to discover, extract and verify the bilingual named entity with high quality from a large scale web page set; (2) a combination of previous methods has been made, and the results of the experiments show that our scheme gains higher extraction performance than pervious approaches.

Keywords/Search Tags:

Bilingual Named Entity, Data Mining, Chinese Word Segmentation, Cross Language Information Retrieval, Natural Language Processing

PDF Full Text Request

Related items

1	Study On Chinese Named Entity Recognition
2	Applied Research Of Chinese-Korean Cross-Language Text Similarity Calculation
3	Research On Chinese Named Entity Recognition Algorithm Based On Textual Information Perceptual Fusion
4	Natural language processing for named entities with word-internal information
5	The Methodology And Implementation Of Chinese Natural Language Query In Databases
6	A Statistics-Based Language Model Approach To Chinese Word Segmentation
7	Research On Chinese Named Entity Recognition Based On Deep Learning
8	Study On Chinese Word Segmentation Based On Recurrent Neural Network Language Model
9	Hantai Bilingual News Topic Discovery Method Research
10	Research On Chinese Named Entity Recognition Based On Deep Learning