Font Size: a A A

Research On Information Extraction Of Bilingual Resource Based On The Web

Posted on:2009-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:S N PangFull Text:PDF
GTID:2178360272987310Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Currently, the ocean of information continues to expand at an astonishing rate. How can anyone in the world find just the right bit of information that they need? Facing the challenge, Information Retrieval (IR) and Information Extraction (IE) technique are fast developed, aiming at helping people solve the problem of information overloading. Information Retrieval can provide all documents that satisfy definite conditions, but people must read the whole text to get the exact information. Differing from the former, Information Extraction is a method of extracting fact directly from the natural language text and building them into structured data for user's further query.Corpus is a representative collection of linguistic material with some kind of structure for application. It is large enough and machine-readable.. With the development of Internet, all kinds of digital linguistic material including the bilingual ones is easier to get than ever, thus the practice of Information Extraction gets better predetermination.The present paper is a study of extracting valuable information from bilingual texts on the Internet. Research is carried out by constructing a whole process of downloading web pages, aligning resource, and extracting information. Major works are listed as follows:Material collection is the first step to do extraction. The method of locating and recognizing bilingual resource on the Internet is thoroughly discussed. Analysis is done on the structure of a website's search function, or spider program is made to directly download the pages of websites without the search function.An efficient algorithm based on partition, which can shield the noisy parts of the web pages, is provided to extract information from the clean content.With the admission of the copyright, the pages can be stored and paragraphs in them are aligned to create a semi-finished bilingual parallel corpus which is the source to do knowledge extraction. The method of extracting bilingual vocabulary,terminology and translation templates is thoroughly discussed and analyzed. Finally, conclusion of the paper is made and what to do in the future is pointed out.
Keywords/Search Tags:natural language processing, Internet, bilingual resource, information extraction
PDF Full Text Request
Related items