For Internet Access To Multiple Information Technology Research

Posted on:2012-06-04

Degree:Master

Type:Thesis

Country:China

Candidate:M J Jiang

Full Text:PDF

GTID:2208330335997731

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In order to present information fast and exactly to user, a variety of information collection system appears on the internet. Information collection is the way to fetch knowledge from the web. It consists of webpage collection, information extraction and information de-duplication. Information extraction is based on the web page collected; qualified information pages can improve the effect of information extraction. After extraction, a de-duplication procedure is always needed. Duplicate information means the pages they belong are redundant. Previous research did not use the redundant relation to optimize the webpage collection. We can choose a subset of high quality information page, download them and extract information to improve the effect and efficiency of the system.We first introduce the fast information page collection method. This method is based on an ordinary information extraction system. We conclude URL patterns from the information pages. Then we select a subset of URL patterns based on duplicate information and corresponding webpages. At last, we construct a download navigator by URL patterns. In the future, the information collection system can obtain a small amount of information pages very fast by the download navigator, and it can also maintain the number of extracted objects.Moreover, this fast information page collection is based on an information de-duplication procedure. We will also introduce the de-duplication method in this paper. For muhti-field information, we will separate information fields into 4 classes. We compute their similarities as features, and distinguish duplicate information pairs by a binary classifier. In addition, we use an auto expansion method to get more synonymous named-entity, which can improve the effect of multi-field information de-duplication.The fast information page collection will be compared with a traditional crawler. We will contrast their downloaded pages, number of extracted information and the effectiveness of the navigator. As shown by the result, our fast information page collection method can clearly reduce the cost of crawling web pages, and keep the information capacity. The experiment of multi-field information de-duplication is taken on two different datasets, both can prove the effect of de-duplication, and the auto expansion of named-entity can increase the result significantly.

Keywords/Search Tags:

Information extraction, Webpage collection, Information de-duplication

PDF Full Text Request

Related items

1	Design And Implementation Of Chinese Webpage Automatic Collection And Classification
2	The Research And Design Of Network Information Monitoring And Analysis System
3	Page Events Information Extraction
4	The Application And Research Of Regular Expression In Webpage Extration
5	Design And Implementation Of Content-based Webpage Collection And Classification System
6	Reserch And Implementation Of Webpage Cleaning Algorithm Based On Visual Information
7	The Technology And Application On Web Information Extraction Based On Analyzing Webpage Content
8	The Personal Information Extraction Based On Webpage Understanding
9	Research And Implement Of Web Information Intelligence Collection And Personalized Service System
10	Research On Web Filtering Method Of People Information