Font Size: a A A

For Internet Access To Multiple Information Technology Research

Posted on:2012-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:M J JiangFull Text:PDF
GTID:2208330335997731Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In order to present information fast and exactly to user, a variety of information collection system appears on the internet. Information collection is the way to fetch knowledge from the web. It consists of webpage collection, information extraction and information de-duplication. Information extraction is based on the web page collected; qualified information pages can improve the effect of information extraction. After extraction, a de-duplication procedure is always needed. Duplicate information means the pages they belong are redundant. Previous research did not use the redundant relation to optimize the webpage collection. We can choose a subset of high quality information page, download them and extract information to improve the effect and efficiency of the system.We first introduce the fast information page collection method. This method is based on an ordinary information extraction system. We conclude URL patterns from the information pages. Then we select a subset of URL patterns based on duplicate information and corresponding webpages. At last, we construct a download navigator by URL patterns. In the future, the information collection system can obtain a small amount of information pages very fast by the download navigator, and it can also maintain the number of extracted objects.Moreover, this fast information page collection is based on an information de-duplication procedure. We will also introduce the de-duplication method in this paper. For muhti-field information, we will separate information fields into 4 classes. We compute their similarities as features, and distinguish duplicate information pairs by a binary classifier. In addition, we use an auto expansion method to get more synonymous named-entity, which can improve the effect of multi-field information de-duplication.The fast information page collection will be compared with a traditional crawler. We will contrast their downloaded pages, number of extracted information and the effectiveness of the navigator. As shown by the result, our fast information page collection method can clearly reduce the cost of crawling web pages, and keep the information capacity. The experiment of multi-field information de-duplication is taken on two different datasets, both can prove the effect of de-duplication, and the auto expansion of named-entity can increase the result significantly.
Keywords/Search Tags:Information extraction, Webpage collection, Information de-duplication
PDF Full Text Request
Related items