Font Size: a A A

Research On Tibetan Webpage Weight Loss Based On Tibetan Search Engine

Posted on:2019-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:J C R LiFull Text:PDF
GTID:2416330551961400Subject:Tibetan information processing project
Abstract/Summary:PDF Full Text Request
Since the development of the Internet,due to its characteristics of openness and sharing,it has provided people with a large amount of information resources conveniently and quickly,which has greatly facilitated the efficiency of people's access to information.But on the other hand,the information on the Internet has a lot of repetitiveness and similarities.In particular,search engines are filled with a large number of web pages with the same or similar content.Tibetan web pages are no exception.Repeated web pages are a waste of storage resources for developers,and they also consume network bandwidth,which seriously reduces the search engine's work efficiency.For the user,they do not want the correct search results to sink in a large number of duplicate or similar web pages and are difficult to find,thereby increasing the user's browsing burden.Therefore,eliminating duplicate webpages with high accuracy and speed is undoubtedly one of the key technologies for improving the quality of search engines and improving the user experience.At present,web page weight loss methods are numerous for Chinese and English,and the efficiency and accuracy of various weight loss methods are also uneven.However,most of them follow a common framework,that is,first select a set of feature items that can represent the core content from a given web document,then perform dimensionality reduction on the set of feature items,and then perform similarity on the reduced-dimension data set.Degrees of calculation,the final comparison of similarity results to determine the degree of repetition,and thus eliminate duplicate pages.Due to the differences between Chinese and English and Tibetan,the above general framework cannot be directly applied to Tibetan web page weight loss.Therefore,it is necessary to design a webpage deduplication framework that meets the characteristics of the Tibetan language by studying each functional module.The Tibetan webpage weight-reduction system studied in this paper is based on the abovementioned general framework.By consulting relevant domestic and foreign literature data,it mainly deals with the purification of Tibetan web pages,the processing of webpage text blocks,the selection of web page features,and the calculation of weights.The information technology such as fingerprint calculation,similarity calculation and weight loss processing has been deeply analyzed and studied.And,in the Tibetan language information fingerprinting module,the best solution is taken through the comparison of the three algorithms.The webpage cleanup module implements the extraction of Tibetan webpage subject sentences by improving the original algorithm and adding a location markup function.The feature weight calculation module is improved to realize position weight calculation.The weight-reduction processing module adds the functions of fingerprint clustering preservation and classification matching on the original basis to improve the overall efficiency of the system.Finally,the practicality of the structure of web page weight loss function is proved through the experiment of each module.Using a large-scale web page to conduct an overall test of the system,a recall rate of 93.8%,a precision rate of 97.7%,and an F value of 0.957 were obtained.
Keywords/Search Tags:Tibetan web page, text block, feature selection, information fingerprinting, similarity calculation
PDF Full Text Request
Related items