Font Size: a A A

Cheating Sites Identification Based On The Page Structure

Posted on:2015-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:H C YangFull Text:PDF
GTID:2268330431457083Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of the Internet, network information showing explosive growth, the search engine become the main way for users to access to information. Can get a relatively higher position in the search engine rankings will determine to number of user visits. Some websites in order to improve their ranking in search engines, not by improving the quality of web pages, but according to the search engine’s own characteristics, use deception to improve the ranking, this is the cheating website. Web technology has the diversity, concealment, evolution and other characteristics. A representative cheating way is to use the same page template structure, cheating by filling different content, produce many similar appearance cheating stations. Since the template unity, reduce costs, which are widely used. Such type of page, typically produced by the same owners, often accompanied by attachments cheating, keyword stuffing and other cheating. Currently, the main method to detect such cheating is based on the page content information to determine if the web page contains pornographic or gambling word, whether it contains stuffing keywords and so on. In this way the presence of the following two questions:(1) Low accuracy. Porn gambling web pages are not all rubbish cheating, this method will recognize good porn gambling web page mistake.(2) inefficient. Hundreds of cheating pages under the same template, a simple identification of each page will make a greater workload, repetitive work more. In order to solve such spam batched, paper first analyzes the HTML web browser rendering process and the structure of the page, presents two ways to define the template:Dom Based Template (DBT) and Css Based Template (CBT). Then designs template extraction algorithm to extract the structure of the site as the fingerprint characteristics respectively. And take precision and recall rate as the evaluation standard, defined way to verify the validity of the two templates, by comparing the performance of the two algorithm, find that DBT in the recall rate higher than CBT, but CBT is superior in accuracy. Then, we use the DBT algorithm to calculate the template fingerprint feature. In order to identify cheating template site, first clustering based on the DBT for different sites, the sites that contain the same fingerprint template cluster together. In order to improve the accuracy of recognition template sites, paper propose a high-quality pages mining method based on user behavior characteristics. Such as user loyalty, visit depth, click out and stay time. Use the rate of ban and the rate of error ban to verify the user behavior characteristics in identifying high-quality pages effective. Finally, get porn vocabularies and gambling vocabulary by using topic model, defines porn rate and game rate, and using decision tree classification algorithm to identify cheating template.
Keywords/Search Tags:web cheating, templates, clustering, user behavior, decision trees
PDF Full Text Request
Related items