Font Size: a A A

Web Spam Detection Based On Single Page Features Extraction And Hybrid Integrated Classification

Posted on:2024-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:F GaoFull Text:PDF
GTID:2558307124972029Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Web spam refers to web pages that use illegal means to boost their rankings.The reason why web spam appear on the Internet is that web spam can obtain high profits at a relatively low cost,so web page owners use various methods to create web spam to obtain rich profits.Since web spam bring huge losses to search engine companies and Internet users,it is very important for companies and users to improve the detection efficiency of web spam.The key technologies for improving web spam detection performance include feature engineering technology and unbalanced data processing technology.For feature engineering technology,new features can be extracted based on web page information,or feature selection can be performed on a standardized web page feature library.As for the unbalanced data processing technology,it is possible to balance the data set categories through the sampling method,or to embed the integrated learning method in the model,so that the optimized model is suitable for unbalanced data processing problems.There are already hundreds of effective features for detecting web spam,most of these effective features are obtained based on the comprehensive extraction of many web page information the difficulty of feature algorithm design and the computational complexity of the feature itself are very large.Considering that the extraction methods of these effective features do not focus on analyzing the semantic similarity of web page content and links,the final trained model does not have good generalization performance.Furthermore,considering that web spam detection is a typical class imbalance data processing problem,using traditional sampling methods will lead to loss of data information or model overfitting.Therefore,for the current problems of feature engineering and unbalanced data processing.This paper proposes several improved methods for web spam detection,and finally analyzes and demonstrates the effectiveness of the method through experimental results.The main work is as follows:Firstly,based on the existing single page statistical features used for spam web page detection,new statistical features are further extracted to fully leverage the advantages of feature sets,enabling them to describe web page features more comprehensively and meticulously,and improving model classification performance.This method only extracts statistical features based on the HTML script information of the current web page.The biggest difference between this feature and the features in the standard web page feature library is the difficulty of feature algorithm design and feature calculation.Both the amount and the information redundancy between features are decreasing.In order to verify the validity of the statistical features,the models used in the experiment are all traditional machine learning models and the hyperparameters of the models are all default values.Finally,the validity of the statistical features is further verified according to the analysis of the experimental results.Secondly,based on the existing single page semantic features used for spam web page detection,new semantic features are further extracted and the extraction method is improved.Semantic features are divided into text similarity features and link similarity features.The extraction of text similarity features uses the ingenious combination of topic model and word mover‘s distance(LDA-WMD).For link similarity features,considering the particularity of the underlying domain name,first use Depth First Segmentation(DFS)for preprocessing,and then calculate the similarity.For semantic features,the experimental results show that semantic features can not only ensure their own effectiveness but also outperform semantic features proposed by other excellent methods,and combined with statistical features,it can further improve the classification performance of the model and improve the generalization ability of the model.Finally,a web spam detection algorithm based on Hybrid Ensemble Classification Algorithm(HEC)is proposed.HEC incorporates the perturbation attributes of the CART decision tree itself,which improves the difference between the base classifiers.HEC combines three algorithms to improve the overall classification performance of the model.First,it combines random undersampling with replacement(RUS-replacement)to ensure that as many samples as possible participate in data balance and avoid information loss.Secondly,the Boosting algorithm is used to reduce the model prediction bias.Finally,the Bagging algorithm is used to reduce the model prediction variance.Using HEC for web spam detection,the experimental results show that HEC can solve the problem of data imbalance,and its classification performance is better than a single ensemble model and better than other excellent web spam detection algorithms.
Keywords/Search Tags:feature extraction, unbalanced data processing, statistical characteristics, semantic features, hybrid ensemble classification
PDF Full Text Request
Related items