Font Size: a A A

Machine Learning Based Hidden Hyperlink Detection For Web Pages

Posted on:2024-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y ZhangFull Text:PDF
GTID:2568307157450224Subject:Master of Electronic Information (Professional Degree)
Abstract/Summary:PDF Full Text Request
With the continuous development of the era towards informatization,web applications,as the most commonly used information exchange channel,their security issues cannot be underestimated,covering aspects such as web server security,application platform security,and web client security.Hidden hyperlink,like their name,are hidden beneath the surface of a webpage and easily overlooked by people,potentially causing personal information leakage and even posing a threat to property security.Compared to the traditional single form of web hidden hyperlink,the existence of hidden hyperlinks is now more diverse,greatly increasing the difficulty of maintenance for web managers.Therefore,measures need to be taken to effectively identify this security hazard.In recent years,researchers have proposed detection models for hidden hyperlink web pages.One type is based on black and white list rule matching methods,and the other type is based on machine learning detection methods.Machine learning based methods can improve the efficiency of hidden hyperlink detection compared to traditional rule matching,but there are still the following problems:1.Most of them only focus on the text features of hidden hyperlink web pages,which makes the performance of the detection model weak;2.The extracted hidden hyperlink features have redundancy problems,which makes the model training more expensive;3.The detection model established by a single classifier is not strong in difference and the effect is not obvious.In response to the above problems,this thesis has carried out the following work:1.In order to improve the comprehensive performance of the hidden hyperlink web page detection model,we also pay attention to the hidden structure features of the web page while extracting the text features of the web page,thus effectively improving the detection ability of the model.2.Considering the correlation between the hidden hyperlink features in the data sample set,a hybrid feature selection method is proposed.First,the sample set is initially filtered by filtered feature selection,then the sample feature set is expressed quantitatively based on TF-IDF strategy,and finally the feature vector set is filtered again by principal component analysis,thus simplifying the feature space of training samples.3.By comparing the detection results of the models built by each single classifier,the difference between the corresponding models is low.An integrated learning model based on differential evolution algorithm is proposed,that is,first select the appropriate feature selection method for the single classifier model used,and then optimize the model based on weight.Finally,the accuracy of the detection model is further improved and the correlation between features and the difference between classifiers are well balanced.Through comparative experiments,it has been shown that the ensemble learning optimization based on mixed feature selection and fusion differential evolution proposed in this paper has better detection performance for dark chain web pages.This shows that the introduced hybrid feature filtering method has good performance in decision tree,random forest,Ada Boost and support vector machine respectively.The introduced differential evolution algorithm can improve the comprehensive performance of the detection model in ensemble learning optimization.Comparative experiments were conducted on the platform’s actual data to demonstrate the practicality and effectiveness of the proposed scheme.
Keywords/Search Tags:Hidden hyperlink, Machine learning, Mixed feature selection, Single classifier, Differential evolution
PDF Full Text Request
Related items