Font Size: a A A

Research On News Popularity Prediction Based On Ensemble Learning Feature Selection

Posted on:2022-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:M QuFull Text:PDF
GTID:2517306311959249Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the continuous expansion of the scale of netizens and the vigorous development of mobile Internet technology in China,network news has become the main carrier of network information.Online news with wide spread and high popularity can bring huge profits to the news media.Predict news with high popularity can be given more promotion resources to expand its profits.Therefore,it is of certain research value to extract and study the characteristics that determine the popularity of online news and establish a model to predict whether news will be popular before news release.The data used in this paper came from Mashable,an online news website,which included 61 news features such as words,links,digital media,keywords,time,and natural language processing.Based on the characteristics of online news,the data were divided into text free news and text containing news.In this paper,a dichotomy is proposed to predict whether news can be popular based on the amount of news sharing as an index,and XGBoost model is used to predict the popularity of online news.Since the redundant features of news data have an impact on the prediction and generalization ability of the model,this paper conducts research from the perspective of feature selection.In this paper,two feature selection models are constructed and improved to screen the features that affect news popularity:Firstly,inspired by the integration model of machine learning,this paper deduces the feasibility of feature selection integration,and adopts the sequential backward selection method in the model to make greedy selection of the base feature selection method and then integrate.The integration method can select a smaller feature subset than the base feature method.Secondly,in the feature selection of genetic algorithm,this paper improves the fitness function.Through comparison,the improved fitness function can enable genetic algorithm to select smaller feature subsets with better prediction and generalization effects.In terms of model evaluation,this paper adopts the test set AUC,the training set AUC and the model running time to comprehensively evaluate the prediction ability,overfitting situation and running efficiency of all models.Compared with the data without feature selection and the traditional feature selection method,it is concluded that the best feature subset can be obtained by using the improved genetic algorithm in the small sample of non-text news data,and it is better to use the integrated feature selection method in the large sample of news data containing text.Finally,based on the empirical results,this paper puts forward corresponding suggestions to the news media,summarizes the shortcomings of this paper and looks forward to it.
Keywords/Search Tags:Online news popularity, Feature selection, XGBoost, Ensemble model, Genetic Algorithm
PDF Full Text Request
Related items