Font Size: a A A

Research On Nonrandom Missing Data Classification Modeling Based On TrAdaBoost Algorithm

Posted on:2021-05-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z H LiFull Text:PDF
GTID:2480306113467244Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The analysis and application of data has brought unlimited development opportunities for information technology reform and innovation to industries such as economic production,biological research,and medical treatment.However,the lack of data has brought great challenges and obstacles to data mining and modeling.Imputation missing is the most commonly used missing processing method,which can retain the original data and the complete data structure as much as possible,but when there are a large number of non-random missing data,especially the missing variables have a more important impact on the model establishment,with only a small amount of known data,it is difficult to interpolate to obtain a data set with the same distribution as the original data,and it is not possible to continue to use traditional machine learning algorithms to build effective models.Transfer learning has the characteristics that it can be used to train the model without the data being exactly the same.The core is to find the similarity between the source domain data and the target domain data in different distribution samples,and smoothly transfer the similarity in the source domain data to modeling of target domain data.Considering the above situation,this paper proposes a research idea of applying the Tr Ada Boost algorithm in instance transfer learning to the field of missing data modeling to solve the phenomenon of deviations in the distribution of data sets after interpolation.The process of this algorithm is to first use the missing processing method to interpolate the missing parts of the training set.The imputed training set samples are regarded as source domain data different from the original distribution,and the samples without missing in the training set are regarded as the target domain,and the data of the test set to be predicted are the samples from the original distribution without missing.Then build a Tr Ada Boost algorithm framework model that uses a logistic regression model as the base classifier on the training set composed of the source and target domain data.During the calculation process,the weight of the misclassified samples in the target domain is increased,and the misclassified samples in the source domain are reduced.The weight of the model allows the model to learn more similar information between the source domain and the target domain,weights the prediction results of the multiple base classifiers obtained after the iteration according to their prediction effects,and synthesizes a strong classification accuracy learner.In order to verify the effectiveness of the method proposed in this paper,this study simulates the generation of a 5-dimensional feature binary classification data set,simulates univariate missing data according to the non-random missing mechanism,uses common missing data processing methods to complete the missing in the training set,and then build the Tr Ada Boost algorithm model,logistic regression model,Ada Boost and XGBoost models,and the logistic regression with delete missing variables.Compare the model's AUC classification effect under different modeling methods.After simulation experiments,this paper finds that when non-random missing with a missing rate higher than 85% occur,using the Tr Ada Boost algorithm to build a model can achieve much better results than traditional machine learning algorithms.At the same time,this paper also studies the degree of correlation between variables and the generation factor of category tags for variable missing as influencing factors.When the non-random missing rate is above 85%,regardless of the strong or weak correlation between the feature variables,and no matter the degree of influence of the category label on the missing probability,the experimental results show that the AUC of the model based on the Tr Ada Boost algorithm proposed in this paper is higher than that of the traditional machine learning algorithm.In addition,this paper also finds that when the category label has a greater impact on the missing,the model of the Tr Ada Boost algorithm has a more significant improvement effect than the models of other algorithms.
Keywords/Search Tags:Missing Data Modeling, TrAdaBoost, Machine Learning
PDF Full Text Request
Related items