| Training data and test data are independent and identically distributed,and having sufficient data are two conditions that traditional machine learning must satisfy.But obtaining a large amount of tagged data actually is costly,the conditions of independence and identical distribution is too strict.In the face of these problems,Transfer Learning is particularly important because it can transfer knowledge in related fields to learning tasks in the target areas.TrAdaBoost is a widely used instance-based Transfer Learning algorithm and it has a very strong knowledge transfer capability.However,this algorithm has the disadvantages of slow convergence rate,easy negative transfer,unreasonable initial weight distribution,and easy overfitting.In this paper,a series of effective improvements are proposed for these disadvantages.A weighted multi-source regression algorithm is proposed and applied to the industry problems of power communication networks.The main work of this paper is as follows:1.Improved TrAdaBoost algorithm.This article makes a detailed analysis of the shortcomings of TrAdaBoost.In response to these shortcomings,this paper has made some improvements.Firstly,two weight initialization methods are proposed which are based on the Very Fast KMM and the probability output of two-class classification,both of which can achieve better results.The former is more simple to use,while the latter is more applicable and efficient.In addition,a sample exclusion strategy based on quantiles of weights was proposed.Some irrelevant samples were excluded and the training speed of the algorithm was accelerated.Besides,a sample exclusion strategy based on the lowest threshold was proposed.The improved algorithm named VFKMM-TrAdaBoost had about 2.5%accuracy improvement compared to TrAdaBoost algorithm in the UCI Public Dataset 20 Newsgroups.It reduced training time by at least one time,and greatly reduced the risk of negative migration.2.Weighted multi-source VFKMM-TrAdaBoost regression algorithm.The existing multi-source Transfer Learning algorithm has very few researches on regression problems,and most of them are symmetric two-class classification problems.In this paper,the error tolerance coefficient is proposed to solve the problem that the sample weight of the source domain is reduced too quickly and the effect of the algorithm is improved.This paper presents a weighted multi-source VFKMM-TrAdaBoost regression algorithm which is based on AdaBoostRegressor error function and VFKMM-TrAdaBoost algorithm.Experiments were performed on the modified Friedman#1 regression problem to verify the effectiveness of the algorithm.The error tolerance coefficient can increase the R^2 score by approximately 0.01.3.Apply the above regression algorithm to practical problems in the power communication network industry.This paper proposes anomaly site(sites with a large number of missing services)detection and true value prediction models.In the feature engineering,the methods in the social network analysis are introduced,the centrality and the features based on PageRank are extracted,and the importance of the site in the topology is fully considered.Use the weighted multi-source VFKMM-TrAdaBoost regression algorithm to predict the actual number of sites at the site and transfer the data from other provinces to the prediction tasks of provinces with too few sites.The anomalous site comes from two aspects:the abnormal site discovered by iForest and the site with large residual values of observations and predictions.Implemented the corresponding system and collect offline verification data so that the model can be further refined.Experimental results and offline validation results validate the effectiveness of the model. |