| Software defects have an important impact on software quality and even software economy.In order to reduce the loss caused by software defects,one of the most active problems in the field of software engineering is how to find software defects efficiently and accurately.In the 1990 s,it was discovered that defects were not distributed randomly in software.Then a series of prediction models for software defect tendency,quantity,severity and distribution were proposed.However,in real software development scenarios,it is impossible to guarantee that every software system has rich change log data.Especially for newly developed and small-scale software systems,the lack of training data leads to the problem of "cold start" in software defect prediction modeling,which limits the application scope of research results.Multi-sources Cross Project Defect Prediction(MCPDP)is designed to use multiple historical data from other projects(source projects)to predict the likelihood of software module defects in the target project.This study solves the problem of cold start of defect prediction modeling and provides a solution to build defect prediction model for new software or software system lacking historical data.However,due to the different development languages,programming styles and design patterns of different projects,data are heterogeneous,which makes the distribution state of source data and target data different.This paper proposes a solution to the heterogeneity of cross-project data.Firstly,the source data and target data are mapped to the same public space,and then the feature space of source data and target data is overlapsed by the rotation adjustment of projection matrix,so as to achieve the purpose of feature alignment.Secondly,in order to further improve the accuracy of heterogeneous defect prediction across projects,a source data selection method is designed.On this basis,a cross-project defect prediction model is constructed.In order to prove the effectiveness of the proposed method,experiments were carried out based on four open data sets,SOFTLAB,NASA,Relink and AEEEM,and the results showed that the proposed method improved 4%and 5% in F-measure index,respectively,compared with the baseline method,proving that the proposed method has good performance... |