With the diversified application of software,the scale of software is expanding and the complexity is increasing,software security is facing severe challenges increasingly.In the process of software design and program development,software defects are inevitable,which seriously threaten software security.Therefore,detecting and repairing software defects in advance before software release is helpful to optimize test resources,improve software quality and save human resources and cost.Cross-project defect prediction(CPDP)technology uses the data from other projects(source projects)to mine effective features automatically,and train defect prediction model,which can predict whether there exist defects in the software samples of the target project.CPDP is a feasible solution to improve the quality and reliability of software development,and has achieved good results.At present,CPDP still exists some challenges in practical application.How to mine the discriminant information of defect data fully and effectively,is a key problem in CPDP.In the process of discriminant learning,the specific challenges are listed as follows: the data distribution difference from different projects leads to distribution under-adaptation problem;existing CPDP methods usually assume that the data from source project are labeled,however,in practice,the available samples usually are unlabeled,and there exist the problem of the effective use of unlabeled data;different kinds of metrics describe the same software module from different views,there exist the problem of the insufficient mining for the effective information in multi-source data from multiple kinds of software metrics.This paper focuses on the above problems,and designs corresponding solutions based on the technology of discriminant feature learning,to further improve the performance of CPDP.The specific research works in this paper are described as follows:(1)Considering the problems of fully and effectively mining of discriminant information in data and distribution under-adaptation between the data from different projects in CPDP,two CPDP approaches based on transfer learning are proposed: Selective Pseudo-labeling based Subspace Learning(SPSL)approach and Manifold embedded Distribution Adaptation(MDA)approach.In order to reduce the distribution difference of the data from different projects and make full use of the discriminant information of data,SPSL combines subspace learning and pseudo-labeling technologies.SPSL first learns a transfer matrix to map the data from source project and target project into the common space,thus the distributions from different projects tend to be similar.In common subspace,SPSL obtains the pseudo labels by using the nearest neighbor prediction and structured prediction to predict the unlabeled data from target project,and then combines labeled data from source project and pseudo-labeled data from target project,and uses the information from the data to update transfer matrix.To further reduce the distribution difference of the data from different projects,MDA considers marginal distribution difference and conditional distribution difference at the same time.MDA first uses manifold feature learning to map the high-dimensional data into manifold space,which can easily exploit latent information from different projects.Then,MDA jointly uses marginal distribution and conditional distribution to perform distribution adaptation learning,which can reduce the distribution gap of the data.Comprehensive experimental results show that these two approaches can fully exploit the information from different projects,solve the problem of huge distribution difference,and improve the prediction performance.(2)Study the effective use of unlabeled data,Discriminative Adversarial Feature Learning(DAFL)approach is proposed.DAFL introduces adversarial learning framework in semi-supervised crossproject defect prediction,to better address the data distribution difference problem of different projects.DAFL consists of two parts: feature transformer and project discriminator,which compete with each other.Feature transformer tries to mine the discriminant information of labeled data from target project and unlabeled data from source project,and uses intrinsic structure inferred from data,to improve the discriminability of data.A project discriminator tries to discriminate whether the software samples is from source project or target project on the generated representation,to reduce the distribution difference of data from different projects.Experimental results show that DAFL can effectively exploit the discriminative information from unlabeled data and labeled data,solve the problem of huge distribution difference of the data and improve the prediction performance.(3)Explore the problem of the united data mining from multi-source data,Deep Multi-view Cross-project Defect Prediction(DMCDP)approach is proposed.Considering that the existing CPDP methods ignore the complementary information between product metrics and process metrics,DMCDP models the defect prediction based on product metrics and process metrics as a multi-view learning problem.DMCDP designs a deep learning framework to solve the heterogeneous problem between the software metrics from different views within a project,and it can explore the complementarity and discriminability of data across views.Considering the large gap of data distributions,a discrepancy constraint is designed to reduce the gap from different projects and different views.Experimental results verify that the CPDP model based on multi-view learning DMCDP is better than other defect prediction models. |