Font Size: a A A

Research On Key Technology Of Cross-Version Software Defect Prediction Based On Machine Learning

Posted on:2023-10-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Y ZhaoFull Text:PDF
GTID:1528306911495304Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In software quality assurance,software testing is an activity with limited resources.With the growth of the scale and complexity of the software project,the cost and the time of software testing increased sharply.The defect prediction technique based on the learning method is one of the important methods to maintain the quality of the software,which can be used to distribute the resources of the quality assurance effectively.The defect prediction technique based on the learning method can be divided into three execution scenes according to the source of the training set,including the within-version defect prediction,the cross-version defect prediction and the cross-project defect prediction.Cross-version defect prediction is to predict the defect labels of modules/classes in the new version by training the defect prediction model on the defect data of the historical version.For projects with multiple historical versions,compared with other projects,the existing new version can inherit and reconstruct some modules of the previous version,which makes the defect data distribution of the new version and the historical version more similar.Therefore,cross-version defect prediction is also considered by industry personnel as a practically valuable defect prediction scenario.Most research on defect prediction provided rich technical references and important practical guidance for cross-project and within-version defect prediction.However,such research rarely involved cross-version defect prediction.In recent years,the study of cross-version defect prediction has received much attention from researchers.However,there are some contradictory results in the cross-version defect prediction,which make it difficult for the testing staff to understand the essence of the cross-version data.In particular,there was disagreement over the use of defect data for historical versions.This project followed the scientific rules of investigation,mining and solving problems to dig out relevant problems of cross-version defect prediction and design the technology to improve the performance of cross-version defect prediction.The specific research content is as follows:1.First,to make up for the lack of experience in cross-version software defect prediction,this paper performs a large comprehensive empirical study based on previous experience in defect prediction research.Utilized 25 learning algorithms to investigate some important issues in cross-version defect prediction by large-scale experiments under 5 metrics.The survey results show that:a)BayesNet model is the best basic classifier in cross version defect prediction(the false positive rate is 0.332 and the recall is 0.657),followed by random forest(the false positive rate is 0.436 and the recall is 0.676);b)Although the defect data from the same version has the same data distribution,the accuracy of defect prediction in not all within-version is better than that of cross-versions;c)Cross-version data problems,including changes for feature trends and defect instances,can affect the accuracy of defect prediction.Then,based on the results of the empirical study,the impact of variations in defect data between versions on the defect prediction model was mined.In data problem mining,the essence of cross-version defect data is clarified through three forms of concept drift,and then five common learning algorithms are used to mine the impact of cross-version defect data problems on defect prediction performance.According to the investigation results,we encourage practitioners to introduce the drift detection process into the traditional cross-version defect prediction method to solve the concept drift problem.2.To solve the problem of high false-positive rate and low recall caused by class imbalance and label drift of defect data between versions,this paper proposes a cross-version defect prediction weight adjustable BayesNet model.The design principle of the weight adjustable BayesNet model is to change the sampling probability of misclassification instances through the classification error rate of the training set.When the defect instance of the historical version is wrongly classified for many times,the BayesNet model with adjustable weights will increase its sampling probability with each learning of the sub-classifier.When the defect instance of the historical version is correctly classified several times,the model will also reduce its sampling probability with each learning of the sub-classifier.Therefore,the BayesNet model with adjustable weights can learn more knowledge in the training set to focus on the false alarm instances of defect prediction in the target version.The experimental results show that the weight adjustable BayesNet model can effectively improve the accuracy of cross-version defect prediction model,with an average recall of 0.674 and an average false positive rate of 0.322.3.Based on the research results of cross-version defect prediction,this paper proposes a two-stage cross-version defect prediction framework to reduce the impact of concept drift of defect data between versions.The main basis of this framework is the same as the assumption of transfer learning,that is,it is similar or the same support between the source domain(historical version of defect data)and the target domain(new version of defect data).The original intention of our adaptive framework is to maximize the similarity or the same support between cross-version data.The first stage of this method is to select the training set with the smallest concept drift for the target version through the feature space similarity of the defect data.The second stage is to use the unlabeled data of the target version to obtain a cluster analyzer.This cluster analyzer identify whether the defect instances in the selected training set belong to the target domain.Then,our framework select the defect instances belonging to the target data cluster to form a new training set.We obtain the classifier on new training set to predict the potential defect modules in the target version.The experimental results show that(a)Multiple base classifiers have been improved under the ST-TLF framework,and the MCC of SVM has been improved by 49.74%;(b)When performing the best training set matching,the accuracy of ST method is 82.4%,whereas the empirical recommended method is only 41.2%;(c)Compared with the 12 methods,the ST-TLF with BayesNet as the base classifier increased the average MCC by 18.84%compared with P15-NB as the best baseline method.The research contents mentioned above have been evaluated and analyzed through a large number of experiments in real and publicly available engineering projects.These experimental results have an important reference value for industry personnel to design new cross-version defect prediction methods and improve the accuracy of cross-version defect prediction.
Keywords/Search Tags:software test, defect prediction, cross-version defect prediction, concept drift, machine learning, transfer learning
PDF Full Text Request
Related items