Font Size: a A A

Software Defect Prediction Model Driven By Imbalanced Datasets

Posted on:2017-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:X H FangFull Text:PDF
GTID:2348330566457315Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer technology and software applied more and more common in daily life and work,the size and complexity of software systems are also increased.But the software defect prediction technical which can guarantee the quality of software system develop relatively slow.Software defect prediction technology greatly limits the development of software applications.Thus improving the prediction accuracy of software defect prediction model is an essential path to develop software systems.Due to the imbalanced software defects datasets,when built software defect prediction model,we need to consider the problem of imbalanced datasets.Existing software defect prediction models mostly process defect data without considering the imbalanced problem.But the cost of missing high-risk module is much greater than the cost of missing low-risk module.In this paper when built the software defect prediction model,we first proposed a new imbalanced data classification approach(which is also named as B-oversampling)based on boundary samples in consideration of high-risk module.This approach increases the number of boundary samples which is easily be misclassified in the positive class.The B-oversampling method uses the distances between two classes to determine the boundary samples in positive class.And new positive samples were synthesized according to the boundary sample to balance the number of positive samples and negative samples.This approach can not only increase the number of positive class samples,but also maximize extend the boundary of positive class by improving the recognition rate of positive samples.Then we proposed an undersampling classification approach-SENN-Bagging on the basis of safety negative samples.The SENN-Bagging method using clustering consistency coefficient divided the negative samples into safety samples and boundary samples.On this basis,we use SENN undersampling rule to process the safety samples,and then use Bagging algorithm to classify new datasets.The SENN-Bagging approach can reduce the negative samples size and the loss of important information in negative class.Based on the above two methods,we built a software defect prediction model named as BS-Boosting based on The model consists of oversampling method B-oversampling,undersampling method SENN and Boosting ensemble classifier.defect prediction model first use different methods to the boundary samples of positive class and negative class.After determining the boundary samples,on one hand BS-Boosting reduce the number of safety negative samples by SENN undersampling rule for the negative samples.On the other hand,BS-Boosting increase the number of boundary negative samples through B-oversampling approach.In the iterative process of Boosting algorithm,the number of positive samples increased constantly,the new samples was classified and the synthetic samples which are being misclassified will be removed.After this process,every classifier integrated to form BS-Boosting the software defect prediction model.In terms of defect data processing BS-boosting defect prediction model increases the difference between the two types of defect data,so that the defect data can be distinguished more accurately.In terms of the classifier,Boosting algorithm can improve adaptability and prevent the overfitting of each classifier.
Keywords/Search Tags:Software defect prediction model, Imbalanced data, Oversampling, Undersampling, Boosting
PDF Full Text Request
Related items