| Software defect prediction technology identifies potential defective software modules by analyzing software historical data and using classification and sorting models.In the process of building software defect prediction model,the number of defective samples is much smaller than that of non-defective samples,and the distribution is not uniform.There are serious inter-class and intra-class imbalances,which will have a negative impact on the construction of prediction model.In order to reduce the impact of data imbalance on classifier,there are corresponding methods to correct data imbalance in the four stages of building software defect prediction model,including data sampling,feature extraction,classifier optimization and evaluation criteria.Data sampling is the initial stage of building defect prediction model,and correcting data imbalance in the initial stage can directly reduce the complexity of subsequent stages.Commonly used data sampling methods to deal with class imbalance problem achieve class balance by adjusting the number of samples,but the distribution usually follows the original distribution,and the intra-class balance is not improved.Aiming at the sample distribution,this paper proposes a method to generate unbalanced data of software defect prediction class.According to the distribution in the sample feature space,clustering partition is carried out.Different strategies are adopted to synthesize defective sample data according to different distribution in the partitioned sub-region.By increasing the number,the balance between defective and defective sample classes can be achieved,and the data of different regions can be generated.Different densities improve the intra-class distribution of defective samples.In order to verify the validity of the proposed method,experiments are carried out on nine published defect prediction data sets.The comparison between the proposed method and existing data generation methods is made,and experiments are carried out under different classification algorithms.The results show that the method proposed in this paper can improve the classification performance of classifiers and reduce the impact of data imbalance on software defect prediction results by dividing the samples and adopting different data generation strategies in different distribution areas.。. |