| Automated software defect prediction technology can effectively reduce testing and maintenance costs,improve user experience,and prevent possible catastrophic consequences by analyzing the characteristic information of the code itself to predict potential defects and assist developers to repair the software in a timely manner,thus becoming an important way to improve software quality.The traditional software defect prediction method based on machine learning takes a large number of software instances containing software measurement information and actual defect data as the data set,and uses machine learning model for training to discover the relationship between software measurement information and software defects and predict potential software defects.However,due to the lack of effective data preprocessing methods in previous research methods,the class imbalance problem often occurs in software defect prediction,that is,the number of samples with defect categories is often far less than the number of samples without defect categories;Moreover,it is not conducive to produce an effective feature set to ignore the degree of correlation between metrics and its impact on defect prediction;More importantly,most of the existing studies have not carried out effective feature extraction before training,resulting in unsatisfactory prediction results due to the impact of high-dimensional data.A large number of redundant and unrelated software metrics will interfere with the accuracy and recall rate of software defect prediction.To solve the above problems,this paper proposes a feature extraction method based on deep reinforcement learning(deep Q-Learning network)to eliminate uncorrelated,redundant and noisy features,and applies it to software defect prediction based on binary classification model,which not only improves the prediction accuracy,F-measure,AUC and MCC prediction performance,but also reduces the computational burden of machine learning algorithm.The specific work is as follows:(1)In order to solve the problem of category imbalance,in the data preprocessing stage,this paper proposes an under-sampling method based on Balance Cascade for software defect prediction,which divides the original data set and obtains multiple smaller data subsets.(2)This paper proposes a feature extraction model based on deep Q-Learning Network(DQN).In this model,the weights of all metrics are sorted by calculating the expected cross entropy to avoid the over-fitting problem;Then use the random matrix theory(RMT,Random Matrix Theory)to construct the relational matrix to measure the correlation degree of the metric elements;Finally,the reward principle of Q value is defined by weight ranking,relation matrix and error number.The feature extraction model proposed in this paper is based on Q value.A convolutional neural network(CNN)model is trained on the data set and CNN parameters are optimized to extract the sequence composed of metric pairs that can replace the original metric.(3)In this paper,the proposed feature extraction model is applied to the two-class machine learning model of software defect prediction,and a software defect prediction method DQN-SDP based on the feature extraction model is proposed,and the effectiveness of the method is evaluated through experiments.In the experiment,we used 11 NASA MDPs and 11 PROMISE datasets.First,we compared the software defect prediction results of three binary classifiers(Decision Tree,SVM and KNN)that applied the feature extraction model proposed in this paper with the prediction results that did not apply the model.All performance indicators proved the effectiveness of the feature extraction model based on DQN;Secondly,this paper tests and compares DQN-SPD with other three most advanced learning-based software defect prediction methods on the same data set.The experimental results show that the proposed method performs better in accuracy,F-measure,AUC and MCC. |