| The evolution of software is a long and continuous process.Software system often takes a series of unequal changes during a period of time.With the development of software system,the function of the system will be more and more abundant,therefore the system becomes lager and larger.However its design will become worse,more complex and difficult to be understood.As a result,the software system will be difficult to maintain.Thus,the maintenance is 2-100 times the cost of development throughout the life cycle of software system.In order to improve the maintainability of software systems and reduce the cost of maintenance,the software system must be reconstructed without affecting the external behavior.The basis of reconstruction is to detect the code bad smell,which makes the detection of code bad smell particularly important.Code bad smell refers to some problems in software design that can lead the software difficult to evolve.More and more scholars use machine learning methods to detect bad smell.As the machine learning methods can construct detection rules by learning the samples,and then evaluate the results with the test samples.The results show that the detection of code bad smell with machine learning method can achieve better results.However because the code bad smell data sets are extremely unbalanced,the number of negative samples is far larger than the number of the positive samples,which reduced the effect of the traditional machine learning algorithm.This paper proposes a new method to detect bad smell.Based on the decision tree of traditional machine learning algorithm,a cost matrix with cognitive complexity as the cost factor is introduced to reduce the impact of the data unbalance on the algorithm and improve the detection accuracy of bad smell,And this paper mainly studies the detection of two bad smells,long method and feature envy.The main work of this thesis is listed as follows:1)In view of the detection of code bad smell,a new code bad smell detection algorithm and a cost-sensitive integrated classifier algorithm is proposed considering the unbalanced characteristics of the data sets.Based on the traditional decision tree algorithm,the samples are resampled by the under-sampling strategy and a plurality of balanced subsets are generated.These subsets are trained to generate multiple base classifiers,then the base classifiers are combined to form an integrated classifier.At last,the error classification cost is added in the classification attribute selection of the integrated classifier,in which the error classification cost is determined by the cognitive complexity,which makes the classifier inclined to classify the few classes accurately.2)Object-oriented metrics are calculated based on abstract syntax tree.For the detection of long method,this paper uses abstract syntax tree to calculate the lines of each method(Methodline),cyclomatic complexity(McCabe),LCOM and other metrics in the project.All the long methods in the project are marked for subsequent identification.Based on the detection of Feature Envy,the abstract syntax tree was used to extract the class name,the method name,and calculate the ATFD(Access to Foreign Data)and the LAA(Locality of Attribute Accesses)of each class in the project.3)On the same data set,the effectiveness of the cost-sensitive integrated classifier is validated by comparing the experimental results of the cost-sensitive integrated classifier,decision tree and random forest algorithm. |