| Machine learning assisted cancer diagnosis has always been a hot research direction in the medical field.Traditional cancer diagnosis methods usually rely on manual observation and judgment,which is subjective and error-prone.The machine learning algorithm can mine effective features and patterns from a large number of patient data through learning and analysis,and improve the accuracy and efficiency of diagnosis.However,with the explosive growth of medical data,traditional single classifiers often fail to meet the requirements of high accuracy and high robustness,and may have overfitting or underfitting problems in the face of different types of tumors.Therefore,this paper proposes an improved algorithm based on Boosting integration rules to integrate traditional machine learning classifiers,and designs a heterogeneous Boosting integration algorithm combined with the improved algorithm to improve the accuracy and diversity of model-assisted cancer diagnosis.The specific research contents are as follows:Firstly,the feasibility of TCGA breast cancer data was studied by using statistical methods such as Cocharan-Q test.Because the feature dimension of TCGA data set is very high,the traditional machine learning model is difficult to complete the fitting.Therefore,the method of difference analysis and feature selection is used to reduce the dimension of data.After obtaining the low-dimensional gene data,the machine learning model is built to classify and compare the results to select a relatively better parameter adaptive method.Secondly,for the problem of poor classification effect of minority data in machine learning classification results of unbalanced TCGA data,a resampling method is proposed to balance the data,and experiments show that the SMOTE algorithm effectively improves the classification results of minority samples,and the recall rate is high.However,the SMOTE method still has the problem of low accuracy relative to the recall rate due to the need to manually adjust the K value and the unstable quality of the generated minority class data.In this thesis,a K-SMOTE algorithm that can adaptively select the K value is proposed,which effectively improves the classification accuracy.Then,Boosting ensemble algorithm is used to improve the classification effect of machine learning model.For the problem that the exponential loss function is sensitive to outliers and easily affects the generalization performance of the model,this thesis proposes a Huber Boost algorithm based on Huber loss function and integrates K-SMOTE algorithm in its framework,which not only improves the classification accuracy,but also improves the F1-score index.Finally,in order to enhance the universality of the improved model,this thesis designs the HK-SHBoost algorithm to integrate heterogeneous base classifiers.Through classification experiments and universality experiments,it is proved that the improved model algorithm effectively enhances the diversity and universality of the model. |