| Cancer is one of the main diseases threatening human health in the world.Among them,breast cancer has become the “number one killer” threatening women’s health.With the development of modern technology,machine learning has become an auxiliary method for disease prevention and disease diagnosis.Reduce the risk of medical misdiagnosis and missed diagnosis,and improve the ability of modern medicine to solve cancer problems.This article uses breast cancer patients in the SEER database as the data source,and uses big data technology and machine learning algorithms to predict and analyze the survival of breast cancer patients one year later.A variety of equalization algorithms are used to balance the unbalanced degree of the original data set,and undersampling and data preprocessing techniques are used to remove interfering samples,which improves the ability to recognize minority samples and reduces the influence of noise samples on experimental results.The main research contents and conclusions are as follows:(1)Data mining and pathological analysis of breast cancer patients,using data cleaning technology to delete noise samples and invalid samples in the data set,and according to the patient’s age,gender,skin color,primary tumor staging,regional lymph node staging and distant metastasis staging The patient population was divided into seven characteristics based on the pathological analysis of breast cancer patients.Finally,it was found that factors such as tumor spread,tumor stage,and patient age have a greater impact on the expected survival of patients.(2)The RENN-SMOTE-SVM hybrid sampling algorithm is proposed.The algorithm reduces the number of noise samples based on the nearest neighbor rule through the under-sampling RENN algorithm and improves the model recognition ability.On the other hand,the SMOTE algorithm performs linear interpolation between minority samples Increasing the number of samples will eventually make the sample ratio reach equilibrium.The results of predicting the survival of breast cancer patients in the SEER database after one year show that the accuracy of the algorithm(Accuracy)reached 90.55%,the F1-score value reached 91.83%,the G-mean value reached 76.60%,and the AUC value 0.846,compared with other common equalization algorithms,it has superior classification effect and predictive analysis results.(3)Reduce the unbalanced degree of the original data set through a variety of equalization algorithms,use under-sampling algorithms and data preprocessing techniques to eliminate interfering samples,and establish multiple models to predict the survival of breast cancer patients one year later.Experiments have found that balancing the number of samples in unbalanced data sets through equalization techniques can improve the ability to identify minority samples on the basis of the original algorithm model.Secondly,the result of the hybrid sampling algorithm is better than that of a single balanced classification algorithm that uses oversampling or undersampling. |