| In recent years,information technology,big data technology and machine learning have made great progress,and the concept of healthy China has gradually been popularized.A large number of rich medical data can provide potentially valuable information,and the application of machine learning methods to medical data set has gradually become a research hotspot,which can help relevant medical staff to improve disease diagnosis and the efficiency of the break and the relief of some of the patients’ treatment pain.Medical data set may have missing values due to the operational errors of data collection personnel or the limited technical measurement.Therefore,this paper mainly aims to solve the problem of missing values in medical data set,select several reasonable missing value filling methods to fill the missing values in medical data set,and then use the classification algorithm in machine learning to establish the appropriate model for the help of recognizing and diagnosing epileptic seizures.First of all,this paper introduces several methods to deal with missing values in data sets,such as mean filling method,mode or median filling method,k nearest neighbor filling method and this paper points out the advantages and disadvantages of each filling algorithm at present.Then based on the correlation degree of each feature attribute in the data set,this paper proposes a new distance measurement method,which mainly calculates the Pearson correlation coefficient between each feature,and adds it as a form of weight to the calculation method of Euclidean distance.It improves the distance measurement method of k nearest neighbor filling algorithm.At the same time,due to the uncertainty of K value,The K value selection method of k nearest neighbor filling method is proposed.A scale coefficient is set to extract k nearest samples within the scale coefficient.Then,after the data preprocessing methods including missing value processing,abnormal value processing and normalization processing,three different methods of feature selection and model combination are used to establish the appropriate epilepsy patient recognition model,mainly including single variable feature selection and random forest combination algorithm,recursive feature selection and random forest combination algorithm and SVM model.The results show that SVM model is better than other two models in accuracy,precision,recall,F1 value,AUC value and so on.Finally,although this paper only studies the processing and classification of missing values in medical datasets,we can use these methods to deal with missing values in other datasets for reference.Reasonable and effective processing of missing values in datasets can help us to dig out the potential information in datasets and improve the utilization efficiency of datasets. |