Font Size: a A A

Pre-processing And Quality Analysis Of Medical Big Data Based On Machine Learning

Posted on:2024-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:D H ZengFull Text:PDF
GTID:2544307157985399Subject:Electronic information
Abstract/Summary:PDF Full Text Request
With the booming development of big data industry and the vigorous promotion of smart healthcare in the whole society,more and more medical datasets are used to assist medical diagnosis,and the quality of the datasets directly affects the diagnosis results.However,new era medical datasets always have problems such as attribute redundancy,missing data,numerical errors,and duplicate samples,which seriously affect the quality of the datasets.To solve these problems,preprocessing of the original dataset is required.A good preprocessing algorithm can substantially improve the quality of the dataset.Therefore,it is also of great practical importance to study data preprocessing algorithms.Attribute redundancy and missing data are common problems in datasets and are the two main studies in this paper.In this paper,we use machine learning related algorithms to pre-process the raw data and draw conclusions by performing quality analysis on the processed datasets.The details are presented as follows.(1)Research and improvement of data dimensionality reduction techniques for medical datasets.Firstly,we analyze and study the existing big data dimensionality reduction techniques,and then propose relevant solutions for the characteristics of medical datasets.In this paper,a Relief-based data dimensionality reduction algorithm is proposed to achieve the effect of data dimensionality reduction by improving the ability of Relief algorithm in handling duplicate attributes and distinguishing low-weight attributes.The improved algorithm is applied to medical data dimensionality reduction and compared with traditional algorithms such as linear regression algorithm,random forest algorithm,principal component analysis algorithm and statistical algorithm.The experimental results show that the improved Relief algorithm improves the prediction accuracy while reducing the computational effort when dealing with high-dimensional medical datasets.Among them,the sample classification accuracy was improved by 8.8 percentage points on average.(2)Research and improvement of missing value filling techniques for medical data sets.The existing data filling algorithms are basically based on some complete samples for mathematical modeling,ignoring the data information carried by the missing samples,which reduces the utilization of data.Therefore,this paper comes out with a multiple confidence-based missing value filling method for medical data to evaluate its filling quality.In the first stage,the association relationship between attributes is analyzed statistically,and the samples are assigned multiple confidence levels by the correlation algorithm;in the second stage,the transmission path of the self-associative neural network model is optimized to solve the self-mapping problem from input nodes to output nodes,and the loss function is optimized by the dynamic selection of confidence levels to reduce the degree of influence of incomplete samples on parameter optimization during model training;in the third stage,the The correlation relationship between attributes and the filling accuracy of the dataset are combined as a way to evaluate the filling quality of this dataset.The experimental results show that the neural network model with the introduction of multiple confidence levels has higher accuracy in filling missing values,with an average increase of 10.7 percentage points in filling categorical attributes and an average decrease of 12.7 percentage points in the absolute error percentage in filling continuous attributes.The data filling quality is improved by 19 percentage points on average.
Keywords/Search Tags:Attribute redundancy, Missing data, Relief algorithm, Confidence level, Auto-Associative Neural Network, Data quality
PDF Full Text Request
Related items