Pre-processing And Quality Analysis Of Medical Big Data Based On Machine Learning

Posted on:2024-02-04

Degree:Master

Type:Thesis

Country:China

Candidate:D H Zeng

Full Text:PDF

GTID:2544307157985399

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

With the booming development of big data industry and the vigorous promotion of smart healthcare in the whole society,more and more medical datasets are used to assist medical diagnosis,and the quality of the datasets directly affects the diagnosis results.However,new era medical datasets always have problems such as attribute redundancy,missing data,numerical errors,and duplicate samples,which seriously affect the quality of the datasets.To solve these problems,preprocessing of the original dataset is required.A good preprocessing algorithm can substantially improve the quality of the dataset.Therefore,it is also of great practical importance to study data preprocessing algorithms.Attribute redundancy and missing data are common problems in datasets and are the two main studies in this paper.In this paper,we use machine learning related algorithms to pre-process the raw data and draw conclusions by performing quality analysis on the processed datasets.The details are presented as follows.(1)Research and improvement of data dimensionality reduction techniques for medical datasets.Firstly,we analyze and study the existing big data dimensionality reduction techniques,and then propose relevant solutions for the characteristics of medical datasets.In this paper,a Relief-based data dimensionality reduction algorithm is proposed to achieve the effect of data dimensionality reduction by improving the ability of Relief algorithm in handling duplicate attributes and distinguishing low-weight attributes.The improved algorithm is applied to medical data dimensionality reduction and compared with traditional algorithms such as linear regression algorithm,random forest algorithm,principal component analysis algorithm and statistical algorithm.The experimental results show that the improved Relief algorithm improves the prediction accuracy while reducing the computational effort when dealing with high-dimensional medical datasets.Among them,the sample classification accuracy was improved by 8.8 percentage points on average.(2)Research and improvement of missing value filling techniques for medical data sets.The existing data filling algorithms are basically based on some complete samples for mathematical modeling,ignoring the data information carried by the missing samples,which reduces the utilization of data.Therefore,this paper comes out with a multiple confidence-based missing value filling method for medical data to evaluate its filling quality.In the first stage,the association relationship between attributes is analyzed statistically,and the samples are assigned multiple confidence levels by the correlation algorithm;in the second stage,the transmission path of the self-associative neural network model is optimized to solve the self-mapping problem from input nodes to output nodes,and the loss function is optimized by the dynamic selection of confidence levels to reduce the degree of influence of incomplete samples on parameter optimization during model training;in the third stage,the The correlation relationship between attributes and the filling accuracy of the dataset are combined as a way to evaluate the filling quality of this dataset.The experimental results show that the neural network model with the introduction of multiple confidence levels has higher accuracy in filling missing values,with an average increase of 10.7 percentage points in filling categorical attributes and an average decrease of 12.7 percentage points in the absolute error percentage in filling continuous attributes.The data filling quality is improved by 19 percentage points on average.

Keywords/Search Tags:

Attribute redundancy, Missing data, Relief algorithm, Confidence level, Auto-Associative Neural Network, Data quality

PDF Full Text Request

Related items

1	Research Of Quality Optimal Control Method Of Big Data For Remote Health Care Monitoring
2	Research On Key Technologies Of Redundancy Elimination For Medical Big Data
3	Data Analytics And Computational Modelling On Breast Cancer Data
4	Estimation Of Missing CT Projection Data Based On Multiple Deep Networks
5	Statistical Inferences For Incomplete Data With Missing And Truncation In Clinical Trials
6	Hospital Audit Data Analysis Method Based On Data Mining Technology
7	Research On Neural Network Classification Method Of Medical Data Based On Meta-heuristic Algorithm
8	Multiple Imputation And Mixed-effects Model Applied In Longitudinal Data With Missing Data
9	Rbfaco And Its Application To Dimensionality Reduction Of High-dimensional Medical Data
10	Research On Prostate Cancer Related Data Based On Data Mining