Prediction Of Diabetic Complications Based On Unbalanced Data

Posted on:2022-03-15

Degree:Master

Type:Thesis

Country:China

Candidate:L Guo

Full Text:PDF

GTID:2494306722968149

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In order to solve the problem of classification boundary deviation caused by unbalanced distribution of positive and negative samples and insufficient learning of training model for a small number of samples,an oversampling technology based on conditional entropy and TFIDF(HTTE)and a label text classification algorithm based on BERT and Convolutional Neural Networks are proposed.HTTE uses the method of information theory to calculate the conditional entropy of tags for each feature combination,and then fuses the TFIDF value to retain the data characteristics and obtain the amount of information,and then creates a new minority class sample according to the obtained value.The label text classification algorithm based on BERT and Convolutional Neural Networks firstly uses the BERT model to generate the text vector,and then uses the Convolutional Neural Networks model for hierarchical connection to encode the vectorized sequence.Based on the data set of 37 biochemical tests of5694 diabetes patients provided by the national clinical science data center,the text classification of complications was carried out by using the fused BERT model and Convolutional Neural Networks model.The results of the model were manually modified and the training set,verification set and test set of the imbalanced data set were divided.The HTTE oversampling method is used to sample a small number of samples in the training set,so that the number of positive and negative samples is balanced.Finally,the random forest ensemble learning model is used to classify and predict the possible complications of diabetic patients.The results show that the train accuracy and val accuracy of label text classification algorithm based on BERT and Convolutional Neural Networks are 0.979 and 0.921 respectively.The accuracy,AUC score under ROC curve and AUC score under PR curve are used as evaluation indexes.The three evaluation indexes obtained by the proposed HTTE oversampling method combined with random forest ensemble learning model are 0.976 and 0.921 respectively 987 and 0.959(mean of 11 complications).The proposed method provides technical reference for scientific research in dealing with unbalanced data,and can assist doctors in clinical diagnosis in medicine,and improve the accuracy and speed of clinical diagnosis.This paper has 20 pictures,14 tables and 60 references.

Keywords/Search Tags:

Unbalanced data, oversampling method, BERT model, text classification, prediction of diabetic complications

PDF Full Text Request

Related items

1	A Prediction Model With Machine Learning In Cardiovascular Disease Risk
2	Automatic Ultrasonic Image Classification For Small Sample And Unbalanced Data
3	Research On Nosocomial Infectious Prediction And Unbalanced Classification Based On Active Learning And Generative Adversarial Networks
4	Design And Implementation Of Intelligent Diagnosis Guidance System Based On Deep Learning
5	Research On Short Text Classification Algorithm Of Obstetric Electronic Medical Record Based On BERT And CNN
6	Survival Prediction Analysis Of Breast Cancer Patients Oriented To Unbalanced Data
7	Research On Classification Method Based On Acupuncture Text Data
8	Research On The Application Of Drug Efficacy Prediction And Content Recommendation Based On Medical Text Data
9	Classification Of Acupuncture Points In Chinese Medicine Based On Bert Model
10	Research On Data Mining Method Of Diabetes Risk Based On Electronic Medical Record Analysis