| In order to solve the problem of classification boundary deviation caused by unbalanced distribution of positive and negative samples and insufficient learning of training model for a small number of samples,an oversampling technology based on conditional entropy and TFIDF(HTTE)and a label text classification algorithm based on BERT and Convolutional Neural Networks are proposed.HTTE uses the method of information theory to calculate the conditional entropy of tags for each feature combination,and then fuses the TFIDF value to retain the data characteristics and obtain the amount of information,and then creates a new minority class sample according to the obtained value.The label text classification algorithm based on BERT and Convolutional Neural Networks firstly uses the BERT model to generate the text vector,and then uses the Convolutional Neural Networks model for hierarchical connection to encode the vectorized sequence.Based on the data set of 37 biochemical tests of5694 diabetes patients provided by the national clinical science data center,the text classification of complications was carried out by using the fused BERT model and Convolutional Neural Networks model.The results of the model were manually modified and the training set,verification set and test set of the imbalanced data set were divided.The HTTE oversampling method is used to sample a small number of samples in the training set,so that the number of positive and negative samples is balanced.Finally,the random forest ensemble learning model is used to classify and predict the possible complications of diabetic patients.The results show that the train accuracy and val accuracy of label text classification algorithm based on BERT and Convolutional Neural Networks are 0.979 and 0.921 respectively.The accuracy,AUC score under ROC curve and AUC score under PR curve are used as evaluation indexes.The three evaluation indexes obtained by the proposed HTTE oversampling method combined with random forest ensemble learning model are 0.976 and 0.921 respectively 987 and 0.959(mean of 11 complications).The proposed method provides technical reference for scientific research in dealing with unbalanced data,and can assist doctors in clinical diagnosis in medicine,and improve the accuracy and speed of clinical diagnosis.This paper has 20 pictures,14 tables and 60 references. |