Font Size: a A A

The Rare Diseases-based Prediction Model With Small-sample And Imbalanced-data

Posted on:2024-05-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:L WangFull Text:PDF
GTID:1524306938957929Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Backgrounds:Precision medicine research is of great significance for early diagnosis and treatment of rare diseases.Electronic medical record is an important resource for diagnosis of rare diseases.The construction of clinically assisted diagnosis model based on Chinese electronic medical records and artificial intelligence algorithm has become one of the important directions of rare disease precision medicine research.The previous research mainly faced three categories of problems:1.Low accuracy of text structuring;2.Small sample size;3.Unbalanced categories.As an extremely rare disease,Idiopathic sporadic ataxia(ISA)refers to a group of diseases of unknown etiology presenting as progressive cerebellar syndrome.There are probably three main causes:Early multiple system atrophy with predominant cerebellar features(MSA-C),sporadic rare hereditary ataxia(HA),hereditary ataxia And the Primary autoimmune cerebellar ataxia(PACA),which is the autoimmune cerebellar ataxia with unknown antibodies.PACA is a curable cause for a small number of ISA patients.Early diagnosis of PACA,or early identification of the autoimmune(etiological)risk(hereinafter referred to as immune risk)in ISA patients,not only contributes to early diagnosis and early treatment to alleviate the prognosis of patients,but also facilitates the discovery of unknown novel anti-neuroantibodies.In this study,the immune risk prediction model of patients with idiopathic sporadic ataxia was constructed to propose solutions to the above three problems:Methods:Firstly,the "ataxic" corpus was constructed independently and the migration conditional random field algorithm was used to construct the Chinese named entity recognition model to structure the medical records.The accuracy of phenotype and occurrence time was ensured by rule making and manual correction.Secondly,disease-phenotype network of different onset duration(≤6 months,7-12 months,13-24 months,and more than 24 months)was plotted for the MSA-C and autoimmune-mediated cerebellar ataxia IMCA patients to explore the natural disease course,The frequency of variables screened for inclusion variables was changed(① all included,②variables with frequency>1%,③ variables with frequency>10%,and④ only symptom phenotypes were included),and four supervised and unsupervised variable screening methods were used(Random forest,RF,LASSO,Elastic net and Laplacian score(LS))to explore the influence of data "sparsity" on variable screening and modeling;Finally,category-imbalance method category-balance(Tomek-links,SMOTE,SMOTE-Tomek)and integrated learning(Stacking,RF,Adaboost)algorithms are combined to model the supervised,semi-supervised and meta-learning algorithms.Results:The inclusion of the self-constructed "ataxic" corpus significantly improved the performance of the Chinese named entity recognition model,but regularization and manual calibration were still needed to complete the structure.According to the phenotypic network,the phenotypes of patients in the two groups were similar.The phenotypes in the MSA-C network increased with the increase of the duration of the disease,while in the IMCA network,there were more symptoms at the beginning of the disease course but most of them were relieved later.In variable screening,the result of RF’s evaluation of the importance of variables by Gini coefficient is contrary to that of LS value.The regression coefficients of LASSO and elastic mesh have the same change trend.The following 12 variables are included in each model.Sleep behavior disorder,hyperlipidemia,tremor,electromyographic neurogenic injury,head trauma,weakened muscle strength of limbs,alcohol consumption,dysarthria,age of onset,positive digital nose test,elevated cerebrospinal fluid white blood cells,brain stem atrophy,cerebellar atrophy,rapid disease progression,postural hypotension,abnormal gait,constipation,frequent urination,positive serum and/or cerebrospinal fluid antineurogenic antibodies,disease Cheng,dizziness,weakness,weight loss,sex,insomnia,abnormal thyroid function,positive Babinski sign,positive Hoffmann sign;The AUC of the phenotypic-based integrated Stacking model is 87.9%,and its performance increases to 96%after the addition of neural checks.The semi-supervised CReST model constructed with 10%and 30%ISA samples has better performance than the supervised and meta-learning models.Conclusions:The inclusion of "ataxia" corpus can clearly improve the performance of Chinese named entity recognition model,and the combination of rule making and manual correction can ensure the accuracy of recognition.Symptom network improves the recognition of the natural course of the two diseases.The semi-supervised model framework constructed by the symptom-based integrated learning algorithm for Stacking in combination with 10%and 30%unlabeled samples assists in patient immune risk prediction.
Keywords/Search Tags:few-shot learning, imbalanced learning, disease-phenotype complex network, Idiopathic sporadic ataxia, autoimmune-mediate cerebellar ataxia
PDF Full Text Request
Related items