Research On The Feature Trend Clustering Method In Allergen Discrimination Algorithm And Disease Endotype Analysis

Posted on:2020-10-25

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Y Huang

Full Text:PDF

GTID:1484306101499114

Subject:Internal Medicine

Abstract/Summary:

PDF Full Text Request

Part Ⅰ: Allergen database ALLERGENIA 2.0OBJECTIVE: To construct an allergen database for allergen consulting and discrimination,accurate training data set is the basis of the accuracy and generalization performance of the algorithm.For allergen,lots of exicted common allergen databases have obvious quality defects,which have a negative impact on the accuracy of the databases and users.It is imperative to integrate an allergen database with the most accurate,zero redundant information and the most complete data.METHODS:(1)ALLERGENIA,COMPARE and ALLERGENONLINE were analyzed for their allergens overlap.(2)Four software ALLERGENFP,ALLERTOP,ALLERMATCH and SORTALLER were used to discriminate the allergens beyongd the three allergen databases.(3)Proteins that are not fully identified as allergens by the 4 software,and exclude sequences with an identity of over 70% and a similar length of 95% that overlap with the three databases were selected for futher manual verification.RESULTS:(1)The number of overlapping allergens in the three allergen databases was 1233;the unique allegens were 790 in ALLERGENIA,84 in ALLERGENONLINE,44 in COMPARE;the overlapping number was 6 between ALLERGENIA and COMPARE,13 between ALLERGENIA and ALLERGENONLINE,747 between COMPARE and ALLERGENONLINE.(2)There are 395 sequences that need to be validated manually,in which 125 are on the basis of literatures.(3)A total number of 2647 allergens were constructed to build ALLERGENIA 2.0.Part Ⅱ: Allergen discrimination algorithm SORTALLER 2.0OBJECTIVE: In addition to allergen database,an appropriate algorithm is needed for the allergen classification and discrimination machine.Existing common allergen discrimination software is based on artificial experience rules,resulting in inevitable prediction limitations and uncontrollability.On the other hand,the algorithm adopts the low-level features of sequences,including lots of sequence noise and invalid information,which blurs the effective features,resulting in high error-rate of the allergen discrimination and poor generalization prediction performance.Therefore,it is necessary to develop a feature engineering which can fully reflect the properties and functions of allergens and an allergen discrimination algorithm with strong predictive generalization performance.METHODS:(1)Allergen characteristic peptide AFFPs were produced by highefficiency allergen protein characterization engineering.(2)Based on the Cofluctuation principle of feature plane,the allergen characteristic peptide AFFPs with the same dimension,similar function and same fluctuation trend were clustered into AFFP Module.The stable Module was screened according to the number of allergens and AFFP contained in AFFP.(3)SORTALLER 2.0 algorithm was developed based on the enrichment and distribution of allergen proteins in the AFFP Module and its multi-target as part of the characteristics of allergen discrimination training.RESULTS:(1)SORTALLER 2.0 has a higher accuracy than other existing allergen recognition software(SORTALLER 1.0,ALLERTOP,ALLERGENFP,ALLERMATCH).Sensitivity(True positive rate),specificity(True Negative rate),Accuracy,Matthews correlation coefficient(MCC)had significant advantages.(2)Thanks to AFFP Module,SORTALLER 2.0 has strong generalization prediction performance that other allergen prediction software lacks.It relies less on the existing data of the database,and has the performance of long-term use.Part Ⅲ: Endotype analysis software LESSGENOBJECTIVE: WGCNA(weighted gene co-expression network analysis)is an effective method to be widely used in gene chip data analysis.Due to the extensive application of the correlation method in WGCNA,the gene module with the highest correlation is not necessarily the key module to distinguish the different genetic characteristics of the target population.On the other hand,sometimes there is no significant correlation between traits and modules,or the correlate relationship between two modules is very close,all of whichindicate in some cases the limitation of classical WGCNA in screening the important modules and analyzing the disease internality.New methods and tools are required to solve this problem.METHODS:(1)the recursive feature elimination method was used to screen out disease-related feature genes,and random features were extracted many times to construct a random forest.Normal and abnormal sample distributions were calculated by cross-validation density estimation,and abnormal samples were excluded.(2)WGCNA method was used to construct the co-expression network of characteristic genes,and clustered the characteristic genes with the same expression trend into gene module,and functional enrichment analysis was conducted on the module.Correlation analysis was used to screen out gene modules closely related to the disease type of samples.(3)Recursive characteristic random forest method was used to calculate the characteristic gene combination with strong correlation in the disease’s endotype in the gene expression data.The most dependent path changes of endotype were calculated.RESULTS:(1)An endotype analysis software LESSGEN was developed.The application is friendly to researchers and clinicians,which does not require programming basics or additional learning costs,and allows users to submit transcriptome data in a web browser to obtain visualized graphical results and conclusions for endotypic analysis of the disease.(2)The software calculates the disease-related gene module and the corresponding disease pathway.(3)By default,the software provides the optimal combination of feature genes to construct the recursive random forest model of features,and the combination parameters of different feature genes can be selected independently.(4)The software screened out the characteristic genes closely related to the disease endotype and strongly related to the disease endotype.(5)The software gives the interaction network of disease endotype characteristic genes.Part Ⅳ: Risk analysis of diseases endotype module and research on accurate medical evaluation method for patientsOBJECTIVE: WGCNA uses linear correlation measure or monotone dependency measure to describe the relationship in biological networks.In fact,only part of the relationships between genes in biological systems are linear or monotonous,and most of them are non-linear.Research based on linearity will hinder the accurate acquisition of network information and the identification of reasonable gene modules.At the same time,principal component analysis(PCA)is often used by WGCNA to extract important module characteristic genes.However,PCA only considers variance changes of data,and the guiding significance of variance may not be enough to reflect the actual expression.Therefore,there are still many questions about how to translate the expressed information into appropriate biological understanding.New methods and tools are needed to solve this problem.METHODS:(1)the co-expression subnetwork the disease internal-dependent module was constructed as follow: A.recursive feature elimination method was used to screen disease-related feature genes,and cross-validation method was used to extract random features to construct the random forest.All data are transmitted through the set of trees.The normal sample distribution and abnormal sample distribution are calculated through the random forest density estimation.The sample outliers are calculated to exclude the abnormal samples.B.Combining the MI-based(mutual information)genetic non-linear association estimation with the co-expression network clustering technology,the non-linear sub-networks of the disease endotype correlation module were constructed.(2)Disease endotype reference network was constructed: A.In the whole gene expression microarray data,the random forest method is used to iterate feature selection,obtain the highest performance feature genome combination of disease endotype model,and construct the disease endotype reference network.B.Transcription factors and micro RNA enrichment in the Molecular Signatures Database(MSig DB)were used to construct the disease endogenous reference regulatory network.(3)Local topology alignment method based on network motif was used to compare the similarities between disease module subnetwork and disease reference network,and disease reference regulation network.The risk scores of each disease-based internal module were calculated.(4)By analyzing the module disease risk of each patient,the heterogeneity of patients was assessed with modular pathogenicity,and the patients were evaluated with precise medical evaluation.RESULTS:(1)a local topological comparison method was established between the disease endotype module sub-network and the disease endotype reference network based on the network motif.It is used to evaluate the pathogenicity of the disease endotype module and make accurate medical evaluation on patients.(2)Taking asthma as an example,the study establishes disease endotype module sub-network and the disease endotype reference network based on the asthma severity.(3)It was found that different modules had different pathogenicity in asthma.(4)Patients with different asthma severity have different sensitivity to different modules and different preferences for module combinations,which lead to different disease characteristics.

Keywords/Search Tags:

allergen, database, machine learning, accuracy, manual verification, discriminant algorithm, generalization performance, disease endotype, feature gene, WGCNA, analysis software, random forest

PDF Full Text Request

Related items

1	Research And Application Of Auxiliary Diagnosis Algorithm For Chronic Kidney Disease Based On Machine Learning
2	Drug-Target Interaction Prediction Based On Machine Learning
3	Analysis Of Cancer Gene Data Base On Random Forest And Support Vector Machine
4	Research And Design Of Disease-aided Diagnosis Software Based On Improved RF-LR Algorithm
5	Research On Risk Prediction Of Diabetes Based On Random Forest And Support Vector
6	Research On ECG Signal Processing Method Based On Machine Learning
7	Research On The Image Classification Of Brain Glioma Based On Machine Learning
8	Selection Of Tb Susceptible Genes Based On Improved Random Forest Algorithm
9	Construction And Evaluation Of Antenatal Depression Risk Prediction Model Based On Random Forest Algorithm
10	Prediction And Analysis Of Cancer Synthetic Lethal Gene Pairs Based On Machine Learning And Statistical Inference