Font Size: a A A

Research On Feature Modeling Method Of Survival Prognosis And Tumor Staging Of Lung Adenocarcinoma Based On Machine Learning

Posted on:2022-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhaoFull Text:PDF
GTID:2504306332457954Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The morbidity and mortality of lung cancer are both highest in the world,with diverse types and complex pathogenesis.At present,lung cancer still has the characteristics of poor prognosis and low survival rate.And it is difficult for early screening.With the rapid development of gene chip technology and data mining technology,a growing number of high-throughput omics data have been applied to cancer research.In this context,it is vital to utilize the omics data to search for biomarkers highly related to lung cancer,as it can provide guidance for early screening and targeted-therapy of lung cancer patients and explore the pathogenesis of lung cancer.However,most omics data are characterized by high dimensions with plenty of noisy data and rare number of samples.How to search for biomarkers highly related to prognosis or the degree of disease progression from high-dimensional data is a challenge currently.To solve this problem,two feature modeling algorithms based on lung adenocarcinoma data were proposed in this paper.The first algorithm performs as a feature selection algorithm based on transcriptome data,which can help to predict patients’survival.The aim of the research is to predict whether a patient will live longer than 3 years after treatment.Firstly the data were downloaded from the TCGA database,then clean and standardize the data.Fold-change test and student t-test were used for preliminary gene filtering,after that we used SFMC method to select the subset with higher quality.Next,we introduced a genetic correlation regulation network into the recursive features elimination algorithm to adjust the weights of features.And we used this modified algorithm to select a more suitable gene set.We obtained a subset of 48 features with the best performance.In the last,the improved dynamic-update SFFS algorithm was operated for removing the redundant characteristics.Finally,we obtained a feature set whose AUC is 0.98,ACC is 0.92 and the number of genes is only 45.Linear support vector machine was selected as the prognostic model.In the end,we carried out the functional analysis,path analysis,survival analysis and target-gene regulation analysis,it proved not only that the biomarker set can guide clinical treatment and reveal the complex pathogenesis of lung cancer.The second algorithm is a multi-step feature modeling algorithm based on multi-omics data to predict tumor staging.We used the transcriptome data set,the methylation data set and the fusion data set of the two data sets from patients with lung adenocarcinoma.The algorithm takes into account the continuity and order of tumor staging,and it also combines the regression algorithm with the classification algorithm.The process is as follows:firstly,the multi-classification were divided into six binary classifications,then we use L1 regularization method to select the characteristics whose sparse coefficient is not 0 after training for each binary classification respectively.Then carry out the SFMC method based on regression model,assessing the subset in the way of both classification and regression performance at the same time.We served R~2and ACC as evaluation indexes.Next we merged the six best feature sets selected before and began to perform recursion feature elimination algorithm for subset selection.After obtaining best optional feature set we used SBS algorithm to remove redundant features.The result showed that the algorithm can obtain a set of biomarkers closely related to tumor stage from several tens of thousands of dimensional features.In the multi-omics integration data,we can get the best subset,as the number of set features is 182,and in the logistic regression model,the ACC is 0.925,BACC is 0.86,and KAPPA is 0.80.In transcriptome data,the number of set features is 157,ACC is 0.924,BACC is 0.79,and KAPPA is0.75.In Methylation data,the number of set features is 128,ACC is 0.9956,Bacc is0.9708,and Kappa is 0.9665.In addition,we carried out the mutation analysis and pathway analysis with methylation biomarkers,it was confirmed that the biomarkers were involved in the complex biological process of lung adenocarcinoma disease development,which was of certain biological and clinical significance.These two algorithms are both used to screen markers and predict an important prognostic indexes in gene expression data.Both of them have the characteristics that the predictive targets change asymptotically with the aggravation of the disease,and are affected by multiple genes which promote or inhibit each other cooperatively.They achieved good performances in their respective tasks,which indicated that the performances of the algorithms for biomarkers can be improved by making full use of the potential interaction between genes and taking into account the bioinformatics continuity of predictive indicators.
Keywords/Search Tags:Lung adenocarcinoma, prognosis, feature selection, machine learning, multigroup data
PDF Full Text Request
Related items