Font Size: a A A

Research And Development Of Dimension Reduction And Prediction Algorithms For Survival Time Of Female Breast Cancer Patients

Posted on:2020-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:S LiuFull Text:PDF
GTID:2404330575977331Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Breast cancer is a malignant tumor of the mammary epithelial tissue.It is a major female disease.The breast is not an essential organ for maintaining vital signs.Breast cancer in situ does not directly cause death,but once cancer cells lose their general cellular characteristics,they spread,and when cancer cells spread throughout the body,they can be life-threatening.DNA methylation is a way of genetic modification by attaching a methyl group to the genomic Cp G,and it regulates various biological processes,like protein interaction,DNA stability,DNA conformation,chromatin structure changes,and gene expression,etc.There are three major types of methylation sequencing,i.e.,bisulfite sequencing,restriction endonuclease-based sequencing,and targeted enrichment methylation site sequencing.With the development of high-throughput sequencing technology,the large-scale acquisition of methylation data has become feasible.In the modern medicine,the method of using DNA methylation data to diagnose breast cancer has become a very effective method.However,DNA methylation data has a high dimension of features,and because the cost of genetic testing is too high,the number of samples is relatively small,which is the so-called "large p small n" paradigm.The data dimension is too high.If directly using the model to predict the original data may lead to over-fitting.This resulted in relatively good results in the training set,but the result is very poor in the test set.That is to say,the model's generalization ability is not strong.On the other hand,to get all the data of DNA methylation sites,all the genes need to be measured,resulting in a high cost.Therefore,this paper focuses on the application of feature selection in DNA methylation data.On the one hand,it can reduce the risk of over-fitting,on the other hand,it can also reduce the cost of detection,so as to achieve the purpose of predicting the survival time of patients.This paper proposed a feature extraction algorithm.Firstly,the feature extraction method,such as T-test,variance,Pearson correlation coefficient and other single feature extraction algorithms,were used to select features.Secondly,wrapper feature selection methods,such as RFE,were used.Finally,we use the embedded algorithm,such as Lasso,ridge regression.In addition,this paper proposes a novel feature selection algorithm,which first predicts whether the patient will die within five years,is a two-category problem,and then judges the results of the two classifications that we predicted in five years.Make regression predictions and predict specific survival time.The algorithm is a feature selection algorithm.We need to select some methylation sites that affect the survival time of patients,and predict the survival time of patients through this site.The algorithm can select effective DNA methylation sites for predicting patient survival time.From the perspective of the model,we can use the model to predict the survival time of patients.From the perspective of biological information,we can also Bio-functional analysis of the DNA methylation sites we selected was performed using biological methods to analyze factors that influence patient survival.
Keywords/Search Tags:bio-information, feature selection, DNA methylation, breast cancer, survival time
PDF Full Text Request
Related items