Font Size: a A A

Study On Characteristic Genes Of Pancreatic Cancer Classification Based On Multiple Data Sets

Posted on:2021-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y M WangFull Text:PDF
GTID:2404330611462876Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
With the development of economy,the number of cancer patients and patients with other diseases has increased dramatically since the 21 st century,which is a constant challenge to the medical level of our country.Just like the attack of the 2019-nCOV,in order to make the case diagnosed and treated,we need to constantly explore the relevant methods.The cancer or disease that cannot be diagnosed by observing the apparent symptoms,or that is still in its infancy,which can be diagnosed and treated by detecting gene sequencing and gene expression data.Nowadays,with the rapid development of gene chip technology,there are more and more open gene expression data.So it is more and more important to explore the direction of cancer or disease diagnosis through these gene expression data.However,the current cancer gene expression data research is devoted to put forward better related research methods based on a small number of data samples,which ignores the universality and uniqueness of the samples,and makes the research results less convincing.Therefore,this paper try to research four pancreatic cancer gene expression data sets and use new samples to test the experimental results.At the same time,considering the similarity between genes,So as to find out more comprehensive differential genes of pancreatic cancer classification,fuzzy clustering analysis,which is a soft clustering method,is used to classify genes,which is different from many hard clustering methods.In this paper,the four gene expression data sets of pancreatic cancer were downloaded from the GEO public database.The empirical Bayes method of Limma package of R language was respectively used to selectdifferential expression gene of four data sets,and then the intersection of the four differential expression gene sets was taken as the basis of subsequent research.The four expression matrices of 73 differential expression genes in four gene expression data sets were respectively extracted,and the expression matrix of 202 samples was obtained by using median integration method.Using fuzzy cluster analysis,73 genes were classified into 5 categories,and absolute distance method was used to extract central genes of each category of genes,and 5 characteristic genes were obtained.Finally,logistic regression algorithm and leave-one-out cross validation were used to identify samples of the four gene expression data sets.By drawing the ROC curve and calculating accuracy,specificity and AUC values of the derived indicators of the confusion matrix,evaluated the classification effectiveness.The results showed that the five information genes correctly classified more than 80% samples in four data sets.To further verify the experimental results,an external pancreatic cancer gene expression data set was used for sample identification,and the classification accuracy was 88.46%.Relevant studies have shown that three of the five genes are related to pancreatic cancer and two are closely related to cancer.Therefore,the method of information gene selection in this paper is effective,and the selected information genes can provide guidance for the diagnosis of pancreatic cancer.
Keywords/Search Tags:gene expression data set, pancreatic cancer, differential gene, fuzzy clustering, classification characteristic gene
PDF Full Text Request
Related items