| Cancer generally refers to the malignant tumor that is formed by abnormal proliferation and differentiation of living cells.It often diffuses or infiltrates into surrounding tissues or organs.The evolution of cancer is a multi-variate and multi-step process,whose specific symptoms appear late and the clinical manifestations vary with different tissues and degrees of development.Since the cause of cancer has not yet been clearly defined,progression of cancer is complex and uncontrollable,and the treatment of cancer lacks ideal specific means.At present,cancer has become one of the main causes of human morbidity and mortality.Therefore,timely detection and accurate diagnosis are critical to making treatment decisions,improving cure rates,and improving the life quality of patients.Since the occurrence of cancer is usually associated with genetic mutations,pathological analysis of cancer at the genetic level is expected to provide more accurate results of cancer diagnosis.With the development of high-throughput sequencing technology,gene expression data,especially RNA-seq data,has gradually become one of the main data types for cancer diagnosis research.The utilization of this high-dimensional data is both an opportunity and a challenge for data mining and machine learning methods.In recent years,machine learning methods have become more and more advanced and efficient,and their application in cancer diagnosis research is more and more extensive and in-depth.At the same time,due to the diversity of machine learning and the clinical complexity of cancer,the application of machine learning faces various difficulties,including the high feature dimensionality caused by the large population of genes,the small sample size caused by the difficulty in sampling cancer data,and the limitations of learning models.Based on the understanding of gene expression data and the application of machine learning methods,this dissertation focuses on related issues in cancer diagnosis research and explores methods and strategies to better solve problems using deep learning,thus improving the accuracy of cancer diagnosis.In particular,the application of deep learning methods on gene expression data for cancer diagnosis is studied.The main are summarized as follows:(1)A deep learning-based multi-model ensemble method for cancer diagnosis:Different machine learning methods have different pertinence and sensitivity for different data characteristics and application environments.Each method has its own advantages and shortcomings relative to others.Thus,none of the machine learning methods applied to cancer diagnosis is significantly superior to others.This dissertation proposes a multimodel ensemble method based on deep learning.By integrating the five most commonly used machine learning models,using the Stacking algorithm,and applying deep neural networks in the learning phase,the proposed method re-learns the predictions of the five machine learning models and the complex structures behind the data so as to integrate the advantages of multiple different models.For the RNA-seq data,in order to avoid over-fitting,gene differential expression analysis is first used to select important features,and the cross-validation principle is applied to re-divide the dataset and the first-stage prediction results.Finally,the deep learning method is used for further weight learning and final prediction.Through the multi-model ensemble and the ability of deep learning to learn complex nonlinear relationships,the proposed method improves the accuracy and stability of cancer diagnosis.(2)A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer diagnosis: In most research on cancer diagnosis,only labeled data is considered.However,in clinical practice,there is a large amount of unlabeled data,which also contains a lot of useful information.In fact,the labeling process is expensive,cumbersome and error-prone,and,as a result,the labeled cancer data is very limited.On the other hand,gene expression data has tens of thousands of feature dimensions and high redundancy.The mismatch between the small sample size and the high feature dimensionality adversely affect model training and prediction performance.Aiming at the above problems,this paper proposes a semi-supervised deep learning method based on the stacked sparse auto-encoder(SSAE).The greedy layer-wise pre-training and the sparsity penalty term are used in the auto-encoder,and the improved momentum update algorithm is also added,so as to extract the important information in the data while reducing the dimensionality for the follow-up classification.The unsupervised SSAE is then linked to a supervised neural network to further fine-tune the whole semi-supervised model,making use of both unlabeled and labeled data.The proposed method utilizes the ability of the SSAE to deal with unlabeled data,extract sparse representations,and obtain a higher convergence rate,making the performance of the cancer diagnosis model more efficient and accurate.(3)A deep learning-based generative adversarial network from imbalanced data for cancer diagnosis: In the clinical practice of cancer,data imbalance is a common problem due to the sampling difficulty and the actual sample size,while most conventional machine learning methods assume balanced data distribution,which may affect the performance of the models to a large extent.This dissertation proposes a deep learning model based on the Wasserstein generative adversarial network(WGAN)to address the imbalanced learning problem.By observing the data distribution of three cancer RNA-seq data sets,the data imbalance problem and its adverse effects on cancer diagnosis models are first analyzed.The imbalanced training data is then processed by sampling methods.It is established through experiment that the model trained by balanced data obtains better performance than the model trained by imbalanced data,and the sample expansion is also a way to improve the training process.Based on the experiment results,an improved WGAN model is proposed,which uses the Wasserstein distance to provide reliable indicators for the training progress and learn complex structures behind the data to generate new samples that are more conform to the original data characteristics.By generating new samples in the minority class to achieve balance and further expanding the sample size,cancer diagnosis models can be better trained to provide better predictive performance. |