| Cancer has become a major cause of disease and death worldwide,which has a serious impact on people’s health and life.There are various causes of cancer,and it is an important task for researchers to find out the causes of cancer and relevant treatment methods.After years of research by researchers,most diseases,including cancer,are related to human genes,and an important source of data about human genes is gene expression profile data.Gene expression data are collected by biologists who select a portion of human tissue samples,add a specific reagent to activate gene expression in the tissue,and then use gene chips to detect RNA protein expression levels.On the one hand,different gene expression levels can be obtained by selecting the same tissue for gene expression profile data of some patients and healthy people.On the other hand,by observing the effects of drugs or therapeutic regimens on the expression of key genes,and the differences in expression levels before and after observation,the therapeutic effects and curative effects of drugs can be evaluated.Therefore,using gene expression profile data to analyze the key pathogenic genes of various cancers is of great importance for cancer diagnosis and treatment.However,in a data set of gene expression profile,there are usually only dozens of samples,and the number of genes detected in one sample is as high as tens of thousands.The imbalance between feature dimension and sample size leads to serious over-fitting problem when the gene expression profile data is directly classified by machine learning model.Applied deep learning to analyze gene expression profile data has been a very important application in the field of biological information.Existing deep learning methods have achieved success in cancer diagnosis based on large gene expression profile data.However,the previous deep learning model is difficult to achieve satisfy performance in high dimensional and low sample size gene expression profile data.In this thesis,we present a method for classification of cancer by gene expression based on deep metric learning——Deep Metric Learning with Sparse Feature Selection(DMSFS).DMSFS designed a new sample generation layer to generate more new samples according to the characteristics of high dimension and few samples of gene expression profile data,so as to solve the problem of unbalance between sample size and feature dimension.At the same time,a new feature weight layer is designed in DMSFS to reflect the importance of featuresthrough the change range of feature weight during model training.After ranking feature weights,DMSFS selects important features from high-dimensional features to participate in classifier for training,so as to reduce the number of features in training.After DMSFS connect the two networks,on the one hand,through the sample generation layer generates more samples for feature weight layer to choose important features,on the other hand feature weight to select the better features to get better difference through diversity loss and provide feedback to the generation layer,resulting in generate layer provides more suitable samples.DMSFS achieved an improvement of 10 to 5 percentage points on eight real gene expression profile data when compared with the current 5 representative methods. |