Font Size: a A A

Method Of Gene Selection In Gene Expression Data And Model Optimization In Near-infrared Spectral Mirco-analysis

Posted on:2015-12-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y MaoFull Text:PDF
GTID:1220330467964435Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
With the development of analytical instruments, it is necessary to establish effective methods to extract information from vast amounts of high dimensional data for analytical chemist. Chemometrics has advantages in data processing, information extraction, qualitative and quantitative analysis of the complex samples. Therefore, in this dissertation, gene selection for cancer classification using gene expression data and model optimization for quantitative analysis of plant samples using near-infrared diffuse reflectance spectroscopy (NIRDRS) were studied. The main contexts are as follows:1. Significant genes were selected by randomization test (RT) for cancer classification using gene expression data. Gene selection is an important task in bioinformatics studies, because the accuracy of cancer classification generally depends upon the genes that have biological relevance to the classifying problems. In this work, randomization test was used as a gene selection method for dealing with gene expression data. In the method, a statistic derived from the statistics of the regression coefficients in a series of partial least squares discriminant analysis (PLSDA) models was used to evaluate the significance of the genes. With repetition of the calculations, the frequency number of a gene can be further used as a criterion to evaluate its significance. Informative genes were selected for classifying the four gene expression datasets of prostate cancer, lung cancer, leukemia and non-small cell lung cancer (NSCLC) and the rationality of the results was validated by biological investigation of the selected genes, multiple linear regression (MLR) modeling and principal component analysis (PCA). With the selected genes, satisfactory results can be obtained. Therefore, the method may be an alternative tool for classification using the expression data.2. The models for rapid determination of chlorogenic acid, scopoletin and rutin in plant samples by near-infrared diffuse reflectance spectroscopy were optimized. Polyphenols in plant samples have been extensively studied because phenolic compounds are ubiquitous in plants and can be used as antioxidants in promoting human health. A method for rapid determination of three phenolic compounds (chlorogenic acid, scopoletin and rutin) in plant samples using NIRDRS was studied in this work. Partial least squares (PLS) regression was used for building the calibration models, and the effects of spectral preprocessing and variable selection on the models were investigated for optimization of the models. The results show that individual spectral preprocessing and variable selection has no or slight influence on the models, but the combination of the techniques can significantly improve the models. The combination of continuous wavelet transform (CWT) for removing the variant background, multiplicative scatter correction (MSC) for correcting the scattering effect and randomization test (RT) for selecting the informative variables was found to be the best way for building the optimal models. For validation of the models, the polyphenol contents in an independent sample set were predicted. The correlation coefficients between the predicted values and the contents determined by high performance liquid chromatography (HPLC) analysis are as high as0.964,0.948and0.934for chlorogenic acid, scopoletin and rutin, respectively. Therefore, NIRDRS can be used for rapid determination of polyphenols in plant sample.
Keywords/Search Tags:Gene expression data, Gene selection, Cancer classification, Near-infrared Spectroscopy, Model optimization
PDF Full Text Request
Related items