Font Size: a A A

Dimensionality Reduction Analysis Of Gene Chip Data Of Acute Myeloid Leukemia

Posted on:2020-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:X SunFull Text:PDF
GTID:2404330572490724Subject:Statistics
Abstract/Summary:PDF Full Text Request
The incidence of leukemia is the sixth highest among all kinds of tumors in Chi-na.Leukemia is a malignant disease of hematopoietic stem cells.In china,about 2.76 person per 100 thousand population suffer from leukemia.Acute myeloid leukemia(AML)is a malignant disease of medullary hematopoietic stem cells.About 1.62 person per 100 thousand population suffer from AML.With the im-plementation of the Human Genome Project,more and more gene sequences have been determined and are increasing at an unprecedented rate.The appearance of gene chip contributes to the study of gene sequences,which can be applied to the diagnosis of diseases and the identification of diseases.Gene chip is suitable for the early diagnosis and treatment.It can improve the accuracy of the diagnosis when used on the children and adults with AML.Gene chip data can be regarded as a matrix.It is complex,time-consuming and expensive to study the gene chip data directly without any processing because of its large amount of data and high dimension.Therefore it is significant to process gene chip data through dimensionality reduction so that researchers can obtain information contained in gene chip more efficiently and quickly.This paper discusses the methods of dimensionality reduction of gene chip data of AML in order to obtain the information of gene chip more efficiently.This paper combines Bootstrap method with PCA innovatively.It is the first time to apply Bootstrap(Bootstrap estimates the statistic by sampling with re-placement from original data)method to the dimensionality reduction of gene chip based on PCA:when taking chips as variables,Bootstrap samples are select-ed with replacement from the set X ?[X1,X2,…,Xn]T.Then we compute the eigenvalue ?b(1?b?B)and the eigenvectors aijb(1?b?B)of covariance matrix of the set of sample.Repeating the steps B(B?1000)times,we take the mean of ?b as the variance contribution of Bootstrap PCA and use the mean of aijb to modify the coefficient of PCA.Therefore we can obtain the principal components of genes to improve the results of PCA when the number of sample is small.Six parts are shown as following:?.Preprocessing of Gene chip DataThis paper gets the matrix of AML gene chip data from GEO gene database of National Center for Biotechnology Information(NCBI).In order to comparethe results,we choose three data sets whose P value of significant test satisfies P<0.05,P<0.01 and P<0.001 respectively.?.Dimensionality Reduction Analysis of AML Gene Chip Data Based on Principal ComponentThis paper analyzes gene chip data sets by PCA based on chips and selects the top three principal components whose cumulative variance contribution rate exceeds 80%.We select differential genes according to the gene scores of the second and the third principal component.Especially,the HOXA9 gene expresses differentially with high frequency,which plays an important role in the AML.Because the number of samples is much smaller than that of variables,PCA gets poor results when taking genes as variables.?.Dimensionality Reduction Analysis of Gene Chip Data Based on Bootstrap-PCAAccording to the Empirical Analysis,Bootstrap-PCA method selects less prin-cipal components than PCA to make cumulative variance contribution rate exceed 80%.IV.Linear Regression based on Bootstrap Principal ComponentsThis paper records two kinds of AML as 1 and 2 respectively and takes them as dependent variable y while taking Bootstrap principal components as independent variables Fj.We establish linear regression equation with y and Fj:y = ?+?1F1+?2F2+…?nFn.We draw a part of samples to calculate the coefficient of the regression equation and check the accuracy with the samples undrawn.It proves that the regression equation can be used to judge the category of samples.Unfortunately,the MATLAB running time of Bootstrap-PCA is too long.?.Dimensionality Reduction Analysis of Gene Chip Data Based on SPCAThis paper use SPCA to make more factor loading be 0.However variance contribution rate decreases when the number of coefficient,gets smaller and more information is lost compared with PCA.Based on AML gene chip data in this paper,the effect of SPCA is poor.VI.Clustering Analysis of Gene Chip DataThis paper compares several distance algorithms of hierarchical clustering in MATLAB running time and accuracy.Generally speaking,the complete distance is better.MATLAB running time decreases significantly when using K-means clustering.At the same time,this paper use hierarchical clustering and K-means clustering to test the accuracy of classification.K-means clustering is suitable for P<0.05 and P<0.01 data sets while hierarchical clustering is suitable for P<0.001 data set.In general,hierarchical clustering is superior to K-means clustering when the number of samples is small;otherwise,K-means clustering is better than hierarchical clustering.In conclusion,PCA can be used to select key genes of the disease if we take chips as variables;If we take genes as variables,we can use Bootstrap PCA to establish a regression equation to classify samples,which is meaningful to determine the type of disease.But MATLAB running time of Bootstrap PCA is longer than PCA.Clustering analysis is suitable for classifying genes and samples.If the data size is big,it is better to use K-means clustering to classify the samples and genes;if the data size is small,hierarchical clustering is superior than K-means clustering to classify the samples.
Keywords/Search Tags:AML Gene Chip Data, Dimensionality Reduction, PCA, Bootstrap, Clustering
PDF Full Text Request
Related items