Font Size: a A A

An Information Entropy-based Improved K-TSP Method For Classifying Human Cancers

Posted on:2010-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:C B ZhouFull Text:PDF
GTID:2144360272995796Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The implementation and completion of Human Genome Project promote the naissance and development of microarray chip technology. The effective integration of microarray chips and other molecular biology techniques can make the analysis of genome structure and expression level high-throughput. The emergence of microarray chips not only meets the requirement for study of gene function for post-genome era, but also greatly enriches the research way of life sciences. Before the appearance of microarray chips technology, researchers always select one or a few genes and proteins as targets, study their biological functions. The advantages of this low-throughput research way are more in-depth research and more reliable results, but the overall progress is very slow. The use of microarray chips for the analysis of genome-wide gene expression, polymorphisms and other structural characteristics will greatly accelerate the process of medical research, and ultimately guide clinical practice. For example, the traditional methods of tumor classification are mainly based on tumor cell morphology, but the problem is for the same morphological tumor cells, their clinical responses for treatment are very different. This means that cellular morphology is not good for cancers and their subtypes classification and makes the diagnosis and treatment error. We can identify the subtypes which are not identified by traditional diagnostic way, but are important guide for diagnosis and treatment of cancer depending on the microarray expression profiling.Cancer classification and prediction using traditional histology methods is not only low accuracy, but also very troublesome. The emergence of gene expression profiling technology provides a new "system" research tools for cancer research study. Meanwhile the gene expression profiling becomes important in oncology at the basic research and clinical applications. How to carry out effective analysis of tumor gene expression profiles, excavation the information and knowledge inherent are important in the bioinformatics. Gene expression data is the focus for biological and medical research in bioinformatics. Cancer classification and prediction based on gene expression data are the hot issues in gene expression data analysis. Through the analysis of gene expression data, we can find the correlation between the changes in gene expression and pathological features, analysis pathogenesis, find the target for diagnosis of diseases and drug target, and in the future directly diagnosis disease base on gene expression data.Microarray expression profiling has been widely used in oncology studies. There are a lot of methods for cancer classification and prediction depending on the microarray expression profiling, k-TSP algorithm is one of them. k-TSP algorithm firstly calculate the values of gene pairs scores and gene pairs rank scores through calculating the differences of the gene expression level between genes depending on gene expression data; Secondly make the gene pairs in descending order depending on the gene pairs scores and gene pairs rank scores; Thirdly get the value of k through Leave-One-Out Cross-Validation (LOOCV); Finally select the first k gene pairs to form the rule for cancer classification and prediction. The results of cancer prediction are the class label which has highest votes depending on the principle of majority vote. This is means that every gene pair in the classification rule gives a vote to the class label which it has been predicted through calculation. The advantages of this algorithm are not only higher accuracy, but also a simple classification rule. However, the algorithm focus all genes in the entire microarray expression profiling data, so there will be genes which are irrelative with the classification, and result in the decline of accuracy. Meanwhile because of the existence of cross-validation process, the time and space consumption are high. Reducing the number of genes can greatly reduce the time and space consumption for cross-validation.There are attributes in the data set which have unreliable information, they affect the quality of prediction results and the accuracy of the classification algorithm. So when dealing with classification and prediction problems, we should select those genes which contain a large number of reliable information. Information entropy is used to quantify the reliable information content for the attributes in the classification and prediction problems. Therefore in this article we use information entropy algorithm to select the genes which have high reliable information content, which called key genes. We train k-TSP algorithm depending on the key genes which we have selected in order to achieve better prediction effect of the classification. Currently there are lots of gene selection algorithms, but different gene selection algorithms are adapt to different classification algorithms. It is necessary to test through the data sets to determine the applicability effect of k-TSP algorithm depending on information entropy.In order to estimate the effect of the improved k-TSP algorithm and simultaneously determine if gene selection algorithm based on information entropy is adapt to the k-TSP algorithm, we select the relatively popular classification algorithms including C4.5 decision tree method (DT), Naive Bayesian network method (NB), k nearest Ways (k-NN), Support Vector Machine (SVM) and prediction of microarray data analysis method (PAM), compare their accuracy on 9 binary class data sets and 10 multi-class data sets totally 19 data sets. From the results, the improved k-TSP algorithm has not only better accuracy in some data sets, but also low time and space consumption.In order to illustrate the effect for key gene selection using information-entropy algorithm, we select leukemia data sets to analyze the genes which form the classification rule for k-TSP algorithm. We search NCBI to find their function. Some of them are directly related with the leukemia genes or genes in the pathways which are associated with leukemia; some indirectly associated with the leukemia genes or with the genes in the pathways which are associated with leukemia.
Keywords/Search Tags:Information Entropy, Cancer Classification, Key Gene, Gene Expression Profile, k-TSP
PDF Full Text Request
Related items