Gene chip technology is one of the most important breakthroughs in molecular biology in recent years. It can measure the expression levels of tens of thousands of genes in parallel, and present the gene expression data under different conditions at a genome wild scale. However, with the sprint up of gene expression data, the analysis of vast amounts of biological data by effective algorithms has become a big challenge. In this thesis, we mainly focus on the development of biclustering algorithms, which aim to find trend preserving biclusters in gene expression data. Here a trend preserving bicluster represents a set of genes which show trend preserving expression patterns under a set of conditions. The analysis of gene expression data reveals the expression information of genes and helps the study of correlations between genes, which also is important to the diagnosis and treatment of diseases, and to the evaluation of drug efficacy.Traditional clustering methods were first applied to analyze gene expression data, by treating genes and conditions separately. The results of tradition clustering algorithms are usually either a set of genes which are co-expressed under all conditions or a set of conditions under which all genes are co-expressed. But a cellular process may only affect a small set of genes which are co-regulated under certain conditions. Moreover, a single gene may participate in multiple pathways, resulting in different expression patterns under different subsets of conditions. Actually the results we need in the analysis of gene expression data are biclusters which can reflect the correlations of a subset of genes under a subset of conditions, while different biclusters are allowed to be overlapped with each other. All these hidden data characteristics are intractable by traditional clustering algorithms.Biclustering methods have been proposed to offer an effective way to analyze gene expression data, which aim to find subsets of co-expressed genes under certain conditions. Biclustering algorithm was first pioneered by Morgan et al. by partitioning a matrix into submatrices with approximately constant values. Since the biclustering idea was applied to analyze gene expression data, different algorithms for different types of biclusters have been widely developed and play important roles in the analysis of gene expression data. Biologically, trend-preserving biclusters are the most meaningful substructures hidden in gene expression data, and many algorithms have been proposed to find this type of biclusters. However, it remains a highly challenge problem to efficiently and effectively identify the general trend preserving biclusters due to its complexity nature in computation.In this thesis, we propose a novel biclustering algorithm UniBic, which is capable of discovering trend preserving biclusters hidden in matrix datasets. The perspective of the algorithm is based on the observations that there exists a column permutation of an order preserving bicluster such that the entries in each permuted row within the bicluster are non-decreasingly arranged. The basic ideas of UniBic follow the steps below:first, create the index matrix according to the original input matrix, and equally partition the index matrix into several subsets of rows based on the requirement of significance of to-be-identified biclusters; second, apply the longest common subsequence framework to each pair of rows in each subset of index matrix to locate seeds for biclusters to be identified, finally, extend the seeds into strict order preserving biclusters, and further extend the strict order preserving biclusters into approximately trend preserving biclusters with the tolerance of errors. The creation of index matrix transforms the problem of discovering trend preserving biclusters in a background matrix into a simple problem of finding the longest common subsequences between pairs of rows in the index matrix, making the original problem less intractable. Moreover, when the algorithm is applied to large scale datasets, e.g., gene expression datasets, we preprocess the data and focus only on those regulated entries so as to alleviate the adverse impacts of inactive entries and noise interference.We tested UniBic on synthetic and real datasets and compared its performance with six other currently competitive biclustering algorithms. The results on different types of synthetic datasets show that UniBic outperforms all other six tools when the implanted biclusters are supported by certain number of columns, particularly, UniBic is better at discovering trend preserving biclusters. UniBic also outperforms others when the implanted biclusters are overlapped. The results on real datasets show that UniBic can find GO enriched biclusters effectively.We observed the shortage of our algorithm which is commonly suffered by most of the existing biclustering algorithms from the fact that as seeds are selected as the longest common subsequences between pairs of rows in the index matrix, UniBic may to some extent overlook the narrow biclusters with significant number of rows but with only few columns. There have been a lot of algorithms designed specifically for narrow biclusters, but all these algorithms are time consuming and perform worse on biclusters with more columns. Considering the computational complexity of biclustering problem, it is hard to design an algorithm that performs well on all types of biclusters. We plan to develop a feasible way to overcome the shortage of UniBic as our further research topic.At last we introduce a clustering algorithm Peg and compare our algorithm with hierarchy clustering method on clostridia genomes. The results show that our algorithm reveals the data features better than HC.UniBic is implemented in C and the source code is available at: http://sourceforge.net/projects/unibic/files/?source=navbar. All the tested datasets as well as results are also available there. |