Font Size: a A A

Parallel Processing In Bioinformatics

Posted on:2008-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2120360215474794Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is a new comprehensive cross discipline involving biology, mathematis, physics, informatics and computer science. It plays an important role in the development of the life science and become the frontier of life science research. The core issue in bioinformatics is genome informatics which includes the obtaining, processing, storing, assigning and explaining of the genome information. Using computers and network as tools, based on the mathematical and physical theory, methods and technology, genome informatics studies the biopolymers include the sequences, structures and functions of DNA and protein. The key issue in genome informatics is to understand the meaning of the order in nucleotide sequences, namely, to understand the exact locations of the genes in the chromosome and the functions of the DNA segments. Furthermore, it simulates and predicts the secondary and tertiary structure of protein using the genome information it discovered. These are very important in the research on disease gene of human being, the expression and function of gene and protein and the designing of pharmacy.Alignment and splicing of biosequences, clustering of gene expressing data are the important tasks in bioinformatics. Biosequences alignment plays an essential role in sequence analysis, reconstruction of phylogenetic trees, detecting regions of significant sequence similarity in collections of primary sequences, and predicting the secondary and tertiary structure. Splicing of biosequences is the most important and essential task in the stage of genome sequencing. It assembles the small segments obtained by the genome sequencing into one or more longer and continuous objective sequences. The biclustering of the gene expressing data is to identify a subset of genes whose expressing levels rise and fall coherently with a subset of experimental conditions. By the biclustering, the genes with identical regulatory gene are classified into one cluster, it is helpful to analysis and identify the presumptive regulatory sites of the genes belongs to the same cluster.In solving those bioinformatical problems, high performance computer is required since they consumes large amount of computation time and memory space. Unlike the vector computation where the data structure and dependencies are regular which make it easy to be parallelized, those bioinformatical problems are essentially combinatory optimization problems which are difficult to be processed in parallel. In this paper, we deeply research on parallel processing of those bioinformatic problems, several parallel algorithms are presented and satisfied experimental results are obtained.First, an fast algorithm for LCS problem named FAST_LCS is presented. The algorithm first seeks the successors of the initial identical character pairs according to a successor table to obtain all the identical pairs and their levels. Then by tracing back from the identical character pair at the largest level, the result of LCS can be obtained. The algorithm can also be used in solving multiple sequences alignment. In the process of seeking identical pairs, techniques of pruning and skipping are used to speedup the process. Experimental result shows that our algorithm can get exactly correct result and is faster and more efficient than other LCS algorithms.Second, we also deeply study the problem of biosequences splicing and present an efficient parallel algorithm. In the algorithm the concept of suffix index was presented instead of suffix tree. The algorithm first constructs a suffix index for the segments of the gene sequences, then for each gene segment i searches in the suffix index of all other segments , to find the longest matching suffix in segment j . Suppose the length of such match is lij, a digraph is built with vertexes representing the segments and lij representing the weight of the edge linking segment i and j. Then the longest Hamilton circle is computed by parallel ant colony optimization. At last, the scheme of splicing is obtained according to the Hamilton circle. Since the algorithm uses suffix index instead of suffix tree,computation load is largely reduced and it is more suitable for parallel processing. Since the algorithm exploits the strong optimization ability of ant colony optimization in TSP solving to find the longest Hamilton circle, it reduces large amount of optimization time. Compared with other similar algorithms, our algorithm can obtain more accurate results and has higher computational speed and efficiency.Furthermore, after studying the problem of gene expressing data analysis, a parallel biclustering algorithm is also presented. Based on the anti-monotones property of the quality of the data sets with their sizes, the algorithm starts from the data sets containing of every two rows and every two columns of the data matrix, and gets the final biclusters by gradually adding columns and rows on the data sets. Both the theory analysis and experimental results show our algorithm has superiority our other similar algorithms in terms of processing speed and quality of clustering and efficiency.The research work involving this paper applies the technique of parallel processing on bioinformatic research, all the parallel algorithms are coded using MPI (C bonding) and tested on parallel computer Shenteng-1800. All the experiments use benchmark data sets randomly selected from the standard bioinformatics data base. Experimental results show our algorithms can not only get higher processing speed, but also higher quality of results. It demonstrates the strong computational ability of parallel computers in applications for bioinformatic problems. The development of efficient parallel algorithms for the boinformatic problems will greatly promote the research of bioinformatics.
Keywords/Search Tags:Bioinformatics
PDF Full Text Request
Related items