Font Size: a A A

Study On Algorithms Of Protein Batch Homology Search And DNA Motif Discovering

Posted on:2017-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:J H YuFull Text:PDF
GTID:2310330488958700Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Batch protein sequences homology search and motif discovery are two common tasks in modern bioinformatics scholars today. The aim of this article is to make an intensive study of batch protein sequences homology search and motif discovery and design effective algorithms: the fast batch homology search based on compression and cluster (C2-BLASTP) and motif discovery based on random projection and particle swarm optimization (PSORPS).The main research works in this article consist of the following aspects:A common task in many modern bioinformatics application is to match a set of protein query sequences against a large sequence dataset. Protein sequence data, although on a slower growth curve than genomic data, nonetheless increase at an exponential rate, doubling roughly every 2 years, for now just keeping pace with Moore's law for computational power. To search multiple queries against the growing database, the basic approach is to run BLAST on each of the original queries or concatenate queries by grouping them together. It is inefficient for failing to exploit common subsequences shared by queries. Therefore, we propose a new fast batch query algorithm based on compression and cluster (C2-BLASTP), which makes full use of the joint information among the query sequences and the database. Firstly, the queries and database are respectively compressed by redundancy analysis, redundancy removal and distinction record. And then the database is further clustered by using similar subsequences. Following this, the process of hits finding can be implemented in the clustered database. Furthermore, a final execution database is reconstructed based on the found potential hits to mitigate the increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experimental evaluations on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The C2-BLASTP achieves competitive results in terms of homology accuracy, search speed and memory usage compared with some state-of-the-art methods.In gene expression and gene regulation, the transcription process can pass genetic information of DNA to protein. And in the process of transcription, transcription factor binding sites can help researchers understand biological evolutionary relationships between the sequences. Identification of transcription factor binding sites, motif discovery, plays an important role in understanding the biological significance of the sequences. In terms of motif discovery, Particle Swarm Optimization addresses this problem by integrating local and global optimal solution. Due to much noisy subsequences in the dataset, it easily makes the whole algorithm fall into local optima. To solve this problem, this article proposes a new algorithm (PSORPS) which takes advantage of random projection strategy to reduce the noisy subsequence. In the process of reducing the noisy subsequence, PSORPS can achieve the similar sequence segments which are distributed in as many sequences as possible. And then PSORPS uses the positions of these segments to initialize the particles. Finally, PSORPS implements re-alignment and simultaneous shift operators on the result of PSO to refine the final result. Experimental evaluations on the real data sets demonstrate the effectiveness of the proposed PSORPS for motif discovery.
Keywords/Search Tags:Homology Search, Protein Compression, Cluster, Motif, Particle Swarm Optimization
PDF Full Text Request
Related items