Font Size: a A A

Detection Of Genomic Islands And Parallelization Of Isolation With Migration Model

Posted on:2013-01-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:C B ZhouFull Text:PDF
GTID:1110330371482691Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Streptococcus pyogenes(S.pyogenes)is a pathogenic bacterium that mainlyacquires its virulence through horizontal gene transfer(HGT)events.Thisgram-positive human pathogenic bacterium can cause severe diseases such aspharyngitis,cellulitis,streptococcal toxic shock-like syndrome and necrotizingfasciitis,etc.S.pyogenes is responsible for at least517,000deaths each year due tothe aforementioned lethal diseases.Treatment of these toxic and lethal invasivediseases with antibiotics is not always effective,and the mortality usually exceeds50%.A large number of virulence factors of S.pyogenes have been acquired throughthe horizontally transferred bacteriophages.A genomic island(GI)is a genomic region that is acquired from anotherorganism through HGT.A GI can code for many functions that are related tosymbiosis,pathogenesis and organism's adaptation,etc.With the revolutionaryinvention of high-throughput sequencing technique,the production of bacterialgenomes is significantly sped up.As of September2,2011,1,606complete and5,140partial bacterial genomes were announced in the NCBI Microbial Genome Projectweb site.The in silico characterization of GIs in the pathogenic bacterium becomesincreasingly needed,due to the time consumption and high cost of the experimentaltechniques.A GI can be computationally detected through the DNA composition orcomparative genomics techniques.Genomes of different origins are known to havedifferent DNA compositions,and this observation was widely used to detect the recenthorizontally transferred GIs in bacterial genomes.The comparative genomicstechnique is to detect the recent insertions/deletions(indels)in a group of closelyrelated bacterial genomes,and those large indels supported by multiple genomes areconsidered to be a GI.We proposed a k-mer frequency method to detect GIs in13completelysequenced strains of S.pyogenes.We firstly detected the abnormal genome fragmentsbased on k-mer frequency,and further restricted our focus on the GIs by consideringthe functional annotations of the genes within the abnormal genome fragments.Ourexperimental results showed that the proposed method can detect GIs effectively.Through comparison between the DNA composition method and comparativegenomics method for GIs detection,we found the advantages and disadvantages ofthem.We proposed a hybrid method of the DNA composition method,thecomparative genomics method and the feature restriction method.The comparative genomics method is used for unconserved genomic regions detection.When given aquery genome,firstly selecting the same species genomes for it manually as referencegenomes.Secondly,did multiple genome alignment using Mauve for all genomesincluding query genome and the reference genomes.Thirdly,extracting theunconserved genomic regions based on the multiple genome alignment.Thecomposition difference method is used for abnormal genomic regions detection.Themethod is the same to the k-mer frequency method for genomic islands detectionabove.The feature restriction method is used for restricting the GIs with feature geneswhich are related to GIs.When given a query genome,we obtained the protein tablefile as genome annotation information for it.Then we obtained the genes which arerelated to the phage,integrase and so on as feature genes.The genomic regions whichinvolve the feature genes are considered as GIs.We extracted the genomic regionswhich are not only the unconserved regions based on multiple genome alignment,butalso the abnormal regions based on k-mer frequency and called them as candidate GIs.Then we extracted the candidate GIs which involve the feature genes as GIs.Ourexperimental results showed that the proposed method can detect GIs effectively.The first purely computational steps of the human genome project have requiredvast amounts of computing equipment with large processor farms being used in boththe private and public assemblies of the human genome.The bioinformatics andcomputational biology have need for high performance computing(HPC).We want toincrease the performance of method in bioinformatics and computational biologybased on HPC.The analysis of population divergence is a major focus in population geneticsand molecular ecology.There are two extreme assumptions for most models whichare designed to do population divergence.The migrations are at a constant rate for aninfinitely long time and the populations are descended from a common ancestralpopulation and diverged without gene flow.Most models are different for the actualworld because of these two assumptions.The aim of Isolation with Migration model(IM)is jointly estimating divergencetimes and migration rates between two populations from DNA sequence data.Thereare six parameters in IM:the population sizes for the ancestral population and twodescended populations,two different migration rates for two descended populationsand the time of population splitting.With these six parameters,IM can capture manyphenomena that can occur when one population splits into two.IM has beensuccessfully used for population genetics.The parameters inferences for IM are based on Markov Chain Monte Carlomethod(MCMC).MCMC is a sampling method with the probability distributionsbased on a Markov chain that has the desired distribution as its equilibriumdistribution.The state of the chain is used as a sample of the desired distribution. MCMC produces the correct distribution as the length of the Markov chain increases.MCMC has been successfully used for gaining posterior probability distribution forphylogenetic tree.The use of multiple chains incurs a significant performance cost for Metropoliscoupled MCMC.Specifically,only the posterior probability distribution for cold chainis the desired distribution.In Metropolis coupled MCMC,each chain requires thesame amount of computation per iteration and interacts to the others only when swaphappens,so it is ideally suited for implementation on parallel machines.MessagePassing Interface(MPI)is used for parallel processing in this article.MPI is anApplication Programming Interface(API)specification for parallel program based onprocesses.Processes with local memory can not access the memory for the otherprocesses.Processes can only communicate with the other processes by sending andreceiving messages.The critical costs for MPI are messages passing and processsynchronization.We proposed a chain parallel for Metropolis Coupled MCMC of IM,whichimproves the performance of IM through the parallel processing.The chains areassigned to several processors for MCMC mthod.The chains swap throughcommunication between processors.Our proposed method achieved a nearly linearspeedup over the sequential version of IM.Our proposed method enables newopportunities to IM for population genetics.MCMC is a time-consuming process for equilibrium distribution and thememory required for parameters inferences may beyond the capacity of a singlecomputer.So we proposed a data parallel for MCMC of IM based on MPI.The dataare partitioned and assigned to several processors for local likelihood calculations.The global likelihood is the sum of the local likelihoods through communicationbetween processors.Our proposed method achieved a nearly linear speedup over thesequential version of IM.
Keywords/Search Tags:Horizaontal Gene Transfer, Genomic Islands, k-mer Frequency, Isolation withMigration Model, Markov Chain Monte Carlo
PDF Full Text Request
Related items