Font Size: a A A

The Study On Sequence Gaps In Multiple Sequence Alignment And Phylogenetic Analyses

Posted on:2006-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:W F ShiFull Text:PDF
GTID:2120360152992006Subject:Entomology
Abstract/Summary:PDF Full Text Request
In the past few years, more and more molecular data is used in entomological phylogeny analysis. Multiple sequence alignment is the basis of phylogenetic analysis and molecular sequences are aligned via inserting gaps. Affine gap penalty scheme is widely used to constrain the number of gaps in order to gain a meaningful alignment. However, the gap penalty cost is arbitrarily designated in this method. Previous research shows that different gap penalties maybe result in different alignments. Here we give statistical evidence of the effects of gaps on multiple sequence alignment and illustrate some other statistical features of gaps. We downloaded thirty-eight data matrices from GenBank and divided them into four kinds of matrix types, rDNA-based matrices, exon-DNA matrices, exon-AA matrices and ITS-based matrices. Then we perform alignment using computer program ClustalX1.81 while setting different gap penalties each time for every data matrix respectively. Statistical results prove that gap opening and extension penalties significantly change the percentage of the gaps. Post hoc tests indicate that q = 4, r = 1 can represent cases allowing more gaps and q = 15, r = 8 can represent cases inserting fewer gaps. Different matrix types have significantly different percentages of gaps. Curves concerning percentages of overall gaps to gap penalties can be generalized into three types empirically, the undee curve, the horizontal line and the step-down curve. Different matrix types have different inclinations to kinds of curves. Exon-AA matrices are more conservative than their DNA matrices.Although gap penalties affect the results of multiple sequence alignment, it is not clear whether there is difference among results got when using different gap coding method and tree search method. Alignment results got by using q - 4, r = 1 and q = 15, r = 8 were selected to perform reconstruction. Reconstruction method includes maximum parsimony (MP) and Bayesian. In MP method, gaps are coded as missing data, the fifth character and treated by simple indel coding method respectively. Results of MP analysis show that changes of tree length and number of informative sites aren't same with each other when gap penalties change. The decrease of gaps leads to the decrease of CI and the increase of HI. Thus, using a relatively small gap penalty allowing inserting more gaps is often necessary. Gap penalties have little effect on bootstrap support values, however, they change the topological structure and resolution of strict consensus trees in most cases. Coding gaps as fifth character can increase the number of informative sites and tree length compared with coding gaps as missing data, nevertheless, the number of MP trees doesn't decrease. Furthermore, these two coding methods don't significantly change CI, RI, RC and bootstrap support values. On the contrary, the topological structure and resolution of strict consensus trees are not same mostly. CI, RI and RC have significant difference among matrix types. Simple indel coding is more efficient than coding gaps as missing data. Compared with coding gaps as fifth character, simple indel coding doesn't change the results significantly except the topological structure and resolution of strict consensus trees and bootstrap support values. Simple indel coding plays a different role on the phylogenetic results when percentage of gaps is different. Bayesian analysis reveals that BPPs have no significant difference between 50% consensus trees got by using q = 4, r = 1 and q = 15, r = 8. Likewise, only few data matrices have the same topological structure and resolution of 50% consensus trees. In sum, we don't believe phylogenetic relationship acquired by limited molecular sequences data is credible. By contraries, we think such discrepancies of the topological structure and resolution are most probable to have been produced in the process of alignment. So to master more biological information to get a good alignment result rather than the reconstruction methods may be more important to a correct phylogen...
Keywords/Search Tags:Gap, Multiple Sequence Alignment, Phylogenetic Analyses, Indel
PDF Full Text Request
Related items