Font Size: a A A

On issues of singularity for confidence regions and hypothesis tests for topologies using generalized least squares

Posted on:2008-07-10Degree:M.ScType:Thesis
University:Dalhousie University (Canada)Candidate:Sheridan, PaulFull Text:PDF
GTID:2440390005464032Subject:Mathematics
Abstract/Summary:
Recently, Susko [31] described a computationally inexpensive way to construct confidence regions (CR) for topologies using a generalized least squares (GLS) test statistic, with chi square distribution, which applies to maximum likelihood (ML) distances. A software implementation for both nucleotide and protein data, called glsdna and glsprot respectively, were also provided by Susko [32]. The accuracy of both the GLS test statistic and sample average approximations used for the variances and covariances for the ML distances are asymptotic in the number of sites; however, in practice usable sequences may be only hundreds of characters long. It is untested just how GLS will perform under these conditions.; In this thesis, a simulation study is undertaken to gauge the consequences of these asymptotic limitations. To this end, 4 and 7 taxon trees were used to simulate nucleotide sequence data for each of the lengths 50, 100, 250, 500, 1000, 5000, and 10000. For each tree used, and each sequence length, on the order of 10000 CR's were generated, and the coverage probability of the true tree, size of each CR, estimated ML distances, and estimated sample average variances-covariances were recorded. It was found that the coverage probabilities agreed with what is expected asymptotically for sequence lengths 1000 and higher. For smaller sample sizes the coverage probabilities were generally found to be higher than the 0.95 value. It was anticipated that, for small sample sizes, the coverage probabilities would attain the expected 0.95 value, if the true covariances were used to compute the GLS test statistic. Surprisingly, the coverage probabilities were drastically underestimated. The underlying cause can be attributed to a tendency for the ML distances to be overestimated for small sequence lengths together with what we found to be exponential increase in variance with distance between taxa.; The second part of this thesis is directed toward fixing a serious limitation of the GLS software. Namely, computation of the GLS test statistic requires the estimated covariance matrix of the ML distances to be invertible. If singularity does occur, then the test statistic cannot be computed and the programs will crash. In molecular evolution models, the covariance matrix is a function of the substitution model and the underlying tree but it is not generally known what types of trees and models cause singular covariance matrices. In this thesis, we show that singular covariance matrices arise if and only if some distance is exactly 0 or equivalently when a pair of taxa have identical sequences with probability 1. However, in practice the covariance matrix must be estimated and the underlying causes of singularity are more complex. A necessary condition for singularity in the estimated covariance matrix is given, as well as two sufficient conditions which are: (1) The number of distinct nucleotide patterns at a site is less than the number of pairs of taxa, and (2) A special type of linear dependence is constructed in the rows of the estimated covariance matrix.; Finally, two alternatives to using the glsdna and glsprot routines are introduced which allow for the construction of a CR even when the covariance matrix is singular. First, the routines glsdna_eig and glsprot_eig, as described in [32], use an eigenvalue cutoff approach. The causes of singularity described in this thesis led to an alternative approach which uses a distance cutoff, or in other words, groups of taxa which are closely related are combined together before computing a CR. This approach is implemented as glsdna_dist and glsprot_dist. These different approaches were compared via a simulation on two 8 taxon trees using nucleotide sequence data. Briefly, the results show that for small samples the glsdna_dist routine gives better coverage probabilities and far smaller CR sizes than those obtained by using glsdna_eig, while for longer sequence lengths the routines exhibit simil...
Keywords/Search Tags:Using, GLS, ML distances, Singularity, Sequence lengths, Covariance matrix, Thesis, Glsdna
Related items