Font Size: a A A

Bounds for the errors in word count distributional approximations

Posted on:2002-02-26Degree:Ph.DType:Dissertation
University:University of Southern CaliforniaCandidate:Huang, HaiyanFull Text:PDF
GTID:1465390011499289Subject:Mathematics
Abstract/Summary:
The study of the occurrences of words in sequences are of interest in many fields, and in biological sequence analysis in particular. Given a sequence S and a collection of d words, O, it is of interest in many applications to obtain information on the multivariate distribution of the vector of counts U = (N(S, w1) ,..., N(S , wd)), where N (S, w) is the number of times a word w ∈ O appears in the sequence S. I obtain explicit bounds on the error made when approximating the distribution of U by the multivariate normal, when the underlying sequence is i.i.d or first-order stationary Markov over a finite alphabet. One application of our results involves the distribution of joint occurrences of restriction enzyme sites of a DNA sequence. I also prove that in order for U to have a non-degenerate covariance matrix, it is necessary and sufficient that the counted word set O is not full, that is, that O is not the collection of all possible words of some length k over the given finite alphabet. To supply the bounds on the error, I use a version of Stein's method.
Keywords/Search Tags:Bounds, Word, Sequence, Distribution
Related items