| A heuristic probabilistic model for the evolution of string composition of biological sequences is proposed, which relates the relaxation of correlation of string compositions to sequence divergence. It explains the effectiveness of phylogenetic methods based on string compositions, and is used to calibrate the results of CVTree, and estimate the working range of the parameter, the length of strings K. It suggests that the CVTree approach can be generalized to a large family, and thereby larger Ks can be used. Two sets of independent methods for distance estimation, solely based upon the presence or absence of K-strings, are developed, which yield phylogenetic trees consisting surpris-ingly well with current taxonomy.The justification of K-string methods inspired a problem of unique recon-struction of a sequence from its constituent K-strings, which is equivalent to the uniqueness of Eulerian paths in a directed graph. Uniquely reconstructible sequences form a factorial regular language. It is thoroughly characterized by its forbidden words, and a deterministic finite automaton (DFA) accepting it is built up. It provides an efficient on-line algorithm for testing the unique reconstructibility of the sequences, which has been applied to investigate the real protein database. |