| There are many problems requiring the characterization, clustering, or classification of sequences of characters drawn from a fixed alphabet, including the classification and information entropy estimation of biological sequences such as DNA sequences. This work describes a method for learning the classification of sequences, primarily biological sequences, by exploiting the string-like nature of these problems by constructing models for each class of sequence, using the model's ability to predict each character of a test sequence as a measure of the similarity between the sequence and the class of sequences used to build the model. The model predicts each character by combining predictions made by many "experts," each of which predicts a character based upon a set of characters from a training set with a similar context of preceding characters. Different experts use different similarity criteria and different context sizes. Through the use of this method, lower, more accurate entropy estimates of DNA sequences are obtained. These estimates are then shown to lead to successful classification of DNA sequences into their three-dimensional structural groups. |