Font Size: a A A

Analytic Number Theory Model For Character Sequence And Its Application In Bioinformatics

Posted on:2004-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:B G MaFull Text:PDF
GTID:2120360125463107Subject:Biophysics
Abstract/Summary:PDF Full Text Request
Many tasks in bioinformatics can be abstracted as the problem of character sequence analysis, such as gene recognition, protein secondary structure prediction etc. All the information a character sequence can provide is no more than two aspects: composition and permutation. The composition information can be represented by ordinary frequencies. The key of the problem is how to reflect the permutation information of a character sequence. Based on a review of already known algorithms, a new model called Analytic Number Theory Model for character sequence is proposed in this dissertation from the visual angle of number theory. In this model, a character sequence is treated as a representation of a number, so that the analysis of character sequence is transformed into a problem of number theory, and to be solved by the aid of mathematical analysis. The core concept of ANTM is Dual Descriptor, so it is also called Dual Descriptor Method sometimes. Dual descriptor is composed of two parts: Composition Weight Factor and Position Weight Function. Composition weight factor is derived from the concept of "radix" in natural number systems, and to be the generalization of it in real number field. Position weight function is an intrinsic concept of natural number systems, and it is also generalized into real number field. To approximate position weight function, Fourier transforms, wavelet transform and such theories are naturally introduced into the field of character sequence analysis. An iterative method is posed in the dissertation for the training of dual descriptor based on a data set. The trained dual descriptor carries the information of original data set. It can be used to recognize character sequences by a method posed in this dissertation called D-value threshold discriminant approach. At the same time, because of the introducing of position weight function, counting with position weight is implemented, and the result of it is Frequencies with Position Weight, also call Weighted Frequencies for short. The advantages of weighted frequencies which make them outgo ordinary frequencies are their ability to carry both composition and permutation information. Therefore, weighted frequencies can serve as characteristic variables of character sequences. With them, dual descriptor can be used in the combination with other approaches such as Fisher discriminant algorithm for the recognition of character sequences.The application of dual descriptor in bioinformatics is demonstrated in this dissertation with the example of DNA sequence analysis. The content of it include: sequence feature extraction, the demonstration of the study process of dual descriptor, the application of D-value threshold discriminant approach and weighted frequencies Fisher discriminant algorithm in the recognition of protein coding regions in both prokaryotic and eukaryotic species.
Keywords/Search Tags:Analytic Number Theory Model for character sequence, Dual Descriptor, Weighted Frequencies, D-value threshold discriminant approach
PDF Full Text Request
Related items