Font Size: a A A

Approaches To Feature Information Extraction For Biological Sequences And Their Applications

Posted on:2014-01-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:H J YuFull Text:PDF
GTID:1220330398972849Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the arrival of the post genome era, the research focus of biology has turned to how to analyze and to explain the increasingly accumulated mass data. Thus, bioinformatics, also named computational molecular biology, emerges as another newly developed interdiscipline, which is very rich in research content. Particularly, sequence similarity analysis is the most important of all, where several core problems should be involved, such as biological sequence representation, feature information extraction from sequences etc. Starting with solving the problem of the feature information extraction from biological sequences, this dissertation carries out a series of research on algorithm design and their applications, where six feature information extraction algorithms were proposed. Compared with related representative works, the results demonstrated the efficiency of the proposed algorithms.The main works for this dissertation can be summarized as follows:1) Graphical representations provide us with a tool allowing visual inspection of the sequences. To visualize and to compare different DNA sequences, a novel alignment-free method is proposed in this paper for both graphical representation and similarity analysis of sequences. We introduce a transformation to represent each DNA sequence with neighboring nucleotide matrix (NNM). Then, based on approximate joint diagonalization (AJD) theory, we transform each DNA primary sequence into a corresponding eigenvalue vector (EVV), which can be considered as numerical characterization of DNA sequence. Meanwhile, we get graphical representation for DNA sequence via the plot of EVV in2D plane. Moreover, using k-means, we cluster these feature curves of sequences into several reasonable subclasses. In addition, similarity analyses are performed by computing the distances among the obtained vectors. This approach contains more sequence information, and it analyzes all the involved sequence information jointly rather than separately. A typical dendrogram constructed by this method demonstrates the effectiveness of our approach.2) In order to compare different genome sequences, an alignment-free method has been proposed. Considering the essential property of sequence is sequentiality, we define a compound transformation which transforms a genome sequence into a sparse16by L-1matrix M based on16kinds of2-mer (dinucleotides). Furthermore, we found the transformation above-mentioned is an order-preserving transformation (OPT). Based on the theory of matrix analysis, we derive a16-dimensional vector to characterize a genome sequence via singular value decomposition (SVD) of M. Finally, we analyze the similarities among multiple sequences from20eutherian species. The experiment results show that our proposed approach performs well in the field of sequence analysis.3) Since the genome sequences are too high-dimensional to be numerically characterized within lower-dimensional space directly. This study also proposes an alignment-free comparison model for genome sequences to solve the problem. In this paper, we transform a genome sequence into a16by (L-1) sparse matrix M. Using singular value decomposition (SVD) upon the obtained M, we got a16-D feature vector F for each genome sequence. Through principal component analysis (PCA) upon all the feature vectors, the first several principal components (PC) were derived for comparison. It is proved that:a) the transformation has the property of distance preserving; b) the elements of16-D vector are just related to neighboring nucleotide number. Then, we obtain dendrogram for each group of mammalian genome sequences. Using the first two PCs, we constructed2D genome map, which illustrates the relationship among all species. The results show that the topology property agrees with the established mammalian phylogeny, which reveals that both mitochondrial and whole genome sequences can efficiently distinguish different species. The proposed approach grasps the sequential property of genome sequence. Furthermore, it also performs well even upon the larger scale data set (e.g., the second one).4) Based on all kinds of adjacent amino acids (AAA), we map each protein primary sequence into a400by (L-1) matrix M. We derive a normalized400-tuple mathematical descriptors D, which is extracted from the primary protein sequences via singular values decomposition of the matrix. The obtained400-D normalized feature vectors (NFV) further facilitate our quantitative analysis of protein sequences. Using the normalized representation of the primary protein sequences, we analyze the similarity upon two datasets:a) ND5sequences from nine species; b) transferrin sequences of24vertebrates. We also compared the results in this study with those from other related works. These two experiments illustrate that our approach (NFV-AAA) performs well in the field of similarity analysis of sequence.5) The traditional multiple sequence alignment (MSA) is not appropriate to the comparison among genome sequences due to computational load. So in this study, as an improvement on K-mer, a novel alignment-free method is introduced, where each primary sequence is divided into several segments and all these segments are simultaneously transformed into K-mer, respectively. In this approach, it is critical to determine the optimal combination of distance metric with the number of K and the number of segments, i.e.,(d*, s*, K*). Based on the cascaded feature vectors transformed from s*segmented sequences, we obtain dendrogram for the mammalian genome sequences via the proposed approach, i.e. segmented K-mer (s-K-mer). The results demonstrate that s-K-mer approach outperforms the traditionally K-mer method on similarity analysis among different species.6) To compare multiple genome sequences, both local and global similarities should be considered. In this paper, we divide each primary genome sequence into several segments, which are simultaneously transformed into corresponding k-mer-based vectors. The operation can be regarded as mixing multiple source genomic signals via’virtual mixer’(VM), through which we can obtain the mixed vectors with equal-length from the corresponding genome sequences with different length. Subsequently, using ICA-based transformation, we project all the vectors upon their independent-components to capture the projection feature vector via’projection extractor’(PE), which has been proved to have a property of distance preserving. Furthermore, a second layer VM-PE model has been developed to improve the performance on similarity analysis within a lower dimension space reduced greatly by this two hierarchical VM-PE model (HVMPE). Then, we used the proposed HVMPE model upon two real datasets of mitochondrial genome sequence to test the efficiency for the model. The contrastive analysis results demonstrate that the proposed HVMPE model outperforms the published representative works.
Keywords/Search Tags:DNA sequence, protein sequence, graphical representation, numericalcharacterization, feature extraction, feature matrix, matrix pencil, jointdiagonalization, segmented K-mer, similarity analysis, dendrogram, phylogenetic tree
PDF Full Text Request
Related items