Approaches To Feature Information Extraction For Biological Sequences And Their Applications

Posted on:2014-01-26

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H J Yu

Full Text:PDF

GTID:1220330398972849

Subject:Pattern Recognition and Intelligent Systems

Abstract/Summary:

PDF Full Text Request

With the arrival of the post genome era, the research focus of biology has turned to how to analyze and to explain the increasingly accumulated mass data. Thus, bioinformatics, also named computational molecular biology, emerges as another newly developed interdiscipline, which is very rich in research content. Particularly, sequence similarity analysis is the most important of all, where several core problems should be involved, such as biological sequence representation, feature information extraction from sequences etc. Starting with solving the problem of the feature information extraction from biological sequences, this dissertation carries out a series of research on algorithm design and their applications, where six feature information extraction algorithms were proposed. Compared with related representative works, the results demonstrated the efficiency of the proposed algorithms.The main works for this dissertation can be summarized as follows:1) Graphical representations provide us with a tool allowing visual inspection of the sequences. To visualize and to compare different DNA sequences, a novel alignment-free method is proposed in this paper for both graphical representation and similarity analysis of sequences. We introduce a transformation to represent each DNA sequence with neighboring nucleotide matrix (NNM). Then, based on approximate joint diagonalization (AJD) theory, we transform each DNA primary sequence into a corresponding eigenvalue vector (EVV), which can be considered as numerical characterization of DNA sequence. Meanwhile, we get graphical representation for DNA sequence via the plot of EVV in2D plane. Moreover, using k-means, we cluster these feature curves of sequences into several reasonable subclasses. In addition, similarity analyses are performed by computing the distances among the obtained vectors. This approach contains more sequence information, and it analyzes all the involved sequence information jointly rather than separately. A typical dendrogram constructed by this method demonstrates the effectiveness of our approach.2) In order to compare different genome sequences, an alignment-free method has been proposed. Considering the essential property of sequence is sequentiality, we define a compound transformation which transforms a genome sequence into a sparse16by L-1matrix M based on16kinds of2-mer (dinucleotides). Furthermore, we found the transformation above-mentioned is an order-preserving transformation (OPT). Based on the theory of matrix analysis, we derive a16-dimensional vector to characterize a genome sequence via singular value decomposition (SVD) of M. Finally, we analyze the similarities among multiple sequences from20eutherian species. The experiment results show that our proposed approach performs well in the field of sequence analysis.3) Since the genome sequences are too high-dimensional to be numerically characterized within lower-dimensional space directly. This study also proposes an alignment-free comparison model for genome sequences to solve the problem. In this paper, we transform a genome sequence into a16by (L-1) sparse matrix M. Using singular value decomposition (SVD) upon the obtained M, we got a16-D feature vector F for each genome sequence. Through principal component analysis (PCA) upon all the feature vectors, the first several principal components (PC) were derived for comparison. It is proved that:a) the transformation has the property of distance preserving; b) the elements of16-D vector are just related to neighboring nucleotide number. Then, we obtain dendrogram for each group of mammalian genome sequences. Using the first two PCs, we constructed2D genome map, which illustrates the relationship among all species. The results show that the topology property agrees with the established mammalian phylogeny, which reveals that both mitochondrial and whole genome sequences can efficiently distinguish different species. The proposed approach grasps the sequential property of genome sequence. Furthermore, it also performs well even upon the larger scale data set (e.g., the second one).4) Based on all kinds of adjacent amino acids (AAA), we map each protein primary sequence into a400by (L-1) matrix M. We derive a normalized400-tuple mathematical descriptors D, which is extracted from the primary protein sequences via singular values decomposition of the matrix. The obtained400-D normalized feature vectors (NFV) further facilitate our quantitative analysis of protein sequences. Using the normalized representation of the primary protein sequences, we analyze the similarity upon two datasets:a) ND5sequences from nine species; b) transferrin sequences of24vertebrates. We also compared the results in this study with those from other related works. These two experiments illustrate that our approach (NFV-AAA) performs well in the field of similarity analysis of sequence.5) The traditional multiple sequence alignment (MSA) is not appropriate to the comparison among genome sequences due to computational load. So in this study, as an improvement on K-mer, a novel alignment-free method is introduced, where each primary sequence is divided into several segments and all these segments are simultaneously transformed into K-mer, respectively. In this approach, it is critical to determine the optimal combination of distance metric with the number of K and the number of segments, i.e.,(d*, s*, K*). Based on the cascaded feature vectors transformed from s*segmented sequences, we obtain dendrogram for the mammalian genome sequences via the proposed approach, i.e. segmented K-mer (s-K-mer). The results demonstrate that s-K-mer approach outperforms the traditionally K-mer method on similarity analysis among different species.6) To compare multiple genome sequences, both local and global similarities should be considered. In this paper, we divide each primary genome sequence into several segments, which are simultaneously transformed into corresponding k-mer-based vectors. The operation can be regarded as mixing multiple source genomic signals viaâ€™virtual mixerâ€™(VM), through which we can obtain the mixed vectors with equal-length from the corresponding genome sequences with different length. Subsequently, using ICA-based transformation, we project all the vectors upon their independent-components to capture the projection feature vector viaâ€™projection extractorâ€™(PE), which has been proved to have a property of distance preserving. Furthermore, a second layer VM-PE model has been developed to improve the performance on similarity analysis within a lower dimension space reduced greatly by this two hierarchical VM-PE model (HVMPE). Then, we used the proposed HVMPE model upon two real datasets of mitochondrial genome sequence to test the efficiency for the model. The contrastive analysis results demonstrate that the proposed HVMPE model outperforms the published representative works.

Keywords/Search Tags:

DNA sequence, protein sequence, graphical representation, numericalcharacterization, feature extraction, feature matrix, matrix pencil, jointdiagonalization, segmented K-mer, similarity analysis, dendrogram, phylogenetic tree

PDF Full Text Request

Related items

1	A Novel Graphical Representation Of Dna Based On Physico-chemical Properties Of Amino Acids And Similarity Analysis
2	The Research Of Graphical Representation Of Protein Sequences And Its Application
3	Numerical Feature Extraction Of Protein Sequences And Its Applications
4	Evolutionary Tree Algorithm Based On Similarity Analysis Of Dna Sequence 4d Study
5	Graphical Representations Of Biological Sequences And Their Applications
6	Research On Similarity Analysis Method For Gene Data
7	Protein Sequence Comparison And DNA-binding Protein Identification With Generalized PseAAC And Graphical Representation
8	Similarity Analysis Of DNA Sequences Based On Graphical Representation
9	Research On DNA, RNA And Protein Sequence Feature Extraction Method And Its Application
10	Research On Feature Recognition Of Sequence Data For Protein Interaction Prediction