Font Size: a A A

Quantitative Sequence-Activity Relationship Study Based On High-Dimenship Descriptor Screening

Posted on:2014-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:N HanFull Text:PDF
GTID:2250330425491027Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Starting from primary sequence of bio-molecular, quantitative sequence activity relationship mainly quantitatively studies inner link of sequence and activity and gives appropriate function description to predict the unknown target function and guide structure modification and transformation and so on. Two important problems in QSAM study are Reasonable feature representation and feature selection.Reasonable feature representation is an important prerequisite in quantitative sequence activity relationship study. Due to structure information of biological sequence function is usually encoded in its primary sequence, and senior structure is very difficult to determine, while primary structure is simple easy to get. This paper proposed that direct characterization method and Geostatistics Correlation-multi-scale Component (GC-MSC) characterization method to extract sequence parameters. Direct characterization method using531physical and chemical properties (for polypeptide sequence) and1123topology parameters (for base sequence) to replace bit by bit, but it requires equal-length for each sequence in dataset. GC-MSC is based on the properties parameters of the individual amino acids or bases. And it combines statistical correlation and multi-scale component to extract context relationship and information among sequences effectively, which has the advantage such as simple calculation, suitable for different length sequence, high generalization ability and so on. Feature selection is a critical step of quantitative sequence activity relationship study. Irrelevant and redundant features will affect the prediction precision and cause confusion to explain model. While select optimal feature subsets from m features has2m possibilities in theory, and when m is larger, it is not exhaustive. Therefore, introducing Binary Matrix Shuffling Filter (BMSF) and the worst descriptor elimination multi-roundly (WDEM) methods, this paper proposed the method based on support vector machine (SVM), which can effectively filter out well-defined features and has the advantages of simple calculation and filtering.From the sequence representation and feature selection, our study tried to find the relationship between the sequences and activity through support vector machine. The datasets of our research included the152HLA-A*0201restricted CTL epitopes, a comprehensive dataset which is4MHC II molecules peptides in IEDB database and38E.coli promoters.1. The identification of CTL epitopes. We characterized each residue in the restrictive CTL epitopes using531physicochemical properties. We selected18descriptors with clear meanings from531×9descriptors for each peptide of length9using the Binary Matrix Shuffling Filter and the worst descriptor elimination multi-roundly methods. Then, we constructed a support vector regression (SVR) based quantitative sequence activity model (QSAM) using18selected descriptors. Testing on HLA-A*0201data showed that our QSAM is superior to those reported in literatures on the accuracies of fitting (R2), leave-one-out cross validation(Q2CV), and extra-sample prediction (R2ext, RMSEext) Finally, we predicted the activities of peptides of all possible combinations of9residues. Several peptides were found with higher affinity activities than those of previously reported epitopes. Our study improved the understanding of relationship between the compositional residues and the affinity activity of the peptide, which provided a valuable guideline for the design of high activity peptide vaccines. Our predicted high affinity peptides provided potential candidates for further experimental verification.2. The prediction of MHC-Ⅱ binding peptide. We characterized each sequences in the comprehensive dataset which consist of4subsets of HLA II binding peptides using GC-MSC that was based on531physicochemical properties. Then, we selected several descriptors from GC-MSC descriptors for each MHC-II binding peptide using the Binary Matrix Shuffling Filter and the worst descriptor elimination multi-roundly methods. Finally, we constructed a support vector regression based quantitative sequence activity model using the selected descriptors. The result of5cross testing on all tested MHC Ⅱ binding peptide’s subsets showed that our prediction method is superior to CTD method, LA method and5-spectrum method. Hence, our method may have wide application prospect for the prediction of MHC class Ⅱ binding peptide.3. The prediction of the strength E.coli promoter. E.coli promoter sequences were characterized by direct expressing and GC-MSC based on1123base’s topology structure parameters and thus resulted in high dimensional feature sets. Out of them20and27descriptors were selected by Binary Matrix Shuffling Filter and the worst descriptor elimination multi-roundly methods in turn, respectively. By20and27descriptors, the accuracy of leave-one-out cross validation (Q2CV) for QSAMs which were established based on PLSR were0.806and0.843, while the models based on SVM were0.838and0.882. Our study improved the understanding of relationship between the E.coli promoter sequences and the strength, which provided a valuable guideline for further experimental identification of E.coli promoter.
Keywords/Search Tags:Quantitative Sequence Activity Model, Geostatistics Correlation, FeatureSelection, Binary Matrix Shuffling Filter, CTL Epitopes, E. coli Promoter
PDF Full Text Request
Related items