Font Size: a A A

Peptide Quantitative Structure-Activity Relationship Study

Posted on:2006-07-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:H MeiFull Text:PDF
GTID:1100360155472573Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Structural description and modeling techniques are two essential contents of the quantitative structure-activity relationship (QSAR) studies. Based on the intense researches on these two points, the QSAR studies related to 48 bitter tasting dipeptides, 58 angiotensin-converting enzyme inhibitors, 31 bradykinin-potentiating pentapeptides, 21 oxytocin analogues, 152 HLA-A*0201 restrictive CTL epitopes, and 34 antimicrobial peptides were dwelled on in detail. Structural description is a key step in the QSAR studies. Whether the structural descriptors can reflect the structural variations determines the success of QSAR studies. Two kinds of amino acid descriptors, i.e. VSTV and VHSE, were derived from the ideal of principal components extraction. VSTV was derived from principal component analysis (PCA) on 25 structural and topological variables of 20 coded amino acids. So, the VSTV descriptors are of easy computation, experiment independent and can be easily expanded to other non-coded amino acids. VHSE was derived from the principal component analysis on independent families of 18 hydrophobic properties, 17 steric properties, and 15 electronic properties, respectively, which were included in total 50 physicochemical properties of 20 coded amino acids. For amino acids, VHSE1 and VHSE2 are related to hydrophobic properties, VHSE3 and VHSE4 related to steric properties, and VHSE5~VHSE8 related to electronic properties. As a new set of amino acid descriptors, VHSE is of relatively definite physicochemical meaning, easy interpretation and more information contained in comparison with z-scales and other amino acid descriptors. When VSTV and VHSE were applied in the QSAR studies of 6 peptide datasets mentioned above, equivalent or better results were obtained in comparison with those obtained with z-scales and other 2-D or 3-D descriptors. The modeling methods and related techniques are also important for the success of QSAR studies. The modeling methods such as multiple linear regression (MLR), principal component regression (PCR), partial least squares (PLS), back-propagation artificial neural network (BP-ANN), and support vector machine (SVM) were systematically studied in this paper. In addition, the techniques related to variables screening and model validations were also discussed. The results showed that MLR, as a classic modeling method, behaved as well as other modeling methods if the application conditions were met. When the ratio of samples to variables was less than 3 or when multiple collinear among variables was encountered, PCR and PLS were better alternative to MLR. In the most situations, PLS performed better than PCR. When the structural descriptors had nonlinear relationship with response variable, BP-ANN was then a better choice. In BP-ANN modeling, validation dataset was used to control over-fitting, find the optimal topological network structure and train network weights. The predictive power of BP-ANN was efficiently enhanced with the proper use of validation dataset. As a new modeling method, SVM is based on the structural risk minimization principle, which incorporates capacity control to prevent over-fitting. So SVM is of better generalization performance than PLS and ANN, and thus is especially suitable for QSAR modeling on small dataset. In this paper, SVM achieved good performance in QSAR modeling. However, there are many issues, i.e. selection of kernel functions and corresponding parameters, leaving to be studied in detailed. For a QSAR dataset, not all variables are relevant to biological activity. So those redundant variables should be deleted from model in order to promote predictive capability especially when the number of variables is very large. In this paper, stepwise multiple regression (SMR) and GA-PLS were used to find an optimal variable subset. When the number of variables was less than 50, SMR, the classic variable selection method, was recommended. When the number of variables was more than 50, GA-PLS was then an alternative choice. However, the over-fitting should be avoided by proper validation methods especially in GA-PLS modeling. Model validation is an absolutely necessary step in QSAR modeling. In this paper, all samples were firstly divided into training dataset and predictive dataset according to D-optimal design technique. The training dataset was used to establish QSAR models and perform internal validation such as leave one out (LOO), leave 1/n out (LNO), leave many out (LMO) and Y random permutations test. On the base of internal validation, external validation was also performed using the predictive dataset. Several evaluation functions were used to evaluate predictive power of the resulting QSAR models.
Keywords/Search Tags:peptide, quantitative structure-activity relationship, VSTV, VHSE, multiple linear regression, principal component regression, partial least squares, genetic algorithms, back-propagation artificial neural network, support vector machine
PDF Full Text Request
Related items