Font Size: a A A

The Application Of Minimal Redundancy Maximal Relevance Feature Selection Method In QSAR Based On Distance Correlation

Posted on:2017-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:X L DengFull Text:PDF
GTID:2311330512469708Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Evaluations of environmental pollutants' toxicities play an important role in the environmental protection. The traditional experiments are not only time-consuming but also labor-wasting as well as cause pollutants leaking out. Based on the research of compounds structure, quantitative structure-activity relationship (QSAR) could predict the toxicity of compounds effectively and has been used widely in multiple domains with the advantages of low-cost and operation-simple. QSAR includes three parts:feature extraction, feature selection and construction of prediction model. Features (molecular descriptors) are generally obtained by quantum chemistry software. Support vector regression (SVR) is commonly used as prediction model, since it has the minimal structural risk. This paper apply two new feature selection methods to the modeling of QSAR for prediction of pollutants' toxicities. The reports list as follow:The research on QSAR of toxicity of alcohols and phenols:The toxicities and features of compounds are generally presented as a non-linear relationship. The compound features calculated by the quantum chemistry methods contain numerous irrelevant and redundant ones. Minimal redundancy maximal relevance (mRMR) as a well-known feature selection method has been applied widely, but the current version is not applicable for continuous dependent variable and the measurement of relevance and redundancy is incomparable. For QSAR, both dependent variables (toxicities) and independent variables (molecular descriptors) are usually continuous variables. Therefore, we use distance correlation (dCor) to replace Pearson correlation coefficient (R) to solve the measurement comparability between relevance and redundancy, and developed a new feature selection method named mRMR-dCor by combined mRMR with dCor in this work. Based on the in-house feature selection method and SVR, the independent prediction results of three phenolic and alcohol compounds datasets indicated that mRMR-dCor (the Q2 were 0.954, 0.941 and 0.981, respectively) was superior to other reference feature selection methods in the prediction performance. Most of molecular descriptors selected by mRMR-dCor were also reported in previous literatures.The prediction of Bioconcentration Factors and octanol-water partition coefficient of Aromatic organic compounds on QSAR:mRMR and mRMR-dCorcould only get the rank of introduced features. Therefore, It has to take the cross-testing of training set to determine whether to retain certain feature. However it has the defect of costing plenty of time. Combining the stratagem of shared redundancy with mRMR-dCor, this paper applies a new feature selection method:dCor-shared. It could end feature selection automatically and shorten the calculating time largely without introducing features in steps. After selecting features, we use SVR to model and predict. The independent prediction results of new method are significantly superior to the results of referenced methods.mRMR-dCor and dCor-shared have broad application prospects in various domains such as QSAR, quantitative structure-pharmacokinetics relationship (QSPR), etc.
Keywords/Search Tags:quantitative structure-activity relationship, minimal redundancy and maximal relevance, distance correlation, shared redundancy, support vector regression
PDF Full Text Request
Related items