Font Size: a A A

Study On Calculation Accuracy Of Non-covalent Interactions Based On Machine Learning

Posted on:2020-03-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:W Z LiFull Text:PDF
GTID:1361330620452351Subject:Intelligent Environment Analysis and Planning
Abstract/Summary:PDF Full Text Request
Non-covalent Interactions(NCIs),also known as weak interactions,are ubiquitous as inter-molecular interactions and play an extremely important role in many disciplines such as environment,chemistry,materials,and life science.In contrast to covalent interactions,in NCIs there is no electron sharing between the molecule formation process.And they are intrinsicly diversified and complicated,usually including hydrogen bonds,van der Waals,dispersion,?-? interactions,halogen bonds,etc.NCIs involve a wide range of experimental phenomena related to molecular systems with different scales,especially when involving macromolecular systems containing large numbers of NCIs.Therefore,the knowledge of NCIs is significant for various research areas like environmental pollution and protection,supramolecular chemistry,photochemistry,superconductivity,and biological molecules.Due to their intrinsic complexity,the understanding of NCIs is still quite limited,so the demand of tools to achieve accurate values of NCIs is imperative.Currently,available means for obtaining NCIs include experimental and theoretical calculation methods.Experimental methods usually are infrared spectroscopy,nuclear magnetic resonance,and so on.Generally,these methods can get high accurate NCIs,but require sophisticated and expensive instruments,intricate procedures and costly resources,yet they are difficult to apply for large molecules.In computational methods,quantum chemical calculations are the most accurate,which includes ab initio,density functional theory,perturbation theory and so on.Comparing with experiments,theoretical calculations can significantly save time and resources,but the demanding of achieving accurate calculations is still quite high,especially for large molecules.In recent years,the renaissance of artificial intelligence has enhanced a new direction for improving theoretical methods,which offer a simple and efficient solution to solve the problem for NCIs calculations.This thesis mainly focuses on the calculation accuracy and machine learning models for NCIs,the main content are as follows:(1)A data partition method based on joint hybrid correlation and diversity distances HSPXY for small chemical databases is proposed.The use of data partition method has a large impact on the performance of models building with small databases.The approaches for representative subset selection can be divided into two categories: the cluster-based design approach and the uniform design approach.In general,the uniform design approach does not consider the correlation,which is not likely to assign samples correctly that have large distances but close correlations with the selected samples.Regarding to this issue,an improved sample set partitioning method based on joint hybrid correlation and diversity distances HSPXY is proposed in this study,which is based on the diversity distance of the commonly used method SPXY.To test the effectiveness of the proposed method,we compare our method with state-ofthe-art data partitioning methods on small chemical databases.The partial least squares regression method(PLS)is used to establish regression models.The performance of models proved that the datasets based on the HSPXY achieve smallest root mean squared errors and highest correlation coefficients than those on the basis of other partitioning methods.It shows that HSPXY provides a new option to obtain a representative training set.(2)A general procedure for learning ensemble establishment based on NCIs databases is proposed.The accurate NCIs computation is quite demanding for first-principles methods,while a competent machine learning model can be an efficient solution to obtain high NCIs accuracy by costing minimal computational resources.Regarding to the model establishment,multiple schemes of ensemble learning models are explored in this study.For Bagging and Boosting types,we choose the existing and representative methods,random forest and gradient boosting decision tree to build the model.In Stacking,we generate different base learners by using five different feature selection methods for selecting various feature subsets firstly.Then the outputs of base learners are input to the meta learner.According to the types of the selected base learners,two ensemble frameworks-homogeneous stacking ensemble(Homo-SE)and heterogeneous stacking ensemble(Hete-SE),are obtained.Considering the sensitivity of the number and type of base learners,we analyze and select the optimal number and type by constructing multiple regression models.Experimental analyses have shown that ensemble learning is significantly better than previous single machine learning on the benchmark datasets,especially for Hete-SE method,which performs best in all methods.(3)In order to further improve the prediction accuracy of NCIs and reduce the human intervention on features,a mixed 3D-CNN deep learning framework for NCIs modeling,DeepNCI,was proposed for the first time and the DeepNCI toolkit was developed.DeepNCI takes both molecular electron density and quantum chemical properties as inputs.The 3D-CNN automatically abstracts the features from electron density through multi-layer convolutional neural networks avoiding the feature selection with human intervention.Then the abstracted features and quantum chemical properties are merged to the fully connected neural network layers to predict the final NCIs.The experimental results show that the DeepNCI model is superior to the existing optimal methods.By comparing T-SNE visualization of the original and abstracted features,it is shown that the DeepNCI network can discriminate samples by abstracted features,that is,the generated feature representation can capture characteristics of NCIs in some degree.The deep neural network structure with electron density input breaks the generalization limit of NCIs prediction,which provides the possibility for extrapolation of molecular systems to obtain reasonable NCIs for large molecules.Additionally,an application domain of DeepNCI is defined,and all test samples are estimated.To test the transferbility of DeepNCI,a transfer learning for a small database containing only dozens of samples,Homolysis Bond Dissociation Energy(HBDE),is modeled by DeepNCI framework.Because the characteristics and features of HBDE are consistent with NCIs,the transfer-learning was easily applied in DeepNCI framework.For this dataset,the DeepNCI model trained by transfer learning achieves the comparable prediction ability with other used methods.This demonstrates the DeepNCI model can also deal with small sampling problem with its transferbility.
Keywords/Search Tags:Non-covalent interactions (NCIs), data partition method, machine learning, ensemble learning, deep learning
PDF Full Text Request
Related items