Font Size: a A A

Research On Chemical Process Optimization And QSAR/QSPR Of Organic Compounds Using Data Mining

Posted on:2009-04-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:S S YangFull Text:PDF
GTID:1101360245999263Subject:Materials science
Abstract/Summary:PDF Full Text Request
Data mining (DM), a multi-disciplinary research area, is a technology to find the unknown, hidden and interesting knowledge from the massive data. It has been recognized as a key research topic in database and machine learning. It has also aroused wide interest of scientific or industrial circle for its large potential application. Carrying out experimental work, finding the regularities of the data obtained, and making prediction for some unknown phenomena, are the chief mode of the research work in the fields of chemistry and related disciplines, including chemical engineering, materials science and environmental science. Since the progress and achievement of computer science and technology, computerized data processing, or so-called machine learning, has been widely used in chemical research work and chemical industrial optimal control. Up to now, the statistical methods used in chemistry are almost all based on the classical statistical theory. It is well known that one of the basic principles in classical statistics is the law of large numbers. According to this principle, when the number of observations tends to infinity, the empirical distribution function converges to the actual distribution function. In other words, the training data set with infinite number of samples should be provided for getting a reliable mathematical model by using machine learning. In any practical problem-solving work, including the machine learning tasks in chemistry and chemical engineering, however, it is impossible to have so many samples for training and mathematical model building. On the contrary, in most of the chemical data processing work the number of training samples is usually quite small. In recent years, a widely recognized theory of statistical science, the statistical learning theory (SLT), has been proposed to find the answer of the above-mentioned question. And newly proposed method of machine learning, support vector machine (SVM), has been proposed based on the spirit of statistical learning theory. The SVM has been used in many fields of application, including image recognition, text categorization and DNA research, with rather good results. Now these powerful data processing techniques have been also used in the fields of chemistry and related disciplines. As a newly proposed algorithm, SVM has bright future as a powerful tool for chemistry and related fields.This thesis focuses on the application research of data mining in chemical process optimization and quantitative structure-activity/property relationship (QSAR/QSPR) of compounds. During the last decades, process optimization and monitoring have been developed into an important branch of research and development in chemical engineering. Generally speaking, large volumes of data in chemical process operation and control are collected in modern distributed control and automatic data logging systems. By using data mining, the useful information hidden in these data can be extracted not only for fault diagnosis but also for the optimal control with the objective of saving energy, increasing yield, reducing pollution, and decreasing production cost. The study of quantitative structure-activiry/property relationship (QSAR/QSPR) is one of the chemical topics. QSAR/QSPR study is also one of the most important steps in molecular design. In QSAR/QSPR work, the known data of some similar compounds are used as training samples, and the number of training samples is usually not more than several houndreds. The flexibility in classification and ability to approximate continuous function make SVM very suitable for QSAR/QSPR studies. The work and contributions of this paper are listed as following:1. The comprehensive and graphical software, Data Mining Optimization System (DMOS), for ammonia synthesis optimization and monitoring has been developed. The DMOS integrates most of the modern optimization methods including database search, pattern recognition, artificial intelligence, statistical learning, and domain knowledge. Some novel computational techniques developed in our lab are also implemented in the DMOS. The DMOS has two versions: the off-line version and on-line version. The DMOS has some exciting characteristics such as method fusion, feature selection, automatic model, model validation, model updating, multi-model building, and on-line monitoring, which contribute to solve optimization and monitoring problems of complex ammonia synthesis process. The DMOS has been successfully applied to the ammonia synthesis process. The main technical parameters affecting the flow of fresh synthesis gas are found. The qualitative and quantitative models correlated between the flow of fresh synthesis gas and some technical parameters are summarized. It can be expected that the DMOS has great potential in ammonia synthesis process and even other chemical processes optimization and monitoring.2. Chemical process optimization is an indispensable means to increase competition power and economic profit of chemical enterprises from technical and economic viewpoints. In this work, the two chemical process optimizations based on data mining (including the 1,2,4-trimethylbenzene unit and the aromatic hydrocarbon unit) are studied. The SVM method especially appropriate for the modeling of small size of data set was firstly applied to the two chemical processes optimization. Morever, traditional methods including Fisher and PCA methods are considered as complementary methods, since they also have their advantages as compared with SVM. They can give many linear projection figures which contain plentiful information. Domain experts, including chemists and chemical engineers, can find very useful inspiration from these projection figures. From these models, the main technical parameters affecting objective function are found. The qualitative and quantitative models correlated between objective function and some technical parameters are then summarized. The optimal results are showed as following: (a) The higher bottom temperature (about 211±0.5℃) of tower C01 (T01-01) and tower C02 (T02-01) and the higher difference of tray temperature (about 30.5±0.5℃) of tower C01 (dT01) benefit to enhance the 1,2,4-trimethylbenzene yield. The correct rate of classification based on training and predicted data sets of the 1,2,4-trimethylbenzene yield by using SVC model are 100% and 96.2%, respectively. The root mean square errors (RMSE) of the 1,2,4-trimethylbenzene yield for trained and predicted data sets calculated by SVR model are 0.028 and 0.034, respectively, (b) The higher bottom temperature (about 203.5±0.5℃) of tower T4504 (T04-01), the lower sensitivity temperature (126±0.5℃) of tower T4503 (T03-02), and the lower reflux ratio (0.27±0.2) of tower T4503 (R) are propitious to decrease the aromatic content of raffinate. The correct rate of classification based on training and predicted data sets of the aromatic content of raffinate by using SVC model are 100% and 100%, respectively. The root mean square errors (RMSE) of the aromatic content of raffinate for trained and predicted data sets calculated by SVR model are 0.072 and 0.060, respectively.3. Quantitative structure-property relationship (QSPR) models were developed to correlate structures of polycyclic aromatic hydrocarbons (PAHs) with their boiling point (bp), n-octanol/water partition coefficient (logKow), and retention time index (RI) for reversed-phase liquid chromatography analysis. The quantum chemical descriptors of 139 PAHs were calculated from the fully optimized geometry at theory level of B3LYP/6-311G**. The descriptors were firstly screened by genetic algorithm (GA)-support vector regression (SVR) method. And then the parameters of SVR models were optimized based on the leave-one-out cross-validation method. The SVR models for bp, logKow, and RI were developed from training sets consisting of 45, 52, and 90 compounds, respectively. The SVR models for bp, logKow, and RI were then tested using external test sets consisting of 12, 13, and 23 compounds, respectively. The good determination coefficient (R~2=0.997, 0.964, 0.950, respectively) and satisfactory external predictive ability (q~2=0.999, 0.897, 0.931, respectively) for bp, logKow, and RI show that SVR method and DFT based descriptors can be used to model bp, logKow, and RI for a diverse set of PAHs.4. Quantitative structure-property relationship (QSPR) model was developed to correlate structures of aromatic compounds with their n-octanol/water partition coefficient (logKow). The 68 molecular descriptors derived solely from the structures of the aromatic compounds were calculated using Gaussian 03, HyperChem 7.5, and TSAR V3.3. The descriptors were screened by the minimum Redundancy Maximum Relevance (mRMR)-genetic algorithm (GA)-support vector regression (SVR) method. The parameters of SVR model was optimized using the five-fold cross-validation method. The QSPR model was developed from a training set consisting of 300 compounds using SVR method with good determination coefficient (R~2=0.85). The QSPR model was then tested using an external test set consisting of 50 compounds with satisfactory external predictive ability (q2=0.84).5. A quantitative structure-activity relationship (QSAR) study was performed to develop model for correlating the structures of 581 aromatic compounds with their aquatic toxicity to Tetrahymena pyriformis. A set of 68 molecular descriptors derived solely from the structures of the aromatic compounds were calculated based on Gaussian 03, HyperChem 7.5, and TSAR V3.3. A comprehensive feature selection method, mRMR-GA-SVR method, was applied to select the best descriptor subset in QSAR analysis. The SVR method was employed to model the toxicity potency from a training set of 500 compounds. Five-fold cross-validation method was used to optimize the parameters of SVR model. The SVR model was tested using an external test set of 81 compounds. A good coefficient of determination (R~2=0.77) and external predictive ability (q~2=0.67) values were obtained indicating the potential of SVR in facilitating the prediction of toxicity.
Keywords/Search Tags:data mining, pattern recognition, support vector regression (SVR), support vector classification (SVC), chemical process optimization, quantitative structure-activity/property relationship (QSAR/QSPR), ammonia synthesis, 1,2,4-trimethylbenzene
PDF Full Text Request
Related items