Font Size: a A A

New Methodology In Chemical Data Mining And Foundational Research On QSPR

Posted on:2003-10-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y P DuFull Text:PDF
GTID:1101360092970121Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
Work in this paper focuses on the data mining from chromatographic retention index data. A retention index database that contains about 50 000 records of retention index is firstly established. Projection pursuit technique is then utilized to do data mining upon the data in order to find out some valuable information about the relationship between the retention indices and structural descriptors. A novel algorithm for projection pursuit is developed in this work. Samples of alkane, alkene and cycloalkane are investigated. Some interesting classifications based on special chemical structures, such as different numbers of carbon atoms in molecules, different numbers of branches, double bonds numbers, position of double bonds, conjugated double bonds or nonconjugated double bonds and numbers of rings etc., have been revealed for these carbonhydrogen compounds with the help of the new algorithm. Different models between topological indices and retention indices are established for different classes of samples obtained from the results of projection. The regression is then significantly improved. This fact shows that there are really several linear models even for alkanes. Furthermore, an interesting projection result is obtained by projection pursuit when compounds in a homologous series are used to calculate the projection direction. This kind of classification shows that all homologous series are seperated each other and have regular distance between each other. Based on this information a new variable called class distance variable is proposed to describe the difference between the classes of homologs. With the help of this variable, a much better model is obtained. Its estimation errors and prediction errors are all very small closing to the measurement error level.Two indices called similarity evaluation index and difference evaluation index are proposed in this work. They can be used to investigate the correlation between topological indices (TIs) quantitatively and also to estimate TIs' contribution to the regession model in QSPR. The application of these two indices on a data set including alkanes and alkenes shows that they can describe relationship between TIs withreasonable results, and they have potential useness in variable selection. Block descriptor that contains a series of individual TIs with similar defmations is proposed in this work. Followed by combining some individual topological indices into a few blocks, a set of new one-dimesional variables is obtained with the help of canonical correlation analysis without losing major information. With the help of the new variables, models including few variables are established to describe retention indices of alkanes and show improved performance with high correlation coefficient and small residuals.For the chromatographic analysis of complex multicomponent samples in analytical chemistry, some grey analytical systems are often encounted, in which some components are ascertained and others are unknowns. The model and algorithm of calculating dead time and retention times of n-alkanes in a grey analytical system are developed. By using the calculated dead time and retention times of n-alkanes, retention indices of unknown components can be calculated easily. Results obtained by this method for two samples of petroleum products show that the calculated results of dead time, retention times of n-alkanes and retention indices of unknown components are satisfactory with small errors, comparing with the experimental values.
Keywords/Search Tags:Data mining, Quantitative structure property relationship, Projection pursuit, Canonical correlation analysis, Topological index, Retention index, Database, Chemometrics
PDF Full Text Request
Related items