Font Size: a A A

Data Mining In Chemoinformatics

Posted on:2005-06-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q N HuFull Text:PDF
GTID:1101360125958118Subject:Applied Chemistry
Abstract/Summary:PDF Full Text Request
To meet the scientists' increasing needs of chemical knowledge from large-scale data sets, Chemoinformatics comes in. Chemoinformatics is the application of informatics methods to solve chemical problems. One of the important aims of Chemoinforaiatics is to obtain some expert knowledge to'explain the observed phenomena. However, the knowledge is always hidden in the huge data sets, which needs some ideas and methods to mine out.The researches of the present work include the construction of databases, the calculation of molecular topological descriptors, the structural interpretation of topological indices, and also the applications in QSAR and QSPR. The contents are mainly the following four parts:The first part is the construction of the Heuristic Queue Notation (H.Q.N.) system as a platform for Chemoinformatics studies, which is described in the second and third chapters. The chapter two introduces the basic principles of the H.Q.N., and the applications of the notation in calculating more than 320 topological indices. The chapter three constructs a database on the active components from Traditional Chinese Medicines (T.C.M.), which is composed of molecular structures and their biological activities of more than 4000 kinds of active components.The second part is to describe the topological structures to get different descriptors, which is described in the fourth to seventh chapters. The chapter four proposes a new method to locate graph center and also a novel centric index. The chapter five obtains some interesting results from the mathematical characteristics of degree distributions with least human intervention. Based on the structure information of the degree distributions, the chapter six brings a method to count the branch number. In order to better describe the modeled properties, the chapter seven suggests a new strategy to define an external factor variable connectivity index, which is one of the latest developments of the molecular connectivity index, and the definition is extended to higher orders to improve the regression results, in which the results are much better than the original molecular connectivity index and the variable connectivityindex.The third part is the structural interpretation of topological index in the eighth to tenth chapters. An often-encountered shortcoming of topological index is the lack of interpretation by simple structural and physical chemical concepts. To mine out the hidden structural features in the multi-dimensional spaces spanned by several topological indices should be helpful in interpreting the topological index and the built models by them. The eighth and ninth chapters describe how to search the multi-dimensional point clouds to machine pick the "interesting" projections. The authors apply the projection pursuit method to mine out the hidden structural features to interpret the external factor variable connectivity index, x, Kappa and atom-type E-State index, in which the TFWW method is used to generate uniformly distributed directions on multivariate Unit sphere and the "entropy" is introduced as the projection index. The obtained results also indicate the possible existence of high collinearity between topological indices, and thus we have studied the mutual relatedness between different sets of topological index. The canonical correlation analysis is a standard method to discover and quantify the mutual relatedness between variables, which is used to study the relationships among the x , Kappa and E-State index. The researches show that they are highly correlated, and we further discover from their shared variance why they are collinear.The fourth part is focused on how to extract orthogonal information from topological indices to build better quantitative structure activity models, which is described in the eleventh and twelfth chapters. In order to include almost all the information of the original variables and at the same time reduce the number of variables, we use the orthogonal block variables and the canonical correlation analysis to study...
Keywords/Search Tags:chemical informatics data, chemoinformatics, data mining, chemometrics, topological index
PDF Full Text Request
Related items