Font Size: a A A

Fundamental Researches On Software Development For Implementation Of Chemometric Algorithms

Posted on:2013-12-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z M ZhangFull Text:PDF
GTID:1261330401479245Subject:Applied Chemistry
Abstract/Summary:PDF Full Text Request
With the continual improvement and gradual maturity in the past few decades, chemometrics has become one of the most effective and systematic methods to extract useful information from datasets of instrumental analysis for complex systems. It brings massive novel ideas for qualitative and quantitative analysis of complex systems, also a framework for processing the signals of analytical instruments. Chemometrics will play a much more important role in data analysis of complex systems if one can implement reliable and easy-to-use chemometrics software with these novel ideas. However, transforming the chemometrics method into software product, there is much fundamental work what needs to be done. One should conduct systematically research on the basis of chemometrics software first, then propose new algorithm if it is needed, and finally implement the software product with new technologies from computer science to guarantee the competency of it in the market. This thesis pays a close attention to fill in the gap between chemometrics methods and software products, which covers the researches on fundamental chemometrics methods, construction of the chemometrics library and application of the new technology in computer science. Fortunately, all the fundamental work has been successfully solved, such as the linear algebra and statistical library, the accurate and efficient preprocess methods, accelerating the modeling procedure through multi-core computing and storage of models in chemometric modeling markup language. The main contents of this thesis can be briefly summarized as follows:1. Chemometric algorithms are basically consisting of a series of linear algebra and statistics functions. In order to write chemometric software easily, the first prerequisite is a well designed and high performance linear algebra and statistics library. The author has devoted about six years to construct a well designed, easy-to-use, accurate, high performance linear algebra and statistics library using BLAS, LAPACK, CSparse, C and C++. Based on the constructed library, most of the chemometrics methods were implemented. Since the encapsulation and design of the constructed library is pretty good, a certain chemometrics method can be easily implemented using the library with the same lines of code as MATLAB. The library is constructed strictly in accordance with the ISO C++standard, so it can be used in various operating system platforms and compiler toolchains. It is compatible with Windows, Linux and Mac OS X; and also the GCC, MSVC, LLVM-Clang and ICC compilers. By comparing the computation times of matrix multiplication and singular value decomposition using the constructed library with MATLAB2011B and R2.14, one can conclude that the constructed library have reached MATLAB2011B in both accuracy and performance, also at least4times faster than R2.14.2. The general trend in personal computer processor development nowadays was the leap from single-core to multi-core with the rapid improvement on engineering and manufacturing. The chemometrics software with multi-core computing technology will be competitive in the market. So multi-core computing method is introduced into chemometrics software by us, and leave-one-out cross-validation is taken as an example to show the powerful capability of the multi-core computing. The comparison results with traditional serial methods show that the execution time drops rapidly with increasing computing cores’number, which demonstrate that the multi-core computing is a promising tool for solving computing-intensive and data-intensive problems in chemometrics.3. The built models should be stored in the hard disk for future prediction, which will lead to the model storage and sharing problems. The chemometrics modeling markup language (CMML) can resolve these problems perfectly. It is developed by us for containing chemometrics models within one document through converting binary data into strings by base64encode/decode algorithms to solve the interoperability issue in sharing chemometrics models. It provides a base functionality for storage of sampling, variable selection, pretreating, outlier and modeling parameters and data. With the help of base64algorithm, the usability of CMML is in equilibrium with size by transforming the binary data into base64encoded string. Due to the advantages of Extensible Markup Language (XML), models stored in CMML can be easily reused in various other software and programming languages as long as the programming language has XML parsing library. One can also use the XML Path Language (XPath) query language to select desired data from the CMML file effectively.4. For the baseline problem in Raman spectroscopy and chromatography, an intelligent baseline correction algorithm named baselineWavelet was proposed firstly. The accurate peak position of Raman spectrum was detected by continuous wavelet transform (CWT) with the Mexican Hat wavelet as the mother wavelet. Background is fitted using penalized least squares with binary masks. In order to provide a baseline correction software product on the basis of baselineWavelet method, it was simplified and generalized. Then, a novel method named airPLS was proposed, it can effectively correct the baseline in Raman spectra, chromatograms and NMR spectra. Even more important, the sparse matrix technology was used to achieve the linear relationship between the signal length and computational time as well as memory. The airPLS method is ideal for high-throughput analytical datasets.5. Retention time shifts badly impair qualitative or quantitative results of chemometric analyses when entire chromatographic data are used. Hence, chromatograms should be aligned to perform further analysis. Being inspired and motivated by this purpose, a practical and handy peak alignment method (alignDE) is proposed, implemented in this research for one-way chromatograms. lengths of chromatograms are equalized using linear interpolation; Accurate peak position is detected by continuous wavelet transform (CWT) with the Mexican Hat and Haar wavelets as its mother wavelets; Differential evolution (DE) is adopted to maximize linear correlation coefficient between reference signal and signal to be aligned. This method is demonstrated with both simulated chromatograms and real chromatograms, for example, chromatograms of Red Peony Root obtained by HPLC-DAD.6. Chromatography has been extensively applied in many fields, such as metabolomics and quality control of herbal medicines. Preprocessing, especially peak alignment is a time-consuming task prior to the extraction of useful information from the datasets by chemometrics and statistics. To accurately and rapidly align shift peaks among one-dimensional chromatograms, multiscale peak alignment (MSPA) is presented in this research. Peaks of each chromatogram were detected based on Continuous Wavelet Transform (CWT) and aligned against a reference chromatogram from large to small scale gradually. The aligning procedure is accelerated by fast fourier transform cross correlation, which can reduce the computational complexity of cross correlation from N2to NlogN. The presented method was compared with two widely used alignment methods on chromatographic dataset, which demonstrates that MSPA can preserve the shapes of peaks and has an excellent speed during alignment. Furthermore, MSPA method is robust and not sensitive to noise and baseline.
Keywords/Search Tags:Chemometrics, software development, baseline correction, penalized least square, continue wavelet transform, fast fourier transfom, cross correlation, multi-core computing, model storage, Ramanspectroscopy
PDF Full Text Request
Related items