Font Size: a A A

Developing Machine Learning Algorithm For Predicting Molecular Structural Formula Based On Ir And Raman Spectra

Posted on:2024-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:B LiFull Text:PDF
GTID:2531307100459484Subject:Materials and Chemical Engineering (Professional Degree)
Abstract/Summary:PDF Full Text Request
In a variety of novel natural compounds and synthetic compounds continue to emerge today,understanding the composition and structure of molecules is essential to grasp their physicochemical and biological properties.So far,determining the structure and composition of unknown compounds still requires analysis and identification by professionals through a large number of experiments,as well as trial and error through spectral database retrieval and theoretical simulation.Infrared spectroscopy and Raman spectroscopy have shown great advantages in the process of structure inversion due to the richness of molecular structure information,and are now commonly used as characterization techniques in basic research,drug development,medical testing and industrial production,etc.However,repeated experimental procedures,lengthy analytical identifications and expensive computational costs are major constraints to the current development of the molecular field and also makes the research progress of structure inversion based on spectrum slow.Therefore,the rapid and accurate identification of molecular structures from spectra has become a pressing problem.Although determining the structure of molecules based on spectra is a complex and tedious task,the recent proliferation of high-performance computers and the development of the field of artificial intelligence have provided new ideas to solve the current challenge.In this thesis,machine learning,a cutting-edge technology in the field of artificial intelligence,is introduced to solve the current challenge.A machine learning algorithm with a neural network as the basic architecture is constructed,and this algorithm is combined with a popular chemical program to achieve automated analysis of molecular structure formulae from infrared and Raman spectra to molecular structure,suggesting a concrete solution to the current challenge and a new idea based on machine learning algorithms.The main points of this thesis are as follows:Firstly,in order to construct a mapping relationship between IR spectra,Raman spectra and molecular structural formulae through machine learning algorithms,we re-optimized and re-computed the original QM9 dataset to obtain 127,468 data samples containing IR and Raman spectra.And used SMILES in Kekule form to represent molecular structural formulae,as well as analysed the chemical space described by the QM9 dataset and looked forward to the migration capability of the model through this dataset.Secondly,inspired by natural language processing(NLP)models,language translation algorithms are introduced into the structure recognition problem,which is most intuitively and concisely described as "spectral language" translated into "molecular structured language".In this study,one-dimensional convolutional neural network,attention mechanism and decoder terminal in Transformer are combined to build a new machine learning algorithm--Tran Spec model,which is trained and tested by self-constructed QM9 spectral data set,so as to realize the mapping of infrared spectrum and Raman spectrum to molecular structural formula.And test the accuracy and migration ability of the model.The results show that applying the Tran Spec model to the molecular recognition problem has the advantages of high accuracy and reliability,and has some migration capability.The algorithm successfully demonstrates the great potential of machine learning algorithms for molecular structure recognition,which is expected to replace complex and tedious manual recognition in the future.Thirdly,from the perspective of practical application,we proposed the molecular structure inversion scheme used in this study with Tran Spec algorithm as the main body.The core idea is to combine the algorithm with popular contemporary chemical programs such as RDKIT,Open Babel,Multiwfn and Gaussian programs to determine the final molecular structure formulae using the correlation between the target and candidate spectra.The results of the study demonstrate the feasibility of combining multiple programs to achieve the determination of the final results,but the parameters set in the individual programs also become an important factor that cannot be ignored in influencing the final result.
Keywords/Search Tags:deep learning, Infrared spectrum, Raman spectrum, SMILES, neural network
PDF Full Text Request
Related items