Font Size: a A A

Prediction Of Bioactivity Of Hepatitis C Virus Inhibitor By Machine Learning Method

Posted on:2022-09-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z J QinFull Text:PDF
GTID:1481306602959079Subject:Chemical Engineering and Technology
Abstract/Summary:PDF Full Text Request
Hepatitis C is caused by the hepatitis C virus(HC V)infection.The infected patients shave a high risk of developing serious liver diseases,including liver cirrhosis and liver fibrosis,which usually lead to hepatocellular carcinoma.According to statistics from World Health Organization,more than 70 million people are infected with HCV worldwide,and the number is increasing by 1.25 million every year.Up to date,there is no effective vaccine against HCV,and in recent years,the multiple genotypes and mutations of HCV have made the treatment of HCV infections more and more difficult.Based on the computational-aided drug design approach,this thesis focuses on two HCV drug targets,which are the NS3/4A protease and the NS5A protein,and conducts the bioactivity prediction of inhibitors and the virtual screening of inhibitors based on cheminformatics and machine learning methods and the drug resistance research on mutant proteins by molecular dynamic simulations.In addition,the cheminformatics and machine learning methods were used to build classification models of the anti-inflammatory target cyclooxygenase-2.The main research contents of the thesis are shown as following:(1)Quantitative structure-activity relationship studies of HCV NS3/4A protease inhibitors were constructed by the multiple linear regression,support vector machine,and random forest methods.5 12 inhibitors and their bioactivity IC50 values were collected to build a data set.Each inhibitor was represented by the CORINA global descriptors and 2D autocorrelation descriptors.Three machine learning methods,including the multiple linear regression(MLR),the support vector machine(SVM),and the random forest(RF)were used to build models.The coefficient of determination(r2)and the standard error of estimate(SEE)of the best SVM model(ModelD4)were 0.843 and 0.647 on the test set,respectively;the r2 and the SEE of the best RF model(ModelC6)were 0.847 and 0.635 on the test set,respectively.The application domain analyses of the two best models showed that the coverage of the two models for the training and test sets were greater than 97%,and this result suggested that the predicted results of the two models were reliable.In addition,the data set was split into a non-macrocyclic sub-data set and a macrocyclic sub-data set,and the same process was conducted to build models based on the sub-data sets.The results showed that the performances of all sub-data set models were better than those of the whole data set models.Finally,three best SVM models were obtained,containing the whole data set model ModelD4,the non-macrocyclic sub-model ModelLB2,and the macrocyclic sub-model ModelMD2;three best RF models were obtained,containing the whole data set model ModelC6,the nonmacrocyclic sub-model ModelLB3,and the macrocyclic sub-model ModelMD4.We believed that these models could be used as powerful virtual screening tools.Based on the analyses of the molecular descriptors,we concluded that the ?atom charges and the lone pair atom electronegativity were important properties.The number of rotatable bonds was the bridge between the non-macrocyclic inhibitors and the macrocyclic inhibitors.(2)Virtual screening studies for HCV NS3/4A protease inhibitors were constructed by the quantitative structure-activity relationship model and threedimensional shape and electrostatic similarity screening methods.The Specs database and ChemDiv database were used to screen,and these two databases had more than 1.81 million small molecules.Two methods were used to screen the database in parallel.The first method was to predict the IC50 values of all molecules in the database by the three pre-built SVM models.Finally,367 nonmacrocyclic molecules with the average predicted bioactivity of less than 100 nM were remained.The second method was to use the ligand conformation of the marketed drug in the co-crystal structure as the template conformation and to find out the molecules in the database that had highly similarity to the template in three-dimensional shape,chemical group and electrostatic.Finally,119 non-macrocyclic molecules and 22 macrocyclic molecules with the similarity indexes above 1.2 were remained.A total of 508 molecules were remained by the two screening methods.Subsequently,the 508 molecules were clustered,and only the 85 molecules with the highest predicted bioactivity in each cluster were retained.85 molecules were divided into 13 categories according to their molecular scaffolds,and the 13 molecules with the highest predicted bioactivity in each scaffold were retained as the candidate molecules.Finally,the 13 candidate molecules were further verified by the molecular docking,molecular dynamics simulation,binding free energy calculation,and binding mode analyses,and three candidate molecules that could interact with the NS3/4A protease were obtained,which could be used for further the study.(3)Three-dimensional quantitative structure-activity relationship studies of the wild-type and mutant-type bioactivities of HCV NS5A protein tetracyclic inhibitors were constructed by the comparative molecular force field and comparative molecular similarity index methods.196 tetracyclic inhibitors and three sets of biological activities,including the EC90 values of the wild type GT1a,the mutant GT-1a Y93H,and the mutant GT-1aL31V were collected to build three data sets.The program OMEGA was used to generate 600 conformations for each inhibitor,and the program ROCS was used for molecular alignments.The comparative molecular force field(CoMFA)and the comparative molecular similarity index(CoMSIA)methods were used to build models.In the modeling process,we defined four hyper-parameters selection rules and an over-training index to choose the best parameters.For the three data sets GT-1a,GT-1a Y93H and GT-1a L31V,the correlation of determinations(r2)of the best models were 0.682,0.779 and 0.782 on the test set,respectively;the standard error of estimate(SEE)of the best model were 0.418,0.608,and 0.560 on the test set,respectively.Based on the contour maps of the three best models,we summarized several suggestions that can simultaneously increase the bioactivity against the wild type and the two mutants:adding a relatively small,non-negative,hydrophobic,and hydrogen-bond acceptor group to the parasubstitution of the Z group;changing the benzene ring of the Z group to the heterocycle;adding a small electron withdrawing substituent to the tetracyclic core group;adding a relatively small,hydrogen bond acceptor to the isopropyl group;adding a hydrophobic,non-hydrogen bond acceptor,and non-negative group to the proline group.(4)Drug resistance studies based on HCV NS5A protein drug elbasvir to the mutants Y93H and L31V was constructed by the molecular docking and molecular dynamic simulation methods.Based on the PDB and NCBI databases,we manually fixed the missing N-terminal residues of the NS5A protein to build a complete 3D structure of the wild-type NS5A protein.Subsequently,the 3D structures of two mutant proteins,Y93H and L31V,were built by the manual mutation.The drug molecule elbasvir was docked into the binding pockets of the three proteins to build the complex systems.The Amber program was used to perform molecular dynamics simulations of the three complex systems,including energy minimization,heating,constant pressure,equilibrium,and 80 ns production simulation.Based on the RMSD analyses,the MM/GBSA binding free energy calculation and decomposition,and the binding mode analyses,we summarized the main reasons of the bioactivity reduction of the elbasvir caused by the mutants:for the mutant Y93H,the electrostatic repulsion was occurred between the imidazole of H93 and the imidazole of elbasvir,causing the drug molecule to shift and flip in the pocket;for the mutant L31V,V31 made the flexibility of the protein linker increased,which reduced the interaction between the linker of the NS5A protein and the cap group of elbasvir.(5)Classification models of bioactivities of cyclooxygenase-2(COX-2)inhibitors were built by the support vector machine and random forest methods,and clustering analyses of molecular scaffolds of COX-2 inhibitors were constructed by the K-means and t-distributed stochastic neighbor embedding methods.2925 COX-2 inhibitors and their bioactivity IC50 values were collected to build a data set.The inhibitors were defined as the highly active inhibitor and the weakly active inhibitor based on 1 ?M as the threshold.For each molecule,the MACCS fingerprints,the ECFP4 fingerprints and the CORINA descriptors were calculated.Two machine learning methods,including the support vector machine(SVM)and the random forest(RF),were used to build models.We further defined 2925 inhibitors with 0.1 ?M and 10?M as thresholds to the highly,intermediate,and weakly active inhibitors,and removed intermediate active inhibitors to obtained 1630 inhibitors for building classification models.The best model was the random forest odel established by ECFP4 fingerprints,and its Matthews correlation coefficient was 0.68 on the external test set.In addition,the K-means clustering method and t-distributed stochastic neighbor embedding method were used to cluster the MACCS fingerprints of 2925 inhibitors into 8 subsets.Based on the descriptor analysis and the clustering analysis,we concluded that aromatic nitrogen atoms,halogen atoms,sulfur-containing double bonds and oxygen-containing double bonds were favorable to bioactivities,while hydroxyl groups and non-aromatic double-bonded nitrogen atoms were unfavorable to bioactivities.In summary,this thesis focuses on the two drug targets,hepatitis C virus NS3/4A protease and NS5A protein.For the NS3/4A protease target,we use a ligand-based drug design method to establish bioactivity prediction models and use them for virtual screening to discover new compounds.For the NS5A protein target,we use ligand-based and receptor-based drug design methods to explore the difference in bioactivities of marketed drug molecules against the wild type and the mutant proteins and to explore the reasons why inhibitors are resistant to mutations.In addition,this thesis involves a third drug target,namely cyclooxygenase-2,an anti-inflammatory drug target,and conducted cheminformatics related research.We hope this thesis has reference significances for the drug design of HCV NS3/4A protease inhibitors,HCV NS5A protein inhibitors,and COX-2 inhibitors.
Keywords/Search Tags:hepatitis C virus NS3/4A protease inhibitor, hepatitis C virus NS5A protein inhibitor, wild type and mutant type protein inhibitors, quantitative structure-activity relationship model, machine learning
PDF Full Text Request
Related items