| There are more than 180 million chemicals registered in the Chemical Abstracts Service,some of them are mutagenic.Mutations may cause genetic diseases,cancers,stem cell dysfunctions and many other diseases that endanger human health.Screening chemicals for mutagenicity is essential to prevent and control the health risks of chemicals.Screening mutagenic chemicals only through experimental tests is inefficient and costly,and it is difficult to meet the needs of chemical risk management.Quantitative structure-activity relationship(QSAR)model can be used for high-throughput screening of mutagenic chemicals to make up for the shortcomings of experiments.In this study,a variety of machine learning algorithms were used to construct QSAR models to predict the effects of two types of mutations: gene mutation and chromosome variation.The main is shown as follows:(1)QSAR models were constructed to predict gene mutation.A database containing Ames test results of 7647 compounds has been established.Six machine learning algorithms and 14 types of molecular fingerprints were employed to develop 84 individual models.Machine learning algorithms included logistic regression,classification and regression tree,naive bayes,k-nearest neighbor,support vector machine and random forest.The molecular fingerprints were calculated by RDkit software package and Pa DEL-descriptor software.The robustness and generalization ability of the models were evaluated by ten-fold cross validation and external validation.Using algorithms and fingerprints that performed well in individual models,36 ensemble models were developed.The application domain was characterized based on molecular similarity,and the mechanisms were analyzed by extracting structure alerts.The results show that,compared with the individual models,the ensemble models perform better.The best ensemble model has good robustness and generalization ability.The determination of application domain clarified the application scope of the best ensemble model and can further enhance its generalization ability.Structures such as aromatic nitro groups played an important role in models classification.Typical compounds induce gene mutation mainly through alkylation,embedding into DNA strand and generating specific DNA adducts.(2)QSAR models were constructed to predict chromosome variation.A database containing micronucleus test results of 7647 compounds has been established,including 1392 negative results and 440 positive results.Based on the imbalanced data set,84 individual models were established,and the method of models construction and evaluation were the same as that of individual models above.According to the performance of individual models,two algorithms suitable for modeling were selected.Three methods of dealing with unbalanced data were combined with the above two algorithms to construct the models of adjusting threshold,under-sampling and over-sampling for predicting chromosome variation.The results indicated that the methods of adjusting threshold and under-sampling are more suitable for this kind of imbalanced data set.The best models constructed by these two methods have good robustness and generalization ability,and can be used to predict chromosome variation.The models established in this study can efficiently screen chemicals for mutagenicity,providing a basis for chemicals health risk assessment. |