| Single Nucleotide Polymorphism(SNP)site is an important basis for the study of human family,animal and plant genetic variation,so it is widely used in population genetics,disease-related genes and other research,and plays an important role in pharmacogenomics,diagnostics and biomedicine in pharmacogenomics research,identifying the SNP site-drug association is the key to clinical precision medicine.However,the traditional biological experimental method is not only costly and inefficient,but also has a certain degree of blindness when verifying the association between a large number of SNP sites and drugs,which makes it unable to be widely used in practical applications.In recent years,with the development of bioinformatics technologies,such as machine learning and data mining have opened up new and efficient strategies and methods for predicting SNP site-drug associations.Therefore,this article proposes a machine learning-based SNP site-drug association prediction algorithm.The main research contents of the article are as follows:First,numerically characterize SNP sites and drug molecules.Since SNP sites and drug molecules are stored in the database in the form of strings,they cannot be directly input into the classifier as feature vectors for prediction.Therefore,this article proposes a numerical characterization method of SNP sites based on k-mer,and adopts a method for numerical characterization of drug molecules based on molecular fingerprints.These methods describe the essential properties of SNP site and drug data,and provide data assurance for subsequent feature extraction algorithms.Second,feature extraction is performed on SNP site-drug features.There are noise data and high-dimensional data in the SNP site feature information after numerical characterization.Therefore,this article proposes a SNP site-drug feature extraction algorithm based on Denoising Variational Auto-Encoders(DVAE),which makes the generated features efficient and does not lose biological information.Next,The extracted SNP site effective features and drug molecule features are fused to form SNP site-drug fusion features,which are input into the random forest classifier for experimental training,validation and testing.In order to evaluate the ability of denoising variational auto-encoder to extract features,a five-fold cross-validation experiment was performed on the model,and good results were obtained.Then compare different feature extraction algorithms and different classifiers respectively,the results show that the feature extraction algorithm proposed in this article can accurately and efficiently extract the features of SNP sites,and can improve the accuracy of SNP site-drug association prediction.Finally,a SNP site-drug association prediction model was constructed.In order to further improve the prediction accuracy,this article proposes a SNP site-drug association prediction model based on Stacking ensemble learning.The first layer model introduces four base classifiers(support vector machine,decision tree,random forest,XGBoost)for prediction;the second layer model uses logistic regression as a meta-classifier to train the predicted values obtained by the first layer model,building the Stacking integration model.The results show that,compared with the aforementioned five single classifiers,the stacking model constructed in this article can effectively improve the prediction accuracy of SNP site-drug associations,and has a higher reference value in practical applications. |