| In the past two decades,influenza,especially avian influenza,has had a significant impact on the poultry industry,animal husbandry and other agricultural fields,resulting in huge economic losses,at the same time,it is easy to have serious adverse effects on human health.The influenza virus is the root cause of influenza,and the research on the influenza virus is still high up to now.The more representative ones are the genotyping of influenza virus and the identification of protein interaction pairs involved in infecting the human.Traditional biological experimental methods are time-consuming and labor-intensive to solve these two research problems,the accuracy needs to be improved and the versatility is not strong.With the development of information technology,machine learning has been applied in more and more fields,including the application research in the field of pathogenic microorganisms.In this paper,machine learning is applied to the above two problems of influenza A virus as follows:(1)For the genotyping of influenza A virus,many studies have focused on the model,and we mainly explored the impact of different features in the machine learning model on the results when predicting the gene type.Compared with the protein features that are widely used in existing research feature extraction methods,this study uses different nucleic acid sequence-based dinucleotide features,and selects protein sequence-based word vector features,which are applied to four machine learning classification models of DT,KNN,NB and SVM,it is finally shown that the Prot Vec method can obtain better results,in the prediction of viral hemagglutinin genotyping(H type),the accuracy can reach 100%,and the accuracy can also achieved of 99.95% in the classification prediction of neuraminidase genotype(N-type).The results show that the method proposed in this study can effectively predict the genotype of influenza A virus.(2)For the prediction of the interaction between influenza A virus and human proteins,the method of word vector based on protein sequence as a feature is continued,and its performance in the interaction prediction problem is discussed.In the research of this section,the dataset of positive and negative samples is the first constructed.Because the positive and negative samples are unbalanced,we use three datasets of positive and negative 1:5,1:8,1:10 for training,and also apply them to the four classification models:DT,KNN,NB and SVM.The final experimental results show that the accuracy rate of the Prot Vec method in the 1:10 dataset is 90.09%,the F1-score is 90.09%,and the accuracy rate obtained in the 1:8 dataset is 89.13%,the F1-score is 88.89%,the accuracy rate obtained in the 1:5 dataset is 83.33%,and the F1-score is 83.33%.The results show that the method proposed in this study can effectively predict the interaction between influenza A virus and human proteins. |