| The influenza virus genome consists of eight genetic segments of varying lengths,with a total length of approximately 13 kb.Due to the special molecular synthesis mechanism of viral polymerase,viral genes are prone to point mutations,which furtherlead to rapid mutation of the virus through the genomic rearrangement mechanism,triggering changes in biological properties and threatening human health.Currently,two biological properties closely related to public health deserve our attention: 1)the risk of spillover of naturally occurring avian influenza viruses and infection of humans,and 2)pathogenicity of interpersonal transmission of influenza B viruses.Biogenetic or protein data,which can be viewed as strings composed of specific sets of characters,allow us to draw on machine learning methods to model and predict the biological properties of infectious diseases,serving the purpose of early surveillance and prevention.The focus of this study is to establish predictive models for two scientific questions:the risk of spillover of avian influenza viruses and the pathogenicity of influenza B viruses.Specifically,the study aims to: 1)construct a deep learning-based predictive model for the risk of spillover of avian influenza viruses.Genomic data of avian influenza viruses are selected,and the data set is divided into different clades based on phylogenetic relationships.Convolutional neural networks(CNNs)and recurrent neural networks(RNNs)are combined to represent the genomic sequences,and the models are trained and tested on specific clade data sets and the entire data set,respectively.Experimental results show that the specific clade models perform well in predicting the data sets of their respective clades,with AUROC(area under the receiver operating characteristic curve)values and AUPR(area under the precision-recall curve)values exceeding 0.966 and 0.876,respectively,but with limited generalization ability.The global model achieves AUROC and AUPR values of 1.000 for all clades except H9N2.Through ablation experiments,it is found that attention mechanisms and sequence embedding methods have a significant impact on model performance.Further testing of model generalization ability shows that transfer model AUROC and AUPR values are above 0.984 and 0.941,respectively.Finally,attention weight matrices are visualized to provide interpretability for the model.2)Propose an integrated learningbased model for predicting the pathogenicity of influenza B virus.A dataset of protein sequences of type B influenza virus was constructed,and 40 critical amino acid positions were selected using entropy-based ranking.Two types of information features were generated using the random forest method,and the optimal feature subset was selected using the Minimum Redundancy Maximum Relevance(m RMR)algorithm.Based on the sequential forward search algorithm,the class information feature dimension was optimized to four dimensions,with an accuracy(ACC)value of 94.2% and a Matthews correlation coefficient(MCC)value of 88.4%.The probability information feature dimension was optimized to three dimensions,with an ACC value of 94.1% and an MCC value of 88.2%.The optimal feature subset was superior to individual original features.Furthermore,the performance differences between the sequential forward search algorithm and two common ensemble learning methods were compared,and the optimal subset obtained by the sequential forward search algorithm showed relatively good performance. |