Font Size: a A A

Prediction Of Coronavirus Host Classification Based On Spike Protein Sequence And Machine Learning

Posted on:2022-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z B WangFull Text:PDF
GTID:2480306773481154Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
Corona Virus Disease 2019(COVID-19)is a respiratory infectious disease caused by Severe Acute Respiratory Syndrome Coronavirus 2(SARS-COV-2).Coronaviruses seriously affect human health and public health due to their crossspecies transmission to a variety of mammals,especially humans.Therefore,rapid and accurate prediction of coronavirus host classification is of great significance for the prevention and control of epidemics in the future.This study collected data from the virus database of the National Center for Biotechnology Information(NCBI).A total of 19385 coronavirus spike protein sequences were obtained from January 1,2000 to September 25,2020.The data were divided into human origin and non-human origin according to the source of isolated species.The cd-hit software was used to remove duplicate and redundant sequences.According to the random order and collection time,the data set was split into training set and test set in the ratio of 8:2.The sequence features of the spike proteins were subsequently extracted using protein descriptors and a natural language model Seq2 Vec.A variety of machine learning methods such as support vector machine(SVM),logistic regression(LR),random forests(RF)and deep learning method gated convolutional neural network(GCNN)were used to build the classification model.The training set was used for 100 times of 5-fold cross validation of the training classification model,and the test set was used for model evaluation.Finally,seq2 vecGCNN was selected as the best model,with AUC of 0.9818,sensitivity of 90.06%,specificity of 1 and accuracy of 94.45%.A total of 3216 sequences were selected from the de duplicated data set.According to different types of hosts,the data were divided into 6 categories: humans,swines,avians,bats,camels and other mammals.After sorting according to the collection time,it was divided into training set and test set in the ratio of 8:2.The distribution descriptor(CTDD)and natural language model Seq2 Vec were used to extract the sequence features of spike protein.A variety of machine learning methods were used to establish the model.The training set was used for 100 times of 5-fold cross validation of the training classification model,and the test set was used for model evaluation.In predicting human host,seq2vec-GCNN was the best model with an accuracy of 99.37%.The CTDD-RF model performed best in predicting other host classifications.The accuracy rates were 95.82% for swines,95.96% for avians,98.33% for bats,92.06% for camels and 94.01% for other mammals.The analysis of the above results shows that it is practical and effective to use spike protein sequence based on machine learning method to train the classification model of coronavirus host.Our model can predict coronavirus host classification timely,quickly and accurately,and apply it to virus prevention and control in the future.With the emergence of SARS-COV-2 variants,the host may further change.This makes it very important to predict the host classification model of coronavirus.In conclusion,the results of this study may have great reference value for the prediction and prevention and control of coronavirus pandemic in the future.
Keywords/Search Tags:Machine Learning, Coronavirus, Spike Protein, Protein Classification Prediction
PDF Full Text Request
Related items