Font Size: a A A

Study On Prediction Of The Risk Of Severe COVID-19 Based On SARS-CoV-2 Evolutionary Analysis

Posted on:2024-01-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:M MiaoFull Text:PDF
GTID:1524307310489704Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Objective As of April 12,2023,SARS-CoV-2(Severe Acute Respiratory Syndrome Coronavirus 2)has caused over 760 million confirmed COVID-19(Coronavirus Disease 2019)cases and over 6.8million deaths worldwide.Although its pathogenicity has weakened with the continuous evolution of the virus,it is still at a low level of epidemic status.During the high outbreak period of the epidemic,with the increase in the number of cases,especially the number of critically ill patients,the global health system faces enormous challenges.Therefore,current research requires continuous monitoring of the evolutionary characteristics of SARS-CoV-2,providing evidence for epidemic prevention and control,and developing and updating vaccines and antiviral drugs.In addition,based on the evolutionary characteristics of SARS-CoV-2,this study will delve into methods for accurately identifying its evolutionary branches,which is of great significance for monitoring the evolutionary diversity of SARS-CoV-2 and effectively blocking its propagation.Finally,this study will construct a COVID-19 severe risk prediction model related to the SARS-CoV-2 genome sequence and typing,providing a basis for early identification of COVID-19 severe patients,moving their management window forward,optimizing treatment,and effectively reducing the occurrence of severe illness and death in COVID-19 patients.Methods(1)This study was based on over two million SARS-CoV-2complete genome sequences from the GISAID(Global Initiative on Sharing All Influenza Data)database.Through genome sequence data collection,sample screening,and multiple sequence alignment,this study completed the organization of the SARS-CoV-2 genome sequence dataset,and applied methods such as recombination analysis,mutation analysis,selective pressure analysis,association mutation analysis,and phylogenetic analysis to study the evolutionary characteristics of SARSCoV-2.This study further applied the time-sliding window model to analyze key mutations,protein diversity,selective pressure,and the prevalence trends of important clades.In addition,this study used frequency switching to extract positive selection mutations and analyze the correlation between mutations;then applied clustering methods to obtain mutation clusters and analyzed the adaptive evolutionary characteristics of SARS-CoV-2 mutation clusters based on phylogenetic trees and the epidemic trends of the concerned variants.(2)Based on the evolutionary analysis of SARS-CoV-2,this study further adopted a supervised learning method to construct a SARS-CoV-2genome sequence typing model.For tasks with fewer clades,this study applied a multi-layer template matching algorithm to achieve accurate sequence recognition while also constructing a difference matrix for measuring the distance between branches.In addition,this study also applied machine learning-based methods to construct a more universal genotype typing model and proposed a lightweight data structure based on nucleotide site mutations to reduce the computational cost.To further improve the classification accuracy and generalization of the model,this study adopted multiple machine learning-based methods for model integration and optimized the ensemble weights through cross-validation.(3)Based on the research on the evolutionary characteristics and the genotype recognition of SARS-CoV-2,this study applied machine learning-based methods to construct a SARS-CoV-2 genome sequencerelated COVID-19 severe risk prediction model.In this study,four machine learning methods,including random forest,Light GBM,XGBoost,and GPBoost,were used to build the prediction model,and the integrated model was obtained through the weighted combination of prediction probability.Finally,this study used model interpretability analysis to quantitatively analyze the impact of patient age,gender,SARS-CoV-2amino acid sites,and sequence typing on the severity of COVID-19patients’ disease.Results(1)SARS-CoV-2 had a large number of mutations throughout the entire genome,distributed across all known proteins.From December2019 to January 2023,the mutation rate of the four structural proteins of SARS CoV-2 showed a dynamic upward trend on the whole and had a large increase during the prevalence of the Omicron variant.The results of selective pressure analysis showed that there was a significant positive selection in the coding regions of the spike glycoprotein and the nucleocapsid protein.In addition,this study has identified 38 positive selection mutations,including 29 amino acid substitution types and 9amino acid deletion types.Among them,the spike glycoprotein had the most positive selection mutations(accounting for 50%),followed by the nucleocapsid protein.These 38 positive selection mutations were clustered to obtain 5 associated mutation clusters,which were related to the epidemic trend of SARS-CoV-2 variants of concern and closely related to the branch nodes of the evolutionary tree.(2)In the research of the SARS-CoV-2 genome sequence typing,this study achieved typing accuracy of 99.894%,97.583%,and 98.436%(measured by F-score)for the ensemble models of Nextstrain,GISAID,and Pango,respectively.For samples with low coverage(the proportion of unknown nucleotide type sites is greater than 1%),the typing accuracy was99.147%,96.960%,and 93.416%,respectively.In terms of data structure,the lightweight data structure based on nucleotide site mutations was significantly superior to the data structure based on the one-hot coding method in terms of model training and computational efficiency.And the lightweight data structure achieved better generalization performance on low-coverage test sets.The two-layer template matching algorithm proposed in this study achieved high recognition accuracy and computational efficiency in Nextstrain and GISAID typing,and this method can effectively measure the differences between different evolutionary branches.In addition,by introducing sub-models,this study can quickly construct extended models to cope with the emerging evolution branches of SARS-CoV-2.(3)In terms of the COVID-19 severe risk prediction study,the performance of the ensemble model was as follows: the F-score on the global data was 88.842%,and the area under the curve(AUC)was 0.956.The test results on different continent data sets were: Asia(F-score:94.643%,AUC: 0.984),Europe(F-score: 89.491%,AUC: 0.957),North America(F-score: 84.566%,AUC: 0.921),South America(F-score:91.232%,AUC: 0.963),Africa(F-score: 86.085%,AUC: 0.939).The results of the model interpretability analysis indicated that the gender of male,older age,and non-vaccination all increased the risk of severe COVID-19.In terms of clades,GK(Delta)increased the risk of severe illness,while GRY(Alpha)and GRA(Omicron)reduced the risk of severe illness.Some amino acid mutations had a significant impact on the severity of COVID-19.Taking the spike glycoprotein protein as an example,mutations such as N501 Y and P681 H increased the risk of severe illness,while mutations such as T19 I and E484 K reduced the risk of severe illness.Conclusions(1)SARS-CoV-2 has accumulated a large number of mutations,some of which have improved the adaptability of the virus and have been preserved.There are associated mutations within and between multiple proteins of SARS-CoV-2,which can be clustered into different clusters.The associated mutation clusters are closely related to the epidemic trends of SARS-CoV-2 variants of concern and the branch nodes of the evolutionary tree.Continuous monitoring of the evolutionary characteristics of SARS-CoV-2 is of great significance for epidemic prevention and control,as well as the development and update of vaccines and antiviral drugs.(2)Based on the analysis of the evolution characteristics of SARS-CoV-2,this study has constructed a sequence typing model of SARS-CoV-2,and the typing accuracy of the ensemble model is generally higher than that of the single model.The typing system proposed in this study can efficiently and accurately identify the evolutionary branches of SARS-CoV-2,which will help carry out relevant epidemiological analysis and provide a reliable classification and traceability basis for effectively blocking its transmission.(3)In addition,this study has constructed a disease severity prediction model for COVID-19 patients based on SARS-CoV-2 genome sequence and typing,and the ensemble model achieved the highest prediction accuracy.The interpretability analysis of the model has shown that the mutation characteristics of SARS CoV-2,typing characteristics,patient age,gender,and vaccination all affect the disease severity of patients to varying degrees.The proposed model helps to early identify COVID-19 patients with a high risk of severe illness,thereby effectively reducing the severity and mortality rates.
Keywords/Search Tags:SARS-CoV-2, COVID-19, Evolution, Clades typing, Risk of severe disease, Machine learning, Ensemble model
PDF Full Text Request
Related items