Font Size: a A A

The Study And Application Of Metagenomics-based Taxonomic Database And A Rapid Pathogen Identification System

Posted on:2022-02-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:P H LiFull Text:PDF
GTID:1484306566991949Subject:Pathogen Biology
Abstract/Summary:PDF Full Text Request
In recent years,frequent outbreaks of infectious diseases,especially new outbreaks of infectious diseases,have posed serious challenges to public health and become a worldwide public health problem.Rapid and accurate identification of pathogenic agents is an important prerequisite for addressing emerging infectious diseases.Traditional pathogen detection methods are still inadequate in dealing with emerging infectious diseases,with time-consuming isolation and culture with few applicable microorganisms,low specificity of serological diagnosis,and PCR assays can only detect known species,making it difficult to deal with unknown and highly variable pathogens.As an emerging technology,high-throughput sequencing-based metagenomic approach to pathogen identification can detect nucleic acid sequences of all species in a sample and analyze pathogen resistance,virulence,molecular typing and phylogeny,which has broad application prospects in infectious agent detection,surveillance and traceability.Currently metagenome sequencing is widely used for pathogen identification.Metagenome sequencing often generates large amount of data,and the database for sequence alignment is the key to pathogen identification.The existing nt databases covering all species are large and redundant,taking up high computational resources and time costs,while Ref Seq reference sequence databases and other marker databases do not cover all species and are prone to miss detection.Metagenomic identification of pathogens includes sequence alignment,species classification,genome assembly and other steps,which require the application of more bioinformatic analysis tools.The local alignment method BLAST is inefficient,and the Burrows-Wheeler transformation-based method can improve the alignment efficiency,but it is heavily dependent on the reference sequence.The existing classification software only counts the number of species sequences,and it is difficult to determine the pathogen directly or to type the pathogen precisely.In addition,for unknown pathogens,current methods use ab initio assembly to obtain the genome and later use BLAST to compare and determine,while the large amount of metagenomic data and the high abundance of background sequences such as hosts make ab initio assembly difficult,and there is an urgent need for tools that can identify pathogens quickly and efficiently,accurately,and systematically analyze key pathogenic features such as pathogen type and virulence resistance.Based on metagenomic s,this study relies on a laboratory sequencing platform and a high-performance server to conduct the following three aspects of research.1.Metagenomic classification database constructionNucleic acid sequence data are obtained from various public databases,indexed based on different software,and classified and graded according to database types to meet different detection needs.The first level is the NCBI nt database,which has a large data volume and is cut into sub-data modules;the second level is the reference genome database,and the NCBI Ref Seq database is divided into five categories according to species types;the third level is the sub-database for specific purposes,including the database of key pathogens,and the database of pathogen characteristic sequences such as drug resistance,virulence,and typing.Non-redundant databases are constructed for common bacterial pathogens,removing conserved sequences of the same species and retaining unique sequences to reduce the database size,improve the matching efficiency and reduce the loss of matching accuracy.Compared with the whole genome data,the non-redundant pathogen database reduced the data size to 50.57%,the matching time to46.48% of the original database,and the loss of 2.11% of correctly classified sequence data.The homology analysis of important disease-causing viruses was performed to calculate the homology of viruses between different families and genera according to the virus classification,and the threshold of pathogen identification was determined by the homology range to provide data reference support for the subsequent identification of unknown pathogens.2.MPIP establishment and evaluation validation of metagenomic pathogen identification processThe metagenomic pathogen identification process MPIP is divided into four modules,including metagenomic classification,pathogen characterization,misclassification error correction and unknown pathogen iterative assembly.Metagenome classification is based on existing tools for downstream analysis,mainly removing irrelevant background data such as host and synthetic,classifying by family and generating species classification profiles and visualization results based on the number of specific sequences;pathogen characterization determines pathogen identity,drug resistance genes,virulence factors and other information based on a hierarchical sub-base;misclassification error correction module targets species with a low number of matching reads or mis-matched species MPIP uses Python language for programming and Linux operating system as the running environment to write the visual interface.operator interface,which can generate metagenomic species classification results and visual charts,etc.under the folder according to the module functions.The simulated samples were selected for feasibility analysis of MPIP,and the simulated samples constructed with different concentration gradients of viruses,bacteria and fungi were subjected to metagenome sequencing,and pathogen detection and pathogen characterization were performed using MPIP to interpret the pathogen identity of known simulated sample species.Metagenomic sequencing was performed on 41 Beijing SARI samples to validate the detection capability of the MPIP process in real samples,and compared with the metagenomic classification tool Centrifuge to assess its effectiveness in metagenomic identification of real samples.Compared with Centrifuge species classification method,MPIP was able to accurately identify human herpesvirus type 4 and genomic mapping coverage information.For the unknown pathogen,the simulated data was used as the simulated data,and a database without SARS-Co V-2 was taken to set the unknown pathogen scenario,and the simulated data was assembled iteratively to determine the closely related species and obtain the whole genome of SARSCo V-2;meanwhile,six real samples of SARS-Co V-2 were used for metagenomic sequencing to further validate MPIP,and the simulated data and the real samples were used to identify herpesvirus 4 using MEGAHIT based on the traditional ab initio assembly method.The results were compared,and the iterative genome assembly completion was94.01% to 96.91% with real samples and 66.51% to 88.62% with ab initio genome assembly,and the iterative assembly was better than the traditional ab initio assembly method.3.Metagenomic pathogen identification applicationsMPIP was practically applied in combination with infectious disease prevention and control and clinical diagnosis practices.First,metagenome sequencing was performed in combination with epidemic situation to determine that the causative pathogen is human adenovirus type 55 and perform traceability analysis,which provides a reference basis for epidemic prevention and control.For clinical samples of unexplained lung infections,MPIP was used to perform pathogen characterization such as metagenomic pathogen identification and drug resistance genes,and Klebsiella pneumoniae and Pseudomonas aeruginosa were identified and drug resistance genes such as bla SHV-12,aac(3)-IIa and bla KPC-2 were detected.Pseudomonas aeruginosa was isolated by traditional pathogen isolation and culture methods,and the clinic adjusted the test results based on Antibiotic treatment,the patient improved rapidly.In the case of clinical unexplained death,BALF,sputum and blood samples were taken for pathogen confirmation,and the causative pathogen Chlamydia psittaci was detected by MPIP in all three samples,and cell culture and PCR assays detected positive BALF samples and negative sputum and blood samples,indicating that metagenomic sequencing combined with MPIP has better pathogen identification ability.The main contributions of this study are as follows:(1)the construction of a highly applicable local classification and grading database,and the establishment of a nonredundant database for common pathogenic bacteria,which compresses the data volume and improves the comparison timeliness without significantly reducing the identification accuracy.(2)Conducting homology analysis and proposing homology metric thresholds for viruses,which provides a theoretical basis for identifying new viruses.(3)Establishing MPIP,a metagenomic pathogen identification process,which overcomes the difficulties of traditional metagenomic classification software in accurately typing and identifying drug resistance virulence information despite its speed,and provides important technical support in epidemic prevention and control and clinical diagnosis and treatment.(4)The first method of using iterative assembly to identify unknown pathogens is proposed,which is superior to the traditional ab initio assembly method in terms of the completion of assembled genomes and provides a new technical means to deal with new emergent pathogens.
Keywords/Search Tags:metagenomics, high-throughput sequencing, database construction, pathogen identification process, unknown pathogen detection
PDF Full Text Request
Related items