| Virus was a type of microbe that can only replicate in living host cells.The virus has a very tiny size with simple structure.Virus was composed of a type of nucleic acid(DNA or RNA)and proteins,except for prion(which only has protein).Viruses have a wide variety of species and a wide range of host cells.As the carrier of genetic information,the viral sequence is the core data for viral research.With the popularity of high-throughput sequencing technology,deep sequencing has become a main method for studying the heredity and evolution of the virus.In face of a large amount of data produced by high-throughput sequencing,bioinformatics analysis is required to dig out the useful information of the viral genome as much as possible.The purpose of this study is to explore a series of bioinformatics analysis methods for viral genome analysis from different types of data produced by high-throughput sequencing.In this paper,based on the high-throughput sequencing data in our research group and analytical purposes,we explored the methods of analyzing the viral genome information from the high-throughput sequencing data.The main contents of this thesis are divided into two parts: 1.Analysis of active prophages from bacterial high-throughput sequencing data;2.Discovery and genome analysis of virus high-throughput sequencing data of complex source.Analysis of active prophages from bacterial high-throughput sequencing dataLysogenic phage is a kind of virus which can integrate their genome into the genome of host bacteria.It can also release from the host genome,causing proliferation of new phages.The character of lysogenic phage’s replication mechanism determined that it has the ability to induce gene transfer.Lysogenic phage has an important impact on host pathogenicity,taking EHEC O104:H4 as an example,whose main toxic genes were encoded by prophage.In this research,72 pieces of bacterial genome sequencing data was analyzed,from which the bacteria was isolated from the patient suffering from foot ulcers.According to the genome characteristics of lysogenic phage,we found some novel lysogenic phages and its integration site in host bacterial genome.The data analysis was carried out with commercial or free established software and self-coded programs.NGS QC Toolkit v2.3.3 was used to control the quality of the raw sequencing data,filtering short and low quality reads.According to the data characteristics of Ion Torrent platform,the commercial software Newbler v3.0 was selected as the assembly software.As for the assembled contigs,a prophage prediction tool which was coded by perl was used for phage prediction.In order to get active prophage genome sequence,Contig Scape plugin was used to show the relationships between the assembled contigs,CLC Genomics Workbench 9 was used to adjust the sequences and for result checking.The contigs was connected by in-house software to obtain phage genome and host genome.The active prophage genome was annotated by RAST.Finally,we analyzed the information of the genome,the integration site and the evolutionary relationship of the active prophages genome,and dug out the potential information.Among the 72 selected bacterial strains,a total of 11 bacterial strains were identified that have active prophage in their genome.Through assembly and splicing,whole genome sequences of 14 active prophages were obtained,including 11 novel strains.The results showed that our method can be used for the accurate prediction of active prophage,which can increase our knowledge of lysogenic phage.Our finding was consistent with that the integrase gene of phage was close to its integration site.The sequences of the integration site were quite different,but it showed a correlation with the integrase.The same integration site can be used for the integration of the lysogenic phages that had the similar integrase,which provides a new idea for the prediction of prophage.The phages whose bacterial hosts were classified into the same genus have a similar genomic structure.Discovery and genome analysis of virus high-throughput sequencing data of complex sourceDue to the long terms of virus isolation and the low successful rate,we often have to perform high-throughput sequencing on some complex samples.Data analysis of complex high-throughput sequencing data needs to obtain the useful virus information,which brings some challenges.In recent years,our research group has carried out the exploration of pathogen detection in clinical samples by high-throughput sequencing technology.The particularity of clinical samples requires that data analysis should detect the pathogen in clinical samples quickly and accurately.At present,single bioinformatics software could not meet the needs of the data analysis of complex sequencing samples.In view of the requirement,a program for data analysis named "Pathogen Classification Software using high-throughput sequencing data v1.0" was developed.The software was able to detected 4 types of pathogens,including bacteria,fungi,protozoa,and viruses.With the data analysis of unknown infection samples sequencing,the program also displayed very good results.The imported Rift Valley fever virus cases found in Beijing in July 2016,for instance,is a successful case for the discovery of known viruses from complex samples.Through the analysis of the sequencing data using our software,a large number of Rift Valley fever virus reads were found,confirming that Rift Valley fever virus was the pathogen.And the whole genome sequence of this strain of Rift Valley fever virus was obtained in the same time.This strain of Rift Valley fever virus had the highest homology with Kakamas strain found in South Africa in 2009.Phylogenetic analysis showed that this strain had no rearrangement occurred.The discovery of Menghai rhabdovirus,for instance,is a successful case for the discovery of unknown viruses from complex samples.The virus was isolated from Aedes albopictus captured in Menghai County,Yunnan Province.The resulting supernatant was cultured with C6/36 cells by blind passage.Reverse transcription-polymerase chain reaction was performed with universal primers of common viruses,but this did not yield any positive results.Through the analysis of high-throughput sequencing data,excluding the reads from host cells,bacteria and other interference factors,the whole genome sequence of Menghai rhabdovirus was obtained.Genomic analyses demonstrated that Menghai rhabdovirus is a novel species of the family Rhabdoviridae.It was most similar to the other two rhabdoviruses isolated from mosquito in Peru.In analysis of the characteristics of Menghai rhabdovirus,a termini sequence analysis with 93 representative rhabdoviruses was also performed.45 of them had short inverted repeat termini,distributed in all 11 genera and unassigned groups.Lyssaviruses had a very consistent terminal sequence of “ACGCTTAAC”,while the “ACGAAGA” termini were found in the four genera of Ephemerovirus,Vesiculovirus,Tibrovirus,and Sprivivirus.The virus termini are usually related to the virus replication mechanism,thus termini sequences tend to be relatively critical,which indicates that the short inverted repeat termini sequence is likely to be a feature of Rhabdoviridae genome.In summary,in this paper,based on the existing methods of viral genome analysis,new analysis method that can obtain the whole genome sequence of active prophage and its integration sites from bacterial high-throughput sequencing data was created.This method can be used for the discovery of novel lysogenic phage,providing new knowledge of the lysogenic phage.“Pathogen Classification Software using high-throughput sequencing data v1.0” was developed,playing a good role in the detection of unknown pathogens.By analyzing the data,a novel rhabdovirus was found,the termini sequence analysis was also carried out in the family Rhabdoviridae.The analysis methods of viral genome still need to be designed according to the different research objects and different analysis needs.It is hoped that the methods and conclusions of this paper can provide references for other researchers. |