Font Size: a A A

Fast Virus Identification Tool Based On Next-generation Sequencing Data And Its Application In Mobile Computing Platform

Posted on:2019-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y N SuFull Text:PDF
GTID:2370330542997300Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Diseases and large-scale outbreaks caused by pathogenic microorganisms are an important threat to human health.Accurate identification of pathogenic microorganisms is a prerequisite for clinical treatment and disease prevention and control.In recent decades,culture methods,microbial-specific polymerase chain reaction(PCR)methods,enzyme linked immunosorbent assay(ELISA)methods,and DNA microarray methods have been played a big role.However,the traditional method has a limitation that it requires a certain degree of prejudgment of pathogenic microorganisms in order to select a suitable kit or experimental material.In recent years,new outbreaks,clinical microbiological infections,or difficult cases of infectious diseases have often resulted in inaccurate priori knowledge of pathogenic microorganisms,posing a higher challenge to pathogen identification methods.The second-generation sequencing technology,also known as high-throughput sequencing technology,provides a viable solution to this problem.In the absence of prior knowledge of pathogenic species,high-throughput sequencing of nucleic acids from specimens can be performed directly,and then bioinformatics methods can be used to compare with a large microbial nucleic acid database to obtain the information of microorganisms in specimens,which would been verified and confirmed by traditional methods to enable the identification in complicated samples.One of the key steps of this strategy is to analyze the huge sequencing data.At present,there is some software that can perform microbiological related analysis on the second-generation sequencing data of non-cultured samples,such as VirusSeq/VERSE.However,these software usually requires relatively large computing resources and storage resources,and some of them need to be deployed in the cloud,and it is difficult to promote them at the epidemic site or clinical frontline.On the other hand,existing software focuses on bacterial microbes,while viruses account for a large proportion of new infectious diseases.Therefore,there are still room for further improvement and promotion of pathogenic microorganism analysis methods based on high-throughput sequencing to better suit the needs of clinical or disease control applications.This paper describes a lightweight bioinformatic virus rapid identification tool.The tool can be installed on home-level personal computers,as well as in mobile computing workstations and computing clusters.It has a user-friendly graphical Chinese interface that enables rapid analysis of high-throughput sequencing data at minute levels to capture virus species-level information.This paper first introduces the development details of the rapid virus identification tool,including the establishment and reduction of virus nucleic acid database,the determination of the virus analysis process,and the realization of the software development technology framework.Due to the large redundancy of the virus sequences in the entire library of nucleic acids,in order to increase the analysis speed and reduce the demand for computing hardware,we have streamlined the processing of the viral nucleic acid database.Using sequence homology comparisons,clustering software,and self-compiled scripts,using 95% homology as a threshold,single-stranded viral nucleic acid sequences were picked out and homologous sequences were deleted from the human genome.The database was reduced from 1,914,294 sequences(3,447,426,279 bases)to 112,694 sequences(721,193,979 bases).Based on a streamlined viral nucleic acid database,quick mapping of high-throughput sequencing data(short reads),assembling,and mapping of splicing sequences were performed,and the results were analyzed and integrated for display.This tool is based on web forms,including data submission,analysis program selection,and results display.The result display part includes the situation of the short-segmented comparison with the viral nucleic acid database,the coverage on the reference sequence,the genomic splicing results,and the BLAST result of the sequences obtained from assembling and viral nucleic acid database.The development is based on the model-view-controller design pattern of the Django framework.It uses a mouse to operate and has a Chinese interface.The tool provides source code installation and virtual machine-based installation.Besides,the optimal number of threads was optimized for mapping and assembling.On individual notebook computing systems,we compared the analysis speed and results based on a streamlined viral nucleic acid database and a raw viral nucleic acid database.The high-throughput sequencing data of the throat swab samples from two patients infected with adenovirus was used as a test.The overall analysis time using the reduced database was 2.16 minutes,which was 9 times shorter than the analysis time of 19.76 minutes using the original sequence database.Among them,the acceleration of the analysis is mainly reflected in the short segment comparison.On the other hand,8,537 adenoviral sequences were found using the reduced database,which was 0.55% of the total dataset sequence,which was all aligned to 77% of the viral sequence,whereas 8,500 adenoviral sequences were found in the original dataset,and the proportions were 0.24% and 71.4%,respectively.Sequence assembling results in an adenovirus were near-genome sequence(34,776 bp).This result shows that the system has a higher analysis speed and can accurately identify pathogenic viruses of this sample data set.Next,we tested more high-throughput sequencing data sets,including five cases of adenovirus infection(891 Mbp)and five cases of Ebola virus infection in West Africa(465 Mbp).One-click analysis is performed by uploading multiple sets of data at the same time.In the individual notebook computing system,the analysis time of the adenovirus data was 4.07 minutes,and the adenovirus sequence was found in 5 cases.The analysis time for the Ebola virus data set was 118.75 minutes,and 6.39%-63.49% of the Ebola virus sequences were identified in these 5 samples and were successfully assembled into the Ebola genome map.The analysis time of Ebola virus mainly lies in assembling.Because the data set can be compared to the sequence of the virus genome,the assembling takes a long time.However,due to the streamlined database,in which the human homologous series was effectively removed,and there was no memory overflow situation due to incomplete removal of human sequences.Overall,we have developed a lightweight bioinformatics tool for virus identification that can use limited computational resources to achieve rapid and accurate species-level identification and analysis of high-throughput sequencing data from non-culture samples.After the clinical first-line pathogen emergency response or rapid assessment may have a better application prospects.At the same time,the tool has a user-friendly Chinese interface,supports one-click results,and displays in the form of charts for non-biological informatics personnel.
Keywords/Search Tags:virus identification, next-generation sequencing, virus database, virus analysis flow
PDF Full Text Request
Related items