Font Size: a A A

Predicting The Host Of Influenza Viruses And Identifing Sequences Of Viruses Based On Word Vector

Posted on:2019-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:B B XuFull Text:PDF
GTID:2370330545473831Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,emerging infectious diseases have become increasingly threatening to human society.With the rapid development of genomics technology,information technology and artificial intelligence,bioinformatics methods and technologies that integrate multiple disciplines are playing an increasingly important role in the prevention and control of infectious diseases.In this paper,through the analogy between natural language and biological sequences,the word vector representation in natural language processing was applied to the feature extraction and representation of biological sequences.Then based on the word vector of biological sequences,the prediction of host of influenza A virus and identification of viral sequences was studied.The main work of this paper is as follows:(1)A calculation method for prediction of host of influenza A virus based on word vectors was proposed in this paper.Influenza viruses not only pose a great threat to human health,but also cause huge economic losses to human society.A rapid determination of the host of influenza viruses would assist in assessment of potential risk of newly discovered influenza viruses.This work applied the word vector method in natural language processing to prediction of the host of influenza A viruses through analogous natural language and biological sequences.Specifically,this article adopted a simple biological sequence segmentation method.The DNA sequences and protein sequences of influenza A virus were expressed as real-valued vectors using the natural language processing tool word2vec,and then the classification model was constructed based on the feature vector representation of these sequences.Avian,human,and swine hosts of the influenza A virus are predicted using the classification model.The experimental results show that the calculation method has a good effect on the prediction of influenza A virus host,in which the model effect on the surface proteins HA and NA(or their genes)is better than that on the internal proteins(or their genes).The highest prediction accuracy of avian,human,and swine influenza virus reached 98.9%,97.9%,and 91.9%,respectively.At the same time,the effects of prediction of hosts based on word vector,k-mer and homology search were compared in the article.The results show that the method based on word vector is consistent with the method based on k-mer as a whole.And the method based on word vector has better overall effect than the homology search method.(2)A method of virus sequences identification based on word vectors was proposed.The virus is the most diverse species on Earth.The first step in viral metagenomics research is to identify viral genome sequences.The classical computational method for identifying virus sequences is mainly homology search,which are based on the sequence similarity between the sequence to be identified and the database of known sequences.While there is more virus mutation or new virus,this method cannot be used to effective identification of viral sequences.In this paper,the intrinsic features of the complete genome sequences were extracted by the word vector method,then the virus sequences were identified using the classification algorithm and homology search.At the same time,considering that the genome sequences generated during the high-throughput sequencing in viral metagenomics are usually not complete,but are part of the genomes,fragments of the complete genome sequences were randomly selected in this article.And then viral sequences were identified based on word vector method and homology search.The experimental results show that the method of virus sequence identification based on word vector outperformed homology search in both complete genome sequences and genomic sequence fragments.The completion of this paper is a useful attempt to use the word vector to represent biological sequences.The research results show that the word vector can be used as a useful biological sequence characterization method for bioinformatics research.At the same time,the work of this article also assists in the prevention and control of newly emerging influenza virus and the rapid identification of viral genome sequences.
Keywords/Search Tags:Host, Word vector, Classification algorithm, Influenza virus, Viral sequence
PDF Full Text Request
Related items