Font Size: a A A

Identifying Viral Sequences And Phage Lifestyles From Metagenomes Based On Deep Learning

Posted on:2022-12-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y MiaoFull Text:PDF
GTID:1480306758979269Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
With the maturity and development of the second generation sequencing technology,metagenomics has become one of the hot spots in microbial research.As the most abundant biological entity on the earth and one of the most important components of human body,viruses replicate in host cells and play a very important role in controlling bacterial population size and changing host metabolism through interactions with host cells.The virus has caused tremendous morbidity and mortality in human society.Metagenomics technology can obtain the genetic information of all microorganisms in a certain environment,so it is of great significance to study viruses from metagenomes.However,viral genome is a very small group of genomic entities in the huge microbial world,and other microbial information may cover the information of viral genome in the actual analysis process.Since viruses do not have fixed and conserved evolutionary marker genes as prokaryotes do,and some viruses have a high mutation rate,identifying viral sequences from metagenomes is the very first and crucial step in subsequent viral analysis.Phages have the largest cardinal number of viruses and can be found in any environment where bacterial hosts are present.While phages may damage bacteria,they can also benefit bacterial populations in some cases,making a crucial difference in the composition of the microbiome.The accurate classification of bacteriophage lifestyle is helpful to understand the population change,genomics and microbiology of bacteriophages,and is of great significance to study the interaction between bacteriophages and bacterial hosts and their different roles in the regulation of microbial community.The identification of virulent phages also has important application value in phage therapy and biological control.It is difficult to accurately identify viral sequences from viral and non-viral mixed sequences in the metagenomes,and there are few bioinformatic methods used to identify phage lifestyle.With the rapid development of deep learning algorithms in the field of Computer Vision and Natural Language Processing,it has the ability of learning the distribution of big data.Since the metagenomic data obtained from the next generation sequencing technology contains a large number of DNA sequences,deep learning methods is fit to deal with these sequences.In this paper,we do researches on identifying viral sequences and phage lifestyles based on the deep learning methods.The concrete research content is as follows:(1)Build a DNA sequence coding model based on enhanced codon correlation.Short virus sequences contain less genetic information,and most deep-learning methods use one-hot vector to encode a single base or k-mer fragment.However,each one-hot vector is orthogonal to each other,and each part of its encoded sequence is irrelevant to each other.Moreover,when the vector dimension increases,one-hot coding becomes more sparse.None of this is conducive to enriching the features of short sequences.In order to improve the feature expression of short sequences,this paper constructs a DNA sequence coding method based on strengthening codon correlation.Through neural network unsupervised learning of the relationship between the various parts of the virus data itself,DNA sequence is encoded into meaningful vector representation and the correlation features between the various parts of short sequences are enhanced.(2)A novel short virus sequence recognition method based on codon strongly associative long-short term memory network is proposed.Most CNN model based on the deep learning method used,the convolution operation and pooling of sliding window layer of pooling operations are easy to cause the sequence feature information extraction and the problem of inadequate,and the sliding window mechanism in the CNN in the feature extraction in the process of focusing only on the current window within the local information of sequence fragments,ignoring the sequence of global information,Not conducive to accurate recognition of short virus sequences.In order to make full use of the sequence characteristics of short sequences,this paper proposes a short virus sequence recognition method based on codon strong correlation LSTM.The trained codon embedding matrix is used to encode the codon of DNA sequence,and the time cycle of LSTM is used to construct the sequential features of DNA sequence,and the long-short term memory characteristics of LSTM are used to construct the global features of the sequence,and the attention mechanism layer is used to strengthen the acquisition of local information of the sequence.The AUC values of 0.9129 and0.9354 were obtained in 300 bp and 500 bp test sets,and the accuracy was 87.60% and91.80%,respectively.(3)This paper proposes a long virus sequence recognition method based on graph convolution network.Existing deep learning-based methods need to segment long sequences when identifying long sequences.Such truncation operation will lose the mutual position relation and potential interrelation of each short sequence in the original long sequence,which will lead to the loss of correlation between various parts of the long sequence and affect the final classification result.In order to solve this problem,this paper proposes a long virus sequence recognition method based on sequence cross-level linkage GCN.By constructing "direct edge","local edge" and "intersegment edge" among nodes in the graph,the intersegment relationship of truncated sequence is supplemented,and the relationship between parts is strengthened by embedding long sequence words.The AUC value of 0.9604 was obtained in each length test set,and the accuracy rate reached 0.9413.(4)Proposed a phage toxicity prediction method based on protein features embedded in multilayer self-attentional networks.Currently,there are few bioinformatics methods for identifying phage toxicity in metagenomes,and the characteristics used to distinguish virulent phages from mild phages are simple and single.To this end,we propose a phage toxicity prediction method based on protein features embedded in multilayer self-attentional networks.The local self-attention mechanism with sliding Windows was introduced into the constructed multi-layer self-attention network,and the key vectors and value vectors of each layer were maximized pooled.Meanwhile,residual connections were introduced between the networks to enrich the information transfer between layers,and the protein sequence position-specific matrix features were combined to enrich the toxicity characteristics of phages.The average accuracy of phage toxicity identification in <300bp,300-500 bp,500-1000 bp,1000-2000 bp and >2000bp test sets was 0.7899,0.8283,0.8416,0.8583 and 0.8681,respectively.In summary,this paper systematically studied metagenomic data-oriented virus recognition and phage toxicity prediction,completed the task of metagenomic virus sequence recognition and phage toxicity prediction,and achieved better recognition results than existing methods.
Keywords/Search Tags:Metagenomes, deep learning, viral identification, phage lifestyle identification, codon embedding, graph convolutional network, muti-layer self-attention network
PDF Full Text Request
Related items