Font Size: a A A

Detecting The Horizontal Gene Transfer For Microbial Genomes And Clustering The Biological Sequences

Posted on:2021-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:S T LiuFull Text:PDF
GTID:2480306515993739Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of sequencing technology,tons of biological sequences are generated.A great challenge in bioinformatics is to mine the useful information from massive data.In order to increase calculation speed,we find that it is viable to use low-dimensional and key features to represent sequences.This idea is applied to the detection of horizontal gene transfer for microbial genomes and biological sequences clustering.Feature extraction and feature selection techniques for biological sequences are studied.The three major points of this thesis are listed as follows:1.In order to increase the prediction accuracy,DAE-HGT model is proposed,which detects horizontal gene transfer events with different strategies.Firstly,features are extracted by using PCA algorithm,denoising auto-encoder model and the combination of both of them respectively.Then,the features are fed into different classification models.Finally,a discriminator is used to detect the true horizontal transfer genes.Comparing with four state-of-the-art models,the Manhattan distance,Euclidean distance,the d2distance and the context-based adaptive d2*(4,1)distance,DAE-HGT model obtains better prediction results and is much faster than other methods.The effectiveness of DAE-HGT model is verified.2.A novel feature selection model,Kmer Rank,is proposed,which is based on bipartite graph.Bipartite graph is built by taking sequence and kmer as two types of nodes and the frequency of the kmer is regarded as the weight of edge.The weight of kmer nodes are calculated in the bipartite graph.The greater the value is,the more important a kmer is.In view of the low calculation efficiency of the original algorithm,improvement is made to reduce the time complexity from O(M*N2)to O(M*N*log N).3.A novel sequence clustering model is proposed based on Kmer Rank algorithm.Firstly,the weight of kmers are calculated and the ones with larger value are selected as important features.Then,the redundant features are filtered out based on the relevance of the important kmer.Finally,the biological sequence is reconstructed from the candidate kmers and the Kmeans algorithm is applied to sequence clustering.Compared with the existing methods,the sequence clustering model based on Kmer Rank algorithm extracts features effectively,which greatly leads to the improvement of clustering accuracy and calculation efficiency.In the study of detection of horizontal gene transfer,three feature extraction methods are proposed to reduce the calculation time and improve the prediction accuracy.The extracted features are effective for the detection of horizontal gene transfer.In the study of biological sequence clustering,the important features are selected based on the Kmer Rank model,which ensures the accuracy of clustering and improves the calculation efficiency.
Keywords/Search Tags:Feature Extraction, Feature Selection, Denoising Auto-Encoder, Weight Calculation, Clustering
PDF Full Text Request
Related items