Font Size: a A A

Virus Sequence Alignment And Classification Based On Hybrid Machine Learning

Posted on:2024-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2530307100467294Subject:Electronic information
Abstract/Summary:PDF Full Text Request
Sequence alignment and classification play a crucial role in the fields of bioinformatics and medicine.By comparing and classifying DNA,RNA,and protein sequences from various organisms,scientists can gain insights into their evolution,structure,and function.Moreover,sequence alignment and classification can be applied to areas such as disease diagnosis and drug development,assisting doctors and researchers in swiftly and accurately diagnosing diseases and pinpointing potential drug targets.Nonetheless,with the advancement of high-throughput sequencing technologies,challenges have arisen due to the explosive growth in sequence length and increased genetic variability.The significance of sequence alignment and classification in research areas such as biodiversity,genomics,and drug design is becoming increasingly apparent.Owing to the extensive applications of sequence alignment,the complexity of computations,and the immense volume and high dimensionality of data,greater demands are being placed on computational performance,necessitating the support of high-performance computing.In light of these challenges,this paper investigates the use of hybrid machine learning algorithms to tackle the accuracy issues related to viral sequence alignment and classification.The main research objectives are as follows:(1)To address the issue of long comparison times of traditional sequence alignment algorithms,a novel alignment algorithm for virus sequences called C3 AA,based on a hybrid strategy,is proposed.The long and unequal length of virus gene sequences make it difficult for traditional algorithms to perform alignment efficiently.C3 AA adopts a hybrid strategy by first segmenting sequences using 3-mers technology,then assigning different values to the four nucleotides based on their properties and calculating the weight of each codon and amino acid based on the nucleotide composition.By optimizing the 20-dimensional sequence feature representation using amino acid frequency,C3 AA performs virus sequence divergence analysis and phylogenetic tree construction based on these feature vectors.Comparing C3 AA with Clustal Omega,MAFFT,MUSCLE,and Squiggle library’s 2D effects,C3 AA showed zero Robinson-Foulds difference from traditional methods in tree construction and a correlation coefficient of 0.96 with Clustal Omega’s pairwise alignment matrix.C3 AA further improves the efficiency of sequence alignment while maintaining high classification accuracy compared to traditional methods.These results suggest that the proposed method is simple and fast for phylogenetic analysis of full genome virus sequences.(2)To address the issue of low accuracy in sequence classification algorithms,we propose a dual-encoding viral sequence classification model based on a combination of multiple machine learning algorithms,called Ensemble Machine Learning Algorithm(EMLA).EMLA first preprocesses the dataset,which includes applying Synthetic Minority Over-sampling Technique(SMOTE)for sequence sampling and splitting the dataset into training and testing sets in an 8:2ratio.EMLA adopts two feature encoding methods,namely label encoding and K-mer encoding,and combines them with CNN,LSTM,Bi LSTM,GRU,and Bi GRU to establish classification models.Through multiple experiments and comparisons with previous models,EMLA was finally determined to be the best model,and its accuracy reached up to 94.45%.Research on virus sequence comparison and classification based on hybrid machine learning has extensive value in fields such as bioinformatics and computer science.These methods not only contribute to the study of the structure and function of viral sequences but also benefit other sequence comparison tasks.
Keywords/Search Tags:sequence alignment, sequence classification, Machine Learning, hybrid strategy
PDF Full Text Request
Related items