Font Size: a A A

Design And Implementation Of Bacterial Biosynthetic Gene Cluster Prediction Algorithm Based On Deep Learning

Posted on:2022-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y F LiuFull Text:PDF
GTID:2480306572977749Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The natural products produced by bacteria in the secondary metabolism process have rich chemical structures and biological activities,including antibiotics,anticancer drugs,antiviral drugs,and other types of small molecule candidate drugs.They are important resources for the development of new drugs.In bacterial genome,genes encoding the synthesis of various natural products exist in the form of Biosynthetic Gene Cluster(BGC),which lays a theoretical foundation for the mining of natural products from sequence to phenotype.In recent years,advances in sequencing technology have led to the rapid growth of bacterial genome data,which has promoted the development of BGC prediction tools.However,limited by traditional algorithms,the prediction accuracy and generalization capabilities of existing tools still need to be improved,and they cannot provide effective technical support for the growing demand for natural product research.After deeply analyzing the shortcomings of existing tools and the difficulties of bacterial BGC prediction,this thesis proposes a two-class BGC prediction model called BGC-Deep Finder that can discover BGC at the protein domain level,and a multi-class BGC prediction model called BGC-Deep Classifier that can directly identify BGC product categories based on deep learning and natural language processing methods.First,the BGCs and genomes are respectively serialized based on protein domains,and positive and negative training sets are generated.Secondly,a domain joint embedding algorithm is designed based on the word2 vec algorithm,which embeds the domain numbers into low-dimensional dense joint vectors according to the context and superfamily information,implementing the distributed digital representation of sequence semantics.Then,a BGC data augmentation algorithm is designed based on the idea of synonym replacement,which define the synonym relationship of the domains according to the sequence similarity and randomly replaces a small number of domains in the original BGC sequence with thei–r synonymous domains to generate simulated BGC sequences,alleviating the lack of positive samples.Finally,the prediction performance of models under different network connection structures is compared based on grid search and cross-validation,and the densely connected stacked bidirectional long and short-term memory network is determined as the core structure to increase the robustness of feature extraction,and the above models are implemented.In the performance test based on a standard bacterial genome data set with 341 BGCs calibrated,the two BGC prediction models designed in this thesis outperform the current leading Cluster Finder and Deep BGC under different evaluation indicators.In particular,BGC-Deep Finder has always maintained the best performance.Its F1 score on prediction at the domain level has increased by 12.1% and 5.8%,respectively.Its F1 score on BGC position prediction under the highest overlap threshold has increased by 19.5% and 7.7%,respectively.Its average AUC on new BGC prediction has increased by 9.3% and 3.1%,respectively.Moreover,among the 4,000 bacterial genomes from the NCBI database,BGCDeep Finder and BGC-Deep Classifier jointly identified 167 high-quality candidate BGCs that were not captured by benchmark tools.After functionally annotating the candidate BGC with the highest score for antibacterial activity,the relevant results unanimously implied the biosynthetic potential of this sequence to encode a kind of new antibiotic.Taken together,the above results fully verify the leading performance of BGCDeep Finder and BGC-Deep Classifier in bacterial BGC prediction,confirm the application value of this work in the development of natural product drugs,and reveals the feasibility of using deep learning to carry out broader exploration in the field of natural product mining.
Keywords/Search Tags:Biosynthetic gene cluster, Deep learning, Natural language processing, Word embedding, Data augmentation
PDF Full Text Request
Related items