Font Size: a A A

Research On Identification Method Of Transcription Factor Binding Site Based On DNase-Seq

Posted on:2022-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:C C LuoFull Text:PDF
GTID:2480306353981989Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the continuous iteration and development of bioinformatics and sequencing technology,the study of transcription has become more and more important.A transcription factor is a special protein that can bind to DNA sequences to regulate gene expression.If we can accurately predict and identify the binding sites of transcription factors,then it will be of great significance for further identifying target genes of transcription factors,studying the specific role of transcription factors in upstream regulatory regions,and finally constructing regulatory networks.As the mainstream protein binding site detection technology,Ch IP-Seq can be effectively used for the detection of protein binding sites in the whole genome.However,due to the limitation of its experimental principles,each experiment relies on specific enzymes,and only one protein binding site can be detected at a time,which greatly increases the cost of the experiment.In contrast,DNase-Seq technology can detect all protein binding sites in the whole genome at one time,and has higher detection accuracy.This article first introduces the commonly used protein binding site research methods,and then uses GEM prediction software and PWM matrix to extract the transcription factor binding site,and then extracts the DNase-Seq digestion data near the site,after Bias correction and general data filtering,then an initial data set was constructed,which showed very strong heterogeneity,that is,the activity of DNase I across the binding site changed drastically,and it also contained zero-inflation noise introduced by the "dropout" event.In order to filter the zero-inflation noise in DNase-Seq,we use a deep count autoencoder based on the assumption of the probability distribution of the input data to perform denoising.First,according to the overdispersion characteristics of DNase-Seq data,the zero-inflation negative binomial distribution is used as its hypothetical distribution,and then DCA is constructed based on this distribution for training and data reconstruction,and the model is verified on simulation data and DNase-Seq data effect.After the zero-inflation noise of DNase-Seq is removed,the classifier model is finally designed.In order to dig out the potential characteristics of different transcription factors from the highly heterogeneous DNase-Seq data,we use a convolutional neural network based on CBAM as the base classifier,the Stacking ensemble algorithm is used to integrate the base classifier to improve the robustness and classification effect of the classification model.The performance of the model is verified by indicators such as macro-recall rate,macro precision rate,accuracy rate and Kappa coefficient Finally,the stacking algorithm was used to integrate the base classifiers and compared with other common models,which proved the effectiveness of the designed model.
Keywords/Search Tags:Transcription factor binding sites, DNase-Seq, Deep count autoencoder network, Convolution neural network
PDF Full Text Request
Related items