Font Size: a A A

DeepCNN-based TFBS Prediction Model For Arabidopsis And The Cross-species Application Of Transfer Learning In Plant

Posted on:2020-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:G ZhangFull Text:PDF
GTID:2370330572484761Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
In the developmental stage and tissue,the types and numbers of genes expressed in living organisms are different,which leads to great differences in cell differentiations during the development of individuals and then achieves complex physiological functions.The selective expressions of genes are regulated by genomic regulatory elements.Transcription factor is important regulatory elements that specifically recognize and bind to regulatory sequences of non-coding regions.By activating or inhibiting transcription-initiating complexes,transcription factor affects the intensity of gene expression.Identifying the DNA binding sites of transcription factors is the basis of transcriptional regulation research,and also have potential for researches on plant diseases and crop improvement.However,in contrast to the rapid development of transcription factors in animal fields,plants have relatively few TF-DNA binding data due to the difficulty of their ChIP-seq sequencing.In the first part of current study,we utilized a deep learning technique to bulid a prediction model of transcription factor binding sites(TFBSs)in Arabidopsis.By employing the high-throught DAP-seq data of Arabidopsis and the framework of deep convolutional neural network,we build a binding site prediction model(DeepCNN)for each of 265 transcription factors in Arabidopsis.The results showed that the AUCs of all test datasets for all predicted models range from 0.826-0.999 and the average AUC is 0.913.When compared with gkm-SVM and MEME-ChIP,the AUCs of all test datasets of gkm-SVM models range from 0.779-0.910 and the average AUC is 0.845,and the AUCs of all test datasets of MEME-ChIP models range from 0.605-0.801 and the average AUC is 0.703,which demonstrated that our DeepCNN model is more superior for prediction of TFBS.Next,we utilized the established DeepCNN model for the genome-wide prediction of TFBSs in Arabidopsis.In this step,we adopted the two different strategies to determine a TFBS:(1)For a given DNA fragment,the predicted result of DeepCNN is directly used as the criterion to determine whether it is a TFBS.(2)Another strategy is similar to ChIP-seq,where we regarded the predicted DNA fragments by scanning the genome as reads,and then identified all TFBSs by peak calling.We collected experiment datasets of binding sites of 22 TFs with non-DAP-seq data for testing in the above two strategies.For evaluation,a correlation test of positive predictive value(PPV)was performed.The Pearson correlation coefficient(PCC)of the first strategy is 0.104,whereas the PCC of the second strategy is 0.686,which showed that the second strategy of genome-wide scanning has a better generalization for determining TFBSs.Meanwhile,we also found that PPVs of 265 TFs show differential enrichments in 31 TF families of Arabidopsis.Based on the ability of automatic feature extraction of convolutional layer,we carried out a further analysis of the learned features for transcription factor ABI5.We found that the learned features are not only corresponding with its known TF motifs,but also mapping the cooperative regulatory mechanism between ABI5 and ABF2,ABF3,RAV1,HY5,which provides some biological explanation for the excellent performance of DeepCNN model.Finally,we used the cooperative regulatory relationship and the information of open chromatin to control the false-positive rates of DeepCNN model by reducing the false-positive sites by at least 70%.This study demonstrated the feasibility of deep learning technology in the genome-wide prediction of TFBS in Arabidopsis.In the second part of current study,based on the predicted results of our model,we explored the influence of functional non-coding SNP on regulation mechanism.Firstly,we calculated the change values of the binding intensity of predicted DNA fragments before and after mutation for 28 GWAS SNPs related to a phenotype named with bacterial disease resistance(avrRpm1).The results showed that the effects of SNP on the binding intensity of 265 TF are different.Subsequently,based on the change values of predicted binding scores for all reported non-coding SNPs,we trained a random forest classifier for prioritizing functional non-coding SNP.The results showed that the AUCs are ranged between 0.637 and 0.671,and the average AUC is 0.654 in 10 repeated experiments with different negative samples.This indicated that DeepCNN model has certain reference value for the prioritization of functional non-coding mutations.In the third part of current study,we carried out cross-species prediction of TFBSs in plants and attempted to make up for the lack of experimental data with computational approachs.Based on the similarity of protein sequences and DNA motifs of transcriptional factors,we transferred the prediction model of TFBSs in Arabidopsis to the prediction of TFBSs in rice,Zea mays and Glycine max with the idea of transfer learning.The results show that,firstly,for rice transcription factor of MADS29,when the overlapping length between the predicted peak and the real peak was set at less than or equal to 200 bp,a good balance of prediction accuracies between positive and negative samples was achieved.The PPV is 0.816,and the negative predictive value(NPV)is 0.189,which indicated that both false negative rate and false positive rate are controlled below 0.2 for MADS29.Secondly,for other three TFs of rice,the PPV and NPV of BZIP23 are 0.752 and 0.108,respectively,and the PPV and NPV of ERF48 are 0.951 and 0.234,respectively,and the PPV and NPV of NAC6 are 0.317 and 0.156,respectively.Thirdly,when transferred to the prediction of TFBSs in Zea mays and Glycine max,the PPVs of ARF5,O2,P1 and KN1 in Zea mays are 0.201,0.550,0.400 and 0.381,respectively,the PPVs of 06G314400 and 13G317000 in Glycine max are 0.452 and 0.413,respectively.These results indicated the feasibility of transfer learning for the cross-species prediction of TFBSs in plants.In particular,the prediction performance of rice is better than those of Zea mays and Glycine max,which showed that it is necessary to focus on finding the most suitable transfer learning model for different species according to actual conditions in the future research.
Keywords/Search Tags:Transcription factor, Transcription factor binding site, Deep convolutional neural network, Functional non-coding SNP, Transfer learning, Cross-species prediction
PDF Full Text Request
Related items