Machine Learning Approaches to Gene Duplication and Transcription Regulation

Posted on:2011-09-30

Degree:Ph.D

Type:Dissertation

University:New York University

Candidate:Chen, Huang-Wen

Full Text:PDF

GTID:1440390002450463

Subject:Biology

Abstract/Summary:

Gene duplication can lead to genetic redundancy or functional divergence, when duplicated genes evolve independently or partition the original function. In this dissertation, we employed machine learning approaches to study two different views of this problem: 1) Redundome, which explored the redundancy of gene pairs in the genome of Arabidopsis thaliana, and 2) ContactBind, which focused on functional divergence of transcription factors by mutating contact residues to change binding affinity.;In the Redundome project, we used machine learning techniques to classify gene family members into redundant and non-redundant gene pairs in Arabidopsis thaliana, where sufficient genetic and genomic data is available. We showed that Support Vector Machines were two-fold more precise than single attribute classifiers, and performed among the best within other machine learning algorithms. Machine learning methods predict that about half of all genes in Arabidopsis showed the signature of predicted redundancy with at least one but typically less than three other family members. Interestingly, a large proportion of predicted redundant gene pairs were relatively old duplications (e.g., Ks>1), suggesting that redundancy is stable over long evolutionary periods. The genome-wide predictions were plot with similarity trees based on ClustalW alignment scores, and can be accessed at http://redundome.bio.nyu.edu.;In the ContactBind project, we use Bayesian networks to model dependences between contact residues in transcription factors and binding site sequences. Based on the models learned from various binding experiments, we predicted binding motifs and their locations on promoters for three families of transcription factors in three species. The predictions are publicly available at http://contactbind.bio.nyu.edu. The website also provides tools to predict binding motifs and their locations for novel protein sequences of transcription factors. Users can construct their Bayesian networks for new families once such a familial binding data is available.

Keywords/Search Tags:

Machine learning, Gene, Transcription, Binding, Redundancy

Related items

1	Analysis Methods Of The Transcription Factor Binding Sites Based On Chromatin Accessibility Sequencing Data
2	Computational Analysis Of The Specificity Of DNA Recognition By AtERFs And In Silico Identification Of The Target Gene Candidates Of DREBs In Arabidopsis Genome
3	Computational Prediction Of Sigma-54 Promoters In Bacterial Genomes By Integrating Motif Finding And Machine Learning Strategies
4	Physically interpretable machine learning methods for transcription factor binding site identification using principled energy thresholds and occupancy
5	Research On RNA Related Function Sites Based On Machine Learning
6	Research On Identifying Specific Gene Sequence And Its Association Based On Deep Learning
7	Construction Of Recombinant Plasmids And The Bioinformative Analysis Of Binding Sites Of Transcription Factors Of POLH, I, K Gene In Mammalian Cells
8	Identification Of DNA-binding Proteins Based On Sequence Information
9	Based On The Information Of Sequences To Predict The Transcription Factor Binding Sites And Promoter
10	The Evolution of a Transcription Factor: Divergence in DNA Binding Behavior of the Sex-Determination Gene hermaphrodite in the Genus Drosophila