Font Size: a A A

Protein domain superfamilies: An evolutionary perspective towards a non-redundant classification of protein domains

Posted on:2008-08-30Degree:Ph.DType:Thesis
University:Boston UniversityCandidate:Cherukuri, Praveen FrazerFull Text:PDF
GTID:2440390005977922Subject:Biology
Abstract/Summary:
Protein domains are units of compact three-dimensional structure as well as units of molecular evolution. Many globular proteins utilize domains as building blocks to give rise to different physiological and cellular functions in living organisms. The detection of these conserved domain signatures often is the first lead towards identifying a protein's molecular function, and frequently it is the only functional computational annotation available for data generated in genomic sequencing efforts. Sequence comparison approaches, in particular profile-based methods, have proven to be very sensitive in identifying such signatures, and have spawned collections of domain- and protein-alignment models, such as Pfam, SMART and COG. The Conserved Domain Database (CDD) attempts to collate and re-organize these collections into superfamilies. A superfamily is understood as a set of protein domains which are homologous, as judged from the results of sequence- and sometimes structure-comparison. Sensitive and specific algorithms for classification and clustering of related domain models into superfamilies are a prerequisite for well defined and correctly classified superfamilies. In this thesis, reverse position-specific BLAST (RPS-BLAST) is compared with other standard annotation resources, IMPALA and HMMer, to study the trade-off between speed and sensitivity of a fast database search heuristics. RPS-BLAST was found to be about 140 times faster than HMMer and about 25 times faster than IMPALA, while maintaining very high sensitivity. Utilizing heuristics-based RPS-BLAST as an annotation resource enabled rapid analysis, filtering, and robust classification of domains in the Conserved Domain Database (CDD). A novel set of algorithms for automated detection of lineage-specific, multiple domains and homologous relationships between single-domains was developed. The clustering of protein domains into superfamilies uses mutual taxonomic coverage as an additional classification parameter, via a list of Sentinel taxonomic nodes which have a minimum of ∼500 million years of apparent 'age'. This innovative approach has increased the sensitivity of classification by 10% (∼50% to ∼60%), with 99% specificity as judged by an external standard of truth, the SCOP classification. The following questions were addressed with these revised set of algorithms: How many ancient domain superfamilies are contained within the CDD collection, how many are associated with at least one representative three-dimensional structure, and what are phylogenetic distributions of these domain superfamilies? About 900 superfamilies (of 5494) were shared by all three kingdoms of life (Archaea, Bacteria, and Eukaryota) corresponding to about 53%, 27% and 25% of Archaeal, Bacterial and Eukaryotic protein-superfamily repertoire. It was found that about ∼43% of the domains in CDD---but only ∼31% of superfamilies---are associated with at least one 3D-protein structure. Finally, a web-based tool was developed to evaluate the 3D-structural and phylogenetic coverage of current domain superfamilies in Conserved Domain Database (CDD) and also study the impact and contribution of Structural Genomics Initiatives (SGI).
Keywords/Search Tags:Domain, Protein, Classification, CDD
Related items