Font Size: a A A

Statistical Inference For Protein Families And Folds Based On Database

Posted on:2010-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:B LvFull Text:PDF
GTID:2120330338476524Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
The statistical inference for the protein families, structures and new functions is a frontier research field in the applied statistics. Based on the SCOP (Structural Classification of Proteins database) and Pfam (Sequence Classification database), the thesis discusses the statistical inferences on protein families and folds.At first, by using the mapping between two kinds of databases, in Chapter 2 we study the size distributions of protein families (SDPFs) belonging to different kinds of folds separately. The three kinds of protein families are Pfam families, SCOP families and the Pfam families mapped by the SCOP families. Our results show that the protein families'sizes and their distributions are independent on the sizes of folds which cover these protein families, and the size distributions of protein families within different types of folds all obey similar power-law. Our models also suggest that the whole SCOP families constitute a random sample from the Pfam family database.Based on the results obtained in Chapter 2 and the dynamic information of the SCOP database, in Chapter 3 we first estimate the total number of the folds covering the current Pfam database. We then construct the Bayesian Model with the prior information of whether the folds are previously known which cover the Pfam families mapping by the newly appeared SCOP families. We estimate the probability distributions of whether a Pfam family with given size contributes a new fold.Finally, based on the size distribution of folds observed in the latest version 1.73 of the SCOP, by the principle of maximum probability principle and moments method, the thesis re-estimates the total number of folds and their distributions over families in nature.
Keywords/Search Tags:Proteins, databases, protein families, folds, size distributions, statistical inference
PDF Full Text Request
Related items