Font Size: a A A

Some probability and statistics problems in proteomics research

Posted on:2008-01-25Degree:Ph.DType:Dissertation
University:The Johns Hopkins UniversityCandidate:Feng, JianFull Text:PDF
GTID:1444390005976298Subject:Biology
Abstract/Summary:
The goal of proteomics is to characterize all the proteins in a cell, tissue or organism grown under some particular condition. Tandem mass spectrometry provides a high-throughput and sensitive way to identify proteins from complex mixtures acquired from cells or tissues. In shotgun proteomics, tandem mass spectrometry is commonly used to identify peptides derived from proteins. After the peptides are detected, proteins are reassembled via a reference database of protein or gene information.; Here, a probability model is introduced for determining the likelihood that peptides are correctly assigned to proteins. This model derives consistent and rigorous probability estimates for assembled proteins. The probability scores make it easier to confidently identify proteins in complex samples and to accurately estimate false-positive rates. The algorithm based on this model is shown to be robust in creating protein complements from peptides from bovine protein standards, yeast cell lysates and Arabidopsis thaliana leaves. The software that runs the algorithm, called PANORAMICS, provides a tool to help analyze the data based on a researcher's knowledge about the sample software platforms.; It is important to measure and control the false positive rates for peptide and protein identifications when using database search algorithms to analyze tandem mass spectrometry datasets. Estimation and control of the frequency of false matches between a peptide tandem mass spectrum and candidate peptide sequences is an issue pervading proteomics research. To solve this problem, I designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found by this algorithm from a protein sequence database were used to create a decoy database using a Monte Carlo sampling algorithm. Searching this decoy database led to the estimation of false positive rates for spectrum/peptide sequence matches. This method, independent of instrumentation, database-search software and samples, is shown to provide better estimation of false positive identification rates than the prevailing reverse database searching method. The pattern detection algorithm, called PTTRNFNDR, can also be used to analyze large sequence datasets for other biological studies. Application of this algorithm to non-biological datasets may also exist.
Keywords/Search Tags:Proteomics, Proteins, Algorithm, Probability, Tandem mass spectrometry, Datasets, Sequence
Related items