Font Size: a A A

Study On Protein Identification Based On Tandem Mass Spectrometry And Database Search Algorithm

Posted on:2013-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:Z C HuangFull Text:PDF
GTID:2210330374968864Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
Since a new fundamental concept called proteome has emerged in the last century, mass spectrometry is becoming one of the most important proteomic technologies for the high-throughput analysis of complex protein samples. Tandem mass spectrometry is an effective way to identify its components from complex protein samples in a short period of time. Because mass spectrometers generate large amounts of high-quality data and the complexity of the data, so the data analysis after the experiment relies on robust bioinformatics tools.Currently protein identification by searching a sequence database using mass spectrometry data is the most popular method of mass spectrometry data analysis. Many database search algorithms are available to analysis tandem mass spectrometry data, such as SEQUEST, Mascot which are need to purchase a license. Also a lot of open source software can be used freely, for instant, OMSSA developed by NCBI and X! tandem. Generally, many of the peaks in tandem mass spectra are noise, and the process of collision-induced dissociation (CID) need a better understanding. So data analysis is one of the challenges of proteomic research.The complexity of the spectrum and the limitation of the protein sequence database could lead to false positive identification. So design a more powerful algorithm to increase sensitivity without increasing false positive rate is worth to do. The score function determine the quality of the results which usually involves a standard two-step procedure:In the first step, a score given by compare experimental spectra and theoretical spectra. The score tells us which peptide is the best match in these candidate peptides, and we only consider the highest scoring peptide as the potential right result. In the second step, a random distribution used to calculate the probability of the top hit peptide in this database search identification. The previously algorithms often focus on one of the steps, for example, SEQUEST only tell a score between experimental spectra and theoretical spectra, but not calculate a probability value. OMSSA and Mascot give a probability value but less use the intensity information of the spectra.With the increase of the mass spectrometry experimental data, a growing number of experimental spectra is annotated and used to build spectrum library. In addition to search protein sequence databases for the identification of the spectrum, the use of the spectrum library identification tandem mass spectrometry data has also become a new analytical method. Actually, mass spectrometry based proteomics analysis is not just annotated spectra. Identification of post-translational modification provides useful information to further reveal the mechanism of protein regulation. As described above, mass spectrometry-based proteomic analysis has formed a set of analysis processes, including spectrum preprocessing, spectrum identification, identification post-translational modification and quantitative protein analysis. Nowadays, software suite is available for the whole processes analysis, such as Mascot. Also these suites are easy to use, but we can't get the best results from this software because of the limitation of the algorithms.This study tried to analysis the pipeline to find out what affect the results of identification. The analysis includes preprocessing of tandem mass spectrometry data, analysis several algorithms of identification, how to select software to identification post-translational modification. The purpose is to explore how to use existing tools to maximize extract meaningful information from the experimental data. The results show that both X! Tandem and ProteinPilot benefit from deisotoping. When using X! Tandem, the performance can be improved by add decoy sequences if the target database contains little sequences. And combined use of sequence database and spectrum library is an efficiency way to decode the spectrum.
Keywords/Search Tags:protein identification, tandem mass spectrometry, databasesearch, post-translational modification
PDF Full Text Request
Related items