Font Size: a A A

The Recognition Of Subcellular Localization Of Proteins

Posted on:2008-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y JiaFull Text:PDF
GTID:2120360218462699Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
Functional annotation of unknown proteins is a major goal in proteomics. Subcellular location of proteins is one of the key functional characters because proteins can perform nomal biological functions only after they are translocated to correct subcellular locations. This essay is based on N-terminal sorting signals information and the amino acid component, embarks from the amino acid sequence, has carried on the subcellular localization recognition to the plant and the non-plant two albuminoid substance data sets.First,the essay does the characteristic analysis to the different subcellular localization classification protein sequence dataset, including the amino acid frequency, adjacent residue-pair and N-terminal sorting signals .The result showed the single amino acid component shows some differences in the distribution in each subcellular localization classification protein sequence data, but not very remarkable; The 400 adjacent residue-pair frequencies have the difference in each subcellular localization classification protein sequence data centralism; The characteristic of the N terminal signals in each dataset of Plant or non-Plant has remarkable difference and characteristic difference of the N terminal signals mainly concentrates on 30 positions before the N terminal.Next,we use the method of Increment of Diversity (ID) to carry on the recognition:(1) take the N terminal signals characteristic as the characteristic of classifying and select the first 20 amino acid distribution of the N terminal to constitute 400 information parameter.(2) take the adjacent residue-pair frequencies as the characteristic to constitute 400 information parameter.(3) select amino acid frequency to constitute 20 information parameter for the characteristic.(4) synthesize N-terminal signals characteristics and adjacent residue-pair frequency to constitute 800 information parameter.(5) synthesize the N-terminal signals characteristic, the frequency that the amino acid appears the characteristics of the adjacent residue-pair frequencies constitutes 820 information parameter.We have trained and evaluated using 5-fold cross-validation for plant protein and non-plant protein.The result shows that the core of recognizing the good quality and bad quality by means of application variety increasing is whether the choser characteristic parameter is appropriate or not. When synthesizing different kinds of information, we will integrate directly many the kinds of information linear conformity with the ID algorithm not necessarily to increase the recognition precision directly in the identical multiple quantity.Finally,the Increment of Diversity with Quadratic Discriminant analysis (IDQD) to carry on the recognition and select the adjacent residue-pair frequencies, the N-terminal sorting signals characteristic as ID the information parameter,accuracy can be integrated with binary distinguishes.The total prediction accuracies reaches 96.8% for four subcellular locations in Plant and 92.7% for three subcellular locations in non-Plant using self-consistency;the total prediction accuracy reaches 87.4% for four subcellular locations in Plant and 91.2% for three subcellular locations in non-Plant using 5-fold cross-validation. The result indicats the IDQD algorithm has obtained the high recognition precision, and has confirmed IDQD is an effective sorter.
Keywords/Search Tags:Subcellular localization, Increment of Diversity, Quadratic Discriminant analysis, Amino acid composition, N-terminal targeting signal
PDF Full Text Request
Related items