Font Size: a A A

Bioinformatics Studies On The Relationship Between Disulfide Structural Feature And Sequence In Proteins

Posted on:2006-05-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:J N SongFull Text:PDF
GTID:1100360155452446Subject:Fermentation engineering
Abstract/Summary:PDF Full Text Request
Disulfide bonds are primary covalent crosslinks between cysteine side chains which can existeither in the same protein polypeptides or among different protein polypeptides. For manyproteins, disulfide bonds are the perpetual characteristics of their ultimate folding products. Thecorrect formation of disulfide bonds is the crucial step in the folding pathway and the kinetics ofdisulfide formation can dominate the rate and pathway of protein folding. The mispair of disulfidebonds is the important reason resulting from the incorrect folding of protein polypeptides. Suchbonds play important roles in stabilizing protein spatial conformation and ensuring that proteinwill perform its biochemical function.The systematic Bioinformatics studies on the relationship between the disulfide structuralfeatures and sequences in proteins would have potentially important applications both in proteinengineering and rational molecular drug design, such as in introducing engineered disulfide bondsto increase the conformational stability of proteins and helping locate disulfide bridges to aidthree-dimensional structure predictions. In this paper, disulfide bonds in proteins were selected asthe research subject by utilizing the common algorithms and tools in Bioinformatics and theknowledge of mathematics, physics, biology and computer science. After successfullyconstructing the high quality protein disulfide structural database and the high quality database ofEscherichia coli gene sequences and corresponding protein structures, the systematicbioinformatics studies were carried out to explore the relationship between structural features ofdisulfide bond formation and sequences in proteins based on three levels: the gene codingsequence, the amino acid sequence and three-dimensional spatial structure. The main contents ofthis dissertation follow:(1) Construction of protein disulfide structural database with high quality was the basis of thestatistical analysis and computation of disulfide bonds in proteins. According to the principles ofthe resolution higher than 0.25nm and sequence identity less than 30%, the protein structural datawere selected from the PISCES Culled PDB to constitute the raw database. Based on this dataset,a large disulfide bond database with high quality was constituted after the strict structural data fileformat test, sequence consistency test, SSBOND record veracity test and SSBOND recordemendation by eliminating the inaccurate and questionable data. The high quality database ofEscherichia coli gene sequences and corresponding protein structures was essential forinvestigating the relationship between protein folding and protein coding sequence. By queryingabout Escherichia coli proteins in SWISS-PROT, a cross-reference table of the protein structuresand their corresponding gene sequences in different databases was obtained. After removing alarge amount of redundant and uncertain data, a high quality dataset-EcoPDB was finallyconstructed, which was a fundamental dataset of understanding the relation between proteinspatial structure data and nucleic acid sequence data.(2) The formation features and sequence distribution features of disulfide bonds in proteinshave an important effect on the further investigation of the formation of disulfide bonds, therelationship between disulfide bonds and amino acid sequences, the prediction of disulfidebonding states of cysteines and the folding dynamics assisted by disulfide bonds. The resultsindicated that the oxidation states of cysteines showed an obvious cooperation phenomenon thatalmost all cysteines in the same protein were oxidized if this protein contained disulfide bonds.The distribution of disulfide bonds in protein sequences was rather uneven that most bonds wereformed between the two cysteines with close sequence distance less than 70. The results alsoindicated that there existed some strong preference for some certain sequence distances, such as11, 6, 16, 5 and 13. Disulfide bonds were inclined to form in the front part of the amino acidsequence comparatively, which had a positive meaning in ensuring the prolongation andformation of the newly polypeptide chain and lowering the mis-folding in protein folding process.It was shown that amino acid distribution of protein sequences flanking the oxidized cysteinesand reduced cysteines had distinct differences: In the case of oxidized cysteines, the occurrence ofthe hydrophobic and polar residues was higher, while in the case of reduced cysteines, the contentof strongly hydrophobic and charged residues was higher. The amino acid residues in differentpositions flanking the centered oxidized cysteines made different contributions towards disulfidebond formation. Some certain residues in certain positions had strongly positive effect on theformation of disulfide bonds, while other residues in certain positions showed high negativeinclination to disulfide bond formation.(3) A novel approach was introduced to predict the disulfide-bonding states of cysteines inproteins by means of a two-class linear discriminator based on their amino acids and dipeptidescomposition. The results demonstrated that the cooperativity phenomenon exhibited by theoxidation of cysteines could be well described by the compositions of 20 amino acids and 400dipeptides in proteins. Based on the contents of 20 amino acids, the prediction accuracy of theoxidation form of cysteines scored as high as 85.2% on cysteine basis and 81.2% on protein basis,respectively, by using the rigorous jack-knife procedure. The prediction performances of oxidizedcysteines and reduced cysteines were Qoxi=89.9% and Qred=71.0%, respectively. The Matthew'scorrelation coefficient MCC was 60.6%. Based on 400 dipeptide compositions, the predictionaccuracy of this classifier achieved up to Q2=89.1% on cysteine level and Q2prot=85.2% onprotein level, evaluated by the rigorous jack-knife test. The accuracy rates of oxidized cysteinesand reduced cysteines were Qoxi=92.2% and Qred=79.3%, respectively. The Matthew'scorrelation coefficient MCC was 70.7%. It was shown that whether cysteines should formdisulfide bonds depends not only on the global structural features of proteins but also on the localsequence environment of proteins. The results also demonstrated that the application of this novelmethod based on amino acid and dipeptide compositions could provide comparable predictionperformance compared with existing methods for the prediction of the oxidation states ofcysteines in proteins.(4) A novel approach was proposed to predict the disulfide-bonding states of cysteines inproteins by constructing a two-stage classifier combining a first global linear discriminator basedon their amino acid composition and a second local support vector machine classifier. The resultsindicated that the new hybrid classifier had relatively higher prediction accuracy for thedisulfide-bonding states of cysteines. When Qc=-0.1 was selected, the overall predictionaccuracy could be improved to Q2=84.1% on cysteine level and Q2prot=80.1% on protein level, ,respectively, by using jack-knife procedure. The accuracy rates of oxidized cysteines and reducedcysteines were Qoxi = 87.8%, and Qred = 77.8%, respectively. The Matthew's correlationcoefficient MCC was 62.2%. This finding indicated that the formation of disulfide bonds bycysteines was determined by the global structural feature of proteins, as well as the local sequenceenvironment of cysteines.(5) The correlation between cysteine synonymous codon usage and its flanking amino acidresidues, and the correlation between cysteine synonymous codon usage and disulfide bondformation of cysteines were investigated in the whole E. coli genome by using a novel methodbased on information theory and statistical learning theory. It was found that lysine in position -7,tryptophan in position -6, tryptophan in position -1, methionine and glutamic acid in position +1had a great influence on cysteine synonymous codon usage by computing the I m( cys | a )values oftwenty amino acid residues flanking both the C-terminal and N-terminal of cysteines in E. coligenome sequences. By computing the Shannon Entropy values of cysteine synonymous codons inthe high quality database of E.coli gene sequences and corresponding protein structures-EcoPDB,it was found that cysteine synonymous codons do contain some factors influencing the disulfidebond formation. As far as the E.coli Genome was concerned, the correlation between cysteinesynonymous codon usage and disulfide bond formation may be a kind of regulation to proteinstructures on the gene sequence level. The discrepancy of its synonymous codon usage should beconsidered as the reflection of biological function restriction resulted from disulfide bondformation of cysteines.(6) A method was developed for the classification prediction of protein spatial structures andfor protein structure search homology search based on disulfide bonding patterns. It wasapplicable to determine the target protein's structural classification and search related proteinswith similar disulfide information by using disulfide bonding patterns. By computing the disulfidebonding patterns in the protein disulfide structural database and analyzing the correlation betweenprotein disulfide patterns and related protein homologous structures, six detailed cases had beenillustrated and highlighted to demonstrate that it was possible to use single disulfide bondingpatterns instead of the complete protein amino acid sequences to discriminate and classify theprotein structure folds. The results also indicated that proteins with the same disulfide bondingpatterns usually belong to the same structural family or superfamily in the StructuralClassification of Proteins (SCOP) database, which commonly have similar biological functions.
Keywords/Search Tags:bioinformatics, database, disulfide bond, cysteine, disulfide bonding state prediction, machine learning method, statistical correlation, support vector machine, synonymous codon usage, Escherichia coli, disulfide bonding pattern
PDF Full Text Request
Related items