Title page for 952211005


[Back to Results | New Search]

Student Number 952211005
Author Man-chien Kuo(p})
Author's Email Address No Public.
Statistics This thesis had been viewed 2087 times. Download 308 times.
Department Graduate Institute of Systems Biology and Bioinformatics
Year 2007
Semester 2
Degree Master
Type of Document Master's Thesis
Language English
Title Prediction of protein glycosylation sites by using support vector machines
Date of Defense 2008-07-03
Page Count 55
Keyword
  • glycosylation
  • post-translational modification
  • prediction
  • support vector machines
  • Abstract Protein glycosylation is an important post-translational modification (PTM) to affect various molecular functions such as structure, biological activity and protein-protein interaction. Due to the difficulties of biological experiments and the huge amount of identification works, there are several works were proposed in recent years to identify protein glycosylation sites by computational approaches. The features of their identification model were mainly amino acids surrounding the glycosylation sites. All of previous prediction tools are against respective types of glycosylation. Therefore, we develop prediction methods to identify protein glycosylation sites include O-linked, N-linked and C-linked glycosylation using support vector machine (SVM) based on dipeptide combined with accessible surface area, region combined with amino acid, and dipeptide. It shows that the accuracy of O-linked glycosylation on serine and threonine, N-linked on asparagine and C-linked on tryptophan are 95%, 91%, 96% and 95%. We implemented in GSI, a web server to identify O-linked, N-linked and C-linked glycosylation sites.
    Table of Content Table of Contents
    Chapter 1 Introduction1
    1.1Background1
    1.1.1Post-translational modification (PTM)1
    1.1.2Glycosylation2
    Chapter 2 Related works7
    2.1Prediction of glycosylation tools7
    2.1.1Prediction of O-linked glycosylation tool7
    2.1.2Prediction of N-linked glycosylation tool8
    2.1.3Prediction of C-linked glycosylation tool8
    2.1.4Other prediction of glycosylation tool8
    2.2Comparison of current prediction tools8
    Chapter 3 Materials and Methods11
    3.1System Flow11
    3.2dbPTM dataset11
    3.3Data construction12
    3.4Feature construction16
    3.4.10/1 system16
    3.4.2Dipeptide encoding17
    3.4.3Tripeptide encoding17
    3.4.4Secondary structure encoding18
    3.4.5ASA encoding18
    3.4.6Region encoding19
    3.5Support Vector Machine (SVM)22
    3.6Performance evaluation23
    Chapter 4 Results25
    4.1Prediction performance25
    4.2Comparison with previous work35
    4.3Independent test set in previous prediction tools and ours40
    4.4Web interface41
    Chapter 5 Discussion45
    References52

    List of Figures
    Figure 1. The structure of O-linked glycosylation. The oligosaccharides attached to the hydroxyl group of amino acid, serine and threonine.3
    Figure 2. The structure of N-linked glycosylation. The oligosaccharides attached to asparagine.3
    Figure 3. The structure of C-linked glycosylation. The \-mannopyranosyl residue is attached to the indole C2 of tryptophan via a C-C link4
    Figure 4. The structure of GPI anchors. The hydrophobic phosphatidylinositol group is linked to a residue at or near the C-terminus of a protein through a carbohydrate-containing linker.5
    Figure 5. The system flow of constructing prediction models12
    Figure 6. The process of truncate the protein sequence to region windows with glycosylation or non-glycosylation site in the middle.14
    Figure 7. The process of dipeptide encoding17
    Figure 8. The process of tripeptide encoding18
    Figure 9. The process of secondary structure encoding18
    Figure 10. The calculation of ASA scores combined with dipeptide.19
    Figure 11. Comparison of the different between ASA scores of positive and negative datasets on serine residue on O-linked glycosylation20
    Figure 12. Comparison of the different between ASA scores of positive and negative datasets on threonine residue on O-linked glycosylation21
    Figure 13. Comparison of the different between ASA scores of positive and negative datasets on N-linked glycosylation21
    Figure 14. Comparison of the different between ASA scores of positive and negative datasets on C-linked glycosylation22
    Figure 15. The performance of serine residue in O-linked glycosylation prediction models28
    Figure 16. The performance of threonine residue in O-linked glycosylation prediction models30
    Figure 17. The performance of N-linked glycosylation prediction models32
    Figure 18. The performance of C-linked glycosylation prediction models34
    Figure 19. The interface of GSI web server, which is available at http://bioinfo.gene.idv.tw/.42
    Figure 20. In this graph, the web interface with an example of inputs on GSI43
    Figure 21. The results of each type of potentially glycosylated amino acid sites and the distribution of ASA scores surrounding them44
    Figure 22. The list of protein sequences prediction result and the ASA scores of each site.44

    List of Tables
    Table 1. Comparison of current prediction tools10
    Table 2. Number of positive and negative datasets in our study for O-linked, N-linked and C-linked glycosylation considered13
    Table 3. The number of positive and negative datasets for Serine in O-linked glycosylation for different symmetrical window size and ratio of positive and negative datasets15
    Table 4. The number of positive and negative datasets for Serine in O-linked glycosylation for different symmetrical window size and ratio of positive and negative datasets15
    Table 5. The number of positive and negative datasets for C-linked glycosylation for different symmetrical window size16
    Table 6. The number of positive and negative datasets for N-linked glycosylation for different symmetrical window size16
    Table 7. The various ratio of positive and negative datasets on serine residues in O-linked glycosylation based on 0/1 system encoding25
    Table 8. The various ratio of positive and negative datasets on threonine residues in O-linked glycosylation based on 0/1 system encoding26
    Table 9. The results of serine residue in O-linked glycosylation using different features27
    Table 10. The results of threonine residue in O-linked glycosylation using different features29
    Table 11. The results in N-linked glycosylation from different models31
    Table 12. The performance of C-linked glycosylation from different models32
    Table 13. Best models of four types of glycosylation35
    Table 14. Comparison of using our training datasets on serine in O-linked glycosylation to test precious prediction tools36
    Table 15. Comparison of using our training datasets on threonine residues in O-linked glycosylation to test the other prediction tools36
    Table 16. Comparison of the training datasets for serine within other prediction tools to test our and their own prediction models37
    Table 17. Comparison of the training datasets for threonine within other prediction tools to test our and their own prediction models37
    Table 18. Comparison of proposed accuracy with other prediction tools on N-linked glycosylation38
    Table 19. Comparison of the training datasets for asparagine residue within other prediction tools to test our and their prediction tools38
    Table 20. Comparison of proposed accuracy with other prediction tools on C-linked glycosylation39
    Table 21. Comparison of the training datasets for tryptophan residues within other prediction tools to test our and their own prediction models39
    Table 22. Comparison of using independent test sets with current prediction tools and ours on serine residues in O-linked glycosylation40
    Table 23. Comparison of using independent test sets with current prediction tools and ours on threonine residues in O-linked glycosylation41
    Table 24. Comparison of using independent test sets with current prediction tools and ours on asparagine residues in N-linked glycosylation41
    Table 25. Comparison of using independent test sets with current prediction tools and ours on tryptophan residues in C-linked glycosylation41
    Table 26. Using datasets of serine in O-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools47
    Table 27. Using datasets of threonine residues in O-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools48
    Table 28. Using datasets of asparagine residues in N-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools49
    Table 29. Using datasets of tryptophan residues in C-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools50
    Table 30. The comparison of different glycosylation datasets between previous prediction tools and ours51
    Reference 1.Hart GW: Glycosylation. Current opinion in cell biology 1992, 4(6):1017-1023.
    2.Hounsell EF, Davies MJ, Renouf DV: O-linked protein glycosylation structure and function. Glycoconjugate journal 1996, 13(1):19-26.
    3.Stanley P: Glycosylation engineering. Glycobiology 1992, 2(2):99-107.
    4.Jenkins N, Parekh RB, James DC: Getting the glycosylation right: implications for the biotechnology industry. Nature biotechnology 1996, 14(8):975-981.
    5.Mann M, Jensen ON: Proteomic analysis of post-translational modifications. Nature biotechnology 2003, 21(3):255-261.
    6.Walsh CT, Garneau-Tsodikova S, Gatto GJ, Jr.: Protein posttranslational modifications: the chemistry of proteome diversifications. Angewandte Chemie (International ed 2005, 44(45):7342-7372.
    7.Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S: Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 2004, 4(6):1633-1649.
    8.Asker N, Baeckstrom D, Axelsson MA, Carlstedt I, Hansson GC: The human MUC2 mucin apoprotein appears to dimerize before O-glycosylation and shares epitopes with the 'insoluble' mucin of rat small intestine. The Biochemical journal 1995, 308 ( Pt 3):873-880.
    9.Peters BP, Krzesicki RF, Perini F, Ruddon RW: O-glycosylation of the alpha-subunit does not limit the assembly of chorionic gonadotropin alpha beta dimer in human malignant and nonmalignant trophoblast cells. Endocrinology 1989, 124(4):1602-1612.
    10.Chen YZ, Tang YR, Sheng ZY, Zhang Z: Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC bioinformatics 2008, 9:101.
    11.Hanisch FG: O-glycosylation of the mucin type. Biological chemistry 2001, 382(2):143-149.
    12.Helenius A, Aebi M: Roles of N-linked glycans in the endoplasmic reticulum. Annual review of biochemistry 2004, 73:1019-1049.
    13.Gavel Y, von Heijne G: Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: implications for protein engineering. Protein engineering 1990, 3(5):433-442.
    14.Rudd PM, Elliott T, Cresswell P, Wilson IA, Dwek RA: Glycosylation and the immune system. Science (New York, NY 2001, 291(5512):2370-2376.
    15.Hofsteenge J, Blommers M, Hess D, Furmanek A, Miroshnichenko O: The four terminal components of the complement system are C-mannosylated on multiple tryptophan residues. The Journal of biological chemistry 1999, 274(46):32786-32794.
    16.Doucey MA, Hess D, Cacan R, Hofsteenge J: Protein C-mannosylation is enzyme-catalysed and uses dolichyl-phosphate-mannose as a precursor. Molecular biology of the cell 1998, 9(2):291-300.
    17.Perez-Vilar J, Randell SH, Boucher RC: C-Mannosylation of MUC5AC and MUC5B Cys subdomains. Glycobiology 2004, 14(4):325-337.
    18.Ihara Y, Manabe S, Kanda M, Kawano H, Nakayama T, Sekine I, Kondo T, Ito Y: Increased expression of protein C-mannosylation in the aortic vessels of diabetic Zucker rats. Glycobiology 2005, 15(4):383-392.
    19.Julenius K: NetCGlyc 1.0: prediction of mammalian C-mannosylation sites. Glycobiology 2007, 17(8):868-876.
    20.Kinoshita T, Ohishi K, Takeda J: GPI-anchor synthesis in mammalian cells: genes, their products, and a deficiency. Journal of biochemistry 1997, 122(2):251-257.
    21.Caragea C, Sinapov J, Silvescu A, Dobbs D, Honavar V: Glycosylation site prediction using ensembles of Support Vector Machine classifiers. BMC bioinformatics 2007, 8:438.
    22.Eisenhaber B, Bork P, Eisenhaber F: Prediction of potential GPI-modification sites in proprotein sequences. Journal of molecular biology 1999, 292(3):741-758.
    23.Presnell SR, Cohen FE: Artificial neural networks for pattern recognition in biochemical sequences. Annual review of biophysics and biomolecular structure 1993, 22:283-298.
    24.Burges CJC: A Tutorial on Support Vector Machines for Pattern Recognition. In., vol. 2: Springer; 1998: 121-167.
    25.Hansen JE, Lund O, Tolstrup N, Gooley AA, Williams KL, Brunak S: NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconjugate journal 1998, 15(2):115-130.
    26.Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A: The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Human mutation 2004, 23(5):464-470.
    27.Gupta R, Birch H, Rapacki K, Brunak S, Hansen JE: O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins. Nucleic acids research 1999, 27(1):370-372.
    28.Julenius K, Molgaard A, Gupta R, Brunak S: Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. Glycobiology 2005, 15(2):153-164.
    29.Li S, Liu B, Zeng R, Cai Y, Li Y: Predicting O-glycosylation sites in mammalian proteins by using SVMs. Computational biology and chemistry 2006, 30(3):203-208.
    30.Gupta R, Jung E: NetNGlyc: Prediction of N-glycosylation sites in human proteins. In.: Accessed; 2005.
    31.Ahmad S, Gromiha MM, Sarai A: RVP-net: online prediction of real valued accessible surface area of proteins from single sequences. Bioinformatics (Oxford, England) 2003, 19(14):1849-1851.
    32.Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH: dbPTM: an information repository of protein post-translational modification. Nucleic acids research 2006, 34(Database issue):D622-627.
    33.McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics (Oxford, England) 2000, 16(4):404-405.
    34.Grzymislawski M, Derc K, Sobieska M, Wiktorowicz K: Microheterogeneity of acute phase proteins in patients with ulcerative colitis. World J Gastroenterol 2006, 12(32):5191-5195.
    35.Bhasin M, Raghava GP: ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic acids research 2004, 32(Web Server issue):W414-419.
    36.Gould SJ, Keller GA, Hosken N, Wilkinson J, Subramani S: A conserved tripeptide sorts proteins to peroxisomes. The Journal of cell biology 1989, 108(5):1657-1664.
    37.Richmond TJ: Solvent accessible surface area and excluded volume in proteins. Analytical equations for overlapping spheres and implications for the hydrophobic effect. Journal of molecular biology 1984, 178(1):63-89.
    38.Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction. Bioinformatics (Oxford, England) 2001, 17(8):721-728.
    39.Chang CC, Lin CJ: LIBSVM: a library for support vector machines. In., vol. 80; 2001: 604V611.
    40.Song J, Burrage K, Yuan Z, Huber T: Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information. BMC bioinformatics 2006, 7:124.
    41.Witten I, Frank E: Data Mining: Practical Machine Learning Tools and Techniques: Morgan Kaufmann; 2005.
    Advisor
  • Li-ching Wu(d߫C)
  • Jorng-tzong Horng(xv)
  • Files
  • 952211005.pdf
  • approve in 1 year
    Date of Submission 2008-07-09

    [Back to Results | New Search]


    Browse | Search All Available ETDs

    If you have dissertation-related questions, please contact with the NCU library extension service section.
    Our service phone is (03)422-7151 Ext. 57407,E-mail is also welcomed.