Student Number 952211005 Author Man-chien Kuo(³¢ápÓ}) Author's Email Address No Public. Statistics This thesis had been viewed 2103 times. Download 311 times. Department Graduate Institute of Systems Biology and Bioinformatics Year 2007 Semester 2 Degree Master Type of Document Master's Thesis Language English Title Prediction of protein glycosylation sites by using support vector machines Date of Defense 2008-07-03 Page Count 55 Keyword glycosylation post-translational modification prediction support vector machines Abstract Protein glycosylation is an important post-translational modification (PTM) to affect various molecular functions such as structure, biological activity and protein-protein interaction. Due to the difficulties of biological experiments and the huge amount of identification works, there are several works were proposed in recent years to identify protein glycosylation sites by computational approaches. The features of their identification model were mainly amino acids surrounding the glycosylation sites. All of previous prediction tools are against respective types of glycosylation. Therefore, we develop prediction methods to identify protein glycosylation sites include O-linked, N-linked and C-linked glycosylation using support vector machine (SVM) based on dipeptide combined with accessible surface area, region combined with amino acid, and dipeptide. It shows that the accuracy of O-linked glycosylation on serine and threonine, N-linked on asparagine and C-linked on tryptophan are 95%, 91%, 96% and 95%. We implemented in GSI, a web server to identify O-linked, N-linked and C-linked glycosylation sites. Table of Content Table of Contents
Chapter 1 Introduction1
1.1.1Post-translational modification (PTM)1
Chapter 2 Related works7
2.1Prediction of glycosylation tools7
2.1.1Prediction of O-linked glycosylation tool7
2.1.2Prediction of N-linked glycosylation tool8
2.1.3Prediction of C-linked glycosylation tool8
2.1.4Other prediction of glycosylation tool8
2.2Comparison of current prediction tools8
Chapter 3 Materials and Methods11
3.4.4Secondary structure encoding18
3.5Support Vector Machine (SVM)22
Chapter 4 Results25
4.2Comparison with previous work35
4.3Independent test set in previous prediction tools and ours40
Chapter 5 Discussion45
List of Figures
Figure 1. The structure of O-linked glycosylation. The oligosaccharides attached to the hydroxyl group of amino acid, serine and threonine.3
Figure 2. The structure of N-linked glycosylation. The oligosaccharides attached to asparagine.3
Figure 3. The structure of C-linked glycosylation. The £\-mannopyranosyl residue is attached to the indole C2 of tryptophan via a C-C link4
Figure 4. The structure of GPI anchors. The hydrophobic phosphatidylinositol group is linked to a residue at or near the C-terminus of a protein through a carbohydrate-containing linker.5
Figure 5. The system flow of constructing prediction models12
Figure 6. The process of truncate the protein sequence to region windows with glycosylation or non-glycosylation site in the middle.14
Figure 7. The process of dipeptide encoding17
Figure 8. The process of tripeptide encoding18
Figure 9. The process of secondary structure encoding18
Figure 10. The calculation of ASA scores combined with dipeptide.19
Figure 11. Comparison of the different between ASA scores of positive and negative datasets on serine residue on O-linked glycosylation20
Figure 12. Comparison of the different between ASA scores of positive and negative datasets on threonine residue on O-linked glycosylation21
Figure 13. Comparison of the different between ASA scores of positive and negative datasets on N-linked glycosylation21
Figure 14. Comparison of the different between ASA scores of positive and negative datasets on C-linked glycosylation22
Figure 15. The performance of serine residue in O-linked glycosylation prediction models28
Figure 16. The performance of threonine residue in O-linked glycosylation prediction models30
Figure 17. The performance of N-linked glycosylation prediction models32
Figure 18. The performance of C-linked glycosylation prediction models34
Figure 19. The interface of GSI web server, which is available at http://bioinfo.gene.idv.tw/.42
Figure 20. In this graph, the web interface with an example of inputs on GSI43
Figure 21. The results of each type of potentially glycosylated amino acid sites and the distribution of ASA scores surrounding them44
Figure 22. The list of protein sequences prediction result and the ASA scores of each site.44
List of Tables
Table 1. Comparison of current prediction tools10
Table 2. Number of positive and negative datasets in our study for O-linked, N-linked and C-linked glycosylation considered13
Table 3. The number of positive and negative datasets for Serine in O-linked glycosylation for different symmetrical window size and ratio of positive and negative datasets15
Table 4. The number of positive and negative datasets for Serine in O-linked glycosylation for different symmetrical window size and ratio of positive and negative datasets15
Table 5. The number of positive and negative datasets for C-linked glycosylation for different symmetrical window size16
Table 6. The number of positive and negative datasets for N-linked glycosylation for different symmetrical window size16
Table 7. The various ratio of positive and negative datasets on serine residues in O-linked glycosylation based on 0/1 system encoding25
Table 8. The various ratio of positive and negative datasets on threonine residues in O-linked glycosylation based on 0/1 system encoding26
Table 9. The results of serine residue in O-linked glycosylation using different features27
Table 10. The results of threonine residue in O-linked glycosylation using different features29
Table 11. The results in N-linked glycosylation from different models31
Table 12. The performance of C-linked glycosylation from different models32
Table 13. Best models of four types of glycosylation35
Table 14. Comparison of using our training datasets on serine in O-linked glycosylation to test precious prediction tools36
Table 15. Comparison of using our training datasets on threonine residues in O-linked glycosylation to test the other prediction tools36
Table 16. Comparison of the training datasets for serine within other prediction tools to test our and their own prediction models37
Table 17. Comparison of the training datasets for threonine within other prediction tools to test our and their own prediction models37
Table 18. Comparison of proposed accuracy with other prediction tools on N-linked glycosylation38
Table 19. Comparison of the training datasets for asparagine residue within other prediction tools to test our and their prediction tools38
Table 20. Comparison of proposed accuracy with other prediction tools on C-linked glycosylation39
Table 21. Comparison of the training datasets for tryptophan residues within other prediction tools to test our and their own prediction models39
Table 22. Comparison of using independent test sets with current prediction tools and ours on serine residues in O-linked glycosylation40
Table 23. Comparison of using independent test sets with current prediction tools and ours on threonine residues in O-linked glycosylation41
Table 24. Comparison of using independent test sets with current prediction tools and ours on asparagine residues in N-linked glycosylation41
Table 25. Comparison of using independent test sets with current prediction tools and ours on tryptophan residues in C-linked glycosylation41
Table 26. Using datasets of serine in O-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools47
Table 27. Using datasets of threonine residues in O-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools48
Table 28. Using datasets of asparagine residues in N-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools49
Table 29. Using datasets of tryptophan residues in C-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools50
Table 30. The comparison of different glycosylation datasets between previous prediction tools and ours51
Reference 1.Hart GW: Glycosylation. Current opinion in cell biology 1992, 4(6):1017-1023.
2.Hounsell EF, Davies MJ, Renouf DV: O-linked protein glycosylation structure and function. Glycoconjugate journal 1996, 13(1):19-26.
3.Stanley P: Glycosylation engineering. Glycobiology 1992, 2(2):99-107.
4.Jenkins N, Parekh RB, James DC: Getting the glycosylation right: implications for the biotechnology industry. Nature biotechnology 1996, 14(8):975-981.
5.Mann M, Jensen ON: Proteomic analysis of post-translational modifications. Nature biotechnology 2003, 21(3):255-261.
6.Walsh CT, Garneau-Tsodikova S, Gatto GJ, Jr.: Protein posttranslational modifications: the chemistry of proteome diversifications. Angewandte Chemie (International ed 2005, 44(45):7342-7372.
7.Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S: Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 2004, 4(6):1633-1649.
8.Asker N, Baeckstrom D, Axelsson MA, Carlstedt I, Hansson GC: The human MUC2 mucin apoprotein appears to dimerize before O-glycosylation and shares epitopes with the 'insoluble' mucin of rat small intestine. The Biochemical journal 1995, 308 ( Pt 3):873-880.
9.Peters BP, Krzesicki RF, Perini F, Ruddon RW: O-glycosylation of the alpha-subunit does not limit the assembly of chorionic gonadotropin alpha beta dimer in human malignant and nonmalignant trophoblast cells. Endocrinology 1989, 124(4):1602-1612.
10.Chen YZ, Tang YR, Sheng ZY, Zhang Z: Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC bioinformatics 2008, 9:101.
11.Hanisch FG: O-glycosylation of the mucin type. Biological chemistry 2001, 382(2):143-149.
12.Helenius A, Aebi M: Roles of N-linked glycans in the endoplasmic reticulum. Annual review of biochemistry 2004, 73:1019-1049.
13.Gavel Y, von Heijne G: Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: implications for protein engineering. Protein engineering 1990, 3(5):433-442.
14.Rudd PM, Elliott T, Cresswell P, Wilson IA, Dwek RA: Glycosylation and the immune system. Science (New York, NY 2001, 291(5512):2370-2376.
15.Hofsteenge J, Blommers M, Hess D, Furmanek A, Miroshnichenko O: The four terminal components of the complement system are C-mannosylated on multiple tryptophan residues. The Journal of biological chemistry 1999, 274(46):32786-32794.
16.Doucey MA, Hess D, Cacan R, Hofsteenge J: Protein C-mannosylation is enzyme-catalysed and uses dolichyl-phosphate-mannose as a precursor. Molecular biology of the cell 1998, 9(2):291-300.
17.Perez-Vilar J, Randell SH, Boucher RC: C-Mannosylation of MUC5AC and MUC5B Cys subdomains. Glycobiology 2004, 14(4):325-337.
18.Ihara Y, Manabe S, Kanda M, Kawano H, Nakayama T, Sekine I, Kondo T, Ito Y: Increased expression of protein C-mannosylation in the aortic vessels of diabetic Zucker rats. Glycobiology 2005, 15(4):383-392.
19.Julenius K: NetCGlyc 1.0: prediction of mammalian C-mannosylation sites. Glycobiology 2007, 17(8):868-876.
20.Kinoshita T, Ohishi K, Takeda J: GPI-anchor synthesis in mammalian cells: genes, their products, and a deficiency. Journal of biochemistry 1997, 122(2):251-257.
21.Caragea C, Sinapov J, Silvescu A, Dobbs D, Honavar V: Glycosylation site prediction using ensembles of Support Vector Machine classifiers. BMC bioinformatics 2007, 8:438.
22.Eisenhaber B, Bork P, Eisenhaber F: Prediction of potential GPI-modification sites in proprotein sequences. Journal of molecular biology 1999, 292(3):741-758.
23.Presnell SR, Cohen FE: Artificial neural networks for pattern recognition in biochemical sequences. Annual review of biophysics and biomolecular structure 1993, 22:283-298.
24.Burges CJC: A Tutorial on Support Vector Machines for Pattern Recognition. In., vol. 2: Springer; 1998: 121-167.
25.Hansen JE, Lund O, Tolstrup N, Gooley AA, Williams KL, Brunak S: NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconjugate journal 1998, 15(2):115-130.
26.Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A: The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Human mutation 2004, 23(5):464-470.
27.Gupta R, Birch H, Rapacki K, Brunak S, Hansen JE: O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins. Nucleic acids research 1999, 27(1):370-372.
28.Julenius K, Molgaard A, Gupta R, Brunak S: Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. Glycobiology 2005, 15(2):153-164.
29.Li S, Liu B, Zeng R, Cai Y, Li Y: Predicting O-glycosylation sites in mammalian proteins by using SVMs. Computational biology and chemistry 2006, 30(3):203-208.
30.Gupta R, Jung E: NetNGlyc: Prediction of N-glycosylation sites in human proteins. In.: Accessed; 2005.
31.Ahmad S, Gromiha MM, Sarai A: RVP-net: online prediction of real valued accessible surface area of proteins from single sequences. Bioinformatics (Oxford, England) 2003, 19(14):1849-1851.
32.Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH: dbPTM: an information repository of protein post-translational modification. Nucleic acids research 2006, 34(Database issue):D622-627.
33.McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics (Oxford, England) 2000, 16(4):404-405.
34.Grzymislawski M, Derc K, Sobieska M, Wiktorowicz K: Microheterogeneity of acute phase proteins in patients with ulcerative colitis. World J Gastroenterol 2006, 12(32):5191-5195.
35.Bhasin M, Raghava GP: ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic acids research 2004, 32(Web Server issue):W414-419.
36.Gould SJ, Keller GA, Hosken N, Wilkinson J, Subramani S: A conserved tripeptide sorts proteins to peroxisomes. The Journal of cell biology 1989, 108(5):1657-1664.
37.Richmond TJ: Solvent accessible surface area and excluded volume in proteins. Analytical equations for overlapping spheres and implications for the hydrophobic effect. Journal of molecular biology 1984, 178(1):63-89.
38.Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction. Bioinformatics (Oxford, England) 2001, 17(8):721-728.
39.Chang CC, Lin CJ: LIBSVM: a library for support vector machines. In., vol. 80; 2001: 604¡V611.
40.Song J, Burrage K, Yuan Z, Huber T: Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information. BMC bioinformatics 2006, 7:124.
41.Witten I, Frank E: Data Mining: Practical Machine Learning Tools and Techniques: Morgan Kaufmann; 2005.
Advisor Li-ching Wu(§d¥ß«C)
approve in 1 year
952211005.pdf Date of Submission 2008-07-09