Title page for 90522059


[Back to Results | New Search]

Student Number 90522059
Author Shih-Chien Kuo(郭釋謙)
Author's Email Address bruce@db.csie.ncu.edu.tw
Statistics This thesis had been viewed 1540 times. Download 1090 times.
Department Computer Science and Information Engineering
Year 2002
Semester 2
Degree Master
Type of Document Master's Thesis
Language zh-TW.Big5 Chinese
Title On-Line Extraction Rule Analysis
Date of Defense 2003-06-25
Page Count 39
Keyword
  • Information Extraction
  • Information Integration
  • Information Retrieval
  • Abstract The vast amount of online information available has led to renewed interest in information extraction (IE) systems that analyze input documents to produce a structured representation of selected information from the documents. However, the design of an IE system differs greatly according to its input: from unrestricted free-text to semi-structured Web documents. This paper extends an automatic pattern discovery approach called IEPAD to the rapid generation of IE systems that can extract structured data from semi-structured Web documents. In this novel framework, extraction rules can be trained not only from a multiple-record Web page but also from multiple single-record Web pages (called singular pages). Most of all, this framework requires no annotation labor that is required for many machine-learning based approaches. Evaluation results show a high level of system performance.
    Table of Content 第1章 緒論1
    第2章 相關研究討論4
    2.1 使用者標示動作之資訊擷取系統4
    2.2 免標示動作之資訊擷取系統6
    2.3 WysiWyg的資訊擷取系統9
    第3章 系統架構14
    3.1 範例14
    3.2 目標區域框選(Enclosing)16
    3.3 Generalization20
    3.4 細部資料指定24
    3.5 多重Enclosing25
    3.6 擷取規則26
    第4章 擷取器27
    第5章 實驗結果與問題討論29
    5.1 擷取Multiple-Record Pages29
    5.2 擷取Singular Pages32
    第6章 結論與未來展望36
    參考文獻37
    Reference [1] N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Record, 26(4):8–15, 1997.
    [2] R. Baumgartner, S. Flesca, and G. Gottlob. Supervised wrapper generation with lixto. In Proceedings of VLDB Demo, 2001.
    [3] C.-H. Chang and S.-C. Lui. Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, pages 681–688, Hong-Kong, May 2–6 2001.
    [4] B. Chidlovskii, J. Ragetli, and M. Rijke. Automatic wrapper generation for web search engines. In Proceedings of the 1st International Conference on Web-Age Information Management (WAIM’2000), LNCS Series, Shanghai, China, 2000.
    [5] D. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99), pages 467–478, Philadelphia, PA, 1999.
    [6] D. Freitag. Information extraction from html: Application of a general machine learning approach. In Proceedings of the Fifteenth national Conference on Artificial Intelligence, pages 517–523, 1998.
    [7] C.-N. Hsu and C.-C. Chang. Finite-state transducers for semi-structured text mining. In Proceedings of IJCAI-99 Workshop on Text mining: Foundations, Techniques and Applications, pages 38–49, Stockholm, Sweden, 1999.
    [8] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521–538, 1998.
    [9] I. Muslea, S. Minton, and C. Knoblock. A hierarchical information from semi-structured documents. In Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pages 250-257, VA, USA, 2000.
    [10] G. Huck, P. Fankhauser, K. Aberer, and E.J. Neuhold. Jedi: Extracting and synthesizing information from the web. In Proc. of COOPIS, 1998.
    [11] C. Knoblock, S. Minton, and et al. J. Ambite. Modeling web sources for information integration. In Proceedings of the 15th National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, pages 211–218, Wisconsin, USA,1998.
    [12] N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI), pages 729–737, Japan, 1997.
    [13] W.-Y. Lin and W. Lam. Learning to extract hierarchical information from semi-structured documents. In Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, pages 250–257, VA, USA, 2000.
    [14] L. Liu, C. Pu, and W. Han. Xwrap: An xml-enabled wrapper construction system for web information sources. In Proceedings of ICDE, 2000.
    [15] W. May, R. Himmeroder, G. Lausen, and B. Ludascher. A unifed framework for wrapping, mediating and restructuring information from the web. In Proc. of WWWCM, 1999.
    [16] A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In Proceedings of VLDB, 1999.
    [17] A. Sahuguet and F. Azavant. Building intelligent web applications using lightweight wrappers. Data and Knowledge Engineering, 36(3):283–316, 2001.
    [18] S. Soderland. Learning to extract text-based information from the world wide web. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pages, 233–272, CA, USA, 1997.
    [19] S. Soderland. Learning information extraction rules for semi-structured and free text. Journal of Machine Learning, 34(1-3):233–272, 1999.
    [20] G. Gonnet, R. Baeza-Yates, and T.Snider, New Indices for Text: PAT Trees and PAT Arrays, In Bill Frakes, and B.Y. Ricardo, editor, Information Retrieval: Data structures and Algorithms, Prentice Hall, Englewood Cliffs, Chapter 5 (pp. 66-82), NJ, USA, 1992.
    [21] World Wide Web consortium (W3C), http://www.w3c.org
    Advisor
  • Chia-Hui Chang(張嘉惠)
  • Files
  • 90522059.pdf
  • approve in 2 years
    Date of Submission 2003-07-15

    [Back to Results | New Search]


    Browse | Search All Available ETDs

    If you have dissertation-related questions, please contact with the NCU library extension service section.
    Our service phone is (03)422-7151 Ext. 57407,E-mail is also welcomed.