Acessibilidade / Reportar erro

Attribute match discovery in information integration: exploiting multiple facets of metadata

Abstract

Automating semantic matching of attributes for the purpose of information integration is challenging, and the dynamics of the Web further exacerbate this problem. Believing that many facets of metadata can contribute to a resolution, we present a framework for multifaceted exploitation of metadata in which we gather information about potential matches from various facets of metadata and combine this information to generate and place confidence values on potential attribute matches. To make the framework apply in the highly dynamic Web environment, we base our process on machine learning when sufficient applicable data is available and base it otherwise on empirically observed rules. Experiments we have conducted are encouraging, showing that when the combination of facets converges as expected, the results are highly reliable.

semantic attribute matching; information integration; exploitation of metadata


Full text available only in PDF format

ARTICLES

Attribute match discovery in information integration: exploiting multiple facets of metadata

David W. Embley; David Jackman; Li Xu* * Supported in part by the National Science Foundation under grants IIS-0083127

Department of Computer Science - Brigham Young University Provo, Utah 84602, U.S.A., Phone: Voice: (801) 422-3027 Fax: (801)422-0169 embley@cs.byu.edu, djackman@nextpage.com, lx@cs.byu.edu

ABSTRACT

Automating semantic matching of attributes for the purpose of information integration is challenging, and the dynamics of the Web further exacerbate this problem. Believing that many facets of metadata can contribute to a resolution, we present a framework for multifaceted exploitation of metadata in which we gather information about potential matches from various facets of metadata and combine this information to generate and place confidence values on potential attribute matches. To make the framework apply in the highly dynamic Web environment, we base our process on machine learning when sufficient applicable data is available and base it otherwise on empirically observed rules. Experiments we have conducted are encouraging, showing that when the combination of facets converges as expected, the results are highly reliable.

Keywords: semantic attribute matching, information integration, exploitation of metadata.

  • [1] J. Larson, S. Navathe, and R. Elmasri. A theory of attribute equivalence in databases with application to schema integration. IEEE Transactions on Software Engineering, 15(4), 1989.
  • [2] W.-S. Li and C. Clifton. Semantic integration in heterogeneous databases using neural networks. In Proceedings of the 20th Very Large Data Base Conference, Santiago, Chile, 1994.
  • [3] E.H.C. Chua, R.H.L.Chiang, and E-P. Lim. Instance-based attribute identication in database integration. In Proceedings of the 8th Workshop on Information Technologies and Systems (WITS'98), Helsinki, Finland, December 1998.
  • [4] M. Garcia-Solaco, F. Slator, and M. Castellanos. A structure based schema integration methodology. In Proceedings of the 11th International Conference on Data Engineering (ICDE'95), pages 505{512, Taipei, Taiwan, 1995.
  • [5] J. Fowler, B. Perry, M. Nodine, and B. Bargmeyer. Agent-based semantic interoperability in InfoSleuth. SIGMOD Record, 28(1):60{67, March 1999.
  • [6] S. Hayne and S. Ram. Multi-user view integration system (MUVIS): An expert system for view integration. In Proceedings of the 6th International Conference on Data Engineering, pages 402{409, February 1990.
  • [7] S. Bergamaschi, S. Castano, and M. Vincini. Semantic integration of semistructured and structured data sources. SIGMOD Record, 28(1):54{59, March 1999.
  • [8] S. Castano and V. De Antonellis. Semantic dictionary design for database interoperability. In Proceedings of 1997 IEEE International Conference on Data Engineering (ICDE'97), pages 43{54, Birmingham, United Kingdom, April 1997.
  • [9] V. Kashyap and A. Sheth. Semantic and schematic similarities between database objects: A context-based approach. The VLDB Journal, 5:276{304,1996.
  • [10] V. Kashyap and A. Sheth. Semantic heterogeneity in global information systems: The role of metadata, context and ontologies. In M. Papazoglou and G. Schlageter, editors, Cooperative Information Systems: Current Trends and Directions, pages 139{178, 1998.
  • [11] S. Castano, V. De Antonellis, M.G. Fugini, and B. Pernici. Conceptual schema analysis: Techniques and applications. A CM Transactions on Database Systems, 23(3):286{333, September 1998.
  • [12] J. Biskup and D.W. Embley. Extracting information from heterogeneous information sources using ontologically specied target views. Information Systems, 28(3) 29-54,2003.
  • [13] D.W. Embley, B.D. Kurtz, and S.N. Woodeld. Object-oriented Systems Analysis: A Model-Driven Approach. Prentice Hall, Englewood Cliffs, New Jersey, 1992.
  • [14] D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, and R.D. Smith. Conceptual-model-based data extraction from multiple-record Web pages. Data & Knowledge Engineering, 31(3):227{251, November 1999.
  • [15] D.W. Embley and M. Xu. Relational database reverse engineering: A model-centric, transformational, interactive approach formalized in model theory. In DEXA'97 Workshop Proceedings, pages 372{377, Toulouse, France, September 1997. IEEE Computer Society Press.
  • [16] S.H. Yau. Automating the extraction of data behind web forms. Technical report, Brigham Young University, Provo, Utah, 2001. http://www.deg.byu.edu
  • [17] N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Record, 26(4):8{ 15, December 1997.
  • [18] P.B. Golgher, A.H.F. Laender, A.S. da Silva, and Ribeiro-Neto. An example-based environment for wrapper generation. In S.W. Liddle,H.C. Mayr, and B. Thalheim, editors, Proceedings of the 2nd International Conference on the World-Wide Web and Conceptual Modeling, Lecture Notes in Computer Science, 1921, pages 152{164, Salt Lake City, Utah, October 2000.
  • [19] J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni, M. Breunig, and V. Vassalos. Template-based wrappers in the TSIMMIS system. In Proceedings of 1997 ACM SIGMOD International Conference on Management of Data, pages 532{535, Tucson, Arizona, May 1997.
  • [20] G.A. Miller. WordNet: A lexical database for English. Communications of the ACM, 38(11):39{41, November 1995.
  • [21] C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachussets, 1998.
  • [22] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.
  • [23] S. Castano and V. De Antonellis. ARTEMIS: Analysis and reconciliation tool environment for multiple information sources. In Proceedings of the Convegno Nazionale Sistemi di Basi di Dati Evolute (SEBD '99), pages 341{356, Como, Italy, June 1999.
  • [24] R. Baeza -Yates and B. Ribeiro-Neto. Modern Information Retrieval. AddisonWesley, Menlo Park, California, 1999.
  • *
    Supported in part by the National Science Foundation under grants IIS-0083127
  • Publication Dates

    • Publication in this collection
      14 Sept 2004
    • Date of issue
      Nov 2002
    Sociedade Brasileira de Computação Sociedade Brasileira de Computação - UFRGS, Av. Bento Gonçalves 9500, B. Agronomia, Caixa Postal 15064, 91501-970 Porto Alegre, RS - Brazil, Tel. / Fax: (55 51) 316.6835 - Campinas - SP - Brazil
    E-mail: jbcs@icmc.sc.usp.br