Acessibilidade / Reportar erro

Providing Context to Web Searches: The Use of Ontologies to Enhance Search Engine's Accuracy

Abstract

This paper presents the design and state of development of a framework for the construction and use of ontologies to guide searches in the Web or in document repositories. The aim is to enhance precision and recall in information retrieval sessions through the use of a context associated to each session. For transparency and flexibility, these contexts are dynamically built by the user from the system's available ontologies. This way, the user controls the conceptual structure underlying the search process, which should mirror his/hers information needs. Via the Ontologies Manager Framework, the user is able to access an incrementally built public ontology, as well as to create private ontologies, kept in the user's local area. Concepts to compose sessions' contexts can be selected from public and private ontologies. Private ontologies may be proposed by the user to integrate the public ontology, which is periodically upgraded by a maintenance module. This framework is a plug in which can be connected to a number of search engines. The initial experiments use Bright! (Brazilian Internet Guide in Hypertext) search engine as testbed. The prototype is implemented in Java, for portability and reusability.

Web search; Ontologies for Web search; Information Retrieval


Providing Context to Web Searches:

The Use of Ontologies to Enhance Search Engine's Accuracy

Flávia A. Barros, Pedro F. Gonçalves and Thiago L.V.L. Santos

Departamento de Informática

Universidade Federal de Pernambuco - Caixa Postal 7851

50.732-970 Recife (PE) – BRAZIL

fab@di.ufpe.br , pfg@di.ufpe.br , tlvls@di.ufpe.br

Abstract This paper presents the design and state of development of a framework for the construction and use of ontologies to guide searches in the Web or in document repositories. The aim is to enhance precision and recall in information retrieval sessions through the use of a context associated to each session. For transparency and flexibility, these contexts are dynamically built by the user from the system's available ontologies. This way, the user controls the conceptual structure underlying the search process, which should mirror his/hers information needs.

Via the Ontologies Manager Framework, the user is able to access an incrementally built public ontology, as well as to create private ontologies, kept in the user's local area. Concepts to compose sessions' contexts can be selected from public and private ontologies. Private ontologies may be proposed by the user to integrate the public ontology, which is periodically upgraded by a maintenance module.

This framework is a plug in which can be connected to a number of search engines. The initial experiments use Bright! (Brazilian Internet Guide in Hypertext) search engine as testbed. The prototype is implemented in Java, for portability and reusability.

Keywords: Web search, Ontologies for Web search, Information Retrieval.

1 Introduction

The growth of electronic text collections (e.g., Digital Libraries [15, 6], the Web [30], Intranets) has strongly increased the difficulty to find relevant documents. The size of the World-Wide Web (Web), for instance, has grown exponentially, posing new demands on current Information Retrieval (IR) techniques [1, 17, 31]. The crucial problem is to efficiently locate the best documents concerning a user's information need.

The majority of existing Web search engines adopt keyword-based indexing systems (e.g., AltaVista, Excite!, HotBot1 1 AltaVista - http://www.altavista.com/, HotBot-http://www.hotbot.com/, Excite! - http://www.excite.com/. ). These systems are based on indexing robots which continuously retrieve Web pages to build and update a centralized Index Base (IB), which can be queried by users via keyword lists, wildcards, boolean expressions and the like. These facilities, however, do not always provide an acceptable balance between precision and recall, responding with a good deal of irrelevant documents [7].

In addition to these basic search functionalities, several search engines have adopted Yahoo!-like concept hierarchies, which can be browsed by the user while searching (e.g., Yahoo!, HotBot, Lycos, Infoseek2 2 Yahoo! - http://www.yahoo.com/ , Lycos - http://www.lycos.com/ , Infoseek - http://www.infoseek.com/. ). The IB pages are classified within one or more ‘classes’ in the hierarchy, and this is usually done by hand. Clearly, this approach does not scale to the current Web size and growth rate. Furthermore, the classification is static and occurs previous to the search, which may compromise the system’s transparency, since the user cannot foresee in which class the relevant documents lay.

Our approach, in contrast, does not impose a fixed classification on Web pages previous to the search process. Instead, we allow for dynamic classification of pages, since it may vary according to the current user’s information need. We propose the use of ontologies from where the user can select concepts to build up a context for each query session. This feature favors flexibility and transparency, two central issues in the use of software systems in general, and in interfaces in particular. The initial experiments use Bright! (BRazilian Internet Guide in Hypertext3 3 BRight! - http://www.bright.org.br/. ) search engine as testbed .

The next section presents an overview of the OMF, followed by a detailed description of the system’s ontologies. Section 4 shows the OMF’s architecture. Section 5 presents the prototype and illustrates its use with an example. Section 6 evaluates the precision and recall of the prototype and section 7 brings a brief description of the OMF’s maintenance module. Section 8 presents some related work and section 9 presents final remarks with indications of future work.

2 The Ontologies Manager Framework: Overview

As discussed above, existing approaches to query search engines do not seem to fulfill users expectations and demands satisfactorily. In our approach, searches are guided by a context associated with each query session by the user to explicitly convey the session’s underlying semantics. The contexts consist of sets of words semantically related to the current information need, and are used to expand the original query attempting to improve its precision and recall. According to Mauldin [20], "recall is the proportion of relevant documents that are actually retrieved, and precision is the proportion of retrieved documents that are actually relevant.’’

Some search engines provide conceptual searches via an internal mechanism of query expansion which is opaque to the user (e.g., Excite4 4 http://www.excite.com/. ). In some cases, the suggested words for expanding the query would mislead the search process. For instance, when looking for courses on cello, Excite suggested the words violin and cords, which, although related to cello, are inappropriate for the current query.

To avoid this kind of mislead, in our system all words added to the query are actually chosen by the user. Contexts are extracted from the system’s ontologies, hierarchies of (general and specific) concepts. System’s users can browse the ontologies and select one or more concepts (represented by keywords and URL examples) to compound the current context. Once a concept has been selected, the user has the choice to expand the query with all keywords in the concept, or to choose only some of them.

In fact, our ontologies function as providers of ‘possibly’ related words, suggesting (not imposing) to the user words that may help in the search. This is particularly useful when dealing with more than one domain in the same query. For example, consider that the user wants to find post-graduate courses on cello. The user can expand the initial query ‘cello AND studies’ by selecting from the concept ‘music’ the words ‘performer’ and ‘cellist’, and from the concept ‘higher education’ words/expressions like ‘master of arts’ and ‘post-graduate’ (cf. Figure 1). This way, more relevant documents may be retrieved and unrelated documents referring to other music instruments will not be returned (a more detailed example is presented in section 5.2).

Figure 1:
The OMF User Interface.

The context formation is supported by the Ontologies Manager Framework (OMF), a tool for the construction upgrade and use of ontologies. OMF maintains an incrementally built public ontology, which is shared by all system’s users. The framework allows the construction of private ontologies, kept by each particular user in local areas. When desired, the user can propose his/her private ontologies to be added to the public ontology, which is periodically upgraded by a maintenance module (section 7).

Via the OMF’s interface, users can browse the available ontologies and select concepts related to the current session in order to form the context. The ontologies can be constructed in any natural language, since this tool is totally independent of domain.

OMF is a plug in which can be connected to a number of search engines (e.g., Excite, AltaVista, Yahoo!, etc), since its final delivery is a query extended by the inclusion of keywords. Through the User Interface (section 5.1), it is possible to select the search engine to be browsed (Figure 1). The default connection is set to BRight! [8, 9], a highly modular and decentralized search engine under development in the Departamento de Informática at the Universidade Federal de Pernambuco, and which is being used as testbed for the OMF.

The next section presents a detailed description of the system’s ontologies.

3 The Ontologies

The literature contrasts two research areas in ontology building: (1) the empirical work on Ontological Engineering, which defines categories and relationships in the domain being represented; and (2) the more abstract work on Ontological Theorizing, which aims to classify all existing objects in the world, as well as aspects (e.g., time, space, causality) [12, 27].

Following the work on Ontological Engineering, we focus on domain ontologies. Our domain, however, is unrestricted (the Web) and would require a broad coverage general ontology (for example, SENSUS [24] and Cyc [14]).

Nevertheless, our aim here is not to construct such an ontology; instead, we present the user with a high-level broad hierarchy of concepts to guide searches, and allow the construction of new private ontologies, which will be a specialization of some higher concept appearing in the public ontology.

3.1 Ontologies Structure

The system’s ontologies (hierarchies of concepts) consist of directed (possibly cyclic) graphs where nodes represent concepts and arcs establish a loose semantic relationships between these concepts. By ‘loose’ we mean that children nodes can hold different kinds of relationships to the parent node (e.g., specialization, part-of, group, etc.).

There may be cases where the arcs hold no type - for instance, in Figure 1, the arc connecting the parent node ‘education’ to the node ‘employment’ has no associated type. In fact, what we have here is a composed concept: ‘employment in education’. For this reason, arc labels will not be indicated in the ontologies.

Nodes in the ontologies are named, and the concept represented by each node is characterized by keywords and example URLs, the node label being also considered as a keyword (cf. Figure 1).

The public ontology shows a structure with one root node (representing the ‘all’ concept), which may have as many children as desired. The first level of nodes contain general category concepts, which serve as a classification for the other concepts in the ontology. The deeper levels can hold either category nodes or nodes representing basic concepts (concrete or abstract). A leaf node usually (but not necessarily) represents a basic concept.

Private ontologies may have any concept as root node. Apart from that, they bear the same structure as the public ontology.

Any concept can be related to more than one other concept (e.g., the concept ‘employment’ can refer to several different professions - education, music, management, etc.). In such cases, there is no need to create a different node for each parent, since our nodes do not hold Web pages’ addresses, but keywords and URL examples used as hints for query expansion.

Instead of creating different nodes for the same concept, all parent nodes can hold a link to the node representing that concept. This treatment introduces cycles in the structure, which will not create loops because the links are directed. This situation is represented in the implementation by the sign #, as can be seen in Figure 1.

3.2 Rationale

In our approach, concepts represented by ontologies’ nodes are described by ‘semantically’ related words. Conceptually speaking, we can think of three classes of keywords composing a concept:

(1.) Synonyms (e.g., clock and watch).

(2.) Words which restrict (or are a specialization of) the node’s concept - including sub-concepts of the node or words which are particular to that concept (e.g., football in relation to field sports).

(3.) Words which expand (generalize) the concept - including super-concepts of the node’s concept (e.g., tennis championships in relation to Roland Garros). A concept’s children nodes can be viewed in the same way as the keywords.

Words of class (1) are expected to improve the query’s recall. However, perfect synonyms are very hard to find (what would be a proper synonym to stock exchange?). This is why simply using a thesaurus to expand queries does not always improve its effectiveness. Words of class (2) improve queries precision when used in conjunction with the original query. Finally, words of class (3) improve recall when used in disjunction.

Allowing words of these three classes in the ontologies’ nodes does not introduce contradictions in the concept’s definition, since the user has complete freedom to select words individually from the node’s list of keywords (as seen in section 2).

By offering the possibility of invoking one concept via several (not necessarily synonymous) individually accessible words, OMF addresses two linguistic phenomena that strongly degrade query’s precision and recall: Polysemy5 5 Polysemy: one word with different meanings ( e.g., bank - establishment for keeping money, river side). and Synonymy6 6 Synonym: several terms (words or phrases) designating the same concept ( e.g., disable, handicap, incapacitate). .

Polysemy reduces precision, since the words chosen to compose the query may have more than one meaning, which may cause the retrieval of many irrelevant documents. Synonymy reduces recall, since the term chosen by the user to represent the desired concept may not be the one appearing in Web pages (e.g., disable standing for handicap). Synonymy can be addressed by the use of a thesaurus (e.g., WordNet7 7 WordNet - http://www.cogsci.princeton.edu/~wn/ ). This solution improves recall, but certainly degrades precision (since a great quantity of words will take part on the query).

4 OMF’s Architecture

4.1 Defining Baseline Criteria

In the construction and use of Artificial Intelligence systems, efforts have been made in the direction of knowledge sharing through the use of libraries of ontologies (e.g., Ontolingua [11]). As will be seen in section 8, IR systems (including search engines) benefit from these ontologies as well (e.g., SHOE [18] and GDA [26] systems). Two central goals here are modularity and reusability of components to lower the construction costs of such systems.

Despite the offered benefits, the growth of such ontology libraries adds difficulties in the automatic selection of the appropriate ontologies (and concepts) to describe the domain of a knowledge-based system, or to guide a particular Web search (as commented in section 2).

One possible way out is leaving this choice at the users’ hands, conferring transparency to the search process as a whole. That is, the user is responsible for selecting from the available ontologies the concepts (and keywords) which characterize his/her information need and will therefore form the queries’ context. This would be a step forward the available mechanisms in the majority of the existing search engines.

Still, we must consider situations where none of the existing ontologies fits the user’s needs. In such cases, a flexible system must provide a mechanism for the user to create his/her own ontologies.

4.2 OMF’s Basic Components

In attempt to bring together desirable features in systems development, we designed a modular portable framework which can be plugged in different search engines and other IR systems (such as Digital Libraries), since its basic delivery is a list of keywords (the query’s context).

We provide a reusable repository of (public and private) ontologies, which are used in a transparent way, since the user is able to visualize the existing ontologies and select keywords to guide a particular search. Flexibility is guaranteed by the possibility of creating new ontologies and by providing an easy way to extend queries with new concepts (keywords) extracted from ontologies.

The Ontologies Manager Framework consists of the OMF User Interface, the Ontologies Maintainer Module (OMM) and the public and private ontologies (Figure 2).

Figure 2:
The Ontologies Manager Framework

The user interface offers access to the public ontology and to the existing private ontologies for the construction of the session’s context. Through this interface, the query is passed onto the selected search engine, and the list of retrieved URLs is presented to the user for subsequent browsing (section 5.1).

The OMM manages the system’s ontologies, accounting for their initial creation and subsequent upgrades (section 7).

5 The Prototype

The OMF first prototype is already available, counting on the public ontology (section 3) and the user interface (described below). The OMM is still under development, and is our current topic of work (section 7).

The system is being implemented in Java, for portability and reusability. The prototype runs both on Windows NT and Unix platforms, and it uses the ODBC [23] interface to provide for database independence.

5.1 The OMF User Interface

The facilities offered by the OMF are accessible via the user interface (cf. Figure 3). Before starting a query session, the user must choose one search engine to be queried (the default system is BRight!).

Figure 3:
Example of the Use of Ontologies.

The interface offers the following options:

(1) selection of which public and/or private ontology will be used in the current query;

(2) selection of keywords to form the current context (using the boolean operators AND or OR);

(3) browsing the retrieved URLs, which can be viewed up to a limit, from 10 to 100, set by the user;

(4) modification of the current context in order to improve the session’s precision and recall.

Each context remains active until a new one is selected by the user.

The decision to provide only the two boolean operators AND and OR in the interface is based on the assumption that ordinary users do not make use of more sophisticated facilities. Studies of users behavior when dealing with IR systems (including search engines) have pointed at this direction [21].

5.2 Example of the System in Use

Figure 3 shows an example of the use of ontologies to enhance the precision and recall of a query session. Here we have a query in Portuguese, to illustrate the flexibility of the system in dealing with different natural languages. As said before, to query a search engine in any chosen language, the user has just to select the appropriate ontologies (since OMF is independent of domain and language).

The search engine selected in the example in Figure 3 is BRight!, which counts on two different IBs: one that indexes only pages within Brazil (which therefore are likely to be written in Portuguese), and another IB which indexes the whole Web.

In this example, the user is looking for information about masters degree in Artificial Intelligence. The original query was ‘mestrado AND (inteligência AND artificial)’. The system returned 22 URLs, from which 18 were relevant. The user then proceeded the browsing and selection of keywords from two different ontology branches.

From the node ‘Educação’ (Education), the user selected the concept ‘pós-graduação’ (post-graduate course) and the keywords ‘pós-graduação’, ‘mestrado’ (masters) and ‘pesquisa’ (research). These words were added to query using the operator OR).

Again, from the node ‘Educação’, the user selected the concept ‘Ciência da Computação’ (Computer Science), and from there the keywords/terms ‘Inteligência Computacional’ (Computational Intelligence) and ‘Sistemas Especialistas’ (Expert Systems).

The extended query was then:

‘(mestrado OR pesquisa OR pós-graduação) AND

((inteligência AND artificial) OR (inteligência AND computacional) OR (sistemas AND especialistas))’

The extended query retrieved 56 pages, from which 50 were relevant.

The achieved results reveal an improvement in the efficacy of the query as a whole, since the overall number of documents retrieved more than doubled, whereas the relevance grew in 8%.

6 Evaluation of Precision and Recall

This section presents an statistical evaluation of the performance of OMF using the Brazilian BRight! IB.

Recall and precision are two inherently conflicting metrics. That is, when an IR system is enhanced to increase precision, recall generally degrades, and vice-versa. A more convenient metrics to evaluate the overall performance of an IR system is the F-Measure [28, 10], given by:

Our goal is to improve the F-measure of the system, bringing it as close as possible to 1. Note that F-measure only goes up when both precision and recall go to 1 simultaneously. When only one of these indicators goes to 1 (and the other goes to 0), the F-measure goes down.

The Web size is currently estimated in about 320 million pages in 1.5 Terabytes [17]. With these figures, the Web exceeds by orders of magnitude the document collections used for research in Information Retrieval, "which have recently reached 7.5 million documents in 20 Gb for the Very Large Corpus track." [29].

Due to the Web’s size and dynamics, it is infeasible to assess retrieval quality by the standard recall measures since, for each query, every Web document would have to be inspected by a human and marked relevant or irrelevant. Instead, relative recall can be used [25, 10]. The trick with relative recall is that a set of queries about a given topic is considered as a whole, and all the relevant documents retrieved for this set of queries is then regarded as "the set of existing relevant documents in the collection". This way, it is only necessary to inspect the documents retrieved for the referred set, instead of the whole collection.

6.1 The Evaluation Framework

The evaluation framework consists of the following steps:

(1) Write Information Needs: Have a group of Web users to individually write a set of Information Needs (IN). Each individual in in the IN set is a natural language description of the contents of documents that should be considered relevant for that in.

(2) Write queries: For each in, have a user to write a set of queries Q(in) to be submitted to the given search engine, in order to try to find relevant documents for that in. We call these the ‘plain queries’.

(3) Associate context to queries: For each information need in, have a user to browse the available ontologies and select the node(s) that best represents the context of the in. All queries Q(in) will be automatically expanded on the basis of this context. We call these the ‘extended queries’.

(4) Perform two batch runs: (a) First, run all the plain queries Q(in) for each in Î IN, producing a data structure which relates queries to Web pages: Hits(q,p,run) means that page p is a hit for query q in run r. This is the control run. (b) Following, run all extended queries, producing the same data structure as above. This is the test run.

(5) Limit the number of hits per query: The number of hits per query is limited to L, under the assumptions that users do not usually check many hits, and that the search engine ranks hits, showing the most relevant first. In our initial experiment, L is set to 20.

(6) Evaluate hits for relevance: In both runs, for each information need in, produce a data structure Eval(in,p,rel) of all the pages p returned by the search engine in Hits(q,p,run) for all the queries q associated to in. For each in in Eval(in,p,rel), have a user to inspect all the associated pages p, and set its respective rel attribute to true or false, according to his/her evaluation of whether page p is relevant or not for the associated in.

(7) Compute the relative recall for each query in each run: For each Information Need in, take all the queries q in Q(in). Two values of relative recall are to be computed for each of these queries q, one for each run. We call these values rr(q,in,run) and compute them as follows:

The first part (Hits) counts all the hits for the query, Information Need and run at hand that have been judged relevant (i.e., number of all relevant documents for in that were retrieved by the system). The second part (Eval) counts all the known pages in the database that have been considered relevant to the Information Need at hand, regardless of which query or which run retrieved them.

rr(q,in,run), thus, computes relative recall (the number of relevant documents retrieved by this query, over the number of all known relevant documents for this Information Need).

(8) Compute the cut-off precision for each query in each run: As in (7) above, compute

cp(q,in,run) computes the cut-off precision

(the number of relevant documents retrieved by query q, over the number of all documents retrieved by the same query). The expression ‘cut-off precision’ is used to emphasize that we are computing precision not over all the documents retrieved by each query, but over all these documents up to a certain limit L, as stated in (5).

(9) Compute the F-measure for each query in each run: As in (7) and (8) above, compute

For clarity, we have omitted the parameter list (q,in,run) in the right side of the formula.

(10) Compute overall statistical measure for the experiment: For each of the control variables used (i.e., rr, cp and fm), compute their mean, median, maximum value, minimum value, and standard deviation over all queries, for each run, and compare the results across these runs.

6.2 Experimental Results

According to the experimental framework described above, we had 7 users to write 42 INs with a total amount of 254 plain queries. Each of these INs were then expanded through the association of one or more ontologies nodes to generate the corresponding expanded queries. Both query sets were then run against a database with 108,000 pages arbitrarily sampled from the brazilian Web by BRight! crawler.

The following table depicts the results for Precision, Relative Recall and F-Measure, both to plain and extended queries.

Average Precision Average Rl.Recall Average F-Meas. Std.Dev. F-Meas. plain queries 0.230 0.190 0.168 0.197 Expanded queries 0.265 0.216 0.198 0.203 Expanded x plain +15.17% +13.57% +17.84% ---

Table 1: experimental results for Precision, Relative Recall, F-Measure, and Standard Deviation for F-Measure.

The measures presented in Table 1 show a consistent qualitative performance improvement of the system by the use of ontologies to expand queries. Although Recall and Precision are generally conflicting metrics, as discussed in the beginning of this section, the application of the proposed technique caused them to grow simultaneously in the average, which is also reflected in the 17.84% increase in F-Measure.

There was a wide scope of variability in the amount of results for each individual query, which produced considerably high Standard Deviation (SD) figures for all the measures. To illustrate this fact, Table 1 includes the SD results for F-Measure. To cite the worst case for this variability, two of the INs had both Precision and Relative Recall equal to zero for the simple queries, which might not happen when a larger database is used. We decided to keep these INs in the sample for methodological soundness.

To investigate the source of this variability, we created two partitioned versions of the IN set according to ‘information content sought’ and ‘query/IN author’. The former partition grouped together INs that were intended to look for information on the same broad subject category, while the latter grouped the INs by author.

The motivation for the first partition was to look for evidences that certain kinds of information sought were not represented in the database, which would produce result discrepancies that should not be interpreted as inherent to the method being evaluated. For the second partition, we intended to check whether there would be ‘good query writers’ versus ‘not so good ones’, which could also introduce discrepancies that would not be necessarily correlated to the method under evaluation.

These computations, however, produced results that were very similar to the ones presented in Table 1, which means that the high variability attested by the SD is, in fact, spread all over the two partitions considered. Therefore, we conclude that the results presented are independent of the specific behaviour of the subjects involved in the experiment, and is also independent of specific information content classes.

7 The Ontologies Maintainer Module

The Ontologies Maintainer Module (OMM) manages both the private ontologies and the shared public ontology. Concerning the public ontology, the OMM is responsible for:

(1) Allowing the creation of this ontology by the system’s administrator through the OMM Interface.

(2) The upgrade of this structure (inclusion and exclusion of nodes and sub-ontologies): Ontologies proposed by users to integrate de public ontology will be evaluated against existing nodes, in order to judge their adequacy. This is a semi-automatic process, which is sub-divided in two steps: (a) automatic searching for similar concepts in the public ontology, and (b) proposing the inclusion or merge of the private (sub-)ontology to the system’s administrator.

(3) Keeping it balanced by periodical upgrades: The public ontology will be periodically evaluated on the basis of access rates associated to each of its nodes. These rates will be used to sign possible distortion in the ontology’s structure.

(4) Broadcasting to all system’s users any undertaken upgrade in the Public Ontology.

The OMM offers access to the private ontologies as well. Via this interface, the user can manipulate (modify, delete or create) his/her own private ontologies. In order to facilitate the creation process, the user will be able to select parts of the public ontology to be copied into the his/her private area for further modification.

8 Related Work

In section 1, we discussed some of the drawbacks of adopting static classification of Web pages according to some concepts hierarchy. In this section, we present alternative attempts to enhance precision where the users themselves classify their pages, as well as work done in automatic classification of information sources.

A more ambitious alternative to classify pages by hand is to annotate Web pages with special HTML tags which convey the page’s classification - e.g., SHOE system (Simple HTML Ontology Extension) [18, 19], GDA system (Global Document Annotation) [26]. The user determines the page’s classification based on some available ontology, indicated in the page’s heading.

Two major drawbacks can be identified here: (1) the index bases are constrained to pages marked with each system’s special tags; and (2) so far, there is no agreement upon one universally accepted set of tags to be used. Yet, considering that the use of a (consensus) set of tags might become a praxis, we still have to face restrictions. In the SHOE system, for instance, the user must select a concept from some available ontology to guide the search. Only documents classified under the chosen ontology and concept can be retrieved, limiting even more the space of searchable documents.

The limitations (1) and (2) above could be overcome, for instance, with the automatic classification of Web pages. The investigation of automatic ways to categorize text within a given (hierarchical or flat) scheme has been an active field of research both in the IR and in the Artificial Intelligence communities. The most varied learning techniques have been deployed in attempt to generate classifiers based on a given training corpus [22, 3, 4, 16].

Convectis (Context Vector Technology) SelectCast for Content is one example of such systems [22]. Adaptive neural network algorithms are used in the learning and classification process. This tool is a product commercialized by Aptex Software [2], and has been plugged in Infoseek search engine [13].

The Pharos system [3, 4] is a distributed architecture for locating online heterogeneous information sources (traditional DBs, semi or unstructured collections, such as Web sites, FTP files, etc.), which are automatically classified within several (independent) hierarchical classification schemes. The system uses Latent Semantic Indexing (LSI) [5] in the classification process.

Although working with heterogeneous sources of information, Pharos is not aimed to index the whole Web, as our application demands. Furthermore, it classifies and selects document sources, instead of actual documents. Our approach, in contrast, searches for documents in the entire Web.

Finally, Koller and Sahami [16] advocate the construction of a hierarchy of classifiers, instead of a single huge one. Each node of the hierarchy then runs a classifier, which allows for the partitioning of the classification problem into smaller, more manageable ones. Their results have shown to yield faster and more accurate classification than the single flat classifier counterpart.

This is, thus, a good example of an approach to classification that profits from an existing conceptual hierarchy, provided that there is an adequately sized corpus classified according to that hierarchy (such that classifiers can be successfully trained). However, this approach is not directly applicable to the problem of searching information on the Web, neither is this the objective of its authors. We refer to it here as a promising approach of the use of automatic classification techniques to deal with concept hierarchy structures, such as the ontologies we deploy in our framework.

Our hierarchies, in contrast, are not intended to categorize pages beforehand, regardless the classification method used. Instead, our ontologies provide the users with flexible ways to expand their queries dynamically by the addition of the query context. Our approach is less restricting, since it recognizes that user consultations are directed to the whole Web content, and provides a mechanism to enhance recall/precision performance by adding context to queries.

9 Final Remarks and Future Work

We presented here the design and state of development of the Ontologies Manager Framework, a tool for the construction and use of ontologies to guide searches in the Web and in documents’ repositories. The aim is to improve the precision and recall of consultations through the use of a context associated to each query session. This approach contrasts to the practice of classifying Web pages in hierarchies beforehand, disregarding the dynamic aspect of Web searches.

We showed the flexibility offered by the system in a number of ways: (1) the possibility of selecting different search engines (or other IR systems) to be browsed; (2) the ease to build contexts dynamically by selecting keywords from the system’s ontologies; (3) the possibility of building private ontologies to conform the user’s particular needs; (4) the possibility of adding/merging private ontologies to the public ontology; and (5) the independence of natural language used, since swapping languages consists only in accessing a public ontology written in the desired language. Another strong point in the system’s architecture is its modularity, which favors portability and reusability. The system’s transparency is guaranteed by the decision of leaving at the user’s hands the control over the conceptual basis of the consultation (i.e., the sessions’ context).

The experimental results reveled an improvement of 17.84% in the F-measure by the expansion of the queries with the ontologies nodes. Our tool works efficiently as a provider of keywords for query expansion, which is very useful for casual users who do not have a clear idea of what to look for and face difficulties to expand the basic original query.

Our current work focuses on the completion of the OMM (section 7). The OMM interface already allows for the creation and manual upgrade of the public ontology. Our next steps are the development of: (1) a mechanism to allow the creation and manipulation of private ontologies by users; (2) the algorithms for the automatic search for similar nodes between the public ontology and some proposed sub-ontology; and (3) the algorithms to evaluate the balancing of this ontology.

References

[1] M. Bowman, M. Schwartz. Report of the Distributed Indexing/Searching Workshop. Sponsored by the World Wide Web Consortium. Cambridge, MA, 1996. http://www.w3.org/Search/9605-Indexing-Workshop/

[2] W.R. Caid, P. Oing. System and method of context vector generation and retrieval.United States Patent 5,619,709, US Patent and Trademark Office, Apr. 8, 1997. http://patents.uspto.gov/cgibin/ifetch4?INDEX+PATBBALL+0+15314+0+5+30748+OF+1+1+1+5%2c619%2c709 .

[3] R. Dolin, D. Agawal, A. El Abbadi, L. Dillon. Pharos: A Scalable Distributed Architecture for Locating Heterogeneous Information Sources. In Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM’97). Nevada, pages 165-169. 1997. http://pharos.alexandria.ucsb.edu/publications/

[4] R. Dolin, D. Agawal, A. ElAbbadi, J. Pearlman. Using Automated Classification for Summarizing and Selecting Heterogeneous Information Sources. In D-Lib Magazine. 1998. http://www.dlib.org/dlib/january98/dolin/01dolin.html

[5] S. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments & Computers. 23(2): 229-236, 1991.

[6] E. A.Fox. Rethinking Libraries in the Information Age: Lessons Learned with Five Digital Library Projects. School of Information & Library Science, UNC Chapel Hill. 1996. http://fox.cs.vt.edu/talks/UNC96/

[7] W.B. Frakes. Introduction to Information Storage and Retrieval Systems. In Frakes, W.B. and Baeza-Yates, R. (eds.), Introduction to Information Retrieval - Data Structures and Algorithms. Prentice-Hall, New Jersey, pages 1-12, 1992.

[8] P.F. Gonçalves, A.C. Salgado, S.L. Meira. Digital Neighbourhoods: Partitioning the Web for Information Indexing and Searching. In Olivè, A., Pastor,J.A. (Eds.) Advanced Information Systems Engineering. 9th International Conference (CAiSE’97). Barcelona, Spain, June 1997. Springer Verlag, Lecture Notes in Computer Science 1250: 289-302. http://www.bright.org.br/ .

[9] P.F. Gonçalves, S. Meira, A.C. Salgado. A Distributed Mobile-Code based Architecture for Information Indexing, Searching and Retrieval in the World-Wide Web. In Proceedings of the 7th Annual Conference of the Internet Society (INET’97). Malaysia, pages 24-27, 1997. http://www.isoc.org/inet97/proceedings/A7/A7_2.HTM

[10] P.F. Gonçalves, J. Robin, T.L.V.L. Santos, O. Miranda, S.L. Meira. Measuring the Effect of Centroid Size on Web Search Precision and Recall. In Proceedings 8th Annual Conference of the Internet Society (INET’98). Geneva, Switzerland, July, 1998. http://www.isoc.org/inet98/proceedings/1x/1x_8.htm.

[11] T.R. Gruber. A Translation Approach to Portable Ontology Specifications. In Knowledge Acquisition, 5(2): 99-220. 1993. http://www-ksl.stanford.edu/people/gruber/publications.html

[12] T.R. Gruber. Toward principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli (eds) Formal Ontology in Conceptual Analysis and Knowledge Representation. Kluwer, 1994. http://www-ksl.stanford.edu/people/gruber/publications.html

[13] Infoseek. Aptex categorizes more than 700,000 web sites for infoseek. Press Release. 1996. http://www.infoseek.com/doc/PressReleases/hnc.html

[14] D. Lenat, R. Guha. Building Large Knowledge-based Systems. Representation and Interface in the Cyc Project. Addison-Wesley, Reading, MA, 1990.

[15] C. Kacmar (Ed.) SIGLINK Newsletter Eletronic Supplement. Special Issue on Digital Libraries, September, 4(2),1995.

[16] D. Koller, M. Sahami. Hierarchically Classifying Documents Using very few Words. In Proceedings of the 14th International Conference on Machine Learning (ICML-97).Vanderbuilt University, Nashville, TN. pages 170-178. 1997. http://robotics/Stanford.EDU/~koller/papers/ml97.html

[17] S. Lawrence, C.L. Giles. Searching the World Wide Web. In Science. 280: 98-100. 1998. http://www.sciencemag.org

[18] S. Luke, L. Spector, D. Rager. Ontology-based Knowledge Discovery on the World-Wide Web. In Proceedings of the Workshop on Internet-based Information Systems/AAAI-96. Portland, Oregon, USA. 1996. http://www.cs.umd.edu/projects/plus/SHOE

[19] S. Luke, L. Spector, D. Rager, J. Hendler. Ontology-based Web Agents. In Proceedings of the First International Conference on Autonomous Agents (AA-97). 1997.

[20] M. Mauldin. Retrieval Performance in FERRET: A Conceptual Information Retrieval System. In 14th Conference on Research and Development in Information Retrieval (ACM-SIGIR). Chicago, USA. October 1991.

[21] A. Pollock, A. Hockley. What’s Wrong with Internet Searching. In D-Lib Magazine. March 1997. http://www.dlib.org/dlib/march97/bt/03pollock.html

[22] SelectCast for Content (Convectis) Homepage. 1998. http://www.aptex.com/products-convectis.htm http://www.aptex.com/product_sc_content_brief.htm

[23] R. Signore, J. Creamer, M.O Stegman. The ODBC Solution – Open Database Connectivity in Distributed Environments McGraw-Hill, New York. 1994.

[24] B. Swartout, P. Ramesh, K. Kevin, T. Russ. Towards Distributed Use of Large-Scale Ontologies. In Tenth Knowledge Acquisition for Knowledge-Based Systems Workshop by Track (KAW’96). Alberta, Canada. November 1996.

[25] J. Tague-Sutcliffe. The pragmatics of information retrieval experimentation, revisited. In Information Processing and Management. 28: 467-490. Elsevier, Oxford, UK, 1992.

[26] M. Utiyama, K. Hasida. Bottom-Up Alignment of Ontologies. In International Joint Conference on Artificial Intelligence (IJCAI-97); Workshop EP24 - Ontologies and Multilingual NLP. Nagoya, Japan. August 23, 1997. http://www.etl.go.jp/etl/nl/GDA

[27] A. Valente, J. Breuker. Towards Principled Core Ontologies. In Tenth Knowledge Acquisition for Knowledge-Based Systems Workshop by Track(KAW’96). Alberta, Canada. November, 1996.

[28] K. Van Rijsbergen. Information Retrieval (2nd edition). Butterworths, London, 1979. http://www.dcs.gla.ac.uk/Keith/Preface.htm

[29] E. Voorhees, D. Harman. Overview of the Sixth Text Retrieval Conference (TREC-6). In Proceedings of TREC-6. National Institute of Standards and Technology (NIST). Gaithersburg, MD, November , 1997. http://trec.nist.gov/pubs/trec6/t6_proceedings.html

[30] W3C Homepage. http://www.w3.org/

[31] B. Yuwono, D.L. Lee. Searching and Ranking Algorithms for Locating Resources on the World Wide Web. In Proceedings of the 12th International Conference on Data Engineering. New Orleans, pages 164-171, 1996. http://www.cs.ust.hk/faculty/dlee/Papers/www/icde96-www.ps.gz

  • [1] M. Bowman, M. Schwartz. Report of the Distributed Indexing/Searching Workshop. Sponsored by the World Wide Web Consortium. Cambridge, MA, 1996. http://www.w3.org/Search/9605-Indexing-Workshop/
  • [2] W.R. Caid, P. Oing. System and method of context vector generation and retrieval.United States Patent 5,619,709, US Patent and Trademark Office, Apr. 8, 1997. http://patents.uspto.gov/cgibin/ifetch4?INDEX+PATBBALL+0+15314+0+5+30748+OF+1+1+1+5%2c619%2c709 .
  • [3] R. Dolin, D. Agawal, A. El Abbadi, L. Dillon. Pharos: A Scalable Distributed Architecture for Locating Heterogeneous Information Sources. In Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM’97). Nevada, pages 165-169. 1997. http://pharos.alexandria.ucsb.edu/publications/
  • [4] R. Dolin, D. Agawal, A. ElAbbadi, J. Pearlman. Using Automated Classification for Summarizing and Selecting Heterogeneous Information Sources. In D-Lib Magazine 1998. http://www.dlib.org/dlib/january98/dolin/01dolin.html
  • [7] W.B. Frakes. Introduction to Information Storage and Retrieval Systems. In Frakes, W.B. and Baeza-Yates, R. (eds.), Introduction to Information Retrieval - Data Structures and Algorithms Prentice-Hall, New Jersey, pages 1-12, 1992.
  • [8] P.F. Gonçalves, A.C. Salgado, S.L. Meira. Digital Neighbourhoods: Partitioning the Web for Information Indexing and Searching. In Olivè, A., Pastor,J.A. (Eds.) Advanced Information Systems Engineering. 9th International Conference (CAiSE’97) Barcelona, Spain, June 1997. Springer Verlag, Lecture Notes in Computer Science 1250: 289-302. http://www.bright.org.br/ .
  • [10] P.F. Gonçalves, J. Robin, T.L.V.L. Santos, O. Miranda, S.L. Meira. Measuring the Effect of Centroid Size on Web Search Precision and Recall. In Proceedings 8th Annual Conference of the Internet Society (INET’98). Geneva, Switzerland, July, 1998.  http://www.isoc.org/inet98/proceedings/1x/1x_8.htm
  • [11] T.R. Gruber. A Translation Approach to Portable Ontology Specifications. In Knowledge Acquisition, 5(2): 99-220. 1993. http://www-ksl.stanford.edu/people/gruber/publications.html
  • [12] T.R. Gruber. Toward principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli (eds) Formal Ontology in Conceptual Analysis and Knowledge Representation Kluwer, 1994. http://www-ksl.stanford.edu/people/gruber/publications.html
  • [13] Infoseek. Aptex categorizes more than 700,000 web sites for infoseek. Press Release. 1996. http://www.infoseek.com/doc/PressReleases/hnc.html
  • [14] D. Lenat, R. Guha. Building Large Knowledge-based Systems. Representation and Interface in the Cyc Project. Addison-Wesley, Reading, MA, 1990.
  • [15] C. Kacmar (Ed.) SIGLINK Newsletter Eletronic Supplement Special Issue on Digital Libraries, September, 4(2),1995.
  • [16] D. Koller, M. Sahami. Hierarchically Classifying Documents Using very few Words. In Proceedings of the 14th International Conference on Machine Learning (ICML-97).Vanderbuilt University, Nashville, TN. pages 170-178. 1997. http://robotics/Stanford.EDU/~koller/papers/ml97.html
  • [17] S. Lawrence, C.L. Giles. Searching the World Wide Web. In Science 280: 98-100. 1998. http://www.sciencemag.org
  • [18] S. Luke, L. Spector, D. Rager. Ontology-based Knowledge Discovery on the World-Wide Web. In Proceedings of the Workshop on Internet-based Information Systems/AAAI-96. Portland, Oregon, USA. 1996. http://www.cs.umd.edu/projects/plus/SHOE
  • [19] S. Luke, L. Spector, D. Rager, J. Hendler. Ontology-based Web Agents. In Proceedings of the First International Conference on Autonomous Agents (AA-97). 1997.
  • [20] M. Mauldin. Retrieval Performance in FERRET: A Conceptual Information Retrieval System. In 14th Conference on Research and Development in Information Retrieval (ACM-SIGIR). Chicago, USA. October 1991.
  • [21] A. Pollock, A. Hockley. What’s Wrong with Internet Searching. In D-Lib Magazine March 1997. http://www.dlib.org/dlib/march97/bt/03pollock.html
  • [22] SelectCast for Content (Convectis) Homepage. 1998. http://www.aptex.com/products-convectis.htm http://www.aptex.com/product_sc_content_brief.htm
  • [23] R. Signore, J. Creamer, M.O Stegman. The ODBC Solution – Open Database Connectivity in Distributed Environments McGraw-Hill, New York. 1994.
  • [24] B. Swartout, P. Ramesh, K. Kevin, T. Russ. Towards Distributed Use of Large-Scale Ontologies. In Tenth Knowledge Acquisition for Knowledge-Based Systems Workshop by Track (KAW’96). Alberta, Canada. November 1996.
  • [25] J. Tague-Sutcliffe. The pragmatics of information retrieval experimentation, revisited. In Information Processing and Management. 28: 467-490. Elsevier, Oxford, UK, 1992.
  • [26] M. Utiyama, K. Hasida. Bottom-Up Alignment of Ontologies. In International Joint Conference on Artificial Intelligence (IJCAI-97); Workshop EP24 - Ontologies and Multilingual NLP. Nagoya, Japan. August 23, 1997. http://www.etl.go.jp/etl/nl/GDA
  • [27] A. Valente, J. Breuker. Towards Principled Core Ontologies. In Tenth Knowledge Acquisition for Knowledge-Based Systems Workshop by Track(KAW’96). Alberta, Canada. November, 1996.
  • [28] K. Van Rijsbergen. Information Retrieval (2nd edition). Butterworths, London, 1979. http://www.dcs.gla.ac.uk/Keith/Preface.htm
  • [29] E. Voorhees, D. Harman. Overview of the Sixth Text Retrieval Conference (TREC-6). In Proceedings of TREC-6 National Institute of Standards and Technology (NIST). Gaithersburg, MD, November , 1997. http://trec.nist.gov/pubs/trec6/t6_proceedings.html
  • [30] W3C Homepage. http://www.w3.org/
  • [31] B. Yuwono, D.L. Lee. Searching and Ranking Algorithms for Locating Resources on the World Wide Web. In Proceedings of the 12th International Conference on Data Engineering. New Orleans, pages 164-171, 1996. http://www.cs.ust.hk/faculty/dlee/Papers/www/icde96-www.ps.gz
  • 1
    AltaVista -
    HotBot-http://www.hotbot.com/, Excite! -
  • 2
    Yahoo! -
    http://www.lycos.com/ , Infoseek -
  • 3
    BRight! -
  • 4
  • 5
    Polysemy: one word with different meanings (
    e.g., bank - establishment for keeping money, river side).
  • 6
    Synonym: several terms (words or phrases) designating the same concept (
    e.g., disable, handicap, incapacitate).
  • 7
    WordNet -
  • Publication Dates

    • Publication in this collection
      04 Feb 1999
    • Date of issue
      Nov 1998
    Sociedade Brasileira de Computação Sociedade Brasileira de Computação - UFRGS, Av. Bento Gonçalves 9500, B. Agronomia, Caixa Postal 15064, 91501-970 Porto Alegre, RS - Brazil, Tel. / Fax: (55 51) 316.6835 - Campinas - SP - Brazil
    E-mail: jbcs@icmc.sc.usp.br