SciELO - Scientific Electronic Library Online

 
vol.5 issue2Providing Context to Web Searches: The Use of Ontologies to Enhance Search Engine's AccuracyMirroring Resources in the World Wide Web author indexsubject indexarticles search
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

Share


Journal of the Brazilian Computer Society

Print version ISSN 0104-6500On-line version ISSN 1678-4804

J. Braz. Comp. Soc. vol. 5 n. 2 Campinas Nov. 1998

https://doi.org/10.1590/S0104-65001998000300006 

Signature Cache:
A Light Weight Web Cache Indexing Structure

 

Yuping Yang and Mukesh Singhal
Department of Computer and Information Science
The Ohio State University
2015 Neil Avenue,
Columbus, OH, 43210-1277, USA
yangy@cis.ohio-state.edu , singhal@cis.ohio-state.edu

 

 

Abstract Current trend in Web cache research is to have Web caches sharing their contents to improve the hit ratio. High performance Web cache sharing requires use of access indexes for Web caches to reference each other. The challenges facing the design of access indexes for Web cache sharing are the huge size, dynamic nature of Web cache contents, and high access speed.
A recently proposed summary cache scheme [14] uses relatively small indexes for sharing Web caches to reference each other. We improved the summary cache scheme and propose a signature cache scheme. Instead of "repairing" existing access indexes, signature cache scheme builds new indexes to accommodate changes of Web cache contents. This scheme simplifies the maintenance of the indexes and significantly reduces the size of counters. Optionally, the size of the index can be further reduced by a semi-distributed index sharing mode at the cost of slightly increased response time. These improvements result in orders of magnitude reduction in the index size as compared to the summary cache scheme.
Keywords: Web cache, signature, distributed, index, sharing.

 

1 Introduction

Web accesses often cross countries or even continents. A way to reduce network traffic and shorten the perceived response time of Web accesses is to cache frequently used Web pages in Web caches that are physically close to users. Large Web caches within Web proxies have been effectively used to reduce the network traffic [11]. The hit ratio of Web caches is critical to the Web access response time seen by end users and sharing Web cache contents is a way to boost the overall hit ratio for a group of Web caches.

Sharing of Web caches can be done in a straightforward way, e.g., when a Web page being looked for by a query does not exist in a Web cache, the query processor can search other Web caches in the group until the requested Web page is found in one of the Web caches in the group, or finally the query processor concludes that the Web page is not in this group of Web caches and a message is sent to the original server of this Web page for a fresh copy. This type of Web cache sharing was proposed in Harvest project [7,9] and in the Internet Cache Protocol (ICP) [15]. Many institutions in many countries have created proxy caches using ICP to reduce network traffic over the Internet [14].

However, ICP protocol causes heavy network traffic among cooperating Web caches because for a cache miss, messages are sent to all other Web caches to find out whether the requested Web page is present in any of other Web caches. Use of access indexes can shorten the response time in accessing Web caches and reduce the number of messages sent among Web caches. One of the recent proposals of indexes designed for Web cache sharing is the summary cache scheme [14]. We make three important improvements over the summary cache scheme, namely, the semi-distributed index sharing mode, simplified index update operation, and simplified design of counters. We call the improved Web cache index design the "signature cache" scheme.

A summary cache [14] uses 0/1 bit-strings (Bloom filters, for short we call them filters) to code summary information of URLs. Each URL is hashed to a fixed length 0/1 bit-string and different URLs have different combinations of 0's and 1's in their hashed codings. All hashed codings in a particular Web cache are superimposed (bitwise-ORed) to a 01/ bit-string, the Bloom filter, which represents (summaries) all URLs in the Web cache. A query looking for a Web page (keyed by a URL) hashes the URL into a 0/1 bit-string, the query filter, and uses it to search all Bllom filters. If every "1" bit in the query filter is matched by a "1" bit in the same bit position of a Bloom filter, then it is possible (the possibility can be made very high if the filter length is sufficiently long) that the queried Web page is present in the Web cache. Otherwise, it can be determined that the queried Web page is not in the Web cache. Use of Bloom filters in this way can speed up the search process because the size of all Bloom filters is much smaller than the size of all URLs.

As an example, suppose Hash(URL1)=001100, Hash(URL2)=101000, Hash(URL3)=100100, and a Web cache has only three URLs: W = {URL1, URL2, URL3}, then a Bloom filter 101100 can be made for the Web cache. If a URLq (from a query) whose hashed coding is Hash(URLq)=101000, then it is possible that URLq is in the Web cache because the first and the third bits of the Bloom filter are '1'. If a URLq whose hashed coding is Hash(URLq)=000101, then by comparing this coding with the Bloom filter it can be determined that URLq is not in the Web cache because the sixth bit of the Bloom filter is '0'. It may happen that though Hash(URLq) is matched (defined as every bit of '1' in this coding is matched by a bit of '1' in the same bit position in Bloom filter) with the Bloom filter, the Web cache does not contain URLq. This is called a false match. For historical reasons, it is also called a false drop. The possible existence of false drops lowers the effectiveness of using a Bloom filter to identify queried URLs. The probability that a queried Web page is not in the Web cache but identified by the Bloom filter as existing is called the false drop probability. False drop probability is an important measurement of the effectiveness of using Bloom filters and it can be lowered by using Bloom filters with longer length.

The Bloom filter of a particular Web cache is sent to all other Web caches. If there are m Web caches, then each Web cache has m - 1 Bloom filters from m - 1 other Web caches. A query to a particular Web cache first searches all Bloom filters in that Web cache. This search produces a list of Web caches (candidate list) that possibly contain the desired Web page. If this Web page can not be found locally and there are other Web caches on the candidate list, then messages are sent to those Web caches on the candidate list to search them. This is sketched in Figure 1.

 

Image1762.gif (11941 bytes)

Figure 1: The fully distributed index sharing mode.

 

We divide the messages over the network into two categories. A message that contains request for searching a Web cache is called a signal message and a message that contains found Web pages is called a data message. Obviously, a data message is usually much larger than a signal message.

The main contribution of the summary cache scheme [14] is that it reduces the number of signal messages by using Bloom filters to identify Web caches that are very likely (typically the chance is over 99.9% [14] to contain the desired Web pages and send signal messages only to these identified Web pages.

The summary cache scheme copes with the dynamic nature of the Web contents by using counters for each bit in the Bloom filter. A Bloom filter represents all URLs in a Web cache. Each bit of the Bloom filter may be set to 1 many times by different hash codings from different URLs. When a new Web page is added to a Web cache, a new coding is hashed from the URL of the Web page and is bitwise-ORed with the existing Bloom filter to obtain a new Bloom filter. The counters corresponding to the positions of 1's in newly hashed coding are incremented by 1 (recall that a counter is associated with an individual bit in the Bloom filter). When an existing Web page is deleted from the Web cache, the counters corresponding to the positions of 1's of the hash coding from the deleted Web page are decremented by 1. When a counter is changed from 0 to 1, the corresponding bit in the Bloom filter is set to 1 and when a counter is changed from 1 to 0, the corresponding bit in the Bloom filter is set to 0. By using counters, the updates of Web pages in the Web cache can be reflected into the Bloom filter.

In most of the research literature, Bloom filters are called signatures and signatures as access indexes to datasets and databases have been studied extensively [2, 3, 4, 6, 10]. Only recently signatures are proposed to be used to index Web caches [12,14]. Much of the research results and experiences in signature indexing in other fields such as databases can be applied to Web cache indexing as well. For example, in summary cache [14], only one Bloom filter is constructed for each Web cache. Usually, in signature indexing, the content of the dataset to be indexed are partitioned for better efficiency in accessing the index and for the ease of maintaining the signature index when dataset is updated. Also, each bit in the Bloom filter is associated with a counter. To reduce the probability of overflow, each counter must be at least 4 bit long [14]. This means locally, for a Web cache, the size of counters of its Bloom filter is much larger than the local Bloom filter itself. Finally, the space overhead of the fully distributed Bloom filter index as proposed in the [14] is very large, especially when the number of cooperating Web caches is large. If there are n Web caches participated in the cache sharing, then, the number of Bloom filters is in the order of O(n2).

We propose improvements over summary cache scheme and we name the improved cache indexing scheme the signature cache scheme. The signature cache index is much smaller in size and easier to maintain than summary cache index and yet it delivers comparable performance as that of summary cache.

This paper is organized as follows. Section 2 describes the the proposed signature cache scheme. Section 3 analyzes the index size reduction of the signature cache scheme over the summary cache scheme. Section 4 analyzes the filtering effect of using signature files through both probability analysis and experiments, and also analyzes the search response time of the signature cache scheme. Section 5 discusses the implication of the signature cache scheme to ICP. Finally, Section 6 concludes the paper.

 

2 Signature Cache Scheme

We improve over summary cache scheme in several aspects. First, we noted that the approach to put every Web cache's Bloom filter into every other Web cache results in a large space overhead for the index structure. We propose a semi-distributed index sharing mode to rectify this problem. Second, we noted that the partitioning of cache content in a Web cache can reduce the maintenance effort for index structure. There is no inherent reason why URLs of Web pages being added to a Web cache have to go into any particular partition. The only requirement for partitioning is that partitions should have approximately equal size. So, we realized that the precise tracking of the change of Bloom filters is not really necessary. Based on this understanding we propose a simplified update operation for signatures. Third, for the simplified update operation, the one counter per bit position for a partition filter in the summary cache scheme is not needed and we propose a simplified counter design. The improved Web cache design with these improvements is called the signature cache scheme.

Signature index can also be used as a content-based access index. For space limitation, we will restrict to only discuss the searching of URLs of Web pages. In the following discussion, all partitioning and indexing refer to the URLs of Web caches, and we assume pointers are used to link the URLs to their actual Web caches. The Bloom filter is just another term for superimposed signature. So, in the rest of this paper we refer to signatures and filters interchangeably.

Partitioning and Input Partition: Most of the research literature in signature indexing favor partitioning of the content of dataset. It can be shown [10] that theoretically, the space overhead for the signature index is the same whether or not a Web cache is partitioned, as long as the filtering effect (measured by false drop probability) is the same. For example, if the content of a Web cache is partitioned into 100 partitions, and each partition has a signature constructed by bit-wise ORing all hashed codings from members of the partition, then the size, in terms of total bits, of 100 partition signatures will be the same as one long signature obtained by bit-wise ORing hashed codings from all members of the cache, provided that the same false drop probability is to be achieved.

However, one of the advantages of partitioning the content of the Web cache is that partitioning can make the maintenance of the signature index much easier. This is a particularly welcome feature for Web caches since the content of Web caches are frequently changed. To maintain the consistency between the signature index and the content of the Web cache, URLs of newly added Web pages can go into one particular partition, which we call input partition. By directing all newly coming URLs to a Web cache into the input partition, we avoid the work to update all existing partition signatures except one, that is, the partition signature for the input partition.

Suppose the URLs of a Web cache are partitioned into m partitions. We can pick any partition as the input partition and call the other m - 1 partitions the "stable" partitions. Once constructed, the partition signature of a stable partition never changes. A stable partition only allows deletion of URLs from the partition and does not allow addition of a new URLs to it. All new URLs are added to the input partition. Once the number of URLs in the input partition reaches the average size (which can be pre-determined) of a stable partition, the input partition becomes one of the stable partitions and a new input partition is established. Note that the deletion of URLs from a partition does not affect the correctness of the partition signature and only affects the efficiency of the partition signature as an access index. This is because the deletion of URLs from a partition will raise the false drop probability of the partition signature. The loss of efficiency of the partition signature can be controlled to a pre-determined extent by using counters, which will be discussed later in this paper.

Keep in mind that the Bloom filter is a very long 0/1 bit string. For example, if a Web cache has 1M Web pages, and the load factor (ratio of bits in the Bloom filter to the number of URLs) is 16, the length of the Bloom filter will be 16M bits, i.e., 2M bytes. If the content of the Web cache is partitioned into 20,000 partitions, then on the average each partition has only 50 URLs. For the same load factor of 16 (so the false drop probability will be the same), the partition signature is only 800 bits, i.e., 100 bytes. To maintain the small partition signature of the input partition (in this case, 100 bytes only) results in much less CPU operations (perceivably due to operations on much smaller in-memory datasets) or much less random disk accesses (if the filters are disk resident) than to update the Bloom filter in the summary cache scheme.

Simplified Design of Counters: By using an input partition, counters are not needed for recording the addition of new Web pages to a Web cache. However, counters are still needed for recording the deletion of existing Web pages from a stable partition of a Web cache. Since we do not modify the partition signatures for stable partitions, counters are only used to signal whether "enough" URLs have been deleted from a partition so that the corresponding partition signature has to be re-make. For this purpose, one counter per stable partition is sufficient.

In a signature cache, the counter for a partition is initialized to zero at the time when the partition becomes a stable partition and the partition signature is made. The counter is incremented by 1 each time an URL is deleted from the partition. When the counter for the partition signature is large enough, we know that enough number of URLs have been deleted from the partition. So, we discard the partition signature and put all remaining URLs of this partition into the input partition. This is equivalent to treating the Web pages of these URLs as new Web pages to the Web cache. This approach greatly reduces the total size of counters.

As a simple numerical example, suppose the Bloom filter has 5,000 bits and each bit of the Bloom filter has a counter of four bits [14], then the total size of counters for the Bloom filter in a summary cache is 20,000 bits. Also suppose that the URLs of the same Web cache are divided into 100 partitions and a two-bit counter is used for recording the deletion of URLs from the partition, then the total size of counters for all stable partitions in the signature cache is only 99 x 2 = 198 bits. This is quite a dramatical reduction for the counter size.

Note that our simplified counter design does not lose any effectiveness as compared to the summary cache. The two bit counter used in the above example means that we only allow up to four URLs to be deleted from a partition. In general, the loss of efficiency of the partition signatures can be precisely controlled by a very small counter. Our new counter design can even reduce the amount of updating because counters are only updated for the deletions of URLs and not for the additions of URLs. This represents, in general, a 50% reduction of the maintenance effort for counters as compared to the summary cache approach.

Semi-Distributed Sharing Mode: The fully distributed index sharing mode in the summary cache scheme uses large storage spaces, especially when the number of Web caches is large. For example, suppose there are 400 cooperating Web caches, each Web cache has 8 GB of Web page contents, and on the average each Web page is about 8 KB [14], then, each Web cache has 1M Web pages. Suppose each Bloom filter has a length of 16M bits (as the result of using a load factor of 16 [14]), then the total size of all Bloom filters from all other cooperating Web caches in a particular Web cache is 16 x 106 x 400 / 8 = 8 x 108, i.e., 800 MB, which is about 10% of the space to store all Web pages in the Web cache. The scalability of the fully-distributed index sharing is severe limited because the overall index size is in the order of O(n2) when the number of cooperating Web caches is in the order of O(n).

To reduce signature index size, we propose a semi-distributed index sharing mode, i.e., one or several Web caches are acted as coordinating Web caches to store signatures from all cooperating Web caches. For simplicity of discussion, let us assume that there is only one coordinating Web cache. When a cache miss happens in a Web cache and this Web cache is not the coordinating Web cache, a signal message is sent from the Web cache to the coordinating Web cache to search signature index. Based on this search, a list of candidate Web caches will be further searched to look for the desired Web page requested by the query. This is sketched in Figure2. Because there is only one copy of each partition signature in the system, for the above scenario, the total space (in memory or on disk) needed for storing all signatures in the semi-distributed sharing mode is only 400*16M/8 = 800 MB. This is a much smaller size when compared to the total of 400*800 MB = 320 GB needed in all the cooperating Web caches in the summary cache scheme.

 

Image1763.gif (3665 bytes)

Figure 2: The semi-distributed index sharing mode.

 

Using a coordinating Web cache introduces contention for accessing the coordinating Web cache (bottleneck problem) and introduces a single point of failure (reliability problem). To avoid these problems, several coordinating Web caches may be used. The number of coordinating Web caches can be adjusted to strike a balance between space efficiency or access efficiency, which favors the use of only one coordinating Web cache, and the ease of reliability/bottleneck problems, which favor the use of more coordinating Web caches.

A salient feature of the signature index, as compared to other access indexes, is that it is easily divisible. Suppose the reliability is not a problem, i.e., we do not have to worry the crash of coordinating Web caches, then all signatures can be divided into k equal portions and each portion is stored in one of the k coordinating Web caches. Different coordinating Web caches store different signatures. There is no redundancy (if so desired) in the sig nature index. Suppose k = 2, then for each cache miss, two signal messages are sent to the two coordinating Web caches to search for a list of candidate caches.

In practical implementation, to avoid processing bottleneck and to enhance reliability, signatures may be divided and some redundancy may be desired. For example, four coordinating Web caches C1 through C4 of equal capability may be employed. The entire signature index can be divided into two equal portions and stored in C1 and C2; C3 is the redundant copy of C1 and C4 is the redundant copy of C2. For each cache miss, four signal messages are sent to C1 through C4 and these caches are searched in parallel. This design will achieve a very high reliability with much less storage overhead than fully-distributed sharing mode used in the summary cache scheme.

Main differences between the two schemes: The main differences between the signature cache scheme and the summary cache scheme are the different signature schemes (partitioning of the URLs and the use of input partition; updating only the partition signature for the input partition vs. updating the Bloom filter for the entire Web cache), different counter designs (one counter per partition vs. one counter per bit), and the different index sharing mode (semi-distributed vs. fully distributed). Together, these improvements result in a much smaller Web cache index which requires much less and simpler maintenance work.

 

3 Benefits of the Signature Cache

The main advantages of the signature cache scheme over the summary cache scheme are the small index size and ease of maintenance for index structures. The ICP protocol does not need any index structure. However, to share contents of a large number of Web caches, the network traffic overhead generated by using ICP protocol is unbearable. The summary cache scheme [14] drastically reduces the network traffic over the ICP protocol with the cost of maintaining an index structure in each of the cooperating Web caches. The main advantages of using Bloom filters, i.e., signatures, in the summary cache scheme are that signature indexes are small and easy to maintain. However, there is much room to improve in this scheme. We make several improvements over the summary cache scheme to further reduce the size of the signature index and further simplify the maintenance of the signature index.

It is a fact that the ease of index maintenance is an important issue in practical Web cache sharing implementation because the content of Web caches are frequently changed. In our signature cache design, all but one partition are stable partitions. So, the software to update the index structure when the content of Web caches changes can concentrate on working on the input partition. This means that compared to the summary cache design, the signature cache has much less work, both in CPU operations and disk accesses, needed to be done to maintain the index structure.

Compared to the summary cache scheme, two factors contribute to the smaller size of the signature cache index: one counter per partition design and the semi-distributed signature index sharing mode. Bloom filters in summary caches are duplicated to be distributed over all cooperating Web caches and counters are not. So, our two improvements work in a complementary way : the size reduction of the one counter per partition design is more evident when the number of cooperating Web caches is small while the size reduction of the semi-distributed index sharing mode is more evident when the number of cooperating Web caches is large. Together, these two improvements achieve a balanced size reduction effect over a wide range of number of Web caches for the index structure, which includes both filters and counters.

Figures 3 and 4 show the size reductions (signatures only, not including the counters) for the use of the semi-distributed index sharing mode vs. the fully-distributed mode as the number of Web caches changes. The parameters for figures 3 through 8 are listed in Table 1. Note that the vertical axis is in logarithmic scale and the size reduction is very effective. The semi-distributed sharing mode has a drastically lower space overhead because only one copy (or several copies) of each signature is kept in the system. Naturally, we concern about the cost of using the semi-distributed index sharing mode in terms of the additional network traffic caused by sending an (or several) extra signal message to the coordinating Web cache(s). It turns out that the extra network traffic overhead of the extra signal message(s) is quite low as discussed in Section 4.

 

Image1764.gif (2717 bytes)

Figure 3: Filter sizes of different index sharing modes.

 

 

 

Image1765.gif (2475 bytes)

Figure 4: Filter sizes of different index sharing modes.

 

 

size of each Web cache

size of each Web page

loader factor (number of bits per Web page)

size of filters contributed by each Web

size of each counter

length of each filter

8 GB

8 KB

16

2 MB

4 bits

800 bits

Table 1: Parameters used in figures 3 - 8

 

The size reductions (counters only, not including the signatures) due to the one counter per partition design vs. the one counter per bit design as the number of cooperating Web caches varies is plotted in Figures 5 and 6.

 

Image1766.gif (2475 bytes)

Figure 5: Counter sizes of different counter designs.

 

teste1767.gif (6637 bytes)

Figure 6: Counter sizes of different counter designs.

 

The total index size is the sum of both filter size and counter size. To assess the contribution of counter size reduction to the overall index size, we compare the signature cache scheme to a improved summary cache scheme, i.e., the summary cache scheme with semi-distributed index sharing mode and still keeps its one counter per bit design. The effect of size reductions is shown in Figures 7 and 8, and is also very effective.

 

Image1768.gif (2475 bytes)

Figure 7:Total index sizes of signature and summary cache schemes.

 

Image1769.gif (2475 bytes)

Figure 8: Total index sizes of signature and summary cache schemes.

 

4 Effectiveness of Signature Indexing

Filtering Effect: We analyze the filtering effect of using signatures for Web cache indexing in this section. Our discussion is based on the result of previous researches in signature files [1, 2, 3, 5, 8, 10, 13].

Assume each URL is hashed into a signature of b bits with exactly k "1" bitsand b - k "0" bits and each partition has m URLs representing m Web pages. A query URL, i.e., the URL of the Web page which is looked for by the query, is also hashed into a query signature of b bits with exactly k "1" bitsand b - k "0" bits. A match of query signature with a partition signature is defined as every bit of 1 is matched by a bit of 1 in the same position of the partition signature.

The probability of false drops, i.e., the probability that a query signature matches with a partition signature but the query URL is not present in the partition, can be computed as follows. For each "1" bit in the query signature, the probability that a signature of a URL in the partition has no "1" bit in this position is 1 - k/b. The probability of all signatures of m URLs in the partition have no "1" bit in this position is (1 - k/b)m. So, the probability that at least one "1" bit in all signatures of m URLs in the partition in this position is 1 - (1 - k/b)m. The query signature has k "1" bits. So, the false drop probability for an unsuccessful search is [1]:

F = (1 - (1 - k/b)m)k (1)

This probability can also accurately approximate the false drop probability for a successful search [2]. From this equation, we derive the desired signature length if we fix k and the false drop probability F:

b = Image1770.gif (1044 bytes) (2)

We conducted experiments using a file of 125,000 signatures of length b (b varies from 400 to 1,000), with k = 4 "1" bits. The four "1" bits are randomly distributed in b bits. The grouping factor m is fixed at 50. So, there are total of 125,000/50 = 2,500 partition signatures. For each experiment, 20 out of 125,000 signatures are randomly selected as query signatures and are the file of 125,000 signatures is scanned to find true drops. A true drop is a match of a query signature with a partition signature and the partition represented by the partition signature does contain a URL whose signature is the query signature. The match is defined as every "1" bit in the query signature is matched by a "1" bit in the partition signature. The scanning of the file of 125,000 signatures generates true drops and the number of true drops is recorded in this scan. For each query signature, no more than one true drop is recorded for each partition. Each of the 20 query signatures is then checked against 2,500 partition signatures and the number of total drops is recorded. The number of false drops is the difference between the number of total drops and the number of true drops. This experiment is repeated 30 times for each b, using different random seeds, and the result is plotted in Figure 9.

 

Image1771.gif (2475 bytes)

Figure 9:The false drop probabilities.

 

Figure 9 plots the variation of the false drop probability F with respect to the length of the signatures b. The analytical result is obtained by using Eq.(1) and the experimental result is obtained by experiments described above. The experimental result is slightly better than the analytical result. This is due to the approximate nature of Equation 1. In Equation 1, 1 - (1-k/b)m is the probability that at least one "1" bit of the partition signature is in the position of a particular "1" bit of the query signature. Because the query signature has k "1" bits, F is approximated as (1 - (1-k/b)m)k. This is an over-estimate because if we already known one "1" bit of the partition signature is in the position of a "1" bit of the query signature, then the probability that at least one "1" bit of the partition signature is in the position of a (second) particular "1" bit of the query signature is between (1 - (1-(0-1)/b))m-1 and (1 - (1-k/b))m-1 * (1 - (1-(k-1)/b)). If k > 2, an even smaller probability will be computed for the third "1" bit of the query.

For all practical purposes, we can use Eq.(1) [1] as a simple approximation for the estimation of the filtering effect of signature files.

Obviously, the choice of k affects the value of F for the same b and m pair. The optimal k can be determined as follows. A partition signature is superimposed from m signatures of m URLs. The average number of "1" bitsin the partition signature is:

w = b (1 - (1 - k/b)m) (3)

Sacks-Davis [1] shows that the best filtering effect of using signature achieves at approximately w = b/2. Substituting w = b/2 into Eq.(3), we obtain the optimal k as:

k = b (1 - Image1772.gif (918 bytes)) (4)

Figure 10 plots the variation of the optimal k with respect to b and m by using Eq.(4). Substituting k in Eq.(1) using Eq.(4), we obtain F as the function of b and m, i.e., F = F(b,m) as:

F(b,m) = Image1774.gif (1013 bytes) (5)

 

 

Image1773.gif (2563 bytes)

Figure 10: The change of Optimal k values.

 

Since b is the length of signatures and m is the size of each partition, load factor (the number of bits per indexed Web page) b/m is proportional to the size of the signature file in the signature caches. For the same Web cache contents, if b/m doubles, then the size of all signature files doubles.

From Figure 9 we see that signatures of length 600 bits is enough to reduce the false drop probability to 1%. This is sufficient for most practical applications. If the grouping factor m = 50, then (600/8)/50 = 1.5, the overhead of signatures is only 1.5 bytes per Web page (which averages 8K bytes [14]). This is a very light-weighted index structure.

Search Response Time: In the semi-distributed index sharing mode, if a query for a Web page cannot be satisfied locally in a Web cache, a signal message (if only one coordinating Web cache is employed) is sent to the coordinating Web cache to search signature files instead of directly sending signal messages to Web caches that may possibly contain the desired Web page. This means an additional signal message may be sent as compared to that in the fully distributed index sharing mode. However, this additional signal message has limited effect on the network traffic. The total network traffic for answering a query in the fully distributed index sharing mode are:

1. the initial signal message sent by a user to one of the Web caches.

2. in case a cache miss happens, signature files are searched and a list of Web caches are identified as possibly containing the desired Web page. Signal messages are sent to these Web caches.

3. if the Web page is found, a data message is sent to the user who initiated the query. Otherwise, a signal message is sent to the original server of the desired Web page for a fresh copy.

4. a data message is sent from the original server directly to the user who initiated the query. An alternative way is to update the Web cache first with this fresh copy of Web page and then a data message is sent to the user; this will incur longer time delay.

Since a signal message is usually much smaller than a data message, it is apparent that the additional network traffic caused by using the semi-distributed index sharing mode is no more than one signal message for answering a query in the fully distributed index sharing mode. Also, the additional signal message that may be sent is between the cooperating Web caches, which are presumably to be close to each other. As for the time delay, if every signal takes the same time (assumed to be one unit) to travel, then the summary cache with fully distributed index sharing mode needs three time units for a cache hit (in one of all cooperating Web caches) and four time units for a cache miss, while the signature cache with semi-distributed index sharing mode needs four time units for a cache hit and five time units for a cache miss. In practice, the slightly longer response time may be well justified by the very significant size reduction for the signature index.

We emphasize that the three improvements of the signature cache scheme: simplified update operation, simplified design of counters, and semi-distributed index sharing mode can be applied independently. If a decision is made to only adopt simplified update operation and simplified design of counters, then there is no increase in response time, i.e., in this case the signature cache scheme will have the same response time as the summary cache scheme.

 

5 The Signature Cache Scheme and ICP

The Internet Cache Protocol [16], ICP, is mainly used to exchange hints about the existence of URLs in some neighboring Web caches. So far, ICP has not become an Internet standard. It is obvious that ICP can be improved since ICP protocol sends messages to all cooperating Web caches in case of a cache miss. The large number of messages causes heavy network traffic among cooperating Web caches. In this regard, both the summary cache scheme [14] and our signature cache scheme are proposed remedies for reducing network traffic overhead.

Fan el al. [14] suggested to add a new opcode, ICP_OP_DIRUPDATE, in ICP version 2 [16] to update the summary index (directory). Our signature cache scheme also has a similar need to update its signature index and a similar modification to the ICP is also needed. As far as the communication protocols concerned, the main difference between the summary cache scheme (which uses the fully distributed index sharing mode) and the signature cache scheme (which may optionally use the semi-distributed index sharing mode) is that if the semi-distributed index sharing mode is applied, all messages for updating the signature index are sent to the coordinating Web cache.

 

6 Concluding Remarks

The use of Web caches has become increasingly popular. We recognized that many of the challenges facing the design of large Web caches are similar to those of large databases. Signature index has been extensively studied in the field of text retrieval [2, 3, 4, 6, 10], relational [5, 13], and object-oriented [8] databases. The strength of the signature index is its small size, simple structure, ease of maintenance, and its ability to index text documents. These are precisely what are needed to index large Web caches.

The proposal of summary cache scheme [14] is the first effort to apply signature technique in indexing cooperating Web caches. We identified aspects of the summary cache scheme that can be improved and made three improvements: the simplified update operation, the simplified design of counters, and the semi-distributed index sharing mode. We call the improved scheme the signature cache scheme. These improvements achieve orders of magnitude index size reduction over the summary cache scheme. Each of our improvements may be applied independently. If only the simplified update operation and the simplified counter design are used, the signature cache will have about the same cache response time as that of the summary cache. The use of semi-distributed index sharing mode can further reduce the total size of the signature index in the system at the cost of slightly longer cache response time due to one additional signal message (or several signal messages sent simultaneously if there are several coordinating Web caches) sent to the coordinating Web cache. In the worst case, cache response time is 30% higher: a signal message takes the same time to travel as a data message and a local cache miss followed by a cache hit in another cooperating Web cache. Compared to the significant size reduction of the index, the slight increase of cache response time by using the semi-distributed index sharing mode may be well justified.

Our improved schemes (with two or all three of the improvements) have much better scalability than the summary cache scheme. Because of the explosive growth of the World Wide Web, the size of Web caches will also grow drastically and our signature cache scheme will have its impact on the practical design and implementation of Web caches.

Besides indexing URLs, because signature index is very well suited for indexing text documents and the content of most Web pages are text, signature index has the good potential to be developed into a content-based index for Web caches. As for future research, we believe to investigate using signatures as content-based indexes will be a fruitful direction and the result of this research will influence practical Web cache designs.

 

References

[1] R. Sacks-Davis. A Two Level Superimposed Coding Scheme For Partial Match Retrieval. Information Systems, 8(4):273-280, 1983.        [ Links ]

[2] C. Faloutsos, S. Christodoulakis. Signature files: an access method for documents and its analytical analytical performance evaluation. ACM Transactions on Office Information Systems, 2(4):267-288, 1984.        [ Links ]

[3] U. Deppisch. S-Tree: A Dynamic Balanced Signature Index for Office Retrieval. In Proceedings of the ACM Conference on Research and Development in Information Retrieval, Pisa, Italy, pages 77-87, Sept. 1986.        [ Links ]

[4] C. Faloutsos, S. Christodoulakis. Description and Performance analysis of signature file methods for office filing. ACM Transactions on Office Information Systems, 5(3):237-257, 1987.        [ Links ]

[5] W. W. Chang, H. J. Schek. A Signature Access Method for the Starburst Database System. In Proceedings of the 15th International Conference on Very Large Data Bases}, pages 145-153, 1989.        [ Links ]

[6] C. Faloutsos. Signature-based text retrieval methods: A survey. Data Engineering Bulletin, 13(1):25-32, 1990.        [ Links ]

[7] P. B. Danzig, R. S. Hall, M. F. Schwartz. A case for caching file objects inside internetworks. In Proceedings of SIGCOMM'93, pages 239-248, 1993.        [ Links ]

[8] Y. Ishikawa, H. Kitagawa, N. Ohbo. Evaluation of Signature Files as Set Access Facilities in OODBs. In Proceedings of the 1993 SIGMOD Conference, Washington, DC, pages 247-256, June 1993.        [ Links ]

[9] The Harvest Group. Harvest information discovery and access system. http://excalibur.usc.edu/,  1994.        [ Links ]

[10] D. L. Lee, Y. M. Kim, G. Patel. Efficient Signature File Methods for Text Retrieval. IEEE Transactions on Data and Knowledge Engineering, 7(3):423-435, 1995.        [ Links ]

[11] B. M. Duska, D. Marwood, M. J. Feeley. The measured access characteristics of world-wide-web client proxy caches. In Proceedings of USENIX Symposium on Internet Technology and Systems}, December 1997.        [ Links ]

[12] J. Marais and K. Bharat. Supporting cooperative and personal surfing with a desktop assistant. In Proceedings of ACM UIST'97, ftp://ftp.digital.com/pub/DEC/SRC/publications/marais/uist97paper.pdf.,   October 1997.        [ Links ]

[13] Y. Yang, M. Singhal. Summary Databases as Indexing Structures. Technical Report TR03, Department of Computer and Information Science, January, 1998.        [ Links ]

[14] L. Fan, P. Cao, J. Almeida. Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol. In Proceedings of ACM SIGCOMM'98, September, 1998.        [ Links ]

[15] National Lab for Applied Network Research. Icp working group. http://ircache.nlanr.net/Cache/ICP/ , 1998.

[16] The Internet Engineering Task Force, Request for Comments: 2186, Category: Informational http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2186.txt , 1998.

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License