On the Design of the Seljuk-Amoeba Operating Environment

Brasileiro, Francisco Vilar; Vasconcelos, Sheila Regine Almeida; Gallindo, Érica de Lima; Catão, Vladimir Soares

doi:10.1590/S0104-65001997000300005

Abstract

Building dependable distributed applications is not an easy task. Designers of such systems have followed two complementary approaches to reduce design complexity, namely: i) the use of appropriate developing tools; and ii) the choice of the most restrictive failure semantics possible for the components that form the system’s underlying execution layer. The Seljuk model uses these two approaches to specify a structured way of providing fault tolerance services in the context of distributed operating environments, thus facilitating the construction and execution of dependable distributed applications. In this paper we present the design of the Seljuk-Amoeba operating environment, which follows the Seljuk model to enhance the Amoeba distributed operating system with the provision of fault tolerance services

fault tolerance; Byzantine failures; replicated processing; dependable distributed applications; distributed operating systems

On the Design of the Seljuk-Amoeba Operating Environment

Francisco Vilar Brasileiro
Universidade Federal da Paraíba - UFPb/Campus II
Av. Aprígio Veloso, 882, Bodocongó
58109-970, Campina Grande, Paraíba, Brazil
fubica@dsc.ufpb.br Sheila Regine Almeida Vasconcelos
Departamento de Sistemas e Computação - DSC
Av. Aprígio Veloso, 882, Bodocongó
58109-970, Campina Grande, Paraíba, Brazil
sheila@dsc.ufpb.br Érica de Lima Gallindo
Centro de Ciências e Tecnologia - CCT
Av. Aprígio Veloso, 882, Bodocongó
58109-970, Campina Grande, Paraíba, Brazil
erica@dsc.ufpb.br Vladimir Soares Catão
Laboratório de Sistemas Distribuídos - LSD
Av. Aprígio Veloso, 882, Bodocongó
58109-970, Campina Grande, Paraíba, Brazil
vlad@dsc.ufpb.br Abstract Building dependable distributed applications is not an easy task. Designers of such systems have followed two complementary approaches to reduce design complexity, namely: i) the use of appropriate developing tools; and ii) the choice of the most restrictive failure semantics possible for the components that form the systems underlying execution layer. The Seljuk model uses these two approaches to specify a structured way of providing fault tolerance services in the context of distributed operating environments, thus facilitating the construction and execution of dependable distributed applications. In this paper we present the design of the Seljuk-Amoeba operating environment, which follows the Seljuk model to enhance the Amoeba distributed operating system with the provision of fault tolerance services.

Keywords and phrases: fault tolerance, Byzantine failures, replicated processing, dependable distributed applications, distributed operating systems.

1 Introduction

Due to its inherent redundancy, a distributed system provides a service that is potentially more dependable than that provided by a centralised one. However, for a particular distributed application to attain high depen-dability levels, it is necessary for both the application and its execution environment to be carefully designed. Particularly, the failure of an isolated processing site or a network connection should have small impact, if any, on the operation of the system as a whole. This leads to the necessity of introducing fault tolerance mechanisms into the system.

When fault tolerance mechanisms are introduced at the application level, the complexity of the applications development process increases. In fact, one of the main problems faced by those designing and implementing dependable distributed applications lies on the difficulties introduced by the necessity of tolerating faults [1]. In order to reduce this complexity, designers have followed two complementary approaches: i) the use of appropriate developing tools to support the implementation of the applications (e.g. libraries of functions for fault tolerance [2], programming toolkits [3, 4], special constructions of programming languages [5], specialised services of ope-rating systems [6, 7], etc.); and ii) the choice of the most restrictive failure semantics possible for the components that form the systems underlying execution layer [1].

Brasileiro [8] describes a model that can be used to structure a distributed operating environment using the two approaches presented above, so that it can provide services for fault tolerance, and facilitate the develo-pment and execution of dependable distributed applications. The model proposes the provision of two levels of services: at the lower level, the operating environment offers services that allow higher level services and applications to restrict the failure semantics of the components that form their execution infra-structure (processors and communication links); and at a higher level, the operating environment offers services that implement or help in the implementation of the main fault tolerance mechanisms required by dependable distributed applications.

The main characteristic of the extra services provided by the operating environment is their flexibility. This allows applications with different requirements to coexist in the same distributed system, but paying only for those services each of them uses. Further, applications may define, at activation time, their dependability requirements as well as the assumptions made on the failure semantics of the underlying components. Thus, it is possible, among other things, to balance the levels of performance and dependability that are desired, when starting applications with dependability requirements.

In this paper we follow the Seljuk model to specify services and extra functionality that should be added to the Amoeba distributed operating system [7] so as to attain an operating environment, named Seljuk-Amoeba, which supports the implementation and execution of dependable distributed applications. The remaining of the paper is structured as follows. To make the paper self-contained, Section 2 gives an overview of the basic structure and the main concepts upon which Amoeba has been developed. In Section 3 we discuss the design of Seljuk-Amoebas lower level reliable processing service, whilst Section 4 presents the specification of Seljuk-Amoebas higher level fault tolerance services. In Section 5 we discuss related work; finally, Section 6 finishes the paper with our concluding remarks.

2 Basic Concepts and Structure of Amoeba

Amoeba is a micro-kernel based distributed operating system. An Amoeba system is composed of a number of processors, each with its own private memory and network adapter, connected by some communication infra-structure. Provision of operating systems functionality is divided between the micro-kernel itself and a number of servers that execute on user mode. The micro-kernel, which runs on every processor of the system, is responsible only for the basic operations such as memory management, threads scheduling, I/O support and inter-process communication. All the remaining functionality (e.g. directory and file servers) is provided by user level servers.

All Amoeba servers work based on the object concept. An object is an encapsulated piece of data upon which certain well defined operations may be performed. Each object is managed by an object server process. Operations on an object are performed sending messages to the objects server. When an object is created, the server returns a capability to the process which creates it. This capability is used to address and protect the object [7].

2.1 Process Management

The key element of Amoeba's process management mechanisms is the process descriptor. It is a portable structure that describes the state of a running process. A process descriptor consists of four parts. The first part is the host descriptor. It describes the properties of the host on which the process may run (e.g., CPU architecture, memory availability, etc.). A process can only run on a host that has the properties matching those in its process descriptor. The second part, the memory descriptor, describes the layout - but not the contents - of a process' virtual address space. The third part of the process descriptor is the thread descriptor. Amoeba has kernel-space threads, that is, the kernel manages the individual threads of a process. For each thread, the thread descri-ptor contains: program counter, stack pointer, processor status word, registers and the state of blocking system calls. Finally, the last part of the process descriptor contains two capabilities needed for process management. The first one is the capability for the process; a process is viewed as an object which is managed by the kernel of the host on which it runs. Requests for operations on the object (e.g., suspend, migrate, kill) are sent to its mana-ging server, i.e. the kernel. The second one is a capability for the exception handler of a process. When an exception occurs, the kernel builds a processor descriptor for the process and sends it to the process exception handler.

2.2 Communication

Amoeba supports both Remote Procedure Call (RPC) and Group Communication (GC). In order to support them, Amoeba provides a network protocol called FLIP - Fast Local Internet Protocol [9]. The communication protocols are organised in two layers: an upper layer implementing RPC and GC and a lower layer that implements FLIP.

FLIP is a datagram-based protocol, hence conne-ctionless. Packets may be lost, corrupted, or arrive in an order different from the one in which they were sent. As a network layer protocol, FLIP can also fragment packets. This has to be done when the size of the message that is to be sent over FLIP is bigger than the maximum size defined for a particular kind of network (for instance, the maximum size of an Ethernet packet is 1526 bytes [10]).

The RPC and GC are session layer protocols, and since there is no transport layer in Amoeba, the reassembling of fragments to restore messages and the re-transmission of lost messages is performed at RPC and GC levels.

2.3 Amoebas Servers

Server processes are important components of the Amoeba system. Apart from the basic functions provided by the micro-kernel that executes on each processor of the system, all the systems functionality is provided by user level servers; these include servers for providing file, directory, boot and execution services. We will pay special attention to Amoebas processing service which is implemented by a user level server - the Run Server, responsible for higher level scheduling and load balancing, and a kernel level server - the Process Server, responsible for basic process management.

Under Amoeba, tasks are submitted to the Run Server through its commands interpreter (the shell). When a user types in a command on the terminal, the shell extracts the first word of the command line, assumes that it is the name of an executable program, looks for this program into Amoebas File Server, and if it finds it, it takes steps to execute it. First of all, the shell tries to find the architectures for which the program is available. To do so, it looks into the /bin directory. If that program is available for multiple architectures, then it will be present not as a regular file, but as a directory. This directory contains executable programs for each architecture avai-lable. The shell then does an RPC with the Run Server sending it all possible process descriptors and asking it to choose an architecture and a specific CPU to run the program on.

Hosts in an Amoeba system are grouped into pools. There may be different pools for different architectures. A host may also be present in more than one pool at a time. The Run Server is responsible for maintaining status information about all hosts in the pools registered with it.

Based on the status information of all hosts in a users pool, and all available process descriptors, the Run Server selects the most suitable host to run the given process. To implement this, the Run Server maintains a number of attributes for each host under its control. Some of these attributes are: CPU architecture (e.g. i80386, SPARC, etc.), CPU speed, amount of free physical me-mory currently available, average number of executable threads in the last few seconds, etc. After choosing such host, it returns an exec capability for the Process Server on the selected host. This capability is then used to start the new process.

When the Run Server returns the capability of the Process Server of the selected host, the shell then does an RPC with that server, asking it to effectively create the process. Figure 1 shows all steps involved on creating a process under Amoeba.

Figure 1: Amoebas process creation

3 Reliable Processing in Seljuk-Amoeba

As seen in the previous section, Amoebas processing service does not provide any support for fault tolerance. Redundancy is imperative for achieving fault tolerance, thus for attaining a reliable processing service, it is necessary to replicate processes and execute them in different processors. The collection of processors that execute the replicated processes is normally referred to as a fail-controlled node [11].

The basic approach for implementing a fail-controlled node is to replicate the computation on a sufficiently large number of independent processors, which can fail in some less restrictive way (e.g. arbitrarily). Processors are driven by some synchronisation mechanism which guarantees that non-faulty processors will produce the same output stream. Outputs generated by replicated processors are then evaluated by a filtering mechanism, such as a comparator or a voter, which avoids incorrect values from appearing at the application level, ensuring the controlled failure semantics of the node.

Speirs et al. [12] and Brasileiro et al. [13] describe software implementations of fail-controlled nodes at the application level, which use the efficient replication management protocols presented in [11]. The Seljuk-Amoeba fault-tolerant processing service will provide both failure-masking and fail-silent processing services, by introducing the protocols studied in [11] at the micro-kernel level of the Amoeba distributed operating system. This is achieved by developing a new Run Server, named Fault Tolerance Run Server (FT Run Server), which is responsible for higher level replicated scheduling, and by introducing a new kernel level server - named the Node Server, which is responsible for managing the replicated nodes formed by the FT Run Server. Furthermore, Amoebas communication layer must be modified to accommodate intra-node communication introduced by the replica management protocols.

It is assumed that (non-replicated) distributed applications are composed of a number of processes that do not share memory, and interact only via messages. (This is the usual way Amoeba applications are structured.) In this paper we also assume that application processes present a deterministic behaviour. (In the next section we will discuss ways of dealing with replicated applications which incorporate non-deterministic behaviour; although the mechanisms presented are used at a higher level of process replication, they may also be used to handle non-determinism at a lower level of replication.) Thus, synchronisation among replicas is attained by simply ensu-ring that each application process replica receives the same stream of input messages in the same order, therefore yielding identical streams of output messages to be produced by non-faulty replicas.

We consider failure-masking nodes and fail-silent nodes comprised of N processors, where N=2p +1 in the case of failure-masking nodes; N=p +1 in the case of fail-silent nodes; and p (p >0) is the nodes resilience degree, i.e. the upper bound on the number of processors of a node that may fail. The restrictions N=2p +1 and N=p +1 are only necessary to assure that the validation techniques work well. These techniques operate on the output messages produced by the replicas of the application. All valid messages produced by a node possess p +1 signatures, thus ensuring that at least one non-faulty processor has participated in the validation process. We assume that mechanisms exist for genera-ting and validating digital signatures, which provides an authentication facility with arbitrary high probability [14].

We also assume that the communication between any two processors is synchronous, i.e. there is a maximum bound d for message processing and transmission between any two non-faulty processors. (Catão and Brasileiro [15] show how this assumption can be achieved on an asynchronous network.)

3.1 The FT Run Server

Applications executed by the FT Run Server may choose, at activation time, the failure semantics of the node where they will be executed, as well as the effective failure semantics of the underlying processors. The FT Run Server provides the required failure semantics, in a transparent way, by replicating processing on enough processors and ensuring that a suitable validation function will be applied to the outputs generated by them. The choice of the number and types of processors to use in a node depends on the applications reliability requirements. It is worth noting that if the effective failure semantics of the underlying processor is at least as restrictive as the nodes failure-semantics, no replication is needed, and processing will be carried out in the usual way.

Besides tolerating physical faults, the nodes formed by the FT Run Server are also able to tolerate hardware and software design faults. Since Amoeba supports heterogeneous processors, it is possible to achieve fault tolerance at hardware design level by executing a process on a node formed by processors with different architectures. Further, if different versions of a program are available, it is also possible to tolerate software design faults by executing a different version of the process on each processor forming a node.

In order to allow software design fault tolerance, the executable program organisation under Amoeba was adapted. In its original version, Amoeba only deals with different versions of a program for different archite-ctures. We have made changes to Amoebas File Server organization so that a particular program may have several versions for each architecture. The different versions are independently designed to satisfy the specification of the program. Figure 2 shows the/bin organization, after the introduction of these changes.

Figure 2: Seljuk-Amoebas File System organization

In this example, the /bin directory contains two commands: dir and sort. The dir command is available only for VAX architecture, whilst the sort command has three implementations available for the 80386 architecture. This arrangement allows that application designers ask the FT Run Server to execute their applications in replicated nodes, with each replica executing a different implementation of that program. Note that with the diversity of processor architectures and the availability of different versions for each replica executing on them, the degree of fault tolerance attained can be very high.

Now let us see how applications can take advantage of the service provided by the FT Run Server. Suppose a shell desires to execute processes in fail-controlled nodes. Its first step is to do an RPC with the FT Run Server. In addition to all available process descriptors, a number of extra parameters must be provided. First, it is necessary to inform the failure semantics of the systems processors, over which fail-controlled nodes will be built. Further, since application programmers are able to choose between different failure semantics, one of the parameters provided to the FT Run Server is the failure semantics of the node where the application will run. The node resilience degree is also provided by the shell as a parameter. The FT Run Server uses it for deciding how many independent processors will form a node. Another parameter passed by the shell, is a list which contains the identification of processors that the application does not want to use (this parameter may be used, for example, if an application designer thinks that a given processor is not reliable enough, and does not want to submit any task to it, or to prevent processors to take part on two or more nodes where processes involved in the same distributed application execute). Finally, a last parameter added to the RPC is a flag that indicates whether the application will be executed with design diversity or not. Figure 3 shows a process creation scenario using the FT Run Server.

Figure 3: Seljuk-Amoebas replicated process creation

After doing an RPC with the FT Run Server, the shell stays blocked waiting for a reply. The FT Run Server returns to the shell a descriptor for the created node. This node descriptor contains the following information: a node identifier, a list containing all processors in this node (the first one being the node co-ordinator) together with the corresponding process descriptors that have been chosen for each processor, and a capability for the node. Then, the shell does an RPC with the Node Server of the co-ordinator processor, asking it for process creation (see Figure 3).

The Node Server of the co-ordinator processor asks the local Process Server to create a process and sends the same request to all other processors of the node. The Node Servers of these processors, in turn, ask to their respective Process Servers to create the replicas. The Node Server as well as the Process Server are implemented into Seljuk-Amoebas micro-kernel.

3.2 Micro-kernel Level Processing Management

Nodes formed by the FT Run Server use the mechanisms described in [11], for replica control. Thus, in addition to the introduction of the FT Run Server, it is necessary to adjust Amoebas micro-kernel to implement these protocols. The functionality of a replicated node can be implemented by the following micro-kernel threads that will execute on each processor forming the replicated node (in order to communicate, threads in the same processor use message passing over shared queues and lists, whilst threads in different processors use message passing over internal communication links):

Sender thread: this thread takes messages deposited into the Processed Message Queue (PMQ), that have been produced by the application processes of that pro-cessor, signs and sends them to the other processors of the node for validation, i.e. voting in failure-masking nodes, and comparison in fail-silent nodes. Further, it deposits a copy of the messages into the Internal Candidate List (ICL).

Validator thread: the function of this thread depends on the type of the node. In failure-masking nodes, the Validator thread is a Voter thread. It compares authentic messages deposited into the External Candidate List (ECL) which have been signed and sent by other processors of the node, with their counterparts produced locally and that have been deposited into the ICL by the Sender thread. If the comparison is not successful, the message is discarded. Otherwise, the message is countersigned, and if there are now p +1 signatures, the message, termed a valid message, is handed over to the local Transmitter thread for network delivery to destination nodes. If there are less than p +1 signatures, then the message is sent to the other processors of the node that have not signed the message yet. In fail-silent nodes the Validator thread is a Comparator thread. Its functioning is similar to the Voter thread, with only one difference: once a failure is detected, instead of simply discarding the received message, the Validator thread terminates its activities, and so does the Sender thread. This guarantees that the node will remain silent after a failure.

Transmitter thread: this thread is responsible for retrieving the p +1-signed messages deposited into the Validated Message Queue (VMQ) and sending them over the network to the destination nodes.

Receiver thread: this thread authenticates messages received from the network or from the internal links and discards any message which fails authentication or any duplicated message received. Authenticated and valid messages received from the network are deposited into the Received Message Queue (RMQ) for ordering, whilst authenticated messages received from other processors of the node, which carry less than p +1 signatures, are deposit in the ECL for validation.

Order thread: this thread executes an order protocol with its counterparts in the other processors of the node. The function of the order protocol is to construct identical queues of valid messages received from the network for processing by the local application processes of all non-faulty processors of the node. Ordered messages are deposited into the appropriate Delivered Message Queue (DMQ).

Figure 4 shows how the threads, queues and lists discussed above, as well as the information stream between them, are logically merged within Seljuk-Amoebas micro-kernel, in order to implement a reliable processing service.

Figure 4: View of a replica implementing the reliable processing service

When a message arrives at a network adapter of a processor, it is sent to the Receiver thread which authenticates it, discarding duplicated or non-valid messages. Then, its destination must be considered, in order to decide whether the message is going to be ordered or not. If the message has been sent to a replicated process, it must be ordered before being relayed to the part of the kernel that implements message receiving in the higher level protocols (represented in Figure 4 by RPC_in/GC_in). In this way, all application replicas (Application) are going to receive the same messages and in the same order. Otherwise the message must be immediately relayed to RPC_in/GC_in. Another two types of messages may be received by a processor: messages sent by the other Order threads (of the other processors of a replicated node), which are relayed to the local Order thread; and messages sent by the other Validator threads (of the other processors of a replicated node), which are delivered to the local Validator thread for validation. Messages produced by the application processes (Application), are deposited into the PMQ. After this, the message is deli-vered to the Sender thread. If the message has been generated by a replicated process, it is deposited into the ICL and stays there waiting for a message into the ECL that matches it; also a copy of the message is sign and sent to the other processors of the node. Otherwise, the message is immediately relayed to the part of the kernel that implements message delivery into the higher level communication protocols (represented in Figure 4 by RPC_out/GC_out). After being processed by RPC_out/GC_out, the message is delivered to the Transmitter thread which sends it to its destination.

4 Fault Tolerance Services in Seljuk-Amoeba

Three higher level fault tolerance services are provided by the Seljuk-Amoeba operating environment, namely: process replication, fault diagnosis and system reconfiguration. These services operate on software components. Software components can be replicated actively, passively or semi-actively. The fault diagnosis service is able to diagnose replicas of software components that have failed, whilst the system configuration service can be used to create new replicas of software components and re-establish the resilience degree of the application.

Transparency is the key design principle for these services. Thus, whenever possible the services should be used without the need of changing legacy non-replicated applications. When this is not possible, the system must provide adequate support to facilitate the adaptation of existent applications and the construction of new ones.

In order to achieve the desired level of transparency, Seljuk-Amoebas fault tolerance services rely on a set of special group management primitives which are not provided by Amoebas standard GC service; thus we start this section by studying Amoebas GC service and discussing the new features provided by Seljuk-Amoebas GC service. After that we present the specification of Seljuk-Amoebas higher level fault tolerance services.

The software servers that implement the fault tole-rance services are assumed to execute over a fault-free processing service. Also, these services assume that the underlying processing service, over which software components execute, presents fail-silent semantics. Since Seljuk-Amoebas lower level processing service offers both fail-silent and failure-masking services, these assumption can be attained with arbitrarily high probability.

4.1 Group Communication in Seljuk-Amoeba

Amoeba offers GC primitives for atomically broadcasting messages to a group of processes in the presence of communication and processor failures. The GC primitives provide an abstraction that enables programmers to design applications consisting of one or more processes running on different machines of the distributed system. All members of a single group see all events concerning this group in the same order, including those events of a new member joining the group, a member leaving the group and the detection of a crashed member [16].

Amoebas GC service provides primitives for both group management and group communication. The following management primitives are available: CreateGroup, JoinGroup, LeaveGroup, ResetGroup, and GetInfoGroup. The CreateGroup primitive creates a new empty group and automatically makes the caller a member of the group; the JoinGroup primitive makes the caller a member of a specified group, whilst the LeaveGroup primitive removes the caller from the specified group; the ResetGroup primitive initiates recovery after a processor failure; whilst the GetInfoGroup returns state information about the group. There are also primitives for allowing communication within a group. The SendToGroup primitive atomically sends a message to all members of the destination group, whilst the ReceiveFromGroup primitive blocks until a message arrives from the specified group; the received message is then returned to the caller.

In the context of Seljuk-Amoebas fault tolerance services, a group consists of one or more software component replicas that are co-operating to provide a single fault-tolerant service (a replicated server). A unique identifier (a port) is associated to each group. All replicas of a group listen to the same port. The port concept provides the desired transparency to a process which is requiring the service provided by the replicated server (a client).

However, group structure in Amoeba is closed, i.e. only processes belonging to the group can send messages to the group. Therefore, the usual way for a client to access a service provided by a group is to do an RPC with one of its members and rely on that member to use the SendToGroup primitive to disseminate the request within the group. Using Amoebas group communication, based on closed groups, the client-server intera-ction will happen in the following way: the client performs an RPC, indicating the group port; the RPC layer then looks up its tables to find out which servers are listening to that port; if there is more than one server, the RPC layer chooses at random one of them to send the clients request. The replica of the server that receives the request disseminates the request within the group by calling SendToGroup. Clients need not to be aware whether a service is implemented by replicated servers or by a single server. On the other hand, our aim of obtaining transparency at the server side as well is not attained, since servers need to be aware of the existence of the group.

Transparency at the server side is attained by the provision of three new GC primitives, namely: CreateRepGroup, JoinRepGroup and ResetRepGroup. Unlike Amoebas original GC primitives, these primitives are called by a process that is not a member of the group. In fact, these primitives are usually called by Seljuk-Amoebas fault tolerance servers.

The CreateRepGroup primitive creates a group of servers. It receives two parameters: rep-type, which defines the kind of replication that is going to be used within the group, and can take the values active, passive, and semi-active; and node-list, which indicates the nodes on which the various replicas have been allocated. CreateRepGroup returns a group descriptor to be used to identify the group in subsequent calls. This primitive is used to set up the micro-kernel tables of all member nodes with appropriate information, so to allow the dissemination of messages sent to and received by the group in a transparent way.

The JoinRepGroup primitive adds a new replica to an existent group. It also receives two parameters: group-id, which indicates the group to which the new replica will be added; and node, which indicates the node in which the new replica has been allocated. On the other hand, the ResetRepGroup primitive removes faulty replicas from a group. It receives the group-id and a node-list parameters which indicate the group and the replicas that should be removed, respectively. These primitives are normally used for group reconfiguration purpose.

The utilisation of Seljuk-Amoebas GC primitives will be better understood in the following sections.

4.2 Replication Service

The replication service in Seljuk-Amoeba is provided by the Replicator Server. To create a replicated software component, named a resilient process, a process (typically a shell) invokes one of the services provided by the Replicator Server. The generic format of such a service is presented below:

Replicate (rep-type, program, diversity, resilience, semantics, reconfiguration)

Replicate is a stub procedure that is responsible for starting an RPC with the Replicator Server to request the creation of a resilient process. The rep-type parameter indicates which kind of replication is going to be used. The program parameter is a capability for a file-like object which contains the code of the software component to be executed; the diversity parameter informs if design diversity should be used (in order to tolerate design faults); the resilience degree is defined by the resilience parameter, whilst the assumption about the actual failure semantics of the systems pro-cessors is indicated by the semantics parameter; finally, the reconfiguration parameter informs if the environment must carry out automatic reconfiguration when faulty replicas are diagnosed.

When the Replicator Server receives such a request, it first discovers all the architectures and versions that are available for program. Next, the Replicator Server makes resilience+1 requests to the FT Run Server, allocating the appropriate number of fail-silent nodes to execute program. In each request made, the Replicator Server passes the following parameters to the FT Run Server: a list with all process descriptors available for program, the actual failure semantics of the processors (the semantics parameter received by the Replicator Server), the failure semantics required (i.e. fail-silent), the number of processors to form a fail-silent node (1 if semantics is at least as restrictive as fail-silent, or 2 otherwise), a list with the processors that have already been allocated in the previous requests, and the diversity flag.

If all nodes are successfully created, the Replicator Server can continue; otherwise, an error is returned to the process that has initiated the RPC with the Replicator Server. In the case where all nodes are successfully created, it is necessary to create a group formed by all process replicas just created. This group is created by invoking the CreateRepGroup primitive providing the node descriptors that have been returned by the various calls to the FT Run Server and the replication type as parameters.

The CreateRepGroup call initialises the appropriate tables in the micro-kernel of all processors forming the nodes used for replicating the user process, such that all messages sent to any replica in the group will be received by all replicas; the order on which messages are received will depend on the type of replication used; further, output messages from replicas are treated by a filtering function that will also depend on the replication type.

Finally, a configuration object is created to store information about the replicated server. These include: the list of node descriptors used, and all parameters receive in the service call. As we will show later, the information contained in the configuration object is used by the fault diagnosis and recovery services.

In the above discussion we have used the Replicate service as a generic name to represent the various services offered by the Replicator Server. In fact, instead of offering a generic replication service, the Replicator Server offers one service for each type of replication it supports. This obviates the need of indicating the replication type via a parameter; also parameters that are specific to a particular type of replication can be introduced in a more appropriate way. In the next three sub-sections we will discuss each one of these services.

4.2.1 Active Replication Service

Two requirements are necessary to actively replicate software components on fail-silent nodes: i) software components must have a deterministic behaviour; and ii) all replicas must receive the same input messages in the same order. In Seljuk-Amoeba, the first requirement is guaranteed by allowing only deterministic processes to be actively replicated; software components with non-deterministic behaviour can be replicated passively, or semi-actively, instead. As discussed previously, the second requirement is transparently tackled by Seljuk-Amoebas GC layer.

When active replication is carried out over fail-silent nodes, there are two approaches on dealing with output messages: i) choose one replica to send the messages produced; or ii) let all replicas send the messages produced. The first approach has the advantage of reducing the traffic over the network. However, when the replica in charge of sending output messages fails, output messages may experience a large delay, until the faulty replica is detected and a new one takes over. The second approach eliminates this recovery delay at the expense of generating extra traffic. Applications using Seljuk-Amoebas active replication service can choose at activation time which approach to use.

The Replicator Server provides active replication services via the following primitive:

ActiveReplicate (program, diversity, resilience, semantics, reconfiguration, continuity)

The service offered by the Replicator Server for requesting active replication has an extra parameter (continuity). It indicates whether the service continuity degree must be maximised or not. In the former case, the group communication layer of all replicas send replies to the client; in the latter case, the group communication layer of only one replica send replies to the client.

In Seljuk-Amoeba, any distributed application that communicates only by message passing and presents a deterministic behaviour may be transparently replicated by the active replication service of the Replicator Server. No modifications on the source code are necessary. In fact, the application does not need even to be recompiled.

4.2.2 Passive Replication Service

Passive replication involves the provision of code to be executed when the replica is the primary, i.e. the re-plica is executing the processing and sending checkpoints to the other backup replicas; and code to be executed when the replica is a backup, i.e. the replica is just collecting checkpoints generated by the primary replica. A simple way to implement a process to be passively replicated is by dividing the process functionality in two threads: one to execute in the active mode (primary functionality) and other to execute in the passive mode (backup functionality).

The first action of the active thread is to verify if the replica is currently a primary or backup. If the replica is a backup, it must block until it eventually becomes a primary. The fault tolerance library provided by Seljuk-Amoeba offers the function OnPrimaryUnblock, which can be used for this purpose. When calling this function a thread will block if it is not a primary, otherwise it will continue its execution. If a process ever becomes a primary, any thread blocked on the OnPrimaryUnblock function will be unblocked. After returning from the OnPrimaryUnblock call, the active thread must check if there is any state to be restored, and if this is the case, carry out the restoration actions required (e.g. process the last checkpoint received from the previous primary re-plica). Next, the thread starts to process client requests and send the respective replies.

From time to time, the active thread of the primary replica must checkpoint its local state with the passive replicas of the backup replicas. Seljuk-Amoebas fault tolerance library offers a Checkpoint function that can be used to disseminate checkpoints. The only function of the passive thread of the backup replicas is to receive the checkpoints issued by the active thread of the primary replica. This task can be simplified by using the GetCheckpoint function also provided by Seljuk-Amoebas fault tolerance library. Both Checkpoint and GetCheckpoint receive as parameters application depen-dent functions, which are responsible for packing and unpacking checkpoint messages that are sent by the primary replica and received by the backup replicas using the Checkpoint and the GetCheckpoint functions, respectively.

Whilst the active threads block in the backup replicas, the passive thread blocks in the primary replica, since it does not receive any checkpoint messages. This guarantees that each replica operates in only one mode (active or passive). Client requests are received by the communication layer of all replicas, but they are delivered only to the primary replica. Input messages received by the communication layer of the backup replicas are buffered until a checkpoint is received; the checkpoint contains an indication of which messages have been processed by the primary replica, and therefore can be discarded. Also, only one output message is generated, since there is only one active thread executing at each time.

When a faulty primary replica is diagnosed by the dia-gnosis service (see Section 4.3), the micro-kernel elects a new primary. The active thread of the elected replica is unblocked and this replica starts to receive requests from the clients and to provide the respective replies. Seljuk-Amoeba provides a standard function to elect a new primary replica: the replica that has the smallest identifier inside the group is elected. However, the service is flexible in the sense that the programmer can choose his own election function. In this case, the function is passed as a parameter in the invocation of the service. The following service provides passive replication in Seljuk-Amoeba:

PassiveReplicate (program, diversity, resilience, semantics, reconfiguration, pack, unpack, election)

The pack, unpack, and election parameters define application dependent functions for packing checkpoints in messages that can be send to backup replicas through the Checkpoint function, unpack checkpoint messages received by the backup replicas through the GetCheckpoint function, and elect a new primary replica after the failure of the current primary, respectively.

4.2.3 Semi-Active Replication Service

Semi-active replication tries to combine the fast fault recovery property of active replication schemes with the ability of replicating applications with non-deterministic behaviour of passive replication schemes. One possibility is to have all replicas executing the same stream of input messages, and having a single replica, usually named leader, executing non-deterministic operations and informing the results to the other replicas, usually named followers. In case of leader failure, an election must be carried out to choose a new leader. As is the case in active replication, semi-actively replicated software components can follow two approaches in disseminating output messages: i) only the leader sends output messages; and ii) all replicas send output messages.

Seljuk-Amoebas fault tolerance library offers three functions to provide support for the co-operation of leader and follower replicas when dealing with non-deterministic behaviour of software components. The WhatsMyRole function returns the current role of the caller, i.e. if the caller is the leader, or if it is a follower; the Notify function can be used to disseminate a message to all follower replicas, whilst the GetNotify function can be used to receive notification messages from a leader replica.

Using these functions, non-deterministic behaviour can be dealt with in the following way: whenever a non-deterministic computation must be executed, all replicas make a call to the function WhatsMyRole; if the replica is the leader, it executes the non-deterministic computation and informs the result to the other replicas of the group through a call to Notify. When the replica is a follower, instead of executing the computation, it uses the result calculated by the leader that can be obtained through a call to GetNotify.

Semi-active replication is provided by the following service (the continuity and election parameters have the same functionality discussed in the previous sub-sections):

SemiActiveReplicate (program, diversity, resilience, semantics, reconfiguration, continuity, election)

4.3 Fault Diagnosis Service

Diagnosis services are provided in Seljuk-Amoeba by the Diagnoser Server. It continually monitors the operation of the replicas of a replicated software component, verifying whether the node where they are executing has failed or not. It also checks if the individual replica pro-cesses are alive. The Replicator Server is responsible for issuing diagnosis requests to the Diagnoser Server, on the behalf of every replicated software component that has required a replication service and has set the reconfiguration parameter to be true. Requests are sent to the Diagnoser Server through a call to the following service (where configuration is the configuration object associated with the replicated software component):

Diagnose (configuration)

After determining the system state and discovering all faulty replicas and nodes, the Diagnoser Server takes actions to reorganise groups with faulty replicas. This is achieved by calling the primitive ResetRepGroup with appropriate parameters. This call removes faulty members from the group, and updates the configuration object of this group in each node where there is a correct replica executing. Furthermore, if the replication type is passive or semi-active and the primary/leader replica has been diagnosed as being faulty, then the ResetRepGroup call leads the correspondent micro-kernel to execute the ele-ction protocol and take the necessary recovery actions discussed in the sub-sections 4.2.2 and 4.2.3. For example, in the case of passive replication, any thread blocked on a call to the OnPrimaryUnblock function must be unblocked if it is executing on the new primary node.

4.4 System Reconfiguration Service

Whenever a replica (or the node where it is executing) fails and the ResetRepGroup primitive is called, the number of group members decreases and, consequently, the capacity to tolerate new faults decreases too. In order to maintain the required resilience degree of applications throughout their mission time, reconfiguration actions must be carried out. Reconfiguration implies that new nodes must be allocated and new replicas of the resilient process must be started in these nodes.

This activity is performed by the Replicator Server. After all recuperation procedures have been finished, the Diagnoser Server calls the following Replicator Servers service:

RestoreReplicate (configuration)

Then, the Replicator Server searches the information about the group identified by the specified configuration object to verify the number of new members that must be joined to the group. The Replication Server must create new replicas and join them to the group.

Note that in order to be able to correctly provide its service, new replicas have to update their state and synchronise themselves with old replicas before accepting requests from clients. This is normally an application dependent procedure, and has to be introduced into the applications original code. (Previously we have shown how the Checkpoint and GetCheckpoint functions can be used to facilitate this procedure.) Note that this is only necessary when replicas need to restore their state; a significant number of distributed applications are based on stateless servers; therefore, for these applications, reconfiguration can be achieved transparently.

4.5 Summary of Fault Tolerance Services

As presented in the previous sub-sections, Seljuk-Amoeba supports fault tolerance services in two ways; it offers two specialised servers that accept requests for replication, diagnosis, and system reconfiguration services; and it also offers functions collected on a fault tolerance library, that helps the implementation of dependable applications. Tables 1 and 2 summarize the primitives and functions available.

Primitive Server Description Active Replicate Replicator Replicates deterministic software components actively Passive Replicate Replicator Replicates software components passively SemiActive Replicate Replicator Replicates software components semi-actively Diagnose Diagnoser Monitors the execution of software components Restore Replicate Replicator Reconfigures replicated applications

Table 1: Seljuk-Amoebas fault tolerance primitives

Function Description OnPrimaryUnblock If the caller is not a primary replica it will block. If a replica process ever becomes a primary, any blocked thread will be unblocked. Checkpoint Used by primary replicas to generate checkpoints. GetCheckpoint Used by backup replicas to receive and deal with checkpoints sent by primary replicas. WhatsMyRole Returns to the caller an indication of its role, i.e. if the caller is a leader or a follower replica. Notify Used by leader replicas to send the results of non-deterministic computations. GetNotify Used by follower replicas to receive the results of non-deterministic computations.

Table 2: Functions of Seljuk-Amoebas fault tolerance library

5 Related Work

Most fault-tolerant architectures reported in the li-terature are ad-hoc solutions to specific problems (see [17], for example). Others, like the Sequoia [18] and the Stratus [19] systems, are suitable for a broader range of applications (on-line transactions processing, in the aforementioned case). However, in both cases, some problems arise: firstly, their reliable processing service is heavily dependent on a proprietary hardware archite-cture; secondly, applications with no dependability requirements will have to pay, normally with a reduction on their performance, for unwanted reliable services; finally, technological advances (e.g. faster processors) can only be incorporated into these system, after substantial redesign has been carried out.

The Delta-4 architecture [20] is one of the first general purpose fault-tolerant architectures proposed in the literature. Its approach is to offer an open dependable distributed computing system that attempts to use as much as possible off-the-shelf components, accommodating heterogeneity of the underlying hardware and software, and providing application portability across many platforms. The architecture offers specifiable dependability levels and provides support for replication of software components, fault diagnosis and system reconfiguration.

The Seljuk model follows many ideas put forward by the Delta-4 project, but applicable to a different setting. Its approach is to enhance the services provided by a distributed operating system (Amoeba, in the case of Seljuk-Amoeba) with the provision of reliable processing and fault tolerance services. Further, unlike Delta-4 which requires special hardware to provide the fail-silent semantics required for the execution of communication software and replica co-ordination entities, Seljuk-Amoeba incorporates the necessary replica management protocols to provide reliable pro-cessing using fail-arbitrary off-the-shelf processors.

The distributed operating system ROSE [6] is a system that follows Seljuk-Amoeba philosophy of offering fault tolerance services at the operating system level. It provides some abstractions that facilitate the construction of reliable applications. A failure detection server allows processors and communication failures to be detected. It also provides a Replicated Address Space object abstraction that makes portions of an address space highly available through replication. Based on such objects, a Resilient Process abstraction is provided to application processes transparently. However, ROSE presents two restrictions when compared to Seljuk-Amoeba: i) it assumes that the system underlying components have fail-silent semantics; and ii) it tolerates only physical faults.

Huang and Kintala [2] discuss three technologies to increase application-level fault tolerance, based on reusable software components. These technologies can be used in addition to design diversity techniques such as N-version programming [21] and recovery blocks [22] to tolerate software design faults. The first technology is represented by a daemon process able to detect processes and processors failures, recover processes and reconfigure the system. The second technology is a user-level library of C functions that can be used in the development of application programs to specify and checkpoint critical data, log events, locate and reconnect a server, do exception handling, do N-version programming and use recovery block techniques. Finally, a multi-dimensional file system allows users to specify and replicate critical files on backup file systems. Like ROSE, these technologies assume fail-silent semantics for the systems components. Furthermore, programming effort is still considerable. Seljuk-Amoeba aims to reduce these disa-dvantage by allowing fault tolerance mechanisms to be used in a much more transparent way.

6 Conclusions

The collection of fault tolerance services of the Seljuk-Amoeba operating environment form an important tool for the construction of robust distributed applications. The reliable processing service provides a reduction in the applications development complexity by allowing programmers to assume that the node failure semantics is less restrictive than that provided by the underlying off-the-shelf processors. Seljuk-Amoebas micro-kernel is in charge of the management of redundant processors needed to ensure the node failure semantics assumed. Further, the Replicator and Diagnoser servers together with Seljuk-Amoebas fault tolerance library provide very simple ways of enhancing legacy software with fault tolerance properties or building new dependable distributed applications, as well as controlling their execution.

Seljuk-Amoebas services offer flexibility to applications, by allowing them to choose their failure semantics at activation time. In this way, if the dependability requirements of the application change, it is possible to provide the required service without the need for recompiling the application, since the reliable processing service is implemented at the level of Seljuk-Amoebas micro-kernel.

Finally, another advantage of the services provided is that their cost (in terms of performance reduction) is paid only by those applications with dependability requirements. Note that when the failure semantics of the avai-lable processors is at least as restrictive as the failure semantics required by applications, the processing service of the Seljuk-Amoeba system behaves exactly as the original processing service of the Amoeba system, and does not impose any performance reduction to applications.

We are currently developing an implementation of the system. We intend to construct dependable distributed applications using different combinations of the various fault tolerance services provided by the system, in order to better assess the performance reductions imposed by them.

References

[1] F. Cristian. Understanding Fault-Tolerant Distributed Systems. Communications of the ACM, 34(2):56-78, 1991. [2] Y. Huang and C. Kintala, Software Implemented Fault Tolerance: Technologies and Experience. In Digest of Papers, FTCS-23, Tolouse,France, pp. 2-9, 1993. [3] K.P. Birman and T.A. Joseph. Reliable Communication in the Presence of Failures. ACM Transactions on Computer Systems, 5(1):47-76, 1987. [4] S.K. Shrivastava, G.N. Dixon and G.D. Parrington. An Overview of the Arjuna: A Programming System for Reliable Distributed Computing. IEEE Software, 8(1):63-73, 1991. [5] P. Maes, Concepts and Experiments in Computational Reflection. In Proceedings of OOPLAS87, ACM SIGPLAN Notices, 22(12):147-155, 1987. [6] P. Ng, The Design and Implementation of a Reliable Distributed Operating System - ROSE. In Proceedings of the 9^th International Symposium on Reliable Distributed Systems, pp. 2-11, Huntsville, USA, 1990. [7] S.J. Mullender, G. van Rossum, A.S. Tanenbaum, R. van Renesse, and H. van Staveren. Amoeba: A Distributed Operating System for the 1990s. IEEE Computer, 23(5):44-53, 1990. [8] F.V. Brasileiro, Seljuk: Um Ambiente para Suporte ao Desenvolvimento e à Execução de Aplicações Distribuídas Robustas, (in Portuguese). In Proceedings of VII Simpósio de Computadores Tolerantes a Falhas, Campina Grande, Brazil, pp. 45-59, 1997. [9] M.F. Kaashoek, R. Renesse, H. van Staveren, and A.S. Tanenbaum. FLIP: An Internetwork Protocol for Supporting Distributed Systems. ACM Transactions on Computer Systems, 11(2):73-106, 1993. [10] L.F.G. Soares, G. Lemos, and S. Colchet. Das LANs, MANs e WANs às Redes ATM, (in Portuguese), 2^ndedition, Ed. Campus, 1995 [11] F.V. Brasileiro. Constructing Fail-Controlled Nodes for Distributed Systems: A Software Approach. Ph.D. Thesis, University of Newcastle upon Tyne, May 1995. [12] N.A. Speirs, S. Tao, F.V. Brasileiro, P.D. Ezhilchelvan, and S.K. Shrivastava. The Design and Implementation of VOLTAN Fault-Tolerant Nodes for Distributed Systems. Transputer Communications, 1(2):93-109, 1993. [13] F.V. Brasileiro, P.D. Ezhilchelvan, S.K. Shrivastava, N.A. Speirs, and S. Tao. Efficient Protocols for Fail-Silent Nodes in Distributed Systems. IEEE Transactions on Computers, 45(11):1226-1238, 1996. [14] Rivest, A. Shamir, and L. Adleman. A Method of Obtaining Digital Signatures and Public-key Cryptosystems. Communications of the ACM, 21(2):120-126, 1978. [15] V.S. Catão and F.V. Brasileiro, Serviço de Comunicação Síncrona para Nodos Replicados, (in Portuguese). In Proceedings of VII Simpósio de Computadores Tolerantes a Falhas, Campina Grande, Brazil, pp. 305-319, 1997. [16] M.F. Kaashoek, A.S. Tanenbaum, and K. Verstoep. Group Communication in Amoeba and its Applications. Distributed Systems Engineering Journal, 1:48-58, 1993. [17] R.E. Harper, J.H. Lala, and J.J. Deyst, Fault Tolerant Processor Architecture Overview. In Digest of Papers, FTCS-18, Tokyo, Japan, pp. 252-257, 1988. [18] P.A. Bernstein. Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing. IEEE Computer, 21(2):37-45, 1988. [19] S. Webber and J. Beirne, The Stratus Architecture. In Digest of Papers, FTCS-21, Montréal, Canada, pp. 79-85, 1991. [20] D. Powell (Ed.). Delta-4 - A Generic Architecture for Dependable Distributed Computing. Spring-Verlag, 1992 [21] A. Avizienis. The N-Version Approach to Fault Tolerant Software. IEEE Transactions on Software Engineering, 11(12):1491-1501, 1985. [22] B. Randell. System Structure for Software Fault Tolerance. IEEE Transactions on Software Engineering, 1(2):220-232, 1975.

Publication Dates

Publication in this collection
04 Feb 1999
Date of issue
Nov 1997

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.