Survivability in Object Services Architectures

David L. Wells and David E. Langworthy

Object Services and Consulting, Inc.

A "survivable" application can continue to function despite the loss or degradation of some of its components, will maintain its functionality and performance for as long as possible, and will degrade gracefully when this is no longer possible. Survivability relies on redundancy to allow normal operations to continue as long as possible, the ability to reconfigure to correct problems, and policies defining acceptable (but less desirable) functionality or performance should it prove impossible to maintain the desired behavior.

We are developing an architecture and Survivability Service to make OSA-based distributed systems far more survivable in the face of component failure and degradation than is currently possible. The architecture unifies a number of existing robustness mechanisms and adds several new ones to provide a variety of tools that can be applied in different situations. Because of the complexity of system-wide survivability, it is impossible to have a "master plan" for assuring survivability. Instead, we use market mechanisms to create global survivability as an emergent behavior resulting from a large number of small, local decisions.

Our approach maintains the simplicity of OSA application development that has been largely responsible for the popularity of OSAs by not requiring individual applications or services to be responsible for the details of ensuring their own survivability. This is necessary because survivability is difficult to program and its development costs should be amortized across many applications, the survivability needs of different applications or services often conflict, and survivability requires a more accurate knowledge of the eventual deployment environment(s) than is reasonable to expect at development time. To achieve this, we make survivability orthogonal to conventional OSA application semantics; in other words survivability is "added" to an application rather than built into it from the start. This is done by a "Survivability Service" that handles the survivability needs of applications collectively, responding to changes in workload, resource requirements, resource availability, and threats based on a number of environment models that can be specified independently. A consequence of making survivability orthogonal to application functionality is that changing the models (not the applications or services) allows applications to be deployed into dynamically changing or unanticipated environments. This approach also supports the use of COTS and GOTS that are not constructed for survivability.

The key to constructing survivable systems is to configure them in such a way that they can be easily reconfigured when needed to survive loss of system resources. We have extended and clarified the standard OSA object model to create a survivable object abstraction that makes it possible to define a set of "survivable configurations" that are able to withstand component loss and are also capable of being systematically evolved into new configurations should component loss become severe. The abstraction provides ways to change both the physical configuration (different service placement or resource allocation) and the logical configuration (service alternatives or changed levels of service quality) of an application. Developers use the abstraction to specify, implement, and connect services. The OSA Survivability Service manages configurations defined in this abstraction to keep them running as well as possible given the currently available resources. The object abstraction:

makes a clean distinction between the abstraction of a service instance and its implementation(s) in order to support replication, instance migration, change of implementation class for a given service instance, and multiple simultaneous implementation classes for a given instance;
abstracts the bindings of clients to services and implementations to resources in order to allow an OSA Survivability Service to determine which service instance best meets the needs of a client, and how and where that service instance should be instantiated;
defines useful patterns of object configurations that have desirable survivability properties;
uses the concept of quality of service (QoS) to allow alternatives to both service bindings and implementation instantiations in the event resource limitations prevent optimal behavior; and
defines legal transformations between legitimate configurations.

We believe that a key to adding any kind of "extra-functional behavior" such as security, persistence, survivability, etc., is to have an object abstraction with the right kind of "translucent joints" where systems can either be mediated or taken apart and reassembled dynamically in different ways. A joint is a well defined place where a binding between system components may be made. In general, more information about the binding than is common in programming languages is maintained; this could be a statement of requirements of any object that can satisfy the binding, the provenance required, information flow restrictions, QoS, etc. Translucence means that the joint is visible if desired in order to use its special properties, but otherwise is invisible except possibly for a small performance penalty. In fact, it is often possible to reduce or completely eliminate the performance penalty at the cost of more complexity in changing the binding. Prior examples of the use of such joints to add behavior are persistence and transaction control in Open OODB and the security in the OMG Object Security Service.

We are specifying the architecture of an OSA Survivability Service to manage applications defined using the survivable object abstraction. The architecture supports a wide variety of survivability actions (below), is compatible with existing OSAs and projected trends (including the various repositories and the CORBA Security Service), and encompasses a wide variety of existing research in fault tolerant systems, failure detectors, system models, etc. We currently have an overall architecture for the Survivability Service that covers the "big picture" of how the components relate, including an internal partitioning that allows major subsystems to be replaced or refined, possibly by third parties. Survivability actions supported by the OSA Survivability Service are:

Basic Process Control gives the ability to start, stop and restart processes, to clean up after failed or aborted processes, and to restore processes to known states. Most of this is provided by ORBs.
Fault Tolerant Services are services designed to (usually) fail in known "good" ways. Their failure modes become part of the service specification. This must be provided by the service developers.
Failure Detection & Classification are mechanisms to detect the symptoms of failures and attacks, and classify the events into likely failure categories. This can be done through probes, wrappers, or exception reports from well-behaved services. We will obtain these mechanisms from elsewhere.
High Service Availability mechanisms use replication or hierarchical masking (i.e., error handling in the client) to make individual service instances much more highly available than they would otherwise be. We concentrate on replication-based policies since they do not rely on the semantics of the services and are therefore more widely applicable. Many replication-based policies exist and some are integrated with ORBs. These mechanisms make it possible to physically reconfigure an application by changing the way individual services are implemented; the logical organization remains fixed in that clients still interact with the same services after any reconfiguration.
Availability Management determines the appropriate fault tolerance mechanism to use for a given service based on service failure modes and perceived threats, and determines the resource pool needed to achieve desired availability. This is where much of our design and development work has been done.
Service Renegotiation makes it possible to change the logical organization of an application by binding clients to alternate services if the desired service should become unavailable or degrades in performance. The rebinding can be to an equivalent, but distinct service (e.g., a different server having the same maps), or to a similar, but acceptable service (e.g., a different server with maps of the same area but at lower resolution). Alternatively, the same service connection can be maintained but at a lower quality of service (e.g., more errors or slower). In addition to allowing rebinding to service alternatives when services fail, service renegotiation can represent a fallback position if the costs of assuring service availability become unacceptably high. Service renegotiation requires specifications of client-service connections well beyond those currently used in OSAs, and will be a main focus of our project in the next year.

The OSA Survivability Service configures and reconfigures applications using currently available resources in an attempt to avoid know threats. It uses a collection of environment models describing resources, threats, and overall situation in determining what to do. These models are defined roughly at present.

We are building a Survivability Service prototye, including a market mechanism for resource allocation, simple models and model evolution to drive survivability decisions under changing conditions, specifications of how to rebind logically equivalent or similar services, and some visualization. This will allow demonstration of a cohesive part of the Survivability Service by the middle part of 1998. A concept demonstration of part of this currently exists.

We are interested in attending this workshop in order to trade ideas about object abstractions and joints, and to contribute to a discussion of how different behaviors applied at the same joint should be allowed to interact.