Survivability in Object Services Architectures
David L. Wells and David E. Langworthy
Object Services and Consulting, Inc.
A "survivable" application can continue to function despite the loss
or degradation of some of its components, will maintain its functionality
and performance for as long as possible, and will degrade gracefully when
this is no longer possible. Survivability relies on redundancy to
allow normal operations to continue as long as possible, the ability to
reconfigure to correct problems, and policies defining acceptable (but
less desirable) functionality or performance should it prove impossible
to maintain the desired behavior.
We are developing an architecture and Survivability Service to make
OSA-based distributed systems far more survivable in the face of component
failure and degradation than is currently possible. The architecture
unifies a number of existing robustness mechanisms and adds several new
ones to provide a variety of tools that can be applied in different situations.
Because of the complexity of system-wide survivability, it is impossible
to have a "master plan" for assuring survivability. Instead, we use
market mechanisms to create global survivability as an emergent behavior
resulting from a large number of small, local decisions.
Our approach maintains the simplicity of OSA application development
that has been largely responsible for the popularity of OSAs by not requiring
individual applications or services to be responsible for the details of
ensuring their own survivability. This is necessary because survivability
is difficult to program and its development costs should be amortized across
many applications, the survivability needs of different applications or
services often conflict, and survivability requires a more accurate knowledge
of the eventual deployment environment(s) than is reasonable to expect
at development time. To achieve this, we make survivability orthogonal
to conventional OSA application semantics; in other words survivability
is "added" to an application rather than built into it from the start.
This is done by a "Survivability Service" that handles the survivability
needs of applications collectively, responding to changes in workload,
resource requirements, resource availability, and threats based on a number
of environment models that can be specified independently. A consequence
of making survivability orthogonal to application functionality is that
changing the models (not the applications or services) allows applications
to be deployed into dynamically changing or unanticipated environments.
This approach also supports the use of COTS and GOTS that are not constructed
for survivability.
The key to constructing survivable systems is to configure them in such
a way that they can be easily reconfigured when needed to survive loss
of system resources. We have extended and clarified the standard
OSA object model to create a survivable object abstraction that
makes it possible to define a set of "survivable configurations" that are
able to withstand component loss and are also capable of being systematically
evolved into new configurations should component loss become severe.
The abstraction provides ways to change both the physical configuration
(different service placement or resource allocation) and the logical configuration
(service alternatives or changed levels of service quality) of an application.
Developers use the abstraction to specify, implement, and connect services.
The OSA Survivability Service manages configurations defined in this abstraction
to keep them running as well as possible given the currently available
resources. The object abstraction:
-
makes a clean distinction between the abstraction of a service instance
and its implementation(s) in order to support replication, instance migration,
change of implementation class for a given service instance, and multiple
simultaneous implementation classes for a given instance;
-
abstracts the bindings of clients to services and implementations to resources
in order to allow an OSA Survivability Service to determine which service
instance best meets the needs of a client, and how and where that service
instance should be instantiated;
-
defines useful patterns of object configurations that have desirable survivability
properties;
-
uses the concept of quality of service (QoS) to allow alternatives to both
service bindings and implementation instantiations in the event resource
limitations prevent optimal behavior; and
-
defines legal transformations between legitimate configurations.
We believe that a key to adding any kind of "extra-functional behavior"
such as security, persistence, survivability, etc., is to have an object
abstraction with the right kind of "translucent joints" where systems can
either be mediated or taken apart and reassembled dynamically in different
ways. A joint is a well defined place where a binding between system
components may be made. In general, more information about the binding
than is common in programming languages is maintained; this could be a
statement of requirements of any object that can satisfy the binding, the
provenance required, information flow restrictions, QoS, etc. Translucence
means that the joint is visible if desired in order to use its special
properties, but otherwise is invisible except possibly for a small performance
penalty. In fact, it is often possible to reduce or completely eliminate
the performance penalty at the cost of more complexity in changing the
binding. Prior examples of the use of such joints to add behavior
are persistence and transaction control in Open OODB and the security in
the OMG Object Security Service.
We are specifying the architecture of an OSA Survivability Service
to manage applications defined using the survivable object abstraction.
The architecture supports a wide variety of survivability actions (below),
is compatible with existing OSAs and projected trends (including the various
repositories and the CORBA Security Service), and encompasses a wide variety
of existing research in fault tolerant systems, failure detectors, system
models, etc. We currently have an overall architecture for the Survivability
Service that covers the "big picture" of how the components relate, including
an internal partitioning that allows major subsystems to be replaced or
refined, possibly by third parties. Survivability actions supported by
the OSA Survivability Service are:
-
Basic Process Control gives the ability to start, stop and
restart processes, to clean up after failed or aborted processes, and to
restore processes to known states. Most of this is provided by ORBs.
-
Fault Tolerant Services are services designed to (usually) fail
in known "good" ways. Their failure modes become part of the service
specification. This must be provided by the service developers.
-
Failure Detection & Classification are mechanisms to
detect the symptoms of failures and attacks, and classify the events into
likely failure categories. This can be done through probes, wrappers,
or exception reports from well-behaved services. We will obtain
these mechanisms from elsewhere.
-
High Service Availability mechanisms use replication or hierarchical
masking (i.e., error handling in the client) to make individual service
instances much more highly available than they would otherwise be.
We concentrate on replication-based policies since they do not rely on
the semantics of the services and are therefore more widely applicable.
Many replication-based policies exist and some are integrated with ORBs.
These mechanisms make it possible to physically reconfigure an application
by changing the way individual services are implemented; the logical organization
remains fixed in that clients still interact with the same services after
any reconfiguration.
-
Availability Management determines the appropriate fault tolerance
mechanism to use for a given service based on service failure modes and
perceived threats, and determines the resource pool needed to achieve desired
availability. This is where much of our design and development work has
been done.
-
Service Renegotiation makes it possible to change the logical
organization of an application by binding clients to alternate services
if the desired service should become unavailable or degrades in performance.
The rebinding can be to an equivalent, but distinct service (e.g., a different
server having the same maps), or to a similar, but acceptable service (e.g.,
a different server with maps of the same area but at lower resolution).
Alternatively, the same service connection can be maintained but at a lower
quality of service (e.g., more errors or slower). In addition to
allowing rebinding to service alternatives when services fail, service
renegotiation can represent a fallback position if the costs of assuring
service availability become unacceptably high. Service renegotiation
requires specifications of client-service connections well beyond those
currently used in OSAs, and will be a main focus of our project in the
next year.
The OSA Survivability Service configures and reconfigures applications
using currently available resources in an attempt to avoid know threats.
It uses a collection of environment models describing resources,
threats, and overall situation in determining what to do. These models
are defined roughly at present.
We are building a Survivability Service prototye, including a
market mechanism for resource allocation, simple models and model evolution
to drive survivability decisions under changing conditions, specifications
of how to rebind logically equivalent or similar services, and some visualization.
This will allow demonstration of a cohesive part of the Survivability Service
by the middle part of 1998. A concept demonstration of part of this
currently exists.
We are interested in attending this workshop in order to trade ideas
about object abstractions and joints, and to contribute to a discussion
of how different behaviors applied at the same joint should be allowed
to interact.