QoS & Survivability
David Wells
Object Services and Consulting, Inc.
-
March 1998
-
Revised August 1998
-
This research is sponsored by the Defense Advanced Research
Projects Agency and managed by Rome Laboratory under contract F30602-96-C-0330.
The views and conclusions contained in this document are those of the authors
and should not be interpreted as necessarily representing the official
policies, either expressed or implied of the Defense Advanced Research
Projects Agency, Rome Laboratory, or the United States Government.
© Copyright 1997, 1998 Object Services and Consulting,
Inc. Permission is granted to copy this document provided this copyright
statement is retained in all copies. Disclaimer: OBJS does not warrant
the accuracy or completeness of the information in this document.
1 - Introduction
In the past several years, there has been considerable research in the
areas of quality of service (QoS) and survivability in an
attempt to facilitate the construction of large software systems that behave
properly under a wide range of operating conditions and degrade gracefully
when outside this range. From the 10,000 foot level, quality of service
addresses the goodness and timeliness of results delivered to a client,
while survivability addresses how to repair or gracefully degrade when
things go awry and the desired behavior is not able to be maintained. These
two areas are obviously related, because QoS forms at least a part of the
definition of the "desired" behavior of a system that survivability techniques
are attempting to preserve or gracefully degrade.
This paper explores the relationship between quality of service and
survivability. Section 2 discusses the concepts of quality of service
and survivability. Section 3 identifies and presents highlights of important
QoS research efforts. Section 4 discusses these projects in more detail,
particularly efforts whose approach to QoS is compatible to our approach
to survivability. Section 5 identifies technical "points of intersection"
between the QoS and survivability work that could eventually lead to a
confluence. Section 6 identifies some issues that arise when QoS
and survivability are combined and points out some weaknesses in the way
the existing projects add and measure survivability.
2 - Quality of Service & Survivability
The concept of quality of service has traditionally been applied only at
the network (and sometimes operating system) level. At that level,
QoS deals with issues such as time to delivery, bandwidth, jitter, and
error rates. Network-level QoS is important because many applications will
not function in an acceptable or useful manner unless the network they
use can guarantee some minimal service guarantees. It has been observed
that just as all services and applications rely on networks, they also
rely on other applications and services and these must also make some QoS
guarantees to allow the application to perform correctly.
A pair of short papers from Rome Laboratory describe
service-level QoS as a function of precision (how much), accuracy
(how correct), and timeliness (does it come when needed).
For example, a map may be insufficiently precise (100m instead of 10m resolution),
inaccurate (things in the wrong places), or untimely (delivered too late
to be useful). Unless all three requirements are met, a client is
not getting what it needs and therefore the result is lacking in QoS.
A benefit function is defined for each point in this 3-D space stating
the value to the client of receiving that particular QoS. Distance
metrics for the argument spaces are application dependent. For example,
the distance between "red" and "orange" will be less than the distance
between "red" and "blue" in a spectral dimension, but not in a textual
dimension. The benefit function is similarly application-specific and may
be situation dependent as well.
QoS at any given level of abstraction places QoS requirements on the
components providing the QoS. In the example, delivering a 10m map
requires the retrieval of a certain number of bits of map data.
A survivable system is one that can repair itself or degrade gracefully
to preserve as much critical functionality as possible in the face of attacks
and failures. A survivable system needs to be able to switch compatible
services in an established connection and substitute acceptable alternatives.
It must also be able to dynamically adapt to the threats in its environment
to reallocate essential processing to the most robust resources.
QoS and survivability are intricately linked; they are not the same,
but neither makes sense without the other.
-
Survivability without some notion of what is supposed to be surviving
is pointless; the what is provided by QoS metrics.
-
QoS "guarantees" that can't be made to survive or adapt under changing
conditions are not very useful as guarantees, and could in fact lead to
denial of service attacks as opponents bring a system to its knees by degrading
QoS and causing the QoS management system to continually add superfluous
resources.
3 - Overview of QoS Projects
Recent work, much of it funded by DARPA-ITO and administered by Rome Laboratory
through the Quorum
program, is extending QoS concepts and mechanisms to higher semantic levels
to allow the definition, measurement, and control of the quality of service
delivered by services and complete applications. There are three
major grouping of projects. SRI and BBN each have architectural frameworks
and are developing or adapting multiple pieces of technology to fit their
frameworks. The BBN and SRI frameworks address different types of QoS needs
and do not appear compatible. Several independent projects are developing
individual pieces of technology. All projects are administered by Rome
Laboratory, which also does some technology development. The groupings
and relationships of projects (shown in QoS
Projects Map) are given below. The individual projects are described
in more detail in Section 4.
BBN Cluster: The BBN cluster consists of three architecture/infrastructure
efforts based on a CORBA client server model:
-
QuO - Quality Objects
team: BBN - see: papers
-
AQuA - Adaptive
Quality of Service Availability
team: BBN, Illinois, Cornell - see: papers
-
OIT - Open Implementation
Toolkit for Creating Adaptable Distributed Applications
team: BBN, Illinois - see: papers,
and one application demonstration project:
-
DIRM - Dynamic
Integrated Resource Management
team: BBN, Columbia, SMARTS - see: papers
These projects are closely related, and in many ways it is useful to think
of them as one large project. QuO developed a general framework for
QoS in CORBA that is being refined by AQuA and extended by OIT to address
service availability management. There is to be a "production QuO"
done under the DIRM project.
SRI: SRI is developing an architecture and scheduling algorithms
for the delivery of end-to-end QoS for a data streaming model:
Independent Projects: These projects are developing modeling and
analysis/simulation tools that could be used by a QoS management system
to model resources and QoS requirements and to schedule resources. Several
of the tools were developed for another purpose and are being adapted to
the QoS domain. All have had some relationship with the BBN cluster of
projects. Projects are:
-
UltraSAN
team: Illinois - see papers
-
QoSME
team: Columbia
-
InCharge/MODEL
team: SMARTS, Inc. - see papers,
-
Horus
team: Cornell
4 - QoS Project Details
BBN Related Projects -QuO, AQuA, OIT, DIRM
The QuO project is developing Quality Objects
that can manage their own QoS. QuO is integrated with the CORBA architecture,
in that most of the work is done by extending client-side and server-side
ORB stubs. These "smarter" stubs are generated from an extension to IDL
called QDL that allows specifying things about service and connection quality.
QuO assumes a CORBA-like processing model in which there are client-server
and peer-peer relationships and in which the exact processing loads are
unknown and can be quite variable. This distinguishes QuO from the
SRI work, where the information flow of the applications takes the shape
of a DAG and where processing and QoS requirements are assumed to be well
understood a priori as is the case in multi-media delivery.
Using QDL, an object can specify the QoS it expects from each service
that it uses and can specify what its usage patterns will be (e.g., invocations/sec).
An object will similarly use QDL to specify the QoS it knows how to provide
(which can be different for different implementations or resource levels).
These specifications are used to create client-server bindings called connections.
Connections are first-class objects as they are in our survivability model
defined in Composition Model for OSAs.
To make the writing of QoS specifications and the creation and maintenance
of connections tractable, QoS is partitioned into regions of normative
behavior. Within each region it is assumed that every QoS is equally
useable. Region definitions look like predicates in a language like
C. Some of these regions (e.g., Normal, Overload, InsufficientResources,
Idle) are predefined, but others can be defined using QDL. Clients
and services parameterize these regions to define when they are in a particular
one. For example, a client may say that when in Normal mode, it will
make between 2 and 10 calls/sec to a particular service; anything over
10 calls/sec puts the client into Overload mode.
The use of regions means that minor (insignificant) deviations in the
QoS delivered do not require changes to the service or connection, which
substantially simplifies runtime processing. Similarly, if clients and
servers agree on common meanings for named regions, matching client and
server specifications is simplified which is important since this activity
must often be done on the fly. The use of QoS regions is a significant
difference between the BBN and SRI approaches; in the SRI approach, the
benefit function is allowed to be continuous.
The regions discussed so far are called negotiated regions since
they represent where the client and server try to operate and form the
basis for the connection contract. As long as both the client and server
operate in their negotiated regions, all is well. However, it is possible
for either the client or the server (which for these purposes also includes
the connection between the proxy and the remote server implementation)
to deviate from the negotiated region, either by overloading the server
or failing to deliver results as required. Because of the potential for
operating outside the negotiated regions, QuO defines reality regions
to represent the actual QoS-relevant behaviors of client and servers. Reality
regions are defined in the same way as negotiated regions; it appears that
for any given connection, the same set of specifications will be used for
both kinds of regions. If the observed reality region differs from
the negotiated region, the connection is no longer valid and remedial action
must be taken.
Various monitors determine the reality region a connection is actually
in. Monitors are presumably predefined to monitor the kinds of things that
QuO cares about; others can presumably be defined, and there is a claim
that they can be deployed selectively to only monitor those QoS items of
interest. Types of monitors include counting invocations, timers,
and heartbeats. It seems that there could be a whole subsystem for inserting
probes in useful places.
Operationally, a QuO object is a distributed object where part
of it lies on the server side and part(s) lie with its various clients.
It is appropriate to think of the client side as being a very smart proxy
object that knows how to do QoS related actions. (It appears that
this stub can be further tailored to do just about anything, but that doesn't
look like a good idea unless the architecture tells what kinds of things
should be done in the proxy.) A QuO proxy keeps track of the
negotiated QoS mode of the connection; interacts with monitors; provides
feedback to the client through interfaces generated from the QDL; and takes
certain actions to maintain the negotiated QoS.
Client-side proxies can take actions to maintain a QoS mode or change
the negotiated QoS region. A client makes a service request
through a client-side proxy which decides what to do with method invocations.
This could include sending the requests to a server, servicing them from
a cache, routing to replicas, routing to a varying set of implementation
objects, or ignoring them and signaling an exception. How many
of these they actually plan to provide is unknown, but this is definitely
where they fit. It fits very well with our survivability model, except
that we would extract this from the purview of the client and move it to
a Surivability Service to handle competing service demands. In an
ideal world, there would be a large number of generic actions that could
be taken by any proxy to maintain QoS or shift regions gracefully
when it detects a change in client behavior. Our Evolution
Model for OSAs defines many ways to evolve configurations. Some
of the actions a client-side proxy can take to maintain a region or change
regions are:
-
A server may be able to shift its implementation in order to stay in the
same region without disturbing the client. An example given is to
change implementations to trade bandwidth for processing power as resource
availability changes. Since the server stays in the same negotiated
region, the client doesn't need to see any change.
-
The client may request a different negotiated region. For example,
it may go idle and negotiate a lower QoS region. It is also possible
that a client is detected to have entered a different region. For
example, the regions may define Idle as no messages of any kind for 5 minutes.
If the reality region enters Idle, it can be treated as if the client signaled
that it was going Idle. Thus a client can change modes without having
to be implemented for QoS.
-
A server may have multiple implementation strategies that will allow it
to enter different regions. For example, if the client goes Idle,
the server may scale back its own resources, while if a Client enters an
Overload mode, the server may add resources or shift to a different implementation.
-
The proxy may make an up-call to the client to find out what to do.
Feedback to the client is done by callbacks generated from the QDL.
These form an additional interface to the client that is used only for
QoS purposes. The proxy can notify the client that the reality region
has deviated from the negotiated region. The client must provide
handlers for these callbacks to either begin a negotiation process
with the proxy or to change its own behavior in some way (e.g., slow down,
accept lower precision, etc.). Again, it appears that there is a
wealth of opportunity to do good things here, but it is not clear how many
of them are actually defined.
The QuO papers give a fairly detailed figure of the form of the proxy
objects. The partitioning of function seems good and would allow
us to extend if we needed to. This looks like where a lot of their
work went.
A weakness of the QuO project seems to be that they appear to have a
sort of "one-level" service model, in which clients call services which
are leaves. Also, the allocation of resources appears to be under
the control of the proxies, without much regard for other demands. They
briefly mention a language, Resource Definition Language (RDL) for specifying
resources and it would be reasonable that there be some sort of auction
or scheduler that gave resources out in a non-object centric way, but this
gets little attention. It may be possible to handle both multiple
levels of objects and sensible allocation for competing resources, but
they don't seem to do it. It appears that the architecture and models
do make this possible.
AQuA is improving the QuO proxy architecture, attaching RSVP
to allow QuO proxies to control QOS over CORBA, and adding regions to deal
with availability as well as QoS. Not much published, so what follows
is my supposition as to how this will work. It appears that they
plan to predefine a number of negotiated regions for availability, where
the region predicate has to do with things like how many replicas will
be required in order to stay in that region. Hooks down to a replication
manager (probably via Horus or using Electra/Ensemble) are used as the
monitors to detect whether the reality region for availability has changed
(e.g., a replica died). Actions in response to an availability change
would be to start another replica or inform the client of the change and
let the client decide what to do in the same way it decides what to do
about QoS region changes. UltraSAN is being used (or at least considered)
as a way to determine the availability region predicates. Using UltraSAN
(and possibly SMARTS/Model), they will model various configurations and
use the UltraSAN simulation and analysis tools to determine how long a
configuration is likely to stay alive or how long it will take to reconfigure
(I haven't actually seen the latter discussed at all, so maybe they do
not plan to do this). They will define and analyze configurations
until they find one that has the right predicted availability for a particular
client need. That will then form a contract for a specific availability
region.
It is not clear that they maintain a connection between QoS and availability.
It looks like region predicates express both QoS and availability concerns
in the same predicate. Since increased availability can degrade QoS
(slower with more replicas depending on the operation), this needs to be
addressed. It does look like they will be able to determine that
the combined reality region does not match the combined negotiated region,
but that will not help with the act of finding a good combined region definition
that can actually be instantiated.
OIT is just getting started and not much has been written about
it. It looks like they will be cleaning up some QuO internal architecture
and perhaps developing a toolkit to help write and manage contracts.
This may integrate with the use of UltraSAN; it does not say so, but that
would not be an unreasonable part of such a toolkit. It also appears that
OIT will be using these extensions to provide some support for survivability
along the lines of AQuA; it is not clear what the relationship is between
them.
DIRM has much more the feel of a technology demonstration than
the other projects. Little is written about it, but it appears that
the idea is to show off some QoS concepts in a collaborative decision process
in some sort of military command post setting.
SRI
SRI is working on end-to-end QoS, with particular emphasis on managing
data streaming within a DAG of processes. The strengths of the work
is that it has a very clear modeling methodology and language (including
pictorial) for specifying alternative implementations, natural handling
of information and processing flows that encompass arbitrary numbers of
steps, and a scheduling algorithm for allocating processes and communication
to available resources as described in a system model.
Resources are allocated by a scheduler based on a modified Dykstra shortest
path algorithm for finding least cost (w.r.t. a definable cost function
such as least time, cheapest, or throughput) path through a graph representing
the required processing steps and communication. There is a delta form
of the algorithm that gives delta-suboptimal decisions much faster.
A concern is algorithm speed and the fact that it appears to require a
centralized scheduler.
It appears that the scheduling algorithm precludes the use of this work
in a peer-peer or client-server setting; if so, this is a major differentiator
between the SRI and BBN work. There are several reasons for this. It is
hard to see how a shortest path algorithm can be adapted to handle arbitrary
numbers of loop iterations as can be found in general client-server or
peer-peer systems. A feature of the algorithms is that processing steps
and flow be able to be identified on a time step basis; this doesn't seem
to match an environment in which service requests are either random or
have high variance. All their examples deal with media delivery.
The papers mention the ability to separate feedback paths from feed forward
paths, but this is not explained.
They define more completely than anyone else surveyed the meaning of
precision, accuracy, and timeliness, specifying explicit parameters for
these concepts. Each parameter has several components, including
absolute values, relationships between values, expected values, bounds,
and variance. It looks like a QoS specification could become quite
complex. They also define a benefit function in the obvious
way. It is not clear how their scheduling algorithm deals with relative
values or variance that their QoS specifications allow.
This work appears most suitable for rather tightly controlled situations
where a high level of QoS is required. Unlike the BBN work, their
solution appears to not be open.
UltraSAN
UltraSAN is a system for modeling and analyzing networks in which events
such as workload and failures are probabilistic. A system is modeled as
a stochastic activity network (SAN), which is an extension of a Petri net.
A SAN extends Petri nets by allowing transitions to be probabilistic, for
multiple tokens to be at any given place, and "gates" to act as predicates
to define which transitions are allowed at any given time. The result
looks like a stylized dataflow graph of the activity under analysis.
Reward functions are given for transitions to designated states and for
remaining in a state for a duration. The point of framing the system
as a SAN is that a SAN can be converted to a Markov model, to which they
know how to apply analytical and simulation techniques. They have
tools to define the SAN and to convert it to a Markov model and analyze
& simulate (by means of fault injection) the resultant Markov model.
They also have some partitioning and replication constructs to aid in the
construction of SANs. Basically, you can "join" SANs together so
that they share "places". This allows replicas to be composed and
for large SANs to be created from small SANs. They also have a technique
based on these replication tricks to reduce the state space of the Markov
model, which otherwise would quickly become unmanageable for reasonable
sized SANs. The software is currently distributed to 129 universities &
several businesses
AQuA is attempting to apply UltraSAN to availability management by building
a SAN for a projected configuration (along with the reward functions),
converting it to a Markov model, and solving to determine what level of
availability it can provide. I don't see how this could be done on
the fly, so I presume they intend to do off-line evaluations of various
configurations they think they can construct, rate each as being survivable
or available to some degree, and then define "regions" around these precomputed
configurations. e.g., UltraSAN predicts that under a given set of
transition assumptions, 3 replicas gives availability of 0.9999 for T seconds,
so a region of high availability would be that 3 replicas are maintained.
Failure to do so constitutes a "reality" region change, which requires
reconfiguration or change to a different "negotiated" region. I don't
think this can deal well with configurations not previously planned as
desirable and not with changes to the transition parameters as would happen
if the system came under attack. This might allow a degree of "preplanned"
survivability, but does not appear highly adaptive.
None of the papers directly talk about evaluating configurations for
survivability under exceptional conditions. All deal with expected
behavior of a configuration to determine the "reward" from a particular
configuration.
Limitations on the use of UltraSAN in AQuA or OIT appear to be:
-
Modeling in UltraSAN looks quite complex, so not a lot of alternative configurations
will be able to be tried. Hence, they will not be able to employ
as many survivability techniques as they might actually have available.
-
UltraSAN analysis techniques are heavy weight, since they require repeated
simulation or solution of huge Markov models. This means that all
analysis will have to be done in advance; i.e., when contract regions are
established, not when something breaks. This has the same disadvantage
as above.
-
UltraSAN models do not appear to be easily modified if the interconnect
topology changes. This is not a big problem in the original target
environment of UltraSAN, where topology was based on physical interconnects,
but is more of an issue in a service model, where the topology changes
as easily as starting a new process.
-
It is not clear how UltraSAN deals with a time-varying mix of tasks and
loads.
Columbia - QoSME
Columbia University has done work on QoS for stream data. It is not clear
how this work differs from RSVP which is more mainstream (I haven't investigated
enough to judge at any deep level). Possibly the emphasis in this project
is on scheduling and resource allocation rather than mechanisms.
SMARTS - MODEL & inCharge
MODEL uses the NetMate resource model to describe systems. MODEL does fault
isolation by treating symptoms of faults as bit errors in a code and then
using error correcting techniques to isolate the "code" (the original fault)
that caused the "received" symptom code. It is an elegant approach and
is claimed to be fast. A problem is that the connection topology is rigid,
and since this determines the code book, component reconfiguration (sw
or hw) will force reconstruction of the code book. Also, since symptoms
appear over time, it is not clear how to assemble them into a "code" that
can be decoded. The difficulty is figuring out an appropriate time window,
particularly when symptom reports may be delayed or lost. The NetMate
model could be a basis for our resource and application models.
Cornell - Horus
Horus is a group message delivery and synchronization system for networking.
It is a direct descendant of the ISIS system. Horus has been used to manage
replica groups as part of making individual services highly available through
controlled redundancy. Horus is not currently licensable.
5 - Commonalities Between QoS and Survivability Techniques
As should be obvious from the previous sections, there are a number of
similarities in architecture, mechanisms, and metrics between service-level
QoS efforts and our survivability work. In this section, we examine
these commonalities. The next section discusses differences.
The terminology we use comes from CORBA, although the observations are
also applicable in other object-based frameworks.
QoS and Survivability Specified by Client; Managed by System
All work in QoS and survivability assumes that while clients are able to
specify the value they place on receiving a given QoS or having a service
remain available, they should not manage it themselves. There are four
reasons for this:
-
While managing either of these is hard for a service developer to implement
on a per service basis, the techniques used are generic enough that they
can be developed well by QoS and survivability experts and applied to a
wide variety of services and situations.
-
QoS and survivability both require resource reservation, which in turn
requires substantial knowledge about the resource environment. The environment
is the same for all services in it, so it makes more sense to model it
outside of any of the individual services. This especially true since the
environment will be expected to evolve (for good or ill) over the lifetime
of the services.
-
Services cannot make unilateral decisions on resource reservation because
they have to compete against other services, whose existence may be unknown
to them. Every service will naturally attempt to maximize its own behavior,
but in resource constrained conditions, this will be impossible. This will
require at minimum a non service centric allocation mechanism.
-
The relative values of the services themselves will change depending on
the situation. Services that are valuable in peacetime may become considerably
less valuable during combat. Unless a service is programmed to understand
all the operational contexts in which it may find itself and adjust its
demands accordingly, this is impossible to achieve. This is unreasonable
to expect, especially for services that may be used for many years and
be adapted to new contexts. It also does not handle greedy services that
chose not to play by the rules, perhaps from malevolent intent.
Moving the locus of QoS and survivability management out of the client
requires significant extensions to both binding specifications and the
CORBA proxy architecture.
Binding Specifications
A binding specification used in a system managing QoS or survivability
has to specify both more and less than current, OID-based bindings do:
more because things such as performance, abstract state, trust, availability,
and cost of use must be specified in addition to type; and less because
if the specification is too specific (e.g., identifies a single object
or server) QoS and survivability management will not have enough choices
available to do a good job. Once the binding specification becomes more
sophisticated, services will be required to advertise their capabilities
to a far greater extent than they do now. Binding specifications and advertisements
will need to be matched to determine a (possibly ranked) set of (complete
or partial) matches. In an OMG context, this would be the job of a Trader
Service, although far more sophisticated than any yet existent, even in
prototype form. Because trading probably requires domain knowledge to achieve
good matches, we will probably see a family of domain-specific Traders.
If this is true, then matching a complete binding specification will require
consulting several Traders and composing their responses. If this approach
is taken, it will be possible for the composition function to weight the
various responses (e.g., to care more about the speed of response than
the security of the service under some circumstances).
Proxy Architecture Extensions
Since a client should be unconcerned with how QoS or survivability is provided,
these must be provided by some other part of the system. Both BBN and OBJS
have chosen independently to make the control locus of this the server
client-side proxy (the local stand-in for a remote server object) and the
server-side stub (which handles communication on the server end of the
connection). CORBA proxies and server stubs are generated from IDL specifications
and are responsible for message passing between clients and remote server
objects via the ORB. This requires argument marshaling and interaction
with the underlying communication layer. Standard CORBA proxies hide the
location and implementation of services from the client, but do little
else.
There are several advantages to extending CORBA proxies and server stubs:
-
All communications between clients and servers must, by definition, pass
through the proxies and server stubs. Arguments and return results are
exposed and packaged for movement at this point as part of the normal argument
marshaling. Together, this means that all communications can be mediated
and that what is passing through the connection is easily accessible.
-
A proxy and the server-side stub reside in different address spaces (and
often on different machines). This gives some flexibility as to where a
given extension should be placed. This can be an advantage for performance
reasons. In addition, since functionality can only be provided reliably
if the monitor is incorruptible, having the ability to place monitors and
mediators out of harm's way is advantageous. This is especially true for
things like security monitoring, where a client or service may wish to
avoid the mediation and could attempt to corrupt local monitors; the OMG
security Service does this. Finally, because proxies and stubs can fail
independently, it is possible to place monitors in each to monitor the
health of the other.
-
The CORBA specification allows proxies and stubs to be extended to do other
tasks besides message passing. Alternative proxies have been used to manage
local caches, to distribute processing load among replicas, and to manage
security (in the OMG Security Service). Commercial ORBs often provide server
stub alternatives that wrap the server implementation via inheritance or
delegation.
-
Smart proxies can be generated automatically by an improved IDL compiler
that either replaces or extends the IDL compiler that is provided with
every ORB to generate proxies and server stubs. [Note: replacement is more
likely than extension because, although they could be, these compilers
are in general not open to extension.] IDL can be extended to allow definition
of additional properties as in the case of BBN's QDL for defining QoS attributes.
There is considerable latitude for the internal architecture of extended
proxies and server stubs. The primary questions are: what belongs on the
client side and what belongs on the server side, whether existing proxies
should be extended or used as-is by a more abstract connection object,
and the definition of interfaces to monitors, up-calls to clients, and
external services such as the Survivability Service.
Adopting BBN terminology, we call the entire collection of mechanisms
between a client and a service a connection.
Connections
For both QoS and survivability, a connection between a client and a service
is far more abstract and far more active than in CORBA.
The need for increased abstraction is because the enhanced binding specifications
allow considerably more binding alternatives and consequently more flexibility
in the way the connections are established and managed. This additional
flexibility makes it important to make not only the location and implementation
of the object providing the service hidden from the client, but also which
object is providing the service. Because of this, the client must not be
allowed to act as it is connected to a specific object providing the service
or to expect anything about the service other than what it requests in
the binding specification it provides.
The increased activity is because the connection is attempting to guarantee
far more things about the interaction than simply "best effort" to deliver
requests and return results. It does this by maintaining information about
the desired behavior of the connection in the form of a connection contract,
measuring the actual behavior through a collection of monitors attached
to the proxy and server stubs, and taking remedial action if the two do
not match. Remedial actions include changing the implementation of the
object providing service, changing which object provides the service, renegotiating
the connection contract, or terminating the connection gracefully.
Because of the above, the connection itself becomes a first-class object
and should have interfaces through which its activity can be monitored
as well as the interfaces used by the client and server.
Monitors
It is important that monitoring be defined and performed by the QoS or
Survivability Service rather than either the client or the server. Part
of this is a trust issue; both clients and servers have to adhere to the
connection contract and it seems unreasonable to trust either to do so.
Another part is that many kinds of monitoring (heartbeats, traffic counting,
etc.) can be independent of application semantics and should be factored
out. The relative importance of different monitors depends on what properties
are defined as important by the connection contract, allowing monitors
to be placed only when needed. Finally, monitoring on the connection itself
allows "QoS-unaware" and "survivability-unaware" services to have a certain
degree of these "added" to them without requiring reimplementation; an
important consideration given the number of existing services that are
unaware.
6 - Issues in Merging QoS & Survivability
It is tempting to think that QoS and survivability can simply be composed.
However, it is not quite so simple as can be seen by the following.
The later BBN work attempts to add "availability" onto QoS by controlling
a replication factor for the service implementation. This does not
seem to capture a number of key points and seems somewhat redundant.
Specifically:
-
If a service fails to be available, and hence doesn't respond when needed,
isn't that a QoS failure? How is the failure to deliver a timely
response because of service crash different, from the client perspective,
from failure to deliver a timely response due to any other cause that would
be covered by QoS considerations? Why should they be
specified independently?
-
A service that is up 100% of the time but provides low QoS isn't really
"available". Just the fact that a service is running is irrelevant.
-
To meet a given QoS requirement, there is no requirement that a service
be continually available; only that it do what it is supposed to do at
the right time. Assume that a service has a QoS requirement
of 99% of its responses to be within a 10 second time window from method
invocation. As a QoS metric, that is pretty clear. However,
when mapped to availability, it is less obvious and there is not a clear
correlation. If a client sends 100 messages/hour and each takes 1
second to process, there is no requirement that the service be "up" 99%
of the time. The service will actually be doing useful work for 100/3600
seconds (<3% of the time). So, as long as it can be brought up
fast enough when needed, there is really no 99% "availability" requirement.
This false issue does not arise if the entire matter is couched as a QoS
requirement.
-
If the time to start a service was zero and all services could persistently
save their current state for free, a service could be activated only and
exactly when it had work to do. In this case, there would never
be a need to have a service running until it was needed, which would obviate
the need for replication except for the persistent state. Again,
replication and availability is not exactly the right measure.
-
High availability can be obtained by the use of many replicas. This
is not without overhead for replica coordination, so the result is likely
to be reduced performance leading to possibly reduced QoS. Treating
QoS and availability separately makes this analysis harder.
I think there are these underlying causes of the above puzzles:
-
availability as a client concern
-
too few reconfiguration strategies
-
QoS not pushed through all levels of abstraction
-
inadequate treatment of time
-
configuration and QoS "fragility" not considered
-
inadequate metrics
Problem: Availability as a Client Concern
The BBN QoS work models the "availability" of a service bound to a connection;
in the examples this is done by requiring a particular replication factor
for the service as part of the connection contract. This is problematic
because availability properly applies to a service, while QoS applies
to a connection. It is reasonable that a service be replicated
in order to provide the connection the desired QoS, but the coupling should
be much looser than implied in the BBN work. To see why treating
service availability as a property of a connection is not correct, consider
the following examples, in which the client is satisfied if 99% of requests
are handled in a timely fashion with suitable precision and accuracy.
-
Case #1 - Service A determines that it must be available 99.99% of the
time to satisfy 99% of the requests by a client.
-
Case #2 - Service B determines that it can be booted quickly enough that
it can be idle until needed and never has to be "available".
-
Case #3 - Service C determines that it has enough resources that if available
99.9% of the time it can satisfy 99% of the requests of 10 clients.
It is not clear how any of these reasonable cases can be addressed by a
model where service availability is a direct concern of the client.
Problem: To Few Reconfiguration Strategies
The QoS work surveyed has a limited number of ways to reconfigure
a service or a connection. This causes a tendency to view QoS problems
as client problems (demands too high) or a server problem (too slow or
unavailable) rather than more abstractly as a connection problem that can
be addressed in a variety of ways to maintain the contracted QoS.
Our Evolution Model for OSAs
paper enumerates rebinding points where things like platform, platform
type, implementation code class, replication factor, replication policy,
and bound service instance can be changed. When QoS is introduced,
a number of other techniques can be used. For example, consider a
QoS contract that requires responses within 30 seconds, but that allows
a null response if the last non-null response was within the last 5 minutes.
In this case, the proxy is free to substitute a null response to meet the
timing constraint in some situations. It is instructive to note that
if precision or accuracy is allowed to go to zero (even if only occasionally),
arbitrarily good timeliness can be acheived by generating null responses
in the proxy. This will allow a synchronous client to continue to
function (at least for a while) even if the server doesn't respond.
An enumeration of a wider range of QoS-preserving responses would help
not only by providing a laundry list of useful techniques, but would help
to stear the community away from such "solution-oriented" metrics as availability.
Problem: QoS Requirements Not Pushed Through All Levels of Abstraction
Both QoS and survivability must deal with multiple levels of abstraction.
A service can only provide a given level of QoS if the services it relies
upon can themselves provide the QoS required of them. Similarly,
a service is survivable if the services it relies upon servive or if the
service can reconfigure to require different base services. Although both
communities realize this, at present, neither handles it particularly well.
In particular, problems revolve around mapping requirements at one level
of abstraction down to requirements at lower levels of abstraction, and
efficiently reserving resources through possibly several layers of lower-level
services.
A related difficulty is that the QoS policies at higher layers need
to somehow be reflected down to lower layers in meaningful form.
Consider this example. A client requires high availability (in the
BBN model) and fast response. If availability was the only QoS requirement,
the existence of the requisite number of replicas would be sufficient to
maintain the contract. However, this is not adequate, since response
time also matters. Different replication policies (primary copy,
voting, etc.) give different QoS behaviors. Further, the messaging
policy within a replica group may vary depending on the QoS needs of the
client. For example, normally reads are sent only to a single replica
to conserve bandwidth and processing. However, if timely delivery
is crucial, it makes sense to attempt reads from multiple replicas to ensure
that at least one responds in time (assuming of course that the multiple
reads will not compete for the same bandwidth and become even slower).
The policy to follow could vary based on a number of factors, including
load on the resources, criticality of timely response, client value, and
previous behavior of the connection (e.g., is it barely making the timing
bound?). To acheive this, it would appear that the high level QoS
goal be somehow pushed down to the replication and messaging subsystem.
Problem: Inadequate Treatment of Time
The QoS work surveyed seems to treat all configurations as having present
state only. This neglects the fact that configurations will change
state (for better or worse) either on demand or because of some random
event, and that these changes take some amount of time to occur.
In some of these new states, the service will be able to provide the required
QoS and in others it will not. Clients promise certain invocation
rates as part of the QoS contract, which is part of the treatment of time,
but nowhere is the time for a service to change configurations addressed.
As noted, this means that services must be kept in a state where they can
respond in the QoS bound, which can be very wasteful.
Problem: Configuration & QoS Fragility
QuO treats all responses that fall inside a negotiated region as being
of equal worth. That is fine as far as the client is concerned, but
is misleading when the ability to meet future QoS goals is considered.
Consider the following two contracts whose timeliness components are:
Contract #1
-
10 sec - 30 sec response
-
15 sec average response over any 10 invocation intervals
-
no violations allowed
Contract #2
-
10 sec - 30 sec response
-
15 sec average response over any 10 invocation intervals
-
no more than 1 timeliness failure per 100 invocations AND no more than
2 average responses over 20 sec per 100
intervals
With these two contract fragments,:
-
in Contract #1, if the average response time is 14.9 seconds, that is not
as safe as if it were 11 seconds, since a response time of 30 seconds on
the next request would cause a violation in the former case but not the
latter
-
in Contract #2, if a timing goal has been missed, another timing goal cannot
be missed until a certain number of responses have been made
Further, regardless of the previous behavior of the connection:
-
if the service configuration fails, there is an increased chance that the
next QoS target will not be met; this is influenced by how long it takes
to restart the service (if possible)
-
if the threat environment becomes more hostile, the probability of missing
a QoS target increases, even if no targets have been missed so far
-
if load increases, missing QoS targets becomes more likely, even though
none have been missed so far
There needs to be some treatment of how likely it is that a negotiated
region will be violated in the future. Otherwise, the only time a
reconfiguration takes place is when the region has been violated,
by which time a QoS failure has occurred. Because the time to reconfigure
is not treated by the present body of QoS work, it is not even known how
long the connection will be unusable. To make matters worse, some
failures cannot be recovered from. As part of evaluating how brittle
a connection is, there are definitely gradations of "badness", in either
probability of a violation, how severe the violation is likely to be, how
long it will last, and whether alternative levels of QoS can be reached
from the new configuration.
The BBN work partially addresses this problem of brittleness by requiring
a given availability for services based on a precomputed replication factor,
which is made part of the QoS region contract. By maintaining replicas,
the service never becomes too brittle. While this approach suffers
from the limitations discussed above, it brings out an interesting point
that we will exploit more fully in our work combining QoS and survivability.
In the base QuO work, all parts of the region contract were relevant to
the client and all violations were seen by the client (although often they
could be fixed by the proxy). However, the replication factor is
not relevant to the client except as a rough measure of the ability to
deliver future QoS. When a contract is violated due to replica
failures, only the proxy is concerned. We are working on extending
and formalizing this notion that a contract should have parts relevant
to both the client and the proxy so that QoS can be delivered and brittle
states avoided.
Metrics
Both survivability and QoS attempt to say something about the "goodness"
of a connection. To this end, they each define metrics. However,
in our opinion, they say very different things. Specifically, QoS (in its
broadest sense) addresses current properties of the connection,
whereas survivability is concerned with the future of the connection.
In other words, survivability addresses whether the connection is likely
to maintain a desired QoS. Both QoS work and survivability address restoring
a connection to a good QoS; the difference is that when doing so, survivability
considers the ability of the new configuration to survive whereas QoS does
not. This is discussed in a paper Survivability
is Utility.
This has two principle effects.
-
The survivability metrics must differ from the QoS metrics, since the latter
measure only current state. To give an extreme example of the difference,
consider that placing a service on a lightly loaded, but very vulnerable
machine will score highly on the QoS metric, but low on the survivability
metric. These metrics, and their relationship, are discussed later.
-
A Survivability Service will need to take expected future behavior into
account when allocating resources. This necessitates some sort of model
of likely future events that could cause a configuration to change.
Bibliography of QoS Papers
Papers are organized by project.
Rome Laboratory
-
Quality of Service for AWACS Tracking, Patrick Hurley ,Tom Lawrence,
Tom Wheeler, Ray Clark, to appear 4th International Command
and Control Research and Technology Symposium, 14-16 September 1998, Nasby
Park (Stockholm) Sweden
-
Anomaly Management, Thomas F. Lawrence, AFRL, internal report, 1997
BBN Cluster Papers
QuO
DIRM
AQuA
OIT
SRI Papers
-
QoS Taxonomy
(1997)
good definitions of QoS parameters and benefit function
-
Modeling
for Adaptive QoS (1997)
describes a multi-level model for application-specific QoS specification
that supports implementation alternatives and variable QoS
-
QoS-Based
Allocation(1997)
describes a scheduler for resource allocation in a network that is
QoS aware
Illinois Papers
SMARTS Papers
-
"High Speed & Robust Event Correlation" - Yemini, Kiger, Mozes,
Yemini, Ohsie.
Description of the SMARTS inCharge fault analysis system.