QoS & Survivability

David Wells
Object Services and Consulting, Inc.

March 1998
Revised August 1998

This research is sponsored by the Defense Advanced Research Projects Agency and managed by Rome Laboratory under contract F30602-96-C-0330. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied of the Defense Advanced Research Projects Agency, Rome Laboratory, or the United States Government.

© Copyright 1997, 1998 Object Services and Consulting, Inc. Permission is granted to copy this document provided this copyright statement is retained in all copies. Disclaimer: OBJS does not warrant the accuracy or completeness of the information in this document.

1 - Introduction

In the past several years, there has been considerable research in the areas of quality of service (QoS) and survivability in an attempt to facilitate the construction of large software systems that behave properly under a wide range of operating conditions and degrade gracefully when outside this range. From the 10,000 foot level, quality of service addresses the goodness and timeliness of results delivered to a client, while survivability addresses how to repair or gracefully degrade when things go awry and the desired behavior is not able to be maintained. These two areas are obviously related, because QoS forms at least a part of the definition of the "desired" behavior of a system that survivability techniques are attempting to preserve or gracefully degrade.

This paper explores the relationship between quality of service and survivability. Section 2 discusses the concepts of quality of service and survivability. Section 3 identifies and presents highlights of important QoS research efforts. Section 4 discusses these projects in more detail, particularly efforts whose approach to QoS is compatible to our approach to survivability. Section 5 identifies technical "points of intersection" between the QoS and survivability work that could eventually lead to a confluence. Section 6 identifies some issues that arise when QoS and survivability are combined and points out some weaknesses in the way the existing projects add and measure survivability.

2 - Quality of Service & Survivability

The concept of quality of service has traditionally been applied only at the network (and sometimes operating system) level. At that level, QoS deals with issues such as time to delivery, bandwidth, jitter, and error rates. Network-level QoS is important because many applications will not function in an acceptable or useful manner unless the network they use can guarantee some minimal service guarantees. It has been observed that just as all services and applications rely on networks, they also rely on other applications and services and these must also make some QoS guarantees to allow the application to perform correctly.

A pair of short papers from Rome Laboratory describe service-level QoS as a function of precision (how much), accuracy (how correct), and timeliness (does it come when needed). For example, a map may be insufficiently precise (100m instead of 10m resolution), inaccurate (things in the wrong places), or untimely (delivered too late to be useful). Unless all three requirements are met, a client is not getting what it needs and therefore the result is lacking in QoS. A benefit function is defined for each point in this 3-D space stating the value to the client of receiving that particular QoS. Distance metrics for the argument spaces are application dependent. For example, the distance between "red" and "orange" will be less than the distance between "red" and "blue" in a spectral dimension, but not in a textual dimension. The benefit function is similarly application-specific and may be situation dependent as well.

QoS at any given level of abstraction places QoS requirements on the components providing the QoS. In the example, delivering a 10m map requires the retrieval of a certain number of bits of map data.

A survivable system is one that can repair itself or degrade gracefully to preserve as much critical functionality as possible in the face of attacks and failures. A survivable system needs to be able to switch compatible services in an established connection and substitute acceptable alternatives. It must also be able to dynamically adapt to the threats in its environment to reallocate essential processing to the most robust resources.

QoS and survivability are intricately linked; they are not the same, but neither makes sense without the other.

Survivability without some notion of what is supposed to be surviving is pointless; the what is provided by QoS metrics.
QoS "guarantees" that can't be made to survive or adapt under changing conditions are not very useful as guarantees, and could in fact lead to denial of service attacks as opponents bring a system to its knees by degrading QoS and causing the QoS management system to continually add superfluous resources.

3 - Overview of QoS Projects

Recent work, much of it funded by DARPA-ITO and administered by Rome Laboratory through the Quorum program, is extending QoS concepts and mechanisms to higher semantic levels to allow the definition, measurement, and control of the quality of service delivered by services and complete applications. There are three major grouping of projects. SRI and BBN each have architectural frameworks and are developing or adapting multiple pieces of technology to fit their frameworks. The BBN and SRI frameworks address different types of QoS needs and do not appear compatible. Several independent projects are developing individual pieces of technology. All projects are administered by Rome Laboratory, which also does some technology development. The groupings and relationships of projects (shown in QoS Projects Map) are given below. The individual projects are described in more detail in Section 4.

BBN Cluster: The BBN cluster consists of three architecture/infrastructure efforts based on a CORBA client server model:

QuO - Quality Objects

papers

AQuA - Adaptive Quality of Service Availability

papers

OIT - Open Implementation Toolkit for Creating Adaptable Distributed Applications

papers

and one application demonstration project:

DIRM - Dynamic Integrated Resource Management

papers

These projects are closely related, and in many ways it is useful to think of them as one large project. QuO developed a general framework for QoS in CORBA that is being refined by AQuA and extended by OIT to address service availability management. There is to be a "production QuO" done under the DIRM project.

SRI: SRI is developing an architecture and scheduling algorithms for the delivery of end-to-end QoS for a data streaming model:

Adaptive QoS-driven Resource Management for Distributed Real Time Systems

papers

Independent Projects: These projects are developing modeling and analysis/simulation tools that could be used by a QoS management system to model resources and QoS requirements and to schedule resources. Several of the tools were developed for another purpose and are being adapted to the QoS domain. All have had some relationship with the BBN cluster of projects. Projects are:

UltraSAN

papers

QoSME

InCharge/MODEL

papers

Horus

4 - QoS Project Details

BBN Related Projects -QuO, AQuA, OIT, DIRM

The QuO project is developing Quality Objects that can manage their own QoS. QuO is integrated with the CORBA architecture, in that most of the work is done by extending client-side and server-side ORB stubs. These "smarter" stubs are generated from an extension to IDL called QDL that allows specifying things about service and connection quality. QuO assumes a CORBA-like processing model in which there are client-server and peer-peer relationships and in which the exact processing loads are unknown and can be quite variable. This distinguishes QuO from the SRI work, where the information flow of the applications takes the shape of a DAG and where processing and QoS requirements are assumed to be well understood a priori as is the case in multi-media delivery.

Using QDL, an object can specify the QoS it expects from each service that it uses and can specify what its usage patterns will be (e.g., invocations/sec). An object will similarly use QDL to specify the QoS it knows how to provide (which can be different for different implementations or resource levels). These specifications are used to create client-server bindings called connections. Connections are first-class objects as they are in our survivability model defined in Composition Model for OSAs.

To make the writing of QoS specifications and the creation and maintenance of connections tractable, QoS is partitioned into regions of normative behavior. Within each region it is assumed that every QoS is equally useable. Region definitions look like predicates in a language like C. Some of these regions (e.g., Normal, Overload, InsufficientResources, Idle) are predefined, but others can be defined using QDL. Clients and services parameterize these regions to define when they are in a particular one. For example, a client may say that when in Normal mode, it will make between 2 and 10 calls/sec to a particular service; anything over 10 calls/sec puts the client into Overload mode.

The use of regions means that minor (insignificant) deviations in the QoS delivered do not require changes to the service or connection, which substantially simplifies runtime processing. Similarly, if clients and servers agree on common meanings for named regions, matching client and server specifications is simplified which is important since this activity must often be done on the fly. The use of QoS regions is a significant difference between the BBN and SRI approaches; in the SRI approach, the benefit function is allowed to be continuous.

The regions discussed so far are called negotiated regions since they represent where the client and server try to operate and form the basis for the connection contract. As long as both the client and server operate in their negotiated regions, all is well. However, it is possible for either the client or the server (which for these purposes also includes the connection between the proxy and the remote server implementation) to deviate from the negotiated region, either by overloading the server or failing to deliver results as required. Because of the potential for operating outside the negotiated regions, QuO defines reality regions to represent the actual QoS-relevant behaviors of client and servers. Reality regions are defined in the same way as negotiated regions; it appears that for any given connection, the same set of specifications will be used for both kinds of regions. If the observed reality region differs from the negotiated region, the connection is no longer valid and remedial action must be taken.

Various monitors determine the reality region a connection is actually in. Monitors are presumably predefined to monitor the kinds of things that QuO cares about; others can presumably be defined, and there is a claim that they can be deployed selectively to only monitor those QoS items of interest. Types of monitors include counting invocations, timers, and heartbeats. It seems that there could be a whole subsystem for inserting probes in useful places.

Operationally, a QuO object is a distributed object where part of it lies on the server side and part(s) lie with its various clients. It is appropriate to think of the client side as being a very smart proxy object that knows how to do QoS related actions. (It appears that this stub can be further tailored to do just about anything, but that doesn't look like a good idea unless the architecture tells what kinds of things should be done in the proxy.) A QuO proxy keeps track of the negotiated QoS mode of the connection; interacts with monitors; provides feedback to the client through interfaces generated from the QDL; and takes certain actions to maintain the negotiated QoS.

Client-side proxies can take actions to maintain a QoS mode or change the negotiated QoS region. A client makes a service request through a client-side proxy which decides what to do with method invocations. This could include sending the requests to a server, servicing them from a cache, routing to replicas, routing to a varying set of implementation objects, or ignoring them and signaling an exception. How many of these they actually plan to provide is unknown, but this is definitely where they fit. It fits very well with our survivability model, except that we would extract this from the purview of the client and move it to a Surivability Service to handle competing service demands. In an ideal world, there would be a large number of generic actions that could be taken by any proxy to maintain QoS or shift regions gracefully when it detects a change in client behavior. Our Evolution Model for OSAs defines many ways to evolve configurations. Some of the actions a client-side proxy can take to maintain a region or change regions are:

A server may be able to shift its implementation in order to stay in the same region without disturbing the client. An example given is to change implementations to trade bandwidth for processing power as resource availability changes. Since the server stays in the same negotiated region, the client doesn't need to see any change.
The client may request a different negotiated region. For example, it may go idle and negotiate a lower QoS region. It is also possible that a client is detected to have entered a different region. For example, the regions may define Idle as no messages of any kind for 5 minutes. If the reality region enters Idle, it can be treated as if the client signaled that it was going Idle. Thus a client can change modes without having to be implemented for QoS.
A server may have multiple implementation strategies that will allow it to enter different regions. For example, if the client goes Idle, the server may scale back its own resources, while if a Client enters an Overload mode, the server may add resources or shift to a different implementation.
The proxy may make an up-call to the client to find out what to do.

Feedback to the client is done by callbacks generated from the QDL. These form an additional interface to the client that is used only for QoS purposes. The proxy can notify the client that the reality region has deviated from the negotiated region. The client must provide handlers for these callbacks to either begin a negotiation process with the proxy or to change its own behavior in some way (e.g., slow down, accept lower precision, etc.). Again, it appears that there is a wealth of opportunity to do good things here, but it is not clear how many of them are actually defined.

The QuO papers give a fairly detailed figure of the form of the proxy objects. The partitioning of function seems good and would allow us to extend if we needed to. This looks like where a lot of their work went.

A weakness of the QuO project seems to be that they appear to have a sort of "one-level" service model, in which clients call services which are leaves. Also, the allocation of resources appears to be under the control of the proxies, without much regard for other demands. They briefly mention a language, Resource Definition Language (RDL) for specifying resources and it would be reasonable that there be some sort of auction or scheduler that gave resources out in a non-object centric way, but this gets little attention. It may be possible to handle both multiple levels of objects and sensible allocation for competing resources, but they don't seem to do it. It appears that the architecture and models do make this possible.

AQuA is improving the QuO proxy architecture, attaching RSVP to allow QuO proxies to control QOS over CORBA, and adding regions to deal with availability as well as QoS. Not much published, so what follows is my supposition as to how this will work. It appears that they plan to predefine a number of negotiated regions for availability, where the region predicate has to do with things like how many replicas will be required in order to stay in that region. Hooks down to a replication manager (probably via Horus or using Electra/Ensemble) are used as the monitors to detect whether the reality region for availability has changed (e.g., a replica died). Actions in response to an availability change would be to start another replica or inform the client of the change and let the client decide what to do in the same way it decides what to do about QoS region changes. UltraSAN is being used (or at least considered) as a way to determine the availability region predicates. Using UltraSAN (and possibly SMARTS/Model), they will model various configurations and use the UltraSAN simulation and analysis tools to determine how long a configuration is likely to stay alive or how long it will take to reconfigure (I haven't actually seen the latter discussed at all, so maybe they do not plan to do this). They will define and analyze configurations until they find one that has the right predicted availability for a particular client need. That will then form a contract for a specific availability region.

It is not clear that they maintain a connection between QoS and availability. It looks like region predicates express both QoS and availability concerns in the same predicate. Since increased availability can degrade QoS (slower with more replicas depending on the operation), this needs to be addressed. It does look like they will be able to determine that the combined reality region does not match the combined negotiated region, but that will not help with the act of finding a good combined region definition that can actually be instantiated.

OIT is just getting started and not much has been written about it. It looks like they will be cleaning up some QuO internal architecture and perhaps developing a toolkit to help write and manage contracts. This may integrate with the use of UltraSAN; it does not say so, but that would not be an unreasonable part of such a toolkit. It also appears that OIT will be using these extensions to provide some support for survivability along the lines of AQuA; it is not clear what the relationship is between them.

DIRM has much more the feel of a technology demonstration than the other projects. Little is written about it, but it appears that the idea is to show off some QoS concepts in a collaborative decision process in some sort of military command post setting.

SRI

SRI is working on end-to-end QoS, with particular emphasis on managing data streaming within a DAG of processes. The strengths of the work is that it has a very clear modeling methodology and language (including pictorial) for specifying alternative implementations, natural handling of information and processing flows that encompass arbitrary numbers of steps, and a scheduling algorithm for allocating processes and communication to available resources as described in a system model.

Resources are allocated by a scheduler based on a modified Dykstra shortest path algorithm for finding least cost (w.r.t. a definable cost function such as least time, cheapest, or throughput) path through a graph representing the required processing steps and communication. There is a delta form of the algorithm that gives delta-suboptimal decisions much faster. A concern is algorithm speed and the fact that it appears to require a centralized scheduler.

It appears that the scheduling algorithm precludes the use of this work in a peer-peer or client-server setting; if so, this is a major differentiator between the SRI and BBN work. There are several reasons for this. It is hard to see how a shortest path algorithm can be adapted to handle arbitrary numbers of loop iterations as can be found in general client-server or peer-peer systems. A feature of the algorithms is that processing steps and flow be able to be identified on a time step basis; this doesn't seem to match an environment in which service requests are either random or have high variance. All their examples deal with media delivery. The papers mention the ability to separate feedback paths from feed forward paths, but this is not explained.

They define more completely than anyone else surveyed the meaning of precision, accuracy, and timeliness, specifying explicit parameters for these concepts. Each parameter has several components, including absolute values, relationships between values, expected values, bounds, and variance. It looks like a QoS specification could become quite complex. They also define a benefit function in the obvious way. It is not clear how their scheduling algorithm deals with relative values or variance that their QoS specifications allow.

This work appears most suitable for rather tightly controlled situations where a high level of QoS is required. Unlike the BBN work, their solution appears to not be open.

UltraSAN

UltraSAN is a system for modeling and analyzing networks in which events such as workload and failures are probabilistic. A system is modeled as a stochastic activity network (SAN), which is an extension of a Petri net. A SAN extends Petri nets by allowing transitions to be probabilistic, for multiple tokens to be at any given place, and "gates" to act as predicates to define which transitions are allowed at any given time. The result looks like a stylized dataflow graph of the activity under analysis. Reward functions are given for transitions to designated states and for remaining in a state for a duration. The point of framing the system as a SAN is that a SAN can be converted to a Markov model, to which they know how to apply analytical and simulation techniques. They have tools to define the SAN and to convert it to a Markov model and analyze & simulate (by means of fault injection) the resultant Markov model. They also have some partitioning and replication constructs to aid in the construction of SANs. Basically, you can "join" SANs together so that they share "places". This allows replicas to be composed and for large SANs to be created from small SANs. They also have a technique based on these replication tricks to reduce the state space of the Markov model, which otherwise would quickly become unmanageable for reasonable sized SANs. The software is currently distributed to 129 universities & several businesses

AQuA is attempting to apply UltraSAN to availability management by building a SAN for a projected configuration (along with the reward functions), converting it to a Markov model, and solving to determine what level of availability it can provide. I don't see how this could be done on the fly, so I presume they intend to do off-line evaluations of various configurations they think they can construct, rate each as being survivable or available to some degree, and then define "regions" around these precomputed configurations. e.g., UltraSAN predicts that under a given set of transition assumptions, 3 replicas gives availability of 0.9999 for T seconds, so a region of high availability would be that 3 replicas are maintained. Failure to do so constitutes a "reality" region change, which requires reconfiguration or change to a different "negotiated" region. I don't think this can deal well with configurations not previously planned as desirable and not with changes to the transition parameters as would happen if the system came under attack. This might allow a degree of "preplanned" survivability, but does not appear highly adaptive.

None of the papers directly talk about evaluating configurations for survivability under exceptional conditions. All deal with expected behavior of a configuration to determine the "reward" from a particular configuration.

Limitations on the use of UltraSAN in AQuA or OIT appear to be:

Modeling in UltraSAN looks quite complex, so not a lot of alternative configurations will be able to be tried. Hence, they will not be able to employ as many survivability techniques as they might actually have available.
UltraSAN analysis techniques are heavy weight, since they require repeated simulation or solution of huge Markov models. This means that all analysis will have to be done in advance; i.e., when contract regions are established, not when something breaks. This has the same disadvantage as above.
UltraSAN models do not appear to be easily modified if the interconnect topology changes. This is not a big problem in the original target environment of UltraSAN, where topology was based on physical interconnects, but is more of an issue in a service model, where the topology changes as easily as starting a new process.
It is not clear how UltraSAN deals with a time-varying mix of tasks and loads.

Columbia - QoSME

Columbia University has done work on QoS for stream data. It is not clear how this work differs from RSVP which is more mainstream (I haven't investigated enough to judge at any deep level). Possibly the emphasis in this project is on scheduling and resource allocation rather than mechanisms.

SMARTS - MODEL & inCharge

MODEL uses the NetMate resource model to describe systems. MODEL does fault isolation by treating symptoms of faults as bit errors in a code and then using error correcting techniques to isolate the "code" (the original fault) that caused the "received" symptom code. It is an elegant approach and is claimed to be fast. A problem is that the connection topology is rigid, and since this determines the code book, component reconfiguration (sw or hw) will force reconstruction of the code book. Also, since symptoms appear over time, it is not clear how to assemble them into a "code" that can be decoded. The difficulty is figuring out an appropriate time window, particularly when symptom reports may be delayed or lost. The NetMate model could be a basis for our resource and application models.

Cornell - Horus

Horus is a group message delivery and synchronization system for networking. It is a direct descendant of the ISIS system. Horus has been used to manage replica groups as part of making individual services highly available through controlled redundancy. Horus is not currently licensable.

5 - Commonalities Between QoS and Survivability Techniques

As should be obvious from the previous sections, there are a number of similarities in architecture, mechanisms, and metrics between service-level QoS efforts and our survivability work. In this section, we examine these commonalities. The next section discusses differences. The terminology we use comes from CORBA, although the observations are also applicable in other object-based frameworks.

QoS and Survivability Specified by Client; Managed by System

All work in QoS and survivability assumes that while clients are able to specify the value they place on receiving a given QoS or having a service remain available, they should not manage it themselves. There are four reasons for this:

While managing either of these is hard for a service developer to implement on a per service basis, the techniques used are generic enough that they can be developed well by QoS and survivability experts and applied to a wide variety of services and situations.
QoS and survivability both require resource reservation, which in turn requires substantial knowledge about the resource environment. The environment is the same for all services in it, so it makes more sense to model it outside of any of the individual services. This especially true since the environment will be expected to evolve (for good or ill) over the lifetime of the services.
Services cannot make unilateral decisions on resource reservation because they have to compete against other services, whose existence may be unknown to them. Every service will naturally attempt to maximize its own behavior, but in resource constrained conditions, this will be impossible. This will require at minimum a non service centric allocation mechanism.
The relative values of the services themselves will change depending on the situation. Services that are valuable in peacetime may become considerably less valuable during combat. Unless a service is programmed to understand all the operational contexts in which it may find itself and adjust its demands accordingly, this is impossible to achieve. This is unreasonable to expect, especially for services that may be used for many years and be adapted to new contexts. It also does not handle greedy services that chose not to play by the rules, perhaps from malevolent intent.

Moving the locus of QoS and survivability management out of the client requires significant extensions to both binding specifications and the CORBA proxy architecture.

Binding Specifications

A binding specification used in a system managing QoS or survivability has to specify both more and less than current, OID-based bindings do: more because things such as performance, abstract state, trust, availability, and cost of use must be specified in addition to type; and less because if the specification is too specific (e.g., identifies a single object or server) QoS and survivability management will not have enough choices available to do a good job. Once the binding specification becomes more sophisticated, services will be required to advertise their capabilities to a far greater extent than they do now. Binding specifications and advertisements will need to be matched to determine a (possibly ranked) set of (complete or partial) matches. In an OMG context, this would be the job of a Trader Service, although far more sophisticated than any yet existent, even in prototype form. Because trading probably requires domain knowledge to achieve good matches, we will probably see a family of domain-specific Traders. If this is true, then matching a complete binding specification will require consulting several Traders and composing their responses. If this approach is taken, it will be possible for the composition function to weight the various responses (e.g., to care more about the speed of response than the security of the service under some circumstances).

Proxy Architecture Extensions

Since a client should be unconcerned with how QoS or survivability is provided, these must be provided by some other part of the system. Both BBN and OBJS have chosen independently to make the control locus of this the server client-side proxy (the local stand-in for a remote server object) and the server-side stub (which handles communication on the server end of the connection). CORBA proxies and server stubs are generated from IDL specifications and are responsible for message passing between clients and remote server objects via the ORB. This requires argument marshaling and interaction with the underlying communication layer. Standard CORBA proxies hide the location and implementation of services from the client, but do little else.

There are several advantages to extending CORBA proxies and server stubs:

All communications between clients and servers must, by definition, pass through the proxies and server stubs. Arguments and return results are exposed and packaged for movement at this point as part of the normal argument marshaling. Together, this means that all communications can be mediated and that what is passing through the connection is easily accessible.
A proxy and the server-side stub reside in different address spaces (and often on different machines). This gives some flexibility as to where a given extension should be placed. This can be an advantage for performance reasons. In addition, since functionality can only be provided reliably if the monitor is incorruptible, having the ability to place monitors and mediators out of harm's way is advantageous. This is especially true for things like security monitoring, where a client or service may wish to avoid the mediation and could attempt to corrupt local monitors; the OMG security Service does this. Finally, because proxies and stubs can fail independently, it is possible to place monitors in each to monitor the health of the other.
The CORBA specification allows proxies and stubs to be extended to do other tasks besides message passing. Alternative proxies have been used to manage local caches, to distribute processing load among replicas, and to manage security (in the OMG Security Service). Commercial ORBs often provide server stub alternatives that wrap the server implementation via inheritance or delegation.
Smart proxies can be generated automatically by an improved IDL compiler that either replaces or extends the IDL compiler that is provided with every ORB to generate proxies and server stubs. [Note: replacement is more likely than extension because, although they could be, these compilers are in general not open to extension.] IDL can be extended to allow definition of additional properties as in the case of BBN's QDL for defining QoS attributes.

There is considerable latitude for the internal architecture of extended proxies and server stubs. The primary questions are: what belongs on the client side and what belongs on the server side, whether existing proxies should be extended or used as-is by a more abstract connection object, and the definition of interfaces to monitors, up-calls to clients, and external services such as the Survivability Service.

Adopting BBN terminology, we call the entire collection of mechanisms between a client and a service a connection.

Connections

For both QoS and survivability, a connection between a client and a service is far more abstract and far more active than in CORBA.

The need for increased abstraction is because the enhanced binding specifications allow considerably more binding alternatives and consequently more flexibility in the way the connections are established and managed. This additional flexibility makes it important to make not only the location and implementation of the object providing the service hidden from the client, but also which object is providing the service. Because of this, the client must not be allowed to act as it is connected to a specific object providing the service or to expect anything about the service other than what it requests in the binding specification it provides.

The increased activity is because the connection is attempting to guarantee far more things about the interaction than simply "best effort" to deliver requests and return results. It does this by maintaining information about the desired behavior of the connection in the form of a connection contract, measuring the actual behavior through a collection of monitors attached to the proxy and server stubs, and taking remedial action if the two do not match. Remedial actions include changing the implementation of the object providing service, changing which object provides the service, renegotiating the connection contract, or terminating the connection gracefully.

Because of the above, the connection itself becomes a first-class object and should have interfaces through which its activity can be monitored as well as the interfaces used by the client and server.

Monitors

It is important that monitoring be defined and performed by the QoS or Survivability Service rather than either the client or the server. Part of this is a trust issue; both clients and servers have to adhere to the connection contract and it seems unreasonable to trust either to do so. Another part is that many kinds of monitoring (heartbeats, traffic counting, etc.) can be independent of application semantics and should be factored out. The relative importance of different monitors depends on what properties are defined as important by the connection contract, allowing monitors to be placed only when needed. Finally, monitoring on the connection itself allows "QoS-unaware" and "survivability-unaware" services to have a certain degree of these "added" to them without requiring reimplementation; an important consideration given the number of existing services that are unaware.

6 - Issues in Merging QoS & Survivability

It is tempting to think that QoS and survivability can simply be composed. However, it is not quite so simple as can be seen by the following. The later BBN work attempts to add "availability" onto QoS by controlling a replication factor for the service implementation. This does not seem to capture a number of key points and seems somewhat redundant. Specifically:

If a service fails to be available, and hence doesn't respond when needed, isn't that a QoS failure? How is the failure to deliver a timely response because of service crash different, from the client perspective, from failure to deliver a timely response due to any other cause that would be covered by QoS considerations? Why should they be specified independently?
A service that is up 100% of the time but provides low QoS isn't really "available". Just the fact that a service is running is irrelevant.
To meet a given QoS requirement, there is no requirement that a service be continually available; only that it do what it is supposed to do at the right time. Assume that a service has a QoS requirement of 99% of its responses to be within a 10 second time window from method invocation. As a QoS metric, that is pretty clear. However, when mapped to availability, it is less obvious and there is not a clear correlation. If a client sends 100 messages/hour and each takes 1 second to process, there is no requirement that the service be "up" 99% of the time. The service will actually be doing useful work for 100/3600 seconds (<3% of the time). So, as long as it can be brought up fast enough when needed, there is really no 99% "availability" requirement. This false issue does not arise if the entire matter is couched as a QoS requirement.
If the time to start a service was zero and all services could persistently save their current state for free, a service could be activated only and exactly when it had work to do. In this case, there would never be a need to have a service running until it was needed, which would obviate the need for replication except for the persistent state. Again, replication and availability is not exactly the right measure.
High availability can be obtained by the use of many replicas. This is not without overhead for replica coordination, so the result is likely to be reduced performance leading to possibly reduced QoS. Treating QoS and availability separately makes this analysis harder.

I think there are these underlying causes of the above puzzles:

availability as a client concern
too few reconfiguration strategies
QoS not pushed through all levels of abstraction
inadequate treatment of time
configuration and QoS "fragility" not considered
inadequate metrics

Problem: Availability as a Client Concern

The BBN QoS work models the "availability" of a service bound to a connection; in the examples this is done by requiring a particular replication factor for the service as part of the connection contract. This is problematic because availability properly applies to a service, while QoS applies to a connection. It is reasonable that a service be replicated in order to provide the connection the desired QoS, but the coupling should be much looser than implied in the BBN work. To see why treating service availability as a property of a connection is not correct, consider the following examples, in which the client is satisfied if 99% of requests are handled in a timely fashion with suitable precision and accuracy.

Case #1 - Service A determines that it must be available 99.99% of the time to satisfy 99% of the requests by a client.
Case #2 - Service B determines that it can be booted quickly enough that it can be idle until needed and never has to be "available".
Case #3 - Service C determines that it has enough resources that if available 99.9% of the time it can satisfy 99% of the requests of 10 clients.

It is not clear how any of these reasonable cases can be addressed by a model where service availability is a direct concern of the client.

Problem: To Few Reconfiguration Strategies

The QoS work surveyed has a limited number of ways to reconfigure a service or a connection. This causes a tendency to view QoS problems as client problems (demands too high) or a server problem (too slow or unavailable) rather than more abstractly as a connection problem that can be addressed in a variety of ways to maintain the contracted QoS. Our Evolution Model for OSAs paper enumerates rebinding points where things like platform, platform type, implementation code class, replication factor, replication policy, and bound service instance can be changed. When QoS is introduced, a number of other techniques can be used. For example, consider a QoS contract that requires responses within 30 seconds, but that allows a null response if the last non-null response was within the last 5 minutes. In this case, the proxy is free to substitute a null response to meet the timing constraint in some situations. It is instructive to note that if precision or accuracy is allowed to go to zero (even if only occasionally), arbitrarily good timeliness can be acheived by generating null responses in the proxy. This will allow a synchronous client to continue to function (at least for a while) even if the server doesn't respond. An enumeration of a wider range of QoS-preserving responses would help not only by providing a laundry list of useful techniques, but would help to stear the community away from such "solution-oriented" metrics as availability.

Problem: QoS Requirements Not Pushed Through All Levels of Abstraction

Both QoS and survivability must deal with multiple levels of abstraction. A service can only provide a given level of QoS if the services it relies upon can themselves provide the QoS required of them. Similarly, a service is survivable if the services it relies upon servive or if the service can reconfigure to require different base services. Although both communities realize this, at present, neither handles it particularly well. In particular, problems revolve around mapping requirements at one level of abstraction down to requirements at lower levels of abstraction, and efficiently reserving resources through possibly several layers of lower-level services.

A related difficulty is that the QoS policies at higher layers need to somehow be reflected down to lower layers in meaningful form. Consider this example. A client requires high availability (in the BBN model) and fast response. If availability was the only QoS requirement, the existence of the requisite number of replicas would be sufficient to maintain the contract. However, this is not adequate, since response time also matters. Different replication policies (primary copy, voting, etc.) give different QoS behaviors. Further, the messaging policy within a replica group may vary depending on the QoS needs of the client. For example, normally reads are sent only to a single replica to conserve bandwidth and processing. However, if timely delivery is crucial, it makes sense to attempt reads from multiple replicas to ensure that at least one responds in time (assuming of course that the multiple reads will not compete for the same bandwidth and become even slower). The policy to follow could vary based on a number of factors, including load on the resources, criticality of timely response, client value, and previous behavior of the connection (e.g., is it barely making the timing bound?). To acheive this, it would appear that the high level QoS goal be somehow pushed down to the replication and messaging subsystem.

Problem: Inadequate Treatment of Time

The QoS work surveyed seems to treat all configurations as having present state only. This neglects the fact that configurations will change state (for better or worse) either on demand or because of some random event, and that these changes take some amount of time to occur. In some of these new states, the service will be able to provide the required QoS and in others it will not. Clients promise certain invocation rates as part of the QoS contract, which is part of the treatment of time, but nowhere is the time for a service to change configurations addressed. As noted, this means that services must be kept in a state where they can respond in the QoS bound, which can be very wasteful.

Problem: Configuration & QoS Fragility

QuO treats all responses that fall inside a negotiated region as being of equal worth. That is fine as far as the client is concerned, but is misleading when the ability to meet future QoS goals is considered. Consider the following two contracts whose timeliness components are:

Contract #1

10 sec - 30 sec response
15 sec average response over any 10 invocation intervals
no violations allowed

Contract #2

10 sec - 30 sec response
15 sec average response over any 10 invocation intervals
no more than 1 timeliness failure per 100 invocations AND no more than 2 average responses over 20 sec per 100

With these two contract fragments,:

in Contract #1, if the average response time is 14.9 seconds, that is not as safe as if it were 11 seconds, since a response time of 30 seconds on the next request would cause a violation in the former case but not the latter
in Contract #2, if a timing goal has been missed, another timing goal cannot be missed until a certain number of responses have been made

Further, regardless of the previous behavior of the connection:

if the service configuration fails, there is an increased chance that the next QoS target will not be met; this is influenced by how long it takes to restart the service (if possible)
if the threat environment becomes more hostile, the probability of missing a QoS target increases, even if no targets have been missed so far
if load increases, missing QoS targets becomes more likely, even though none have been missed so far

There needs to be some treatment of how likely it is that a negotiated region will be violated in the future. Otherwise, the only time a reconfiguration takes place is when the region has been violated, by which time a QoS failure has occurred. Because the time to reconfigure is not treated by the present body of QoS work, it is not even known how long the connection will be unusable. To make matters worse, some failures cannot be recovered from. As part of evaluating how brittle a connection is, there are definitely gradations of "badness", in either probability of a violation, how severe the violation is likely to be, how long it will last, and whether alternative levels of QoS can be reached from the new configuration.

The BBN work partially addresses this problem of brittleness by requiring a given availability for services based on a precomputed replication factor, which is made part of the QoS region contract. By maintaining replicas, the service never becomes too brittle. While this approach suffers from the limitations discussed above, it brings out an interesting point that we will exploit more fully in our work combining QoS and survivability. In the base QuO work, all parts of the region contract were relevant to the client and all violations were seen by the client (although often they could be fixed by the proxy). However, the replication factor is not relevant to the client except as a rough measure of the ability to deliver future QoS. When a contract is violated due to replica failures, only the proxy is concerned. We are working on extending and formalizing this notion that a contract should have parts relevant to both the client and the proxy so that QoS can be delivered and brittle states avoided.

Metrics

Both survivability and QoS attempt to say something about the "goodness" of a connection. To this end, they each define metrics. However, in our opinion, they say very different things. Specifically, QoS (in its broadest sense) addresses current properties of the connection, whereas survivability is concerned with the future of the connection. In other words, survivability addresses whether the connection is likely to maintain a desired QoS. Both QoS work and survivability address restoring a connection to a good QoS; the difference is that when doing so, survivability considers the ability of the new configuration to survive whereas QoS does not. This is discussed in a paper Survivability is Utility.

This has two principle effects.

The survivability metrics must differ from the QoS metrics, since the latter measure only current state. To give an extreme example of the difference, consider that placing a service on a lightly loaded, but very vulnerable machine will score highly on the QoS metric, but low on the survivability metric. These metrics, and their relationship, are discussed later.
A Survivability Service will need to take expected future behavior into account when allocating resources. This necessitates some sort of model of likely future events that could cause a configuration to change.

Bibliography of QoS Papers

Papers are organized by project.

Rome Laboratory

Quality of Service for AWACS Tracking, Patrick Hurley ,Tom Lawrence, Tom Wheeler, Ray Clark, to appear 4^th International Command and Control Research and Technology Symposium, 14-16 September 1998, Nasby Park (Stockholm) Sweden
Anomaly Management, Thomas F. Lawrence, AFRL, internal report, 1997

BBN Cluster Papers

QuO

Project Overview (1995)

Overview of Quality of Service for Distributed Objects (1995)

Architectural Support for Quality of Service for CORBA Objects

Object-Oriented QoS for C2 Adaptivity and Evolvability (1996)

Object-Oriented QoS: Some Research Issues (1996)

Architectural Support for Quality of Service for CORBA Objects

Architectural Support for Quality of Service for CORBA Objects (1997)

Specifying and Measuring Quality of Service in Distributed Object Systems (1998)

DIRM

AQuA and DIRM: QuO Projects Overview (1996)

1997 DARPA Project Summary (1997)

AQuA

Adaptive Quality of Service for Availability (AQuA) (1997)

1997 DARPA Project Summary (1997)

Adaptive Quality of Service for Availability (AQuA) (1998)

OIT

Toolkit for Adaptable Distributed Applications (OIT) (1997)

SRI Papers

QoS Taxonomy (1997)

Modeling for Adaptive QoS (1997)

QoS-Based Allocation(1997)

Illinois Papers

Dependability Evaluation Using UltraSAN (1993)

Specification and Construction of Performability Models (1989)

SMARTS Papers

"High Speed & Robust Event Correlation" - Yemini, Kiger, Mozes, Yemini, Ohsie.

QoS & Survivability

David Wells Object Services and Consulting, Inc.

1 - Introduction

2 - Quality of Service & Survivability

3 - Overview of QoS Projects

4 - QoS Project Details

BBN Related Projects -QuO, AQuA, OIT, DIRM

SRI

UltraSAN

Columbia - QoSME

SMARTS - MODEL & inCharge

Cornell - Horus

5 - Commonalities Between QoS and Survivability Techniques

QoS and Survivability Specified by Client; Managed by System

Binding Specifications

Proxy Architecture Extensions

Connections

Monitors

6 - Issues in Merging QoS & Survivability

Problem: Availability as a Client Concern

Problem: To Few Reconfiguration Strategies

Problem: QoS Requirements Not Pushed Through All Levels of Abstraction

Problem: Inadequate Treatment of Time

Problem: Configuration & QoS Fragility

Metrics

Bibliography of QoS Papers

Rome Laboratory

BBN Cluster Papers

QuO

DIRM

AQuA

OIT

SRI Papers

Illinois Papers

SMARTS Papers

David Wells
Object Services and Consulting, Inc.