Survivability in Object Services Architectures

David L. Wells and David E. Langworthy
Object Services and Consulting, Inc.

A "survivable" application can continue to function despite the loss or degradation of some of its components, will maintain its functionality and performance for as long as possible, and will degrade gracefully when this is no longer possible.  Survivability relies on redundancy to allow normal operations to continue as long as possible, the ability to reconfigure to correct problems, and policies defining acceptable (but less desirable) functionality or performance should it prove impossible to maintain the desired behavior.

We are developing an architecture and Survivability Service to make OSA-based distributed systems far more survivable in the face of component failure and degradation than is currently possible.  The architecture unifies a number of existing robustness mechanisms and adds several new ones to provide a variety of tools that can be applied in different situations.  Because of the complexity of system-wide survivability, it is impossible to have a "master plan" for assuring survivability.  Instead, we use market mechanisms to create global survivability as an emergent behavior resulting from a large number of small, local decisions.

Our approach maintains the simplicity of OSA application development that has been largely responsible for the popularity of OSAs by not requiring individual applications or services to be responsible for the details of ensuring their own survivability.  This is necessary because survivability is difficult to program and its development costs should be amortized across many applications, the survivability needs of different applications or services often conflict, and survivability requires a more accurate knowledge of the eventual deployment environment(s) than is reasonable to expect at development time.  To achieve this, we make survivability orthogonal to conventional OSA application semantics; in other words survivability is "added" to an application rather than built into it from the start.  This is done by a "Survivability Service" that handles the survivability needs of applications collectively, responding to changes in workload, resource requirements, resource availability, and threats based on a number of environment models that can be specified independently.  A consequence of making survivability orthogonal to application functionality is that changing the models (not the applications or services) allows applications to be deployed into dynamically changing or unanticipated environments.  This approach also supports the use of COTS and GOTS that are not constructed for survivability.

The key to constructing survivable systems is to configure them in such a way that they can be easily reconfigured when needed to survive loss of system resources.  We have extended and clarified the standard OSA object model to create a survivable object abstraction that makes it possible to define a set of "survivable configurations" that are able to withstand component loss and are also capable of being systematically evolved into new configurations should component loss become severe.  The abstraction provides ways to change both the physical configuration (different service placement or resource allocation) and the logical configuration (service alternatives or changed levels of service quality) of an application. Developers use the abstraction to specify, implement, and connect services.  The OSA Survivability Service manages configurations defined in this abstraction to keep them running as well as possible given the currently available resources.  The object abstraction:

We believe that a key to adding any kind of "extra-functional behavior" such as security, persistence, survivability, etc., is to have an object abstraction with the right kind of "translucent joints" where systems can either be mediated or taken apart and reassembled dynamically in different ways.  A joint is a well defined place where a binding between system components may be made.  In general, more information about the binding than is common in programming languages is maintained; this could be a statement of requirements of any object that can satisfy the binding, the provenance required, information flow restrictions, QoS, etc.  Translucence means that the joint is visible if desired in order to use its special properties, but otherwise is invisible except possibly for a small performance penalty.  In fact, it is often possible to reduce or completely eliminate the performance penalty at the cost of more complexity in changing the binding.  Prior examples of the use of such joints to add behavior are persistence and transaction control in Open OODB and the security in the OMG Object Security Service.

We are specifying the architecture of an OSA Survivability Service to manage applications defined using the survivable object abstraction. The architecture supports a wide variety of survivability actions (below), is compatible with existing OSAs and projected trends (including the various repositories and the CORBA Security Service), and encompasses a wide variety of existing research in fault tolerant systems, failure detectors, system models, etc.  We currently have an overall architecture for the Survivability Service that covers the "big picture" of how the components relate, including an internal partitioning that allows major subsystems to be replaced or refined, possibly by third parties. Survivability actions supported by the OSA Survivability Service are:

The OSA Survivability Service configures and reconfigures applications using currently available resources in an attempt to avoid know threats.  It uses a collection of environment models describing resources, threats, and overall situation in determining what to do.  These models are defined roughly at present.

We are building a Survivability Service prototye, including a market mechanism for resource allocation, simple models and model evolution to drive survivability decisions under changing conditions, specifications of how to rebind logically equivalent or similar services, and some visualization.  This will allow demonstration of a cohesive part of the Survivability Service by the middle part of 1998.  A concept demonstration of part of this currently exists.

We are interested in attending this workshop in order to trade ideas about object abstractions and joints, and to contribute to a discussion of how different behaviors applied at the same joint should be allowed to interact.