A Compiler-Enabled Model- and Measurement-Driven Adaptation Environment for Dependability and Performance

NSF CNS-0406351

 

About the Project

Next-generation parallel and distributed computing must be dependable and have predictable performance in order to meet the requirements of increasingly complex scientific and commercial applications. The large-scale nature and changing user requirements of such applications, coupled with the changing fault environment and workloads in which they must operate, dictate that their dependability and performance must be managed in an on-line fashion, reacting to changes in anticipated and observed faults, demands placed on the system, and changes in specified dependability, performance, and/or functional requirements. We are in the process of creating a compiler-enabled model- and measurement-driven adaptation environment that allows distributed applications to perform as expected despite faults that may occur. Achievement of the desired capabilities will require fundamental advances in and synergistic combinations between 1) compiler-based flexible dependability mechanisms, 2) efficient online model-based prediction and control, and 3) measurement-driven and compiler-enabled early error detection. We will validate the adaptation environment by using it for two important applications from the scientific and commercial domains: the CARMA (Combined Array for Research in Millimeter-Wave Astronomy) data pipeline, which is a data-intensive Grid application for radio astronomy, and iMobile, which is an enterprise-scale mobile services platform.

Static and dynamic compiler transformations will be used to create novel dependability mechanisms that require less system resources than traditional mechanisms (minimizing their performance impact), and that can detect classes of errors (such as programming errors) that cannot be detected by traditional replication mechanisms. The new mechanisms will be used by the online adaptation engine to achieve a specified dependability and performance objective. Examples of the new mechanisms include variant replicas of several kinds (viz., replicas that differ from the original process to provide better error detection, lower overhead, or both), and combinations of variant replicas with traditional checkpointing (for more efficient rollback recovery). To minimize static code expansion, replicas will be generated on demand under the control of the adaptation engine, via a transparent dynamic compilation framework. We will also explore compiler-based techniques that allow the middleware to coordinate distributed adaptations between processes more intelligently and more efficiently.

An online model-based prediction and control engine will make use of compiler-assisted deterministic and stochastic models, together with input from compiler-guided performance and error measurements, to adapt a system's configuration to achieve the best combination of performance and dependability under existing conditions. Use of different types of models will allow the configuration changes to be either 1) algorithm changes that choose among the available mechanisms, including possible dynamically generated versions, or 2) parameter changes that tune a particular algorithm to work more efficiently. The models will incorporate compiler-synthesized model components that capture properties of existing and potential versions of generated code. To provide such prediction and control capabilities, we will use a combination of reactive feedback control techniques, along with predictive, state-space-based stochastic modeling techniques (e.g., Markov decision processes). To ensure rapid decision-making and quick solution times for the models, we will use a combination of approximation techniques (such as state-space reduction through decomposition and finite horizon computations) along with partial offline generation of controllers via symbolic solution methods.

Sophisticated error and performance measurement techniques will be used to characterize system error behavior, enable early error detection, guide online adaptation models, and work with the compiler to improve error detection and tolerance. In particular, measurements on operational systems will help characterize real issues in the field, including correlated errors and error propagation, that often escape current detection mechanisms. On-line analysis will help extract error symptoms for early error detection and thus minimize performance impact of failures. Such analysis will also provide the parameters and distribution characteristics to adapt system models and thus ensure effective on-line control. We will use compiler support to develop application - specific, preemptive detection techniques to improve coverage and minimize error propagation while maintaining performance.

The larger impact of this research will be to produce and distribute a practical, integrated compiler and middleware system that uses online models and measurement techniques to achieve performance and dependability in a scalable manner under a wide variety of changing conditions. The techniques we develop could ultimately impact many diverse and critical applications, including those in the electric power distribution, aerospace, healthcare, and financial services sectors.

People

Publications