About the Project
Next-generation parallel and distributed computing must be dependable and have predictable performance in order to meet the requirements of increasingly complex scientific and commercial applications. The large-scale nature and changing user requirements of such applications, coupled with the changing fault environment and workloads in which they must operate, dictate that their dependability and performance must be managed in an on-line fashion, reacting to changes in anticipated and observed faults, demands placed on the system, and changes in specified dependability, performance, and/or functional requirements. We are in the process of creating a compiler-enabled model- and measurement-driven adaptation environment that allows distributed applications to perform as expected despite faults that may occur. Achievement of the desired capabilities will require fundamental advances in and synergistic combinations between 1) compiler-based flexible dependability mechanisms, 2) efficient online model-based prediction and control, and 3) measurement-driven and compiler-enabled early error detection. We will validate the adaptation environment by using it for two important applications from the scientific and commercial domains: the CARMA (Combined Array for Research in Millimeter-Wave Astronomy) data pipeline, which is a data-intensive Grid application for radio astronomy, and iMobile, which is an enterprise-scale mobile services platform.
Static and dynamic compiler transformations will be used to create novel dependability mechanisms that require less system resources than traditional mechanisms (minimizing their performance impact), and that can detect classes of errors (such as programming errors) that cannot be detected by traditional replication mechanisms. The new mechanisms will be used by the online adaptation engine to achieve a specified dependability and performance objective. Examples of the new mechanisms include variant replicas of several kinds (viz., replicas that differ from the original process to provide better error detection, lower overhead, or both), and combinations of variant replicas with traditional checkpointing (for more efficient rollback recovery). To minimize static code expansion, replicas will be generated on demand under the control of the adaptation engine, via a transparent dynamic compilation framework. We will also explore compiler-based techniques that allow the middleware to coordinate distributed adaptations between processes more intelligently and more efficiently.
An online model-based prediction and control engine will make use of compiler-assisted deterministic and stochastic models, together with input from compiler-guided performance and error measurements, to adapt a system's configuration to achieve the best combination of performance and dependability under existing conditions. Use of different types of models will allow the configuration changes to be either 1) algorithm changes that choose among the available mechanisms, including possible dynamically generated versions, or 2) parameter changes that tune a particular algorithm to work more efficiently. The models will incorporate compiler-synthesized model components that capture properties of existing and potential versions of generated code. To provide such prediction and control capabilities, we will use a combination of reactive feedback control techniques, along with predictive, state-space-based stochastic modeling techniques (e.g., Markov decision processes). To ensure rapid decision-making and quick solution times for the models, we will use a combination of approximation techniques (such as state-space reduction through decomposition and finite horizon computations) along with partial offline generation of controllers via symbolic solution methods.
Sophisticated error and performance measurement techniques will be used to characterize system error behavior, enable early error detection, guide online adaptation models, and work with the compiler to improve error detection and tolerance. In particular, measurements on operational systems will help characterize real issues in the field, including correlated errors and error propagation, that often escape current detection mechanisms. On-line analysis will help extract error symptoms for early error detection and thus minimize performance impact of failures. Such analysis will also provide the parameters and distribution characteristics to adapt system models and thus ensure effective on-line control. We will use compiler support to develop application - specific, preemptive detection techniques to improve coverage and minimize error propagation while maintaining performance.
The larger impact of this research will be to produce and distribute a practical, integrated compiler and middleware system that uses online models and measurement techniques to achieve performance and dependability in a scalable manner under a wide variety of changing conditions. The techniques we develop could ultimately impact many diverse and critical applications, including those in the electric power distribution, aerospace, healthcare, and financial services sectors.
People
Publications
- V. S. Adve, A. Agbaria, M. A. Hiltunen, R. K. Iyer, K. R. Joshi, Z. Kalbarczyk, R. M. Lefever, R. Plante, W. H. Sanders, and R. D. Schlichting, "A Compiler-Enabled Model- and Measurement-Driven Adaptation Environment for Dependability and Performance," Proceedings of the Next Generation Software (NGS) Workshop at the International Parallel & Distributed Processing Symposium (IPDPS), Denver, Colorado, April 4, 2005 (CD-ROM).
- A. Agbaria and W. H. Sanders, "Application-Driven Coordination-Free Distributed Checkpointing," Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, Columbus, Ohio, June 6-10, 2005, pp. 177-186.
- D. Dhurjati and V. Adve, "Backwards-Compatible Array Bounds Checking for C with Very Low Overhead," In Proceedings of the International Conference on Software Engineering (ICSE), Shanghai, China, May 2006.
- D. Dhurjati and V. Adve, "Efficiently Detecting All Dangling Pointer Uses in Production Servers," In Proceedings of the International Conference on Dependable Systems and Networks (DSN), Philadelphia, USA, June 2006.
- D. Dhurjati, S. Kowshik, and V. Adve, "Enforcing Alias Analysis for Weakly Typed Languages," In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Ottawa, Canada, June 2006, pp. 144-157.
- S. Gaonkar, E. Rozier, A. Tong, and W. H. Sanders, "Scaling File Systems to Support Petascale Clusters: A Dependability Analysis to Support Informed Design Choices,"
Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2008), Anchorage, Alaska, June 24-27, 2008, to appear.
- S. Gaonkar and W. H. Sanders, "Analysis of the Reliability/Availability of Distributed File Systems in Large-Scale Systems: A Case Study Using Simultaneous Simulation," Proceedings of the 8th International Workshop on Performability Modeling of Computer and Communication Systems, Edinburgh, UK, Sept. 20-21, 2007.
- K. R. Joshi, Stochastic-Model-Driven Adaptation and Recovery in Distributed Systems. Doctoral Dissertation, University of Illinois, 2007.
- K. R. Joshi, M. A. Hiltunen, and W. H. Sanders, "Performability Optimization Using Linear Bounds of Partially Observable Markov Decision Processes," Proceedings of the 7th International Workshop on Performability Modeling of Computer and Communication Systems (PMCCS-7), Turin, Italy, September 23-24, 2005, pp. 73-76.
- K. R. Joshi, M. Hiltunen, W. H. Sanders, and R. Schlichting, "Automatic Model-Driven Recovery in Distributed Systems," Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems (SRDS 2005), Orlando, Florida, October 26-28, 2005, pp. 25-36.
- K. R. Joshi, M. A. Hiltunen, W. H. Sanders, and R. D. Schlichting, "Automatic Recovery Using Bounded Partially Observable Markov Decision Processes," Proceedings of the International Conference on Dependable Systems and Networks (DSN-2006), Philadelphia, PA, USA, June 25-28, 2006, pp. 445-456.
- K. R. Joshi, M. Hiltunen, R. Schlichting, W. H. Sanders, and A. Agbaria, "Online Model-Based Adaptation for Optimizing Performance and Dependability," Proceedings of the Workshop on Self-Managed Systems (WOSS 2004), Newport Beach, CA, October 31-November 1, 2004 (CD-ROM).
- V. V. Lam, P. Buchholz, and W. H. Sanders, "A Component-Level Path Composition Approach for Efficient Transient Analysis of Large CTMCs," Proceedings of the International Conference on Dependable Systems and Networks (DSN-2006), Philadelphia, PA, USA, June 25-28, 2006, pp. 485-494.
- K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer, "Application-Based Metrics for Strategic Placement of Detectors," in Proc. of Pacific Rim Int'l Symposium on Dependable Computing, PRDC'05, 2005.
- K. Pattabiraman, Z. Kalbarczyk, and R. Iyer, "Automated Derivation of Application-aware Error Detectors using Static Analysis," Fast Abstract at the International Conference on Dependable Systems and Networks, DSN-06, June 2006.
- K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer, "Automated Derivation and Hardware Implementation of Application-Specific Error Detectors," Proc. HPCRI: 2nd Workshop on High Performance Computing: Reliability Issues; held in conjunction with the 12th International Symposium on High Performance Computer Architecture (HPCA-12), Austin, 2006.
-
K. Pattabiraman, G.-P. Saggese, D. Chen, Z. Kalbarczyk, and R. Iyer, "Dynamic Derivation of Application-Specific Error Detectors and Their Implementation in Hardware," submitted for publication.
- H. V. Ramasamy, Parsimonious Service Replication for Tolerating Malicious Attacks in Asynchronous Environments, Ph.D. thesis, University of Illinois at Urbana-Champaign, 2005.
- H. V. Ramasamy, A. Agbaria, and W. H. Sanders, "A Parsimonious Approach for Obtaining Resource-Efficient and Trustworthy Execution," IEEE Transactions on Dependable and Secure Computing, vol. 4, no. 1, January-March 2007, pp. 1-17.
- H. V. Ramasamy, A. Agbaria, and W. H. Sanders, "Parsimony-Based Approach for Obtaining Resource-Efficient and Trustworthy Execution,"
Dependable Computing: Proceedings of the 2nd Latin-American Symposium (LADC 2005), Salvador, Brazil, October 25-28, 2005, LNCS vol. 3747, Springer-Verlag, pp. 206-225.
- H. V. Ramasamy and C. Cachin, "Parsimonious
Asychronous Byzantine-Fault-Tolerant Atomic Broadcast," Proceedings
of the 9th International Conference on Principles of Distributed
Systems (OPODIS), Pisa, Italy, Dec. 12-14, 2005.
- H. Ramasamy, M. Seri, and W. H. Sanders, "The CoBFIT Toolkit,"
Proceedings of the 26th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC 2007), Portland, Oregon, Aug. 12-15, 2007, pp. 350-351.
- P. Sousa, N. F. Neves, P. Veríssimo, and W. H. Sanders, "Proactive Resilience Revisited: The Delicate Balance Between Resisting Intrusions and Remaining Available," Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS 2006), Leeds, UK, October 2-4, 2006, pp. 71-82.
- L. Wang, K. Pattabiraman, L. Votta, C. Vick, A. Wood. Z. Kalbarczyk, and R. K. Iyer, "Modeling Coordinated Checkpointing for Large-Scale Supercomputers," Proceedings of the International Conference on Dependable Systems and Networks (DSN), Yokohoma, Japan, 2005.
|