Operating Systems Research Group
Operating Systems Research on Energy, Reliability and Autonomy

ARTS: Available, Robust and Trusted Software
--Making software to be like arts

As rapid advances in computing hardware have led to dramatic improvement in computer performance, the issues of reliability, availability, maintainability, and cost of ownership are becoming increasingly important. Unfortunately, software bugs continue to be frequent, accounting for as much as 40% of computer system failures.  Software bugs can crash the system, making the service unavailable. Moreover, "silent" bugs that go undetected can corrupt information, generating wrong outputs or control commands, and destroying valuable information. According to the National Institute of Standards and Technology, software bugs cost the U.S. economy an estimated $59.5 billion annually, or approximately 0.6% of the gross domestic product!

Unfortunately, identifying and fixing software bugs is a task that requires enormous human labor.  Despite this enormous effort, software released to end-users still contains numerous bugs. These bugs continue to consume human time in the form of bug reporting at the user site, user-vendor communication, and subsequent "bug-fix" software releases. We need, above all, techniques that automate the process of debugging as much as possible.

In particular, debugging parallel applications is especially difficult because parallel programs suffer from not only bugs that commonly exist in sequential programs but also special types of bugs such as deadlocks and data races. Many of these bugs are non-deterministic, making interactive debugging a time-consuming process, which significantly affects the productivity of parallel application development. This problem is becoming increasingly severe as the demand for ever more computational capabilities has driven the creation of terascale parallel systems. Most existing parallel debugging tools are insufficient to meet such challenge because it is prohibitively difficult to use an interactive debugger with only basic functionalities to debug a parallel program on a system with more than thousands of nodes. Moreover, many timing-related bugs surface only on terascale systems and are not exposed on a small-scale testbed.

To address the above problems, the goal of our project is to efficiently and effectively detect bugs in software to improve the software robustness and security. In addition, we also explore techniques to let software surviving bugs without restarting. 

Our project consists of the following research thrusts: