Cascading failures involve a failure in one system component that triggers failures in successive system components, potentially leading to system wide failures. While frequently used fault tolerant techniques can reduce the severity and the frequency of such failures, they continue to occur in practice. To better understand how failures cascade, we have conducted a qualitative analysis of 55 cascading failures, described in 26 publicly available incident reports. Through this analysis we have identified 16 types of cascading mechanisms (organized into eight categories) that capture the nature of the system interactions that contribute to cascading failures. We also discuss three themes based on the observation that the cascading failures we have analyzed occurred in one of three ways: a component being unable to tolerate a failure in another component, through the actions of support or automation systems as they respond to an initial failure, or during system recovery. We believe that the 16 cascading mechanisms we present and the three themes we discuss, provide important insights into some of the challenges associated with engineering a truly resilient and well-supported system.
College and Department
Physical and Mathematical Sciences; Computer Science
BYU ScholarsArchive Citation
Chamberlin, Barbara W., "How Failures Cascade in Software Systems" (2022). Theses and Dissertations. 9474.
software design, cascading failure, graceful degradation, fault tolerance, graceful recovery