Abstract

The steady and uninterrupted availability of systems is essential for the mission of many companies and other organizations. This responsibility relies mostly upon support engineers, who are responsible to respond to incidents. Incident response is a unique type of task in software engineering, given it carries distinguishing characteristics like risks, pressure, incomplete information and urgency. Despite the importance of this task for many organizations, little can be found in the literature about the incident response task and model. To fill the gap, we created a theoretical foundation to foster research on incident response. We conducted an interview study, asking 12 support engineers about their experiences dealing with outages, service degradation, and other incidents that demanded an urgent response. We used our 22 collected cases to identify important concepts of incidents and their dimensions, and created an ontology of incidents and a model of the incident response. To validate the usefulness of our results, we analyzed our incidents based on our ontology and model, providing some insights related to detection of incidents, investigation and the hand over process. We also provide analytical insights related to the prevention of resource limitation incidents. Finally, we validate the usefulness of our research by proposing an improvement on monitoring tools used by support engineers.

Degree

College and Department

Physical and Mathematical Sciences; Computer Science

Rights

https://lib.byu.edu/about/copyright/