TY - GEN
T1 - Identifying linked incidents in large-scale online service systems
AU - Chen, Yujun
AU - YANG, Xian
AU - Dong, Hang
AU - He, Xiaoting
AU - Zhang, Hongyu
AU - Lin, Qingwei
AU - Chen, Junjie
AU - Zhao, Pu
AU - Kang, Yu
AU - Gao, Feng
AU - Xu, Zhangwei
AU - Zhang, Dongmei
N1 - Funding Information:
We thank our colleagues at Microsoft Azure groups who developed the incident management system and helped us learn the system: Feng Gao, Jeffery Sun, Pochian Lee, Li Yang, Zhangwei Xu. This work is supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61702107. Hongyu Zhang’s work is supported by Australian Research Council (ARC) DP200102940.
PY - 2020/11/8
Y1 - 2020/11/8
N2 - In large-scale online service systems, incidents occur frequently due to a variety of causes, from updates of software and hardware to changes in operation environment. These incidents could significantly degrade system's availability and customers' satisfaction. Some incidents are linked because they are duplicate or inter-related. The linked incidents can greatly help on-call engineers find mitigation solutions and identify the root causes. In this work, we investigate the incidents and their links in a representative real-world incident management (IcM) system. Based on the identified indicators of linked incidents, we further propose LiDAR (Linked Incident identification with DAta-driven Representation), a deep learning based approach to incident linking. More specifically, we incorporate the textual description of incidents and structural information extracted from historical linked incidents to identify possible links among a large number of incidents. To show the effectiveness of our method, we apply our method to a real-world IcM system and find that our method outperforms other state-of-the-art methods.
AB - In large-scale online service systems, incidents occur frequently due to a variety of causes, from updates of software and hardware to changes in operation environment. These incidents could significantly degrade system's availability and customers' satisfaction. Some incidents are linked because they are duplicate or inter-related. The linked incidents can greatly help on-call engineers find mitigation solutions and identify the root causes. In this work, we investigate the incidents and their links in a representative real-world incident management (IcM) system. Based on the identified indicators of linked incidents, we further propose LiDAR (Linked Incident identification with DAta-driven Representation), a deep learning based approach to incident linking. More specifically, we incorporate the textual description of incidents and structural information extracted from historical linked incidents to identify possible links among a large number of incidents. To show the effectiveness of our method, we apply our method to a real-world IcM system and find that our method outperforms other state-of-the-art methods.
KW - Incident management
KW - Link prediction
KW - Linked incidents
KW - Online service system
UR - http://www.scopus.com/inward/record.url?scp=85097187876&partnerID=8YFLogxK
U2 - 10.1145/3368089.3409768
DO - 10.1145/3368089.3409768
M3 - Conference contribution
AN - SCOPUS:85097187876
T3 - ESEC/FSE 2020 - Proceedings of the 28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
SP - 304
EP - 314
BT - ESEC/FSE 2020 - Proceedings of the 28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
A2 - Devanbu, Prem
A2 - Cohen, Myra
A2 - Zimmermann, Thomas
PB - Association for Computing Machinery, Inc
T2 - 28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020
Y2 - 8 November 2020 through 13 November 2020
ER -