Identifying linked incidents in large-scale online service systems

Yujun Chen, Xian YANG, Hang Dong, Xiaoting He, Hongyu Zhang, Qingwei Lin, Junjie Chen, Pu Zhao, Yu Kang, Feng Gao*, Zhangwei Xu, Dongmei Zhang

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

In large-scale online service systems, incidents occur frequently due to a variety of causes, from updates of software and hardware to changes in operation environment. These incidents could significantly degrade system's availability and customers' satisfaction. Some incidents are linked because they are duplicate or inter-related. The linked incidents can greatly help on-call engineers find mitigation solutions and identify the root causes. In this work, we investigate the incidents and their links in a representative real-world incident management (IcM) system. Based on the identified indicators of linked incidents, we further propose LiDAR (Linked Incident identification with DAta-driven Representation), a deep learning based approach to incident linking. More specifically, we incorporate the textual description of incidents and structural information extracted from historical linked incidents to identify possible links among a large number of incidents. To show the effectiveness of our method, we apply our method to a real-world IcM system and find that our method outperforms other state-of-the-art methods.

Original languageEnglish
Title of host publicationESEC/FSE 2020 - Proceedings of the 28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
EditorsPrem Devanbu, Myra Cohen, Thomas Zimmermann
PublisherAssociation for Computing Machinery, Inc
Pages304-314
Number of pages11
ISBN (Electronic)9781450370431
DOIs
Publication statusPublished - 8 Nov 2020
Event28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020 - Virtual, Online, United States
Duration: 8 Nov 202013 Nov 2020

Publication series

NameESEC/FSE 2020 - Proceedings of the 28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Conference

Conference28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020
Country/TerritoryUnited States
CityVirtual, Online
Period8/11/2013/11/20

Scopus Subject Areas

  • Software

User-Defined Keywords

  • Incident management
  • Link prediction
  • Linked incidents
  • Online service system

Fingerprint

Dive into the research topics of 'Identifying linked incidents in large-scale online service systems'. Together they form a unique fingerprint.

Cite this