TY - GEN
T1 - Top-k Graph Summarization on Hierarchical DAGs
AU - Zhu, Xuliang
AU - Huang, Xin
AU - Choi, Koon Kau
AU - Xu, Jianliang
N1 - Funding Information:
This paper is supported by NSFC 61702435, HK RGC GRF 12200917, 12201518, 12232716, 12201119, HK RGC CRF C6030-18G, and Guangdong Basic and Applied Basic Research Foundation (Project No. 2019B1515130001). Xin Huang is the corresponding author.
PY - 2020/10/19
Y1 - 2020/10/19
N2 - Directed acyclic graph (DAG) is an essentially important model to represent terminologies and their hierarchical relationships, such as Disease Ontology. Due to massive terminologies and complex structures in a large DAG, it is challenging to summarize the whole hierarchical DAG. In this paper, we study a new problem of finding k representative vertices to summarize a hierarchical DAG. To depict diverse summarization and important vertices, we design a summary score function for capturing vertices' diversity coverage and structure correlation. The studied problem is theoretically proven to be NP-hard. To efficiently tackle it, we propose a greedy algorithm with an approximation guarantee, which iteratively adds vertices with the large summary contributions into answers. To further improve answer quality, we propose a subtree extraction based method, which is proven to guarantee achieving higher-quality answers. In addition, we develop a scalable algorithm k-PCGS based on candidate pruning and DAG compression for large-scale hierarchical DAGs. Extensive experiments on large real-world datasets demonstrate both the effectiveness and efficiency of proposed algorithms.
AB - Directed acyclic graph (DAG) is an essentially important model to represent terminologies and their hierarchical relationships, such as Disease Ontology. Due to massive terminologies and complex structures in a large DAG, it is challenging to summarize the whole hierarchical DAG. In this paper, we study a new problem of finding k representative vertices to summarize a hierarchical DAG. To depict diverse summarization and important vertices, we design a summary score function for capturing vertices' diversity coverage and structure correlation. The studied problem is theoretically proven to be NP-hard. To efficiently tackle it, we propose a greedy algorithm with an approximation guarantee, which iteratively adds vertices with the large summary contributions into answers. To further improve answer quality, we propose a subtree extraction based method, which is proven to guarantee achieving higher-quality answers. In addition, we develop a scalable algorithm k-PCGS based on candidate pruning and DAG compression for large-scale hierarchical DAGs. Extensive experiments on large real-world datasets demonstrate both the effectiveness and efficiency of proposed algorithms.
KW - data summarization
KW - directed acyclic graph
KW - hierarchy
UR - http://www.scopus.com/inward/record.url?scp=85095866530&partnerID=8YFLogxK
U2 - 10.1145/3340531.3411899
DO - 10.1145/3340531.3411899
M3 - Conference proceeding
AN - SCOPUS:85095866530
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1903
EP - 1912
BT - CIKM 2020 - Proceedings of the 29th ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery (ACM)
T2 - 29th ACM International Conference on Information and Knowledge Management, CIKM 2020
Y2 - 19 October 2020 through 23 October 2020
ER -