TY - JOUR
T1 - From Global to Query-Dependent: Summarization of Large Hierarchical DAGs
AU - Zhu, Xuliang
AU - Huang, Xin
AU - Wang, Kai
AU - Xu, Jianliang
AU - Lin, Xuemin
N1 - Funding information:
This work was supported by Hong Kong RGC Projects Nos. 12200424 and C2003-23Y.
Publisher copyright:
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM
PY - 2025/9/20
Y1 - 2025/9/20
N2 - Hierarchical directed acyclic graph (DAG) is an essential model for representing terminologies and their hierarchical relationships, such as Disease Ontology and ImageNet categories. Due to the vast number of terminologies and complex structures in a large DAG, it becomes challenging for humans to effectively analyze and explore the hierarchical relationships they encode. Therefore, summarizing hierarchical DAGs is essential for enhancing the interpretability and visualization of the underlying hierarchy. Beyond visual data exploration, hierarchical DAG summarization also supports a range of applications, such as biomedical ontology analytics, snippet generation for information search, and summarized recommendation. In this paper, we address a new problem of finding k representative vertices to summarize a hierarchical DAG. To capture diverse summarization and identify important vertices, we design a summary score function that reflects vertices diversity coverage and structure correlation. The studied problem is theoretically proven to be NP-hard. To tackle it efficiently, we propose a greedy algorithm with an approximation guarantee that iteratively adds vertices with significant summary contributions to the answers. To further enhance the answer quality, we introduce a subtree extraction-based method that is proven to achieve higher-quality answers. Additionally, we develop a scalable algorithm, k-PCGS, which employs candidate pruning and DAG compression for large-scale hierarchical DAGs. For the query-dependent problem, we propose an index-based method and several optimization techniques to improve efficiency. Extensive experiments on large real-world datasets demonstrate the effectiveness and efficiency of our proposed algorithms.
AB - Hierarchical directed acyclic graph (DAG) is an essential model for representing terminologies and their hierarchical relationships, such as Disease Ontology and ImageNet categories. Due to the vast number of terminologies and complex structures in a large DAG, it becomes challenging for humans to effectively analyze and explore the hierarchical relationships they encode. Therefore, summarizing hierarchical DAGs is essential for enhancing the interpretability and visualization of the underlying hierarchy. Beyond visual data exploration, hierarchical DAG summarization also supports a range of applications, such as biomedical ontology analytics, snippet generation for information search, and summarized recommendation. In this paper, we address a new problem of finding k representative vertices to summarize a hierarchical DAG. To capture diverse summarization and identify important vertices, we design a summary score function that reflects vertices diversity coverage and structure correlation. The studied problem is theoretically proven to be NP-hard. To tackle it efficiently, we propose a greedy algorithm with an approximation guarantee that iteratively adds vertices with significant summary contributions to the answers. To further enhance the answer quality, we introduce a subtree extraction-based method that is proven to achieve higher-quality answers. Additionally, we develop a scalable algorithm, k-PCGS, which employs candidate pruning and DAG compression for large-scale hierarchical DAGs. For the query-dependent problem, we propose an index-based method and several optimization techniques to improve efficiency. Extensive experiments on large real-world datasets demonstrate the effectiveness and efficiency of our proposed algorithms.
KW - Hierarchy
KW - Data summarization
KW - Graph algorithm
U2 - 10.1145/3769079
DO - 10.1145/3769079
M3 - Journal article
SN - 0362-5915
JO - ACM Transactions on Database Systems
JF - ACM Transactions on Database Systems
ER -