TY - JOUR
T1 - Top-K Representative Search for Comparative Tree Summarization
AU - Chen, Yuqi
AU - Huang, Xin
AU - Chen, Bilian
N1 - This work is supported by Hong Kong RGC Projects Nos. 12200021, C2003-23Y, and National Natural Science Foundation of China under grant No. 12371515. Xin Huang and Bilian Chen are the corresponding authors.
Publisher Copyright:
© 2025 IEEE.
PY - 2025/5/9
Y1 - 2025/5/9
N2 - Data summarization aims at utilizing a small-scale summary to represent massive datasets as a whole, which is useful for visualization and information sipped generation. However, most existing studies of hierarchical summarization only work on one single tree by selecting k representative nodes, which neglects an important problem of comparative summarization on two trees. In this paper, given two trees with the same topology structure and different node weights, we aim at finding k representative nodes, where k1 nodes summarize the common relationship between them and k2 nodes highlight significantly different subtrees meanwhile satisfying k1 + k2 = k. To optimize summarization results, we introduce a scaling coefficient for balancing the summary view between two subtrees in terms of similarity and difference. Additionally, we propose a novel definition based on the Hellinger distance to quantify the node distribution difference between two subtrees. We present a greedy algorithm SVDT to find high-quality results with approximation guaranteed in an efficient way. Furthermore, we explore an extension of our comparative summarization to handle two trees with different structures. Extensive experiments demonstrate the effectiveness and efficiency of our SVDT algorithm against existing summarization competitors.
AB - Data summarization aims at utilizing a small-scale summary to represent massive datasets as a whole, which is useful for visualization and information sipped generation. However, most existing studies of hierarchical summarization only work on one single tree by selecting k representative nodes, which neglects an important problem of comparative summarization on two trees. In this paper, given two trees with the same topology structure and different node weights, we aim at finding k representative nodes, where k1 nodes summarize the common relationship between them and k2 nodes highlight significantly different subtrees meanwhile satisfying k1 + k2 = k. To optimize summarization results, we introduce a scaling coefficient for balancing the summary view between two subtrees in terms of similarity and difference. Additionally, we propose a novel definition based on the Hellinger distance to quantify the node distribution difference between two subtrees. We present a greedy algorithm SVDT to find high-quality results with approximation guaranteed in an efficient way. Furthermore, we explore an extension of our comparative summarization to handle two trees with different structures. Extensive experiments demonstrate the effectiveness and efficiency of our SVDT algorithm against existing summarization competitors.
KW - hellinger distance
KW - top-K diversification
KW - Tree summarization
UR - http://www.scopus.com/inward/record.url?scp=105004930873&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2025.3565845
DO - 10.1109/TKDE.2025.3565845
M3 - Journal article
AN - SCOPUS:105004930873
SN - 1041-4347
SP - 1
EP - 7
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
ER -