TY - JOUR
T1 - Efficient and Optimal Algorithms for Tree Summarization with Weighted Terminologies
AU - Zhu, Xuliang
AU - Huang, Xin
AU - Choi, Byron Koon Kau
AU - Xu, Jianliang
AU - Cheung, William Kwok Wai
AU - Zhang, Yanchun
AU - Liu, Jiming
N1 - Funding information:
This work was supported by HK RGC under Grants 12200021, 12202221, 12201520, 22200320, 12201119, and 12201518.
Publisher Copyright:
© 1989-2012 IEEE.
PY - 2023/3/1
Y1 - 2023/3/1
N2 - Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., gene ontology, disease ontology, to name a few. In this paper, we study the weighted tree summarization. We motivate and formulate our kWTS- problem as selecting a diverse set of k nodes to summarize a hierarchical tree T with weighted terminologies. We first propose an efficient greedy tree summarization algorithm GTS. It solves the problem with (1-1/e)-approximation guarantee. Although GTS achieves quality-guaranteed answers approximately, but it is still not optimal. To tackle the problem optimally, we further develop a dynamic programming algorithm OTS to obtain optimal answers for kWTS problem in O(nhk3) time, where n,h are the node size and height in tree T. The algorithm complexity and correctness of OTS are theoretically analyzed. In addition, we propose a useful optimization technique of tree reduction to remove useless nodes with zero weights and shrink the tree into a smaller one, which ensures the efficiency acceleration of both GTS and OTS in real-world datasets. Moreover, we illustrate one useful application of graph visualization based on the answer of k-sized tree summarization and show it in a novel case study. Extensive experimental results on real-world datasets show the effectiveness and efficiency of our proposed approximate and optimal algorithms for tree summarization. Furthermore, we conduct a usability evaluation of attractive topic recommendation on ACM Computing Classification System dataset to validate the usefulness of our model and algorithms.
AB - Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., gene ontology, disease ontology, to name a few. In this paper, we study the weighted tree summarization. We motivate and formulate our kWTS- problem as selecting a diverse set of k nodes to summarize a hierarchical tree T with weighted terminologies. We first propose an efficient greedy tree summarization algorithm GTS. It solves the problem with (1-1/e)-approximation guarantee. Although GTS achieves quality-guaranteed answers approximately, but it is still not optimal. To tackle the problem optimally, we further develop a dynamic programming algorithm OTS to obtain optimal answers for kWTS problem in O(nhk3) time, where n,h are the node size and height in tree T. The algorithm complexity and correctness of OTS are theoretically analyzed. In addition, we propose a useful optimization technique of tree reduction to remove useless nodes with zero weights and shrink the tree into a smaller one, which ensures the efficiency acceleration of both GTS and OTS in real-world datasets. Moreover, we illustrate one useful application of graph visualization based on the answer of k-sized tree summarization and show it in a novel case study. Extensive experimental results on real-world datasets show the effectiveness and efficiency of our proposed approximate and optimal algorithms for tree summarization. Furthermore, we conduct a usability evaluation of attractive topic recommendation on ACM Computing Classification System dataset to validate the usefulness of our model and algorithms.
KW - Data Summarization
KW - Hierarchy
KW - Optimal Algorithm
KW - Top-k
KW - Tree
UR - http://www.scopus.com/inward/record.url?scp=85118280054&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2021.3120722
DO - 10.1109/TKDE.2021.3120722
M3 - Journal article
AN - SCOPUS:85118280054
SN - 1041-4347
VL - 35
SP - 2500
EP - 2514
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 3
ER -