Efficient and Optimal Algorithms for Tree Summarization with Weighted Terminologies

Xuliang Zhu, Xin Huang*, Byron Koon Kau Choi, Jianliang Xu, William Kwok Wai Cheung, Yanchun Zhang, Jiming Liu

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., gene ontology, disease ontology, to name a few. In this paper, we study the weighted tree summarization. We motivate and formulate our kWTS- problem as selecting a diverse set of k nodes to summarize a hierarchical tree T with weighted terminologies. We first propose an efficient greedy tree summarization algorithm GTS. It solves the problem with (1-1/e)-approximation guarantee. Although GTS achieves quality-guaranteed answers approximately, but it is still not optimal. To tackle the problem optimally, we further develop a dynamic programming algorithm OTS to obtain optimal answers for kWTS problem in O(nhk3) time, where n,h are the node size and height in tree T. The algorithm complexity and correctness of OTS are theoretically analyzed. In addition, we propose a useful optimization technique of tree reduction to remove useless nodes with zero weights and shrink the tree into a smaller one, which ensures the efficiency acceleration of both GTS and OTS in real-world datasets. Moreover, we illustrate one useful application of graph visualization based on the answer of k-sized tree summarization and show it in a novel case study. Extensive experimental results on real-world datasets show the effectiveness and efficiency of our proposed approximate and optimal algorithms for tree summarization. Furthermore, we conduct a usability evaluation of attractive topic recommendation on ACM Computing Classification System dataset to validate the usefulness of our model and algorithms.

Original languageEnglish
Pages (from-to)2500-2514
Number of pages15
JournalIEEE Transactions on Knowledge and Data Engineering
Volume35
Issue number3
Early online date19 Oct 2021
DOIs
Publication statusPublished - 1 Mar 2023

Scopus Subject Areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

User-Defined Keywords

  • Data Summarization
  • Hierarchy
  • Optimal Algorithm
  • Top-k
  • Tree

Fingerprint

Dive into the research topics of 'Efficient and Optimal Algorithms for Tree Summarization with Weighted Terminologies'. Together they form a unique fingerprint.

Cite this