Efficient and Optimal Algorithms for Tree Summarization with Weighted Terminologies

Research output: Contribution to journalArticlepeer-review

Abstract

Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., gene ontology, disease ontology, to name a few. In this paper, we study the weighted tree summarization. We motivate and formulate our kWTS-problem as selecting a diverse set of k nodes to summarize a hierarchical tree T with weighted terminologies. We first propose an efficient greedy tree summarization algorithm GTS. It solves the problem with (1-1/e)-approximation guarantee. Although GTS achieves quality-guaranteed answers approximately, but it is still not optimal. To tackle the problem optimally, we further develop a dynamic programming algorithm OTS to obtain optimal answers for kWTS-problem in O(nhk^3) time, where n, h are the node size and height in tree T. The algorithm complexity and correctness of OTS are theoretically analyzed. In addition, we propose a useful optimization technique of tree reduction to remove useless nodes with zero weights and shrink the tree into a smaller one, which ensures the efficiency acceleration of both GTS and OTS in real-world datasets. Moreover, we illustrate one useful application of graph visualization based on the answer of k-sized tree summarization and show it in a novel case study.

Original languageEnglish
JournalIEEE Transactions on Knowledge and Data Engineering
DOIs
Publication statusE-pub ahead of print - 19 Oct 2021

Scopus Subject Areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

User-Defined Keywords

  • Approximation algorithms
  • Data Summarization
  • Diseases
  • Dynamic programming
  • Heuristic algorithms
  • Hierarchy
  • Ontologies
  • Optimal Algorithm
  • Optimization
  • Terminology
  • Top-k
  • Tree

Fingerprint

Dive into the research topics of 'Efficient and Optimal Algorithms for Tree Summarization with Weighted Terminologies'. Together they form a unique fingerprint.

Cite this