Abstract
Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., gene ontology, disease ontology, to name a few. In this paper, we study the weighted tree summarization. We motivate and formulate our kWTS-problem as selecting a diverse set of k nodes to summarize a hierarchical tree T with weighted terminologies. We first propose an efficient greedy tree summarization algorithm GTS. It solves the problem with (1-1/e)-approximation guarantee. Although GTS achieves quality-guaranteed answers approximately, but it is still not optimal. To tackle the problem optimally, we further develop a dynamic programming algorithm OTS to obtain optimal answers for kWTS-problem in O(nhk^3) time, where n, h are the node size and height in tree T. The algorithm complexity and correctness of OTS are theoretically analyzed. In addition, we propose a useful optimization technique of tree reduction to remove useless nodes with zero weights and shrink the tree into a smaller one, which ensures the efficiency acceleration of both GTS and OTS in real-world datasets. Moreover, we illustrate one useful application of graph visualization based on the answer of k-sized tree summarization and show it in a novel case study.
Original language | English |
---|---|
Pages (from-to) | 2500-2514 |
Number of pages | 15 |
Journal | IEEE Transactions on Knowledge and Data Engineering |
Volume | 35 |
Issue number | 3 |
Early online date | 19 Oct 2021 |
DOIs | |
Publication status | Published - 1 Mar 2023 |
Scopus Subject Areas
- Information Systems
- Computer Science Applications
- Computational Theory and Mathematics
User-Defined Keywords
- Approximation algorithms
- Data Summarization
- Diseases
- Dynamic programming
- Heuristic algorithms
- Hierarchy
- Ontologies
- Optimal Algorithm
- Optimization
- Terminology
- Top-k
- Tree