Abstract
In stream processing systems, Key Grouping is a commonly employed partitioning scheme for distributing input tuples among parallel instances of stateful operators. With key grouping, tuples shared public keys in the stream are designated to the specific instance responsible for that key. Typically, the implementation of key grouping involves the use of a hash function. While it is convenient and deterministic, it is also known to cause load imbalance between parallel instances, especially in the presence of skewed data streams. Key-Splitting is an effective technique that distributes tasks associated with keys to downstream operators, facilitating load balancing at a relatively low cost. However, overly increasing parallel instances can lead to excessive aggregation costs, becoming a system bottleneck. In this paper, we show the high aggregation cost brought by the Key-Splitting partitioner at different levels of key separation. To address this challenge, we introduce an adaptive Key-Splitting method which controlling the degree of key separation. We propose a partitioner named FlexD, which aims to achieve dynamic adaptation of key separation limits for streaming data. The partitioner employs key grouping to distribute rare keys and dynamic expansion of processing instances to distribute hot keys. We implemented our method on Apache Storm and evaluated it by using real-world and synthetic datasets. Experimental results show that our method achieves a good balance between load balancing and aggregation cost. Moreover, it outperforms existing methods, achieving higher throughput.
Original language | English |
---|---|
Pages (from-to) | 164-178 |
Number of pages | 15 |
Journal | CCF Transactions on High Performance Computing |
Volume | 6 |
Issue number | 2 |
Early online date | 12 Jan 2024 |
DOIs | |
Publication status | Published - Apr 2024 |
Scopus Subject Areas
- Computer Science (miscellaneous)
- Information Systems
- Hardware and Architecture
- Computer Science Applications
User-Defined Keywords
- Key grouping
- Load balancing
- Stream partitioning
- Stream processing