PAW: Data Partitioning Meets Workload Variance

Zhe Li, Man Lung Yiu, Tsz Nam Chan

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

2 Citations (Scopus)


In distributed storage systems (e.g., HDFS, Amazon S3, Databricks), partitioning is applied on a dataset in order to enhance performance and availability. Recently, partitioning methods have been designed to optimize the query performance of partitions with respect to the historical query workload. Nevertheless, in practice, future query workloads may deviate from the historical query workload, thus deteriorating the performance of existing partitioning methods.
To fill this research gap, we model the variance of future query workloads from the historical query workload, then exploit this characteristic to produce partitions that perform well for future query workloads. In addition, we explore the space of irregular shaped partition regions to further optimize the query performance. Experimental results on TPC-H and real datasets show that our proposal is up to 70× more efficient than the state-of-the-art method.
Original languageEnglish
Title of host publicationProceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
Number of pages13
ISBN (Electronic)9781665408837
ISBN (Print)9781665408844
Publication statusPublished - May 2022
Event38th IEEE International Conference on Data Engineering, ICDE 2022 - Virtual, Kuala Lumpur, Malaysia
Duration: 9 May 202212 May 2022

Publication series

NameProceedings of IEEE International Conference on Data Engineering (ICDE)
ISSN (Print)1063-6382
ISSN (Electronic)2375-026X


Conference38th IEEE International Conference on Data Engineering, ICDE 2022
CityKuala Lumpur
Internet address

Scopus Subject Areas

  • Software
  • Information Systems
  • Signal Processing


Dive into the research topics of 'PAW: Data Partitioning Meets Workload Variance'. Together they form a unique fingerprint.

Cite this