Distributed and Adaptive Partitioning for Large Graphs in Geo-Distributed Data Centers

Haobin Tan, Yao Xiao, Amelie Chi Zhou*, Kezhong Lu, Xuan Yang

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Graph partitioning is of great importance to optimizing the performance and cost of geo-distributed graph analytics applications. However, it is non-trivial to obtain efficient and effective partitioning due to the challenges brought by the large graph scales, dynamic graph changes and the network heterogeneity in geo-distributed data centers (DCs). Existing studies usually adopt heuristic-based methods to achieve fast and balanced partitioning for large graphs, which are not powerful enough to address the complexity in our problem. Further, graph structures of many applications can change at various frequencies. Dynamic partitioning methods usually focus on achieving low latency to quickly adapt to changes, which unfortunately sacrifices partitioning effectiveness. Also, such methods are not aware of the dynamicity of graphs and can over sacrifice effectiveness for unnecessarily low latency. To address the limitations of existing studies, we propose DistRLCut, a novel graph partitioner which leverages Multi-Agent Reinforcement Learning (MARL) to solve the complexity of the partitioning problem. To achieve fast partitioning for large graphs, DistRLCut adapts MARL to a distributed implementation which significantly accelerates the learning process. Further, DistRLCut incorporates two techniques to trade-off between partitioning effectiveness and efficiency, including local training and agent sampling. By adaptively tuning the number of local training iterations and the agent sampling rate, DistRLCut is able to achieve good partitioning results within an overhead constraint required by graph dynamicity. Experiments using real cloud DCs and real-world graphs show that, compared to state-of-the-art static partitioning methods, DistRLCut improves the performance of geo-distributed graph analytics by 11%-95%. DistRLCut can partition over 28 million edges per second, showcasing its scalability for large graphs. With varying graph changing frequencies, DistRLCut can improve the performance by up to 71% compared to state-of-the-art dynamic partitioning.
Original languageEnglish
Pages (from-to)1161-1174
Number of pages14
JournalIEEE Transactions on Parallel and Distributed Systems
Volume36
Issue number6
Early online date3 Apr 2025
DOIs
Publication statusE-pub ahead of print - 3 Apr 2025

User-Defined Keywords

  • Graph partitioning
  • geo-distributed data centers
  • reinforcement learning

Fingerprint

Dive into the research topics of 'Distributed and Adaptive Partitioning for Large Graphs in Geo-Distributed Data Centers'. Together they form a unique fingerprint.

Cite this