TY - JOUR
T1 - Distributed and Adaptive Partitioning for Large Graphs in Geo-Distributed Data Centers
AU - Tan, Haobin
AU - Xiao, Yao
AU - Zhou, Amelie Chi
AU - Lu, Kezhong
AU - Yang, Xuan
N1 - Funding Information:
This work is supported by the National Natural Science Foundation of China 62172282, a CCF-Ant research grant, and a startup grant of HKBU.
PY - 2025/4/3
Y1 - 2025/4/3
N2 - Graph partitioning is of great importance to optimizing the performance and cost of geo-distributed graph analytics applications. However, it is non-trivial to obtain efficient and effective partitioning due to the challenges brought by the large graph scales, dynamic graph changes and the network heterogeneity in geo-distributed data centers (DCs). Existing studies usually adopt heuristic-based methods to achieve fast and balanced partitioning for large graphs, which are not powerful enough to address the complexity in our problem. Further, graph structures of many applications can change at various frequencies. Dynamic partitioning methods usually focus on achieving low latency to quickly adapt to changes, which unfortunately sacrifices partitioning effectiveness. Also, such methods are not aware of the dynamicity of graphs and can over sacrifice effectiveness for unnecessarily low latency. To address the limitations of existing studies, we propose DistRLCut, a novel graph partitioner which leverages Multi-Agent Reinforcement Learning (MARL) to solve the complexity of the partitioning problem. To achieve fast partitioning for large graphs, DistRLCut adapts MARL to a distributed implementation which significantly accelerates the learning process. Further, DistRLCut incorporates two techniques to trade-off between partitioning effectiveness and efficiency, including local training and agent sampling. By adaptively tuning the number of local training iterations and the agent sampling rate, DistRLCut is able to achieve good partitioning results within an overhead constraint required by graph dynamicity. Experiments using real cloud DCs and real-world graphs show that, compared to state-of-the-art static partitioning methods, DistRLCut improves the performance of geo-distributed graph analytics by 11%-95%. DistRLCut can partition over 28 million edges per second, showcasing its scalability for large graphs. With varying graph changing frequencies, DistRLCut can improve the performance by up to 71% compared to state-of-the-art dynamic partitioning.
AB - Graph partitioning is of great importance to optimizing the performance and cost of geo-distributed graph analytics applications. However, it is non-trivial to obtain efficient and effective partitioning due to the challenges brought by the large graph scales, dynamic graph changes and the network heterogeneity in geo-distributed data centers (DCs). Existing studies usually adopt heuristic-based methods to achieve fast and balanced partitioning for large graphs, which are not powerful enough to address the complexity in our problem. Further, graph structures of many applications can change at various frequencies. Dynamic partitioning methods usually focus on achieving low latency to quickly adapt to changes, which unfortunately sacrifices partitioning effectiveness. Also, such methods are not aware of the dynamicity of graphs and can over sacrifice effectiveness for unnecessarily low latency. To address the limitations of existing studies, we propose DistRLCut, a novel graph partitioner which leverages Multi-Agent Reinforcement Learning (MARL) to solve the complexity of the partitioning problem. To achieve fast partitioning for large graphs, DistRLCut adapts MARL to a distributed implementation which significantly accelerates the learning process. Further, DistRLCut incorporates two techniques to trade-off between partitioning effectiveness and efficiency, including local training and agent sampling. By adaptively tuning the number of local training iterations and the agent sampling rate, DistRLCut is able to achieve good partitioning results within an overhead constraint required by graph dynamicity. Experiments using real cloud DCs and real-world graphs show that, compared to state-of-the-art static partitioning methods, DistRLCut improves the performance of geo-distributed graph analytics by 11%-95%. DistRLCut can partition over 28 million edges per second, showcasing its scalability for large graphs. With varying graph changing frequencies, DistRLCut can improve the performance by up to 71% compared to state-of-the-art dynamic partitioning.
KW - Graph partitioning
KW - geo-distributed data centers
KW - reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=105002129134&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2025.3557610
DO - 10.1109/TPDS.2025.3557610
M3 - Journal article
SN - 2161-9883
VL - 36
SP - 1161
EP - 1174
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 6
ER -