TY - JOUR
T1 - Online Management for Edge-Cloud Collaborative Continuous Learning
T2 - A Two-timescale Approach
AU - Lin, Shaohui
AU - Zhang, Xiaoxi
AU - Li, Yupeng
AU - Joe-Wong, Carlee
AU - Duan, Jingpu
AU - Yu, Dongxiao
AU - Wu, Yu
AU - Chen, Xu
N1 - This work was supported in part by NSFC under Grant 62472460, Grant 62102460, and Grant 62202402, in part by Guangdong Basic and Applied Basic Research Foundation under Grant 2024A1515010161, Grant 2022A1515011583, Grant 2023A1515011562, Grant 2023A1515012982, Grant 2023B1515120058, and Grant 2021B151520008, in part by the Young Outstanding Award under the Zhujiang Talent Plan of Guangdong Province, Guangzhou Basic and Applied Basic Research Program under Grant 2024A04J6367, in part by Hong Kong RGC Early Career Scheme under Grant 22202423, in part by the Initiation Grant for Faculty Niche Research Areas 2023/24 under Grant RC-FNRA-IG/23-24/COMM/01, in part by the Germany/Hong Kong Joint Research Scheme sponsored by the Research Grants Council of Hong Kong and the German Academic Exchange Service of Germany under Grant G-HKBU203/22, in part by NSF under Grant CNS-2106891 and Grant CNS-1751075, and in part by the Startup Grant (Tier 1) for New Academics AY2020/21 of Hong Kong Baptist University.
PY - 2024/12
Y1 - 2024/12
N2 - Deep learning (DL) powered real-time applications usually need continuous training using data streams generated over time and across different geographical locations. Enabling data offloading among computation nodes through model training is promising to mitigate the problem that devices generating large datasets may have low computation capability. However, offloading can compromise model convergence and incur communication costs, which must be balanced with the long-term cost spent on computation and model synchronization. Therefore, this paper proposes EdgeC3, a novel framework that can optimize the frequency of model aggregation and dynamic offloading for continuously generated data streams, navigating the trade-off between long-term accuracy and cost. We first provide a new error bound to capture the impacts of data dynamics that are varying over time and heterogeneous across devices, as well as quantifying varied data heterogeneity between local models and the global one. Based on the bound, we design a two-timescale online optimization framework. We periodically learn the synchronization frequency to adapt with uncertain future offloading and network changes. In the finer timescale, we manage online offloading by extending Lyapunov optimization techniques to handle an unconventional setting, where our long-term global constraint can have abruptly changed aggregation frequencies that are decided in the longer timescale. Finally, we theoretically prove the convergence of EdgeC3 by integrating the coupled effects of our two-timescale decisions, and we demonstrate its advantage through extensive experiments performing distributed DL training for different domains.
AB - Deep learning (DL) powered real-time applications usually need continuous training using data streams generated over time and across different geographical locations. Enabling data offloading among computation nodes through model training is promising to mitigate the problem that devices generating large datasets may have low computation capability. However, offloading can compromise model convergence and incur communication costs, which must be balanced with the long-term cost spent on computation and model synchronization. Therefore, this paper proposes EdgeC3, a novel framework that can optimize the frequency of model aggregation and dynamic offloading for continuously generated data streams, navigating the trade-off between long-term accuracy and cost. We first provide a new error bound to capture the impacts of data dynamics that are varying over time and heterogeneous across devices, as well as quantifying varied data heterogeneity between local models and the global one. Based on the bound, we design a two-timescale online optimization framework. We periodically learn the synchronization frequency to adapt with uncertain future offloading and network changes. In the finer timescale, we manage online offloading by extending Lyapunov optimization techniques to handle an unconventional setting, where our long-term global constraint can have abruptly changed aggregation frequencies that are decided in the longer timescale. Finally, we theoretically prove the convergence of EdgeC3 by integrating the coupled effects of our two-timescale decisions, and we demonstrate its advantage through extensive experiments performing distributed DL training for different domains.
KW - Collaborative federated learning
KW - continuous learning
KW - edge-cloud collaboration
KW - two-timescale
UR - http://www.scopus.com/inward/record.url?scp=85203444304&partnerID=8YFLogxK
U2 - 10.1109/TMC.2024.3451715
DO - 10.1109/TMC.2024.3451715
M3 - Journal article
AN - SCOPUS:85203444304
SN - 1536-1233
VL - 23
SP - 14561
EP - 14574
JO - IEEE Transactions on Mobile Computing
JF - IEEE Transactions on Mobile Computing
IS - 12
ER -