TY - GEN
T1 - Real-time in-memory checkpointing for future hybrid memory systems
AU - Gao, Shen
AU - He, Bingsheng
AU - Xu, Jianliang
N1 - Publisher Copyright:
© Copyright 2015 ACM.
PY - 2015/6/8
Y1 - 2015/6/8
N2 - In this paper, we study real-time in-memory checkpointing as an effective means to improve the reliability of future large-scale parallel processing systems. Under this context, the checkpoint overhead can become a significant performance bottleneck. Novel memory system designs with upcoming non-volatile random access memory (NVRAM) technologies are emerging to address this performance issue. However, we find that those designs can still have prohibitively high checkpoint overhead and system downtime, especially when checkpoints are taken frequently to implement a reliable system. In this paper, we propose a novel in-memory checkpointing system, named Mona, for reducing the checkpoint overhead of hybrid memory systems with NVRAM and DRAM. To minimize the inmemory checkpoint overhead, Mona dynamically writes partial checkpoints from DRAM to NVRAM during application execution. To reduce the interference of partial checkpointing, Mona utilizes runtime idle periods and leverages a cost model to guide partial checkpointing decisions for individual DRAM ranks. We further develop load-balancing mechanisms to balance checkpoint overheads across different DRAM ranks. Simulation results demonstrate the eficiency and effectiveness of Mona in reducing the checkpoint overhead, downtime and restarting time.
AB - In this paper, we study real-time in-memory checkpointing as an effective means to improve the reliability of future large-scale parallel processing systems. Under this context, the checkpoint overhead can become a significant performance bottleneck. Novel memory system designs with upcoming non-volatile random access memory (NVRAM) technologies are emerging to address this performance issue. However, we find that those designs can still have prohibitively high checkpoint overhead and system downtime, especially when checkpoints are taken frequently to implement a reliable system. In this paper, we propose a novel in-memory checkpointing system, named Mona, for reducing the checkpoint overhead of hybrid memory systems with NVRAM and DRAM. To minimize the inmemory checkpoint overhead, Mona dynamically writes partial checkpoints from DRAM to NVRAM during application execution. To reduce the interference of partial checkpointing, Mona utilizes runtime idle periods and leverages a cost model to guide partial checkpointing decisions for individual DRAM ranks. We further develop load-balancing mechanisms to balance checkpoint overheads across different DRAM ranks. Simulation results demonstrate the eficiency and effectiveness of Mona in reducing the checkpoint overhead, downtime and restarting time.
KW - Checkpointing
KW - NVRAM
KW - Parallel computing
KW - Phase change memory
UR - http://www.scopus.com/inward/record.url?scp=84957574666&partnerID=8YFLogxK
U2 - 10.1145/2751205.2751212
DO - 10.1145/2751205.2751212
M3 - Conference proceeding
AN - SCOPUS:84957574666
T3 - Proceedings of the International Conference on Supercomputing
SP - 263
EP - 272
BT - ICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing
PB - Association for Computing Machinery (ACM)
T2 - 29th ACM International Conference on Supercomputing, ICS 2015
Y2 - 8 June 2015 through 11 June 2015
ER -