edoc

Multi-level checkpointing and silent error detection for linear workflows

Benoit, Anne and Cavelan, Aurélien and Robert, Yves and Sun, Hongyang. (2018) Multi-level checkpointing and silent error detection for linear workflows. Journal of Computational Science, 28. pp. 398-415.

Full text not available from this repository.

Official URL: https://edoc.unibas.ch/68676/

Downloads: Statistics Overview

Abstract

We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for fail-stop errors. We present sophisticated dynamic programming algorithms that return the optimal solution for each problem in polynomial time. We also show how to combine all these techniques and solve the problem with both fail-stop and silent errors. Simulation results demonstrate that these extensions lead to significantly improved performance compared to the standard single-level checkpointing algorithm.
Faculties and Departments:05 Faculty of Science > Departement Mathematik und Informatik > Informatik > High Performance Computing (Ciorba)
UniBasel Contributors:Cavelan, Aurélien
Item Type:Article, refereed
Article Subtype:Research Article
Publisher:Elsevier
ISSN:1877-7503
Note:Publication type according to Uni Basel Research Database: Journal article
Identification Number:
Last Modified:16 Nov 2020 16:32
Deposited On:16 Nov 2020 16:32

Repository Staff Only: item control page