A generic approach to scheduling and checkpointing workflows
Affiliation auteurs | !!!! Error affiliation !!!! |
Titre | A generic approach to scheduling and checkpointing workflows |
Type de publication | Journal Article |
Year of Publication | 2019 |
Auteurs | Han L, Le Fevre V, Canon L-C, Robert Y, Vivien F |
Journal | INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS |
Volume | 33 |
Pagination | 1094342019866891 |
Date Published | NOV |
Type of Article | Article |
ISSN | 1094-3420 |
Mots-clés | checkpoint, fail-stop error, resilience, Workflow |
Résumé | This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as Heterogeneous Earliest Finish Time and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as minimal series-parallel graphs. Extensive experiments report significant gain over both CkptAll and CkptNone for a wide variety of workflows. |
DOI | 10.1177/1094342019866891, Early Access Date = {AUG 2019 |