A generic approach to scheduling and checkpointing workflows

Affiliation auteurs!!!! Error affiliation !!!!
TitreA generic approach to scheduling and checkpointing workflows
Type de publicationJournal Article
Year of Publication2019
AuteursHan L, Le Fevre V, Canon L-C, Robert Y, Vivien F
JournalINTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
Volume33
Pagination1094342019866891
Date PublishedNOV
Type of ArticleArticle
ISSN1094-3420
Mots-cléscheckpoint, fail-stop error, resilience, Workflow
Résumé

This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as Heterogeneous Earliest Finish Time and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as minimal series-parallel graphs. Extensive experiments report significant gain over both CkptAll and CkptNone for a wide variety of workflows.

DOI10.1177/1094342019866891