Versions

Description

Developped by the KAUST Supercomputing Laboratory (KSL), Decimate is a SLURM extension written in python designed to handle dependent jobs more easely and efficiently. Decimate transparently adds parameters to SLURM sbatch command to check the correctness of jobs and automatically reschedules jobs found faulty. Using Decimate, One can submit, run, monitor or terminate a workflow composed of dependent jobs. If asked, thanks to standardized or customized messages, the user will be informed by mail of the progress of its workflow on the system. In case of failure of one part of tne workflow, decimate automatically detects the failure, signals it to the user and launches the misbehaving part after having fixed the job dependency. By default if the same failure happens three consecutive times, decimate cancels the whole workfow removing all the depending jobs from the scheduling. Decimate also allows the user to define his own mail alerts that can be sent at any point of the workflow through a call via a python method. This feature will also be available from bash in a next version. Some customized checking functions can also be designed by the user. Their purpose is to validate if a step of the workflow was succesful or not. It could involved checking for the presence of some result files, grepping some error or success messages in them, computing ratio or checksum... These intermediate results can be easely transmitted to decimate validating or not the correctness of any step. They can also be forwarded by mail to the user where as the workflow is executing.

Repository

https://github.com/samkos/decimate.git

Project Slug

decimate

Last Built

3 years, 6 months ago passed

Maintainers

Home Page

https://samkos.github.io/decimate/

Badge

Tags

python, slurm, faul-tolerant

Short URLs

decimate.readthedocs.io
decimate.rtfd.io

Default Version

latest

'latest' Version

dist