Failure Injection¶
The failures
application randomly injects task failure scenarios into another TaPS application.
This is useful to understanding how an application or execution engine handles certain forms of task failures.
Warning
Some of the error scenarios may have unexpected consequences to other actively running programs. For example:
MANAGER_KILLED
will kill the parent process that a task is executing within.MEMORY
will continually consume memory until an error is raised. This can cause other applications to crash.NODE_KILLED
attempt to kill other processes on the node to simulate failures.RANDOM
will select a random, potentially dangerous, failure mode.WORKER_KILLED
will kill the process that a task is executing within.
Please be careful when using this application, and run the application in an isolated environment (e.g., a container or ephemeral node).
Installation¶
The failures
application requires executing a base application (the application into which failures are injected), and some base applications have additional requirements.
For example, to use failures
with the cholesky
application, install TaPS using:
Warning
In older versions of TaPS, the failures
app was not compatible with dill==0.3.6
which, as of writing, is the pinned version installed by globus-compute-sdk
.
If you encounter serialization issues with Globus Compute/Parsl when using the failures
app, manually upgrade dill
:
dill
versions but you must ensure the same version of dill
is installed on all endpoints.
See Issue #155 for more information and PR #163 which addresses this issue.
Data¶
Data requirements depend on the base application that failures are injected into.
Example¶
The base application name is specified using --app.base
and the corresponding configuration must be provided as a JSON string to --app.config
.
python -m taps.run --app failures \
--app.base cholesky \
--app.config '{"matrix_size": 100, "block_size": 50}' \
--app.failure-rate 0.5 --app.failure-type dependency \
--engine.executor process-pool --engine.executor.max-processes 4
Alternatively, the app can be configured using a TOML file.