Failure Injection¶
The failures application randomly injects task failure scenarios into another TaPS application.
This is useful to understanding how an application or execution engine handles certain forms of task failures.
Warning
Some of the error scenarios may have unexpected consequences to other actively running programs. For example:
MANAGER_KILLEDwill kill the parent process that a task is executing within.MEMORYwill continually consume memory until an error is raised. This can cause other applications to crash.NODE_KILLEDattempt to kill other processes on the node to simulate failures.RANDOMwill select a random, potentially dangerous, failure mode.WORKER_KILLEDwill kill the process that a task is executing within.
Please be careful when using this application, and run the application in an isolated environment (e.g., a container or ephemeral node).
Installation¶
The failures application requires executing a base application (the application into which failures are injected), and some base applications have additional requirements.
For example, to use failures with the cholesky application, install TaPS using:
Warning
In older versions of TaPS, the failures app was not compatible with dill==0.3.6 which, as of writing, is the pinned version installed by globus-compute-sdk.
If you encounter serialization issues with Globus Compute/Parsl when using the failures app, manually upgrade dill:
dill versions but you must ensure the same version of dill is installed on all endpoints.
See Issue #155 for more information and PR #163 which addresses this issue.
Data¶
Data requirements depend on the base application that failures are injected into.
Example¶
The base application name is specified using --app.base and the corresponding configuration must be provided as a JSON string to --app.config.
python -m taps.run --app failures \
--app.base cholesky \
--app.config '{"matrix_size": 100, "block_size": 50}' \
--app.failure-rate 0.5 --app.failure-type dependency \
--engine.executor process-pool --engine.executor.max-processes 4
Alternatively, the app can be configured using a TOML file.