FINJ is an open-source Python tool for fault injection targeted at HPC systems. It works in Python 3.4 and above, without any platform restriction. FINJ can be seamlessly integrated with other injection tools targeted at specific fault types, thus enabling users to coordinate faults from different sources and different system levels. FINJ also provides workload support, thus permitting users to specify lists of applications to be executed and faults to be triggered on multiple nodes at specific times with specific durations. FINJ represents a high-level, flexible tool, enabling users to perform complex and reproducible experiments, aimed at revealing the relations that may exist between faults, application behavior and the system itself. Fault injection in FINJ is achieved through tasks that are executed on target nodes: each task corresponds to a particular application, which can either be a benchmark program or a fault-triggering program. This approach allows for great flexibility, as FINJ can be integrated with any type of low-level fault injection tool. The process of fault injection in FINJ is orchestrated by two entities, which communicate through a simple message protocol via TCP sockets: A fault injection engine: runs on hosts that are the target of injection, and manages the execution of all tasks related to faults and benchmarks; A fault injection controller: runs on a separate orchestrator host, and instructs controllers on which tasks should be run, and when. Controllers also collect and store all output produced by engines. Workloads in FINJ are structured as CSV files containing entries for tasks that must be executed at specific times and with specific durations. A particular execution of a FINJ workload constitutes an injection session.

Keywords for this software

Anything in here will be replaced on browsers that support the canvas element