Caching and retries¶
Reflow uses content-addressed hashing to skip tasks whose inputs haven't changed.
How it works¶
Each task instance gets an identity hash built from:
- the task name and version string
- the input parameters
- the output hashes of all upstream tasks
If a previous run produced a successful result with the same identity, reflow reuses it instead of submitting a new job.
Cache invalidation¶
Bump the version field when your task logic changes:
Everything downstream is also recomputed because the upstream output hash changes.
Forcing recomputation¶
From the CLI:
$ python pipeline.py submit --run-dir /tmp/r --source data.csv --force
$ python pipeline.py submit --run-dir /tmp/r --source data.csv --force-tasks process
From Python:
# Force everything
run = wf.submit(run_dir="/tmp/r", force=True, source="data.csv")
# Force only specific tasks
run = wf.submit(run_dir="/tmp/r", force_tasks=["process"], source="data.csv")
The same works with run_local:
Disabling the cache per task¶
Output verification¶
For tasks that return file paths, reflow can verify that cached files still exist on disk before reusing them:
This checks Path.exists() for Path and list[Path] return types.
You can also supply a custom verification function:
@wf.job(verify=lambda output: Path(output).stat().st_size > 0)
def create_file() -> str:
return "/tmp/output.csv"
Retries¶
Failed or cancelled tasks can be retried without rerunning the whole pipeline:
Or from the CLI:
On retry, upstream cache hits are verified by default so stale intermediate files are detected.
The manifest store¶
Reflow keeps run metadata, task states, and cached outputs in a shared
SQLite database at ~/.cache/reflow/manifest.db. This gives you
cross-run cache reuse and a single place to inspect workflow history.
The path can be overridden:
Or in the config file: