feat: make cleanup production-safe by separating invalid backups from infra errors

- Delete only truly invalid backups (dirval rc=1)
- Treat timeouts and missing dirval as infrastructure errors
- Never auto-delete backups affected by timeouts
- Return exit code 1 on infrastructure problems
- Update unit and E2E tests to reflect new safety semantics
- Align README with new deletion and exit-code behavior

https://chatgpt.com/share/695d36f6-0000-800f-98e7-f88a798d6e91
This commit is contained in:
2026-01-06 17:23:05 +01:00
parent 9e67392bd6
commit 838286c54e
4 changed files with 131 additions and 60 deletions

105
README.md
View File

@@ -7,21 +7,24 @@
**Repository:** https://github.com/kevinveenbirkenbach/cleanup-failed-backups
`cleanback` validates and (optionally) cleans up **failed Docker backup directories**.
It scans backup folders under a configurable backups root (e.g. `/Backups`), uses `dirval` to validate each subdirectory, and lets you delete the ones that fail validation.
`cleanback` validates and (optionally) cleans up **failed Docker backup directories** in a **production-safe** way.
Validation runs **in parallel** for performance; deletions are controlled and can be **interactive** or **fully automatic**.
It scans backup folders under a configurable backups root (for example `/Backups`), uses `dirval` to validate each backup subdirectory, and removes **only those backups that are confirmed to be invalid**.
Validation runs **in parallel** for performance; deletions are **explicitly controlled** and can be interactive or fully automated.
---
## ✨ Highlights
- **Parallel validation** of backup subdirectories
- Uses **`dirval`** (directory-validator) via CLI
- **Interactive** or **non-interactive** deletion flow (`--yes`)
- Uses **`dirval`** (directory validator) via CLI
- **Safe deletion model**: only truly invalid backups are removed
- **Interactive** or **non-interactive** cleanup (`--yes`)
- Supports validating a single backup **ID** or **all** backups
- Clear **exit code semantics** for CI and system services
- Clean **Python package** with `pyproject.toml`
- **Unit + Docker-based E2E tests**
- **Unit tests** and **Docker-based E2E tests**
---
@@ -31,7 +34,7 @@ Validation runs **in parallel** for performance; deletions are controlled and ca
```bash
pip install cleanback
````
```
This installs:
@@ -51,7 +54,7 @@ pip install -e .
## 🔧 Requirements
* Python **3.8+**
* Access to the backups root directory tree (e.g. `/Backups`)
* Read/write access to the backups root directory tree (e.g. `/Backups`)
* `dirval` (installed automatically via pip dependency)
---
@@ -66,6 +69,8 @@ After installation, the command is:
cleanback
```
---
### Validate a single backup ID
```bash
@@ -78,6 +83,8 @@ Validates directories under:
/Backups/<ID>/backup-docker-to-local/*
```
---
### Validate all backups
```bash
@@ -92,50 +99,78 @@ Scans:
---
### Common options
## ⚙️ Common options
| Option | Description |
| -------------------- | ------------------------------------------------------------------ |
| `--dirval-cmd <cmd>` | Path or name of `dirval` executable (default: `dirval`) |
| `--workers <n>` | Parallel workers (default: CPU count, min 2) |
| `--timeout <sec>` | Per-directory validation timeout (float supported, default: 300.0) |
| `--yes` | Non-interactive mode: delete failures automatically |
| `--force-keep <n>` | In `--all` mode: skip the last *n* backup folders (default: 0) |
| Option | Description |
| -------------------- | ------------------------------------------------------------------------------------- |
| `--dirval-cmd <cmd>` | Path or name of `dirval` executable (default: `dirval`) |
| `--workers <n>` | Number of parallel validator workers (default: CPU count, minimum 2) |
| `--timeout <sec>` | Per-directory validation timeout in seconds (float supported, default: `300.0`) |
| `--yes` | Non-interactive mode: automatically delete **invalid** backups (dirval rc=1 only) |
| `--force-keep <n>` | In `--all` mode: skip the last *n* timestamp subdirectories inside each backup folder |
> **Note:** Backups affected by timeouts or infrastructure errors are **never deleted automatically**, even when `--yes` is used.
---
### Examples
## 🧪 Examples
```bash
# Validate a single backup and prompt on failures
# Validate a single backup and prompt before deleting invalid ones
cleanback --backups-root /Backups --id 2024-09-01T12-00-00
# Validate everything with 8 workers and auto-delete failures
cleanback --backups-root /Backups --all --workers 8 --yes
# Use a custom dirval binary and short timeout
cleanback --backups-root /Backups --all --dirval-cmd /usr/local/bin/dirval --timeout 5.0
```
---
## 🧪 Tests
### Run all tests
```bash
# Validate all backups and automatically delete invalid ones
cleanback --backups-root /Backups --all --workers 8 --yes
```
```bash
make test
# Use a custom dirval binary and a short timeout (testing only)
cleanback \
--backups-root /Backups \
--all \
--dirval-cmd /usr/local/bin/dirval \
--timeout 5.0
```
---
## 🔒 Safety & Design Notes
* **No host filesystem is modified** during tests
(E2E tests run exclusively inside Docker)
* Deletions are **explicitly confirmed** unless `--yes` is used
* Timeouts are treated as **validation failures**
* Validation and deletion phases are **clearly separated**
* **Validation and deletion are strictly separated**
* Only backups explicitly marked **invalid by `dirval`** are eligible for deletion
* **Timeouts and infrastructure errors are NOT treated as invalid backups**
* Backups affected by timeouts are **never deleted automatically**
* Infrastructure problems (timeouts, missing `dirval`) cause a **non-zero exit code**
* Deletions require confirmation unless `--yes` is specified
* Tests never touch the host filesystem (E2E tests run inside Docker only)
This design makes `cleanback` safe for unattended operation on production systems.
---
## 🚦 Exit codes
`cleanback` uses exit codes to clearly distinguish between backup issues and infrastructure problems:
| Exit code | Meaning |
| --------- | ------------------------------------------------------------------ |
| `0` | All backups valid, or invalid backups were successfully removed |
| `1` | Validation infrastructure problem (e.g. timeout, missing `dirval`) |
| `2` | CLI usage or configuration error |
This makes the tool suitable for **CI pipelines**, **systemd services**, and other automation.
---
## 🧪 Tests
Run all tests (unit + Docker-based E2E):
```bash
make test
```
---

View File

@@ -277,6 +277,19 @@ def parse_args(argv: Optional[List[str]] = None) -> argparse.Namespace:
return parser.parse_args(argv)
def _is_timeout(res: ValidationResult) -> bool:
return res.returncode == 124 or "timed out" in (res.stderr or "").lower()
def _is_dirval_missing(res: ValidationResult) -> bool:
return res.returncode == 127 or "not found" in (res.stderr or "").lower()
def _is_invalid(res: ValidationResult) -> bool:
# dirval: 0 = ok, 1 = invalid, others = infra errors (timeout/missing/etc.)
return res.returncode == 1
def main(argv: Optional[List[str]] = None) -> int:
args = parse_args(argv)
@@ -296,18 +309,43 @@ def main(argv: Optional[List[str]] = None) -> int:
return 0
results = parallel_validate(subdirs, args.dirval_cmd, args.workers, args.timeout)
failures = [r for r in results if not r.ok]
if not failures:
invalids = [r for r in results if _is_invalid(r)]
timeouts = [r for r in results if _is_timeout(r)]
missing = [r for r in results if _is_dirval_missing(r)]
deleted = 0
if invalids:
print(f"\n{len(invalids)} directory(ies) are invalid (dirval rc=1).")
deleted = process_deletions(invalids, assume_yes=args.yes)
ok_count = sum(1 for r in results if r.ok)
if timeouts or missing:
print("\nERROR: validation infrastructure problem detected.")
if timeouts:
print(f"- timeouts: {len(timeouts)} (will NOT delete these)")
for r in timeouts[:10]:
print(f" timeout: {r.subdir}")
if len(timeouts) > 10:
print(f" ... (+{len(timeouts) - 10} more)")
if missing:
print(f"- dirval missing: {len(missing)} (will NOT delete these)")
for r in missing[:10]:
print(f" missing: {r.subdir}")
if len(missing) > 10:
print(f" ... (+{len(missing) - 10} more)")
print(
f"\nSummary: deleted={deleted}, invalid={len(invalids)}, ok={ok_count}, timeouts={len(timeouts)}, missing={len(missing)}"
)
return 1
if not invalids:
print("\nAll directories validated successfully. No action required.")
return 0
print(f"\n{len(failures)} directory(ies) failed validation.")
deleted = process_deletions(failures, assume_yes=args.yes)
kept = len(failures) - deleted
print(
f"\nSummary: deleted={deleted}, kept={kept}, ok={len(results) - len(failures)}"
)
print(f"\nSummary: deleted={deleted}, invalid={len(invalids)}, ok={ok_count}")
return 0

View File

@@ -121,13 +121,6 @@ class CleanbackE2EDockerTests(unittest.TestCase):
env["PATH"] = f"{self.bin_dir}:{env.get('PATH', '')}"
# Run: python -m cleanback --id <ID> --yes
# We must point BACKUPS_ROOT to our run_root. Easiest: set /Backups = run_root
# But code currently has BACKUPS_ROOT = /Backups constant.
#
# Therefore, we create our test tree under /Backups (done above) and pass --id
# relative to that structure by using run_root/<ID>. To do that, we make
# run_root the direct child under /Backups, then we pass the composite id:
# "<run-folder>/<ID>".
composite_id = f"{self.run_root.name}/{self.backup_id}"
cmd = [
@@ -148,14 +141,19 @@ class CleanbackE2EDockerTests(unittest.TestCase):
]
proc = subprocess.run(cmd, text=True, capture_output=True, env=env)
self.assertEqual(proc.returncode, 0, msg=proc.stderr or proc.stdout)
# New behavior:
# - invalid dirs are deleted and do NOT cause failure
# - timeouts are treated as infrastructure problems -> exit code 1 and NOT deleted
self.assertEqual(proc.returncode, 1, msg=proc.stderr or proc.stdout)
self.assertTrue(self.good.exists(), "good should remain")
self.assertFalse(self.bad.exists(), "bad should be deleted")
self.assertFalse(
self.assertTrue(
self.timeout.exists(),
"timeout should be deleted (timeout treated as failure)",
"timeout should NOT be deleted (timeouts are infrastructure problems)",
)
self.assertIn("Summary:", proc.stdout)
self.assertIn("validation infrastructure problem", proc.stdout.lower())
if __name__ == "__main__":

View File

@@ -123,12 +123,12 @@ class CleanupBackupsUsingDirvalTests(unittest.TestCase):
"--yes",
]
)
self.assertEqual(rc, 0, msg=err or out)
self.assertEqual(rc, 1, msg=err or out)
self.assertTrue(self.goodA.exists(), "goodA should remain")
self.assertFalse(self.badB.exists(), "badB should be deleted")
self.assertFalse(
self.assertTrue(
self.timeoutC.exists(),
"timeoutC should be deleted (timeout treated as failure)",
"timeoutC should NOT be deleted (timeout is infra error)",
)
self.assertIn("Summary:", out)
@@ -147,10 +147,10 @@ class CleanupBackupsUsingDirvalTests(unittest.TestCase):
"--yes",
]
)
self.assertEqual(rc, 0, msg=err or out)
self.assertEqual(rc, 1, msg=err or out)
self.assertTrue(self.goodA.exists())
self.assertFalse(self.badB.exists())
self.assertFalse(self.timeoutC.exists())
self.assertTrue(self.timeoutC.exists())
self.assertTrue(self.goodX.exists())
self.assertFalse(self.badY.exists())
@@ -198,8 +198,8 @@ class CleanupBackupsUsingDirvalTests(unittest.TestCase):
"--yes",
]
)
self.assertEqual(rc, 0, msg=err or out)
self.assertIn("dirval not found", out + err)
self.assertEqual(rc, 1, msg=err or out)
self.assertIn("dirval missing", out + err)
def test_no_targets_message(self):
empty = self.backups_root / "EMPTY" / "backup-docker-to-local"