Manual Test Runbook — L4: Automated Restore Drill
Owner: Sagar | Time: ~6 min (Parts A + B offline) / +20–40 min (optional Part C live drill) | Sandbox: snowops-sandbox-01
Overview
L4 proves recoverability end-to-end: it restores an L1 backup (or fails over an
L2 SQL failover group) into an ephemeral sandbox RG, validates the recovered
resource, tears the sandbox down, and records a versioned pass/fail
RestoreDrillReport. The report lands in compliance/restore-drills/ and the S2
dashboard surfaces it as a "DR restore drills" panel.
Three pieces:
apps/restore-drill/— the TS tool. Pure offline logic + a thin executor seam (DryRunExecutorfor tests/demos,AzureCliExecutorfor the liveazpath).- S2 additive panel —
apps/compliance-dashboardgains--restore-drills-dir; the drill's pass/fail reaches the dashboard through the evidence store. - L4 workflow (
.github/workflows/restore-drill.yml) — monthly cron; dispatch defaults to dry-run, schedule runs the live drill into the sandbox.
Parts A + B verify outcome classification, orchestration, and rendering offline, plus the S2 panel wiring. Part C runs a real drill in the sandbox.
Part A — Offline (no cloud, ~5 min)
A1. Build + typecheck + unit tests (L4)
Expect: 3 suites, 17 tests pass — outcome classification (pass/fail/partial,
RTO met/missed/not-evaluated), runDrill orchestration (validate skipped on
restore failure, teardown always runs), the diff/regression signal, and the
markdown/status renderers.
A2. Dry-run CLI — passing drill
node dist/index.js --spec examples/spec.vm.json \
--out-dir /tmp/l4pass --now 2026-06-01T00:00:00.000Z
cat /tmp/l4pass/status.json
Expect: stderr ... PASSED — recovered in 12.0m (RTO met — target 30m);
status.json has "outcome": "passed", "passed": true, "rtoMet": true.
A3. Dry-run CLI — failed restore is a gate
node dist/index.js --spec examples/spec.vm.json --script examples/script.fail.json \
--out-dir /tmp/l4fail --now 2026-06-01T00:00:00.000Z --fail-on-drill-failure true
echo "exit=$?"
Expect: exit 2; /tmp/l4fail/summary.md shows ❌ FAILED, validate
skipped, teardown still passed, and the RTO line reads "not evaluated".
Part B — S2 dashboard panel (offline, ~1 min)
B1. S2 still passes with the additive panel
Expect: 4 suites, 34 tests pass — including the unchanged HTML golden
file (the DR panel is gated on --restore-drills-dir, so the compliance-only
output is byte-for-byte identical) and the new restore-drills suite.
B2. Render a combined dashboard
node dist/index.js \
--snapshots-dir examples/snapshots \
--restore-drills-dir /tmp/l4pass \
--out-dir /tmp/s2dr --now 2026-05-30T00:00:00.000Z
grep -A4 "## DR restore drills" /tmp/s2dr/dashboard.md
grep -c restoreDrills /tmp/s2dr/dashboard.json
Expect: a "DR restore drills" table listing payments-prod-vm-monthly;
dashboard.json contains restoreDrills (count 1). Re-run without
--restore-drills-dir → no DR section, no restoreDrills key.
B3. Workflow lint
Part C — Live drill (sandbox, optional, ~20–40 min)
Prerequisites: an L1-backed protected item with at least one recovery point (or an L2 failover group), and an empty ephemeral sandbox RG tagged
ephemeral=true. PIM activated; identity has restore + RG-delete rights.
- Copy
apps/restore-drill/examples/spec.vm.jsonand editsourceVaultId,protectedItemId,sandboxResourceGroup(the ephemeral RG), andregionto point at real sandbox resources. Confirm the RG tag:
- Run the live drill (writes evidence + a dated report):
cd apps/restore-drill
node dist/index.js --executor azure --spec /path/to/your-spec.json \
--out-dir /tmp/l4live \
--evidence-dir "$PWD/../../compliance/restore-drills"
cat /tmp/l4live/summary.md
Expect: restore, validate, teardown steps with real durations; outcome
passed (or partial/failed with the captured az error). A dated report
appears under compliance/restore-drills/.
- Confirm teardown removed the sandbox resources:
For a sql-failover-group drill, confirm the primary role is back on the
original server (az sql failover-group show ... --query replicationRole).
Pass criteria
- Part A — L4 builds, typechecks; 17 tests pass; passing + failing dry runs behave (exit 2 on failure)
- Part B — S2 builds; 34 tests pass (golden unchanged); DR panel renders only with
--restore-drills-dir; workflow YAML parses - (Part C) live drill recovers + validates + tears down; dated report written to the evidence store; sandbox RG empty afterward
Teardown
If a live drill's teardown was interrupted, delete the ephemeral RG (it is
tagged ephemeral=true, so X7 would also reap it on the next nightly run):
Sign-off
- Tester: _ | Date: _ | Result: PASS / FAIL / N/A
- Notes: