Skip to content

Manual Test Runbook — L4: Automated Restore Drill

Owner: Sagar  |  Time: ~6 min (Parts A + B offline) / +20–40 min (optional Part C live drill)  |  Sandbox: snowops-sandbox-01

Overview

L4 proves recoverability end-to-end: it restores an L1 backup (or fails over an L2 SQL failover group) into an ephemeral sandbox RG, validates the recovered resource, tears the sandbox down, and records a versioned pass/fail RestoreDrillReport. The report lands in compliance/restore-drills/ and the S2 dashboard surfaces it as a "DR restore drills" panel.

Three pieces:

  • apps/restore-drill/ — the TS tool. Pure offline logic + a thin executor seam (DryRunExecutor for tests/demos, AzureCliExecutor for the live az path).
  • S2 additive panel — apps/compliance-dashboard gains --restore-drills-dir; the drill's pass/fail reaches the dashboard through the evidence store.
  • L4 workflow (.github/workflows/restore-drill.yml) — monthly cron; dispatch defaults to dry-run, schedule runs the live drill into the sandbox.

Parts A + B verify outcome classification, orchestration, and rendering offline, plus the S2 panel wiring. Part C runs a real drill in the sandbox.

Part A — Offline (no cloud, ~5 min)

A1. Build + typecheck + unit tests (L4)

cd apps/restore-drill
npm ci
npm run typecheck
npm run build
npm test

Expect: 3 suites, 17 tests pass — outcome classification (pass/fail/partial, RTO met/missed/not-evaluated), runDrill orchestration (validate skipped on restore failure, teardown always runs), the diff/regression signal, and the markdown/status renderers.

A2. Dry-run CLI — passing drill

node dist/index.js --spec examples/spec.vm.json \
  --out-dir /tmp/l4pass --now 2026-06-01T00:00:00.000Z
cat /tmp/l4pass/status.json

Expect: stderr ... PASSED — recovered in 12.0m (RTO met — target 30m); status.json has "outcome": "passed", "passed": true, "rtoMet": true.

A3. Dry-run CLI — failed restore is a gate

node dist/index.js --spec examples/spec.vm.json --script examples/script.fail.json \
  --out-dir /tmp/l4fail --now 2026-06-01T00:00:00.000Z --fail-on-drill-failure true
echo "exit=$?"

Expect: exit 2; /tmp/l4fail/summary.md shows ❌ FAILED, validate skipped, teardown still passed, and the RTO line reads "not evaluated".

Part B — S2 dashboard panel (offline, ~1 min)

B1. S2 still passes with the additive panel

cd ../compliance-dashboard
npm ci && npm run build && npm test

Expect: 4 suites, 34 tests pass — including the unchanged HTML golden file (the DR panel is gated on --restore-drills-dir, so the compliance-only output is byte-for-byte identical) and the new restore-drills suite.

B2. Render a combined dashboard

node dist/index.js \
  --snapshots-dir examples/snapshots \
  --restore-drills-dir /tmp/l4pass \
  --out-dir /tmp/s2dr --now 2026-05-30T00:00:00.000Z
grep -A4 "## DR restore drills" /tmp/s2dr/dashboard.md
grep -c restoreDrills /tmp/s2dr/dashboard.json

Expect: a "DR restore drills" table listing payments-prod-vm-monthly; dashboard.json contains restoreDrills (count 1). Re-run without --restore-drills-dir → no DR section, no restoreDrills key.

B3. Workflow lint

cd ../..
ruby -ryaml -e "YAML.load_file('.github/workflows/restore-drill.yml')" && echo OK

Part C — Live drill (sandbox, optional, ~20–40 min)

Prerequisites: an L1-backed protected item with at least one recovery point (or an L2 failover group), and an empty ephemeral sandbox RG tagged ephemeral=true. PIM activated; identity has restore + RG-delete rights.

  1. Copy apps/restore-drill/examples/spec.vm.json and edit sourceVaultId, protectedItemId, sandboxResourceGroup (the ephemeral RG), and region to point at real sandbox resources. Confirm the RG tag:
az group show -n <sandbox-rg> --query tags -o json   # expect {"ephemeral":"true"}
  1. Run the live drill (writes evidence + a dated report):
cd apps/restore-drill
node dist/index.js --executor azure --spec /path/to/your-spec.json \
  --out-dir /tmp/l4live \
  --evidence-dir "$PWD/../../compliance/restore-drills"
cat /tmp/l4live/summary.md

Expect: restore, validate, teardown steps with real durations; outcome passed (or partial/failed with the captured az error). A dated report appears under compliance/restore-drills/.

  1. Confirm teardown removed the sandbox resources:
az resource list --resource-group <sandbox-rg> --query "length(@)" -o tsv   # expect 0

For a sql-failover-group drill, confirm the primary role is back on the original server (az sql failover-group show ... --query replicationRole).

Pass criteria

  • Part A — L4 builds, typechecks; 17 tests pass; passing + failing dry runs behave (exit 2 on failure)
  • Part B — S2 builds; 34 tests pass (golden unchanged); DR panel renders only with --restore-drills-dir; workflow YAML parses
  • (Part C) live drill recovers + validates + tears down; dated report written to the evidence store; sandbox RG empty afterward

Teardown

If a live drill's teardown was interrupted, delete the ephemeral RG (it is tagged ephemeral=true, so X7 would also reap it on the next nightly run):

az group delete --name <sandbox-rg> --yes --no-wait

Sign-off

  • Tester: _  |  Date: _  |  Result: PASS / FAIL / N/A
  • Notes: