Skip to content

Manual Test Runbook — S1: Scheduled Terraform Drift Detection

Owner: Sagar  |  Time: ~6 min (offline) / ~20 min (with live plan)  |  Sandbox: snowops-sandbox-01

Overview

S1 detects configuration drift: a scheduled terraform plan against a stack whose committed config hasn't changed should be a no-op, so any proposed action means the live infrastructure has drifted. Two pieces:

  • apps/drift-detector/ — the TS tool (read-only: it plans, never applies).
  • S1 workflow (.github/workflows/drift-detection.yml) — daily cron that plans each stack, runs the tool, and opens/updates one GitHub Issue per drifted stack.

It builds on E0's substrate: a versioned JSON artifact (drift.json) plus a structured diff between two reports as the signal (diffReports, mirroring E0's diffSnapshots). Parts A + B verify the classifier + diff + ticket logic offline; Part C is the live scheduled plan.

Part A — Offline (no cloud, ~5 min)

A1. Build + typecheck

cd apps/drift-detector
npm ci
npm run typecheck
npm run build

A2. Unit tests

npm test

Expect: 3 suites, 18 tests pass (action classifier + report build + diff; ticket construction + GitHub upsert idempotency + dry-run; markdown rendering).

A3. Offline CLI — clean vs. drifted, plus the diff

# Clean plan → no drift, exit 0, no ticket:
node dist/index.js --stack sandbox --input examples/plan.clean.json \
  --out-dir /tmp/drift-clean

# Drifted plan → 3 resources, exit 2 with --fail-on-drift, dry-run ticket:
node dist/index.js --stack payments-prod --input examples/plan.drifted.json \
  --out-dir /tmp/drift-1 --fail-on-drift true ; echo "exit=$?"

# Re-run diffed against the first report (no new drift this time):
node dist/index.js --stack payments-prod --input examples/plan.drifted.json \
  --baseline /tmp/drift-1/drift.json --out-dir /tmp/drift-2

cat /tmp/drift-1/summary.md

Confirm:

  • /tmp/drift-clean/drift.json has drifted: false, summary.total: 0.
  • The drifted run prints DRIFT — 3 resource(s) and exits 2.
  • /tmp/drift-1/summary.md lists the storage-account update, the NSG-rule create, and the storage-container replace (the data-source read is excluded).
  • /tmp/drift-1/issue.md carries the <!-- snowops-drift:stack=payments-prod --> dedupe marker.
  • schemaVersion: "1.0" in every drift.json.

A4. Diff-as-signal (newly-drifted detection)

# Baseline: only the storage-account update drifts.
cat > /tmp/base-plan.json <<'JSON'
{ "resource_changes": [
  { "address": "azurerm_storage_account.state", "type": "azurerm_storage_account",
    "name": "state", "mode": "managed", "change": { "actions": ["update"] } } ] }
JSON
node dist/index.js --stack sandbox --input /tmp/base-plan.json --out-dir /tmp/drift-base

# Current: the storage-account drift plus a NEW NSG rule.
node dist/index.js --stack sandbox --input examples/plan.drifted.json \
  --baseline /tmp/drift-base/drift.json --out-dir /tmp/drift-cur
grep -A4 "Newly drifted" /tmp/drift-cur/summary.md

Confirm the ## Change since … section shows Drift worsened and the ### Newly drifted list includes the NSG rule + storage container.

Part B — Workflow lint (~1 min)

ruby -ryaml -e "YAML.load_file('.github/workflows/drift-detection.yml')"

Confirm the workflow carries: schedule cron + workflow_dispatch; permissions.issues: write; the detect job's stack matrix; and the Detect drift + file/update ticket step invoking dist/index.js --open-issue github.

Part C — Live scheduled plan (~15 min, ~$0)

Prerequisite: the OIDC SP for the sandbox environment has Reader on the sandbox subscription, and the stack matrix backend_* values match the real sandbox backend.

  1. Trigger the workflow via workflow_dispatch (Actions → drift-detection → Run). With no drift, the run is green and files no issue.
  2. Seed drift out-of-band: change one sandbox resource in the Azure Portal that Terraform manages (e.g. add a tag, or bump a storage account's min_tls_version).
  3. Re-run the workflow. Confirm:
  4. The Detect drift + file/update ticket step reports DRIFT — N resource(s).
  5. A GitHub Issue labelled drift is opened for sandbox with the resource table.
  6. A drift-sandbox-<run_id> artifact is attached.
  7. Re-run again without fixing the drift → the same issue is updated (not duplicated). Verify only one open drift issue exists for the stack.
  8. Reconcile the drift (revert the portal change or re-apply via C1), re-run, and close the issue by hand.

Pass criteria

  • npm test green (3 suites, 18 tests)
  • Offline clean run: drifted: false, exit 0; drifted run: 3 resources, exit 2 (A3)
  • Diff shows the newly-drifted resources against a baseline (A4)
  • drift-detection.yml parses with issues: write + the matrix + the detect step (B)
  • (Live) seeded drift opens one drift issue; re-run updates it in place (C)

Teardown

  • Offline: rm -rf /tmp/drift-clean /tmp/drift-1 /tmp/drift-2 /tmp/drift-base /tmp/drift-cur /tmp/base-plan.json.
  • Live: revert the seeded portal change; close the drift issue; artifacts auto-expire (30-day retention).

Sign-off

  • Tester: Sagar Chhabra  |  Date: 3/6/2026  |  Result: PASS
  • Notes: Only offline part done