Runbook Testing Sequence & Sign-Off Order
Last Updated: 2026-06-04
Status: 4 of 44 runbooks signed off (D1, D3, X3, C4, R1)
Overview
This document defines the canonical order for executing manual test runbooks across all SnowOps assets. The sequence respects dependency flows: foundational tools before consumers, detection before visibility, and setup before ops.
Each runbook takes ~5–15 min to complete. Total time to full sign-off: ~6–8 hours over multiple sessions.
Master Sequence
Phase 1: Visualization & Detection (12 min)
Foundation: outputs generation, visualization, and anomaly detection.
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 1 | V2 | Architecture Diagrams | None (offline) | 10m | Golden-file test; no cloud auth required |
| 2 | V3 | Runbook Generator | None (offline) | 5m | Markdown generation from TF outputs |
| 3 | E0 | Evidence Collector | V2, V3 (use outputs) | 10m | Snapshot + compliance evidence |
Rationale: V2 and V3 are 100% offline; they generate client-facing artifacts from TF outputs (F0 contracts). E0 depends on having evidence collection working before we can validate compliance dashboards.
Phase 2: Detection & Visibility (15 min)
Anomaly detection and compliance visibility.
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 4 | S1 | Drift Detector | E0 (evidence format) | 10m | Detects delta between plan + state → ticket |
| 5 | S2 | Compliance Dashboard | E0 (snapshots) | 5m | Displays evidence history + L4 DR panel |
Rationale: S1 detects drift; S2 visualizes compliance. Both depend on E0 generating snapshots.
Phase 3: Reliability & Disaster Recovery (20 min)
Backup, replication, and failover.
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 6 | L1 | Backup Policy | None | 5m | Defines backup schedule, retention |
| 7 | L2 | Cross-Region Replication | L1 (backup foundation) | 8m | Replicates backed-up state across regions |
| 8 | L4 | Restore Drill Automation | L1, L2 (backups in place) | 10m | Restore → validate → teardown → report |
Rationale: L1 establishes backup baseline; L2 adds geographic redundancy; L4 validates the entire restore chain end-to-end.
Phase 4: Bootstrap & Registry (10 min)
Client setup and module management.
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 9 | B6 | Client Bootstrap | None | 5m | Prerequisite checker + permission validator |
| 10 | F11 | Module Registry | B6 (client env validated) | 5m | Version manifest + pin audit |
Rationale: B6 validates the client environment before we attempt any module operations. F11 depends on a clean environment.
Phase 5: Sandbox Cleanup & Utilities (5 min)
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 11 | X7 | Ephemeral RG Cleanup | None | 5m | Nightly cleanup of sandbox resources |
Rationale: Sandbox hygiene; can run in parallel with other work.
Phase 6: Cost & Identity Utilities (10 min)
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 12 | U1 | Utility: Cost export | B6 (bootstrap) | 5m | Pull cost data for billing |
| 13 | U2 | Utility: Identity policies | B6 (bootstrap) | 5m | Generate AAD policy reports |
Rationale: Depends on client environment being set up.
Phase 7: Azure Networking (20 min)
Foundational cloud infrastructure modules.
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 14 | N5 | Network Security Groups | None | 8m | F-module: NSG rules + test import block |
| 15 | N6 | Route Tables | N5 (NSG foundation) | 8m | F-module: routes + test import block |
Rationale: Basic network foundation before cluster/workload modules.
Phase 8: Core Infrastructure Modules (45 min)
F-series (Terraform modules) + monitoring.
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 16 | M1 | Resource Group | None | 5m | F-module: RG + tags |
| 17 | M2 | Monitoring (Log Analytics) | M1 (RG) | 8m | F-module: LA workspace + retention |
| 18 | M3 | Alert Rules | M2 (LA workspace) | 8m | F-module: metric + log alerts |
| 19 | M6 | Managed Grafana | M2, M3 (LA, alerts) | 10m | F-module: Grafana instance + datasources |
| 20 | J1 | Identity: Service Principal Registry | M1 (RG) | 5m | F-module: SP credential rotation automation |
| 21 | J2 | Identity: AAD Roles | M1 (RG) | 8m | F-module: custom roles + assignments |
| 22 | J6 | Identity: Workload Identity Bindings | M1, J1, J2 (identity foundation) | 10m | F-module: pod identity → storage/KV bindings |
Rationale: M1 creates the RG; M2/M3/M6 build the observability stack; J1/J2/J6 establish the identity foundation for workloads.
Phase 9: Advanced Infrastructure (30 min)
Container registry, secrets, state management.
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 23 | H5 | Service Principal Rotation | J1, J2 (identity) | 8m | F-module: automated SP credential rotation |
| 24 | H7 | Azure Automation Account | M1 (RG) | 5m | F-module: automation account for runbooks |
| 25 | F8 | ArgoCD GitOps | J1, J2, J6 (identity) + M1, M2 (monitoring) | 15m | Helm + kyverno policies + app-of-apps |
Rationale: H5/H7 are identity/automation ops. F8 (GitOps) depends on identity + monitoring in place.
Phase 10: Compute & Storage Modules (40 min)
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 26 | B5 | AKS Private Cluster | M1, N5, N6, J1, J2, J6 (all foundation) | 15m | F-module: AKS + private link + workload identity |
| 27 | B2 | Container Registry | M1, B5 (RG + AKS) | 8m | F-module: ACR + network rules + purge policy |
| 28 | C3 | Azure Container Insights | M2, B5 (LA + AKS) | 8m | F-module: AKS monitoring → LA workspace |
| 29 | C2 | Key Vault | M1, J1, J2, J6 (RG + identity) | 8m | F-module: KV + RBAC + network rules |
| 30 | H1 | Storage Account | M1, C2 (RG + KV) | 8m | F-module: storage + encryption + access tiers |
| 31 | H2 | Cosmos DB | M1, C2 (RG + KV) | 8m | F-module: Cosmos + encryption + network rules |
| 32 | H3 | SQL Database | M1, C2 (RG + KV) | 8m | F-module: SQL + encryption + audit logging |
Rationale: B5 is the compute foundation; B2 stores images. C3 monitors it. C2 (KV) protects secrets. H1/H2/H3 are stateful data stores, all depend on KV.
Phase 11: Data, Networking & Foundational Policies (35 min)
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 33 | F3 | Storage Firewall Rules | H1, N5, N6 (storage + NSG) | 5m | F-module: fine-grained NSG + storage rules |
| 34 | F5 | Network Peering | N5, N6 (networks) | 5m | F-module: hub ↔ spoke peering |
| 35 | F4 | Private Endpoints | H1, H2, H3, C2, B2 (services) | 8m | F-module: private endpoints for services |
| 36 | D4 | Kyverno Policies | B5 (AKS) | 8m | Policy: pod security + image verification + labels |
| 37 | F2 | RBAC Role Definitions | J1, J2, J6 (identity) | 5m | F-module: custom Azure RBAC roles |
| 38 | F0 | Cloud-Agnostic Contracts | All F-modules (definition) | 5m | Contract: validates all F0 outputs across modules |
Rationale: F3–F5 refine networking. D4 enforces pod security on AKS. F2 defines the RBAC contract. F0 is the schema; validate once all modules exist.
Phase 12: CI/CD Pipelines (15 min)
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 39 | B1 | GitHub Onboarder | B6 (bootstrap), C1–C3 (pipelines exist) | 10m | GitHub App + repo provisioning on Closed Won |
Rationale: B1 consumes C1–C3 workflows; must come after C2/C3 are tested.
Phase 13: Legacy Runbooks (Optional, ~20 min)
| Order | Asset | Name | Dependencies | Time | Notes |
|---|---|---|---|---|---|
| 40+ | M1–M3 era | Legacy modules | Deprecated; signoff optional | — | Archive or migrate to new naming |
Rationale: Only if you have legacy M1-era assets that haven't been migrated.
Quick Reference: Current Sign-Off Status
✅ SHIPPED (4): D1, D3, X3, C4, R1
🟦 CODE-COMPLETE: E0, V2, V3, S1, S2, K1, K2, L1, L2, L4, F12, B6, F11, D5, [44 total minus shipped]
⏳ PENDING: All phases above
Execution Tips
Before You Start
- Verify sandbox subscription access (PIM activation if needed)
- Clone repo:
git clone https://github.com/snowopsdev/snowops-automation.git - Install tooling:
During Execution
- Work in phases — don't jump around. Phase 1 runbooks must all pass before Phase 2.
- Fill sign-off blocks — every runbook has a "Sign-off" section with tester/date/result.
- Keep a log — maintain a simple checklist (below) in your local notes.
- Link PRs — if you discover a bug, create an issue + PR; reference in the runbook notes.
- Reusable workflows — C1–C3 pipeline tests can be batched; verify once, reference in C2/C3.
Sample Execution Log
## Execution Log — June 4, 2026
| Phase | Asset | Tester | Date | Result | Notes |
|-------|-------|--------|------|--------|-------|
| 1 | V2 | Sagar | 2026-06-04 | PASS | Golden-file matches; d2 render skipped (no d2 binary) |
| 1 | V3 | Sagar | 2026-06-04 | PASS | All markdown templates render correctly |
| 1 | E0 | Sagar | 2026-06-04 | FAIL | Vanta adapter missing; opened issue #XXX |
| — | — | — | — | — | (continued next session) |
Dependencies Map (Graphical Reference)
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 1: Visualization & Detection │
│ V2 (diagrams) → V3 (runbooks) → E0 (evidence) │
└──────────────────────┬──────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 2: Detection & Visibility │
│ S1 (drift) → S2 (dashboard) [depends: E0] │
└──────────────────────┬──────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 3: Reliability (L1 → L2 → L4) │
└──────────────────────┬──────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 4–5: Bootstrap, Registry, Utilities │
│ B6 → F11, X7, U1, U2 │
└──────────────────────┬──────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 6–7: Networking Foundation │
│ N5 → N6 │
└──────────────────────┬──────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 8: Core Infrastructure │
│ M1 → M2 → M3 → M6 │
│ M1 → J1 → J2 → J6 │
└──────────────────────┬──────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 9: Advanced Ops │
│ J1, J2 → H5 (SP rotation) │
│ J1, J2, J6 + M1, M2 → F8 (GitOps) │
└──────────────────────┬──────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 10: Compute & Data │
│ (M1 + N5, N6 + J1, J2, J6) → B5 (AKS) │
│ M1, B5 → B2 (ACR) │
│ M2, B5 → C3 (Container Insights) │
│ M1, J1, J2, J6 → C2 (KV) │
│ M1, C2 → H1, H2, H3 (storage/DB) │
└──────────────────────┬──────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 11: Data & Policy │
│ H1, N5, N6 → F3 (firewall) │
│ N5, N6 → F5 (peering) │
│ H1, H2, H3, C2, B2 → F4 (private endpoints) │
│ B5 → D4 (Kyverno) │
│ J1, J2, J6 → F2 (RBAC) │
│ (All F-modules) → F0 (contracts) │
└──────────────────────┬──────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 12: CI/CD │
│ (C1, C2, C3 ready) + B6 → B1 (GitHub Onboarder) │
└─────────────────────────────────────────────────────────────────┘
FAQs
Q: Can I skip a runbook?
A: No. Every asset must have its sign-off block filled. If the asset is not applicable (N/A), mark result as "N/A" with a note.
Q: What if a runbook fails?
A: Create a bug issue, link it in the runbook notes, and move to the next asset. Return to the failed asset after the bug is fixed.
Q: How do I run runbooks in parallel?
A: Within a phase, assets with no inter-dependencies can run in parallel. For example, in Phase 1, you could theoretically run V2 + V3 in parallel, but they're short enough to do sequentially. In Phase 10, B2 and C3 are independent; you could test both at once if you have two terminals.
Q: Who approves the sign-offs?
A: For now, Sagar. Once we ship, community contributors can propose sign-offs; Sagar reviews.
Q: How often do I re-run a signed-off runbook?
A: Only if the asset's code changes. If the code is stable, you don't need to re-run it. The sign-off is a watermark for "this works in sandbox as of this date."
Next Steps
- Print or bookmark this page.
- Start Phase 1 with V2 (offline, ~10 min).
- Update your sign-off as you go — copy the table above into a spreadsheet or Markdown file.
- After each phase, update
docs/context/06-project-state.mdwith the runbook backlog progress. - When all 44 are complete, update
CLAUDE.md§ Machine State block and commit.
Document version: 1.0
Sync with: CLAUDE.md (§ 0. Session Handoff), docs/context/06-project-state.md, docs/context/09-testing-dod.md