Compliance Monitoring & Full Baseline — Handover

Package: Baseline "Cloud Secure" [B] (part 2 of 2) · Milestone: M2b Scenario: Same client as M2a; adds continuous compliance monitoring, backup/DR, an evidence floor, and auto-generated documentation. Ownership: Mostly [CO] Client-Owned, with [SH] Shared monitoring outputs (you consume reports SnowOps helps operate).

This is the compliance-monitoring and documentation half of the Baseline package. It assumes the M2a Greenfield Baseline is in place.

What You Now Get

The M2a baseline made your platform compliant by construction. M2b makes it audit-defensible over time — continuous proof that it stays compliant, recovers from disaster, and is documented.

Capability	Asset	Where it lives
Compliance evidence snapshot	E0	`apps/evidence-collector/` → `compliance/snapshots/`
Drift detection	S1	`apps/drift-detector/` + `.github/workflows/drift-detection.yml`
Compliance dashboard	S2	`apps/compliance-dashboard/` + `.github/workflows/compliance-dashboard.yml`
Policy waivers (exceptions)	D5	`waivers/exceptions.yaml` + OPA waiver engine
Architecture diagram (auto)	V2	`apps/diagram-generator/`
Operational runbook (auto)	V3	`apps/runbook-generator/`
Backup policies	L1	`modules/azure/backup-policy/`
Cross-region replication / DR	L2	`modules/azure/cross-region-replication/`
Automated restore drill	L4	`apps/restore-drill/` + `.github/workflows/restore-drill.yml`
Incident response runbooks	K1	`docs/runbooks/incident/`
On-call integration	K2	`modules/azure/oncall-integration/` (PagerDuty/Opsgenie + Slack)
Brownfield import library	F12	`modules/azure/import-blocks/`
Self-service prerequisite checker	B6	`apps/client-bootstrap/`
Module versioning / private registry	F11	`apps/module-registry/` + `modules/registry.json`

The Continuous-Compliance Loop

These assets form a closed feedback loop that runs without you having to remember to look:

        ┌──────────────────────────────────────────────┐
        │  E0  evidence-collector (scheduled + post-apply)│
        │  → queries Azure Policy + Defender secure score │
        │  → writes a versioned snapshot to               │
        │     compliance/snapshots/                       │
        └───────────────┬──────────────────────────────┘
                        │
          ┌─────────────┴──────────────┐
          ▼                            ▼
  S2 compliance-dashboard       diffSnapshots regression signal
  → trend + framework rollup    → flags when posture degrades
  → static HTML + markdown
                        ▲
        ┌───────────────┴──────────────┐
        │  S1 drift-detector (daily cron)│
        │  → terraform plan per stack    │
        │  → files ONE ticket per stack  │
        │     of drift via TicketPlatform│
        └────────────────────────────────┘

E0 is your evidence floor — a machine-generated compliance snapshot on every apply and on schedule. This is what you show an auditor when they ask "prove your controls were operating on date X."
S1 catches anyone making manual ("click-ops") changes to managed infrastructure and opens a ticket. It only plans, never applies.
S2 renders the snapshot history into a dashboard with a SOC 2 / ISO 27001 / CIS Azure / HIPAA rollup and a trend line. It also surfaces the L4 restore-drill results in a "DR restore drills" panel.

Disaster Recovery — The Three Legs

DR is delivered as three complementary assets. Understand which does what:

Leg	Asset	What it provides
Recoverability	L1 backup-policy	GeoRedundant Recovery Services + Data Protection vaults; per-env retention policies for VM / Files / SQL / AKS.
Active replication	L2 cross-region-replication	Blob object replication + SQL failover group across regions.
Proof	L4 restore-drill	Monthly automated drill: restore/failover into an ephemeral sandbox RG → validate → tear down → record a dated `RestoreDrillReport`.

The L4 drill is your RTO evidence. It runs monthly via restore-drill.yml, classifies each outcome passed/partial/failed, measures actual RTO, and commits the report to compliance/restore-drills/ — which S2 then displays. An auditor asking "do you test your backups?" gets a dated, machine-generated answer.

Backup policies (L1) define retention; binding a specific VM/DB/share to a policy is a per-instance action your team owns. The vault managed identities are exported for that.

Documentation That Generates Itself

Tool	Input	Output
V2 diagram-generator	`terraform output -json`	A `d2lang` architecture diagram (SVG/PNG/PDF) of what was actually deployed
V3 runbook-generator	`terraform output -json`	Per-domain operational runbooks (identity/network/compute/registry/secrets/storage/observability) with key facts, Day-Zero hardening posture, and failure modes

Re-run these after any significant change so your architecture diagram and operational runbook never drift from reality. Both are zero-cloud — they read Terraform outputs, not your live environment.

Policy Waivers — Handling Exceptions Correctly

When a real, justified exception to a D3 OPA rule is unavoidable (common during brownfield migration), do not disable the rule. File a time-boxed waiver:

# waivers/exceptions.yaml
- rule_prefix: snowops.network
  resource_address: azurerm_storage_account.legacy_public
  expiry_date: "2026-09-01"
  owner: platform-team@client.example
  justification: "Legacy public endpoint; migration to private endpoint tracked in JIRA-1234"

The waiver suppresses the matching finding until expiry_date, then hard-fails CI once expired (snowops.waiver_expired). This gives you a PR-linked, auditable exception trail with a built-in deadline — exactly what an auditor wants to see instead of a silently-disabled control.

Incident Response

K1 ships a runbook library in docs/runbooks/incident/ covering compromise, ransomware, data leak, DDoS, and vendor breach. Review these with your team and adapt the contact/escalation details.
K2 wires Sentinel/Defender incidents to your on-call tool (PagerDuty/Opsgenie) and Slack via modules/azure/oncall-integration/. Configure your action groups and test an alert end-to-end at handover.

What This Delivers for Compliance

Control theme	Framework reference	Asset / Evidence
Monitoring of controls	SOC 2 CC4.1 · ISO 27001 A.8.16	E0 snapshots, S2 dashboard trend
Change detection / unauthorized change	SOC 2 CC7.1/CC8.1 · ISO 27001 A.8.32	S1 drift tickets
Availability / backup	SOC 2 A1.2/A1.3 · ISO 27001 A.8.13/A.8.14	L1 policies, L4 monthly restore reports
Incident response	SOC 2 CC7.3/CC7.4 · ISO 27001 A.5.24–A.5.26	K1 runbooks, K2 on-call
Exception management	SOC 2 CC3.4	D5 waiver records with expiry

Verification at Handover

E0 produces a snapshot in compliance/snapshots/ (scheduled run + post-apply).
S2 renders a dashboard HTML from the snapshot history with a framework rollup.
S1 opens a drift ticket after you deliberately make a manual change to a managed resource.
L4 dry-run drill completes and writes a RestoreDrillReport; the live monthly run restores into a sandbox RG and tears it down.
An expired waiver fails CI; an unexpired one suppresses its finding.
V2/V3 generate a diagram and runbook from terraform output -json.
A test incident routes to on-call (K2) and Slack.

Failure Modes You Should Know

Symptom	Cause	Response
S1 opens duplicate drift tickets	Marker mismatch / multiple stacks sharing a key	Each stack uses an embedded `<!-- snowops-drift:stack=… -->` marker for idempotent upsert; verify each matrix entry has a unique stack name.
E0 snapshot empty / missing fields	Collector lacks Reader + Security Reader	Grant the read-only roles E0 requires; it never needs write access.
L4 drill leaves a sandbox RG behind	Teardown step failed	X7 nightly cleanup backstops it (the drill RG is tagged `ephemeral=true`); the report is classified `partial`.
Dashboard framework rollup shows "Unmapped"	Policy names don't match the name-based matcher	Expected and honest — unmatched controls are bucketed explicitly, not dropped. Refine names or accept the bucket.

Removal / Offboarding Path

The monitoring apps (E0/S1/S2/L4/V2/V3) are read-only and standalone — deleting the app directories and scheduled workflows removes them with zero residual cloud cost.
Backup/DR modules (L1/L2) have verified terraform destroy paths. Decide retention deliberately — destroying a backup vault destroys recovery points.
Evidence already collected in compliance/snapshots/ and compliance/restore-drills/ is yours to keep for your audit record.

Have existing infrastructure to bring under management, or running Azure DevOps? See Brownfield Adoption.
Pursuing formal SOC 2 / ISO 27001 / HIPAA certification? That is the Advanced "Certification-Ready" package (M4) — automated evidence platform (Vanta/Drata), SIEM, trust center, vendor/HR controls. Ask SnowOps for the Advanced engagement scope.