SnowOps — Decisions Log
| # | Decision | Resolution | Date |
|---|---|---|---|
| D1 | CRM | HubSpot | 2026-05-25 |
| D2 | First compliance platform | Vanta (Drata via adapter) | 2026-05-25 |
| D3 | Terraform remote state | Azure Storage blob (RA-GZRS, immutability, blob-lease) | 2026-05-25 |
| D4 | Container registry | ACR Premium (Notation v2) | 2026-05-25 |
| D5 | K8s policy engine | Kyverno | 2026-05-25 |
| D6 | Primary cloud | Azure-first, cloud-agnostic by contracts | 2026-05-25 |
| D7 | Pre-sales offering | Free automated Azure posture audit (G) | 2026-05-25 |
| D8 | Service packaging | Two tiers: Baseline (Cloud Secure) + Advanced (Certification-Ready) | 2026-05-25 |
| D9 | Test framework (IaC) | Terratest (Go) + Conftest + Kyverno test | 2026-05-25 |
| D10 | Sandbox env | Dedicated SnowOps Azure sandbox subscription (X1) — prerequisite to ship anything | 2026-05-25 |
| D11 | SIEM | Microsoft Sentinel (native Azure-aligned; deferred to Advanced tier) | 2026-05-25 |
| D12 | Milestone structure | 6 milestones (M1, M2a, M2b, M3, M4, M5); each independently shippable; client search begins after M1 | 2026-05-26 |
| D13 | Brownfield support | F-modules must ship with terraform import blocks (F12 library, M3); brownfield is a first-class client profile |
2026-05-26 |
| D14 | Secondary CI/CD | Azure DevOps Pipelines (C5) in M3; M1+M2 are GitHub Actions only — explicit constraint | 2026-05-26 |
| D15 | Ticketing abstraction | TicketPlatform adapter interface (E7, M3) mirroring EvidencePlatform pattern |
2026-05-26 |
| D16 | Evidence floor | E0 lightweight compliance snapshot moved to Baseline [B] and M2b | 2026-05-26 |
| D17 | Documentation tier | V2 (architecture diagrams) and V3 (runbook generator) promoted from Advanced to Baseline | 2026-05-26 |
| D18 | F0 contract count | 7 contracts shipped (network, identity, cluster, registry, kv, observability + object_store). object_store added so F6 has a conforming shape. | 2026-05-27 |
| D19 | F0 conformance pattern | variable "candidate" typed + output "candidate" echoing — pipe candidate through contract module to gate via terraform validate. Strict on literals; permissive on inter-module unknowns. |
2026-05-27 |
| D20 | B-series ↔ F-series relationship | B-modules are per-client compositions that call F-modules as building blocks; F-modules never call B-modules. B3 wraps F1; B4 wraps F6. F-modules = reusable + cloud-agnostic; B-modules = client-opinionated (Defender ON, MCSB assigned, group RBAC wired). | 2026-05-28 |
| D21 | Regulatory-compliance posture | B3 assigns MCSB at subscription scope with system-assigned identity but NO remediation roles → audit-only by default. Active DINE remediation is opt-in out-of-band. Distinct assignment name (snowops-mcsb) coexists with Defender's auto-assignment. |
2026-05-28 |
| D22 | State-backend data-plane access | F6 disables shared keys; backends use use_azuread_auth = true. B4 owns Blob Data role grants. Network lockdown is OPT-IN (default off) because GitHub-hosted runners need the public endpoint. B4 cannot toggle public_network_access_enabled (lives on F6's resource) — network_rules Deny is the lever. |
2026-05-28 |
| D23 | PIM split: Azure resources vs Entra roles | H3 = Entra directory roles (azuread, Graph PATCH for activation rules). B5 = Azure resource roles (azurerm native azurerm_pim_eligible_role_assignment + azurerm_role_management_policy). Both validate-test-only (PIM activation is live/human/P2). B5 composes with B3's standing Owner. |
2026-05-28 |
| D24 | F8 GitOps bundle shape + placement | F8 = K8s manifests under top-level gitops/ (ArgoCD app-of-apps). D4 Kyverno policies REUSED via Application pointing at policy/kyverno/rules — never forked. Hardened: ArgoCD internal+TLS, ingress-nginx internal Azure LB, ESO via Workload Identity. Offline gate = gitops/validate.sh + check-yaml + kyverno test. |
2026-05-29 |
| D25 | H7 break-glass: accounts as input, not created | H7 does NOT create accounts or passwords — hardware-FIDO2 + split-knowledge password-to-safe is a manual ceremony (Identity > Secrets). H7 takes existing account object IDs as input and owns: role-assignable group + permanent Global Admin + severity-0 sign-in alert. H7 is the PRODUCER of the break-glass group that H2/B3/B5 consume. | 2026-05-29 |
| D26 | H5 SP rotation: TS app, PR not auto-rotate | H5 is a TS app (read-only, Application.Read.All) that opens a rotation PR rather than rotating itself — the tool can't know where a secret is consumed. PR commits inventory snapshot as evidence. Steers toward federated-OIDC migration (secretless SPs never go stale). Cron lives in the caller because a reusable workflow can't carry on: schedule. |
2026-05-29 |
| D27 | GTM pivot — sales-readiness re-plan | Y-series = content/collateral layer feeding the existing A-series automation (not a parallel stack). Z-series = vertical reference architectures. Motion = founder-led cold outbound; free G-series Discovery Audit is the wedge. GTM assets need zero cloud → Claude builds in parallel with Sagar's runbooks (Track A vs Track B). No technical asset deleted or de-scoped. | 2026-05-29 |
| D28 | GTM content home (gtm/) + 🟦 status semantics for Y/Z |
Y/Z assets live under gtm/ (peer to compliance/, distinct from docs/). Status: 🟦 = drafted + pre-commit passes + DoD spot-checks; 🟩 = human sign-off complete (Nidhi compliance review / Sagar pricing+seed-list). No cloud/runbook gate for content assets. |
2026-05-29 |
| D29 | Y-series fidelity to existing automation + honest-claim enforcement | Y-series is bound to the real codebase: Y13 properties reconciled against apps/crm-automations/ field names; Y7 coverage matrix reuses G2 rule-pack frameworks mappings; Y9 uses real G2 rule IDs over fully-synthetic data. Roadmap assets flagged (roadmap — Mx) inline. No "guaranteed certification" language — audit-ready only. |
2026-05-29 |
| D30 | J1 standalone LAW vs F1 bundled; "immutability" = delete lock | J1 is separate from F1's bundled workspace. Both emit the identical F0 observability_contract. J1's "immutability" = CanNotDelete management lock (not WORM payload storage, which is J6). Delete lock defaults ON in module, OFF in fixtures/examples for unattended teardown. |
2026-05-29 |
| D31 | J2 = DINE-not-Deny + GUID-agnostic; J6 WORM details; lock files committed | J2 uses DINE (not Deny) — diagnostic settings are added post-creation. GUIDs are caller-supplied (not hardcoded). J6 uses account-level time-based immutability with allow_protected_append_writes = true (append-blob log writes work; existing data immutable). .terraform.lock.hcl files ARE committed in this repo. |
2026-05-29 |
| D32 | M-series: three separate Deny-initiative modules + a distinct CMK module | M1/M3/M6 = three separate modules (matches 1:1 asset↔module convention). Effect = Deny (not DINE) — encryption/TLS/region are create-time properties. No system-assigned identity. GUIDs are curated defaults but caller-overridable. M2 (cmk) = real resources: HSM-backed key + rotation_policy in an EXISTING F5 vault (not inline to avoid trivy AZU-0013). |
2026-05-29 |
| D33 | N-series: two separate modules (policy + real-resource); N6 standalone vs F2 | N5 = Deny initiative (public-network-access). N6 = real-resource NSG module with optional flow logs. N6 overlaps F2 intentionally — N6 applies a hardened NSG to any subnet without standing up F2. Same standalone-vs-bundled stance as J1↔F1 and M6↔F1. | 2026-05-29 |
| D34 | U-series: two separate modules; U1 composes with action groups; U2 literal tagName | U1 = real-resource budget + action group (reuses H7's dynamic-block pattern). U2 = tag Deny initiative (reuses M6 skeleton). U2's per-reference literal tagName differs from M6's shared parameter — each tag needs a distinct value so the initiative is parameter-less. |
2026-05-29 |
| D35 | X7 = RG-scoped + protected-name guard + min-age guard; W-series postponed to last | X7 deletes at RG level (cascades to children). Three guards: (1) ephemeral=true tag only, (2) protected-name globs (baseline's own snowops-sandbox-observability-rg is tagged ephemeral=true so protected by NAME), (3) --min-age-hours (default 6) for in-flight tests. W-series (W1–W3) POSTPONED to last per Sagar 2026-05-29 — not part of M2a-complete bar. |
2026-05-29 |
| D36 | M2b/M3 re-prioritization: 14-asset core, postpone all others | Sagar selected 14 focus assets (E0, V2, V3, S1, S2, K1, K2, L1, L2, L4, D5, F12, B6, F11). ALL other M2b/M3 assets postponed until this core is code-complete + signed off. Depth before breadth — mature M2a + complete the drift/evidence/DR/versioning core first. | 2026-05-30 |
| D37 | E0 shape: standalone TS tool, versioned snapshot, C1 best-effort post-apply | E0 = standalone TS app (pure logic + thin ARM REST shell + jest). Snapshot is versioned (schemaVersion 1.0) so downstream adapters depend on a stable shape. diffSnapshots is the centerpiece — regressed = more non-compliant resources OR lower score OR newly non-compliant assignment. C1 integration is continue-on-error (apply already succeeded; CLI exits non-zero only under --fail-on-regression). |
2026-05-30 |
| D38 | V2 = d2lang (not Python diagrams); shape-based detection; golden-file test |
d2 chosen over Python diagrams lib: single static binary, plain-text diffable source (enables golden-file test), one-step SVG/PNG/PDF. Detection is shape-based (not name-based) because a client's root module can re-export F0 contracts under any output name. Renderer is deterministic (stable slug IDs + sorted emission). Offline-generate / live-render split (same as D4/F8/C3 precedent). |
2026-05-30 |
| D39 | S1 shape: standalone TS tool on E0 substrate; consumes terraform show -json; seeds the E7 TicketPlatform early |
S1 = standalone TS app mirroring E0 (pure logic + thin shell + jest). Consumes terraform show -json (not stdout scraping). Versioned DriftReport schemaVersion 1.0; diffReports is the change signal (parallels diffSnapshots). Plans only, never applies. Rather than block on E7 (postponed, M3), S1 ships a minimal TicketPlatform interface (D15) + GitHub Issues + dry-run adapters; E7 later absorbs/extends it without changing S1's call site. Idempotent upsert via an embedded <!-- snowops-drift:stack=… --> marker → one open issue per stack. Scheduled standalone workflow with a per-stack matrix (read-only + cron-driven, so not a C1 add-on). |
2026-05-30 |
| D40 | S2 shape: fully-offline TS dashboard over E0's snapshot store; vendors diffSnapshots; best-effort name-based framework rollup | S2 = standalone TS app (apps/compliance-dashboard/) mirroring E0/S1 (pure logic + thin shell + jest). FULLY OFFLINE — it never calls Azure; E0 already owns live collection, so S2 is a pure presentation/aggregation layer over E0's versioned ComplianceSnapshot (re-declares the schemaVersion 1.0 input contract rather than importing E0, per D37 — no cross-package build coupling). Renders a versioned ComplianceDashboard (schemaVersion 1.0) + a self-contained static HTML page (inline CSS + inline SVG sparkline, no JS/network → golden-file test like V2/D38) + a markdown summary. The latest-vs-previous regression delta vendors E0's diffSnapshots semantics; the trend series across all snapshots is the value a single E0 snapshot can't give. Framework attribution is best-effort NAME matching (reuses the G2 vocab soc2/iso27001/cis_azure + hipaa; MCSB fans out to all four it underpins per D21; unmatched → explicit "Unmapped" bucket — honest, not silently dropped). Canonical store = compliance/snapshots/ (fed by E0 post-apply via C1 or by the S2 workflow's scheduled collect). Workflow uploads a PRIVATE artifact by default; GitHub Pages publish is OPT-IN (publish_pages) because posture data is sensitive. No new gates (D36). | 2026-05-30 |
| D41 | L1 shape: backup policy module (two vault families) + per-env retention; defines policies, not instance bindings | L1 = modules/azure/backup-policy/. Spans Azure's two backup vault families: a azurerm_recovery_services_vault (VM / Azure Files / SQL-in-VM workloads) and a azurerm_data_protection_backup_vault (AKS operational store) — each created only when one of its workload toggles is on. Four policies map to the catalog's "VM, AKS, SQL, Storage": azurerm_backup_policy_vm, azurerm_backup_policy_file_share, azurerm_backup_policy_vm_workload (SQLDataBase, Full), azurerm_data_protection_backup_policy_kubernetes_cluster. Retention is per-environment (dev 7d / staging 14d+5w / prod 30d+12w+12m+7y), overridable via retention_profile; tiers expand through dynamic blocks and plan-time preconditions enforce CRR⇒GeoRedundant + yearly⇒monthly⇒weekly nesting. The module defines REUSABLE POLICIES, not per-instance protection bindings — binding a VM/share/DB/cluster is per-instance and owned by the consumer/B-module (vault system-assigned identities are exported for that). GeoRedundant + cross_region_restore_enabled default = the on-ramp to L2; the live restore drill is L4. Offline terraform validate Terratest gate (no build-tagged integration test — a real backup/restore is hours-long and is L4's job). Same standalone-reusable stance as the cmk/budget-alert precedent. | 2026-05-30 |
| D42 | L2 shape: cross-region replication wiring module (consumes, doesn't create); storage object replication + SQL failover group; per-env failover posture | L2 = modules/azure/cross-region-replication/. The active-replication half of DR (L1 = recoverability half; L4 = drill). Two independently-toggleable workloads: blob object replication (azurerm_storage_object_replication, one rule per container_mappings pair; optionally creates the destination containers) and a geo-redundant SQL failover group (azurerm_mssql_failover_group, primary↔partner server). CONSUMES existing storage accounts + SQL servers by ARM ID — it wires replication, it does NOT create the underlying resources (brownfield-safe + composable; same "define the link, not the instance" stance as L1's policy module, and avoids overlapping F6/B4 storage ownership + the Identity>Secrets tension of standing up a SQL server). Rejected the provisioning-module alternative for those reasons. Per-env SQL failover posture: dev Manual / staging Automatic-60m / prod Automatic-120m, overridable; plan-time preconditions enforce (a) ≥1 workload on, (b) primary_location != secondary_location (cross-region intent explicit + testable offline even though resources are consumed by ID), (c) distinct accounts + distinct servers, (d) Automatic⇒grace set / Manual⇒grace unset (Azure rejects the wrong combination). Offline terraform validate Terratest gate over a fixture that stands up both GZRS accounts + a source container + two AAD-only SQL servers + a DB and wires both workloads at the prod profile (no build-tagged integration test — a real failover/restore is L4's job). No new gates (D36). | 2026-05-30 |
| D43 | L4 shape: standalone TS drill app (executor seam) + evidence store; S2 panel is additive/gated | L4 = apps/restore-drill/, the third DR leg (L1 backup policies + L2 replication links + L4 proof-of-recovery), built on the E0/S1/S2 mold: pure offline logic + jest, with all I/O behind a DrillExecutor seam (mirrors S1's TicketPlatform). Two adapters: DryRunExecutor (deterministic — tests/demos/workflow rehearsal) and AzureCliExecutor (live az). A drill = restore an L1 backup / fail over an L2 failover group into an ephemeral sandbox RG (tagged ephemeral=true so X7 backstops a failed teardown) → validate → tear down; runDrill enforces validate-skipped-on-restore-failure + teardown-always-runs (no leaked sandbox cost). Versioned RestoreDrillReport schemaVersion 1.0; outcome ∈ passed/partial/failed (partial = recovered but RTO missed OR teardown failed); measured RTO = restore+validate duration (teardown excluded); diffReports is the recoverability-regression signal (parallels diffSnapshots/diffReports). "Pass/fail to the S2 dashboard" is delivered the same producer/consumer way E0→S2 works: L4 writes evidence to compliance/restore-drills/, and S2 gains an OPTIONAL --restore-drills-dir that renders an additive "DR restore drills" panel + a gated restoreDrills field — gated so the compliance-only HTML golden file is byte-for-byte unchanged (re-declares the report subset rather than importing L4, per D37/D40). Scheduled via .github/workflows/restore-drill.yml (monthly cron; dispatch defaults to dry-run rehearsal, schedule runs the live drill + commits the dated report). No new gates (D36); no build-tagged Terratest — the live path is the runbook's Part C. | 2026-05-30 |
| D44 | D5 handled externally; F12 = self-validating import-block library at the README-promised paths | D5 (policy waiver engine) is handled EXTERNALLY by Sagar (like K1/K2) — not pulled in-repo. The shipped D3/X3 OPA bundle is left untouched. M2b 14-asset core thus has three external items (K1, K2, D5); in-repo work skips straight from L4 to F12. F12 = modules/azure/import-blocks/ with one <module>.tf per module, at the exact paths seven module READMEs already promised. Coverage = F1–F6 + J1/J2/J6 (9 modules; F7 postponed, F8 has no TF resources, other real-resource modules deferred). Design: each file pairs the config-driven import {} blocks (the deliverable consumers copy) with a placeholder module call (the rest of the dir = variables.tf + provider) so the whole directory is ONE valid config — terraform validate (TestImportBlocksValidate) proves every to = address resolves, incl. count[0] + for_each["key"] instances, with no duplicated blocks. Chose this over (a) a wrapper/structural-grep gate (weaker) and (b) instantiating via fixtures (duplicates blocks). Caveat captured in the runbook: validate does not evaluate for_each/count, so it proves the base address resolves but not key validity — key schemes are derived from module source and documented per file. for_each import blocks (a 1.7+ feature) are avoided for 1.6 compatibility: one literal-key block per instance. Each covered module's README brownfield section now points at its real file. No new gates (D36). | 2026-05-31 |
| D45 | B6 = pure-evaluator TS app over a snapshot, behind a Collector seam; readiness = all required checks pass | B6 = apps/client-bootstrap/, the prerequisite checker + permission validator a prospective client runs in their OWN tenant pre-engagement. Built on the E0/S1/L4 mold: a pure evaluator over an EnvironmentSnapshot with all I/O behind a Collector seam (FixtureCollector for tests/offline --snapshot; AzureCliCollector best-effort live az/Graph probes — the untested boundary, like S1's terraform shell). Chose this over a pure-bash script (the literal "single-script" framing) so the two acceptance scenarios are unit-testable fixtures fed to the same evaluator; bootstrap.sh is kept as the thin single client entrypoint that builds-on-first-run and wraps the CLI (exit 0=READY / 2=NOT READY). Check catalogue (declarative, pure): tooling (az ≥ 2.50, terraform ≥ 1.6 required; git recommended), auth (signed in), permissions (assign RBAC = Owner OR User Access Administrator — Contributor deliberately excluded since it cannot write roleAssignments; create Entra apps+SPs = Global/Application/Cloud Application Administrator; required resource providers Registered), licensing (Entra ID P2 for PIM/B5 — recommended/warn, not a blocker since PIM isn't day-zero). Required perms derived from what B2/B3/B4 actually do on day zero. Readiness rule: READY only when every REQUIRED check passes (a required fail OR skip = not ready); recommended checks only warn and never block. Read-only by design — no resource is created, so the removal path is just deleting the dir. Permission checks skip (not fail) when not signed in to avoid double-faulting on the auth blocker. No new gates (D36). | 2026-05-31 |
| D46 | F11 = git-tag private registry (no hosted service); manifest + CHANGELOG are the source of truth, enforced by a TS tool | F11 = apps/module-registry/ + modules/registry.json + per-module CHANGELOG.md + .github/workflows/module-release.yml. Chose a git-tag-based "private registry" (modules published as tags <module>/v<version>, consumers pin via source = "git::…//<path>?ref=<module>/vX.Y.Z") over standing up a hosted Terraform Module Registry HTTP service — zero infra, works with private GitHub repos via the consumer's existing creds, honest about the mechanism. modules/registry.json is the single source of truth for which modules are published + at what semver; each module's CHANGELOG.md records history and its top ## vX.Y.Z heading MUST equal the manifest version (the validator errors on drift). Tool built on the B6/L4 mold: pure core over a RegistrySnapshot behind a Collector seam (FsRegistryCollector reads manifest + CHANGELOGs + git tag; FixtureCollector for tests) → validate / buildIndex / planReleases / auditPins. Validation = unique names+paths, strict X.Y.Z semver (no ranges/pre-release), CHANGELOG sync, and no version-regression below an already-published tag (can't unpublish). The pin audit scans a consumer tree and flags unpinned (floats on a branch), ref-mismatch, unknown-version — enforcing the consumer pin strategy in CI. Release is automated on merge to main: validate (gate), then create each pending tag + a GitHub Release whose body is the CHANGELOG section (idempotent). Seeded the 10 core published modules at the versions they already ship (F0 0.1.0; F1–F6 + J1/J2/J6 1.0.0) — brownfield adoption = add a manifest row + a CHANGELOG, no code change. A jest guard loads the REAL committed manifest + CHANGELOGs so the registry can't silently drift. No new Terraform gate (F11 is metadata, not a TF module); the jest suite + the release-workflow validation step are the gates. Completes the 14-asset M2b core (D36). | 2026-05-31 |
| D47 | Three externally-authored assets (K1, K2, D5) merged → 14-asset M2b core 100% in-repo | The three assets Sagar built externally (per D36/D44: K1 IR runbook library, K2 on-call integration, D5 policy waiver engine) have all landed on main via gemini-work PRs #12 (K1+K2) and #13 (D5) and are now marked code_complete. The 14-asset M2b core (D36) is fully in-repo (14/14); m2b_core_external is empty. D5 = a waiver layer on the D3/X3 OPA bundle: D3 rules now emit raw_violation, main.rego filters them through has_active_waiver and hard-denies expired waivers; active records in waivers/exceptions.yaml, wired into terraform-plan-apply.yml via conftest --data. Noted for cleanup (not yet actioned): a stray debug.rego was committed at the repo root in PR #13 and should be removed. Focus now shifts to runbook sign-offs. | 2026-05-31 |
| D48 | Post-core next-5 selection; C5 reconciled as already-built; I-series CI security suite shipped | With the 14-asset M2b core code-complete, the next 5 most-important unbuilt/postponed items were selected: I3 (CodeQL SAST), I2 (dependency scanning), I1 (container image scanning), E7 (TicketPlatform adapters), F7 (Terragrunt live-infra reference). Reconciliation first: C5 (ADO pipeline templates) was discovered already built in commit 51c7fc4 (1,257 lines — terraform-plan-apply/container-build-sign/aks-deploy/quality-gates ADO mirrors + examples) but left ⏸️ postponed in the docs; marked 🟦 code-complete and dropped from the candidate set. Rationale for the 5: the M2a CI baseline had three real holes — no SAST (I3), no PR-time SCA gate or alert digest (I2, only dependabot.yml existed), no reusable image scan (I1, closes G6 for non-K8s clients, distinct from C2's build-time grype) — and the two highest-leverage forward items are E7 (closes G8, unblocks E6/I5/K4/P3, generalizes the S1 TicketPlatform seed) and F7 (the missing per-env/region module-wiring repo). v0.53 ships I1+I2+I3 as a coherent security-scanning suite (pure CI/config, fully offline-lintable): codeql.yml (js/ts + go, security-extended), dependency-review.yml (fail-on High + licence deny-list) + dependency-digest.yml (idempotent weekly alert issue, S1-marker pattern), image-scan.yml (reusable workflow_call Trivy image gate, SARIF + optional registry login). Runbooks I1/I2/I3 added. Next code: E7, then F7. | 2026-06-04 |
| D49 | E7 shape: standalone multi-adapter package, one shared marker-upsert behind a uniform seam; S1 repoint is interface-compatibility, not a build import | E7 = apps/ticket-platform/ (@snowops/ticket-platform), the generalized home of S1's TicketPlatform seed (D15/D39); closes G8. Built on the standalone-package convention (D37/D40): apps don't import each other at build time, so E7 is its own package and S1 keeps an interface-compatible inline copy — the "repoint without changing the call site" promised in the S1 seed is realized as shape-compatibility + a comment pointer, NOT a cross-package import (the repo has no npm workspace; E0→S1/S2 already re-declare contracts the same way). Design: ONE shared idempotent upsertByMarker (list open tickets → match an HTML-comment dedupe marker in the body → update-or-create) over a single low-level MarkerUpsertApi seam (listOpen/create/update); the four platforms (GitHub Issues REST, Jira REST v2, Linear GraphQL, Azure DevOps Boards WIQL+JSON-Patch) differ ONLY in their Fetch implementation of those three calls, each taking an injectable fetch so every adapter is unit-tested by stubbing HTTP (no live account). Platform-specific calls: Jira v2 (not v3) so description is a plain string and the marker round-trips (v3 ADF would mangle it); Linear labels need UUIDs so draft.labels isn't mapped (marker + team scope are enough for dedupe); ADO description is HTML so the HTML-comment marker round-trips, and listOpen is a two-step WIQL→batch-get. Generic marker namespace snowops-ticket: (S1 keeps its historical snowops-drift:; both are opaque body text). Ships DryRunTicketPlatform, a selectPlatform(name, env, overrides) factory with clear missing-credential errors, and the snowops-ticket CLI (default dry-run) so any asset/workflow files an idempotent ticket from a rendered body without embedding adapter code. 6 jest suites / 26 tests. Next-5 #4 (D48); next code F7. | 2026-06-04 |
| D50 | F7 shape: Terragrunt live/ reference — DRY include hierarchy, real dependency DAG, bootstrap on local state; completes the next-5 | F7 = live/, a Terragrunt reference (not a deployable estate — GUIDs/acme are placeholders) wiring F1–F6 into a multi-env/multi-region Azure estate. Layering: root.hcl (included by every unit — remote state in the F6 backend keyed by env-container + unit path, generated azurerm provider with federated OIDC + storage_use_azuread so there are zero long-lived creds, and the §3 tag set) → _envcommon/<module>.hcl (one DRY template per module: source + estate-constant inputs) → <env>/env.hcl (subscription/tenant/retention/allowed-regions/state-container) → <env>/<region>/region.hcl (location + non-overlapping CIDRs) → <env>/<region>/<module>/terragrunt.hcl (a 3-line include of root + _envcommon + a dependency block). Units are byte-identical across env/region BY DESIGN — all variance flows through env.hcl/region.hcl + _envcommon, which is the whole point of the DRY demo. Real dependency DAG (baseline → network-hub/key-vault/acr, consuming dependency.baseline.outputs.log_analytics_workspace_id with mock_outputs so plan/validate run pre-apply). The bootstrap unit (F6 state account) deliberately does NOT include root.hcl and runs on local state — the only honest way to break the chicken-and-egg of storing remote state in itself; it creates one SA with a container per env. Module source uses get_repo_root() (validates in-monorepo) with the F11 registry-pin form documented for external use. Inputs were written against the modules' REAL variable shapes (read from variables.tf — baseline policy_initiative/log_analytics, network-hub hub/spokes, KV/ACR), not invented. Offline gate = live/validate.sh (structural assertions, portable bash — no find -printf) + terragrunt hcl validate/hclfmt when installed; no Terratest (the live path is the runbook's Part C — a run-all plan/apply against sandbox). Completes the next-5 (D48): I1/I2/I3 (v0.53) + E7 (v0.54) + F7 (v0.55). | 2026-06-04 |
| D51 | Post-next-5 batch selection (M2b additional); J4 shape = curated scheduled-query pack consuming workspace + action groups | With the next-5 (D48) complete, the next 5 most-important postponed items were selected by depth-before-breadth (D36) + milestone order — finish the remaining M2b "additional" assets before M3 tail / M4: J4 (alert rule pack), I5 (Defender→ticket via E7), X5 (pipeline integration tests), X8 (synthetic monitoring), R2 (production change log). The heavier network items (N3 WAF, N4 DDoS — CO-owned, cloud-cost) are deferred to a later batch. J4 = modules/azure/alert-rule-pack/: a curated pack of azurerm_monitor_scheduled_query_rules_alert_v2 rules across four threat domains (identity / privilege escalation / network / data exfiltration), 10 curated rules. Design choices: each rule's KQL carries the substantive logic and returns one row per match → the rule fires on row-count > 0 (threshold default 0), so the meaningful numeric thresholds live visibly inside each KQL where and rule_overrides.threshold raises the row count required; no explicit TimeGenerated filter (the rule applies window_duration as the query range). Per-domain enable toggles (drop a whole domain whose source table isn't flowing), per-rule overrides (severity/threshold/operator/frequency/window/enabled, null-inherits curated default), and a freeform custom_rules escape hatch (a lifecycle precondition rejects custom keys that collide with curated keys — curated rules are tuned via overrides, not re-declared). CONSUMES the J1 workspace + K2 action groups by ARM ID (does not create them — same "define the link, not the instance" stance as L2); the action block is omitted when no action groups are supplied (dry-run rollout — rules still surface in the Alerts blade). Offline terraform validate Terratest gate (TestAlertRulePackValidate) over a fixture exercising all four domains + override + disabled rule + custom rule; no build-tagged integration test — a real scheduled-query evaluation needs log data flowing and is the J4 runbook Part C. Same standalone-module convention (versions/variables/main/outputs/README/CHANGELOG/examples) as the log-analytics/nsg-baseline precedents. Not in the F12 import-block library (detection content, not foundational infra). Next code: I5. | 2026-06-05 |
| D52 | X-series completion (X5 + X8); X5 = offline contract gate + live consumers; X8 = standard availability tests + alerts | Completing the X series (the two remaining postponed X assets) ahead of the rest of the D51 batch, per the user. X5 (tests/pipeline-integration/): the pre-existing files were only C2/C3 runbook fixtures — the real deliverable was missing. Built (a) an OFFLINE reusable-workflow contract gate — contract_check.py parses every workflow_call workflow's declared inputs/secrets, finds every caller across .github/workflows/ + templates/client-repo/.github/workflows/, and asserts required inputs supplied / no unknown inputs / required secrets supplied (or secrets: inherit) / no unknown secrets; missing-coverage is a WARNING (image-scan/I1 is client-only). Pure stdlib + PyYAML (handles PyYAML parsing the on: key as boolean True). 11 stdlib-unittest cases (synthetic pass/fail per violation class + uses-resolution + a real-repo zero-violations assertion). validate.sh wrapper (matches F7), CI .github/workflows/pipeline-integration.yml runs it on PRs touching workflows/callers. (b) LIVE test-consumer workflows it-{container-build-sign,aks-deploy,terraform-plan-apply}.yml — dispatch-only callers of the LOCAL reusable workflows (uses: ./.github/workflows/…) that drive the existing fixtures against the X1 sandbox (C2 clean/vulnerable Dockerfiles; C3 nginx/ArgoCD app; C1 plan-only). This split (offline gate always-on + live consumers opt-in) mirrors the repo's offline/live convention; the consumers are themselves checked by the contract gate. Chose a Python checker over a Go/jest one because the content is workflow YAML and Python+PyYAML is the established lint idiom in the I-series runbooks. Removed the now-redundant .gitkeep. X8 (modules/azure/synthetic-monitoring/): outside-in synthetic monitoring as the black-box complement to J4's log-signal detection. azurerm_application_insights_standard_web_test per web_tests entry (classic URL-ping tests are deprecated) + a per-test azurerm_monitor_metric_alert using the application_insights_web_test_location_availability_criteria (scoped to BOTH the web test and the AI component, as Azure requires), notifying action groups by ARM ID. Optionally creates a workspace-based App Insights component (consume-or-create, like J4's consume stance for the workspace/action-groups). Per-test config (GET/POST + body + headers, expected status, content match pass-if-found/not-found, TLS cert check + remaining-lifetime alert, geo-locations, frequency/timeout, failed-location-count, severity) with pack-wide defaults; plan-time preconditions enforce (a) an AI component is provided and (b) failed_location_count ≤ geo_locations count. Both modules follow the standalone-module convention (versions/variables/main/outputs/README/CHANGELOG/examples + fixture + offline terraform validate Terratest gate); no build-tagged integration tests — live availability/build/deploy runs are the runbooks' Part C. Runbooks X5/X8 added. Next code: I5. | 2026-06-05 |
| D53 | I5 = Defender→ticket app; first E7 consumer; consumes E7 at RUN time via its CLI (no build import) | I5 = apps/defender-ticketer/ (@snowops/defender-ticketer) — the first real consumer of E7 (D49), closing the G8 loop. Built on the E0/S1/B6 mold: a pure core (normalize → filter → ticket draft) behind an I/O seam. alerts.ts (pure) normalizes raw Defender for Cloud alerts defensively across BOTH the ARM-REST properties shape and the az security alert list flattened shape, filters by a severity floor (default Medium) + status (default Active/InProgress — Dismissed/Resolved excluded), and computes a stable per-alert dedupe key defender-alert=<alertId> so re-runs update one ticket instead of duplicating. collector.ts = the collection seam (FixtureCollector offline/--input; AzureCliAlertCollector shelling az security alert list, injectable runner — the untested live boundary). Key decision — how I5 consumes E7: per the standalone-package convention (D37/D40/D49) apps don't import each other at build time (no npm workspace), so I5 re-declares the tiny TicketPlatform contract and consumes E7 at RUN time by invoking its snowops-ticket CLI via E7CliTicketPlatform (injectable TicketCliRunner; the runner owns the body/output temp files; I5 builds the --platform/--title/--labels/--dedupe-key baseArgs and parses E7's ticket_id/ticket_url/ticket_updated output block). This is the same shell-out pattern S1 uses for terraform, keeps E7's four adapters (GitHub/Jira/Linear/ADO) the single source of platform code, and genuinely proves the adapter end-to-end. Chose this over (a) a build-time import of @snowops/ticket-platform (violates the no-workspace convention) and (b) re-implementing a GitHub adapter inline like S1's seed (defeats the purpose of E7). I5 supplies body + dedupe-key; E7 owns marker insertion + the idempotent upsert. CLI writes a versioned DefenderTicketReport (schemaVersion 1.0) + per-alert issue-<id>.md; --platform defaults to dry-run (own DryRunTicketPlatform, no tracker); --fail-on-alerts for gate use. 21 jest tests (3 suites). Verified: offline dry-run selects 2/4 sample alerts at the Medium floor, 1/4 at High; E7's CLI --output block confirmed to match what parseCliOutput reads. No new gate (an app; the app-tests CI workflow auto-discovers it). Completes D51 #2; last in the batch is R2. | 2026-06-05 |
| D54 | R2 = production change log app; completes the D51 batch | R2 = apps/change-log/ (@snowops/change-log) — the production change-management evidence asset (SOC2 CC8 / ISO A.12.1.2): what merged to the default branch, categorized into a changelog. Same E0/S1/I5 mold: pure core behind a collector seam. categorize.ts (pure) classifies each merged change by its conventional-commit prefix (the §3 standard: feat/fix/sec/compliance/infra/test + common extras), falling back to PR labels, then other; strips the prefix + trailing (#N). render.ts (pure) groups into ordered non-empty sections and renders Keep-a-Changelog markdown; prependRelease splits an existing changelog on the first ## heading to insert a new release section while preserving the whole preamble (header + blurb). collect.ts = the seam: GitLogCollector (default — squash-merge commits on the branch are what reached production; --no-merges, %H/%an/%cI/%s pretty format with US/RS separators, parser exposed + unit-tested), GitHubPrCollector (gh pr list --json for richer labels/authors), FixtureCollector (offline/tests); shell runners injected so parsers test without git/gh. Optional E7 change-record ticket per release via ticket.ts — the SAME run-time snowops-ticket CLI bridge I5 introduced (D53; re-declared TicketPlatform, E7CliTicketPlatform with injectable runner, parseCliOutput), dedupe key change-record=<release> so a re-run updates one ticket. Chose the git-log default (works offline against the repo itself, no API/auth, honest about squash-merge being the production signal) over an API-only design, with --source github as the richer opt-in. 32 jest tests (4 suites). Verified: offline sample → Features/Security/Tests/Other; live git log on this repo categorized the real commits; prepend preserved the preamble; change-record dry-run path runs. No new gate (the app-tests CI workflow auto-discovers it). Completes the D51 M2b-additional batch (J4/X5/X8/I5/R2). Next: open backlog — runbook sign-offs in parallel; M4 advanced; deferred net items N3/N4; W-series remains last (D35). | 2026-06-05 |
Open Decisions
(none — re-open as new questions arise)