Manual Test Runbook — J4: Detection Alert Rule Pack
Owner: Sagar | Time: ~8 min (Parts A + B offline) · +15 min (optional Part C integration apply) | Sandbox: snowops-sandbox-01
Promotes J4 (
modules/azure/alert-rule-pack/) from 🟦 Code Complete → 🟩 Shipped. Parts A + B are offline ($0). Part C applies the curated pack against a sandbox Log Analytics workspace (~$0 — scheduled-query rules are a few cents/month; the apply lasts minutes) and destroys.
Prerequisites
- Sandbox subscription access active (PIM activated if required)
-
az logindone; sandbox subscription selected - Identity has Contributor on the sandbox sub (workspace + alert-rule create)
-
SNOWOPS_SANDBOX_SUBSCRIPTION_ID+SNOWOPS_SANDBOX_TENANT_IDexported - Local tooling:
terraform >= 1.6,go >= 1.22,az CLI >= 2.50 - Working directory: repo root
Steps
Part A — terraform fmt + validate (offline, ~3 min)
- Module + example:
terraform -chdir=modules/azure/alert-rule-pack fmt -recursive -check
terraform -chdir=modules/azure/alert-rule-pack init -backend=false -input=false
terraform -chdir=modules/azure/alert-rule-pack validate
terraform -chdir=modules/azure/alert-rule-pack/examples/basic init -backend=false -input=false
terraform -chdir=modules/azure/alert-rule-pack/examples/basic validate
Expected: Success! for both.
- Offline Terratest case:
Expected: PASS — exercises the full curated pack across all four domains, a per-rule override (severity + threshold + frequency), one disabled rule, a custom rule, and the action-group wiring, offline.
Part B — full Terratest suite (offline, ~5 min)
bash cd tests/terratest && go test -count=1 -timeout 15m ./...
Expected: the full suite green (the new TestAlertRulePackValidate included).
Part C — integration apply (sandbox, ~15 min, ~$0)
No build-tagged integration test ships for J4 — a real scheduled-query evaluation needs log data flowing and the "alert fires within N min of a planted signal" criterion belongs here, not in a teardown-on-completion test.
- Stand up a throwaway workspace + the curated pack. Reuse the J1 fixture for
the workspace, then apply the J4 example against it (or hand-write a small
root module wiring
log-analytics→alert-rule-pack). Minimal path:
cd tests/terratest/fixtures/alert-rule-pack
terraform init -input=false
# Point workspace_id at a REAL sandbox workspace ARM ID first (edit main.tf
# or wrap with a workspace module); action_group_ids may stay synthetic or be
# set to a real K2 action group.
terraform apply -auto-approve \
-var "subscription_id=$SNOWOPS_SANDBOX_SUBSCRIPTION_ID" \
-var "tenant_id=$SNOWOPS_SANDBOX_TENANT_ID"
- Confirm the rules exist with the expected shape:
RG=snowops-j4-test-rg
az monitor scheduled-query list --resource-group "$RG" \
--query "[].{name:name, severity:severity, enabled:enabled, freq:evaluationFrequency}" -o table
Expected: 9 enabled rules + 1 disabled (*-network-nsg-deny-spike, disabled
via the fixture override) + the custom *-app-5xx-spike. The
*-identity-failed-signin-spike rule shows severity 1 and frequency PT5M
(override applied).
- (Optional, signal test) Plant a matching signal and confirm an alert fires.
Easiest: temporarily lower a rule's threshold to 0 against a table that has
data (e.g. point a custom rule at
Heartbeat | take 1), wait one evaluation window, and confirm an alert appears:
Expected: the rule transitions to fired and (if a real action group was supplied) the action group receives a notification.
- Destroy:
terraform destroy -auto-approve \
-var "subscription_id=$SNOWOPS_SANDBOX_SUBSCRIPTION_ID" \
-var "tenant_id=$SNOWOPS_SANDBOX_TENANT_ID"
Expected: clean destroy — rules are stateless detection content; the RG and every rule are removed. (The workspace/action group, if owned elsewhere, are untouched.)
Pass criteria
- Part A — module + example validate;
TestAlertRulePackValidatepasses - Part B — full offline suite passes
- (Part C) fixture applies the pack; rule count + severities + the disabled rule + the override match; destroys clean
- (Part C, optional) a planted signal fires the rule and notifies the action group
- All test resources removed
Failure mode
A rule that can never fire because its source table isn't flowing into the
workspace (most common operational failure). Documented in the module README
data-source matrix; mitigation is the per-domain enable toggle. A noisy rule is
tuned via rule_overrides (raise threshold) or a custom_rules replacement.
Cost impact
A few cents per rule per month plus log-query cost per evaluation (data scanned ×
frequency). Full pack at the default PT15M < $1/day on a typical Baseline
workspace. No standing compute. Part C apply is ~$0.
Removal path
terraform destroy (Part C step 7) removes every rule and the RG if J4 created
it. Verified clean in Part C.
Sign-Off
| Field | Value |
|---|---|
| Part A (validate) | ☐ PASS |
| Part B (offline suite) | ☐ PASS |
| Part C (integration apply) | ☐ PASS / ☐ skipped |
| Part C signal test | ☐ PASS / ☐ skipped |
| Tester | |
| Date | |
| Result | ☐ PASS |