Skip to content

Manual Test Runbook — F5: Azure Key Vault

Owner: Sagar  |  Time: ~30 min (Parts A + B) · +25 min (optional Part D secret-CRUD probe)  |  Sandbox: snowops-sandbox-01

Promotes F5 (modules/azure/key-vault/) from 🟦 Code Complete → 🟩 Shipped. Part C costs ~$2 (PE hourly + a few vault ops). Skip Part D if not iterating on the secret data-plane / Workload Identity wiring.


Prerequisites

  • Sandbox subscription access active (PIM activated if required)
  • az login done; az account show confirms the sandbox subscription is selected
  • Identity has Contributor + User Access Administrator on the sandbox subscription (UAA only needed if kv_*_principal_ids is non-empty)
  • Local tooling: terraform >= 1.6, go >= 1.22, az CLI >= 2.50, jq
  • SNOWOPS_SANDBOX_SUBSCRIPTION_ID and SNOWOPS_SANDBOX_TENANT_ID env vars set
  • Working directory: repo root

Steps

Part A — terraform fmt + validate (offline, ~1 min)

  1. Confirm formatting + structural validity of the module on its own:
terraform -chdir=modules/azure/key-vault fmt -check
terraform -chdir=modules/azure/key-vault init -backend=false -input=false
terraform -chdir=modules/azure/key-vault validate

Expected: Success! The configuration is valid.

  1. Run the F5-relevant offline Terratest cases:
cd tests/terratest
go test -v -timeout 5m ./modules/azure/... \
  -run 'TestKeyVaultValidate|TestF5KVContractConformance|TestContractsRejectBadLiterals'

Expected: 3 top-level tests pass; the kv-missing-uri sub-test under TestContractsRejectBadLiterals is the F5-relevant negative case.


Part B — full Terratest suite (offline, ~3 min)

  1. Run the whole offline suite to confirm F5 hasn't regressed F1/F2/F4/F6:
cd tests/terratest
go test -v -timeout 10m ./...

Expected: 13 top-level tests pass (TestNoopHarness, TestBaselineValidate, TestStateBackendValidate, TestSandboxValidate, TestF1ContractConformance, TestF6ObjectStoreContractConformance, TestF2NetworkContractConformance, TestNetworkHubValidate, TestACRValidate, TestF4RegistryContractConformance, TestKeyVaultValidate, TestF5KVContractConformance, TestContractsRejectBadLiterals with 7 sub-tests).


Part C — integration test (real Azure apply + destroy, ~25 min, ~$2)

Skip if iterating on offline changes only. Cost is dominated by the F2 hub-spoke + Private Endpoint, not the vault itself.

  1. Export sandbox env vars (same as F1 / F2 / F4 / F6):
export SNOWOPS_SANDBOX_SUBSCRIPTION_ID="<sandbox-subscription-guid>"
export SNOWOPS_SANDBOX_TENANT_ID="<sandbox-tenant-guid>"
  1. Run the F5 integration test:
cd tests/terratest
go test -v -tags integration -timeout 60m ./modules/azure/... -run TestKeyVaultModule
  1. Watch for key milestones:
  2. Plan: ~14 to add, 0 to change, 0 to destroy. — F2 (RG + hub vnet + 1 hub subnet + 1 spoke vnet + 2 spoke subnets + 2 NSGs + 2 NSG associations + 2 peerings + 1 Private DNS zone + 2 vnet links) + F5 (RG + KV + PE + PE psc + PE dns zone group) = ~14 resources.
  3. azurerm_key_vault.this: Creation complete after ~30s — typically 30–60 seconds depending on region.
  4. azurerm_private_endpoint.this: Still creating... — usually ~2 min.
  5. All output assertions PASS, including the kv_contract shape check (rbac_mode = true, purge_protection = true).
  6. Destroy complete! — clean teardown.

Vault-name reuse caution. The fixture provider config sets purge_soft_delete_on_destroy = false and recover_soft_deleted_key_vaults = true, so destroy leaves the vault in soft-deleted state. The integration test uses random.UniqueId() so this never collides. If you re-run with a fixed name within the 90-day window, Azure will refuse the apply.


Part D — secret CRUD via Workload Identity (optional, ~25 min)

Verifies the SnowOps end-to-end secret-fetch story: a workload identity (proxy for AKS Workload Identity / F3) writes + reads a secret over the Private Endpoint, while public access stays disabled.

  1. After Part C apply but before destroy, capture the vault name + outputs from the test state:
cd tests/terratest/fixtures/key-vault
VAULT_NAME=$(terraform output -raw vault_name)
KV_RG=$(terraform output -raw kv_resource_group_name)
NET_RG=$(terraform output -raw net_resource_group_name)
echo "vault=$VAULT_NAME kv_rg=$KV_RG net_rg=$NET_RG"
  1. Create a User-Assigned Managed Identity to play the Workload Identity role, then grant it Key Vault Secrets Officer on the vault scope:
az identity create \
  --resource-group "$KV_RG" \
  --name "f5-probe-mi" \
  --location "eastus"

PROBE_MI_OID=$(az identity show --resource-group "$KV_RG" --name "f5-probe-mi" --query principalId -o tsv)
VAULT_ID=$(az keyvault show --name "$VAULT_NAME" --query id -o tsv)

az role assignment create \
  --assignee-object-id "$PROBE_MI_OID" \
  --assignee-principal-type ServicePrincipal \
  --role "Key Vault Secrets Officer" \
  --scope "$VAULT_ID"
  1. Deploy a tiny Linux VM into the F2 spoke workload subnet and attach the MI to it. The VM is the only way to reach the vault — public access is off so a control-plane shell on your laptop cannot.
SPOKE_VNET=$(terraform output -json spoke_subnet_ids | jq -r '."apps/workload"' | awk -F/ '{print $(NF-2)}')

az vm create --resource-group "$NET_RG" \
  --name "f5-probe-vm" \
  --image "Ubuntu2204" \
  --vnet-name "$SPOKE_VNET" \
  --subnet "workload" \
  --public-ip-address "" \
  --admin-username "snowops" \
  --generate-ssh-keys \
  --assign-identity "/subscriptions/$SNOWOPS_SANDBOX_SUBSCRIPTION_ID/resourceGroups/$KV_RG/providers/Microsoft.ManagedIdentity/userAssignedIdentities/f5-probe-mi" \
  --size "Standard_B2s"
  1. From the VM, fetch a token using the attached MI and write + read a test secret. The DNS lookup must resolve to the PE private IP (not the public CNAME).

    az vm run-command invoke --resource-group "$NET_RG" --name "f5-probe-vm" \
      --command-id "RunShellScript" \
      --scripts "
        set -euxo pipefail
        sudo apt-get update -y && sudo apt-get install -y curl dnsutils
        # Confirm DNS resolves to PE (10.10.2.x range from fixture).
        dig +short $VAULT_NAME.vault.azure.net
        # Token via IMDS.
        TOKEN=\$(curl -s 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fvault.azure.net' \
          -H 'Metadata: true' | jq -r .access_token)
        # Write.
        curl -s -X PUT \"https://$VAULT_NAME.vault.azure.net/secrets/f5-probe-secret?api-version=7.4\" \
          -H \"Authorization: Bearer \$TOKEN\" \
          -H 'Content-Type: application/json' \
          -d '{\"value\":\"f5-probe-value-001\"}'
        # Read back.
        curl -s \"https://$VAULT_NAME.vault.azure.net/secrets/f5-probe-secret?api-version=7.4\" \
          -H \"Authorization: Bearer \$TOKEN\"
      "
    

    Expected: dig returns a 10.10.2.x address (the PE private IP). The write returns the secret manifest. The read returns {"value":"f5-probe-value-001",...}.

  2. Confirm public access is denied — run the same read from outside the spoke (your laptop) and assert a 403:

    TOKEN=$(az account get-access-token --resource https://vault.azure.net --query accessToken -o tsv)
    curl -s -o /dev/null -w "%{http_code}\n" \
      "https://$VAULT_NAME.vault.azure.net/secrets/f5-probe-secret?api-version=7.4" \
      -H "Authorization: Bearer $TOKEN"
    

    Expected: 403 (PublicNetworkAccessDenied). If you get 200, F5 has silently allowed public access — fail the runbook and investigate.

  3. Cleanup before the integration test's deferred terraform destroy fires: delete the probe VM + MI (the role assignment is scoped to the vault and goes when the vault does, but the MI must be removed manually). The integration test's terraform destroy won't know about these out-of-band resources:

    az vm delete --resource-group "$NET_RG" --name "f5-probe-vm" --yes
    az identity delete --resource-group "$KV_RG" --name "f5-probe-mi"
    

Pass criteria

  • Part A — terraform validate passes for the module
  • Part B — full offline Terratest suite passes (13+ top-level tests)
  • Part C — TestKeyVaultModule integration test passes end-to-end
  • Vault created at premium SKU
  • public_network_access_enabled = false confirmed via portal or az keyvault show --name $VAULT_NAME --query properties.publicNetworkAccess returns Disabled
  • RBAC mode on — az keyvault show --name $VAULT_NAME --query properties.enableRbacAuthorization returns true
  • Purge protection on — ... --query properties.enablePurgeProtection returns true
  • Soft-delete retention = 90 — ... --query properties.softDeleteRetentionInDays returns 90
  • Private Endpoint reachable; private IP non-empty; A-record present in F2's privatelink.vaultcore.azure.net zone
  • (Part D) VM in spoke resolves <vault>.vault.azure.net to a 10.10.2.x address
  • (Part D) Workload Identity writes + reads a secret over the PE
  • (Part D) External (laptop) read returns 403 PublicNetworkAccessDenied
  • All Destroy calls complete without error
  • No orphaned RGs / soft-deleted vault collisions remain (verify with az group list -o table and az keyvault list-deleted -o table)
  • All test resources tagged ephemeral = true (X7 cleanup safety net)

Teardown

The integration test runs terraform destroy automatically. If a failure mid-run orphans resources, clean up manually:

# Two RGs: <name_prefix>-net-rg (F2) and <name_prefix>-kv-rg (F5).
az group delete --name "<name_prefix>-net-rg" --yes --no-wait
az group delete --name "<name_prefix>-kv-rg" --yes --no-wait

# The vault soft-delete record outlives its RG. List + purge once empty (only
# possible because the integration test provider has purge_soft_delete_on_destroy
# = false combined with no purge-protection guard in soft-delete state on a
# fresh apply — production NEVER purges).
az keyvault list-deleted -o table
az keyvault purge --name "<vault_name>" --location "<region>"

Vault destroy takes ~10 seconds; the PE detach can stretch to ~2 minutes.


Sign-off

  • Tester: _  |  Date: _  |  Result: PASS / FAIL / N/A
  • Notes: