Manual Test Runbook — F3: AKS Secure
Owner: Sagar | Time: ~5 min (Parts A + B, offline) · +30 min Part C (sandbox apply, ~$10) · +30 min Part D (kubectl + Workload Identity probe, optional) | Sandbox: snowops-sandbox-01
Promotes F3 (
modules/azure/aks-secure/) from 🟦 Code Complete → 🟩 Shipped. Part C costs ~$10 (4 D4ds_v5 VMs + public LB for ~30 min). Skip Part D if you don't need to verify the in-cluster auth path end-to-end.
Prerequisites
- Sandbox subscription access active (PIM activated if required)
-
az logindone;az account showconfirms the sandbox subscription is selected - Identity has Contributor + Network Contributor + User Access Administrator on the sandbox subscription
- Local tooling:
terraform >= 1.6,go >= 1.22,az CLI >= 2.50,kubectl >= 1.28,jq -
SNOWOPS_SANDBOX_SUBSCRIPTION_IDandSNOWOPS_SANDBOX_TENANT_IDenv vars set - (Part C only)
SNOWOPS_AAD_BREAKGLASS_GROUP_OBJECT_IDenv var set — an AAD group object ID that should receive cluster-admin via AAD-RBAC. Create one in the sandbox tenant if needed:az ad group create --display-name "snowops-aks-admins" --mail-nickname "snowops-aks-admins". - Working directory: repo root
Steps
Part A — terraform fmt + validate (offline, ~2 min)
- Confirm formatting + structural validity of the module on its own:
terraform -chdir=modules/azure/aks-secure fmt -check
terraform -chdir=modules/azure/aks-secure init -backend=false -input=false
terraform -chdir=modules/azure/aks-secure validate
Expected: Success! The configuration is valid.
- Confirm the basic example also passes:
terraform -chdir=modules/azure/aks-secure/examples/basic fmt -check
terraform -chdir=modules/azure/aks-secure/examples/basic init -backend=false -input=false
terraform -chdir=modules/azure/aks-secure/examples/basic validate
Expected: Success! The configuration is valid.
- Run the F3-relevant offline Terratest cases:
cd tests/terratest
go test -v -timeout 5m ./modules/azure/... \
-run 'TestAKSSecureValidate|TestF3ClusterContractConformance|TestContractsRejectBadLiterals'
Expected: 3 top-level tests pass; the cluster-missing-endpoint sub-test
under TestContractsRejectBadLiterals is the F3-relevant negative case.
Part B — full Terratest suite (offline, ~3 min)
- Run the whole offline suite to confirm F3 hasn't regressed any earlier module:
Expected: 15 top-level tests pass (TestNoopHarness,
TestBaselineValidate, TestStateBackendValidate, TestSandboxValidate,
TestF1ContractConformance, TestF6ObjectStoreContractConformance,
TestF2NetworkContractConformance, TestNetworkHubValidate,
TestACRValidate, TestF4RegistryContractConformance,
TestKeyVaultValidate, TestF5KVContractConformance,
TestAKSSecureValidate, TestF3ClusterContractConformance,
TestContractsRejectBadLiterals with 8 sub-tests).
- Sanity-check the integration test compiles even without running it:
Expected: no errors.
Part C — integration test (real Azure apply + destroy, ~30 min, ~$10)
Skip if iterating on offline changes only. Cost is dominated by the four Standard_D4ds_v5 VMs that the system + apps pools spin up.
- Export sandbox env vars + the break-glass group object ID:
export SNOWOPS_SANDBOX_SUBSCRIPTION_ID="<sandbox-subscription-guid>"
export SNOWOPS_SANDBOX_TENANT_ID="<sandbox-tenant-guid>"
export SNOWOPS_AAD_BREAKGLASS_GROUP_OBJECT_ID="<aad-group-object-id>"
- Run the F3 integration test:
cd tests/terratest
go test -v -tags integration -timeout 90m ./modules/azure/... -run TestAKSSecureModule
- Watch for key milestones:
Plan: ~17 to add, 0 to change, 0 to destroy.— F2 (RG + hub vnet + 1 hub subnet + 1 spoke vnet + 2 spoke subnets + 2 NSGs + 2 NSG associations + 2 peerings + 1 Private DNS zone + 2 vnet links) + F3 (RG + AKS cluster + 1 user node pool) = ~17 resources.azurerm_kubernetes_cluster.this: Still creating... [5m]— AKS provisioning typically takes 8-12 minutes.azurerm_kubernetes_cluster_node_pool.user["apps"]: Creation complete after ~4m— user pool comes up after the cluster.- All output assertions PASS, including the
cluster_contractshape check (endpoint_is_private = true,endpointends inazmk8s.io). Destroy complete!— clean teardown of the cluster + the AKS-owned node RG + the control-plane RG + F2.
Destroy timing. AKS destroy stretches to ~10 minutes (AKS cleans the node RG, then itself). The integration test uses a 90-minute timeout precisely so destroy has headroom.
Part D — kubectl + Workload Identity probe (optional, ~30 min)
Verifies the SnowOps end-to-end auth story: AAD-RBAC works, the cluster rejects local accounts, Workload Identity federates an in-cluster service account to an AAD app, and that pod can hit a token endpoint. Requires a jump host in the same vNet — the private cluster's API server is not reachable from your laptop.
- After Part C apply but before destroy, capture the cluster name + RG + private FQDN:
cd tests/terratest/fixtures/aks-secure
CLUSTER_NAME=$(terraform output -raw cluster_name)
AKS_RG=$(terraform output -raw cluster_resource_group_name)
NET_RG=$(terraform output -raw net_resource_group_name)
PRIVATE_FQDN=$(terraform output -raw private_fqdn)
OIDC_ISSUER=$(terraform output -raw oidc_issuer_url)
echo "cluster=$CLUSTER_NAME aks_rg=$AKS_RG net_rg=$NET_RG"
echo "private_fqdn=$PRIVATE_FQDN"
echo "oidc_issuer=$OIDC_ISSUER"
-
Add your own AAD user / group to the cluster-admin group used at apply time (if not already a member). Re-activate PIM if required.
-
Deploy a tiny Linux jump host into the F2 spoke
workloadsubnet so you can reach the private API server:SPOKE_VNET=$(terraform output -json spoke_subnet_ids | jq -r '."apps/workload"' | awk -F/ '{print $(NF-2)}') az vm create --resource-group "$NET_RG" \ --name "f3-jump-vm" \ --image "Ubuntu2204" \ --vnet-name "$SPOKE_VNET" \ --subnet "workload" \ --public-ip-address "" \ --admin-username "snowops" \ --generate-ssh-keys \ --size "Standard_B2s" -
Install kubectl + the kubelogin AAD plugin on the jump host, then fetch the cluster's AAD-only kubeconfig and confirm
kubectl get nodesreturns the system + apps pool nodes:az vm run-command invoke --resource-group "$NET_RG" --name "f3-jump-vm" \ --command-id "RunShellScript" \ --scripts " set -euxo pipefail curl -sLO 'https://dl.k8s.io/release/v1.29.4/bin/linux/amd64/kubectl' chmod +x kubectl && sudo mv kubectl /usr/local/bin/ curl -sLO 'https://aka.ms/install-azure-kubelogin.sh' || true # Install az CLI (skip if pre-installed) sudo apt-get update -y && sudo apt-get install -y azure-cli || true az aks install-cli >/dev/null 2>&1 || true az login --identity || az login --use-device-code az aks get-credentials --resource-group $AKS_RG --name $CLUSTER_NAME --overwrite-existing kubelogin convert-kubeconfig -l azurecli kubectl get nodes -o wide "Expected: 4 nodes (2 system + 2 user) Ready. Node OS column shows AzureLinux.
-
Confirm the AAD-only auth is enforced — fetch the legacy admin kubeconfig and assert the API server rejects it:
az aks get-credentials --admin --resource-group "$AKS_RG" --name "$CLUSTER_NAME" --overwrite-existing 2>&1 | headExpected: a clear error like
Cannot get admin credentials. ... local accounts are disabled on this cluster. If you get a kubeconfig back, F3'slocal_account_disabledprecondition was bypassed — fail the runbook. -
Confirm a default-deny NetworkPolicy is in place — D4 will install one via Kyverno once the cluster is in scope, but the Calico CNI is what actually enforces it. Spot-check:
az vm run-command invoke --resource-group "$NET_RG" --name "f3-jump-vm" \ --command-id "RunShellScript" \ --scripts " kubectl create namespace probe-ns kubectl run nginx --image=nginx:1.27.0 -n probe-ns kubectl get networkpolicy -n probe-ns "Expected: the Kyverno-generated
default-denyNetworkPolicy is listed inprobe-ns(after D4 is installed; if running before D4, manuallykubectl applya default-deny NetPol to confirm Calico enforces it). -
Cleanup before the integration test's deferred
terraform destroyfires:
Pass criteria
- Part A —
terraform validatepasses for the module + example - Part B — full offline Terratest suite passes (15 top-level tests)
- Part C —
TestAKSSecureModuleintegration test passes end-to-end - Cluster created at SKU
Standard, private endpoint ON -
az aks show --resource-group $AKS_RG --name $CLUSTER_NAME --query "privateClusterEnabled"returnstrue -
... --query "disableLocalAccounts"returnstrue -
... --query "oidcIssuerProfile.enabled"returnstrue -
... --query "securityProfile.workloadIdentity.enabled"returnstrue -
... --query "networkProfile.networkPolicy"returnscalico -
... --query "networkProfile.networkPluginMode"returnsoverlay -
... --query "aadProfile.managed"returnstrue -
... --query "aadProfile.enableAzureRbac"returnstrue - Diagnostic settings show 7 enabled categories + AllMetrics forwarding to the F1 workspace (when LAW supplied)
- Defender for Containers profile present (when LAW supplied)
- (Part D)
kubectl get nodesreturns 4 Ready nodes via AAD-only auth - (Part D) Local admin kubeconfig fetch fails with
local accounts are disabled - (Part D) Calico NetworkPolicy CRD installed + default-deny enforced
- All
Destroycalls complete without error - No orphaned node RG remains —
az group list -o table | grep f3-testreturns nothing - All test resources tagged
ephemeral = true(X7 cleanup safety net)
Teardown
The integration test runs terraform destroy automatically. AKS destroy is
slow (~10 minutes — AKS cleans the node RG first, then itself). If a failure
mid-run orphans resources, clean up manually:
# Three RGs: <name_prefix>-net-rg (F2), <name_prefix>-aks-rg (F3 control
# plane), <name_prefix>-aks-rg-nodes (F3 node RG, AKS-managed).
az group delete --name "<name_prefix>-net-rg" --yes --no-wait
az group delete --name "<name_prefix>-aks-rg" --yes --no-wait
az group delete --name "<name_prefix>-aks-rg-nodes" --yes --no-wait
Node RG cleanup caveat. AKS normally cleans the node RG when the cluster is destroyed. If the cluster is force-deleted via
az group deletebefore AKS gets a chance, the node RG is orphaned and the underlying VMSS/LB/disks remain — costing $0.40/hour until you delete it explicitly.
Sign-off
- Tester: _ | Date: _ | Result: PASS / FAIL / N/A
- Notes: