Skip to content

Manual Test Runbook — F3: AKS Secure

Owner: Sagar  |  Time: ~5 min (Parts A + B, offline) · +30 min Part C (sandbox apply, ~$10) · +30 min Part D (kubectl + Workload Identity probe, optional)  |  Sandbox: snowops-sandbox-01

Promotes F3 (modules/azure/aks-secure/) from 🟦 Code Complete → 🟩 Shipped. Part C costs ~$10 (4 D4ds_v5 VMs + public LB for ~30 min). Skip Part D if you don't need to verify the in-cluster auth path end-to-end.


Prerequisites

  • Sandbox subscription access active (PIM activated if required)
  • az login done; az account show confirms the sandbox subscription is selected
  • Identity has Contributor + Network Contributor + User Access Administrator on the sandbox subscription
  • Local tooling: terraform >= 1.6, go >= 1.22, az CLI >= 2.50, kubectl >= 1.28, jq
  • SNOWOPS_SANDBOX_SUBSCRIPTION_ID and SNOWOPS_SANDBOX_TENANT_ID env vars set
  • (Part C only) SNOWOPS_AAD_BREAKGLASS_GROUP_OBJECT_ID env var set — an AAD group object ID that should receive cluster-admin via AAD-RBAC. Create one in the sandbox tenant if needed: az ad group create --display-name "snowops-aks-admins" --mail-nickname "snowops-aks-admins".
  • Working directory: repo root

Steps

Part A — terraform fmt + validate (offline, ~2 min)

  1. Confirm formatting + structural validity of the module on its own:
terraform -chdir=modules/azure/aks-secure fmt -check
terraform -chdir=modules/azure/aks-secure init -backend=false -input=false
terraform -chdir=modules/azure/aks-secure validate

Expected: Success! The configuration is valid.

  1. Confirm the basic example also passes:
terraform -chdir=modules/azure/aks-secure/examples/basic fmt -check
terraform -chdir=modules/azure/aks-secure/examples/basic init -backend=false -input=false
terraform -chdir=modules/azure/aks-secure/examples/basic validate

Expected: Success! The configuration is valid.

  1. Run the F3-relevant offline Terratest cases:
cd tests/terratest
go test -v -timeout 5m ./modules/azure/... \
  -run 'TestAKSSecureValidate|TestF3ClusterContractConformance|TestContractsRejectBadLiterals'

Expected: 3 top-level tests pass; the cluster-missing-endpoint sub-test under TestContractsRejectBadLiterals is the F3-relevant negative case.


Part B — full Terratest suite (offline, ~3 min)

  1. Run the whole offline suite to confirm F3 hasn't regressed any earlier module:
cd tests/terratest
go test -v -timeout 10m ./...

Expected: 15 top-level tests pass (TestNoopHarness, TestBaselineValidate, TestStateBackendValidate, TestSandboxValidate, TestF1ContractConformance, TestF6ObjectStoreContractConformance, TestF2NetworkContractConformance, TestNetworkHubValidate, TestACRValidate, TestF4RegistryContractConformance, TestKeyVaultValidate, TestF5KVContractConformance, TestAKSSecureValidate, TestF3ClusterContractConformance, TestContractsRejectBadLiterals with 8 sub-tests).

  1. Sanity-check the integration test compiles even without running it:
cd tests/terratest
go vet -tags integration ./...
go build -tags integration ./...

Expected: no errors.


Part C — integration test (real Azure apply + destroy, ~30 min, ~$10)

Skip if iterating on offline changes only. Cost is dominated by the four Standard_D4ds_v5 VMs that the system + apps pools spin up.

  1. Export sandbox env vars + the break-glass group object ID:
export SNOWOPS_SANDBOX_SUBSCRIPTION_ID="<sandbox-subscription-guid>"
export SNOWOPS_SANDBOX_TENANT_ID="<sandbox-tenant-guid>"
export SNOWOPS_AAD_BREAKGLASS_GROUP_OBJECT_ID="<aad-group-object-id>"
  1. Run the F3 integration test:
cd tests/terratest
go test -v -tags integration -timeout 90m ./modules/azure/... -run TestAKSSecureModule
  1. Watch for key milestones:
  2. Plan: ~17 to add, 0 to change, 0 to destroy. — F2 (RG + hub vnet + 1 hub subnet + 1 spoke vnet + 2 spoke subnets + 2 NSGs + 2 NSG associations + 2 peerings + 1 Private DNS zone + 2 vnet links) + F3 (RG + AKS cluster + 1 user node pool) = ~17 resources.
  3. azurerm_kubernetes_cluster.this: Still creating... [5m] — AKS provisioning typically takes 8-12 minutes.
  4. azurerm_kubernetes_cluster_node_pool.user["apps"]: Creation complete after ~4m — user pool comes up after the cluster.
  5. All output assertions PASS, including the cluster_contract shape check (endpoint_is_private = true, endpoint ends in azmk8s.io).
  6. Destroy complete! — clean teardown of the cluster + the AKS-owned node RG + the control-plane RG + F2.

Destroy timing. AKS destroy stretches to ~10 minutes (AKS cleans the node RG, then itself). The integration test uses a 90-minute timeout precisely so destroy has headroom.


Part D — kubectl + Workload Identity probe (optional, ~30 min)

Verifies the SnowOps end-to-end auth story: AAD-RBAC works, the cluster rejects local accounts, Workload Identity federates an in-cluster service account to an AAD app, and that pod can hit a token endpoint. Requires a jump host in the same vNet — the private cluster's API server is not reachable from your laptop.

  1. After Part C apply but before destroy, capture the cluster name + RG + private FQDN:
cd tests/terratest/fixtures/aks-secure
CLUSTER_NAME=$(terraform output -raw cluster_name)
AKS_RG=$(terraform output -raw cluster_resource_group_name)
NET_RG=$(terraform output -raw net_resource_group_name)
PRIVATE_FQDN=$(terraform output -raw private_fqdn)
OIDC_ISSUER=$(terraform output -raw oidc_issuer_url)
echo "cluster=$CLUSTER_NAME aks_rg=$AKS_RG net_rg=$NET_RG"
echo "private_fqdn=$PRIVATE_FQDN"
echo "oidc_issuer=$OIDC_ISSUER"
  1. Add your own AAD user / group to the cluster-admin group used at apply time (if not already a member). Re-activate PIM if required.

  2. Deploy a tiny Linux jump host into the F2 spoke workload subnet so you can reach the private API server:

    SPOKE_VNET=$(terraform output -json spoke_subnet_ids | jq -r '."apps/workload"' | awk -F/ '{print $(NF-2)}')
    
    az vm create --resource-group "$NET_RG" \
      --name "f3-jump-vm" \
      --image "Ubuntu2204" \
      --vnet-name "$SPOKE_VNET" \
      --subnet "workload" \
      --public-ip-address "" \
      --admin-username "snowops" \
      --generate-ssh-keys \
      --size "Standard_B2s"
    
  3. Install kubectl + the kubelogin AAD plugin on the jump host, then fetch the cluster's AAD-only kubeconfig and confirm kubectl get nodes returns the system + apps pool nodes:

    az vm run-command invoke --resource-group "$NET_RG" --name "f3-jump-vm" \
      --command-id "RunShellScript" \
      --scripts "
        set -euxo pipefail
        curl -sLO 'https://dl.k8s.io/release/v1.29.4/bin/linux/amd64/kubectl'
        chmod +x kubectl && sudo mv kubectl /usr/local/bin/
        curl -sLO 'https://aka.ms/install-azure-kubelogin.sh' || true
        # Install az CLI (skip if pre-installed)
        sudo apt-get update -y && sudo apt-get install -y azure-cli || true
        az aks install-cli >/dev/null 2>&1 || true
        az login --identity || az login --use-device-code
        az aks get-credentials --resource-group $AKS_RG --name $CLUSTER_NAME --overwrite-existing
        kubelogin convert-kubeconfig -l azurecli
        kubectl get nodes -o wide
      "
    

    Expected: 4 nodes (2 system + 2 user) Ready. Node OS column shows AzureLinux.

  4. Confirm the AAD-only auth is enforced — fetch the legacy admin kubeconfig and assert the API server rejects it:

    az aks get-credentials --admin --resource-group "$AKS_RG" --name "$CLUSTER_NAME" --overwrite-existing 2>&1 | head
    

    Expected: a clear error like Cannot get admin credentials. ... local accounts are disabled on this cluster. If you get a kubeconfig back, F3's local_account_disabled precondition was bypassed — fail the runbook.

  5. Confirm a default-deny NetworkPolicy is in place — D4 will install one via Kyverno once the cluster is in scope, but the Calico CNI is what actually enforces it. Spot-check:

    az vm run-command invoke --resource-group "$NET_RG" --name "f3-jump-vm" \
      --command-id "RunShellScript" \
      --scripts "
        kubectl create namespace probe-ns
        kubectl run nginx --image=nginx:1.27.0 -n probe-ns
        kubectl get networkpolicy -n probe-ns
      "
    

    Expected: the Kyverno-generated default-deny NetworkPolicy is listed in probe-ns (after D4 is installed; if running before D4, manually kubectl apply a default-deny NetPol to confirm Calico enforces it).

  6. Cleanup before the integration test's deferred terraform destroy fires:

    az vm delete --resource-group "$NET_RG" --name "f3-jump-vm" --yes
    

Pass criteria

  • Part A — terraform validate passes for the module + example
  • Part B — full offline Terratest suite passes (15 top-level tests)
  • Part C — TestAKSSecureModule integration test passes end-to-end
  • Cluster created at SKU Standard, private endpoint ON
  • az aks show --resource-group $AKS_RG --name $CLUSTER_NAME --query "privateClusterEnabled" returns true
  • ... --query "disableLocalAccounts" returns true
  • ... --query "oidcIssuerProfile.enabled" returns true
  • ... --query "securityProfile.workloadIdentity.enabled" returns true
  • ... --query "networkProfile.networkPolicy" returns calico
  • ... --query "networkProfile.networkPluginMode" returns overlay
  • ... --query "aadProfile.managed" returns true
  • ... --query "aadProfile.enableAzureRbac" returns true
  • Diagnostic settings show 7 enabled categories + AllMetrics forwarding to the F1 workspace (when LAW supplied)
  • Defender for Containers profile present (when LAW supplied)
  • (Part D) kubectl get nodes returns 4 Ready nodes via AAD-only auth
  • (Part D) Local admin kubeconfig fetch fails with local accounts are disabled
  • (Part D) Calico NetworkPolicy CRD installed + default-deny enforced
  • All Destroy calls complete without error
  • No orphaned node RG remains — az group list -o table | grep f3-test returns nothing
  • All test resources tagged ephemeral = true (X7 cleanup safety net)

Teardown

The integration test runs terraform destroy automatically. AKS destroy is slow (~10 minutes — AKS cleans the node RG first, then itself). If a failure mid-run orphans resources, clean up manually:

# Three RGs: <name_prefix>-net-rg (F2), <name_prefix>-aks-rg (F3 control
# plane), <name_prefix>-aks-rg-nodes (F3 node RG, AKS-managed).
az group delete --name "<name_prefix>-net-rg" --yes --no-wait
az group delete --name "<name_prefix>-aks-rg" --yes --no-wait
az group delete --name "<name_prefix>-aks-rg-nodes" --yes --no-wait

Node RG cleanup caveat. AKS normally cleans the node RG when the cluster is destroyed. If the cluster is force-deleted via az group delete before AKS gets a chance, the node RG is orphaned and the underlying VMSS/LB/disks remain — costing $0.40/hour until you delete it explicitly.


Sign-off

  • Tester: _  |  Date: _  |  Result: PASS / FAIL / N/A
  • Notes: