Manual Test Runbook — C3: aks-deploy reusable workflow
Owner: Sagar | Time: ~5 min (Parts A + B offline) · ~75 min (Parts C–F sandbox) | Sandbox: X1 (requires F3 AKS already up)
Promotes C3 (
.github/workflows/aks-deploy.yml) from 🟦 Code Complete → 🟩 Shipped. Parts C–F require a sandbox AKS cluster (F3) with ArgoCD installed. F8 (the SnowOps ArgoCD bundle) is still ⬜, so this runbook includes an ad-hoc ArgoCD install via the upstream Helm chart — replace with F8 once it lands. Cost: ~$0 incremental over the running F3 cluster.
Purpose
Validate that the C3 reusable workflow correctly:
- Parses cleanly and exposes the documented input/secret/output contract.
- Refuses to run when both image-set modes are set, or when neither is set.
- Logs into ArgoCD with a project-scoped token, applies a Kustomize image override, syncs the app, waits for Healthy.
- Probes a smoke URL until success or timeout.
- Performs a rollback drill: rolls back, re-smokes, rolls forward, re-smokes.
Prerequisites
- Sandbox subscription access (PIM activated).
- Sandbox F3 AKS cluster up;
kubectlcontext pointed at it (az aks get-credentials …). - Sandbox ArgoCD reachable from the GH Actions runner used for the test. Two viable paths:
- Hosted runner: ArgoCD exposed via an internet-reachable LoadBalancer / ingress (sandbox-only — DO NOT do this in client production).
- Self-hosted runner: Runner inside the F2 spoke network, ArgoCD on
ClusterIPor internal LB. - An ArgoCD project-scoped API token captured into the test repo's
ARGOCD_AUTH_TOKENsecret. - Local tooling:
kubectl,helm >= 3.14,argocdCLI matching the version pinned in the workflow (2.11.4by default),gh,jq. - Working directory: repo root.
- Env vars set:
export SANDBOX_AKS_RG="snowops-sandbox-aks-rg"
export SANDBOX_AKS_NAME="snowops-sandbox-aks-01"
export ARGOCD_NS="argocd"
export ARGOCD_SERVER="argocd-sandbox.example.com" # whatever your sandbox ingress is
Steps
Part A — YAML + workflow lint (offline, ~2 min)
- Confirm the reusable workflow + caller template parse:
ruby -ryaml -e "YAML.load_file('.github/workflows/aks-deploy.yml')"
ruby -ryaml -e "YAML.load_file('templates/client-repo/.github/workflows/aks-deploy.yml')"
-
Both commands exit
0with no output. -
If
actionlintis installed locally, run it:
actionlint .github/workflows/aks-deploy.yml \
templates/client-repo/.github/workflows/aks-deploy.yml
-
No
errororwarning. (Optional — skip if not on PATH.) -
Confirm pre-commit gates pass on the touched files:
pre-commit run --files \
.github/workflows/aks-deploy.yml \
templates/client-repo/.github/workflows/aks-deploy.yml \
pipelines/README.md \
tests/pipeline-integration/aks-deploy/README.md \
tests/pipeline-integration/aks-deploy/argocd-app.yaml \
tests/pipeline-integration/aks-deploy/app/base/configmap.yaml \
tests/pipeline-integration/aks-deploy/app/base/deployment.yaml \
tests/pipeline-integration/aks-deploy/app/base/kustomization.yaml \
tests/pipeline-integration/aks-deploy/app/base/service.yaml \
docs/runbooks/test/C3.md
- All hooks PASS.
Part B — contract self-check (offline, ~3 min)
-
Re-read the C3 contract section and confirm:
-
Every documented input is present in
.github/workflows/aks-deploy.yml'son.workflow_call.inputsblock. -
synced_revision+app_healthoutputs declared at workflow level and wired to thedeployjob outputs. -
Validate image-set inputsstep is the first thing the job does (before the CLI download, before any auth). -
Rollback drill steps are all gated on
inputs.rollback_drill && steps.pre.outputs.prev_id != ''. -
Quick offline rejection probe — locally render the workflow's validation logic against an ambiguous payload:
KUSTOMIZE_NAME="app" HELM_SET="image.fullName" bash -c '
if [[ -n "${KUSTOMIZE_NAME}" && -n "${HELM_SET}" ]]; then
echo "Both set — would fail"
exit 1
fi'
- Exits
1withBoth set — would fail(matches the workflow's behavior).
Part C — ad-hoc ArgoCD bootstrap (sandbox, ~20 min) — skip once F8 ships
- Install ArgoCD into the sandbox cluster. Replace this entire step with F8 once it lands.
az aks get-credentials --resource-group "${SANDBOX_AKS_RG}" --name "${SANDBOX_AKS_NAME}" --overwrite-existing
kubectl create namespace "${ARGOCD_NS}" --dry-run=client -o yaml | kubectl apply -f -
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
helm upgrade --install argocd argo/argo-cd \
--namespace "${ARGOCD_NS}" \
--version 7.6.10 \
--set server.service.type=LoadBalancer \
--set configs.params."server\.insecure"=true \
--wait
- All ArgoCD pods Ready (
kubectl -n argocd get pods). -
kubectl -n argocd get svc argocd-servershows an external IP. -
Capture the admin password and mint a project-scoped API token:
ARGOCD_LB_IP="$(kubectl -n argocd get svc argocd-server -o jsonpath='{.status.loadBalancer.ingress[0].ip}')"
export ARGOCD_SERVER="${ARGOCD_LB_IP}"
admin_pw="$(kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath='{.data.password}' | base64 -d)"
argocd login "${ARGOCD_SERVER}" --insecure --grpc-web \
--username admin --password "${admin_pw}"
# Allow the `default` project to manage Applications + create the apiKey
# account that C3 will authenticate as.
kubectl -n argocd patch configmap argocd-cm --type merge -p '{
"data": {
"accounts.snowops-deployer": "apiKey, login"
}
}'
kubectl -n argocd rollout restart deployment argocd-server
kubectl -n argocd rollout status deployment argocd-server --timeout=120s
argocd login "${ARGOCD_SERVER}" --insecure --grpc-web \
--username admin --password "${admin_pw}"
token="$(argocd account generate-token --account snowops-deployer)"
echo "${token}" # paste into the test repo's ARGOCD_AUTH_TOKEN secret
- Token printed; copy it into the test repo's
ARGOCD_AUTH_TOKENGitHub secret. -
argocd account get-user-infoagainst the new token returnsloggedIn: true.
Part D — happy-path deploy + smoke (sandbox, ~20 min)
- Install the reference Application:
# Substitute SNOWOPS_ORG with your actual GH org so ArgoCD can pull the fixture.
sed "s|SNOWOPS_ORG|<your-org>|" tests/pipeline-integration/aks-deploy/argocd-app.yaml \
| kubectl apply -f -
# Wait for the initial sync (will pull nginx:1.27-alpine since the
# kustomization.yaml override hasn't been touched yet).
argocd app wait c3-runbook --health --sync --timeout 300
-
argocd app get c3-runbookreportsSynced+Healthy. -
kubectl -n c3-runbook get podsshows 2 running replicas. -
Expose the service via port-forward for the smoke probe:
kubectl -n c3-runbook port-forward svc/c3-runbook-app 8888:80 &
PF_PID=$!
sleep 2
curl -sS http://127.0.0.1:8888/ -w 'status=%{http_code}\n'
- Body reads
c3-runbook ok, status200. -
kill ${PF_PID}when done. -
Configure a test caller workflow in a throwaway repo (or on a test branch). Use the
templates/client-repo/.github/workflows/aks-deploy.ymlshape but invoke C3 directly (skip the build job — we're using a public nginx, no ACR push needed). Trigger and watch:-
Validate image-set inputspasses (Kustomize mode). -
Install ArgoCD CLIstep downloads the pinned2.11.4. -
Verify ArgoCD authprints the deployer account info. -
Apply image overridesucceeds (e.g., promoting tonginx:1.27.1-alpineif you want a real diff — or keep nginx:1.27-alpine for a no-op). -
Sync application+Wait for Healthyboth succeed within their timeouts. -
Smoke probestep succeeds (status 200) whensmoke_urlpoints at the exposed service. - Step summary table renders with the synced revision + Healthy state.
-
Part E — rollback drill (sandbox, ~15 min)
-
Re-trigger the workflow with
rollback_drill: trueand a different image tag than the current revision (so there's a meaningful version to roll back from):# In the caller workflow: with: image_reference: "docker.io/library/nginx:1.27.1-alpine" image_kustomize_name: "app" rollback_drill: true smoke_url: "http://127.0.0.1:8888/" # or your forwarded URLRe-port-forward, dispatch, watch.
-
Record pre-deploy revision for rollback drillcaptures a non-emptyprev_id. - After the initial deploy + smoke succeed,
Rollback drill — revertrunsargocd app rollback. -
Rollback drill — smoke prior versionprobe returns 200 against the rolled-back deployment. -
Rollback drill — roll forwardre-applies the new image via sync. -
Rollback drill — smoke forward versionprobe returns 200 against the rolled-forward deployment. - Final job status: success.
-
-
Confirm history in ArgoCD:
- At least 3 revisions visible (initial install + first deploy + roll-forward).
Part F — negative paths + cleanup (~10 min)
-
Failure-mode probe — set BOTH image-set modes in the caller and trigger:
- Workflow fails at the
Validate image-set inputsstep with the documented error message.
- Workflow fails at the
-
Failure-mode probe — leave both empty:
- Same step fails with the "Set exactly one of …" message.
-
Cleanup:
Pass criteria
- Parts A + B pass cleanly (YAML, pre-commit, contract self-check, offline rejection probe).
- Part C: ArgoCD bootstrapped; project-scoped token issued.
- Part D: happy-path Kustomize deploy succeeds, smoke probe green.
- Part E: rollback drill exercises full revert + roll-forward with smoke at each stage.
- Part F: ambiguous + empty image-set inputs both fail at validation step with clear errors.
Sign-off
- Tester: ____ | Date: ____ | Result: PASS / FAIL / N/A
- Notes: