Manual Test Runbook — C3: aks-deploy reusable workflow

Owner: Sagar | Time: ~5 min (Parts A + B offline) · ~75 min (Parts C–F sandbox) | Sandbox: X1 (requires F3 AKS already up)

Promotes C3 (.github/workflows/aks-deploy.yml) from 🟦 Code Complete → 🟩 Shipped. Parts C–F require a sandbox AKS cluster (F3) with ArgoCD installed. F8 (the SnowOps ArgoCD bundle) is still ⬜, so this runbook includes an ad-hoc ArgoCD install via the upstream Helm chart — replace with F8 once it lands. Cost: ~$0 incremental over the running F3 cluster.

Purpose

Validate that the C3 reusable workflow correctly:

Parses cleanly and exposes the documented input/secret/output contract.
Refuses to run when both image-set modes are set, or when neither is set.
Logs into ArgoCD with a project-scoped token, applies a Kustomize image override, syncs the app, waits for Healthy.
Probes a smoke URL until success or timeout.
Performs a rollback drill: rolls back, re-smokes, rolls forward, re-smokes.

Prerequisites

Sandbox subscription access (PIM activated).
Sandbox F3 AKS cluster up; kubectl context pointed at it (az aks get-credentials …).
Sandbox ArgoCD reachable from the GH Actions runner used for the test. Two viable paths:
Hosted runner: ArgoCD exposed via an internet-reachable LoadBalancer / ingress (sandbox-only — DO NOT do this in client production).
Self-hosted runner: Runner inside the F2 spoke network, ArgoCD on ClusterIP or internal LB.
An ArgoCD project-scoped API token captured into the test repo's ARGOCD_AUTH_TOKEN secret.
Local tooling: kubectl, helm >= 3.14, argocd CLI matching the version pinned in the workflow (2.11.4 by default), gh, jq.
Working directory: repo root.
Env vars set:

export SANDBOX_AKS_RG="snowops-sandbox-aks-rg"
export SANDBOX_AKS_NAME="snowops-sandbox-aks-01"
export ARGOCD_NS="argocd"
export ARGOCD_SERVER="argocd-sandbox.example.com"   # whatever your sandbox ingress is

Steps

Part A — YAML + workflow lint (offline, ~2 min)

Confirm the reusable workflow + caller template parse:

ruby -ryaml -e "YAML.load_file('.github/workflows/aks-deploy.yml')"
ruby -ryaml -e "YAML.load_file('templates/client-repo/.github/workflows/aks-deploy.yml')"

Both commands exit 0 with no output.
If actionlint is installed locally, run it:

actionlint .github/workflows/aks-deploy.yml \
           templates/client-repo/.github/workflows/aks-deploy.yml

No error or warning. (Optional — skip if not on PATH.)
Confirm pre-commit gates pass on the touched files:

pre-commit run --files \
  .github/workflows/aks-deploy.yml \
  templates/client-repo/.github/workflows/aks-deploy.yml \
  pipelines/README.md \
  tests/pipeline-integration/aks-deploy/README.md \
  tests/pipeline-integration/aks-deploy/argocd-app.yaml \
  tests/pipeline-integration/aks-deploy/app/base/configmap.yaml \
  tests/pipeline-integration/aks-deploy/app/base/deployment.yaml \
  tests/pipeline-integration/aks-deploy/app/base/kustomization.yaml \
  tests/pipeline-integration/aks-deploy/app/base/service.yaml \
  docs/runbooks/test/C3.md

All hooks PASS.

Part B — contract self-check (offline, ~3 min)

Re-read the C3 contract section and confirm:
Every documented input is present in .github/workflows/aks-deploy.yml's on.workflow_call.inputs block.
synced_revision + app_health outputs declared at workflow level and wired to the deploy job outputs.
Validate image-set inputs step is the first thing the job does (before the CLI download, before any auth).
Rollback drill steps are all gated on inputs.rollback_drill && steps.pre.outputs.prev_id != ''.
Quick offline rejection probe — locally render the workflow's validation logic against an ambiguous payload:

KUSTOMIZE_NAME="app" HELM_SET="image.fullName" bash -c '
  if [[ -n "${KUSTOMIZE_NAME}" && -n "${HELM_SET}" ]]; then
    echo "Both set — would fail"
    exit 1
  fi'

Exits 1 with Both set — would fail (matches the workflow's behavior).

Part C — ad-hoc ArgoCD bootstrap (sandbox, ~20 min) — skip once F8 ships

Install ArgoCD into the sandbox cluster. Replace this entire step with F8 once it lands.

az aks get-credentials --resource-group "${SANDBOX_AKS_RG}" --name "${SANDBOX_AKS_NAME}" --overwrite-existing

kubectl create namespace "${ARGOCD_NS}" --dry-run=client -o yaml | kubectl apply -f -

helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
helm upgrade --install argocd argo/argo-cd \
  --namespace "${ARGOCD_NS}" \
  --version 7.6.10 \
  --set server.service.type=LoadBalancer \
  --set configs.params."server\.insecure"=true \
  --wait

All ArgoCD pods Ready (kubectl -n argocd get pods).
kubectl -n argocd get svc argocd-server shows an external IP.
Capture the admin password and mint a project-scoped API token:

ARGOCD_LB_IP="$(kubectl -n argocd get svc argocd-server -o jsonpath='{.status.loadBalancer.ingress[0].ip}')"
export ARGOCD_SERVER="${ARGOCD_LB_IP}"

admin_pw="$(kubectl -n argocd get secret argocd-initial-admin-secret \
  -o jsonpath='{.data.password}' | base64 -d)"

argocd login "${ARGOCD_SERVER}" --insecure --grpc-web \
  --username admin --password "${admin_pw}"

# Allow the `default` project to manage Applications + create the apiKey
# account that C3 will authenticate as.
kubectl -n argocd patch configmap argocd-cm --type merge -p '{
  "data": {
    "accounts.snowops-deployer": "apiKey, login"
  }
}'
kubectl -n argocd rollout restart deployment argocd-server
kubectl -n argocd rollout status deployment argocd-server --timeout=120s

argocd login "${ARGOCD_SERVER}" --insecure --grpc-web \
  --username admin --password "${admin_pw}"

token="$(argocd account generate-token --account snowops-deployer)"
echo "${token}"   # paste into the test repo's ARGOCD_AUTH_TOKEN secret

Token printed; copy it into the test repo's ARGOCD_AUTH_TOKEN GitHub secret.
argocd account get-user-info against the new token returns loggedIn: true.

Part D — happy-path deploy + smoke (sandbox, ~20 min)

Install the reference Application:

# Substitute SNOWOPS_ORG with your actual GH org so ArgoCD can pull the fixture.
sed "s|SNOWOPS_ORG|<your-org>|" tests/pipeline-integration/aks-deploy/argocd-app.yaml \
  | kubectl apply -f -

# Wait for the initial sync (will pull nginx:1.27-alpine since the
# kustomization.yaml override hasn't been touched yet).
argocd app wait c3-runbook --health --sync --timeout 300

argocd app get c3-runbook reports Synced + Healthy.
kubectl -n c3-runbook get pods shows 2 running replicas.
Expose the service via port-forward for the smoke probe:

kubectl -n c3-runbook port-forward svc/c3-runbook-app 8888:80 &
PF_PID=$!
sleep 2
curl -sS http://127.0.0.1:8888/ -w 'status=%{http_code}\n'

Body reads c3-runbook ok, status 200.
kill ${PF_PID} when done.
Configure a test caller workflow in a throwaway repo (or on a test branch). Use the templates/client-repo/.github/workflows/aks-deploy.yml shape but invoke C3 directly (skip the build job — we're using a public nginx, no ACR push needed). Trigger and watch:
```
gh workflow run deploy.yml
gh run watch
```
- Validate image-set inputs passes (Kustomize mode).
- Install ArgoCD CLI step downloads the pinned 2.11.4.
- Verify ArgoCD auth prints the deployer account info.
- Apply image override succeeds (e.g., promoting to nginx:1.27.1-alpine if you want a real diff — or keep nginx:1.27-alpine for a no-op).
- Sync application + Wait for Healthy both succeed within their timeouts.
- Smoke probe step succeeds (status 200) when smoke_url points at the exposed service.
- Step summary table renders with the synced revision + Healthy state.

Part E — rollback drill (sandbox, ~15 min)

Re-trigger the workflow with rollback_drill: true and a different image tag than the current revision (so there's a meaningful version to roll back from):
```
# In the caller workflow:
with:
  image_reference: "docker.io/library/nginx:1.27.1-alpine"
  image_kustomize_name: "app"
  rollback_drill: true
  smoke_url: "http://127.0.0.1:8888/"   # or your forwarded URL
```
Re-port-forward, dispatch, watch.
- Record pre-deploy revision for rollback drill captures a non-empty prev_id.
- After the initial deploy + smoke succeed, Rollback drill — revert runs argocd app rollback.
- Rollback drill — smoke prior version probe returns 200 against the rolled-back deployment.
- Rollback drill — roll forward re-applies the new image via sync.
- Rollback drill — smoke forward version probe returns 200 against the rolled-forward deployment.
- Final job status: success.
Confirm history in ArgoCD:
```
argocd app history c3-runbook
```
- At least 3 revisions visible (initial install + first deploy + roll-forward).

Part F — negative paths + cleanup (~10 min)

Failure-mode probe — set BOTH image-set modes in the caller and trigger:
```
with:
  image_kustomize_name: "app"
  image_helm_set_image: "image.fullName"
```
- Workflow fails at the Validate image-set inputs step with the documented error message.
Failure-mode probe — leave both empty:
- Same step fails with the "Set exactly one of …" message.

Cleanup:

kubectl delete -f tests/pipeline-integration/aks-deploy/argocd-app.yaml
kubectl delete namespace c3-runbook
# If you installed ArgoCD ad-hoc in Part C and don't need it any more:
helm uninstall argocd -n "${ARGOCD_NS}"
kubectl delete namespace "${ARGOCD_NS}"

Pass criteria

Parts A + B pass cleanly (YAML, pre-commit, contract self-check, offline rejection probe).
Part C: ArgoCD bootstrapped; project-scoped token issued.
Part D: happy-path Kustomize deploy succeeds, smoke probe green.
Part E: rollback drill exercises full revert + roll-forward with smoke at each stage.
Part F: ambiguous + empty image-set inputs both fail at validation step with clear errors.

Sign-off

Tester: ____ | Date: ____ | Result: PASS / FAIL / N/A
Notes: