Skip to content

Manual Test Runbook — C3: aks-deploy reusable workflow

Owner: Sagar  |  Time: ~5 min (Parts A + B offline) · ~75 min (Parts C–F sandbox)  |  Sandbox: X1 (requires F3 AKS already up)

Promotes C3 (.github/workflows/aks-deploy.yml) from 🟦 Code Complete → 🟩 Shipped. Parts C–F require a sandbox AKS cluster (F3) with ArgoCD installed. F8 (the SnowOps ArgoCD bundle) is still ⬜, so this runbook includes an ad-hoc ArgoCD install via the upstream Helm chart — replace with F8 once it lands. Cost: ~$0 incremental over the running F3 cluster.


Purpose

Validate that the C3 reusable workflow correctly:

  1. Parses cleanly and exposes the documented input/secret/output contract.
  2. Refuses to run when both image-set modes are set, or when neither is set.
  3. Logs into ArgoCD with a project-scoped token, applies a Kustomize image override, syncs the app, waits for Healthy.
  4. Probes a smoke URL until success or timeout.
  5. Performs a rollback drill: rolls back, re-smokes, rolls forward, re-smokes.

Prerequisites

  • Sandbox subscription access (PIM activated).
  • Sandbox F3 AKS cluster up; kubectl context pointed at it (az aks get-credentials …).
  • Sandbox ArgoCD reachable from the GH Actions runner used for the test. Two viable paths:
  • Hosted runner: ArgoCD exposed via an internet-reachable LoadBalancer / ingress (sandbox-only — DO NOT do this in client production).
  • Self-hosted runner: Runner inside the F2 spoke network, ArgoCD on ClusterIP or internal LB.
  • An ArgoCD project-scoped API token captured into the test repo's ARGOCD_AUTH_TOKEN secret.
  • Local tooling: kubectl, helm >= 3.14, argocd CLI matching the version pinned in the workflow (2.11.4 by default), gh, jq.
  • Working directory: repo root.
  • Env vars set:
export SANDBOX_AKS_RG="snowops-sandbox-aks-rg"
export SANDBOX_AKS_NAME="snowops-sandbox-aks-01"
export ARGOCD_NS="argocd"
export ARGOCD_SERVER="argocd-sandbox.example.com"   # whatever your sandbox ingress is

Steps

Part A — YAML + workflow lint (offline, ~2 min)

  1. Confirm the reusable workflow + caller template parse:
ruby -ryaml -e "YAML.load_file('.github/workflows/aks-deploy.yml')"
ruby -ryaml -e "YAML.load_file('templates/client-repo/.github/workflows/aks-deploy.yml')"
  • Both commands exit 0 with no output.

  • If actionlint is installed locally, run it:

actionlint .github/workflows/aks-deploy.yml \
           templates/client-repo/.github/workflows/aks-deploy.yml
  • No error or warning. (Optional — skip if not on PATH.)

  • Confirm pre-commit gates pass on the touched files:

pre-commit run --files \
  .github/workflows/aks-deploy.yml \
  templates/client-repo/.github/workflows/aks-deploy.yml \
  pipelines/README.md \
  tests/pipeline-integration/aks-deploy/README.md \
  tests/pipeline-integration/aks-deploy/argocd-app.yaml \
  tests/pipeline-integration/aks-deploy/app/base/configmap.yaml \
  tests/pipeline-integration/aks-deploy/app/base/deployment.yaml \
  tests/pipeline-integration/aks-deploy/app/base/kustomization.yaml \
  tests/pipeline-integration/aks-deploy/app/base/service.yaml \
  docs/runbooks/test/C3.md
  • All hooks PASS.

Part B — contract self-check (offline, ~3 min)

  1. Re-read the C3 contract section and confirm:

  2. Every documented input is present in .github/workflows/aks-deploy.yml's on.workflow_call.inputs block.

  3. synced_revision + app_health outputs declared at workflow level and wired to the deploy job outputs.
  4. Validate image-set inputs step is the first thing the job does (before the CLI download, before any auth).
  5. Rollback drill steps are all gated on inputs.rollback_drill && steps.pre.outputs.prev_id != ''.

  6. Quick offline rejection probe — locally render the workflow's validation logic against an ambiguous payload:

KUSTOMIZE_NAME="app" HELM_SET="image.fullName" bash -c '
  if [[ -n "${KUSTOMIZE_NAME}" && -n "${HELM_SET}" ]]; then
    echo "Both set — would fail"
    exit 1
  fi'
  • Exits 1 with Both set — would fail (matches the workflow's behavior).

Part C — ad-hoc ArgoCD bootstrap (sandbox, ~20 min) — skip once F8 ships

  1. Install ArgoCD into the sandbox cluster. Replace this entire step with F8 once it lands.
az aks get-credentials --resource-group "${SANDBOX_AKS_RG}" --name "${SANDBOX_AKS_NAME}" --overwrite-existing

kubectl create namespace "${ARGOCD_NS}" --dry-run=client -o yaml | kubectl apply -f -

helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
helm upgrade --install argocd argo/argo-cd \
  --namespace "${ARGOCD_NS}" \
  --version 7.6.10 \
  --set server.service.type=LoadBalancer \
  --set configs.params."server\.insecure"=true \
  --wait
  • All ArgoCD pods Ready (kubectl -n argocd get pods).
  • kubectl -n argocd get svc argocd-server shows an external IP.

  • Capture the admin password and mint a project-scoped API token:

ARGOCD_LB_IP="$(kubectl -n argocd get svc argocd-server -o jsonpath='{.status.loadBalancer.ingress[0].ip}')"
export ARGOCD_SERVER="${ARGOCD_LB_IP}"

admin_pw="$(kubectl -n argocd get secret argocd-initial-admin-secret \
  -o jsonpath='{.data.password}' | base64 -d)"

argocd login "${ARGOCD_SERVER}" --insecure --grpc-web \
  --username admin --password "${admin_pw}"

# Allow the `default` project to manage Applications + create the apiKey
# account that C3 will authenticate as.
kubectl -n argocd patch configmap argocd-cm --type merge -p '{
  "data": {
    "accounts.snowops-deployer": "apiKey, login"
  }
}'
kubectl -n argocd rollout restart deployment argocd-server
kubectl -n argocd rollout status deployment argocd-server --timeout=120s

argocd login "${ARGOCD_SERVER}" --insecure --grpc-web \
  --username admin --password "${admin_pw}"

token="$(argocd account generate-token --account snowops-deployer)"
echo "${token}"   # paste into the test repo's ARGOCD_AUTH_TOKEN secret
  • Token printed; copy it into the test repo's ARGOCD_AUTH_TOKEN GitHub secret.
  • argocd account get-user-info against the new token returns loggedIn: true.

Part D — happy-path deploy + smoke (sandbox, ~20 min)

  1. Install the reference Application:
# Substitute SNOWOPS_ORG with your actual GH org so ArgoCD can pull the fixture.
sed "s|SNOWOPS_ORG|<your-org>|" tests/pipeline-integration/aks-deploy/argocd-app.yaml \
  | kubectl apply -f -

# Wait for the initial sync (will pull nginx:1.27-alpine since the
# kustomization.yaml override hasn't been touched yet).
argocd app wait c3-runbook --health --sync --timeout 300
  • argocd app get c3-runbook reports Synced + Healthy.
  • kubectl -n c3-runbook get pods shows 2 running replicas.

  • Expose the service via port-forward for the smoke probe:

kubectl -n c3-runbook port-forward svc/c3-runbook-app 8888:80 &
PF_PID=$!
sleep 2
curl -sS http://127.0.0.1:8888/ -w 'status=%{http_code}\n'
  • Body reads c3-runbook ok, status 200.
  • kill ${PF_PID} when done.

  • Configure a test caller workflow in a throwaway repo (or on a test branch). Use the templates/client-repo/.github/workflows/aks-deploy.yml shape but invoke C3 directly (skip the build job — we're using a public nginx, no ACR push needed). Trigger and watch:

    gh workflow run deploy.yml
    gh run watch
    
    • Validate image-set inputs passes (Kustomize mode).
    • Install ArgoCD CLI step downloads the pinned 2.11.4.
    • Verify ArgoCD auth prints the deployer account info.
    • Apply image override succeeds (e.g., promoting to nginx:1.27.1-alpine if you want a real diff — or keep nginx:1.27-alpine for a no-op).
    • Sync application + Wait for Healthy both succeed within their timeouts.
    • Smoke probe step succeeds (status 200) when smoke_url points at the exposed service.
    • Step summary table renders with the synced revision + Healthy state.

Part E — rollback drill (sandbox, ~15 min)

  1. Re-trigger the workflow with rollback_drill: true and a different image tag than the current revision (so there's a meaningful version to roll back from):

    # In the caller workflow:
    with:
      image_reference: "docker.io/library/nginx:1.27.1-alpine"
      image_kustomize_name: "app"
      rollback_drill: true
      smoke_url: "http://127.0.0.1:8888/"   # or your forwarded URL
    

    Re-port-forward, dispatch, watch.

    • Record pre-deploy revision for rollback drill captures a non-empty prev_id.
    • After the initial deploy + smoke succeed, Rollback drill — revert runs argocd app rollback.
    • Rollback drill — smoke prior version probe returns 200 against the rolled-back deployment.
    • Rollback drill — roll forward re-applies the new image via sync.
    • Rollback drill — smoke forward version probe returns 200 against the rolled-forward deployment.
    • Final job status: success.
  2. Confirm history in ArgoCD:

    argocd app history c3-runbook
    
    • At least 3 revisions visible (initial install + first deploy + roll-forward).

Part F — negative paths + cleanup (~10 min)

  1. Failure-mode probe — set BOTH image-set modes in the caller and trigger:

    with:
      image_kustomize_name: "app"
      image_helm_set_image: "image.fullName"
    
    • Workflow fails at the Validate image-set inputs step with the documented error message.
  2. Failure-mode probe — leave both empty:

    • Same step fails with the "Set exactly one of …" message.
  3. Cleanup:

    kubectl delete -f tests/pipeline-integration/aks-deploy/argocd-app.yaml
    kubectl delete namespace c3-runbook
    # If you installed ArgoCD ad-hoc in Part C and don't need it any more:
    helm uninstall argocd -n "${ARGOCD_NS}"
    kubectl delete namespace "${ARGOCD_NS}"
    

Pass criteria

  • Parts A + B pass cleanly (YAML, pre-commit, contract self-check, offline rejection probe).
  • Part C: ArgoCD bootstrapped; project-scoped token issued.
  • Part D: happy-path Kustomize deploy succeeds, smoke probe green.
  • Part E: rollback drill exercises full revert + roll-forward with smoke at each stage.
  • Part F: ambiguous + empty image-set inputs both fail at validation step with clear errors.

Sign-off

  • Tester: ____  |  Date: ____  |  Result: PASS / FAIL / N/A
  • Notes: