Plan: AKS Step 1 — Verify minimal working cluster end-to-end
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md - The implementation process
- PLANS.md - Plan structure and best practices
Status: ✅ Completed (2026-05-11)
Note on path: this PLAN scoped a manual-walkthrough verification (run the four lifecycle scripts directly, then ./uis deploy nginx, then 03-destroy.sh). The actual verification happened differently — by the time the cold cycle ran end-to-end (talk46 R3), the four-PLAN AKS novice-onboarding sequence (INVESTIGATE-platform-aks-novice-onboarding.md) had also shipped, so verification ran through the new uis platform up azure-aks + uis platform down azure-aks wrappers instead of the raw scripts. The verification bar (./uis deploy nginx succeeds with the in-cluster connectivity tests passing, cluster cleanly torn down afterward) was met regardless. See the "Findings from first-run verification" subsection in INVESTIGATE-system-platform-provisioning-layer.md for the full chronology.
Goal: Take the unverified platforms/azure-aks/ OpenTofu drafts (merged 2026-04-09 via PR #120) through their first real end-to-end run against an Azure subscription, fix any gaps that surface, and earn the right to call AKS Step 1 actually shipped — by deploying ./uis deploy nginx against the resulting cluster and watching its built-in connectivity tests pass.
Last Updated: 2026-05-11
Investigation: INVESTIGATE-system-platform-provisioning-layer.md — Step 1 scope, verification bar, and gap-analysis findings.
Problem Summary
platforms/azure-aks/ contains four scripts (00-bootstrap-state.sh, 01-apply.sh, 02-post-apply.sh, 03-destroy.sh) plus an OpenTofu module (tofu/main.tf etc.) that have never been run against a real Azure subscription. One known issue from the 2026-05-07 gap analysis must be fixed before the first run can succeed: the uis-provision-host container has no tofu binary. There's an installer for azure-cli at provision-host/uis/tools/install-azure-cli.sh but no equivalent for OpenTofu.
Beyond that, the dominant risk is un-run risk: bugs and small omissions only surface when scripts execute against real Azure resources for the first time. This plan reserves a phase for that iterative discovery.
The other 2026-05-07 gap (missing kubernetes-secrets.yml apply step in 02-post-apply.sh) is split out into PLAN-platform-aks-002-secrets-apply-parity.md. The verification bar here is ./uis deploy nginx, and nginx doesn't need cluster secrets — so the secrets-parity gap doesn't block this PLAN. It just means the cluster after this PLAN ships is verified for nginx but won't yet be ready for secret-using UIS services like postgresql. PLAN-002 closes that.
Phase 1: OpenTofu installer
Add tofu to the on-demand ./uis tools install system, mirroring the existing install-azure-cli.sh.
Tasks
-
1.1 Create
provision-host/uis/tools/install-opentofu.sh. Match the shape ofinstall-azure-cli.sh:TOOL_ID="opentofu"TOOL_NAME="OpenTofu"TOOL_DESCRIPTION="Open-source infrastructure-as-code tool (Terraform fork)"TOOL_CATEGORY="CLOUD_TOOLS"TOOL_CHECK_COMMAND="command -v tofu"TOOL_SIZE="~30MB"TOOL_WEBSITE="https://opentofu.org/"do_install()runs the official installer with the apt-method (creates apt repo + installs system-wide), so the install pattern matches azure-cli'ssudo bashflow:curl --proto '=https' --tlsv1.2 -fsSL https://get.opentofu.org/install-opentofu.sh -o /tmp/install-opentofu.sh
chmod +x /tmp/install-opentofu.sh
sudo /tmp/install-opentofu.sh --install-method deb
rm /tmp/install-opentofu.shdo_uninstall()removes the apt package and the apt source list (parallel to azure-cli's uninstall).
-
1.2 Confirm
./uis tools listdiscovers the new entry — the listing logic inprovision-host/uis/lib/tool-installation.sh:get_all_tool_idsreads anyinstall-*.shscript'sTOOL_ID=line, so no registry edit is needed.
Validation
Tester runs:
docker exec -it uis-provision-host bash -lc "./uis tools list"
…and sees opentofu listed with status ❌ Not installed. Then:
./uis tools install opentofu
tofu --version
…installs cleanly and reports a version >= 1.6.0 (the minimum the existing tofu/main.tf requires).
Phase 2: First-run verification by tester (against real Azure)
The tester runs the full platforms/azure-aks/ flow against an Azure subscription and reports results. This is where un-run risk surfaces.
Tasks
-
2.1 Install tooling:
./uis tools install azure-cli && ./uis tools install opentofu. Verifyaz --versionandtofu --versionboth work. — superseded by PLAN-uis-tools-install-azure-aks.md (PR #154): the meta-installeruis tools install azure-aksbundles both, used in talk46. -
2.2
Create. Superseded by PLAN-aks-config-cloud-accounts.md (PR #146): config now lives atplatforms/azure-aks/azure-aks-config.shfrom the template.uis.secrets/cloud-accounts/azure-default.env, written by theuis platform init azure-akswizard (PLAN-uis-platform-init-azure-aks.md, PR #155). -
2.3 Log into Azure — handled by the wizard's
az_login_if_needed/ device-code flow. The PIM-activation retry loop fromhosts/azure-microk8s/was ported intocheck_owner_or_contributor(PLAN-uis-platform-init-azure-aks.md Phase 1.4). -
2.4 Run
platforms/azure-aks/scripts/00-bootstrap-state.sh. — ran successfully in talk46 R3 as step▶ 1/3ofuis platform up azure-aks. State RG + storage account (derived from the subscription ID) + container + versioning all created. -
2.5 Run
platforms/azure-aks/scripts/01-apply.sh. — ran successfully in talk46 R3 as step▶ 2/3.tofu applycreated RG + Log Analytics workspace + AKS cluster (Standard_B2s_v2 × 1, k8s 1.34). Subscription-quota check now ported into the wizard (stronger than the originally-deferred plan). -
2.6 Run
platforms/azure-aks/scripts/02-post-apply.sh. — ran successfully in talk46 R3 as step▶ 3/3. Kubeconfig merged, storage class aliases applied, Traefik installed, external IP provisioned.kubernetes-secrets.ymlapply gap closed separately by PLAN-platform-aks-002-secrets-apply-parity.md (PR #149). -
2.7 Verify with
./uis deploy nginx. — verified in talk46 R3: pod scheduled, service reachable, IngressRoute applied, in-cluster connectivity tests (steps 13 + 15 of020-setup-nginx.yml) both succeeded. Public smokecurl http://<external-ip>/returned the catch-all page end-to-end. The F7 cluster-aware banner (PR #157) also rendered the correct LB IP + curl --resolve hint for hostname routes. -
2.8 Tear-down via
03-destroy.sh. — verified in talk46 R3 viauis platform down azure-aks. Cluster RG and Log Analytics destroyed; state RGrg-urbalurba-tfstatepreserved as designed.az aks list -o tablereturned empty afterward. -
2.9 Capture every observation — what worked, what failed, what was confusing — in a tester report. — captured across
testing/uis1/talk/talk43.mdthroughtalk46.md. F1–F13 findings list lives in INVESTIGATE-system-platform-provisioning-layer.md "Findings from first-run verification".
Validation
./uis deploy nginx succeeds end-to-end and the 020-setup-nginx.yml connectivity tests print the expected response. Tester confirms cluster is destroyed afterward (cost gate).
Phase 3: Address gaps surfaced in Phase 2
This phase is intentionally open-ended — until Phase 2 runs, the catalogue of bugs is unknown. Each gap surfaces as a small commit that re-enables a previously-failing Phase 2 task. Re-runs continue until Phase 2 passes cleanly.
Tasks (placeholder — to be enumerated as gaps surface)
-
3.x (per gap) Make the smallest fix that closes the gap. — surfaced as F1–F13 across PRs #155 (init), #156 (up + down + F1–F5), #157 (F7/F8/F9 + cosmetic), #158 (F10/F11/F12/F13 + workflow path-filter fix). Each fix targeted
platforms/azure-aks/source rather than working around in config, per the original principle. -
3.y (per gap) Re-run the affected Phase 2 step. — every fix was followed by either a fresh tester round (talk44 → talk45 → talk46 → talk47-in-the-making) or, for trivially-verifiable ones, in-container smoke tests. The end-to-end cycle in talk46 R3 ran cleanly with all surfaced fixes in place.
Validation
Phase 2 tasks all pass on a fresh clean run (i.e., starting from 00-bootstrap-state.sh).
Phase 4: Status sync
Once Phase 2 passes cleanly, mark Step 1 actually shipped.
Tasks
-
4.1 Update
INVESTIGATE-system-platform-provisioning-layer.mdStep 1 heading + add "Findings from first-run verification" subsection. — done 2026-05-11 as part of this cleanup PR. -
4.2 Update
1PRIORITY.mdTier 0 row forplatform-provisioning-layer. — done 2026-05-11 as part of this cleanup PR. -
4.3 Move this PLAN file to
completed/. — done 2026-05-11 as part of this cleanup PR.
Validation
User confirms Step 1 status accurately reflects the verified state.
Acceptance Criteria
-
provision-host/uis/tools/install-opentofu.shexists and follows the shape ofinstall-azure-cli.sh../uis tools install opentofuinstalls cleanly andtofu --versionreports>= 1.6.0. - Tester has run
00 → 01 → 02 → ./uis deploy nginx → 03end-to-end against a real Azure subscription, with the020-setup-nginx.ymlconnectivity tests succeeding. — talk46 R3, via theuis platform up/down azure-akswrappers. - Cluster has been cleanly destroyed at the end of the verification run (cost gate). — talk46 R3 close,
az aks list -o tableempty afterward, ~€0.02 total cost for the cycle. -
INVESTIGATE-system-platform-provisioning-layer.mdand1PRIORITY.mdreflect Step 1 as shipped (not "drafts merged"). - This plan is in
completed/.
Files to Modify
provision-host/uis/tools/install-opentofu.sh(new)platforms/azure-aks/azure-aks-config.sh(new, git-ignored, tester sets up in Phase 2.2)website/docs/ai-developer/plans/backlog/INVESTIGATE-system-platform-provisioning-layer.md(Phase 4)website/docs/ai-developer/plans/backlog/1PRIORITY.md(Phase 4)website/docs/ai-developer/plans/active/PLAN-001-aks-step1-verification.md→completed/(Phase 4)- Additional
platforms/azure-aks/files only as Phase 3 gaps require.
Implementation Notes
- Cost gate. AKS clusters bill while running. The Phase 2 verification is expected to be a same-session run-through ending in
03-destroy.sh. If a run is interrupted, run03-destroy.shbefore stepping away from the keyboard. - State backend persists across destroys.
00-bootstrap-state.shcreates a state RG / storage account that is not torn down by03-destroy.sh. That's by design (state survives cluster recreate cycles). It costs a few cents a month, not a real concern. - No Key Vault, no ACR. Step 1 deliberately matches the Rancher Desktop secrets workflow (
kubernetes-secrets.yml). Azure-specific add-ons are deferred to "Future hardening" in the investigation. - Secrets parity is split out. PLAN-002 closes the
kubernetes-secrets.ymlapply gap on02-post-apply.sh. nginx doesn't need it; everything else does. Sequencing is loose — PLAN-002 can run before or after this plan ships, but the natural order is verify first (this plan), then extend (PLAN-002). - Tester reports against this plan. Phase 2's tester report is the highest-signal artifact. Capture surprises in detail — they shape Phase 3 and inform PLAN-002 and Step 2's eventual PLAN.