Plan: clean up merged kubeconfig entries when destroying an AKS cluster
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md - The implementation process
- PLANS.md - Plan structure and best practices
Status: Completed (2026-05-16)
Implementation landed in platforms/azure-aks/scripts/03-destroy.sh via the shared pf_remove_context + pf_lockstep_flip helpers in provision-host/uis/lib/platform-switching.sh. All four acceptance-criteria items satisfied:
pf_remove_context "$AZURE_AKS_CLUSTER_NAME"deletes context + cluster + user refs fromkubeconf-alland syncs to the legacy bind-mount path (delete-cluster/-context/-user + cp, lines 211-241 ofplatform-switching.sh).pf_lockstep_flip "rancher-desktop"re-pointscurrent-contextand writescluster-config.shin one shared writer (line 209 of03-destroy.sh).- The host-kubeconfig defensive cleanup (
kubectl config delete-context) runs first at line 181. - Per-cluster
$KUBECONFIG_FILEremoval at line 194.
The verbatim code in this PLAN was not used — the equivalent work landed as higher-level helpers shared with 02-post-apply.sh and cmd_platform_use, which converges on the lockstep-flip / context-removal pattern. Net outcome identical.
Goal: When platforms/azure-aks/scripts/03-destroy.sh tears down an AKS cluster, also remove that cluster's stale clusters: / contexts: / users: entries from the merged kubeconf-all, and re-point current-context to rancher-desktop. Symmetric counterpart to 02-post-apply.sh's flip-on-apply.
Last Updated: 2026-05-10
Source: tester's Round 3 Tier A retry №4 result in testing/uis1/talk/talk.md — flagged as a real bug after PR #149's merge gate was already met. Deferred from #149 to ship the verification-loop fixes; tracked here.
Problem Summary
After 03-destroy.sh runs, cluster-config.sh correctly resets to rancher-desktop (PR #149 added this), but the merged kubeconf-all files still contain the destroyed cluster's three entries:
clusters: azure-aks— pointing at an API server that no longer resolves (azure-aks-XXXX.hcp.westeurope.azmk8s.io … no such host)contexts: azure-aksusers: clusterUser_*current-context: azure-aks
Symptoms operators have hit:
- Bare
kubectl …(no explicit--context) fails with confusing DNS-lookup errors instead of "you destroyed this cluster". - Multiple destroy/recreate cycles accumulate stale entries forever;
kubectl config get-contextsbecomes a graveyard. - Atlas-side port-forwards die silently after destroy because the tester's local
kubectl port-forwardis rooted at a now-dead context.
The destroy already removes the per-cluster ${cluster}-kubeconf file and the kubectl context binding — but it does not edit the merged file that ~100 consumers across the repo read from.
Phase 1: Implement the cleanup section in 03-destroy.sh
Tasks
-
1.1 In
platforms/azure-aks/scripts/03-destroy.sh, add a new section between the existing "Cleaning up kubeconfig" block (per-cluster file removal) and "Reset UIS target to rancher-desktop" (cluster-config flip):print_status "Removing $AZURE_AKS_CLUSTER_NAME entries from merged kubeconfig..."
# Operate on the in-container path (kubectl flock-safe); sync to the
# bind-mount path after, mirroring 04-merge-kubeconf.yml's tasks 29-31.
KUBECONF_PRIMARY="/mnt/urbalurbadisk/kubeconfig/kubeconf-all"
KUBECONF_LEGACY="/mnt/urbalurbadisk/.uis.secrets/generated/kubeconfig/kubeconf-all"
if [[ -f "$KUBECONF_PRIMARY" ]]; then
kubectl --kubeconfig "$KUBECONF_PRIMARY" config delete-context "$AZURE_AKS_CLUSTER_NAME" 2>/dev/null || true
kubectl --kubeconfig "$KUBECONF_PRIMARY" config delete-cluster "$AZURE_AKS_CLUSTER_NAME" 2>/dev/null || true
kubectl --kubeconfig "$KUBECONF_PRIMARY" config delete-user \
"clusterUser_${AZURE_AKS_RESOURCE_GROUP}_${AZURE_AKS_CLUSTER_NAME}" 2>/dev/null || true
# Re-point current-context at rancher-desktop if it's available
if kubectl --kubeconfig "$KUBECONF_PRIMARY" config get-contexts -o name | grep -qx 'rancher-desktop'; then
kubectl --kubeconfig "$KUBECONF_PRIMARY" config use-context rancher-desktop >/dev/null
print_success "current-context switched to rancher-desktop"
else
print_warning "rancher-desktop context not in merged kubeconfig — current-context left dangling"
fi
# Sync to legacy path that ~100 consumer playbooks read from
cp "$KUBECONF_PRIMARY" "$KUBECONF_LEGACY"
print_success "Cleaned merged kubeconfig synced to legacy path"
else
print_warning "Merged kubeconfig not found — nothing to clean"
fi -
1.2 Verify on rancher-desktop + AKS hot-patched container: bring up an AKS cluster, run a
./uis deployagainst it, then run03-destroy.sh. After destroy:kubectl --kubeconfig $KUBECONF_PRIMARY config get-contextsshows noazure-aksentry.kubectl --kubeconfig $KUBECONF_PRIMARY config current-contextreturnsrancher-desktop.cmp -s $KUBECONF_PRIMARY $KUBECONF_LEGACYexits 0.- Bare
kubectl get nodessucceeds (it would have failed with DNS error before this fix).
Validation
The verbatim acceptance check from the tester's reply:
Edge case to handle: if the operator destroys an AKS cluster without a rancher-desktop context in their merged kubeconfig (some future remote-only setup), the
elsebranch warns and leaves the file alone rather than picking a random context. Conservative; fail loud rather than silent.
Implementation Notes
- The
delete-cluster/delete-context/delete-useroperations all run against the in-container path (/mnt/urbalurbadisk/kubeconfig/kubeconf-all), which kubectl already proved write-safe during PR #149's02-post-apply.shapply flow. Thecpto the legacy path is a plain file copy — no kubectl, no flock, safe across the lima/9P bind mount. - The user name
clusterUser_${AZURE_AKS_RESOURCE_GROUP}_${AZURE_AKS_CLUSTER_NAME}is the convention emitted byaz aks get-credentials(and by extension the kubeconfig that OpenTofu'skube_config_rawoutput produces). If a future deployment uses a different user-name convention, we'd need to discover the user dynamically (kubectl --kubeconfig … config view -o jsonpathfor users referencing the destroyed context). - This is the symmetric counterpart to
04-merge-kubeconf.yml's tasks 29-31. Apply adds the cluster + sets context; destroy removes the cluster + resets context. Cleaner mental model for the next contributor.
Files to Modify
platforms/azure-aks/scripts/03-destroy.sh(add the cleanup section)website/docs/ai-developer/plans/active/PLAN-platform-aks-destroy-kubeconfig-cleanup.md← move frombacklog/toactive/when work starts
Acceptance Criteria
- After
03-destroy.shsucceeds, the merged kubeconfig contains no entries referencing the destroyed cluster. -
current-contextisrancher-desktop(or the warning fires if no rancher-desktop context exists). - Both kubeconf-all paths (
/mnt/urbalurbadisk/kubeconfig/and.uis.secrets/.../kubeconfig/) are byte-identical after destroy. - Tester confirms via
./uis deploy <something>after destroy: targets rancher-desktop cleanly with no manual cluster-config or kubectl-context fix-up.