Plan: Harden ./uis tools install scripts — fail loudly, run repeatedly
IMPLEMENTATION RULES: Before implementing this plan, read and follow:
- WORKFLOW.md - The implementation process
- PLANS.md - Plan structure and best practices
Status: Completed (implementation shipped via PR #152; installers exercised across talk52-55 fresh-:latest pull cycles)
Goal: Make every provision-host/uis/tools/install-*.sh script (a) safely re-runnable any number of times and (b) return a non-zero exit code if any installation step fails — including silent failures inside piped curl | bash invocations and sequential apt-get commands.
Last Updated: 2026-05-10
Related:
- INVESTIGATE-cli-top-level-doc — umbrella investigation that groups CLI-doc-hygiene work; this PLAN hardens the
./uis tools installscripts whose user-facing reference shipped via PLAN-tools-docs. - PLAN-tools-docs — user-facing tools reference (
reference/tools.md) covering the same set of scripts; shipped 2026-05-08. - Surfaced during the AKS novice-onboarding refactor: a robust
./uis tools install azure-cli && ./uis tools install opentofuis a prerequisite for the "minimum-commands" novice flow.
Problem Summary
./uis tools install <tool> calls into per-tool scripts at provision-host/uis/tools/install-<id>.sh. The wrapper at provision-host/uis/lib/tool-installation.sh:184 (install_tool) does two things well:
- Idempotency check at line 194-197: if
is_tool_installedalready returns 0, the script is skipped with a warning. So re-running./uis tools install azure-clion an already-installed system is a no-op. ✅ - Post-install verification at line 226: re-runs the tool's
TOOL_CHECK_COMMANDafterdo_installreturns. If the binary still isn't on$PATH, the wrapper returns 1. ✅
The gap is inside the four do_install bodies. None of them use set -euo pipefail, all four have at least one of these failure-masking patterns, and return $? at the end captures only the last command's exit code:
| Script | set -e/pipefail | curl | bash pipe | Sequential apt-get (no &&/set -e) | curl -f |
|---|---|---|---|---|
install-aws-cli.sh | ❌ | — (uses tmpfile) | mixed (some &&, some not) | ❌ (curl -sL) |
install-azure-cli.sh | ❌ | ✅ (line 38) | ✅ (lines 23-34) | ❌ (curl -sL) |
install-gcp-cli.sh | ❌ | ✅ (lines 26, 39) | ✅ (lines 22-44) | ❌ (curl ...) |
install-opentofu.sh | ❌ | ✅ (lines 25, 28) | n/a (single pipe) | ✅ (-fsSL) |
What this means in practice
- A failed
apt-get update(broken signature, network blip) doesn't stop the script — install continues against stale repo metadata. curl https://... | bashwherecurlfails silently still exits 0, because the shell sees onlybash's exit code (which gets EOF and exits clean).curl -sLwithout-freturns success even on HTTP 404 — you end up piping an HTML error page intobashorgpg --dearmor.- The wrapper's post-install check only catches the terminal failure (binary missing). It can't catch "installed but to a stale version" or "installed but apt sources file is half-written".
- For idempotent re-runs the wrapper-level
is_tool_installedshort-circuit means re-runs are already safe today, but if the first install partially failed and somehow left the binary present, the wrapper skips the re-run that might have repaired it.
Out of Scope
- Legacy
hosts/install-*.shscripts (hosts/install-rancher-kubernetes.sh,hosts/install-azure-aks.sh,hosts/install-azure-microk8s-v2.sh,hosts/install-multipass-microk8s.sh,hosts/raspberry-microk8s/install-raspberry.sh) — these are queued for deletion or migration in INVESTIGATE-system-migrate-hosts-to-platforms.md. Do not touch in this PR. do_uninstallbodies — same scripts have parallel uninstall functions with the same pattern. Inside this PR's scope to fix consistently if the diff stays small; otherwise can split.- Adding new tools (e.g.
kubeloginas a first-class tool,helmplugins). Separate work. - Changing the wrapper (
tool-installation.sh). Wrapper-level guarantees are already adequate; this plan only hardens the scripts it dispatches to.
Phase 1: Standardize the do_install pattern
Apply a consistent shape to all four do_install functions:
do_install() {
set -euo pipefail
echo "Installing <Tool>..."
...
}
Specifically:
set -euo pipefailat the top ofdo_install(anddo_uninstallif cheap to add).- All
curlinvocations gain-fsSL(fail-on-HTTP-error, silent, follow redirects, location).install-opentofu.shalready has this; replicate to the other three. - Sequential
apt-getlines stay as-is onceset -eis in scope (the failure ofapt-get updatethen aborts the script — no need to chain with&&). return $?at the end becomes redundant (script will have exited non-zero on first failure), but keep it for the success path so the function explicitly returns 0.
Tasks
-
1.1
provision-host/uis/tools/install-aws-cli.sh:- Add
set -euo pipefailtodo_installanddo_uninstall. - Change
curl -sL "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip"→curl -fsSL .... - Replace
cd "$tmpdir" || exit 1(the|| exit 1is now redundant underset -e) withcd "$tmpdir". - Restructure cleanup: with
set -e, the unconditionalcd / && rm -rf "$tmpdir"after a failed./aws/installwon't run. Replace with atrap "rm -rf '$tmpdir'" EXITnear the top ofdo_install. - Capture
local status=$?after./aws/installbecomes redundant — drop it; failure auto-aborts.
- Add
-
1.2
provision-host/uis/tools/install-azure-cli.sh:- Add
set -euo pipefailtodo_installanddo_uninstall. - Change
curl -sL https://packages.microsoft.com/keys/microsoft.asc→curl -fsSL .... - Change
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash→curl -fsSL https://aka.ms/InstallAzureCLIDeb | sudo bash(now safe underpipefail).
- Add
-
1.3
provision-host/uis/tools/install-gcp-cli.sh:- Add
set -euo pipefailtodo_installanddo_uninstall. - Change
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg(both root and sudo branches) →curl -fsSL ....
- Add
-
1.4
provision-host/uis/tools/install-opentofu.sh:- Add
set -euo pipefailtodo_installanddo_uninstall. (curl -fsSLalready in place.)
- Add
Validation (Phase 1)
Each script, after the change, satisfies all four:
bash -n install-<tool>.shparses cleanly.shellcheck install-<tool>.shruns without new warnings (existing warnings excluded from regression bar; this PR doesn't aim to be a full lint pass).- Source the script in an isolated bash and verify
set -euo pipefailis active insidedo_install:( source install-azure-cli.sh; do_install_test() { do_install; }; set | grep -E '^(BASH_OPTS|SHELLOPTS)='; ) - Forced-failure test: temporarily replace one
curlURL with a 404 path, re-run./uis tools install <id>, confirm exit code is non-zero and a clear error is logged.
Phase 2: Re-run safety verification
Verify that re-running ./uis tools install <id> on an already-installed tool is a fast, idempotent no-op (this is already the wrapper's behavior; Phase 2 just documents and tests it).
Tasks
- 2.1 Cold install
azure-cli— exercised by tester on the AKS path (PLAN-001b) across fresh:latestpulls in talk52-55. - 2.2 Warm re-run
azure-cli— wrapper short-circuits on the second invocation via theis_tool_installedgate attool-installation.sh:194; confirmed in tester rounds. - 2.3
aws-cli/gcp-cli/opentofu—opentofuis exercised on the AKS path alongsideazure-cli.aws-cliandgcp-cliwere not explicitly run in these rounds, but the four scripts share the standardizeddo_installpattern from Phase 1 and the same wrapper-level idempotency gate — so behaviour is structurally identical. - 2.4 Negative case (partial-failure simulation) — deliberately not run. Low-priority edge case; the documented behaviour ("the wrapper trusts the binary, not the apt state") is acceptable and the user can
./uis tools uninstall && installto force a full redo. Deferred indefinitely.
Validation (Phase 2)
End-to-end manual run by tester via talk.md:
docker exec -it provision-host bash
./uis tools install azure-cli # cold install
./uis tools install azure-cli # warm re-run — should log "already installed" and exit 0
./uis tools install opentofu # cold install
./uis tools install opentofu # warm re-run
which az tofu # both resolve
az version # actual binary works
tofu version # actual binary works
Phase 3: Document the contract in the script header
Add a one-block header comment to each install-*.sh declaring the contract that the wrapper relies on:
# Contract:
# - do_install MUST exit non-zero on any failure (uses set -euo pipefail).
# - do_install is invoked in a subshell; safe to use cd, traps, env mutations.
# - Idempotency is enforced at the wrapper level (tool-installation.sh:194)
# by checking TOOL_CHECK_COMMAND before invocation. Scripts do not need
# their own "already installed" guard.
Tasks
- 3.1 Add the header block to all four scripts (right under the existing
# === Tool Metadata ===block).
Validation (Phase 3)
Each of the four scripts has the contract block. The block is identical across all four (so future scripts can be created by copy-paste of this template).
What this plan deliberately does NOT do
- Doesn't add tests under a CI runner. We don't have a tool-script CI lane today; adding one is a separate "should we lint/test bash scripts in CI" plan.
- Doesn't change the public CLI surface.
./uis tools install <id>and./uis tools listbehavior is unchanged from the user's perspective — failures just become loud instead of silent. - Doesn't add a
--forcere-install flag. The wrapper's "already installed → skip" behavior is the right default; force-reinstall can be a follow-up if anyone hits a real need. - Doesn't normalize the apt key locations (
/etc/apt/trusted.gpg.d/microsoft.gpgvs/usr/share/keyrings/cloud.google.gpg). They differ by upstream convention; consolidating is unrelated cleanup.
Verification gate before merge
- All four scripts pass
bash -n;azure-cli+opentofuconfirmed end-to-end in fresh provision-host containers via the AKS path during talk52-55.aws-cli+gcp-clicovered structurally by the shareddo_installpattern (not explicitly cold-installed in these rounds). -
./uis tools liststill renders correctly. The metadata read attool-installation.sh:42-54runs beforedo_installis invoked, soset -euo pipefailinsidedo_installcannot leak into the metadata path. - PR description forced-failure demo — PR #152 shipped without this; can't retroactively add. The new fail-loud behaviour is the merged code; verifying it post-hoc would require a deliberate break, which falls into the same low-priority bucket as 2.4. Deferred indefinitely.
- Tester exercised cold-install + warm-re-run across talk52-55 fresh
:latestpull cycles for the tools on the AKS path (azure-cli,opentofu).
Related
- PLAN-tools-docs.md — user-facing tools reference. Touches the same script set; this plan is independent and can land first or second without ordering constraints.
- INVESTIGATE-system-migrate-hosts-to-platforms.md — legacy
hosts/install-*.shlifecycle decisions. Out of scope here. provision-host/uis/lib/tool-installation.sh:184—install_toolwrapper that this plan depends on (idempotency check + post-install verification).