INVESTIGATE: Rancher Reset and Full Service Verification
Related: STATUS-service-migration, INVESTIGATE-unity-catalog-crashloop Created: 2026-02-19 Status: COMPLETE Completed: 2026-02-20
Goal
Determine the exact procedure to factory-reset Rancher Desktop, reprovision from scratch, and verify all 24 services deploy and undeploy correctly.
Factory Reset
Rancher Desktop → Troubleshooting → Factory Reset:
- Wipes: K3s cluster, all pods, PersistentVolumes, Docker containers, Docker images, Rancher Desktop settings
- Survives: Only host filesystem files (
.uis.extend/,.uis.secrets/, repo files)
The uis-provision-host container and image are wiped by factory reset and must be rebuilt locally before use.
The tester must delete any previous .uis.extend/ and .uis.secrets/ folders so that we have a clean test.
Recovery Procedure
After factory reset and re-enabling Kubernetes:
Contributor (builds image and prepares tester)
./uis build # rebuild the container image locally
cp uis <tester-dir>/ # copy the uis wrapper to the tester directory
The contributor verifies the image builds successfully before handing off to the tester.
Tester (clean slate verification)
rm -rf .uis.extend .uis.secrets # ensure no leftover config
./uis start # uis wrapper creates .uis.extend/ and .uis.secrets/, starts container
./uis deploy # calls ensure_secrets_applied() automatically, deploys enabled services
For individual services:
./uis deploy <service>
The ensure_secrets_applied() function in first-run.sh handles re-applying secrets to a fresh cluster. The uis deploy command calls it before every deployment.
Final Verification Status
26/26 services defined, 23/26 verified. All testable services deploy and undeploy cleanly from a factory-reset clean slate.
| Status | Services |
|---|---|
| Verified (23) | nginx, whoami, postgresql, redis, mysql, mongodb, qdrant, elasticsearch, rabbitmq, authentik, openwebui, litellm, prometheus, grafana, loki, tempo, otel-collector, argocd, jupyterhub, spark, unity-catalog, pgadmin, redisinsight |
| Skipped (3) | gravitee (broken before migration), tailscale-tunnel (requires auth key), cloudflare-tunnel (requires token) |
Tested across 8 rounds in talk9.md + 8 rounds in talk10.md. Bugs found and fixed during testing:
- Lazy initialization (config files not created on
./uis start) - Schema regex too strict for
removePlaybookwith parameters - Shell arithmetic bug (
wc -lwhitespace) in Redis/RabbitMQ removal playbooks - RabbitMQ health check wrong namespace
- Unity Catalog: wrong image, wrong security context, wrong API version, no curl in container (see INVESTIGATE-unity-catalog-crashloop)
- pgAdmin:
admin@localhostrejected by email validator, changed toadmin@example.com - pgAdmin: OOM on login with 256Mi memory limit, increased to 512Mi
- pgAdmin/RedisInsight removal playbooks: same
grep -c"Illegal number" bug - Secrets generation: missing
mkdir -pforsecrets-config/directory - Default secrets duplication:
first-run.shhardcoded values instead of reading fromdefault-secrets.env
Proposed Test Strategy
Phase 1: Reset and Bootstrap
- Factory reset Rancher Desktop
- Re-enable Kubernetes, wait for ready
./uis build— rebuild the container image locally./uis start— creates container from local image./uis secrets apply- Verify cluster is healthy:
kubectl get nodes,kubectl get pods -A
Phase 2: Deploy and Verify Each Service
Test each service: deploy → verify pods running → verify connectivity → undeploy → verify clean removal.
Suggested order (dependencies first):
- nginx — used by automatic install to verify the system is started
- whoami — simplest service, baseline test
- postgresql — required by authentik, openwebui, litellm, unity-catalog
- redis — required by authentik
- mysql — standalone database
- mongodb — standalone database
- qdrant — standalone vector database
- elasticsearch — standalone search
- rabbitmq — standalone queue
- authentik — depends on postgresql + redis
- openwebui — depends on postgresql
- litellm — depends on postgresql
- prometheus — monitoring, standalone
- grafana — monitoring, standalone
- loki — monitoring, standalone
- tempo — monitoring, standalone
- otel-collector — monitoring, standalone
- argocd — management
- jupyterhub — data science
- spark — data science
- unity-catalog — data science, depends on postgresql
Skip for now (require external accounts or broken):
- tailscale-tunnel — requires Tailscale auth key
- cloudflare-tunnel — requires Cloudflare token
- gravitee — was broken before migration
Phase 3: Stack Tests
After individual verification, test deploying full stacks:
- Observability stack: prometheus + grafana + loki + tempo + otel-collector
- AI stack: openwebui + litellm
- Data science stack: jupyterhub + spark + unity-catalog
Resolved Questions
-
Tester workflow: The tester runs on the same Rancher Desktop, so factory reset wipes the tester's containers too. The contributor builds the image locally and copies the
uisfile to the tester directory. -
How long does a full cycle take? TBD — will measure during testing.
-
Data persistence: For this test we wipe everything. Services like PostgreSQL start fresh. That's expected.
-
enabled-services.confalready has nginx as the only enabled service by default. The old system used nginx to verify that the system is started. No changes needed. -
./uisis the standard path. Decided in PLAN-004/PLAN-005. The legacyinstall-rancher.shis no longer used.