INVESTIGATE: Rancher Reset and Full Service Verification

Related: STATUS-service-migration, INVESTIGATE-unity-catalog-crashloop Created: 2026-02-19 Status: COMPLETE Completed: 2026-02-20

Goal

Determine the exact procedure to factory-reset Rancher Desktop, reprovision from scratch, and verify all 24 services deploy and undeploy correctly.

Factory Reset

Rancher Desktop → Troubleshooting → Factory Reset:

Wipes: K3s cluster, all pods, PersistentVolumes, Docker containers, Docker images, Rancher Desktop settings
Survives: Only host filesystem files (.uis.extend/, .uis.secrets/, repo files)

The uis-provision-host container and image are wiped by factory reset and must be rebuilt locally before use.

The tester must delete any previous .uis.extend/ and .uis.secrets/ folders so that we have a clean test.

Recovery Procedure

After factory reset and re-enabling Kubernetes:

Contributor (builds image and prepares tester)

./uis build            # rebuild the container image locally
cp uis <tester-dir>/   # copy the uis wrapper to the tester directory

The contributor verifies the image builds successfully before handing off to the tester.

Tester (clean slate verification)

rm -rf .uis.extend .uis.secrets   # ensure no leftover config
./uis start            # uis wrapper creates .uis.extend/ and .uis.secrets/, starts container
./uis deploy           # calls ensure_secrets_applied() automatically, deploys enabled services

For individual services:

./uis deploy <service>

The ensure_secrets_applied() function in first-run.sh handles re-applying secrets to a fresh cluster. The uis deploy command calls it before every deployment.

Final Verification Status

26/26 services defined, 23/26 verified. All testable services deploy and undeploy cleanly from a factory-reset clean slate.

Status	Services
Verified (23)	nginx, whoami, postgresql, redis, mysql, mongodb, qdrant, elasticsearch, rabbitmq, authentik, openwebui, litellm, prometheus, grafana, loki, tempo, otel-collector, argocd, jupyterhub, spark, unity-catalog, pgadmin, redisinsight
Skipped (3)	gravitee (broken before migration), tailscale-tunnel (requires auth key), cloudflare-tunnel (requires token)

Tested across 8 rounds in talk9.md + 8 rounds in talk10.md. Bugs found and fixed during testing:

Lazy initialization (config files not created on ./uis start)
Schema regex too strict for removePlaybook with parameters
Shell arithmetic bug (wc -l whitespace) in Redis/RabbitMQ removal playbooks
RabbitMQ health check wrong namespace
Unity Catalog: wrong image, wrong security context, wrong API version, no curl in container (see INVESTIGATE-unity-catalog-crashloop)
pgAdmin: admin@localhost rejected by email validator, changed to admin@example.com
pgAdmin: OOM on login with 256Mi memory limit, increased to 512Mi
pgAdmin/RedisInsight removal playbooks: same grep -c "Illegal number" bug
Secrets generation: missing mkdir -p for secrets-config/ directory
Default secrets duplication: first-run.sh hardcoded values instead of reading from default-secrets.env

Proposed Test Strategy

Phase 1: Reset and Bootstrap

Factory reset Rancher Desktop
Re-enable Kubernetes, wait for ready
./uis build — rebuild the container image locally
./uis start — creates container from local image
./uis secrets apply
Verify cluster is healthy: kubectl get nodes, kubectl get pods -A

Phase 2: Deploy and Verify Each Service

Test each service: deploy → verify pods running → verify connectivity → undeploy → verify clean removal.

Suggested order (dependencies first):

nginx — used by automatic install to verify the system is started
whoami — simplest service, baseline test
postgresql — required by authentik, openwebui, litellm, unity-catalog
redis — required by authentik
mysql — standalone database
mongodb — standalone database
qdrant — standalone vector database
elasticsearch — standalone search
rabbitmq — standalone queue
authentik — depends on postgresql + redis
openwebui — depends on postgresql
litellm — depends on postgresql
prometheus — monitoring, standalone
grafana — monitoring, standalone
loki — monitoring, standalone
tempo — monitoring, standalone
otel-collector — monitoring, standalone
argocd — management
jupyterhub — data science
spark — data science
unity-catalog — data science, depends on postgresql

Skip for now (require external accounts or broken):

tailscale-tunnel — requires Tailscale auth key
cloudflare-tunnel — requires Cloudflare token
gravitee — was broken before migration

Phase 3: Stack Tests

After individual verification, test deploying full stacks:

Observability stack: prometheus + grafana + loki + tempo + otel-collector
AI stack: openwebui + litellm
Data science stack: jupyterhub + spark + unity-catalog

Resolved Questions

Tester workflow: The tester runs on the same Rancher Desktop, so factory reset wipes the tester's containers too. The contributor builds the image locally and copies the uis file to the tester directory.
How long does a full cycle take? TBD — will measure during testing.
Data persistence: For this test we wipe everything. Services like PostgreSQL start fresh. That's expected.
enabled-services.conf already has nginx as the only enabled service by default. The old system used nginx to verify that the system is started. No changes needed.
./uis is the standard path. Decided in PLAN-004/PLAN-005. The legacy install-rancher.sh is no longer used.

Goal​

Factory Reset​

Recovery Procedure​

Contributor (builds image and prepares tester)​

Tester (clean slate verification)​

Final Verification Status​

Proposed Test Strategy​

Phase 1: Reset and Bootstrap​

Phase 2: Deploy and Verify Each Service​

Phase 3: Stack Tests​

Resolved Questions​