Troubleshooting Guide
File: docs/troubleshooting-readme.md
Purpose: Comprehensive troubleshooting guide for common issues and solutions
Target Audience: All users, developers, and administrators
Last Updated: September 22, 2024
📋 Overview
This guide covers common issues encountered in the Urbalurba Infrastructure platform and their solutions. Problems are organized by category to help you quickly find relevant troubleshooting steps.
🚀 Quick Diagnostic Commands
Before diving into specific issues, these commands help identify the problem area:
Automated Debugging (Recommended)
# Complete cluster analysis (from provision-host)
./troubleshooting/debug-cluster.sh
# Service-specific debugging
./troubleshooting/debug-traefik.sh # Ingress issues
./troubleshooting/debug-ai-litellm.sh # AI platform issues
Manual Commands
# Check overall cluster health
kubectl get nodes
kubectl get pods -A --field-selector=status.phase!=Running
# Check provision host
docker ps | grep provision-host
docker logs provision-host --tail=50
# Check ingress and services
kubectl get ingressroute -A
kubectl get svc -A
# Check storage
kubectl get pvc -A
kubectl get pv
🔧 Installation & Setup Issues
Rancher Desktop Not Starting
Symptoms: Rancher Desktop fails to start or Kubernetes is not available
Solutions:
-
Reset Rancher Desktop:
- Settings → Troubleshooting → Reset Kubernetes
- Wait for complete reset, then restart
-
Check system resources:
# Ensure sufficient memory (minimum 8GB recommended)
free -h
# Check disk space
df -h -
Verify Docker context:
docker context list
docker context use rancher-desktop
Provision Host Container Issues
Symptoms: Cannot access provision-host or container not running
Solutions:
-
Check container status:
docker ps | grep provision-host
docker logs provision-host --tail=20 -
Restart provision host:
docker stop provision-host
docker start provision-host -
Volume mount issues:
# Verify mount point exists
ls -la /mnt/urbalurbadisk/
# Check Docker volume mounts
docker inspect provision-host | grep Mounts -A 20
Kubeconfig Issues
Symptoms: kubectl commands fail with connection errors
Solutions:
-
Verify kubeconfig location:
echo $KUBECONFIG
ls -la ~/.kube/config -
Test connection:
kubectl cluster-info
kubectl get nodes -
Reset kubeconfig (from provision-host):
cp ~/.kube/config ~/.kube/config.backup
# Re-run cluster connection setup
🏗️ Service Deployment Issues
Pod Stuck in Pending State
Symptoms: Pods remain in Pending status
Diagnosis:
kubectl describe pod -n <namespace> <pod-name>
kubectl get events -n <namespace> --sort-by=.metadata.creationTimestamp
Common Causes & Solutions:
-
Resource constraints:
kubectl top nodes
kubectl describe node -
Storage issues:
kubectl get pvc -A
kubectl describe pvc -n <namespace> <pvc-name> -
Image pull problems:
# Check image availability
docker pull <image-name>
# Check image pull secrets
kubectl get secrets -n <namespace>
Pod CrashLoopBackOff
Symptoms: Pods continuously restart
Diagnosis:
kubectl logs -n <namespace> <pod-name> --previous
kubectl describe pod -n <namespace> <pod-name>
Solutions:
-
Check application logs:
kubectl logs -n <namespace> <pod-name> -f -
Review resource limits:
kubectl get pod -n <namespace> <pod-name> -o yaml | grep -A 5 resources -
Inspect configuration:
kubectl get configmap -n <namespace>
kubectl get secrets -n <namespace>
Service Connection Issues
Symptoms: Services unreachable or timing out
Diagnosis:
kubectl get svc -A
kubectl get endpoints -n <namespace>
kubectl describe svc -n <namespace> <service-name>
Solutions:
-
Test internal connectivity:
kubectl run test-pod --image=curlimages/curl -it --rm -- sh
# From inside pod:
curl <service-name>.<namespace>:port -
Check service selectors:
kubectl get pod -n <namespace> --show-labels
kubectl describe svc -n <namespace> <service-name>
🌐 Networking & Ingress Issues
Traefik Ingress Not Working
Symptoms: Services not accessible via ingress URLs
Diagnosis:
kubectl get ingressroute -A
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik
kubectl describe ingressroute -n <namespace> <route-name>
Solutions:
-
Verify IngressRoute configuration:
# Check host patterns and rules
kubectl get ingressroute -n <namespace> <route-name> -o yaml -
Test Traefik dashboard:
kubectl port-forward -n kube-system svc/traefik 8080:8080
# Access http://localhost:8080 -
Check DNS resolution:
nslookup <service>.localhost
# Should resolve to 127.0.0.1
localhost Domain Issues
Symptoms: http://service.localhost not accessible
Solutions:
-
Check /etc/hosts (macOS/Linux):
cat /etc/hosts | grep localhost
# Should include: 127.0.0.1 localhost -
Verify port forwarding:
kubectl get svc -n kube-system traefik
# Should show ports 80:xxxxx and 443:xxxxx -
Test with IP directly:
curl -H "Host: service.localhost" http://127.0.0.1/
🔐 Authentication Issues
Authentik SSO Problems
Symptoms: Cannot access Authentik or authentication loops
Diagnosis:
kubectl logs -n authentik -l app.kubernetes.io/name=authentik
kubectl get pod -n authentik
kubectl describe ingressroute -n authentik authentik
Solutions:
-
Reset Authentik password:
kubectl exec -n authentik <authentik-pod> -- ak create_admin_token -
Check Authentik configuration:
kubectl get configmap -n authentik authentik-config -o yaml -
Verify database connectivity:
kubectl logs -n authentik <authentik-pod> | grep -i database
Forward Auth Middleware Issues
Symptoms: Protected services show authentication errors
Diagnosis:
kubectl get middleware -A
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik | grep -i auth
Solutions:
-
Check middleware configuration:
kubectl describe middleware -n <namespace> authentik-forward-auth -
Test whoami service:
# Should work: http://whoami-public.localhost
# Should redirect: http://whoami.localhost
💾 Database Issues
PostgreSQL Connection Problems
Symptoms: Applications cannot connect to PostgreSQL
Diagnosis:
kubectl logs -n postgresql -l app=postgresql
kubectl get svc -n postgresql
kubectl exec -n postgresql <postgres-pod> -- pg_isready
Solutions:
-
Test database connection:
kubectl exec -n postgresql <postgres-pod> -- psql -U postgres -c "SELECT version();" -
Check connection limits:
kubectl exec -n postgresql <postgres-pod> -- psql -U postgres -c "SHOW max_connections;"
kubectl exec -n postgresql <postgres-pod> -- psql -U postgres -c "SELECT count(*) FROM pg_stat_activity;" -
Review configuration:
kubectl get configmap -n postgresql postgresql-config -o yaml
Database Migration Issues
Symptoms: Applications report database schema problems
Solutions:
-
Check migration logs:
kubectl logs -n <namespace> <app-pod> | grep -i migration -
Manual migration (if needed):
kubectl exec -it -n <namespace> <app-pod> -- <migration-command>
🤖 AI Platform Issues
OpenWebUI Not Loading
Symptoms: OpenWebUI interface shows errors or won't load
Diagnosis:
kubectl logs -n openwebui -l app=openwebui
kubectl get svc -n openwebui
kubectl describe pod -n openwebui
Solutions:
-
Check OpenWebUI configuration:
kubectl get configmap -n openwebui openwebui-config -o yaml -
Verify LiteLLM connectivity:
kubectl exec -n openwebui <openwebui-pod> -- curl http://litellm.litellm:4000/health
LiteLLM API Issues
Symptoms: AI models not responding or API errors
Diagnosis:
kubectl logs -n litellm -l app=litellm
kubectl exec -n litellm <litellm-pod> -- curl http://localhost:4000/health
Solutions:
-
Check API keys:
kubectl get secrets -n litellm litellm-secrets -
Test model availability:
kubectl exec -n litellm <litellm-pod> -- curl -X POST http://localhost:4000/v1/models
📊 Monitoring Issues
Grafana Dashboard Problems
Symptoms: Grafana not accessible or missing data
Diagnosis:
kubectl logs -n grafana -l app.kubernetes.io/name=grafana
kubectl get svc -n grafana
Solutions:
-
Reset Grafana admin password:
kubectl get secrets -n grafana grafana-admin-secret -
Check data sources:
# Access Grafana and verify Prometheus connection
# URL: http://grafana.localhost
☁️ Cloud Deployment Issues
Azure AKS Connection Problems
Symptoms: Cannot connect to AKS cluster
Solutions:
-
Update kubeconfig:
az aks get-credentials --resource-group <rg> --name <cluster> -
Check Azure CLI authentication:
az account show
az aks list
Tailscale VPN Issues
Symptoms: Cannot access remote hosts via Tailscale
Solutions:
-
Check Tailscale status:
tailscale status
tailscale ping <host> -
Restart Tailscale:
sudo systemctl restart tailscaled # Linux
# Or restart Tailscale app on macOS/Windows
🔄 Recovery Procedures
Complete Cluster Reset
When multiple issues persist, a complete reset may be needed:
-
Backup important data:
# Export important configurations
kubectl get secrets -A -o yaml > secrets-backup.yaml
kubectl get configmap -A -o yaml > configmaps-backup.yaml -
Reset Rancher Desktop:
- Settings → Troubleshooting → Reset Kubernetes
- Wait for complete reset
-
Restore provision-host:
# Restart provision-host container
docker stop provision-host
docker start provision-host -
Re-provision services:
docker exec -it provision-host bash
cd /mnt/urbalurbadisk/
./provision-host/kubernetes/provision-kubernetes.sh
Data Recovery
If persistent data is lost:
-
Check persistent volumes:
kubectl get pv
kubectl describe pv <volume-name> -
Restore from backups (if available):
# Restore database backups
kubectl exec -i -n postgresql <postgres-pod> -- psql -U postgres < backup.sql
🆘 Getting Additional Help
Log Collection
When reporting issues, include these logs:
# Cluster overview
kubectl get all -A > cluster-overview.txt
# Pod issues
kubectl describe pod -n <namespace> <pod-name> > pod-details.txt
kubectl logs -n <namespace> <pod-name> --previous > pod-logs.txt
# Events
kubectl get events -A --sort-by=.metadata.creationTimestamp > events.txt
# Provision host logs
docker logs provision-host > provision-host.log
System Information
Include system details when requesting help:
# Kubernetes version
kubectl version
# Node information
kubectl get nodes -o wide
# Docker information
docker version
docker system info
# Host system
uname -a
df -h
free -h
🤖 Automated Debugging Scripts
The platform includes comprehensive debugging scripts in the troubleshooting/ folder:
Cluster-Wide Debugging
debug-cluster.sh - Complete cluster health analysis:
# Run from provision-host
./troubleshooting/debug-cluster.sh [namespace]
Features:
- Collects all resource information across namespaces
- Identifies unhealthy pods and retrieves their logs
- Analyzes resource usage and storage issues
- Generates timestamped output files with cleanup
- Provides actionable recommendations
export-cluster-status.sh - Full cluster snapshot:
# Export complete cluster configuration
./troubleshooting/export-cluster-status.sh [cluster-name]
Creates:
- Individual files for each Kubernetes resource type
- Compressed archive for easy sharing with support
- Version information for key services
- Complete cluster configuration snapshot
Service-Specific Debugging
Traefik Ingress (debug-traefik.sh):
./troubleshooting/debug-traefik.sh
- IngressRoute and Middleware analysis
- Traefik pod and service diagnostics
- Custom resource validation
- Network connectivity checks
AI Platform (debug-ai-litellm.sh):
./troubleshooting/debug-ai-litellm.sh [namespace]
- LiteLLM configuration and API health
- Model availability and routing
- Secret and ConfigMap validation
- API connectivity testing
Other Service Scripts:
debug-ai-openwebui.sh- OpenWebUI diagnosticsdebug-ai-ollama-cluster.sh- Ollama cluster debuggingdebug-ai-qdrant.sh- Vector database diagnosticsdebug-redis.sh- Redis connectivity and performancedebug-mongodb.sh- MongoDB cluster analysisdebug-elasticsearch.sh- Elasticsearch cluster health
Using the Debug Scripts
-
Access provision-host:
docker exec -it provision-host bash
cd /mnt/urbalurbadisk/ -
Run appropriate debug script:
# For general cluster issues
./troubleshooting/debug-cluster.sh
# For specific service issues
./troubleshooting/debug-traefik.sh
./troubleshooting/debug-ai-litellm.sh -
Review generated output:
# Output saved to troubleshooting/output/
ls troubleshooting/output/
cat troubleshooting/output/debug-cluster-*.txt
Debug Output Features
- Timestamped files - Each run creates uniquely named output
- Automatic cleanup - Keeps only the 3 most recent debug files
- Structured sections - Organized by problem area
- Status tracking - Success/failure indicators for each check
- Log extraction - Automatic collection from problematic pods
- Recommendations - Specific next steps based on findings
Contact and Resources
- 📖 Documentation: Start with doc/README.md
- 🏗️ Architecture: Review doc/overview-system-architecture.md
- 🔧 Commands: See doc/provision-host-commands.md
- 🤖 Debug Scripts: Use automated tools in
troubleshooting/folder - 🐛 Issues: Report at GitHub repository issues
💡 Remember: Most issues can be resolved by checking logs, verifying configuration, and ensuring services are running. When in doubt, start with the automated debugging scripts or the basic diagnostic commands at the top of this guide.