Provisioning Rules and Standards
File: docs/rules-provisioning.md
Purpose: Define the IMPLEMENTATION LAYER - how to write individual deployment scripts and playbooks
Target Audience: Developers, DevOps engineers, and LLMs creating deployment scripts
Scope: Script/playbook patterns, testing standards, error handling, and implementation best practices
Relationship to Other Rulesโ
This document covers how to implement individual deployment scripts:
- Shell script + Ansible playbook pattern
- Testing and verification standards
- Error handling and progress feedback
- Implementation best practices
For how scripts are organized and executed automatically, see: โ Rules for Automated Kubernetes Deployment - Orchestration and automation framework
๐ Overviewโ
This document establishes mandatory patterns for writing deployment scripts and playbooks in the Urbalurba Infrastructure. These patterns ensure reliability, consistency, and maintainability.
๐ฏ Core Deployment Architectureโ
Rule 1: Script + Ansible Patternโ
All deployments MUST follow the Script + Ansible pattern:
scripts/packages/[service].sh โ ansible/playbooks/[nnn]-setup-[service].yml
โ Minimal orchestration โ Heavy lifting implementation
Script Responsibilities (Keep Minimal):โ
- โ Check prerequisites (kubectl access, basic dependencies)
- โ Call Ansible playbook with proper parameters
- โ Display final success/failure message
- โ NO business logic - delegate to Ansible
- โ NO complex operations - keep scripts simple
Ansible Playbook Responsibilities (Heavy Lifting):โ
- โ All deployment logic and verification
- โ Resource creation and configuration
- โ Comprehensive testing and validation
- โ Error handling with proper retry mechanisms
- โ Status reporting and troubleshooting information
Example Structure:โ
# scripts/packages/litellm.sh
#!/bin/bash
set -e
echo "๐ Deploying LiteLLM AI Gateway..."
ansible-playbook ansible/playbooks/210-setup-litellm.yml
echo "โ
LiteLLM deployment complete"
# ansible/playbooks/210-setup-litellm.yml
- name: Deploy LiteLLM with comprehensive validation
# ... all the actual deployment logic
๐ Script Template Patternโ
Rule 1B: Script Naming Conventionโ
โ ๏ธ See doc/rules-naming-conventions.md for complete naming patterns.
Quick Reference:
- Setup Script:
[NN]-setup-[service-name].sh(e.g.,05-setup-postgres.sh) - Remove Script:
[NN]-remove-[service-name].sh(same number prefix)
MANDATORY: Every setup script MUST have a corresponding remove script for clean uninstallation.
Rule 1C: Check Existing Playbooks Firstโ
MANDATORY: Before creating any new Ansible playbook, you MUST:
-
Search existing playbooks: Check
ansible/playbooks/for existing implementations# Search for similar service names
find ansible/playbooks -name "*[service-name]*" -type f
# Search for functionality in playbook content
grep -r "service-functionality" ansible/playbooks/ -
Review existing playbook capabilities: Many existing playbooks support multiple operations via parameters
- Look for
operationparameter (e.g.,deploy,delete,verify) - Check variable definitions and supported modes
- Review task blocks for conditional logic
- Look for
-
Extend existing playbooks rather than create new ones when possible:
- Add new
operationmodes to existing playbooks - Add conditional blocks for new functionality
- Maintain consistency with existing patterns
- Add new
-
Create new playbooks ONLY when:
- No existing playbook handles the service
- Functionality is completely different from existing patterns
- Combining would make existing playbook overly complex
Example: The whoami service already has 025-setup-whoami-testpod.yml with both deploy and delete operations. Use this instead of creating new playbooks.
Rule 1D: Standard Script Structureโ
All deployment scripts MUST follow this template pattern:
#!/bin/bash
# filename: [NN]-setup-[service].sh
# description: Deploy [service] to Kubernetes cluster
TARGET_HOST=${1:-"rancher-desktop"}
STATUS=()
ERROR=0
echo "Starting [service] setup on $TARGET_HOST"
echo "---------------------------------------------------"
# Step 1: Verify prerequisites
# Step 2: Apply configurations
# Step 3: Deploy via Helm/manifests
# Step 4: Verify deployment
print_summary() {
echo "---------- Installation Summary ----------"
for step in "${STATUS[@]}"; do
echo "$step"
done
if [ $ERROR -eq 0 ]; then
echo "All steps completed successfully."
else
echo "Some steps failed. Please check the logs."
fi
}
main() {
# Implementation here
print_summary
}
main "$@"
exit $ERROR
Key Requirements:
- Accept
TARGET_HOSTas first parameter - Use
STATUSarray to track step completion - Use
ERRORvariable for exit code - Include
print_summary()function - Call
main "$@"andexit $ERROR
๐งช Testing Requirementsโ
Rule 2: No .localhost Testing from Host Contextโ
โ CRITICAL ERROR - Never Do This:
# WRONG: Testing .localhost from Ansible (host context)
- name: Test service
ansible.builtin.uri:
url: "http://service.localhost/health" # Will fail!
Problem: Ansible runs on the host machine where .localhost domains resolve to 127.0.0.1 (the host itself), not to the Traefik ingress controller running in the cluster.
Background: The cluster uses a dual-context DNS architecture (detailed in docs/rules-ingress-traefik.md):
- External/Browser Context:
service.localhostโ127.0.0.1โ Traefik โ Service โ - Internal/Pod Context:
service.localhostโ CoreDNS rewrite โ ClusterIP โ Service โ - Host/Ansible Context:
service.localhostโ127.0.0.1(host machine) โ
Rule 3: Mandatory Cluster-Internal Testingโ
โ CORRECT: Use kubectl run for all service tests:
# CORRECT: Test from within cluster using temporary pod
- name: Test service connectivity from within cluster
ansible.builtin.shell: |
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never -n {{ namespace }} -- \
curl -s -w "HTTP_CODE:%{http_code}" http://{{ service }}:{{ port }}/health
register: service_test
retries: 5
delay: 5
until: service_test.rc == 0 and (service_test.stdout.find('HTTP_CODE:200') != -1 or service_test.stdout.find('HTTP_CODE:401') != -1)
Why This Works:โ
- โ True Cluster Context: Test pod runs inside cluster with proper DNS resolution
- โ Service-to-Service Testing: Tests actual communication paths other pods will use
- โ
Temporary & Clean:
--rmautomatically removes test pod - โ No Dependencies: Doesn't require existing pods to have curl/python
- โ Reliable: Uses proven pattern from working playbooks
Testing Sequence Requirements:โ
- Internal Service Test: Verify service responds within cluster
- API Functionality Test: Test actual API endpoints with authentication
- IngressRoute Verification: Confirm Traefik routing is configured
- Integration Test: Verify service integrates with dependencies
๐ Error Handling Rulesโ
Rule 4: No Error Ignoring for Critical Dependenciesโ
โ WRONG: Ignoring errors when next steps depend on success:
- name: Deploy database
command: helm install postgres ...
ignore_errors: true # WRONG! Next steps need this to succeed
- name: Create application tables # Will fail if database not deployed
command: kubectl exec postgres -- psql ...
โ CORRECT: Fail fast for critical dependencies:
- name: Deploy database
command: helm install postgres ...
# No ignore_errors - let it fail if database can't deploy
- name: Wait for database to be ready
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ namespace }}"
label_selectors:
- app=postgresql
register: db_pods
retries: 20
delay: 15
until: db_pods.resources | length > 0 and db_pods.resources[0].status.phase == "Running"
- name: Create application tables
command: kubectl exec postgres -- psql ...
# Now safe to run because database is verified ready
Rule 5: When to Use ignore_errorsโ
โ Safe to ignore errors:
- Cleanup operations (
pkill kubectl proxy) - Optional optimizations (cache warmup)
- Non-critical status reporting
- Tests that don't block deployment progress
โ Never ignore errors for:
- Service deployment steps
- Database/storage setup
- Required secret creation
- Network/ingress configuration
- Any step that subsequent steps depend on
๐ Verification Standardsโ
Rule 6: Comprehensive Verification Requiredโ
Every deployment MUST include:
- Pod Readiness Check with Progress Feedback (Two-Stage Pattern):
RECOMMENDED: Two-Stage Pod Readiness Verification
For robust deployment verification, use the two-stage pattern:
Stage 1: Wait for Pod Running
- name: Wait for service pods to be ready (with progress indicators)
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ namespace }}"
label_selectors:
- app.kubernetes.io/name={{ service_name }}
register: service_pods
retries: 20
delay: 15
until: >
service_pods.resources | length > 0 and
service_pods.resources[0].status.phase == "Running"
Stage 2: Wait for Container Ready
- name: Wait for service pods to be fully ready (1/1)
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ namespace }}"
label_selectors:
- app.kubernetes.io/name={{ service_name }}
register: service_pods_ready
retries: 30
delay: 10
until: >
service_pods_ready.resources | length > 0 and
service_pods_ready.resources[0].status.containerStatuses[0].ready == true
Why Two Stages?
- Stage 1 (
Running): Pod scheduled, containers started, image pulled - Stage 2 (
Ready): Application initialized, readiness probes passing, ready for traffic - Benefits: Prevents false positives where pod exists but application isn't ready
- Use Cases: Databases, message queues, complex applications with startup sequences
Alternative: Single-Stage Pattern (Minimum Requirement)
- name: Wait for service pods to be ready (with progress indicators)
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ namespace }}"
label_selectors:
- app.kubernetes.io/name={{ service_name }}
register: service_pods
retries: 20
delay: 15
until: >
service_pods.resources | length > 0 and
service_pods.resources[0].status.phase == "Running"
- Service Connectivity Test:
- name: Test service connectivity from within cluster
ansible.builtin.shell: |
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never -n {{ namespace }} -- \
curl -s -w "HTTP_CODE:%{http_code}" http://{{ service }}:{{ port }}/health
- IngressRoute Verification:
- name: Verify IngressRoute is created
kubernetes.core.k8s_info:
api_version: traefik.io/v1alpha1
kind: IngressRoute
namespace: "{{ namespace }}"
name: "{{ service_name }}"
register: ingress_check
retries: 5
delay: 2
until: ingress_check.resources | length > 0
- Functional API Test:
- name: Test API functionality
# Use kubectl proxy or port-forward for API-specific tests
Rule 7: Progress Feedback for Long-Running Tasksโ
All tasks that may take longer than 30 seconds MUST provide progress feedback to prevent the appearance of hanging.
โ WRONG: Silent long-running tasks:
# WRONG: 10 minutes of silence - appears to hang
- name: Wait for pods to be ready
shell: kubectl wait --timeout=600s ...
โ CORRECT: Ansible retries with progress indicators:
# CORRECT: Progress every 15 seconds with retry counters
- name: Wait for pods to be ready (with progress indicators)
kubernetes.core.k8s_info:
kind: Pod
label_selectors: [...]
retries: 40 # Clear total attempt count
delay: 15 # Regular progress intervals
until: condition_met
Expected User Experience:
FAILED - RETRYING: [localhost]: Wait for pods (40 retries left).
FAILED - RETRYING: [localhost]: Wait for pods (39 retries left).
FAILED - RETRYING: [localhost]: Wait for pods (38 retries left).
...
ok: [localhost]
Benefits:
- โ User sees system is active, not hanging
- โ Clear countdown shows progress and remaining time
- โ Predictable feedback rhythm (every 15 seconds)
- โ Transparent about retry attempts and timeouts
Rule 8: Task Naming and Numbering Standardsโ
All Ansible tasks MUST follow consistent naming and numbering conventions.
โ ๏ธ See doc/rules-naming-conventions.md for complete patterns.
Mandatory Requirements:
- โ Sequential numbering: Tasks numbered 1, 2, 3... in execution order
- โ Descriptive names: Clear action description after the number
- โ
Consistent format:
- name: N. Action description
โ CORRECT Examples:
- name: 1. Deploy database via Helm
- name: 2. Wait for database pods to be ready
- name: 3. Test database connectivity from within cluster
- name: 4. Apply database ingress configuration
- name: 5. Display database deployment status
โ WRONG Examples:
- name: Deploy database # Missing number
- name: 3. Deploy database # Wrong sequence (should be 1)
- name: 2. Wait # Not descriptive enough
- name: Step 2 - Deploy # Wrong format
Benefits:
- โ Easy debugging: Error messages show exact task sequence
- โ Clear progress: Users see completion percentage
- โ Maintainability: Easy to reference specific tasks in documentation
- โ Troubleshooting: "Failed at task 7" immediately identifies the problem
Refactoring Rule: When adding/removing tasks, renumber all subsequent tasks to maintain sequence.
Rule 8B: First Task MUST Display Deployment Informationโ
MANDATORY: Every Ansible playbook MUST start with Task 1 that displays deployment context information.
Required Format:
- name: 1. Display deployment information
ansible.builtin.debug:
msg:
- "======================================"
- "[Service Name] Deployment"
- "File: ansible/playbooks/[nnn]-setup-[service].yml"
- "======================================"
- "Target Host: {{ target_host }}"
- "Namespace: {{ namespace }}"
- "Component: {{ component_name }}"
- "[Additional context as needed]"
Why This Matters:
- โ Immediate Context: User sees what playbook is running and where
- โ Debugging: Log files clearly show which playbook generated output
- โ Parameter Verification: Confirms correct target host and namespace before deployment
- โ Documentation: File path shows exact source for troubleshooting
- โ Consistency: Uniform format across all playbooks
Real Example from Grafana Setup:
tasks:
- name: 1. Display deployment information
ansible.builtin.debug:
msg:
- "======================================"
- "Grafana Deployment"
- "File: ansible/playbooks/034-setup-grafana.yml"
- "======================================"
- "Target Host: {{ target_host }}"
- "Namespace: {{ namespace }}"
- "Component: {{ component_name }}"
- "Config File: {{ grafana_config_file }}"
Output Example:
TASK [1. Display deployment information] ***********************
ok: [localhost] => {
"msg": [
"======================================",
"Grafana Deployment",
"File: ansible/playbooks/034-setup-grafana.yml",
"======================================",
"Target Host: rancher-desktop",
"Namespace: monitoring",
"Component: grafana",
"Config File: /mnt/urbalurbadisk/manifests/034-grafana-config.yaml"
]
}
Removal Playbooks: Use the same format but with "[Service Name] Removal" as the title and file path pointing to the remove playbook.
Rule 9: Status Reporting Standardsโ
Every playbook MUST end with a comprehensive status report:
- name: Display final deployment status
ansible.builtin.debug:
msg:
- "==============================================="
- "๐ {{ service_name | title }} Deployment Status"
- "==============================================="
- ""
- "โ
SUCCESS - All components verified and running"
- ""
- "๐ Status:"
- "โข Service connectivity: โ
Internal cluster communication verified"
- "โข API responding: โ
Functional tests passed"
- "โข IngressRoute: โ
Traefik routing configured"
- ""
- "๐ Access Instructions:"
- "โข Port-forward: kubectl port-forward svc/{{ service_name }} {{ port }}:{{ port }} -n {{ namespace }}"
- "โข Ingress: http://{{ service_name }}.localhost"
- ""
- "๐ง Troubleshooting:"
- "โข Check pod status: kubectl get pods -n {{ namespace }}"
- "โข View logs: kubectl logs -f <pod-name> -n {{ namespace }}"
- "==============================================="
๐ File Organization Rulesโ
Rule 10: Utility Playbook Structureโ
All files in ansible/playbooks/utility/ MUST be complete playbooks, not just task lists.
โ WRONG: Task list format:
# utility/u06-database-setup.yml - WRONG!
- name: Create database user
postgresql_user: ...
- name: Create database
postgresql_db: ...
โ CORRECT: Complete playbook format:
# utility/u06-database-setup.yml - CORRECT!
---
- name: Database setup utility
hosts: localhost
gather_facts: false
vars:
database_name: "{{ database_name | default('myapp') }}"
database_user: "{{ database_user | default('myuser') }}"
tasks:
- name: 1. Create database user
postgresql_user: ...
- name: 2. Create database
postgresql_db: ...
Usage in Main Playbooks:
# Main playbook imports utility
- import_playbook: utility/u06-database-setup.yml
vars:
database_name: "openwebui"
database_user: "openwebui"
Benefits:
- โ Reusable: Can be run standalone or imported
- โ Testable: Can be tested independently
- โ Parameterized: Accepts variables for different use cases
- โ Complete: Has proper playbook structure with hosts, vars, tasks
- โ Maintainable: Clear separation of concerns
Rule 11: Helm Repository Managementโ
Every playbook that uses Helm charts MUST be responsible for managing its required Helm repositories.
Mandatory Requirements:
- โ Check existing repositories: Verify what's already configured
- โ Add missing repositories: Add only repositories that are needed and missing
- โ Update repositories: Refresh repository indexes before installation
- โ Self-contained: Never assume repositories are pre-configured
โ CORRECT Pattern:
- name: N. Check existing Helm repositories
ansible.builtin.command: helm repo list
register: helm_repo_list
changed_when: false
- name: N+1. Add required Helm repositories if missing
kubernetes.core.helm_repository:
name: "{{ item.name }}"
repo_url: "{{ item.url }}"
loop:
- { name: 'bitnami', url: 'https://charts.bitnami.com/bitnami' }
- { name: 'open-webui', url: 'https://helm.openwebui.com/' }
when: item.name not in helm_repo_list.stdout
- name: N+2. Update Helm repositories
ansible.builtin.command: helm repo update
changed_when: false
- name: N+3. Deploy service via Helm
ansible.builtin.command: >
helm upgrade --install {{ service_name }} {{ chart_name }}
-f {{ config_file }}
--namespace {{ namespace }}
Benefits:
- โ Self-contained: Playbook doesn't depend on external setup
- โ Idempotent: Safe to run multiple times
- โ Efficient: Only adds missing repositories
- โ Reliable: Fresh repository indexes before deployment
- โ Debuggable: Clear separation of repository and deployment steps
โ WRONG: Assuming pre-configured repositories:
# WRONG: Assumes repositories are already configured
- name: Deploy service via Helm
helm: chart=some-chart/service-name ... # May fail if repo missing
Rule 12: Utility Playbook Error Handlingโ
When calling utility playbooks from main playbooks, MUST implement "quiet success, verbose failure" pattern.
Mandatory Requirements:
- โ
Capture output: Always
registerthe result of utility playbook calls - โ Silent success: No output display when utility playbook succeeds
- โ Verbose failure: Show full utility playbook output when it fails
- โ
Proper error handling: Use
failed_whento catch non-zero exit codes
โ WRONG: No error diagnostics:
# WRONG: Utility failure provides no diagnostic information
- name: 1. Set up database
ansible.builtin.shell: |
ansible-playbook utility/database-setup.yml
register: db_result
failed_when: db_result.rc != 0
โ CORRECT: Error diagnostics on failure:
# CORRECT: Shows utility output only when debugging is needed
- name: 1. Set up database
ansible.builtin.shell: |
ansible-playbook utility/database-setup.yml -e operation=create
args:
chdir: /path/to/playbooks
register: db_result
changed_when: db_result.rc == 0
failed_when: db_result.rc != 0
- name: 1.1. Display utility playbook output on failure
ansible.builtin.debug:
msg:
- "โ Database setup failed!"
- "Full output from utility playbook:"
- "{{ db_result.stdout_lines }}"
when: db_result.rc != 0
Benefits:
- โ Clean output during normal operations (quiet success)
- โ Full diagnostic information when troubleshooting is needed (verbose failure)
- โ No subprocess output buffering issues
- โ Maintains utility playbook independence
Rule 13: Consistent File Naming and Numberingโ
TODO: we need to revise numbering (someday)
scripts/packages/[service-name].sh
ansible/playbooks/[nnn]-setup-[service-name].yml
ansible/playbooks/utility/[unn]-[purpose].yml
manifests/[nnn]-[service-name]-[component].yaml
provision-host/kubernetes/[nn]-[category]/[nn]-setup-[service].sh
Rule 14: Retry and Timeout Patternsโ
All deployment tasks MUST use appropriate retry patterns with visible progress indicators instead of silent long-running operations.
๐ Retry Patterns by Use Case:
1. Pod Startup (Standard Services):
# Most services: 20 retries ร 15s = 5 minutes
- name: Wait for service pods to be ready
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ namespace }}"
label_selectors:
- app={{ service_name }}
register: service_pods
retries: 20
delay: 15
until: >
service_pods.resources | length > 0 and
service_pods.resources[0].status.phase == "Running"
2. Pod Startup (Heavy Container Images):
# OpenWebUI with large container: 80 retries ร 15s = 20 minutes
- name: Wait for OpenWebUI pods (large container download)
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ ai_namespace }}"
label_selectors:
- app=open-webui
register: openwebui_pods
retries: 80 # Extra time for container image download
delay: 15
until: >
openwebui_pods.resources | length > 0 and
openwebui_pods.resources[0].status.phase == "Running"
3. Service Connectivity Tests:
# HTTP health checks: 15 retries ร 15s = ~4 minutes
- name: Test OpenWebUI HTTP response
ansible.builtin.shell: |
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never -n {{ namespace }} -- \
curl -s -w "HTTP_CODE:%{http_code}" http://open-webui/health
register: openwebui_http_response
retries: 15
delay: 15
until: openwebui_http_response.rc == 0 and openwebui_http_response.stdout.find('HTTP_CODE:200') != -1
4. Resource Creation Checks:
# Quick resource checks: 5 retries ร 2s = 10 seconds
- name: Verify IngressRoute is created
kubernetes.core.k8s_info:
api_version: traefik.io/v1alpha1
kind: IngressRoute
namespace: "{{ namespace }}"
name: "{{ service_name }}"
register: ingress_check
retries: 5
delay: 2
until: ingress_check.resources | length > 0
๐ Timeout Guidelines:
- Lightweight services: 20 retries ร 15s = 5 minutes
- Heavy container images: 80 retries ร 15s = 20 minutes (OpenWebUI pattern)
- HTTP connectivity tests: 15 retries ร 15s = ~4 minutes
- Resource existence checks: 5 retries ร 2s = 10 seconds
๐ก Key Benefits of This Pattern:
- โ Visible Progress: User sees "RETRYING (N retries left)" messages every 15 seconds
- โ Predictable Timing: Clear expectation of maximum wait time
- โ No Silent Hangs: Never appears frozen or unresponsive
- โ Appropriate Timeouts: Different timeouts for different complexity levels
โ What NOT to do:
# WRONG: Silent long-running operations
- name: Wait for deployment
shell: kubectl wait --timeout=600s --for=condition=ready pod/service-pod
# Problem: 10 minutes of silence - appears to hang
Numbering Convention:
000-099: Core infrastructure (storage, networking, DNS)040-099: Data services (databases, caches, message queues)200-229: AI services (OpenWebUI, LiteLLM, Ollama)070-079: Authentication (Authentik, OAuth providers)030-039: Observability (monitoring, logging, tracing)600-799: Management tools (admin interfaces, development tools)
Rule 15: Ingress Standardsโ
All services MUST follow the Traefik Ingress Standards defined in docs/rules-ingress-traefik.md.
Mandatory Requirements:
- โ Use only Traefik IngressRoute CRDs (never standard Kubernetes Ingress)
- โ Follow HostRegexp patterns for multi-domain support
- โ Use correct API version and proper labeling
- โ Apply authentication middleware for protected services
See docs/rules-ingress-traefik.md for:
- Complete IngressRoute templates and examples
- HostRegexp pattern explanations
- Authentication integration patterns
- API version requirements
- Working examples from the codebase
๐จ Common Anti-Patterns to Avoidโ
โ Anti-Pattern 1: Shell Script Logicโ
# WRONG: Business logic in shell script
if helm list | grep -q postgres; then
echo "Postgres exists, upgrading..."
helm upgrade postgres ...
else
echo "Installing postgres..."
helm install postgres ...
fi
Solution: Move all logic to Ansible playbooks.
โ Anti-Pattern 2: localhost Testingโ
# WRONG: Testing localhost from host context
- name: Test service
uri:
url: "http://service.localhost/api"
Solution: Use kubectl run with curl container.
โ Anti-Pattern 3: Missing Retry Logicโ
# WRONG: No retry for potentially slow operations
- name: Wait for pod
shell: kubectl get pod service-pod
Solution: Use retries/delay/until pattern.
โ Anti-Pattern 4: Ignoring Critical Errorsโ
# WRONG: Ignoring deployment failures
- name: Deploy service
command: helm install service ...
ignore_errors: true
Solution: Let critical failures fail fast.
โ Anti-Pattern 5: Silent Long-Running Tasksโ
# WRONG: No progress feedback for long operations
- name: Wait for deployment
shell: kubectl wait --timeout=600s ...
Solution: Use Ansible retries with progress indicators (see Rule 14: Retry and Timeout Patterns).
โ Anti-Pattern 6: Utility Files as Task Listsโ
# WRONG: utility/database-setup.yml as task list
- name: Create user
postgresql_user: ...
- name: Create database
postgresql_db: ...
Solution: Write complete playbooks with hosts, vars, and tasks sections.
โ Anti-Pattern 7: Assuming Pre-configured Helm Repositoriesโ
# WRONG: Assuming repositories are already configured
- name: Deploy service
helm: chart=some-repo/service-name ... # May fail if repo missing
Solution: Manage Helm repositories within the playbook (check, add, update).
๐ Reference Documentationโ
Related Cluster Documentation:โ
- ๐ฆ Ingress Standards:
docs/rules-ingress-traefik.md- Comprehensive Traefik IngressRoute patterns - ๐ Networking Overview:
docs/networking-readme.md- Cluster networking architecture - ๐๏ธ Infrastructure Guide:
docs/infrastructure-readme.md- Overall cluster architecture - ๐ค AI Environment:
docs/package-ai-environment-management.md- AI-specific deployment patterns
Key Concepts from Traefik Documentation:โ
- HostRegexp Patterns: Multi-domain routing with
HostRegexp(\service\..+`)` - API Version: Use
traefik.io/v1alpha1(nottraefik.io/v1) - DNS Resolution: Dual-context architecture for localhost routing
- Authentication: Forward auth middleware patterns for protected services
- Priority System: Route matching order and conflict resolution
Working Examples in Codebase:โ
- โ
Good:
ansible/playbooks/020-setup-nginx.yml- Proper testing with kubectl run - โ
Good:
ansible/playbooks/210-setup-litellm.yml- Comprehensive verification - โ
Good:
manifests/071-whoami-public-ingressroute.yaml- Proper IngressRoute pattern - โ
Good:
manifests/078-whoami-protected-ingressroute.yaml- Authentication integration
Testing Pattern Reference:โ
# From ansible/playbooks/020-setup-nginx.yml (lines 123-131)
- name: Test connectivity from within the cluster
ansible.builtin.shell: |
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never --context {{ kube_context }} -- \
curl -s http://nginx.default.svc.cluster.local:{{ port }}/{{ file }}
register: curl_test
ignore_errors: true
changed_when: false
IngressRoute Patterns:โ
See docs/rules-ingress-traefik.md for complete IngressRoute examples and patterns.
๐ฏ Enforcementโ
For Human Developers:โ
- All pull requests must follow these rules
- Peer review must verify compliance
- No exceptions without documented justification
For LLMs and AI Assistants:โ
- These rules are MANDATORY and override any default behavior
- Never suggest .localhost testing from host context
- Always use kubectl run pattern for cluster testing
- Never ignore errors for dependencies
- Always follow the Script + Ansible pattern
Validation Checklist:โ
- Uses Script + Ansible pattern
- Tests using kubectl run (not .localhost from host)
- Includes comprehensive verification steps
- Does not ignore errors for critical dependencies
- Provides progress feedback for long-running tasks (>30s)
- Uses sequential task numbering (1, 2, 3...)
- Utility files are complete playbooks (not task lists)
- Manages required Helm repositories within playbook
- Understands auto-execution system (active vs not-in-use placement)
- Uses Traefik IngressRoute (not standard Ingress)
- Follows file naming conventions
- Includes proper status reporting
๐ Continuous Improvementโ
These rules are living standards based on:
- Proven patterns from working deployments
- Lessons learned from debugging failures
- Cluster architecture requirements
- Team experience and best practices
Update Process:
- Propose rule changes via pull request
- Test changes with actual deployments
- Update documentation with examples
- Train team on new patterns
This ensures our deployment standards evolve while maintaining reliability and consistency across all cluster services.