Tempo - Distributed Tracing Backend
Key Features: Distributed Tracing • OTLP Protocol • gRPC & HTTP Endpoints • TraceQL Queries • Cost-Effective Storage • Jaeger/Zipkin Compatible • Multi-Tenancy
File: docs/package-monitoring-tempo.md
Purpose: Complete guide to Tempo deployment and configuration for distributed tracing in Urbalurba infrastructure
Target Audience: DevOps engineers, platform administrators, SREs, developers
Last Updated: October 3, 2025
Deployed Version: Tempo v2.8.2 (Helm Chart: tempo-1.23.3) Official Documentation: https://grafana.com/docs/tempo/v2.8.x/
📋 Overview
Tempo is a high-performance distributed tracing backend designed for cloud-native applications. It provides cost-effective trace storage and powerful querying capabilities through TraceQL. Unlike traditional tracing backends, Tempo only indexes a small set of metadata, dramatically reducing storage and operational costs.
As part of the unified observability stack, Tempo works alongside Prometheus (metrics) and Loki (logs), with all data visualized in Grafana. Applications instrumented with OpenTelemetry send traces to the OTLP Collector, which forwards them to Tempo for storage and querying.
Key Capabilities:
- OTLP Native: Primary ingestion via OpenTelemetry Collector (gRPC 4317, HTTP 4318)
- TraceQL: SQL-like query language for powerful trace filtering and analysis
- Low Storage Cost: Indexes only metadata, stores traces in efficient object storage format
- Multi-Protocol: Supports OTLP, Jaeger, and Zipkin protocols
- Scalable Architecture: Designed for high-volume trace ingestion
- Grafana Integration: Native datasource for trace visualization and exploration
Architecture Type: Append-only trace storage with metadata indexing
🏗️ Architecture
Deployment Components
┌──────────────────────────────────────────────────────┐
│ Tempo Stack (namespace: monitoring) │
├──────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ Tempo Server │ │
│ │ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ OTLP Receivers │ │ │
│ │ │ - gRPC: 4317 │ │ │
│ │ │ - HTTP: 4318 │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ Trace Storage │ │ │
│ │ │ - Metadata Index │ │ │
│ │ │ - Trace Blocks (10Gi PVC) │ │ │
│ │ │ - 24h Retention │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ Query APIs │ │ │
│ │ │ - HTTP API: 3200 │ │ │
│ │ │ - TraceQL Engine │ │ │
│ │ │ - Search API │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
▲ │
│ ▼
┌──────────────────┐ ┌──────────────────────┐
│ OTLP Collector │ │ Grafana Query │
│ (Trace Export) │ │ (TraceQL/HTTP) │
└──────────────────┘ └──────────────────────┘
Data Flow
Application (OTLP instrumented)
│
│ OTLP/HTTP or OTLP/gRPC
▼
┌──────────────────────┐
│ OTLP Collector │
│ (Trace Receiver) │
└──────────────────────┘
│
│ OTLP Export
▼
┌──────────────────────┐
│ Tempo Backend │
│ (4317/4318) │
├──────────────────────┤
│ - Receive traces │
│ - Extract metadata │
│ - Store trace blocks │
│ - Index for search │
└── ────────────────────┘
│
├─► Persistent Storage (10Gi)
└─► Query API (3200)
│
▼
┌──────────────────┐
│ Grafana Explore │
│ (TraceQL Query) │
└──────────────────┘
File Structure
manifests/
└── 031-tempo-config.yaml # Tempo Helm values
ansible/playbooks/
├── 031-setup-tempo.yml # Deployment automation
└── 031-remove-tempo.yml # Removal automation
provision-host/kubernetes/11-monitoring/not-in-use/
├── 02-setup-tempo.sh # Shell script wrapper
└── 02-remove-tempo.sh # Removal script
Storage:
└── PersistentVolumeClaim
└── tempo (10Gi) # Trace blocks storage
🚀 Deployment
Automated Deployment
Via Monitoring Stack (Recommended):
# Deploy entire monitoring stack (includes Tempo)
docker exec -it provision-host bash
cd /mnt/urbalurbadisk/provision-host/kubernetes/11-monitoring/not-in-use
./00-setup-all-monitoring.sh rancher-desktop
Individual Deployment:
# Deploy Tempo only
docker exec -it provision-host bash
cd /mnt/urbalurbadisk/provision-host/kubernetes/11-monitoring/not-in-use
./02-setup-tempo.sh rancher-desktop
Manual Deployment
Prerequisites:
- Kubernetes cluster running (Rancher Desktop)
monitoringnamespace exists- Helm installed in provision-host container
- Manifest file:
manifests/031-tempo-config.yaml
Deployment Steps:
# 1. Enter provision-host container
docker exec -it provision-host bash
# 2. Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# 3. Deploy Tempo
helm upgrade --install tempo grafana/tempo \
-f /mnt/urbalurbadisk/manifests/031-tempo-config.yaml \
--namespace monitoring \
--create-namespace \
--timeout 600s \
--kube-context rancher-desktop
# 4. Wait for pods to be ready
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=tempo \
-n monitoring --timeout=300s
Deployment Time: ~2-3 minutes
⚙️ Configuration
Tempo Configuration (manifests/031-tempo-config.yaml)
Core Settings:
tempo:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # OTLP gRPC receiver
http:
endpoint: 0.0.0.0:4318 # OTLP HTTP receiver
retention: 24h # Trace retention period
persistence:
enabled: true
size: 10Gi # Trace block storage
service:
type: ClusterIP # Internal cluster access only
Key Configuration Sections:
1. OTLP Receivers (Primary Ingestion):
tempo:
receivers:
otlp:
protocols:
# Recommended for production (lower overhead)
grpc:
endpoint: 0.0.0.0:4317
# Alternative for HTTP-only environments
http:
endpoint: 0.0.0.0:4318
2. Retention Policy:
tempo:
retention: 24h # Traces older than 24h are deleted
3. Storage Backend:
persistence:
enabled: true
size: 10Gi # Adjust based on trace volume
4. Metrics Generator (Automatic Service Graphs):
tempo:
metricsGenerator:
enabled: true
remoteWriteUrl: "http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/write"
structuredConfig:
metrics_generator:
registry:
external_labels:
source: tempo
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/write
send_exemplars: true
traces_storage:
path: /var/tempo/generator/traces
processor:
service_graphs:
dimensions:
- service.name
- peer.service
histogram_buckets: [0.1, 0.2, 0.5, 1, 2, 5, 10]
span_metrics:
dimensions:
- service.name
- peer.service
- log.type
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics]
Key Features:
- Service Graphs: Automatically generate service dependency metrics from traces
- Span Metrics: Create Prometheus metrics for trace calls, latency, and errors
- Remote Write: Send generated metrics to Prometheus for visualization
- Dimensions: Track service.name, peer.service, and log.type for detailed filtering
Generated Prometheus Metrics:
traces_spanmetrics_calls_total- Total calls between servicestraces_spanmetrics_latency_bucket- Latency histogram distributiontraces_spanmetrics_size_total- Span size tracking
Example Prometheus Queries:
# Service dependency graph
traces_spanmetrics_calls_total{service_name="my-service"}
# Average latency between services
rate(traces_spanmetrics_latency_sum[1m]) / rate(traces_spanmetrics_latency_count[1m])
# Error rate by service
rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[1m])
Resource Configuration
Storage Requirements:
- Tempo PVC: 10Gi persistent volume (24-hour retention)
- Estimated Usage: ~400-500MB per million spans (varies by trace size)
Service Endpoints:
- OTLP gRPC:
tempo.monitoring.svc.cluster.local:4317 - OTLP HTTP:
tempo.monitoring.svc.cluster.local:4318 - HTTP API:
tempo.monitoring.svc.cluster.local:3200 - Ready Check:
tempo.monitoring.svc.cluster.local:3200/ready
Security Configuration
Network Access:
service:
type: ClusterIP # Internal cluster access only
No External Access: Tempo is internal-only. Traces are sent via OTLP Collector, and queries are performed through Grafana.
🔍 Monitoring & Verification
Health Checks
Check Pod Status:
# Tempo pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo
# Expected output:
NAME READY STATUS RESTARTS AGE
tempo-0 1/1 Running 0 5m
Check Service Endpoints:
# Verify services are accessible
kubectl get svc -n monitoring -l app.kubernetes.io/name=tempo
# Expected services:
tempo ClusterIP 10.43.x.x 3200/TCP,4317/TCP,4318/TCP
Service Verification
Test HTTP API:
# Test API echo endpoint
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
-n monitoring -- \
curl -s http://tempo.monitoring.svc.cluster.local:3200/api/echo
# Expected: HTTP 200 response
Test Ready Endpoint:
# Check if Tempo is ready to receive traces
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
-n monitoring -- \
curl -s http://tempo.monitoring.svc.cluster.local:3200/ready
# Expected: HTTP 200 response
Test OTLP Endpoints:
# Test gRPC port accessibility
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
-n monitoring -- \
curl -v telnet://tempo.monitoring.svc.cluster.local:4317
# Test HTTP port accessibility
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
-n monitoring -- \
curl -s -o /dev/null -w "%{http_code}" \
http://tempo.monitoring.svc.cluster.local:4318/
Search API Testing
Query Traces:
# Test search API (returns trace metadata)
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
-n monitoring -- \
curl -s "http://tempo.monitoring.svc.cluster.local:3200/api/search"
# Expected: JSON response with traces/metrics
Automated Verification
The deployment playbook (031-setup-tempo.yml) performs automated tests:
- ✅ HTTP API endpoint connectivity
- ✅ Ready endpoint verification
- ✅ Metrics endpoint check
- ✅ Search API validation
- ✅ OTLP gRPC port accessibility
- ✅ OTLP HTTP port accessibility
🛠️ Management Operations
Query Traces in Grafana
Access Grafana:
# Open Grafana UI
http://grafana.localhost
Explore Traces:
- Navigate to Explore → Select Tempo datasource
- Choose query type:
- Search: Find traces by service/operation
- TraceQL: Use SQL-like queries
- Trace ID: Lookup specific trace
TraceQL Examples:
# Find traces with errors
{ status = error }
# Find slow traces (>1s duration)
{ duration > 1s }
# Find traces by service name
{ resource.service.name = "sovdev-test-company-lookup-typescript" }
# Complex query: slow traces with errors from specific service
{ resource.service.name =~ "sovdev.*" && status = error && duration > 1s }
Official TraceQL Documentation: https://grafana.com/docs/tempo/v2.8.x/traceql/
HTTP API Queries
Search for Traces:
# Search by service name
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
-n monitoring -- \
curl -s "http://tempo.monitoring.svc.cluster.local:3200/api/search?q=service.name%3D%22my-service%22"
Retrieve Trace by ID:
# Get specific trace (replace TRACE_ID)
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
-n monitoring -- \
curl -s "http://tempo.monitoring.svc.cluster.local:3200/api/traces/TRACE_ID"
Metrics Monitoring
Tempo Self-Monitoring:
# Get Tempo internal metrics
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
-n monitoring -- \
curl -s http://tempo.monitoring.svc.cluster.local:3200/metrics | grep tempo
Key Metrics (via Prometheus):
# Ingested spans per second
rate(tempo_ingester_spans_ingested_total[5m])
# Trace queries per second
rate(tempo_query_frontend_queries_total[5m])
# Storage bytes used
tempo_ingester_bytes_total
Service Removal
Automated Removal:
docker exec -it provision-host bash
cd /mnt/urbalurbadisk/provision-host/kubernetes/11-monitoring/not-in-use
./02-remove-tempo.sh rancher-desktop
Manual Removal:
# Remove Helm chart
helm uninstall tempo -n monitoring --kube-context rancher-desktop
# Remove PVC (optional - preserves data if omitted)
kubectl delete pvc -n monitoring -l app.kubernetes.io/name=tempo
🔧 Troubleshooting
Common Issues
Pods Not Starting:
# Check pod events
kubectl describe pod -n monitoring -l app.kubernetes.io/name=tempo
# Common causes:
# - PVC binding issues (check PV availability)
# - Insufficient resources (check node capacity)
# - Image pull errors (check network)
No Traces Appearing:
# 1. Check OTLP Collector is sending traces to Tempo
kubectl logs -n monitoring -l app.kubernetes.io/name=opentelemetry-collector | grep tempo
# 2. Check Tempo ingestion logs
kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i "trace\|span"
# 3. Verify OTLP Collector configuration
kubectl get configmap -n monitoring otel-collector-opentelemetry-collector -o yaml | grep -A 10 "tempo"
# Expected: Tempo endpoint at tempo.monitoring.svc.cluster.local:4317
Trace Query Failures:
# Check Tempo query logs
kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i error
# Test search API directly
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
-n monitoring -- \
curl -v "http://tempo.monitoring.svc.cluster.local:3200/api/search"
Storage Full:
# Check PVC usage
kubectl get pvc -n monitoring
# Check trace block size via metrics
kubectl port-forward -n monitoring svc/tempo 3200:3200
curl -s http://localhost:3200/metrics | grep tempo_ingester_bytes
# Solutions:
# 1. Reduce retention period in manifests/031-tempo-config.yaml
# 2. Increase PVC size
# 3. Reduce trace sampling rate at application level
OTLP Ingestion Errors:
# Check if Tempo is accepting OTLP traces
kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i "otlp\|grpc\|http"
# Test OTLP HTTP endpoint
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
-n monitoring -- \
curl -X POST -v http://tempo.monitoring.svc.cluster.local:4318/v1/traces \
-H "Content-Type: application/json" \
-d '{}'
# Expected: 400 or 405 (endpoint is reachable, empty payload rejected)
📋 Maintenance
Regular Tasks
Monitor Storage Usage:
# Check PVC status
kubectl get pvc -n monitoring -l app.kubernetes.io/name=tempo
# Check storage metrics
kubectl port-forward -n monitoring svc/tempo 3200:3200
curl -s http://localhost:3200/metrics | grep tempo_ingester_blocks_total
Update Tempo:
# Update Helm chart to latest version
helm repo update
helm upgrade tempo grafana/tempo \
-f /mnt/urbalurbadisk/manifests/031-tempo-config.yaml \
-n monitoring \
--kube-context rancher-desktop
Cleanup Old Traces (automatic):
# Retention handled automatically via tempo.retention setting
tempo:
retention: 24h # Traces older than 24 hours are purged
Backup Procedures
Snapshot Trace Blocks:
# Export PVC data
kubectl exec -n monitoring tempo-0 -- \
tar czf /tmp/tempo-backup.tar.gz /var/tempo
# Copy to local machine
kubectl cp monitoring/tempo-0:/tmp/tempo-backup.tar.gz \
./tempo-backup.tar.gz
Note: Tempo is designed as ephemeral storage with short retention. Long-term trace archival is not a primary use case.
Disaster Recovery
Restore from Backup:
# 1. Remove existing deployment
./02-remove-tempo.sh rancher-desktop
# 2. Restore PVC data (requires direct PV access)
# 3. Redeploy Tempo
./02-setup-tempo.sh rancher-desktop
Data Loss Scenarios:
- PVC deleted: Traces are lost (not critical - 24h retention means limited impact)
- Corruption: Tempo auto-repairs blocks on startup
- Retention expired: Expected behavior, adjust retention if needed
🚀 Use Cases
1. Application Tracing with OpenTelemetry
Instrument Application (example with Go):
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
)
// Configure OTLP exporter to send to OTLP Collector
exporter, _ := otlptracehttp.New(ctx,
otlptracehttp.WithEndpoint("otel.localhost"),
otlptracehttp.WithURLPath("/v1/traces"),
otlptracehttp.WithHeaders(map[string]string{
"Host": "otel.localhost",
}),
)
Environment Configuration:
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://127.0.0.1/v1/traces
OTEL_EXPORTER_OTLP_HEADERS={"Host":"otel.localhost"}
Query in Grafana:
{ resource.service.name = "my-app" }
2. Correlate Logs and Traces
sovdev-logger Integration:
// TypeScript application with sovdev-logger
import { initializeSovdevLogger } from '@sovdev/logger';
// Logs include trace_id and span_id for correlation
logger.info("Processing request", {
trace_id: context.traceId,
span_id: context.spanId
});
Grafana Workflow:
- Find trace in Tempo with TraceQL
- Note
trace_idfrom trace details - Switch to Loki datasource
- Query logs:
{service_name="my-app"} | json | trace_id="TRACE_ID" - View correlated logs and traces together
3. Performance Analysis
Find Slow Requests:
# Traces slower than 2 seconds
{ duration > 2s }
# Group by service
{ duration > 2s } | group by resource.service.name
Analyze Bottlenecks:
- Query slow traces in Grafana Explore
- View trace waterfall/flamegraph
- Identify slow spans (database queries, API calls, etc.)
- Optimize identified bottlenecks
4. Error Investigation
Find Failed Requests:
# Traces with errors
{ status = error }
# Errors from specific service in last hour
{ resource.service.name = "api-service" && status = error }
Debug Workflow:
- Find error traces with TraceQL
- Examine trace details and span attributes
- Correlate with logs (trace_id)
- Identify root cause from span data
💡 Key Insight: Tempo's design philosophy is "store everything, index nothing (except metadata)". This approach dramatically reduces costs while enabling powerful trace analysis through TraceQL. When integrated with OTLP Collector for ingestion and Grafana for visualization, Tempo provides complete distributed tracing capabilities for microservices architectures without the operational complexity of traditional tracing backends.
🔗 Related Documentation
Monitoring Stack:
- Monitoring Overview - Complete observability stack
- Prometheus Metrics - Metrics collection
- Loki Logs - Log aggregation
- OTLP Collector - Telemetry pipeline (trace ingestion)
- Grafana Visualization - Dashboards and trace exploration
Configuration & Rules:
- Naming Conventions - Manifest numbering (031)
- Development Workflow - Configuration management
- Automated Deployment - Orchestration
External Resources:
- TraceQL Language: https://grafana.com/docs/tempo/v2.8.x/traceql/
- OTLP Specification: https://opentelemetry.io/docs/specs/otlp/
- Tempo Configuration: https://grafana.com/docs/tempo/v2.8.x/configuration/