Tempo - Distributed Tracing Backend

Key Features: Distributed Tracing • OTLP Protocol • gRPC & HTTP Endpoints • TraceQL Queries • Cost-Effective Storage • Jaeger/Zipkin Compatible • Multi-Tenancy

File: docs/package-monitoring-tempo.md Purpose: Complete guide to Tempo deployment and configuration for distributed tracing in Urbalurba infrastructure Target Audience: DevOps engineers, platform administrators, SREs, developers Last Updated: October 3, 2025

Deployed Version: Tempo v2.8.2 (Helm Chart: tempo-1.23.3) Official Documentation: https://grafana.com/docs/tempo/v2.8.x/

📋 Overview

Tempo is a high-performance distributed tracing backend designed for cloud-native applications. It provides cost-effective trace storage and powerful querying capabilities through TraceQL. Unlike traditional tracing backends, Tempo only indexes a small set of metadata, dramatically reducing storage and operational costs.

As part of the unified observability stack, Tempo works alongside Prometheus (metrics) and Loki (logs), with all data visualized in Grafana. Applications instrumented with OpenTelemetry send traces to the OTLP Collector, which forwards them to Tempo for storage and querying.

Key Capabilities:

OTLP Native: Primary ingestion via OpenTelemetry Collector (gRPC 4317, HTTP 4318)
TraceQL: SQL-like query language for powerful trace filtering and analysis
Low Storage Cost: Indexes only metadata, stores traces in efficient object storage format
Multi-Protocol: Supports OTLP, Jaeger, and Zipkin protocols
Scalable Architecture: Designed for high-volume trace ingestion
Grafana Integration: Native datasource for trace visualization and exploration

Architecture Type: Append-only trace storage with metadata indexing

🏗️ Architecture

Deployment Components

┌──────────────────────────────────────────────────────┐
│           Tempo Stack (namespace: monitoring)        │
├──────────────────────────────────────────────────────┤
│                                                      │
│  ┌────────────────────────────────────────────┐    │
│  │             Tempo Server                   │    │
│  │                                            │    │
│  │  ┌──────────────────────────────────────┐ │    │
│  │  │  OTLP Receivers                      │ │    │
│  │  │  - gRPC: 4317                        │ │    │
│  │  │  - HTTP: 4318                        │ │    │
│  │  └──────────────────────────────────────┘ │    │
│  │                                            │    │
│  │  ┌──────────────────────────────────────┐ │    │
│  │  │  Trace Storage                       │ │    │
│  │  │  - Metadata Index                    │ │    │
│  │  │  - Trace Blocks (10Gi PVC)           │ │    │
│  │  │  - 24h Retention                     │ │    │
│  │  └──────────────────────────────────────┘ │    │
│  │                                            │    │
│  │  ┌──────────────────────────────────────┐ │    │
│  │  │  Query APIs                          │ │    │
│  │  │  - HTTP API: 3200                    │ │    │
│  │  │  - TraceQL Engine                    │ │    │
│  │  │  - Search API                        │ │    │
│  │  └──────────────────────────────────────┘ │    │
│  └────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────┘
         ▲                          │
         │                          ▼
┌──────────────────┐    ┌──────────────────────┐
│ OTLP Collector   │    │   Grafana Query      │
│ (Trace Export)   │    │   (TraceQL/HTTP)     │
└──────────────────┘    └──────────────────────┘

Data Flow

Application (OTLP instrumented)
         │
         │ OTLP/HTTP or OTLP/gRPC
         ▼
┌──────────────────────┐
│  OTLP Collector      │
│  (Trace Receiver)    │
└──────────────────────┘
         │
         │ OTLP Export
         ▼
┌──────────────────────┐
│   Tempo Backend      │
│   (4317/4318)        │
├──────────────────────┤
│ - Receive traces     │
│ - Extract metadata   │
│ - Store trace blocks │
│ - Index for search   │
└──────────────────────┘
         │
         ├─► Persistent Storage (10Gi)
         └─► Query API (3200)
                  │
                  ▼
         ┌──────────────────┐
         │  Grafana Explore │
         │  (TraceQL Query) │
         └──────────────────┘

File Structure

manifests/
└── 031-tempo-config.yaml                   # Tempo Helm values

ansible/playbooks/
├── 031-setup-tempo.yml                     # Deployment automation
└── 031-remove-tempo.yml                    # Removal automation

provision-host/kubernetes/11-monitoring/not-in-use/
├── 02-setup-tempo.sh                       # Shell script wrapper
└── 02-remove-tempo.sh                      # Removal script

Storage:
└── PersistentVolumeClaim
    └── tempo (10Gi)                        # Trace blocks storage

🚀 Deployment

Automated Deployment

Via Monitoring Stack (Recommended):

# Deploy entire monitoring stack (includes Tempo)
docker exec -it provision-host bash
cd /mnt/urbalurbadisk/provision-host/kubernetes/11-monitoring/not-in-use
./00-setup-all-monitoring.sh rancher-desktop

Individual Deployment:

# Deploy Tempo only
docker exec -it provision-host bash
cd /mnt/urbalurbadisk/provision-host/kubernetes/11-monitoring/not-in-use
./02-setup-tempo.sh rancher-desktop

Manual Deployment

Prerequisites:

Kubernetes cluster running (Rancher Desktop)
monitoring namespace exists
Helm installed in provision-host container
Manifest file: manifests/031-tempo-config.yaml

Deployment Steps:

# 1. Enter provision-host container
docker exec -it provision-host bash

# 2. Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# 3. Deploy Tempo
helm upgrade --install tempo grafana/tempo \
  -f /mnt/urbalurbadisk/manifests/031-tempo-config.yaml \
  --namespace monitoring \
  --create-namespace \
  --timeout 600s \
  --kube-context rancher-desktop

# 4. Wait for pods to be ready
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=tempo \
  -n monitoring --timeout=300s

Deployment Time: ~2-3 minutes

⚙️ Configuration

Tempo Configuration (`manifests/031-tempo-config.yaml`)

Core Settings:

tempo:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317    # OTLP gRPC receiver
        http:
          endpoint: 0.0.0.0:4318    # OTLP HTTP receiver

  retention: 24h                     # Trace retention period

persistence:
  enabled: true
  size: 10Gi                         # Trace block storage

service:
  type: ClusterIP                    # Internal cluster access only

Key Configuration Sections:

1. OTLP Receivers (Primary Ingestion):

tempo:
  receivers:
    otlp:
      protocols:
        # Recommended for production (lower overhead)
        grpc:
          endpoint: 0.0.0.0:4317
        # Alternative for HTTP-only environments
        http:
          endpoint: 0.0.0.0:4318

2. Retention Policy:

tempo:
  retention: 24h                     # Traces older than 24h are deleted

3. Storage Backend:

persistence:
  enabled: true
  size: 10Gi                         # Adjust based on trace volume

4. Metrics Generator (Automatic Service Graphs):

tempo:
  metricsGenerator:
    enabled: true
    remoteWriteUrl: "http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/write"

  structuredConfig:
    metrics_generator:
      registry:
        external_labels:
          source: tempo
      storage:
        path: /var/tempo/generator/wal
        remote_write:
          - url: http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/write
            send_exemplars: true
      traces_storage:
        path: /var/tempo/generator/traces
      processor:
        service_graphs:
          dimensions:
            - service.name
            - peer.service
          histogram_buckets: [0.1, 0.2, 0.5, 1, 2, 5, 10]
        span_metrics:
          dimensions:
            - service.name
            - peer.service
            - log.type
    overrides:
      defaults:
        metrics_generator:
          processors: [service-graphs, span-metrics]

Key Features:

Service Graphs: Automatically generate service dependency metrics from traces
Span Metrics: Create Prometheus metrics for trace calls, latency, and errors
Remote Write: Send generated metrics to Prometheus for visualization
Dimensions: Track service.name, peer.service, and log.type for detailed filtering

Generated Prometheus Metrics:

traces_spanmetrics_calls_total - Total calls between services
traces_spanmetrics_latency_bucket - Latency histogram distribution
traces_spanmetrics_size_total - Span size tracking

Example Prometheus Queries:

# Service dependency graph
traces_spanmetrics_calls_total{service_name="my-service"}

# Average latency between services
rate(traces_spanmetrics_latency_sum[1m]) / rate(traces_spanmetrics_latency_count[1m])

# Error rate by service
rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[1m])

Resource Configuration

Storage Requirements:

Tempo PVC: 10Gi persistent volume (24-hour retention)
Estimated Usage: ~400-500MB per million spans (varies by trace size)

Service Endpoints:

OTLP gRPC: tempo.monitoring.svc.cluster.local:4317
OTLP HTTP: tempo.monitoring.svc.cluster.local:4318
HTTP API: tempo.monitoring.svc.cluster.local:3200
Ready Check: tempo.monitoring.svc.cluster.local:3200/ready

Security Configuration

Network Access:

service:
  type: ClusterIP                    # Internal cluster access only

No External Access: Tempo is internal-only. Traces are sent via OTLP Collector, and queries are performed through Grafana.

🔍 Monitoring & Verification

Health Checks

Check Pod Status:

# Tempo pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=tempo

# Expected output:
NAME        READY   STATUS    RESTARTS   AGE
tempo-0     1/1     Running   0          5m

Check Service Endpoints:

# Verify services are accessible
kubectl get svc -n monitoring -l app.kubernetes.io/name=tempo

# Expected services:
tempo        ClusterIP   10.43.x.x    3200/TCP,4317/TCP,4318/TCP

Service Verification

Test HTTP API:

# Test API echo endpoint
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -s http://tempo.monitoring.svc.cluster.local:3200/api/echo

# Expected: HTTP 200 response

Test Ready Endpoint:

# Check if Tempo is ready to receive traces
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -s http://tempo.monitoring.svc.cluster.local:3200/ready

# Expected: HTTP 200 response

Test OTLP Endpoints:

# Test gRPC port accessibility
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -v telnet://tempo.monitoring.svc.cluster.local:4317

# Test HTTP port accessibility
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -s -o /dev/null -w "%{http_code}" \
  http://tempo.monitoring.svc.cluster.local:4318/

Search API Testing

Query Traces:

# Test search API (returns trace metadata)
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -s "http://tempo.monitoring.svc.cluster.local:3200/api/search"

# Expected: JSON response with traces/metrics

Automated Verification

The deployment playbook (031-setup-tempo.yml) performs automated tests:

✅ HTTP API endpoint connectivity
✅ Ready endpoint verification
✅ Metrics endpoint check
✅ Search API validation
✅ OTLP gRPC port accessibility
✅ OTLP HTTP port accessibility

🛠️ Management Operations

Query Traces in Grafana

Access Grafana:

# Open Grafana UI
http://grafana.localhost

Explore Traces:

Navigate to Explore → Select Tempo datasource
Choose query type:
- Search: Find traces by service/operation
- TraceQL: Use SQL-like queries
- Trace ID: Lookup specific trace

TraceQL Examples:

# Find traces with errors
{ status = error }

# Find slow traces (>1s duration)
{ duration > 1s }

# Find traces by service name
{ resource.service.name = "sovdev-test-company-lookup-typescript" }

# Complex query: slow traces with errors from specific service
{ resource.service.name =~ "sovdev.*" && status = error && duration > 1s }

Official TraceQL Documentation: https://grafana.com/docs/tempo/v2.8.x/traceql/

HTTP API Queries

Search for Traces:

# Search by service name
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -s "http://tempo.monitoring.svc.cluster.local:3200/api/search?q=service.name%3D%22my-service%22"

Retrieve Trace by ID:

# Get specific trace (replace TRACE_ID)
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -s "http://tempo.monitoring.svc.cluster.local:3200/api/traces/TRACE_ID"

Metrics Monitoring

Tempo Self-Monitoring:

# Get Tempo internal metrics
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -s http://tempo.monitoring.svc.cluster.local:3200/metrics | grep tempo

Key Metrics (via Prometheus):

# Ingested spans per second
rate(tempo_ingester_spans_ingested_total[5m])

# Trace queries per second
rate(tempo_query_frontend_queries_total[5m])

# Storage bytes used
tempo_ingester_bytes_total

Service Removal

Automated Removal:

docker exec -it provision-host bash
cd /mnt/urbalurbadisk/provision-host/kubernetes/11-monitoring/not-in-use
./02-remove-tempo.sh rancher-desktop

Manual Removal:

# Remove Helm chart
helm uninstall tempo -n monitoring --kube-context rancher-desktop

# Remove PVC (optional - preserves data if omitted)
kubectl delete pvc -n monitoring -l app.kubernetes.io/name=tempo

🔧 Troubleshooting

Common Issues

Pods Not Starting:

# Check pod events
kubectl describe pod -n monitoring -l app.kubernetes.io/name=tempo

# Common causes:
# - PVC binding issues (check PV availability)
# - Insufficient resources (check node capacity)
# - Image pull errors (check network)

No Traces Appearing:

# 1. Check OTLP Collector is sending traces to Tempo
kubectl logs -n monitoring -l app.kubernetes.io/name=opentelemetry-collector | grep tempo

# 2. Check Tempo ingestion logs
kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i "trace\|span"

# 3. Verify OTLP Collector configuration
kubectl get configmap -n monitoring otel-collector-opentelemetry-collector -o yaml | grep -A 10 "tempo"

# Expected: Tempo endpoint at tempo.monitoring.svc.cluster.local:4317

Trace Query Failures:

# Check Tempo query logs
kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i error

# Test search API directly
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -v "http://tempo.monitoring.svc.cluster.local:3200/api/search"

Storage Full:

# Check PVC usage
kubectl get pvc -n monitoring

# Check trace block size via metrics
kubectl port-forward -n monitoring svc/tempo 3200:3200
curl -s http://localhost:3200/metrics | grep tempo_ingester_bytes

# Solutions:
# 1. Reduce retention period in manifests/031-tempo-config.yaml
# 2. Increase PVC size
# 3. Reduce trace sampling rate at application level

OTLP Ingestion Errors:

# Check if Tempo is accepting OTLP traces
kubectl logs -n monitoring -l app.kubernetes.io/name=tempo | grep -i "otlp\|grpc\|http"

# Test OTLP HTTP endpoint
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -X POST -v http://tempo.monitoring.svc.cluster.local:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{}'

# Expected: 400 or 405 (endpoint is reachable, empty payload rejected)

📋 Maintenance

Regular Tasks

Monitor Storage Usage:

# Check PVC status
kubectl get pvc -n monitoring -l app.kubernetes.io/name=tempo

# Check storage metrics
kubectl port-forward -n monitoring svc/tempo 3200:3200
curl -s http://localhost:3200/metrics | grep tempo_ingester_blocks_total

Update Tempo:

# Update Helm chart to latest version
helm repo update
helm upgrade tempo grafana/tempo \
  -f /mnt/urbalurbadisk/manifests/031-tempo-config.yaml \
  -n monitoring \
  --kube-context rancher-desktop

Cleanup Old Traces (automatic):

# Retention handled automatically via tempo.retention setting
tempo:
  retention: 24h  # Traces older than 24 hours are purged

Backup Procedures

Snapshot Trace Blocks:

# Export PVC data
kubectl exec -n monitoring tempo-0 -- \
  tar czf /tmp/tempo-backup.tar.gz /var/tempo

# Copy to local machine
kubectl cp monitoring/tempo-0:/tmp/tempo-backup.tar.gz \
  ./tempo-backup.tar.gz

Note: Tempo is designed as ephemeral storage with short retention. Long-term trace archival is not a primary use case.

Disaster Recovery

Restore from Backup:

# 1. Remove existing deployment
./02-remove-tempo.sh rancher-desktop

# 2. Restore PVC data (requires direct PV access)
# 3. Redeploy Tempo
./02-setup-tempo.sh rancher-desktop

Data Loss Scenarios:

PVC deleted: Traces are lost (not critical - 24h retention means limited impact)
Corruption: Tempo auto-repairs blocks on startup
Retention expired: Expected behavior, adjust retention if needed

🚀 Use Cases

1. Application Tracing with OpenTelemetry

Instrument Application (example with Go):

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
)

// Configure OTLP exporter to send to OTLP Collector
exporter, _ := otlptracehttp.New(ctx,
    otlptracehttp.WithEndpoint("otel.localhost"),
    otlptracehttp.WithURLPath("/v1/traces"),
    otlptracehttp.WithHeaders(map[string]string{
        "Host": "otel.localhost",
    }),
)

Environment Configuration:

OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://127.0.0.1/v1/traces
OTEL_EXPORTER_OTLP_HEADERS={"Host":"otel.localhost"}

Query in Grafana:

{ resource.service.name = "my-app" }

2. Correlate Logs and Traces

sovdev-logger Integration:

// TypeScript application with sovdev-logger
import { initializeSovdevLogger } from '@sovdev/logger';

// Logs include trace_id and span_id for correlation
logger.info("Processing request", {
    trace_id: context.traceId,
    span_id: context.spanId
});

Grafana Workflow:

Find trace in Tempo with TraceQL
Note trace_id from trace details
Switch to Loki datasource
Query logs: {service_name="my-app"} | json | trace_id="TRACE_ID"
View correlated logs and traces together

3. Performance Analysis

Find Slow Requests:

# Traces slower than 2 seconds
{ duration > 2s }

# Group by service
{ duration > 2s } | group by resource.service.name

Analyze Bottlenecks:

Query slow traces in Grafana Explore
View trace waterfall/flamegraph
Identify slow spans (database queries, API calls, etc.)
Optimize identified bottlenecks

4. Error Investigation

Find Failed Requests:

# Traces with errors
{ status = error }

# Errors from specific service in last hour
{ resource.service.name = "api-service" && status = error }

Debug Workflow:

Find error traces with TraceQL
Examine trace details and span attributes
Correlate with logs (trace_id)
Identify root cause from span data

💡 Key Insight: Tempo's design philosophy is "store everything, index nothing (except metadata)". This approach dramatically reduces costs while enabling powerful trace analysis through TraceQL. When integrated with OTLP Collector for ingestion and Grafana for visualization, Tempo provides complete distributed tracing capabilities for microservices architectures without the operational complexity of traditional tracing backends.

Monitoring Stack:

Monitoring Overview - Complete observability stack
Prometheus Metrics - Metrics collection
Loki Logs - Log aggregation
OTLP Collector - Telemetry pipeline (trace ingestion)
Grafana Visualization - Dashboards and trace exploration

Configuration & Rules:

Naming Conventions - Manifest numbering (031)
Development Workflow - Configuration management
Automated Deployment - Orchestration

External Resources:

TraceQL Language: https://grafana.com/docs/tempo/v2.8.x/traceql/
OTLP Specification: https://opentelemetry.io/docs/specs/otlp/
Tempo Configuration: https://grafana.com/docs/tempo/v2.8.x/configuration/

📋 Overview​

🏗️ Architecture​

Deployment Components​

Data Flow​

File Structure​

🚀 Deployment​

Automated Deployment​

Manual Deployment​

⚙️ Configuration​

Tempo Configuration (manifests/031-tempo-config.yaml)​

Resource Configuration​

Security Configuration​

🔍 Monitoring & Verification​

Health Checks​

Service Verification​

Search API Testing​

Automated Verification​

🛠️ Management Operations​

Query Traces in Grafana​

HTTP API Queries​

Metrics Monitoring​

Service Removal​

🔧 Troubleshooting​

Common Issues​

📋 Maintenance​

Regular Tasks​

Backup Procedures​

Disaster Recovery​

🚀 Use Cases​

1. Application Tracing with OpenTelemetry​

2. Correlate Logs and Traces​

3. Performance Analysis​

4. Error Investigation​

🔗 Related Documentation​