Prometheus - Metrics Collection & Alerting

Key Features: Time-Series Database • PromQL Query Language • Service Discovery • Multi-Dimensional Data • Alertmanager • Pushgateway • Node Exporter • Kube-State-Metrics

File: docs/package-monitoring-prometheus.md Purpose: Complete guide to Prometheus deployment and configuration for metrics monitoring in Urbalurba infrastructure Target Audience: DevOps engineers, platform administrators, SREs, developers Last Updated: October 3, 2025

Deployed Version: Prometheus v3.6.0 (Helm Chart: prometheus-27.39.0) Official Documentation: https://prometheus.io/docs/prometheus/3.6/

📋 Overview

Prometheus is the primary metrics backend in the Urbalurba monitoring stack. It provides time-series data storage, powerful querying capabilities, and automated service discovery for Kubernetes environments. Prometheus implements a pull-based model, actively scraping metrics from instrumented applications and exporters.

As part of the unified observability stack, Prometheus works alongside Tempo (traces) and Loki (logs), with all data visualized in Grafana.

Key Capabilities:

Time-Series Database: Efficient storage of metrics with configurable retention (15 days default)
PromQL: Powerful query language for metrics analysis and alerting
Service Discovery: Automatic discovery of Kubernetes services via ServiceMonitor CRDs
Multi-Dimensional Data: Label-based data model for flexible querying
Remote Write: Accepts metrics from OpenTelemetry Collector
Built-in Exporters: Node metrics, Kubernetes state, push gateway for batch jobs

Architecture Type: Pull-based metrics collector with time-series database and alerting

🏗️ Architecture

Deployment Components

┌─────────────────────────────────────────────────────────┐
│              Prometheus Stack (namespace: monitoring)   │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────────┐    ┌──────────────────┐         │
│  │ Prometheus Server│    │   Alertmanager   │         │
│  │                  │    │                  │         │
│  │ - Metrics Storage│    │ - Alert Routing  │         │
│  │ - PromQL Engine  │    │ - Deduplication  │         │
│  │ - Scraping       │◄───┤ - Notifications  │         │
│  │ - Remote Write   │    │                  │         │
│  └────────┬─────────┘    └──────────────────┘         │
│           │                                            │
│           │ Scrapes Metrics                            │
│           ▼                                            │
│  ┌──────────────────┐    ┌──────────────────┐         │
│  │  Node Exporter   │    │ Kube-State-Metrics│        │
│  │                  │    │                  │         │
│  │ - Host Metrics   │    │ - K8s Objects    │         │
│  │ - CPU/Memory     │    │ - Deployments    │         │
│  │ - Disk I/O       │    │ - Pods/Services  │         │
│  │ - Network        │    │ - ConfigMaps     │         │
│  └──────────────────┘    └──────────────────┘         │
│                                                         │
│  ┌──────────────────┐    ┌──────────────────┐         │
│  │   Pushgateway    │    │ ServiceMonitor   │         │
│  │                  │    │    Discovery     │         │
│  │ - Batch Jobs     │    │                  │         │
│  │ - Ephemeral      │    │ - Auto Scraping  │         │
│  │ - Push Metrics   │    │ - Label Config   │         │
│  └──────────────────┘    └──────────────────┘         │
└─────────────────────────────────────────────────────────┘
         │                          │
         ▼                          ▼
┌──────────────────┐    ┌──────────────────────┐
│  Grafana Query   │    │ OTLP Collector Push  │
│                  │    │  (Remote Write API)  │
└──────────────────┘    └──────────────────────┘

Data Flow

Application Metrics (Prometheus format)
         │
         │ HTTP Scrape (Pull)
         ▼
  ┌──────────────┐
  │  Prometheus  │
  │    Server    │
  └──────────────┘
         │
         ├─► Time-Series Storage (15d retention)
         ├─► PromQL Evaluation
         ├─► Alerting Rules
         └─► Grafana Datasource

OTLP Collector (Metrics)
         │
         │ Remote Write (Push)
         ▼
  ┌──────────────┐
  │  Prometheus  │
  │    Server    │
  └──────────────┘

File Structure

manifests/
└── 030-prometheus-config.yaml              # Prometheus Helm values

ansible/playbooks/
├── 030-setup-prometheus.yml                # Deployment automation
└── 030-remove-prometheus.yml               # Removal automation

provision-host/kubernetes/11-monitoring/not-in-use/
├── 01-setup-prometheus.sh                  # Shell script wrapper
└── 01-remove-prometheus.sh                 # Removal script

Storage:
└── PersistentVolumeClaim
    ├── prometheus-server (8Gi)             # Metrics storage
    └── prometheus-alertmanager (2Gi)       # Alert state

🚀 Deployment

Automated Deployment

Via Monitoring Stack (Recommended):

# Deploy entire monitoring stack (includes Prometheus)
docker exec -it provision-host bash
cd /mnt/urbalurbadisk/provision-host/kubernetes/11-monitoring/not-in-use
./00-setup-all-monitoring.sh rancher-desktop

Individual Deployment:

# Deploy Prometheus only
docker exec -it provision-host bash
cd /mnt/urbalurbadisk/provision-host/kubernetes/11-monitoring/not-in-use
./01-setup-prometheus.sh rancher-desktop

Manual Deployment

Prerequisites:

Kubernetes cluster running (Rancher Desktop)
monitoring namespace exists
Helm installed in provision-host container
Manifest file: manifests/030-prometheus-config.yaml

Deployment Steps:

# 1. Enter provision-host container
docker exec -it provision-host bash

# 2. Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 3. Deploy Prometheus
helm upgrade --install prometheus prometheus-community/prometheus \
  -f /mnt/urbalurbadisk/manifests/030-prometheus-config.yaml \
  --namespace monitoring \
  --create-namespace \
  --timeout 600s \
  --kube-context rancher-desktop

# 4. Wait for pods to be ready
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=prometheus \
  -l app.kubernetes.io/component=server \
  -n monitoring --timeout=300s

Deployment Time: ~2-3 minutes

⚙️ Configuration

Prometheus Configuration (`manifests/030-prometheus-config.yaml`)

Core Settings:

server:
  retention: 15d                    # Metrics retention period
  persistentVolume:
    enabled: true
    size: 8Gi                       # Storage for time-series data
  resources:
    requests:
      cpu: 200m
      memory: 512Mi
    limits:
      cpu: 500m
      memory: 1Gi
  extraArgs:
    web.enable-remote-write-receiver: ""  # REQUIRED: Enables /api/v1/write endpoint for OTLP Collector

Key Configuration Sections:

1. Alertmanager (Alert Processing):

alertmanager:
  enabled: true
  persistentVolume:
    enabled: true
    size: 2Gi
  resources:
    requests:
      cpu: 100m
      memory: 128Mi

2. Node Exporter (Host Metrics):

nodeExporter:
  enabled: true
  hostNetwork: false                # Use pod network

3. Pushgateway (Batch Job Metrics):

pushgateway:
  enabled: true
  persistentVolume:
    enabled: false                  # Ephemeral storage
  resources:
    requests:
      cpu: 50m
      memory: 64Mi

4. Kube-State-Metrics (Kubernetes Objects):

kubeStateMetrics:
  enabled: true                     # Pod/Deployment/Service metrics

5. ServiceMonitor (Auto-Discovery):

serviceMonitors:
  enabled: true                     # Automatic service discovery

Resource Configuration

Storage Requirements:

Prometheus Server: 8Gi persistent volume (15-day retention)
Alertmanager: 2Gi persistent volume (alert state)
Pushgateway: No persistence (ephemeral metrics)

Memory & CPU:

Server: 512Mi request, 1Gi limit / 200m CPU request, 500m limit
Alertmanager: 128Mi request, 256Mi limit / 100m CPU request, 200m limit
Pushgateway: 64Mi request, 128Mi limit / 50m CPU request, 100m limit

Security Configuration

Network Access:

# Internal cluster access only (no IngressRoute by default)
Service: prometheus-server.monitoring.svc.cluster.local:80

Optional External Access:

# Port forwarding for local development
kubectl port-forward -n monitoring svc/prometheus-server 9090:80

# Access at: http://localhost:9090

🔍 Monitoring & Verification

Health Checks

Check Pod Status:

# All Prometheus components
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus

# Expected output (5 pods total):
NAME                                               READY   STATUS
prometheus-server-xxx                              2/2     Running    # Main Prometheus server + config reloader
prometheus-alertmanager-xxx                        1/1     Running    # Alert processing and routing
prometheus-kube-state-metrics-xxx                  1/1     Running    # Kubernetes object metrics
prometheus-prometheus-node-exporter-xxx            1/1     Running    # Host/node metrics (CPU, memory, disk)
prometheus-prometheus-pushgateway-xxx              1/1     Running    # Batch job metrics receiver

Pod Descriptions:

prometheus-server: Main Prometheus server (scraping, storage, querying) + config-reload sidecar
prometheus-alertmanager: Processes and routes alerts to notification channels
prometheus-kube-state-metrics: Exposes Kubernetes object state as Prometheus metrics (pods, deployments, services)
prometheus-prometheus-node-exporter: DaemonSet that collects host-level metrics from the node (CPU, memory, disk I/O, network)
prometheus-prometheus-pushgateway: Allows ephemeral/batch jobs to push metrics to Prometheus

Check Service Endpoints:

# Verify services are accessible
kubectl get svc -n monitoring -l app.kubernetes.io/name=prometheus

# Expected services:
prometheus-server                    ClusterIP   10.43.x.x    80/TCP
prometheus-alertmanager              ClusterIP   10.43.x.x    9093/TCP
prometheus-prometheus-pushgateway    ClusterIP   10.43.x.x    9091/TCP
prometheus-prometheus-node-exporter  ClusterIP   10.43.x.x    9100/TCP

Service Verification

Test Prometheus API:

# Runtime info endpoint
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -s http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/status/runtimeinfo

# Expected: JSON response with runtime information

Test Metrics Endpoint:

# Prometheus self-monitoring metrics
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -s http://prometheus-server.monitoring.svc.cluster.local:80/metrics | head -20

Data Ingestion Testing

Push Test Metric:

# Push metric to Pushgateway
kubectl run prometheus-data-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  /bin/sh -c 'echo "test_metric 42" | curl -X POST --data-binary @- \
  http://prometheus-prometheus-pushgateway.monitoring.svc.cluster.local:9091/metrics/job/test/instance/test'

Query Test Metric:

# Wait 15 seconds for scrape interval
sleep 15

# Query Prometheus for the test metric
kubectl run prometheus-query --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -s --data-urlencode 'query=test_metric' \
  "http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/query"

# Expected: JSON response with "status":"success" and test_metric value

Automated Verification

The deployment playbook (030-setup-prometheus.yml) performs automated tests:

✅ Server API connectivity test
✅ Metrics endpoint test
✅ Pushgateway ingestion test
✅ Query test metric verification

🛠️ Management Operations

Prometheus UI Access (Development)

Port Forwarding:

# Forward Prometheus UI to localhost
kubectl port-forward -n monitoring svc/prometheus-server 9090:80

# Open browser
http://localhost:9090

UI Features:

Graph: PromQL query and visualization
Alerts: Active alerts and rules
Status: Targets, service discovery, configuration
Metrics Explorer: Browse available metrics

Common PromQL Queries

Node Metrics:

# CPU usage by node
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

# Disk usage
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

Kubernetes Metrics:

# Pod count by namespace
count by (namespace) (kube_pod_info)

# Deployment replicas
kube_deployment_status_replicas{deployment="prometheus-server"}

# Container restarts
kube_pod_container_status_restarts_total

Prometheus Self-Monitoring:

# Scrape duration
prometheus_target_interval_length_seconds

# Active time series
prometheus_tsdb_head_series

# Storage size
prometheus_tsdb_storage_blocks_bytes

Service Removal

Automated Removal:

docker exec -it provision-host bash
cd /mnt/urbalurbadisk/provision-host/kubernetes/11-monitoring/not-in-use
./01-remove-prometheus.sh rancher-desktop

Manual Removal:

# Remove Helm chart
helm uninstall prometheus -n monitoring --kube-context rancher-desktop

# Remove PVCs (optional - preserves data if omitted)
kubectl delete pvc -n monitoring -l app.kubernetes.io/name=prometheus

🔧 Troubleshooting

Common Issues

Pods Not Starting:

# Check pod events
kubectl describe pod -n monitoring -l app.kubernetes.io/name=prometheus

# Common causes:
# - Insufficient resources (check node capacity)
# - PVC binding issues (check PV availability)
# - Image pull errors (check network connectivity)

High Memory Usage:

# Check Prometheus memory usage
kubectl top pod -n monitoring -l app.kubernetes.io/name=prometheus

# Solutions:
# 1. Reduce retention period in manifests/030-prometheus-config.yaml
# 2. Increase memory limits
# 3. Check for cardinality explosion (too many unique label combinations)

Metrics Not Appearing:

# Check scrape targets
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Visit http://localhost:9090/targets

# Check ServiceMonitor configuration
kubectl get servicemonitor -n monitoring

# Verify application metrics endpoint
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n <app-namespace> -- \
  curl -s http://<service>:<port>/metrics

Remote Write Failures (from OTLP Collector):

# Check Prometheus server logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus -l app.kubernetes.io/component=server

# Check OTLP Collector logs for remote write errors
kubectl logs -n monitoring -l app.kubernetes.io/name=opentelemetry-collector | grep -i prometheus

# Verify remote write endpoint is accessible
kubectl run curl-test --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -v http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/write

IMPORTANT: If Prometheus returns 404 or error "remote write receiver needs to be enabled", check that the remote-write-receiver flag is enabled:

# Check Prometheus startup flags
kubectl get pod -n monitoring -l app.kubernetes.io/name=prometheus,app.kubernetes.io/component=server \
  -o jsonpath='{.items[0].spec.containers[?(@.name=="prometheus-server")].args}' | jq -r '.[]' | grep "remote-write"

# Should see: --web.enable-remote-write-receiver
# If missing, add to manifests/030-prometheus-config.yaml:
#   server:
#     extraArgs:
#       web.enable-remote-write-receiver: ""

Alertmanager Not Firing Alerts:

# Check Alertmanager logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus -l app.kubernetes.io/component=alertmanager

# Check alert rules in Prometheus UI
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Visit http://localhost:9090/alerts

# Verify Alertmanager configuration
kubectl get configmap -n monitoring prometheus-alertmanager -o yaml

📋 Maintenance

Regular Tasks

Monitor Storage Usage:

# Check PVC usage
kubectl get pvc -n monitoring

# Check Prometheus TSDB size via PromQL
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Query: prometheus_tsdb_storage_blocks_bytes

Update Prometheus:

# Update Helm chart to latest version
helm repo update
helm upgrade prometheus prometheus-community/prometheus \
  -f /mnt/urbalurbadisk/manifests/030-prometheus-config.yaml \
  -n monitoring \
  --kube-context rancher-desktop

Cleanup Old Metrics (automatic):

# Retention handled automatically via server.retention setting
server:
  retention: 15d  # Metrics older than 15 days are purged

Backup Procedures

Snapshot Time-Series Data:

# Create snapshot via API
kubectl run curl-snap --image=curlimages/curl --rm -i --restart=Never \
  -n monitoring -- \
  curl -X POST http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/admin/tsdb/snapshot

# Snapshot stored in: /prometheus/snapshots/

Backup PVC:

# Export PVC data (requires read/write access)
kubectl exec -n monitoring deployment/prometheus-server -- \
  tar czf /tmp/prometheus-backup.tar.gz /prometheus

# Copy to local machine
kubectl cp monitoring/prometheus-server-xxx:/tmp/prometheus-backup.tar.gz \
  ./prometheus-backup.tar.gz

Disaster Recovery

Restore from Backup:

# 1. Remove existing deployment
./01-remove-prometheus.sh rancher-desktop

# 2. Restore PVC data (manual process, requires direct PV access)
# 3. Redeploy Prometheus
./01-setup-prometheus.sh rancher-desktop

Data Loss Scenarios:

PVC deleted: Metrics are lost, redeploy and start fresh collection
Corruption: Prometheus auto-repairs TSDB on startup (check logs)
Retention expired: Expected behavior, increase retention if needed

🚀 Use Cases

1. Application Metrics Monitoring

Instrument Application:

// Go example with Prometheus client
import "github.com/prometheus/client_golang/prometheus/promhttp"

http.Handle("/metrics", promhttp.Handler())

Expose via ServiceMonitor:

apiVersion: v1
kind: Service
metadata:
  name: my-app
  labels:
    app: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Query in Grafana:

rate(http_requests_total{job="my-app"}[5m])

2. Alert on High Resource Usage

Create Alert Rule:

# Add to Prometheus configuration
groups:
  - name: resource_alerts
    rules:
      - alert: HighMemoryUsage
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage detected"

3. Integrate with OTLP Collector

Receive Metrics from OTLP:

# OTLP Collector configuration (manifests/033-otel-collector-config.yaml)
exporters:
  prometheusremotewrite:
    endpoint: http://prometheus-server.monitoring.svc.cluster.local:80/api/v1/write

Verify Metrics Ingestion:

# Query OTLP-sourced metrics
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
# Query: {job="otel-collector"}

4. Dashboard Creation in Grafana

Add Prometheus Datasource (pre-configured):

datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus-server.monitoring.svc.cluster.local:80
    access: proxy

Create Dashboard Panel:

{
  "targets": [{
    "expr": "rate(prometheus_http_requests_total[5m])",
    "legendFormat": "{{handler}}"
  }]
}

💡 Key Insight: Prometheus serves as the metrics foundation for the entire Urbalurba observability stack. Its pull-based architecture, combined with ServiceMonitor auto-discovery and PromQL's powerful query capabilities, provides comprehensive visibility into infrastructure and application health. When integrated with OTLP Collector, Loki, and Tempo, it forms a complete observability solution visualized in Grafana.

Monitoring Stack:

Monitoring Overview - Complete observability stack
Tempo Tracing - Distributed tracing backend
Loki Logs - Log aggregation
OTLP Collector - Telemetry pipeline
Grafana Visualization - Dashboards

Configuration & Rules:

Naming Conventions - Manifest numbering (030)
Development Workflow - Configuration management
Automated Deployment - Orchestration

📋 Overview​

🏗️ Architecture​

Deployment Components​

Data Flow​

File Structure​

🚀 Deployment​

Automated Deployment​

Manual Deployment​

⚙️ Configuration​

Prometheus Configuration (manifests/030-prometheus-config.yaml)​

Resource Configuration​

Security Configuration​

🔍 Monitoring & Verification​

Health Checks​

Service Verification​

Data Ingestion Testing​

Automated Verification​

🛠️ Management Operations​

Prometheus UI Access (Development)​

Common PromQL Queries​

Service Removal​

🔧 Troubleshooting​

Common Issues​

📋 Maintenance​

Regular Tasks​

Backup Procedures​

Disaster Recovery​

🚀 Use Cases​

1. Application Metrics Monitoring​

2. Alert on High Resource Usage​

3. Integrate with OTLP Collector​

4. Dashboard Creation in Grafana​

🔗 Related Documentation​