Build a Production Monitoring Stack with Prometheus and Grafana
Monitoring is the backbone of reliable infrastructure. In this tutorial, you'll build a complete observability stack using Prometheus for metrics collection and Grafana for visualization — all orchestrated with Docker Compose.
What You'll Build
- Prometheus server collecting metrics from multiple targets
- Grafana dashboards with real-time visualizations
- Node Exporter for system metrics (CPU, memory, disk)
- A sample Python app with custom metrics
- Alertmanager with Slack notifications
- Docker and Docker Compose installed
- Basic understanding of YAML configuration
- A Slack webhook URL (optional, for alerts)
Prerequisites
---
Step 1: Project Structure
Create the project directory:
mkdir monitoring-stack && cd monitoring-stack
mkdir -p prometheus grafana/provisioning/datasources grafana/provisioning/dashboards alertmanager app
Final structure:
monitoring-stack/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ └── alert.rules.yml
├── grafana/
│ └── provisioning/
│ ├── datasources/
│ │ └── prometheus.yml
│ └── dashboards/
│ ├── dashboard.yml
│ └── node-exporter.json
├── alertmanager/
│ └── alertmanager.yml
└── app/
├── app.py
├── requirements.txt
└── Dockerfile
---
Step 2: Docker Compose Configuration
Create docker-compose.yml:
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alert.rules.yml:/etc/prometheus/alert.rules.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
networks:
- monitoring
grafana:
image: grafana/grafana:10.3.1
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
restart: unless-stopped
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
networks:
- monitoring
sample-app:
build: ./app
container_name: sample-app
ports:
- "8000:8000"
restart: unless-stopped
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridge
---
Step 3: Prometheus Configuration
Create prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alert.rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'sample-app'
static_configs:
- targets: ['sample-app:8000']
Create prometheus/alert.rules.yml:
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes (current: {{ $value }}%)"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% (current: {{ $value }}%)"
- alert: DiskSpaceRunningLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 10m
labels:
severity: critical
annotations:
summary: "Disk space running low on {{ $labels.instance }}"
description: "Available disk space is below 15% (current: {{ $value }}%)"
- alert: TargetDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Target {{ $labels.instance }} is down"
description: "{{ $labels.job }} target {{ $labels.instance }} has been down for more than 1 minute."
- alert: HighRequestLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "95th percentile latency is above 1s (current: {{ $value }}s)"
---
Step 4: Sample Python Application with Custom Metrics
Create app/requirements.txt:
prometheus-client==0.20.0
flask==3.0.0
Create app/app.py:
import time
import random
from flask import Flask, Response
from prometheus_client import (
Counter, Histogram, Gauge,
generate_latest, CONTENT_TYPE_LATEST
)
app = Flask(__name__)
# Define custom metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
ACTIVE_REQUESTS = Gauge(
'http_active_requests',
'Number of active HTTP requests'
)
ITEMS_IN_QUEUE = Gauge(
'app_items_in_queue',
'Number of items waiting in the processing queue'
)
def track_request(method, endpoint):
"""Decorator to track request metrics."""
def decorator(f):
def wrapper(args, *kwargs):
ACTIVE_REQUESTS.inc()
start_time = time.time()
try:
result = f(args, *kwargs)
status = 200
return result
except Exception:
status = 500
raise
finally:
duration = time.time() - start_time
REQUEST_COUNT.labels(method, endpoint, status).inc()
REQUEST_DURATION.labels(method, endpoint).observe(duration)
ACTIVE_REQUESTS.dec()
wrapper.__name__ = f.__name__
return wrapper
return decorator
@app.route('/')
@track_request('GET', '/')
def home():
# Simulate variable processing time
time.sleep(random.uniform(0.01, 0.1))
return {'status': 'ok', 'message': 'Hello from the monitored app!'}
@app.route('/api/process')
@track_request('GET', '/api/process')
def process():
# Simulate heavier processing
time.sleep(random.uniform(0.05, 0.5))
ITEMS_IN_QUEUE.set(random.randint(0, 100))
return {'status': 'processed', 'queue_size': random.randint(0, 100)}
@app.route('/api/slow')
@track_request('GET', '/api/slow')
def slow_endpoint():
# Intentionally slow for testing alerts
time.sleep(random.uniform(0.5, 2.0))
return {'status': 'done', 'note': 'This was intentionally slow'}
@app.route('/metrics')
def metrics():
return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8000)
Create app/Dockerfile:
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 8000
CMD ["python", "app.py"]
---
Step 5: Alertmanager Configuration
Create alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'slack-critical'
repeat_interval: 15m
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#monitoring'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
send_resolved: true
- name: 'slack-critical'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#incidents'
title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
send_resolved: true
---
Step 6: Grafana Auto-Provisioning
Create grafana/provisioning/datasources/prometheus.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
Create grafana/provisioning/dashboards/dashboard.yml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /etc/grafana/provisioning/dashboards
foldersFromFilesStructure: false
---
Step 7: Launch and Test
# Start the entire stack
docker compose up -d
# Verify all services are running
docker compose ps
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | python -m json.tool
# Generate some traffic to the sample app
for i in $(seq 1 50); do
curl -s http://localhost:8000/ > /dev/null
curl -s http://localhost:8000/api/process > /dev/null
curl -s http://localhost:8000/api/slow > /dev/null
done
# Check metrics
curl http://localhost:8000/metrics
Access the services:
---
Step 8: Useful PromQL Queries
Try these queries in the Prometheus UI:
# Request rate per second
rate(http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU usage percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage percentage
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
# Active HTTP requests
http_active_requests
---
Step 9: Production Hardening Tips
1. Secure Grafana: Change default passwords, enable HTTPS, configure OAuth
2. Retention policy: Adjust --storage.tsdb.retention.time based on storage capacity
3. Remote storage: Use Thanos or Cortex for long-term metric storage
4. Service discovery: Replace static configs with Consul, Kubernetes, or EC2 SD
5. Recording rules: Pre-compute expensive queries for dashboard performance
# Example recording rule in prometheus.yml
groups:
- name: recording_rules
rules:
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
---
Cleanup
# Stop and remove everything
docker compose down -v
Next Steps
You now have a production-grade monitoring stack! Start adding your own services as scrape targets and build custom dashboards for your specific needs.