使用Prometheus和Grafana构建生产级监控系统
监控是可靠基础设施的基石。本教程将使用Prometheus收集指标、Grafana进行可视化,通过Docker Compose构建完整的可观测性栈。
你将构建的内容
- 从多个目标收集指标的Prometheus服务器
- 带实时可视化的Grafana仪表板
- 用于系统指标(CPU、内存、磁盘)的Node Exporter
- 带自定义指标的示例Python应用
- 支持Slack通知的Alertmanager
- 已安装Docker和Docker Compose
- 基本了解YAML配置
- Slack Webhook URL(可选,用于告警)
前提条件
---
第1步:项目结构
创建项目目录:
mkdir monitoring-stack && cd monitoring-stack
mkdir -p prometheus grafana/provisioning/datasources grafana/provisioning/dashboards alertmanager app
最终结构:
monitoring-stack/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ └── alert.rules.yml
├── grafana/
│ └── provisioning/
│ ├── datasources/
│ │ └── prometheus.yml
│ └── dashboards/
│ ├── dashboard.yml
│ └── node-exporter.json
├── alertmanager/
│ └── alertmanager.yml
└── app/
├── app.py
├── requirements.txt
└── Dockerfile
---
第2步:Docker Compose配置
创建 docker-compose.yml:
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alert.rules.yml:/etc/prometheus/alert.rules.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
networks:
- monitoring
grafana:
image: grafana/grafana:10.3.1
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
restart: unless-stopped
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
ports:
- "9100:9100"
restart: unless-stopped
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
networks:
- monitoring
sample-app:
build: ./app
container_name: sample-app
ports:
- "8000:8000"
restart: unless-stopped
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridge
---
第3步:Prometheus配置
创建 prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alert.rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'sample-app'
static_configs:
- targets: ['sample-app:8000']
告警规则 prometheus/alert.rules.yml:
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} CPU使用率过高"
description: "CPU使用率超过80%已持续5分钟以上(当前: {{ $value }}%)"
- alert: TargetDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "目标 {{ $labels.instance }} 已宕机"
---
第4步:带自定义指标的Python应用
创建 app/app.py:
import time
import random
from flask import Flask, Response
from prometheus_client import (
Counter, Histogram, Gauge,
generate_latest, CONTENT_TYPE_LATEST
)
app = Flask(__name__)
# 定义自定义指标
REQUEST_COUNT = Counter(
'http_requests_total',
'HTTP请求总数',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP请求处理时间(秒)',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
ACTIVE_REQUESTS = Gauge(
'http_active_requests',
'当前活跃的HTTP请求数'
)
def track_request(method, endpoint):
"""追踪请求指标的装饰器"""
def decorator(f):
def wrapper(args, *kwargs):
ACTIVE_REQUESTS.inc()
start_time = time.time()
try:
result = f(args, *kwargs)
status = 200
return result
except Exception:
status = 500
raise
finally:
duration = time.time() - start_time
REQUEST_COUNT.labels(method, endpoint, status).inc()
REQUEST_DURATION.labels(method, endpoint).observe(duration)
ACTIVE_REQUESTS.dec()
wrapper.__name__ = f.__name__
return wrapper
return decorator
@app.route('/')
@track_request('GET', '/')
def home():
time.sleep(random.uniform(0.01, 0.1))
return {'status': 'ok', 'message': '来自被监控应用的问候!'}
@app.route('/metrics')
def metrics():
return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8000)
---
第5步:启动和测试
# 启动整个栈
docker compose up -d
# 验证所有服务运行状态
docker compose ps
# 向示例应用发送测试流量
for i in $(seq 1 50); do
curl -s http://localhost:8000/ > /dev/null
done
# 查看指标
curl http://localhost:8000/metrics
访问地址:
---
第6步:常用PromQL查询
# 每秒请求速率
rate(http_requests_total[5m])
# 95百分位延迟
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU使用率
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
---
生产环境加固建议
1. Grafana安全: 修改默认密码,启用HTTPS和OAuth认证
2. 保留策略: 根据存储容量调整 retention.time
3. 远程存储: 使用Thanos或Cortex进行长期指标存储
4. 服务发现: 将静态配置替换为Consul、Kubernetes或EC2 SD
5. 预计算规则: 对高开销查询进行预计算以提升仪表板性能
生产级监控栈搭建完成!现在可以将你自己的服务添加为抓取目标,并构建满足特定需求的自定义仪表板。