使用Prometheus和Grafana构建生产级监控系统

监控是可靠基础设施的基石。本教程将使用Prometheus收集指标、Grafana进行可视化,通过Docker Compose构建完整的可观测性栈。

你将构建的内容

  • 从多个目标收集指标的Prometheus服务器
  • 带实时可视化的Grafana仪表板
  • 用于系统指标(CPU、内存、磁盘)的Node Exporter
  • 带自定义指标的示例Python应用
  • 支持Slack通知的Alertmanager
  • 前提条件

  • 已安装Docker和Docker Compose
  • 基本了解YAML配置
  • Slack Webhook URL(可选,用于告警)
  • ---

    第1步:项目结构

    创建项目目录:

    mkdir monitoring-stack && cd monitoring-stack
    

    mkdir -p prometheus grafana/provisioning/datasources grafana/provisioning/dashboards alertmanager app

    最终结构:

    monitoring-stack/
    

    ├── docker-compose.yml

    ├── prometheus/

    │ ├── prometheus.yml

    │ └── alert.rules.yml

    ├── grafana/

    │ └── provisioning/

    │ ├── datasources/

    │ │ └── prometheus.yml

    │ └── dashboards/

    │ ├── dashboard.yml

    │ └── node-exporter.json

    ├── alertmanager/

    │ └── alertmanager.yml

    └── app/

    ├── app.py

    ├── requirements.txt

    └── Dockerfile

    ---

    第2步:Docker Compose配置

    创建 docker-compose.yml

    version: '3.8'
    
    

    services:

    prometheus:

    image: prom/prometheus:v2.51.0

    container_name: prometheus

    ports:

    - "9090:9090"

    volumes:

    - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml

    - ./prometheus/alert.rules.yml:/etc/prometheus/alert.rules.yml

    - prometheus_data:/prometheus

    command:

    - '--config.file=/etc/prometheus/prometheus.yml'

    - '--storage.tsdb.path=/prometheus'

    - '--storage.tsdb.retention.time=30d'

    - '--web.enable-lifecycle'

    restart: unless-stopped

    networks:

    - monitoring

    grafana:

    image: grafana/grafana:10.3.1

    container_name: grafana

    ports:

    - "3000:3000"

    environment:

    - GF_SECURITY_ADMIN_USER=admin

    - GF_SECURITY_ADMIN_PASSWORD=changeme

    - GF_USERS_ALLOW_SIGN_UP=false

    volumes:

    - grafana_data:/var/lib/grafana

    - ./grafana/provisioning:/etc/grafana/provisioning

    restart: unless-stopped

    networks:

    - monitoring

    node-exporter:

    image: prom/node-exporter:v1.7.0

    container_name: node-exporter

    ports:

    - "9100:9100"

    restart: unless-stopped

    networks:

    - monitoring

    alertmanager:

    image: prom/alertmanager:v0.27.0

    container_name: alertmanager

    ports:

    - "9093:9093"

    volumes:

    - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml

    restart: unless-stopped

    networks:

    - monitoring

    sample-app:

    build: ./app

    container_name: sample-app

    ports:

    - "8000:8000"

    restart: unless-stopped

    networks:

    - monitoring

    volumes:

    prometheus_data:

    grafana_data:

    networks:

    monitoring:

    driver: bridge

    ---

    第3步:Prometheus配置

    创建 prometheus/prometheus.yml

    global:
    

    scrape_interval: 15s

    evaluation_interval: 15s

    alerting:

    alertmanagers:

    - static_configs:

    - targets:

    - alertmanager:9093

    rule_files:

    - "alert.rules.yml"

    scrape_configs:

    - job_name: 'prometheus'

    static_configs:

    - targets: ['localhost:9090']

    - job_name: 'node-exporter'

    static_configs:

    - targets: ['node-exporter:9100']

    - job_name: 'sample-app'

    static_configs:

    - targets: ['sample-app:8000']

    告警规则 prometheus/alert.rules.yml

    groups:
    

    - name: system_alerts

    rules:

    - alert: HighCPUUsage

    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

    for: 5m

    labels:

    severity: warning

    annotations:

    summary: "{{ $labels.instance }} CPU使用率过高"

    description: "CPU使用率超过80%已持续5分钟以上(当前: {{ $value }}%)"

    - alert: TargetDown

    expr: up == 0

    for: 1m

    labels:

    severity: critical

    annotations:

    summary: "目标 {{ $labels.instance }} 已宕机"

    ---

    第4步:带自定义指标的Python应用

    创建 app/app.py

    import time
    

    import random

    from flask import Flask, Response

    from prometheus_client import (

    Counter, Histogram, Gauge,

    generate_latest, CONTENT_TYPE_LATEST

    )

    app = Flask(__name__)

    # 定义自定义指标

    REQUEST_COUNT = Counter(

    'http_requests_total',

    'HTTP请求总数',

    ['method', 'endpoint', 'status']

    )

    REQUEST_DURATION = Histogram(

    'http_request_duration_seconds',

    'HTTP请求处理时间(秒)',

    ['method', 'endpoint'],

    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]

    )

    ACTIVE_REQUESTS = Gauge(

    'http_active_requests',

    '当前活跃的HTTP请求数'

    )

    def track_request(method, endpoint):

    """追踪请求指标的装饰器"""

    def decorator(f):

    def wrapper(args, *kwargs):

    ACTIVE_REQUESTS.inc()

    start_time = time.time()

    try:

    result = f(args, *kwargs)

    status = 200

    return result

    except Exception:

    status = 500

    raise

    finally:

    duration = time.time() - start_time

    REQUEST_COUNT.labels(method, endpoint, status).inc()

    REQUEST_DURATION.labels(method, endpoint).observe(duration)

    ACTIVE_REQUESTS.dec()

    wrapper.__name__ = f.__name__

    return wrapper

    return decorator

    @app.route('/')

    @track_request('GET', '/')

    def home():

    time.sleep(random.uniform(0.01, 0.1))

    return {'status': 'ok', 'message': '来自被监控应用的问候!'}

    @app.route('/metrics')

    def metrics():

    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

    if __name__ == '__main__':

    app.run(host='0.0.0.0', port=8000)

    ---

    第5步:启动和测试

    # 启动整个栈
    

    docker compose up -d

    # 验证所有服务运行状态

    docker compose ps

    # 向示例应用发送测试流量

    for i in $(seq 1 50); do

    curl -s http://localhost:8000/ > /dev/null

    done

    # 查看指标

    curl http://localhost:8000/metrics

    访问地址:

  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3000(admin/changeme)
  • Alertmanager: http://localhost:9093

---

第6步:常用PromQL查询

# 每秒请求速率

rate(http_requests_total[5m])

# 95百分位延迟

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU使用率

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

---

生产环境加固建议

1. Grafana安全: 修改默认密码,启用HTTPS和OAuth认证

2. 保留策略: 根据存储容量调整 retention.time

3. 远程存储: 使用Thanos或Cortex进行长期指标存储

4. 服务发现: 将静态配置替换为Consul、Kubernetes或EC2 SD

5. 预计算规则: 对高开销查询进行预计算以提升仪表板性能

生产级监控栈搭建完成!现在可以将你自己的服务添加为抓取目标,并构建满足特定需求的自定义仪表板。