使用Prometheus和Grafana构建生产级监控系统：实战教程

使用Prometheus和Grafana构建生产级监控系统

监控是可靠基础设施的基石。本教程将使用Prometheus收集指标、Grafana进行可视化，通过Docker Compose构建完整的可观测性栈。

你将构建的内容

从多个目标收集指标的Prometheus服务器
带实时可视化的Grafana仪表板
用于系统指标（CPU、内存、磁盘）的Node Exporter
带自定义指标的示例Python应用
支持Slack通知的Alertmanager

前提条件

已安装Docker和Docker Compose
基本了解YAML配置
Slack Webhook URL（可选，用于告警）

---

第1步：项目结构

创建项目目录：

mkdir monitoring-stack && cd monitoring-stack mkdir -p prometheus grafana/provisioning/datasources grafana/provisioning/dashboards alertmanager app

最终结构：

monitoring-stack/ ├── docker-compose.yml ├── prometheus/ │ ├── prometheus.yml │ └── alert.rules.yml ├── grafana/ │ └── provisioning/ │ ├── datasources/ │ │ └── prometheus.yml │ └── dashboards/ │ ├── dashboard.yml │ └── node-exporter.json ├── alertmanager/ │ └── alertmanager.yml └── app/ ├── app.py ├── requirements.txt └── Dockerfile

---

第2步：Docker Compose配置

创建 docker-compose.yml：

version: '3.8' services: prometheus: image: prom/prometheus:v2.51.0 container_name: prometheus ports: - "9090:9090" volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - ./prometheus/alert.rules.yml:/etc/prometheus/alert.rules.yml - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=30d' - '--web.enable-lifecycle' restart: unless-stopped networks: - monitoring grafana: image: grafana/grafana:10.3.1 container_name: grafana ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=changeme - GF_USERS_ALLOW_SIGN_UP=false volumes: - grafana_data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning restart: unless-stopped networks: - monitoring node-exporter: image: prom/node-exporter:v1.7.0 container_name: node-exporter ports: - "9100:9100" restart: unless-stopped networks: - monitoring alertmanager: image: prom/alertmanager:v0.27.0 container_name: alertmanager ports: - "9093:9093" volumes: - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml restart: unless-stopped networks: - monitoring sample-app: build: ./app container_name: sample-app ports: - "8000:8000" restart: unless-stopped networks: - monitoring volumes: prometheus_data: grafana_data: networks: monitoring: driver: bridge

---

第3步：Prometheus配置

创建 prometheus/prometheus.yml：

global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 rule_files: - "alert.rules.yml" scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100'] - job_name: 'sample-app' static_configs: - targets: ['sample-app:8000']

告警规则 prometheus/alert.rules.yml：

groups:
  - name: system_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} CPU使用率过高"
          description: "CPU使用率超过80%已持续5分钟以上（当前: {{ $value }}%）"

      - alert: TargetDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "目标 {{ $labels.instance }} 已宕机"

---

第4步：带自定义指标的Python应用

创建 app/app.py：

import time
import random
from flask import Flask, Response
from prometheus_client import (
    Counter, Histogram, Gauge,
    generate_latest, CONTENT_TYPE_LATEST
)

app = Flask(__name__)

# 定义自定义指标
REQUEST_COUNT = Counter(
    'http_requests_total',
    'HTTP请求总数',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP请求处理时间（秒）',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

ACTIVE_REQUESTS = Gauge(
    'http_active_requests',
    '当前活跃的HTTP请求数'
)


def track_request(method, endpoint):
    """追踪请求指标的装饰器"""
    def decorator(f):
        def wrapper(args, *kwargs):
            ACTIVE_REQUESTS.inc()
            start_time = time.time()
            try:
                result = f(args, *kwargs)
                status = 200
                return result
            except Exception:
                status = 500
                raise
            finally:
                duration = time.time() - start_time
                REQUEST_COUNT.labels(method, endpoint, status).inc()
                REQUEST_DURATION.labels(method, endpoint).observe(duration)
                ACTIVE_REQUESTS.dec()
        wrapper.__name__ = f.__name__
        return wrapper
    return decorator


@app.route('/')
@track_request('GET', '/')
def home():
    time.sleep(random.uniform(0.01, 0.1))
    return {'status': 'ok', 'message': '来自被监控应用的问候！'}


@app.route('/metrics')
def metrics():
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

---

第5步：启动和测试

# 启动整个栈 docker compose up -d # 验证所有服务运行状态 docker compose ps # 向示例应用发送测试流量 for i in $(seq 1 50); do curl -s http://localhost:8000/ > /dev/null done # 查看指标 curl http://localhost:8000/metrics

访问地址：

Prometheus: http://localhost:9090
Grafana: http://localhost:3000（admin/changeme）
Alertmanager: http://localhost:9093

---

第6步：常用PromQL查询

# 每秒请求速率
rate(http_requests_total[5m])

# 95百分位延迟
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU使用率
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

---

生产环境加固建议

1. Grafana安全: 修改默认密码，启用HTTPS和OAuth认证

2. 保留策略: 根据存储容量调整 retention.time

3. 远程存储: 使用Thanos或Cortex进行长期指标存储

4. 服务发现: 将静态配置替换为Consul、Kubernetes或EC2 SD

5. 预计算规则: 对高开销查询进行预计算以提升仪表板性能

生产级监控栈搭建完成！现在可以将你自己的服务添加为抓取目标，并构建满足特定需求的自定义仪表板。