Bỏ qua đến nội dung chính
PrometheusGrafanamonitoringobservabilityDevOps

Monitoring Với Prometheus + Grafana: Setup Cơ Bản

Monitoring Prometheus + Grafana setup: scrape metric, query PromQL, dashboard, alert. Docker Compose chạy local + Node.js metric exporter sẵn dùng.

Xuất bản 8 phút đọc

Monitoring Prometheus + Grafana là combo chuẩn de-facto cho cloud-native observability. Bài này setup Docker Compose chạy local + scrape metric Node.js + dashboard Grafana từ A-Z.

Kiến trúc

App / Exporter (expose /metrics)
        ↑ scrape mỗi 15s
Prometheus (lưu time series, eval rule)
        ↓ push alert
Alertmanager (dedupe, route → Slack/PagerDuty)

Grafana ← query PromQL ← Prometheus

Setup Docker Compose

# compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    ports: ["3001:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - graf-data:/var/lib/grafana

  node-exporter:
    image: prom/node-exporter:latest
    ports: ["9100:9100"]

volumes:
  prom-data:
  graf-data:
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'app'
    static_configs:
      - targets: ['host.docker.internal:3000']
    metrics_path: '/metrics'

Node.js app expose /metrics

npm install prom-client
import express from 'express'
import client from 'prom-client'

const app = express()

// Default metric: process CPU, memory, event loop
client.collectDefaultMetrics()

// Custom metric
const httpDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
})

const httpTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
})

app.use((req, res, next) => {
  const end = httpDuration.startTimer()
  res.on('finish', () => {
    const route = req.route?.path || 'unknown'
    end({ method: req.method, route, status: res.statusCode })
    httpTotal.inc({ method: req.method, route, status: res.statusCode })
  })
  next()
})

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType)
  res.send(await client.register.metrics())
})

app.listen(3000)

Curl http://localhost:3000/metrics sẽ thấy:

# HELP http_request_duration_seconds HTTP request duration in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05",method="GET",route="/users",status="200"} 142
http_request_duration_seconds_bucket{le="0.1",method="GET",route="/users",status="200"} 198
http_request_duration_seconds_count{method="GET",route="/users",status="200"} 200
http_request_duration_seconds_sum{method="GET",route="/users",status="200"} 12.4

PromQL — query language

# Request rate per second, last 5 phút
rate(http_requests_total[5m])

# Request rate by route
sum(rate(http_requests_total[5m])) by (route)

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

# p95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))

# Memory usage
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

# CPU usage
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))

Grafana dashboard

  1. Login Grafana http://localhost:3001 (admin/admin)
  2. Configuration → Data Sources → Add → Prometheus → URL http://prometheus:9090
  3. Dashboards → Import → ID 1860 (Node Exporter Full) hoặc tự build
  4. Tạo panel mới: Visualize Time Series, Query rate(http_requests_total[5m])

Cardinality — pitfall lớn nhất

// ❌ Sai — user_id 100k giá trị → 100k time series
httpTotal.inc({ user_id: req.user.id })

// ❌ Sai — full URL với param → vô hạn series
httpTotal.inc({ url: req.url })  // /users/1, /users/2, /users/3...

// ✓ Đúng — route pattern, ít giá trị
httpTotal.inc({ route: req.route.path })  // /users/:id

Quy tắc: tổng cardinality 1 metric < 10k. Method × route × status thường < 1k — an toàn.

Alert rule

# prometheus rules
groups:
- name: app
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels: { severity: critical }
    annotations:
      summary: "Error rate > 5% trong 5 phút"

  - alert: HighLatency
    expr: |
      histogram_quantile(0.95,
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
    for: 5m
    labels: { severity: warning }

Kết luận

Monitoring Prometheus + Grafana cover 80% nhu cầu metric. Setup ban đầu 1 buổi, dashboard quan trọng (request rate, error rate, p95, memory) là baseline cho mọi production app. Tham khảo Observability để hiểu metric kết hợp với log + trace ra sao.

Zalo