Monitoring Prometheus + Grafana là combo chuẩn de-facto cho cloud-native observability. Bài này setup Docker Compose chạy local + scrape metric Node.js + dashboard Grafana từ A-Z.
Kiến trúc
App / Exporter (expose /metrics)
↑ scrape mỗi 15s
Prometheus (lưu time series, eval rule)
↓ push alert
Alertmanager (dedupe, route → Slack/PagerDuty)
Grafana ← query PromQL ← Prometheus
Setup Docker Compose
# compose.yml
services:
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prom-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
ports: ["3001:3000"]
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- graf-data:/var/lib/grafana
node-exporter:
image: prom/node-exporter:latest
ports: ["9100:9100"]
volumes:
prom-data:
graf-data:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'app'
static_configs:
- targets: ['host.docker.internal:3000']
metrics_path: '/metrics'
Node.js app expose /metrics
npm install prom-client
import express from 'express'
import client from 'prom-client'
const app = express()
// Default metric: process CPU, memory, event loop
client.collectDefaultMetrics()
// Custom metric
const httpDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
})
const httpTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
})
app.use((req, res, next) => {
const end = httpDuration.startTimer()
res.on('finish', () => {
const route = req.route?.path || 'unknown'
end({ method: req.method, route, status: res.statusCode })
httpTotal.inc({ method: req.method, route, status: res.statusCode })
})
next()
})
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType)
res.send(await client.register.metrics())
})
app.listen(3000)
Curl http://localhost:3000/metrics sẽ thấy:
# HELP http_request_duration_seconds HTTP request duration in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05",method="GET",route="/users",status="200"} 142
http_request_duration_seconds_bucket{le="0.1",method="GET",route="/users",status="200"} 198
http_request_duration_seconds_count{method="GET",route="/users",status="200"} 200
http_request_duration_seconds_sum{method="GET",route="/users",status="200"} 12.4
PromQL — query language
# Request rate per second, last 5 phút
rate(http_requests_total[5m])
# Request rate by route
sum(rate(http_requests_total[5m])) by (route)
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
# p95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))
# Memory usage
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
# CPU usage
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))
Grafana dashboard
- Login Grafana http://localhost:3001 (admin/admin)
- Configuration → Data Sources → Add → Prometheus → URL
http://prometheus:9090 - Dashboards → Import → ID 1860 (Node Exporter Full) hoặc tự build
- Tạo panel mới: Visualize Time Series, Query
rate(http_requests_total[5m])
Cardinality — pitfall lớn nhất
// ❌ Sai — user_id 100k giá trị → 100k time series
httpTotal.inc({ user_id: req.user.id })
// ❌ Sai — full URL với param → vô hạn series
httpTotal.inc({ url: req.url }) // /users/1, /users/2, /users/3...
// ✓ Đúng — route pattern, ít giá trị
httpTotal.inc({ route: req.route.path }) // /users/:id
Quy tắc: tổng cardinality 1 metric < 10k. Method × route × status thường < 1k — an toàn.
Alert rule
# prometheus rules
groups:
- name: app
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels: { severity: critical }
annotations:
summary: "Error rate > 5% trong 5 phút"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels: { severity: warning }
Kết luận
Monitoring Prometheus + Grafana cover 80% nhu cầu metric. Setup ban đầu 1 buổi, dashboard quan trọng (request rate, error rate, p95, memory) là baseline cho mọi production app. Tham khảo Observability để hiểu metric kết hợp với log + trace ra sao.