Monitoring & Logging Best Practices
Introduction
Effective monitoring and logging are essential for production systems. This guide covers Prometheus for metrics, Grafana for visualization, ELK Stack for logs, distributed tracing, alerting strategies, and observability best practices.
1. Prometheus - Metrics Collection
# docker-compose.yml - Prometheus setup
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=15d'
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
prometheus_data:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-east-1'
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
rule_files:
- 'alerts.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'app'
static_configs:
- targets: ['app:3000']
metrics_path: '/metrics'
# Node.js application with Prometheus metrics
import express from 'express';
import { register, Counter, Histogram, Gauge } from 'prom-client';
const app = express();
// Metrics
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status']
});
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.5, 1, 2, 5]
});
const activeConnections = new Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
// Middleware
app.use((req, res, next) => {
activeConnections.inc();
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
activeConnections.dec();
httpRequestsTotal.inc({
method: req.method,
route: req.route?.path || req.path,
status: res.statusCode
});
httpRequestDuration.observe({
method: req.method,
route: req.route?.path || req.path,
status: res.statusCode
}, duration);
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.send(await register.metrics());
});
app.listen(3000);
2. Grafana - Visualization
# docker-compose.yml - Add Grafana
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
volumes:
grafana_data:
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
# Example dashboard JSON (simplified)
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
],
"type": "graph"
},
{
"title": "Response Time (p95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}
],
"type": "graph"
}
]
}
}
# Key Grafana queries
# Request rate by route
rate(http_requests_total[5m]) by (route)
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# Memory usage
process_resident_memory_bytes / 1024 / 1024
# CPU usage
rate(process_cpu_seconds_total[5m])
3. ELK Stack - Centralized Logging
# docker-compose.yml - ELK Stack
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
container_name: elasticsearch
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- xpack.security.enabled=false
ports:
- "9200:9200"
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
container_name: logstash
ports:
- "5000:5000/tcp"
- "5000:5000/udp"
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
container_name: kibana
ports:
- "5601:5601"
environment:
ELASTICSEARCH_HOSTS: http://elasticsearch:9200
depends_on:
- elasticsearch
volumes:
elasticsearch_data:
# logstash.conf
input {
tcp {
port => 5000
codec => json
}
beats {
port => 5044
}
}
filter {
if [type] == "nodejs" {
json {
source => "message"
}
date {
match => ["timestamp", "ISO8601"]
target => "@timestamp"
}
mutate {
remove_field => ["message"]
}
}
grok {
match => {
"message" => "%{COMBINEDAPACHELOG}"
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
stdout {
codec => rubydebug
}
}
# Node.js Winston logger
import winston from 'winston';
import 'winston-logstash';
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
defaultMeta: {
service: 'api',
environment: process.env.NODE_ENV
},
transports: [
// Console
new winston.transports.Console({
format: winston.format.simple()
}),
// File
new winston.transports.File({
filename: 'logs/error.log',
level: 'error'
}),
new winston.transports.File({
filename: 'logs/combined.log'
}),
// Logstash
new winston.transports.Logstash({
host: 'logstash',
port: 5000
})
]
});
// Usage
logger.info('User logged in', { userId: 123, ip: req.ip });
logger.error('Database connection failed', { error: err.message });
logger.warn('High memory usage', { usage: process.memoryUsage() });
4. Alerting with AlertManager
# docker-compose.yml - Add AlertManager
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
volumes:
alertmanager_data:
# alertmanager.yml
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
description: '{{ .GroupLabels.alertname }}'
# alerts.yml - Prometheus alert rules
groups:
- name: application
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighResponseTime
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High response time"
description: "95th percentile is {{ $value }}s"
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} has been down for 5+ minutes"
- alert: HighMemoryUsage
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value | humanizePercentage }}"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space"
description: "Only {{ $value | humanizePercentage }} remaining"
5. Distributed Tracing with Jaeger
# docker-compose.yml - Add Jaeger
jaeger:
image: jaegertracing/all-in-one:latest
container_name: jaeger
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686"
- "14268:14268"
- "14250:14250"
- "9411:9411"
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
# Node.js with OpenTelemetry
import { NodeSDK } from '@opentelemetry/sdk-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-api',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0'
}),
traceExporter: new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces'
}),
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation()
]
});
sdk.start();
// Manual tracing
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('my-api');
async function processOrder(orderId) {
const span = tracer.startSpan('processOrder');
span.setAttribute('orderId', orderId);
try {
// Business logic
await validateOrder(orderId);
await chargeCustomer(orderId);
await createShipment(orderId);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
span.end();
}
}
6. Log Aggregation Patterns
# Structured logging format
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "info",
"service": "api",
"environment": "production",
"traceId": "abc123",
"spanId": "def456",
"userId": 123,
"action": "order.created",
"orderId": 789,
"duration": 250,
"metadata": {
"ip": "192.168.1.1",
"userAgent": "Mozilla/5.0..."
}
}
# Best practices for logging
// 1. Use log levels appropriately
logger.debug('Detailed debugging info'); // Development only
logger.info('User action'); // Important events
logger.warn('Deprecated API used'); // Warnings
logger.error('Database error', { error }); // Errors
logger.fatal('Service crashed'); // Critical
// 2. Add context
logger.info('Order created', {
orderId: order.id,
userId: user.id,
amount: order.total,
items: order.items.length
});
// 3. Include trace IDs
import { v4 as uuidv4 } from 'uuid';
app.use((req, res, next) => {
req.traceId = req.headers['x-trace-id'] || uuidv4();
res.setHeader('x-trace-id', req.traceId);
next();
});
logger.info('Request received', {
traceId: req.traceId,
method: req.method,
path: req.path
});
// 4. Sanitize sensitive data
function sanitize(obj) {
const sensitive = ['password', 'token', 'secret', 'apiKey'];
return Object.entries(obj).reduce((acc, [key, value]) => {
acc[key] = sensitive.includes(key) ? '***' : value;
return acc;
}, {});
}
logger.info('User data', sanitize(userData));
// 5. Log errors with stack traces
try {
await riskyOperation();
} catch (error) {
logger.error('Operation failed', {
error: error.message,
stack: error.stack,
context: { userId, orderId }
});
}
7. Kubernetes Monitoring
# Prometheus ServiceMonitor for Kubernetes
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
namespace: default
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
# Pod with Prometheus annotations
apiVersion: v1
kind: Pod
metadata:
name: myapp
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 3000
name: metrics
# Fluentd DaemonSet for log collection
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: fluentd
template:
metadata:
labels:
k8s-app: fluentd
spec:
serviceAccountName: fluentd
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.default.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
8. Performance Monitoring
# Application Performance Monitoring (APM)
import * as apm from 'elastic-apm-node';
apm.start({
serviceName: 'my-api',
serverUrl: 'http://apm-server:8200',
environment: process.env.NODE_ENV
});
// Custom transactions
const transaction = apm.startTransaction('process-payment', 'payment');
try {
await processPayment(orderId);
transaction.result = 'success';
} catch (error) {
apm.captureError(error);
transaction.result = 'error';
throw error;
} finally {
transaction.end();
}
// Custom spans
const span = apm.startSpan('database-query');
const result = await db.query('SELECT * FROM orders WHERE id = ?', [orderId]);
span.end();
# Real User Monitoring (RUM)
9. Best Practices
✓ Monitoring & Logging Best Practices:
- ✓ Implement the 4 golden signals: latency, traffic, errors, saturation
- ✓ Use structured logging (JSON format)
- ✓ Include trace IDs for correlation
- ✓ Set up alerts for critical metrics
- ✓ Monitor infrastructure and application metrics
- ✓ Use distributed tracing for microservices
- ✓ Implement log retention policies
- ✓ Sanitize sensitive data before logging
- ✓ Use appropriate log levels
- ✓ Create dashboards for key metrics
- ✓ Set up on-call rotations
- ✓ Document runbooks for alerts
- ✓ Test your alerting system regularly
- ✓ Avoid alert fatigue with proper thresholds
- ✓ Implement SLIs, SLOs, and SLAs
Conclusion
Effective monitoring and logging enable proactive incident response and system optimization. Implement Prometheus for metrics, ELK for logs, distributed tracing, and proper alerting. Always monitor the 4 golden signals and maintain observability across your stack.
💡 Pro Tip: Implement the RED method (Rate, Errors, Duration) for services and the USE method (Utilization, Saturation, Errors) for resources. These frameworks ensure you monitor what matters most and can quickly diagnose issues in production.