Observability is a critical capability for modern distributed systems. This article explores practical approaches to the three pillars: logs, traces, and metrics.
Three Pillars of Observability
Core Concepts
Three Pillars of Observability:
┌─────────────────────────────────────────────────────┐
│ │
│ Logs │
│ ├── Record discrete events │
│ ├── Detailed context information │
│ └── Troubleshooting and auditing │
│ │
│ Metrics │
│ ├── Numeric time series │
│ ├── Aggregation and trend analysis │
│ └── Alerting and dashboards │
│ │
│ Traces │
│ ├── Complete request path through system │
│ ├── Service-to-service call relationships │
│ └── Performance bottleneck identification │
│ │
└─────────────────────────────────────────────────────┘
Tool Ecosystem
| Category | Tool Options |
|---|---|
| Logs | ELK Stack, Loki, Splunk |
| Metrics | Prometheus, Datadog, InfluxDB |
| Traces | Jaeger, Zipkin, OpenTelemetry |
| Unified Platform | Grafana, Datadog, New Relic |
Structured Logging
Log Design
import pino from 'pino';
// Create logger instance
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
base: {
service: 'user-service',
version: process.env.APP_VERSION,
environment: process.env.NODE_ENV,
},
timestamp: () => `,"timestamp":"${new Date().toISOString()}"`,
});
// Request context logger
function createRequestLogger(req: Request) {
return logger.child({
requestId: req.headers['x-request-id'],
userId: req.user?.id,
path: req.path,
method: req.method,
});
}
// Usage example
app.use((req, res, next) => {
req.log = createRequestLogger(req);
next();
});
app.post('/api/orders', async (req, res) => {
req.log.info({ orderData: req.body }, 'Creating order');
try {
const order = await orderService.create(req.body);
req.log.info({ orderId: order.id }, 'Order created successfully');
res.json(order);
} catch (error) {
req.log.error({ error: error.message, stack: error.stack }, 'Order creation failed');
res.status(500).json({ error: 'Internal error' });
}
});
Log Level Strategy
// Log level definitions
enum LogLevel {
TRACE = 'trace', // Most detailed, development debugging
DEBUG = 'debug', // Debug information
INFO = 'info', // Normal business events
WARN = 'warn', // Potential issues
ERROR = 'error', // Errors requiring attention
FATAL = 'fatal', // Critical errors, system unavailable
}
// Log level usage guidelines
class OrderService {
async processOrder(orderId: string) {
// TRACE: Detailed debug info
logger.trace({ orderId }, 'Starting order processing');
// DEBUG: Debug-related info
logger.debug({ orderId, step: 'validation' }, 'Validating order');
// INFO: Important business events
logger.info({ orderId, amount: 100 }, 'Order payment initiated');
// WARN: Non-fatal issues
if (retryCount > 0) {
logger.warn({ orderId, retryCount }, 'Payment retry required');
}
// ERROR: Error events
try {
await paymentGateway.charge(orderId);
} catch (error) {
logger.error({ orderId, error: error.message }, 'Payment failed');
throw error;
}
// FATAL: System-level critical errors
if (!databaseConnection) {
logger.fatal('Database connection lost');
process.exit(1);
}
}
}
Log Aggregation
// Transport logs to centralized storage
import pino from 'pino';
// Development: Pretty output
const devTransport = pino.transport({
target: 'pino-pretty',
options: { colorize: true },
});
// Production: Send to Loki
const prodTransport = pino.transport({
targets: [
{
target: 'pino-loki',
options: {
host: process.env.LOKI_HOST,
labels: { app: 'user-service' },
},
},
{
target: 'pino/file',
options: { destination: '/var/log/app.log' },
},
],
});
const logger = pino(
process.env.NODE_ENV === 'production' ? prodTransport : devTransport
);
Distributed Tracing
OpenTelemetry Integration
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
// Initialize SDK
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'user-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: 'http://jaeger:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
Custom Spans
import { trace, SpanStatusCode, context } from '@opentelemetry/api';
const tracer = trace.getTracer('user-service');
async function processPayment(orderId: string, amount: number) {
return tracer.startActiveSpan('processPayment', async (span) => {
try {
// Add attributes
span.setAttribute('order.id', orderId);
span.setAttribute('payment.amount', amount);
// Add events
span.addEvent('payment_started');
// Call payment gateway
const result = await paymentGateway.charge({
orderId,
amount,
});
span.addEvent('payment_completed', {
transactionId: result.transactionId,
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
// Cross-service context propagation
async function callInventoryService(orderId: string) {
return tracer.startActiveSpan('callInventoryService', async (span) => {
const headers = {};
// Inject trace context into request headers
const propagator = trace.getTracer('propagator');
propagator.inject(context.active(), headers);
const response = await fetch('http://inventory-service/reserve', {
method: 'POST',
headers: {
...headers,
'Content-Type': 'application/json',
},
body: JSON.stringify({ orderId }),
});
span.end();
return response.json();
});
}
Metrics Monitoring
Prometheus Metrics
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
const registry = new Registry();
// Counter: Cumulative values
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status'],
registers: [registry],
});
// Histogram: Distribution statistics
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
registers: [registry],
});
// Gauge: Current values
const activeConnections = new Gauge({
name: 'active_connections',
help: 'Number of active connections',
registers: [registry],
});
// Middleware to collect metrics
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestsTotal.inc({
method: req.method,
path: req.route?.path || req.path,
status: res.statusCode,
});
httpRequestDuration.observe(
{ method: req.method, path: req.route?.path || req.path },
duration
);
});
next();
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', registry.contentType);
res.end(await registry.metrics());
});
Business Metrics
// Order-related metrics
const ordersCreated = new Counter({
name: 'orders_created_total',
help: 'Total orders created',
labelNames: ['status', 'payment_method'],
});
const orderValue = new Histogram({
name: 'order_value_dollars',
help: 'Order value distribution',
buckets: [10, 50, 100, 500, 1000, 5000],
});
const inventoryLevel = new Gauge({
name: 'inventory_level',
help: 'Current inventory level',
labelNames: ['product_id'],
});
// Record in business logic
async function createOrder(orderData: CreateOrderDto) {
const order = await orderRepository.create(orderData);
ordersCreated.inc({
status: order.status,
payment_method: order.paymentMethod,
});
orderValue.observe(order.totalAmount);
for (const item of order.items) {
inventoryLevel.dec({ product_id: item.productId });
}
return order;
}
Alerting Strategy
Prometheus Alert Rules
# prometheus/alerts.yml
groups:
- name: application
rules:
# Error rate alert
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
# Latency alert
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s"
# Service down
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} is unreachable"
- name: infrastructure
rules:
# CPU usage
- alert: HighCPUUsage
expr: |
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "CPU usage is {{ $value }}%"
# Memory usage
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value }}%"
Alert Notifications
// Alert handler
interface AlertPayload {
status: 'firing' | 'resolved';
alerts: Alert[];
}
interface Alert {
labels: Record<string, string>;
annotations: Record<string, string>;
startsAt: string;
endsAt?: string;
}
app.post('/api/alerts', async (req, res) => {
const payload: AlertPayload = req.body;
for (const alert of payload.alerts) {
const message = formatAlertMessage(alert, payload.status);
// Choose notification channel based on severity
if (alert.labels.severity === 'critical') {
await sendPagerDutyAlert(alert);
await sendSlackMessage('#incidents', message);
} else if (alert.labels.severity === 'warning') {
await sendSlackMessage('#alerts', message);
}
}
res.status(200).send('OK');
});
function formatAlertMessage(alert: Alert, status: string): string {
const emoji = status === 'firing' ? '🔴' : '✅';
return `${emoji} **${alert.labels.alertname}**
Status: ${status}
Severity: ${alert.labels.severity}
Summary: ${alert.annotations.summary}
Description: ${alert.annotations.description}`;
}
Health Checks
// Health check endpoints
interface HealthCheck {
name: string;
check: () => Promise<boolean>;
}
const healthChecks: HealthCheck[] = [
{
name: 'database',
check: async () => {
await prisma.$queryRaw`SELECT 1`;
return true;
},
},
{
name: 'redis',
check: async () => {
const result = await redis.ping();
return result === 'PONG';
},
},
{
name: 'external-api',
check: async () => {
const response = await fetch('https://api.external.com/health');
return response.ok;
},
},
];
app.get('/health', async (req, res) => {
const results: Record<string, { status: string; latency: number }> = {};
let healthy = true;
for (const check of healthChecks) {
const start = Date.now();
try {
await check.check();
results[check.name] = {
status: 'healthy',
latency: Date.now() - start,
};
} catch (error) {
healthy = false;
results[check.name] = {
status: 'unhealthy',
latency: Date.now() - start,
};
}
}
res.status(healthy ? 200 : 503).json({
status: healthy ? 'healthy' : 'unhealthy',
checks: results,
timestamp: new Date().toISOString(),
});
});
// Readiness check
app.get('/ready', async (req, res) => {
const ready = await isApplicationReady();
res.status(ready ? 200 : 503).send(ready ? 'Ready' : 'Not Ready');
});
// Liveness check
app.get('/live', (req, res) => {
res.status(200).send('Alive');
});
Best Practices Summary
Observability Best Practices:
┌─────────────────────────────────────────────────────┐
│ │
│ Logging │
│ ├── Use structured logging │
│ ├── Include request context │
│ ├── Set appropriate log levels │
│ └── Centralize log storage │
│ │
│ Tracing │
│ ├── Use OpenTelemetry │
│ ├── Propagate trace context │
│ ├── Add meaningful spans │
│ └── Record key attributes and events │
│ │
│ Metrics │
│ ├── RED method (Rate, Errors, Duration) │
│ ├── USE method (Utilization, Saturation, Errors)│
│ ├── Business metrics │
│ └── Reasonable alert thresholds │
│ │
│ Alerting │
│ ├── SLO-based alerting │
│ ├── Tiered notification strategy │
│ ├── Avoid alert fatigue │
│ └── Document alert response │
│ │
└─────────────────────────────────────────────────────┘
| Scenario | Recommended Solution |
|---|---|
| Log storage | Loki + Grafana |
| Metrics monitoring | Prometheus + Grafana |
| Distributed tracing | Jaeger / Tempo |
| Unified standard | OpenTelemetry |
| Alert management | Alertmanager |
Observability is the eyes of system operations. Build comprehensive monitoring to quickly identify and resolve issues.
Systems you can’t see are systems you can’t manage. Observability makes complex systems transparent and controllable.