Introduction
“Just add more logging” is the default response when production issues are hard to debug. But more logs often means more noise, not more insight.
True observability requires a different approach: understanding your system through metrics, traces, and structured logs working together.
The Three Pillars
Logs
What happened at a specific point in time.
{
"timestamp": "2024-01-15T10:23:45Z",
"level": "error",
"message": "Payment failed",
"userId": "user_123",
"paymentId": "pay_456",
"error": "Card declined",
"traceId": "abc123"
}
Good for: Detailed debugging, audit trails, understanding specific events.
Bad for: Aggregation, trends, understanding system-wide behavior.
Metrics
Numeric measurements over time.
http_requests_total{method="POST", path="/api/payments", status="500"} 42
http_request_duration_seconds{quantile="0.99"} 2.5
active_database_connections 45
Good for: Alerting, dashboards, understanding trends and patterns.
Bad for: Understanding why something happened, debugging specific requests.
Traces
The journey of a request through your system.
Trace: abc123
├── API Gateway (2ms)
├── Auth Service (15ms)
├── Payment Service (450ms)
│ ├── Validate Request (5ms)
│ ├── Check Fraud (200ms)
│ └── Process Payment (240ms)
│ └── External Provider (235ms) ← slow!
└── Notification Service (25ms)
Good for: Understanding latency, finding bottlenecks, debugging distributed systems.
Bad for: Aggregation (too much data), simple systems.
Implementing Effective Logging
Structure Your Logs
Unstructured logs are nearly useless at scale:
# Bad
console.log(`User ${userId} failed to pay: ${error}`);
# Good
logger.error('Payment failed', {
userId,
paymentId,
amount,
currency,
errorCode: error.code,
errorMessage: error.message,
traceId: context.traceId
});
Log Levels Matter
Use levels consistently:
- ERROR: Something failed that shouldn’t have
- WARN: Something unexpected but handled
- INFO: Significant business events
- DEBUG: Detailed information for debugging (off in production)
Include Context
Every log should answer: who, what, when, where, why?
function processOrder(order: Order, context: Context) {
const logContext = {
orderId: order.id,
userId: order.userId,
traceId: context.traceId,
spanId: context.spanId
};
logger.info('Processing order', { ...logContext, amount: order.total });
try {
// ... process order
logger.info('Order processed successfully', logContext);
} catch (error) {
logger.error('Order processing failed', {
...logContext,
error: error.message,
stack: error.stack
});
throw error;
}
}
Implementing Metrics
The Four Golden Signals
Start with these for every service:
- Latency: How long requests take
- Traffic: How many requests you’re handling
- Errors: How many requests fail
- Saturation: How “full” your service is
// Example with Prometheus client
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests',
labelNames: ['method', 'path', 'status'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status']
});
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const labels = {
method: req.method,
path: req.route?.path || 'unknown',
status: res.statusCode
};
httpRequestDuration.observe(labels, duration);
httpRequestsTotal.inc(labels);
});
next();
});
Business Metrics
Don’t just monitor infrastructure—monitor what matters to the business:
const ordersProcessed = new Counter({
name: 'orders_processed_total',
help: 'Total orders processed',
labelNames: ['status', 'payment_method']
});
const orderValue = new Histogram({
name: 'order_value_dollars',
help: 'Order value in dollars',
buckets: [10, 50, 100, 500, 1000, 5000]
});
Cardinality Warning
Be careful with label values. High cardinality kills metric systems:
// Bad - userId has unlimited values
httpRequests.inc({ userId: user.id });
// Good - limited set of values
httpRequests.inc({ userType: user.type }); // 'free', 'premium', 'enterprise'
Implementing Distributed Tracing
Propagate Context
Pass trace context through your entire request flow:
// HTTP client
async function callService(url: string, context: Context) {
return fetch(url, {
headers: {
'X-Trace-Id': context.traceId,
'X-Span-Id': context.spanId,
'X-Parent-Span-Id': context.parentSpanId
}
});
}
// Message queue
async function publishMessage(queue: string, message: any, context: Context) {
await queue.publish({
...message,
_traceContext: {
traceId: context.traceId,
spanId: generateSpanId(),
parentSpanId: context.spanId
}
});
}
Instrument Key Operations
Focus tracing on:
- External service calls
- Database queries
- Cache operations
- Message queue operations
- Significant business logic
async function processPayment(payment: Payment, context: Context) {
return tracer.startSpan('processPayment', { parent: context.span }, async (span) => {
span.setAttributes({
'payment.id': payment.id,
'payment.amount': payment.amount,
'payment.currency': payment.currency
});
try {
const result = await paymentProvider.charge(payment);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
throw error;
}
});
}
Connecting the Pillars
The real power comes from connecting logs, metrics, and traces:
Trace ID in Everything
Include trace ID in all logs and metrics:
logger.info('Payment processed', {
traceId: context.traceId, // Links to trace
paymentId: payment.id
});
paymentDuration.observe(
{ traceId: context.traceId }, // Links to trace
duration
);
Exemplars
Link metrics to specific traces:
// When you see a spike in latency, click through to see
// the actual traces that caused it
httpLatency.observe(
{ method: 'POST', path: '/payments' },
duration,
{ traceId: context.traceId } // Exemplar
);
Alerting Strategy
Alert on Symptoms, Not Causes
# Bad - alerts on cause
- alert: HighCPU
expr: cpu_usage > 80%
# Good - alerts on symptom
- alert: HighLatency
expr: http_request_duration_seconds{quantile="0.99"} > 2
Use Multiple Signals
- alert: PaymentServiceDegraded
expr: |
(
rate(payment_errors_total[5m]) / rate(payment_requests_total[5m]) > 0.01
) and (
histogram_quantile(0.99, rate(payment_duration_seconds_bucket[5m])) > 5
)
annotations:
summary: "Payment service is degraded - high errors AND high latency"
Practical Tips
Start Simple
Don’t try to instrument everything at once:
- Add the four golden signals to each service
- Add structured logging with trace IDs
- Add tracing to external calls
- Expand based on what you need to debug
Make Dashboards Useful
A good dashboard answers: “Is the system healthy right now?”
Include:
- Request rate and error rate
- Latency percentiles (p50, p95, p99)
- Saturation metrics (queue depth, connection pool)
- Key business metrics
Practice Using Your Observability
Run game days where you:
- Inject failures
- Try to diagnose using only observability tools
- Identify gaps in instrumentation
Conclusion
Good observability isn’t about collecting more data—it’s about collecting the right data and making it easy to use.
Focus on:
- Structured, contextual logs
- The four golden signals for metrics
- Traces for understanding request flow
- Connecting all three with trace IDs
When something goes wrong in production, you should be able to go from alert to root cause in minutes, not hours.