Building Resilient APIs: Patterns for Production - Writing

Building Resilient APIs: Patterns for Production

Introduction

APIs fail. Networks are unreliable, dependencies go down, and unexpected load happens. The difference between a good API and a great API is how it handles these failures.

This guide covers practical patterns I’ve used to build resilient APIs in production.

Circuit Breaker Pattern

The Problem

When a downstream service is failing, continuing to call it:

  • Wastes resources
  • Increases latency
  • Can cause cascading failures

The Solution

A circuit breaker monitors failures and “opens” when a threshold is reached, immediately returning errors instead of calling the failing service.

class CircuitBreaker {
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private failureCount = 0;
  private lastFailureTime?: number;

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (this.shouldAttemptReset()) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failureCount = 0;
    this.state = 'closed';
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    
    if (this.failureCount >= FAILURE_THRESHOLD) {
      this.state = 'open';
    }
  }

  private shouldAttemptReset(): boolean {
    return Date.now() - this.lastFailureTime! > RESET_TIMEOUT;
  }
}

Real-World Impact

After implementing circuit breakers for our payment provider integrations:

  • Reduced cascading failures by 90%
  • Improved API response times during provider outages
  • Better visibility into dependency health

Retry with Exponential Backoff

The Problem

Transient failures are common (network blips, temporary overload). Immediate retries can make things worse.

The Solution

Retry with exponentially increasing delays, plus jitter to avoid thundering herd.

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 3
): Promise<T> {
  let lastError: Error;
  
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error;
      
      if (attempt < maxRetries - 1) {
        const delay = Math.min(
          1000 * Math.pow(2, attempt) + Math.random() * 1000,
          10000
        );
        await sleep(delay);
      }
    }
  }
  
  throw lastError!;
}

When to Retry

Not all errors should be retried:

  • ✅ Network timeouts
  • ✅ 503 Service Unavailable
  • ✅ 429 Too Many Requests
  • ❌ 400 Bad Request
  • ❌ 401 Unauthorized
  • ❌ 404 Not Found

Timeouts

The Problem

Without timeouts, a slow dependency can block your entire API.

The Solution

Set aggressive timeouts and fail fast.

async function withTimeout<T>(
  promise: Promise<T>,
  timeoutMs: number
): Promise<T> {
  const timeout = new Promise<never>((_, reject) => {
    setTimeout(() => reject(new Error('Timeout')), timeoutMs);
  });
  
  return Promise.race([promise, timeout]);
}

Choosing Timeout Values

  • P95 latency + buffer: If P95 is 200ms, set timeout to 500ms
  • Consider cascading timeouts: Each layer should have shorter timeout than the layer above
  • Monitor and adjust: Use metrics to tune timeout values

Graceful Degradation

The Problem

When a non-critical dependency fails, should your entire API fail?

The Solution

Identify critical vs non-critical dependencies and degrade gracefully.

async function getUserProfile(userId: string) {
  const [user, preferences, recommendations] = await Promise.allSettled([
    fetchUser(userId),           // Critical
    fetchPreferences(userId),    // Non-critical
    fetchRecommendations(userId) // Non-critical
  ]);

  if (user.status === 'rejected') {
    throw new Error('Failed to fetch user');
  }

  return {
    user: user.value,
    preferences: preferences.status === 'fulfilled' 
      ? preferences.value 
      : null,
    recommendations: recommendations.status === 'fulfilled'
      ? recommendations.value
      : []
  };
}

Rate Limiting

The Problem

Unlimited requests can overwhelm your API and downstream services.

The Solution

Implement rate limiting at multiple levels:

  1. Per-user limits: Prevent individual users from overwhelming the system
  2. Global limits: Protect overall system capacity
  3. Dependency limits: Respect downstream service limits
class RateLimiter {
  private requests = new Map<string, number[]>();

  async checkLimit(key: string, limit: number, windowMs: number): Promise<boolean> {
    const now = Date.now();
    const windowStart = now - windowMs;
    
    const requests = this.requests.get(key) || [];
    const recentRequests = requests.filter(time => time > windowStart);
    
    if (recentRequests.length >= limit) {
      return false;
    }
    
    recentRequests.push(now);
    this.requests.set(key, recentRequests);
    return true;
  }
}

Monitoring and Observability

Essential Metrics

Track these metrics for every API endpoint:

  • Request rate
  • Error rate
  • Latency (P50, P95, P99)
  • Dependency health

Structured Logging

Log enough context to debug issues:

logger.info('Payment processed', {
  userId,
  paymentId,
  amount,
  provider,
  duration: Date.now() - startTime,
  success: true
});

Conclusion

Building resilient APIs requires thinking about failure modes upfront. The patterns covered here—circuit breakers, retries, timeouts, graceful degradation, and rate limiting—form a solid foundation.

Remember: failures will happen. The goal is to handle them gracefully and maintain a good user experience even when things go wrong.

Further Reading