Ongoing

Rebuilding the Payment System

Lead Engineer · 2026 · 3 months · 2 people · 2 min read

Reduced payment processing time by 60% and improved reliability to 99.9% uptime

Overview

Led the complete rebuild of a legacy payment processing system handling $2M+ in daily transactions

Problem

The existing payment system was built 5 years ago and suffered from frequent timeouts, poor error handling, and a monolithic architecture that made it difficult to add new payment providers

Constraints

Zero downtime migration required
Must maintain backward compatibility with existing API contracts
Limited to 3-month timeline
Small team of 2 engineers

Approach

We adopted a strangler fig pattern to incrementally migrate payment flows to a new microservices architecture. Started with the lowest-risk payment methods and gradually migrated higher-volume flows after proving stability.

Key Decisions

Use event sourcing for payment state management

Reasoning:

Event sourcing provides complete audit trail and makes it easier to debug payment issues. The append-only nature also improves write performance.

Alternatives considered:

Traditional CRUD with audit logs
State machine with database transactions

Implement circuit breaker pattern for external payment providers

Reasoning:

Prevents cascading failures when payment providers experience issues. Allows graceful degradation and automatic recovery.

Alternatives considered:

Simple retry logic with exponential backoff
Manual failover to backup providers

Tech Stack

Node.js
TypeScript
PostgreSQL
Redis
Kafka
Docker
Kubernetes

Result & Impact

60% reduction (from 2.5s to 1s average)

Processing Time
99.9% (up from 97.2%)

Uptime
0.1% (down from 2.3%)

Error Rate

The new system handles peak loads without degradation and has eliminated customer complaints about payment failures. The modular architecture has made it trivial to add 3 new payment providers in the months following launch.

Learnings

Event sourcing adds complexity but the debugging benefits are worth it for financial systems
Incremental migration with feature flags is less risky than big-bang rewrites
Investing in observability early pays dividends when troubleshooting production issues

Additional Context

This project was particularly challenging because we had to maintain 100% backward compatibility while completely rewriting the internals. The strangler fig pattern allowed us to prove the new system worked before fully committing to the migration.

The event sourcing approach was controversial at first, but it proved invaluable when debugging complex payment flows and understanding exactly what happened during edge cases.

All projects