10 min read
On this page

Design a Payment Processing System

A payment processing system handles the flow of money between buyers, merchants, & financial institutions. Systems like Stripe, Square, & Adyen process millions of transactions daily where correctness is paramount — a single bug can mean money lost, duplicated charges, or regulatory violations.

This document covers the full design with emphasis on the properties that make payment systems uniquely challenging: idempotency, consistency, & auditability.

Payment processing system architecture

Functional Requirements

  • Accept payments from customers (credit card, debit card, bank transfer)
  • Support the full payment lifecycle: authorize, capture, refund, void
  • Provide idempotent payment APIs to prevent double charges
  • Maintain a ledger of all financial transactions
  • Reconcile internal records with external payment processor statements
  • Support multiple currencies
  • Expose payment status to merchants via API & webhooks

Non-Functional Requirements

  • Correctness above all — no double charges, no lost payments
  • Exactly-once processing semantics for payment operations
  • High availability — 99.99% uptime (merchants lose revenue during downtime)
  • Strong consistency for financial state transitions
  • PCI DSS compliance for handling card data
  • Full audit trail for every state change
  • Latency: authorization within 2 seconds, settlement within 24 hours

Estimation

Traffic

  • 10 million transactions per day
  • Average: ~115 transactions/second
  • Peak: ~500 transactions/second (Black Friday, flash sales)
  • Each transaction involves 2-4 internal state changes (authorize, capture, etc.)

Storage

  • Per transaction record: ~1 KB (payment details, metadata, state history)
  • Ledger entry per state change: ~500 bytes
  • 10 million transactions/day x 1 KB = 10 GB/day for transactions
  • Ledger: ~40 million entries/day x 500 bytes = 20 GB/day
  • Over 5 years with retention requirements: ~55 TB total

External API Calls

  • Each payment involves at least one call to an external payment processor (Visa, Mastercard networks, bank APIs)
  • External call latency: 200ms-2s
  • Must handle timeouts, retries, & network failures gracefully

High-Level Design

Components

  • Payment API — merchant-facing REST API for payment operations
  • Payment Service — orchestrates the payment lifecycle
  • Idempotency Store — tracks idempotency keys to prevent duplicate processing
  • Payment State Machine — enforces valid state transitions
  • Ledger Service — double-entry bookkeeping for all financial movements
  • Payment Processor Adapter — integrates with external processors (Stripe, Adyen, bank networks)
  • Reconciliation Service — compares internal records against external statements
  • Webhook Service — notifies merchants of payment status changes
  • Database — relational DB with strong consistency (PostgreSQL)

Payment Flow: Authorization & Capture

Merchant -> Payment API -> Payment Service
                        -> Idempotency check (duplicate? return cached result)
                        -> Create payment record (state: PENDING)
                        -> Ledger entry (debit customer, credit pending)
                        -> Payment Processor Adapter -> External Processor (Visa/Mastercard)
                        -> Receive auth response
                        -> Update payment (state: AUTHORIZED)
                        -> Ledger entry (update pending to authorized)
                        -> Return result to merchant
                        -> Webhook: payment.authorized

Later (capture):
Merchant -> Payment API -> Payment Service
                        -> Verify payment is in AUTHORIZED state
                        -> Payment Processor Adapter -> External Processor (capture)
                        -> Update payment (state: CAPTURED)
                        -> Ledger entry (move from authorized to settled)
                        -> Webhook: payment.captured

Detailed Design

Payment State Machine

Every payment follows a strict state machine. Only valid transitions are allowed — this prevents impossible states like refunding an unauthorized payment.

State Transitions:

  PENDING -----> AUTHORIZED -----> CAPTURED -----> SETTLED
     |               |                |
     |               v                v
     |            VOIDED          REFUND_PENDING --> REFUNDED
     |                                |
     v                                v
   FAILED                       REFUND_FAILED

Valid transitions:
  PENDING      -> AUTHORIZED, FAILED
  AUTHORIZED   -> CAPTURED, VOIDED
  CAPTURED     -> SETTLED, REFUND_PENDING
  REFUND_PENDING -> REFUNDED, REFUND_FAILED
  SETTLED      -> REFUND_PENDING

Implementation: store the current state in the payment record. Every state transition is a conditional update.

UPDATE payments
SET state = 'AUTHORIZED', updated_at = NOW()
WHERE payment_id = ? AND state = 'PENDING'
-- Returns 0 rows affected if state was not PENDING -> reject the transition

The conditional update acts as an optimistic lock. Two concurrent requests to authorize the same payment will not both succeed.

Idempotency

Idempotency is the single most critical property. Network failures, client retries, & load balancer replays can all cause duplicate requests.

How It Works

The merchant includes an Idempotency-Key header (typically a UUID) with every request. The system stores the key alongside the result.

idempotency_store
+-----------------+------------------------------------------------+
| idempotency_key | "550e8400-e29b-41d4-a716-446655440000"  (PK)   |
| payment_id      | "pay_abc123"                                   |
| http_status     | 200                                            |
| response_body   | { "status": "authorized", ... }                |
| created_at      | 2026-03-15T10:30:00Z                           |
| expires_at      | 2026-03-16T10:30:00Z                           |
+-----------------+------------------------------------------------+

Request Processing

1. Receive request with Idempotency-Key
2. Check idempotency store:
   a. Key exists with completed result -> return cached result (no processing)
   b. Key exists with in-progress status -> return 409 Conflict (concurrent duplicate)
   c. Key does not exist -> insert key with status "in_progress"
3. Process the payment
4. Update idempotency store with the final result
5. Return result to merchant

The insert in step 2c uses a unique constraint on idempotency_key. If two identical requests race, only one insert succeeds. The loser receives a 409 & should retry after a short delay.

Stripe pioneered this pattern and made the Idempotency-Key header a standard part of their API design. Stripe stores idempotency keys with a 24-hour TTL, and their API replays the exact same response (including HTTP status code and body) for any repeated key. Their public documentation of this approach has made idempotency keys a de facto industry standard that other payment providers have adopted.

Double-Spend Prevention

A customer should not be charged twice for the same order. Multiple layers of defense:

  1. Idempotency keys (described above) — prevents duplicate API calls
  2. Payment state machine — a payment can only be authorized once (PENDING -> AUTHORIZED)
  3. Unique constraint on order reference — one payment per order ID
  4. External processor deduplication — most processors accept a merchant reference ID & reject duplicates
payments table:
UNIQUE INDEX on (merchant_id, merchant_reference)
-- Prevents two payments for the same order from the same merchant

Ledger Design

The ledger uses double-entry bookkeeping. Every money movement creates two entries: a debit & a credit. The sum of all debits must always equal the sum of all credits.

ledger_entries
+-----------------+------------------------------------------------+
| entry_id        | bigint (PK)                                    |
| payment_id      | varchar (FK to payments)                       |
| account_id      | varchar (e.g., "customer_123", "merchant_456") |
| entry_type      | enum: DEBIT, CREDIT                            |
| amount          | bigint (cents, never floating point)            |
| currency        | varchar(3) (ISO 4217: "USD", "EUR")            |
| created_at      | timestamp                                      |
+-----------------+------------------------------------------------+

Example: Customer pays $100 to Merchant

Entry 1: DEBIT  customer_123   10000 USD  (money leaves customer)
Entry 2: CREDIT merchant_456   10000 USD  (money enters merchant)

Critical rules:

  • Always use integers for money — store amounts in the smallest currency unit (cents for USD). Floating point arithmetic causes rounding errors that accumulate into real financial discrepancies.
  • Entries are append-only — never update or delete a ledger entry. Corrections are made by adding a reversal entry.
  • Every entry has a balancing counterpart — the system rejects any operation that does not balance.

Square's ledger architecture follows these principles rigorously. Square built a custom double-entry ledger system that tracks every cent flowing through their ecosystem across sellers, buyers, and Square's own accounts. Every transaction produces balanced debit and credit entries, and their system runs continuous balance checks to ensure the books never drift out of balance -- a property they call "provable correctness."

Account Balances

Balances are derived by summing ledger entries per account. For performance, maintain a materialized balance that is updated atomically with each new entry.

account_balances
+-----------------+------------------+
| account_id      | "merchant_456"   |
| currency        | "USD"            |
| balance         | 1500000 (cents)  |
| last_entry_id   | 98765            |
+-----------------+------------------+

Update balance atomically with new entry:
BEGIN;
  INSERT INTO ledger_entries (...) VALUES (...);
  UPDATE account_balances SET balance = balance + 10000, last_entry_id = ?
    WHERE account_id = 'merchant_456' AND last_entry_id = ?;
  -- Optimistic lock on last_entry_id prevents concurrent corruption
COMMIT;

Retry Handling

External payment processor calls can fail due to timeouts, network errors, or temporary outages. The retry strategy must be careful — retrying a charge that actually succeeded means double-charging the customer.

Safe vs Unsafe Retries

Safe to retry (idempotent operations):
- Authorization (processor deduplicates by merchant reference)
- Status check / inquiry
- Void (voiding an already-voided payment is a no-op)

Unsafe to retry without safeguards:
- Capture (some processors treat each capture as a new charge)
- Refund (could issue multiple refunds)

Retry Strategy

1. Call external processor
2. Timeout or network error?
   a. Mark payment as UNCERTAIN in internal state
   b. Wait & send an inquiry/status check to the processor
   c. Processor confirms the original succeeded -> update state accordingly
   d. Processor confirms the original failed -> safe to retry
3. Never retry blindly without checking the current state at the processor
4. Use exponential backoff: 1s, 2s, 4s, 8s, max 30s, max 5 attempts
5. After max attempts, flag for manual review

The UNCERTAIN state is key. It means "we sent a request but do not know the outcome." The system must resolve UNCERTAIN states before allowing any further operations on that payment.

PayPal handles this challenge across hundreds of microservices by using a saga-based approach for distributed transactions. Rather than relying on traditional two-phase commit (which is too slow and fragile at their scale), each step in a payment flow publishes an event, and compensating transactions are triggered automatically if any step fails. PayPal's system resolves millions of uncertain states daily through a combination of automated inquiry calls to downstream processors and a reconciliation engine that detects and resolves discrepancies within minutes.

Reconciliation

Reconciliation compares internal records against external statements to catch discrepancies.

Types of Discrepancies

  • Missing from processor: internal record says captured, processor has no record
  • Missing from internal: processor charged the customer, but no internal record exists
  • Amount mismatch: internal says 100,processorsays100, processor says 99.50 (likely FX or fee issue)
  • State mismatch: internal says authorized, processor says captured

Reconciliation Process

1. Daily: download settlement files from each processor
2. Parse into a standardized format
3. Join internal payment records with processor records on merchant reference
4. Flag mismatches:
   a. Automatic resolution for known patterns (e.g., FX rounding)
   b. Queue for manual review if unresolvable
5. Generate a reconciliation report
6. Track the reconciliation rate (target: 99.99% auto-matched)

Reconciliation runs as a batch job. It is not on the critical payment path but is essential for financial integrity. Any unreconciled transaction older than 48 hours triggers an alert.

PCI Compliance Considerations

PCI DSS (Payment Card Industry Data Security Standard) governs how card data is handled.

Architecture Implications

  • Tokenization: never store raw card numbers. Use a PCI-compliant vault (or the processor's tokenization) to convert card numbers into tokens.
  • Network segmentation: the cardholder data environment (CDE) must be isolated from the rest of the infrastructure.
  • Encryption: card data encrypted in transit (TLS 1.2+) & at rest (AES-256).
  • Access control: strict access logs for any system touching card data.
  • Audit logging: every access to card data is logged with who, when, & why.
Payment flow with tokenization:

1. Customer enters card number in merchant's checkout (hosted by processor)
2. Processor returns a token: "tok_abc123"
3. Merchant sends token to our Payment API
4. Payment Service passes token to processor for authorization
5. Raw card number never touches our servers -> reduced PCI scope

Using a processor-hosted checkout (like Stripe Elements or Adyen Drop-in) keeps the merchant's system out of PCI scope entirely. This is the recommended approach for most systems.

Webhook Delivery

Merchants need to know when payment states change. Webhooks provide asynchronous notification.

Webhook delivery:
1. Payment state changes
2. Webhook Service creates an event payload (signed with HMAC)
3. POST to merchant's configured endpoint
4. If 2xx response, mark as delivered
5. If failure, retry with exponential backoff (1min, 5min, 30min, 2hr, 24hr)
6. After exhausting retries, mark as failed & alert the merchant via email

Sign webhooks with HMAC-SHA256 so merchants can verify authenticity. Include a timestamp to prevent replay attacks.

Trade-Offs & Alternatives

Synchronous vs Asynchronous Processing

Approach Latency Complexity Failure Handling
Synchronous Lower (direct response) Lower Harder (timeouts block the client)
Async with queue Higher (polling/webhook) Higher Easier (retry from queue)

Most payment systems use a hybrid: the initial authorization is synchronous (merchant needs an immediate response), while settlement, reconciliation, & notifications are asynchronous.

Single DB vs Event Sourcing

  • Single relational DB: simpler. Payment state is the latest row. Works well up to moderate scale. PostgreSQL with strong consistency is the standard choice.
  • Event sourcing: store every state change as an immutable event. Reconstruct current state by replaying events. More complex but provides a perfect audit trail & supports temporal queries.

For payment systems, the append-only ledger already provides event-sourcing-like properties. A hybrid approach (relational DB for current state + append-only ledger for history) gives the best of both worlds.

SQL vs NoSQL

SQL (PostgreSQL) is strongly preferred for payment systems:

  • ACID transactions ensure financial consistency
  • Foreign keys enforce referential integrity
  • Mature tooling for reporting & reconciliation queries
  • Easier auditing

NoSQL databases lack the consistency guarantees that payment systems require. The throughput demands (500 TPS peak) are well within PostgreSQL's capabilities.

Bottlenecks & Scaling

Database Scaling

At 500 TPS, a single PostgreSQL instance with connection pooling (PgBouncer) handles the load comfortably. For growth:

  • Vertical scaling first (PostgreSQL scales well vertically)
  • Read replicas for reporting & reconciliation queries
  • Partition the payments table by date (monthly partitions)
  • Archive old transactions to cold storage after the retention period

Sharding is typically not needed until 10,000+ TPS. If required, shard by merchant_id to keep all of a merchant's data together.

External Processor Latency

The external processor call (200ms-2s) is the dominant latency component. Mitigations:

  • Connection pooling to processor APIs
  • Circuit breaker pattern: if a processor is failing, fail fast instead of waiting for timeouts
  • Multi-processor support: if Processor A is down, route to Processor B
  • Timeout budget: set a 3-second timeout, return failure to merchant, let them retry

Idempotency Store Performance

Every request hits the idempotency store. Keep it fast:

  • Use Redis as a write-through cache in front of the persistent store
  • TTL of 24 hours on idempotency keys (configurable per merchant)
  • At 500 TPS, Redis handles this trivially

Reconciliation at Scale

With 10 million transactions/day, reconciliation jobs process large datasets:

  • Use batch processing (Apache Spark or a simple partitioned SQL job)
  • Process each processor's settlement file independently in parallel
  • Partition by date & processor to limit working set size

Common Pitfalls

  • Using floating point for money0.1 + 0.2 != 0.3 in IEEE 754. Store amounts as integers in the smallest currency unit (cents). This is the most common & most dangerous mistake in payment system design.
  • Retrying without checking external state — blindly retrying a timed-out authorization can double-charge the customer. Always send an inquiry to the processor before retrying.
  • Not implementing idempotency from day one — adding idempotency retroactively is extremely difficult. Design it in from the start.
  • Mutable ledger entries — updating or deleting ledger entries destroys the audit trail. Use append-only entries with reversals for corrections.
  • Ignoring the UNCERTAIN state — when an external call times out, the payment is in an unknown state. The system must resolve this before proceeding, not assume success or failure.
  • Skipping reconciliation — without daily reconciliation, discrepancies accumulate silently. By the time they surface, they may be impossible to resolve.
  • Handling PCI data directly when you do not need to — use processor-hosted checkout & tokenization to minimize PCI scope. Most systems never need to touch raw card numbers.

Key Takeaways

  • Correctness is the top priority. Use ACID transactions, double-entry bookkeeping, & strict state machines to ensure financial integrity.
  • Idempotency is non-negotiable. Every payment API must accept an idempotency key & return cached results for duplicate requests.
  • Store money as integers in the smallest currency unit. Never use floating point for financial calculations.
  • The ledger is append-only. Corrections are made by adding reversal entries, never by modifying existing records.
  • Handle external processor failures with an UNCERTAIN state. Resolve it via inquiry before retrying to prevent double charges.
  • Reconciliation is a core feature, not an afterthought. Automate it, run it daily, & alert on discrepancies.
  • Minimize PCI scope by using tokenization & processor-hosted checkout. Only bring card data in-house if you have a specific reason & the compliance infrastructure to support it.