Design a Payment Processing System
A payment processing system handles the flow of money between buyers, merchants, & financial institutions. Systems like Stripe, Square, & Adyen process millions of transactions daily where correctness is paramount — a single bug can mean money lost, duplicated charges, or regulatory violations.
This document covers the full design with emphasis on the properties that make payment systems uniquely challenging: idempotency, consistency, & auditability.

Functional Requirements
- Accept payments from customers (credit card, debit card, bank transfer)
- Support the full payment lifecycle: authorize, capture, refund, void
- Provide idempotent payment APIs to prevent double charges
- Maintain a ledger of all financial transactions
- Reconcile internal records with external payment processor statements
- Support multiple currencies
- Expose payment status to merchants via API & webhooks
Non-Functional Requirements
- Correctness above all — no double charges, no lost payments
- Exactly-once processing semantics for payment operations
- High availability — 99.99% uptime (merchants lose revenue during downtime)
- Strong consistency for financial state transitions
- PCI DSS compliance for handling card data
- Full audit trail for every state change
- Latency: authorization within 2 seconds, settlement within 24 hours
Estimation
Traffic
- 10 million transactions per day
- Average: ~115 transactions/second
- Peak: ~500 transactions/second (Black Friday, flash sales)
- Each transaction involves 2-4 internal state changes (authorize, capture, etc.)
Storage
- Per transaction record: ~1 KB (payment details, metadata, state history)
- Ledger entry per state change: ~500 bytes
- 10 million transactions/day x 1 KB = 10 GB/day for transactions
- Ledger: ~40 million entries/day x 500 bytes = 20 GB/day
- Over 5 years with retention requirements: ~55 TB total
External API Calls
- Each payment involves at least one call to an external payment processor (Visa, Mastercard networks, bank APIs)
- External call latency: 200ms-2s
- Must handle timeouts, retries, & network failures gracefully
High-Level Design
Components
- Payment API — merchant-facing REST API for payment operations
- Payment Service — orchestrates the payment lifecycle
- Idempotency Store — tracks idempotency keys to prevent duplicate processing
- Payment State Machine — enforces valid state transitions
- Ledger Service — double-entry bookkeeping for all financial movements
- Payment Processor Adapter — integrates with external processors (Stripe, Adyen, bank networks)
- Reconciliation Service — compares internal records against external statements
- Webhook Service — notifies merchants of payment status changes
- Database — relational DB with strong consistency (PostgreSQL)
Payment Flow: Authorization & Capture
Merchant -> Payment API -> Payment Service
-> Idempotency check (duplicate? return cached result)
-> Create payment record (state: PENDING)
-> Ledger entry (debit customer, credit pending)
-> Payment Processor Adapter -> External Processor (Visa/Mastercard)
-> Receive auth response
-> Update payment (state: AUTHORIZED)
-> Ledger entry (update pending to authorized)
-> Return result to merchant
-> Webhook: payment.authorized
Later (capture):
Merchant -> Payment API -> Payment Service
-> Verify payment is in AUTHORIZED state
-> Payment Processor Adapter -> External Processor (capture)
-> Update payment (state: CAPTURED)
-> Ledger entry (move from authorized to settled)
-> Webhook: payment.captured
Detailed Design
Payment State Machine
Every payment follows a strict state machine. Only valid transitions are allowed — this prevents impossible states like refunding an unauthorized payment.
State Transitions:
PENDING -----> AUTHORIZED -----> CAPTURED -----> SETTLED
| | |
| v v
| VOIDED REFUND_PENDING --> REFUNDED
| |
v v
FAILED REFUND_FAILED
Valid transitions:
PENDING -> AUTHORIZED, FAILED
AUTHORIZED -> CAPTURED, VOIDED
CAPTURED -> SETTLED, REFUND_PENDING
REFUND_PENDING -> REFUNDED, REFUND_FAILED
SETTLED -> REFUND_PENDING
Implementation: store the current state in the payment record. Every state transition is a conditional update.
UPDATE payments
SET state = 'AUTHORIZED', updated_at = NOW()
WHERE payment_id = ? AND state = 'PENDING'
-- Returns 0 rows affected if state was not PENDING -> reject the transition
The conditional update acts as an optimistic lock. Two concurrent requests to authorize the same payment will not both succeed.
Idempotency
Idempotency is the single most critical property. Network failures, client retries, & load balancer replays can all cause duplicate requests.
How It Works
The merchant includes an Idempotency-Key header (typically a UUID) with every request. The system stores the key alongside the result.
idempotency_store
+-----------------+------------------------------------------------+
| idempotency_key | "550e8400-e29b-41d4-a716-446655440000" (PK) |
| payment_id | "pay_abc123" |
| http_status | 200 |
| response_body | { "status": "authorized", ... } |
| created_at | 2026-03-15T10:30:00Z |
| expires_at | 2026-03-16T10:30:00Z |
+-----------------+------------------------------------------------+
Request Processing
1. Receive request with Idempotency-Key
2. Check idempotency store:
a. Key exists with completed result -> return cached result (no processing)
b. Key exists with in-progress status -> return 409 Conflict (concurrent duplicate)
c. Key does not exist -> insert key with status "in_progress"
3. Process the payment
4. Update idempotency store with the final result
5. Return result to merchant
The insert in step 2c uses a unique constraint on idempotency_key. If two identical requests race, only one insert succeeds. The loser receives a 409 & should retry after a short delay.
Stripe pioneered this pattern and made the Idempotency-Key header a standard part of their API design. Stripe stores idempotency keys with a 24-hour TTL, and their API replays the exact same response (including HTTP status code and body) for any repeated key. Their public documentation of this approach has made idempotency keys a de facto industry standard that other payment providers have adopted.
Double-Spend Prevention
A customer should not be charged twice for the same order. Multiple layers of defense:
- Idempotency keys (described above) — prevents duplicate API calls
- Payment state machine — a payment can only be authorized once (PENDING -> AUTHORIZED)
- Unique constraint on order reference — one payment per order ID
- External processor deduplication — most processors accept a merchant reference ID & reject duplicates
payments table:
UNIQUE INDEX on (merchant_id, merchant_reference)
-- Prevents two payments for the same order from the same merchant
Ledger Design
The ledger uses double-entry bookkeeping. Every money movement creates two entries: a debit & a credit. The sum of all debits must always equal the sum of all credits.
ledger_entries
+-----------------+------------------------------------------------+
| entry_id | bigint (PK) |
| payment_id | varchar (FK to payments) |
| account_id | varchar (e.g., "customer_123", "merchant_456") |
| entry_type | enum: DEBIT, CREDIT |
| amount | bigint (cents, never floating point) |
| currency | varchar(3) (ISO 4217: "USD", "EUR") |
| created_at | timestamp |
+-----------------+------------------------------------------------+
Example: Customer pays $100 to Merchant
Entry 1: DEBIT customer_123 10000 USD (money leaves customer)
Entry 2: CREDIT merchant_456 10000 USD (money enters merchant)
Critical rules:
- Always use integers for money — store amounts in the smallest currency unit (cents for USD). Floating point arithmetic causes rounding errors that accumulate into real financial discrepancies.
- Entries are append-only — never update or delete a ledger entry. Corrections are made by adding a reversal entry.
- Every entry has a balancing counterpart — the system rejects any operation that does not balance.
Square's ledger architecture follows these principles rigorously. Square built a custom double-entry ledger system that tracks every cent flowing through their ecosystem across sellers, buyers, and Square's own accounts. Every transaction produces balanced debit and credit entries, and their system runs continuous balance checks to ensure the books never drift out of balance -- a property they call "provable correctness."
Account Balances
Balances are derived by summing ledger entries per account. For performance, maintain a materialized balance that is updated atomically with each new entry.
account_balances
+-----------------+------------------+
| account_id | "merchant_456" |
| currency | "USD" |
| balance | 1500000 (cents) |
| last_entry_id | 98765 |
+-----------------+------------------+
Update balance atomically with new entry:
BEGIN;
INSERT INTO ledger_entries (...) VALUES (...);
UPDATE account_balances SET balance = balance + 10000, last_entry_id = ?
WHERE account_id = 'merchant_456' AND last_entry_id = ?;
-- Optimistic lock on last_entry_id prevents concurrent corruption
COMMIT;
Retry Handling
External payment processor calls can fail due to timeouts, network errors, or temporary outages. The retry strategy must be careful — retrying a charge that actually succeeded means double-charging the customer.
Safe vs Unsafe Retries
Safe to retry (idempotent operations):
- Authorization (processor deduplicates by merchant reference)
- Status check / inquiry
- Void (voiding an already-voided payment is a no-op)
Unsafe to retry without safeguards:
- Capture (some processors treat each capture as a new charge)
- Refund (could issue multiple refunds)
Retry Strategy
1. Call external processor
2. Timeout or network error?
a. Mark payment as UNCERTAIN in internal state
b. Wait & send an inquiry/status check to the processor
c. Processor confirms the original succeeded -> update state accordingly
d. Processor confirms the original failed -> safe to retry
3. Never retry blindly without checking the current state at the processor
4. Use exponential backoff: 1s, 2s, 4s, 8s, max 30s, max 5 attempts
5. After max attempts, flag for manual review
The UNCERTAIN state is key. It means "we sent a request but do not know the outcome." The system must resolve UNCERTAIN states before allowing any further operations on that payment.
PayPal handles this challenge across hundreds of microservices by using a saga-based approach for distributed transactions. Rather than relying on traditional two-phase commit (which is too slow and fragile at their scale), each step in a payment flow publishes an event, and compensating transactions are triggered automatically if any step fails. PayPal's system resolves millions of uncertain states daily through a combination of automated inquiry calls to downstream processors and a reconciliation engine that detects and resolves discrepancies within minutes.
Reconciliation
Reconciliation compares internal records against external statements to catch discrepancies.
Types of Discrepancies
- Missing from processor: internal record says captured, processor has no record
- Missing from internal: processor charged the customer, but no internal record exists
- Amount mismatch: internal says 99.50 (likely FX or fee issue)
- State mismatch: internal says authorized, processor says captured
Reconciliation Process
1. Daily: download settlement files from each processor
2. Parse into a standardized format
3. Join internal payment records with processor records on merchant reference
4. Flag mismatches:
a. Automatic resolution for known patterns (e.g., FX rounding)
b. Queue for manual review if unresolvable
5. Generate a reconciliation report
6. Track the reconciliation rate (target: 99.99% auto-matched)
Reconciliation runs as a batch job. It is not on the critical payment path but is essential for financial integrity. Any unreconciled transaction older than 48 hours triggers an alert.
PCI Compliance Considerations
PCI DSS (Payment Card Industry Data Security Standard) governs how card data is handled.
Architecture Implications
- Tokenization: never store raw card numbers. Use a PCI-compliant vault (or the processor's tokenization) to convert card numbers into tokens.
- Network segmentation: the cardholder data environment (CDE) must be isolated from the rest of the infrastructure.
- Encryption: card data encrypted in transit (TLS 1.2+) & at rest (AES-256).
- Access control: strict access logs for any system touching card data.
- Audit logging: every access to card data is logged with who, when, & why.
Payment flow with tokenization:
1. Customer enters card number in merchant's checkout (hosted by processor)
2. Processor returns a token: "tok_abc123"
3. Merchant sends token to our Payment API
4. Payment Service passes token to processor for authorization
5. Raw card number never touches our servers -> reduced PCI scope
Using a processor-hosted checkout (like Stripe Elements or Adyen Drop-in) keeps the merchant's system out of PCI scope entirely. This is the recommended approach for most systems.
Webhook Delivery
Merchants need to know when payment states change. Webhooks provide asynchronous notification.
Webhook delivery:
1. Payment state changes
2. Webhook Service creates an event payload (signed with HMAC)
3. POST to merchant's configured endpoint
4. If 2xx response, mark as delivered
5. If failure, retry with exponential backoff (1min, 5min, 30min, 2hr, 24hr)
6. After exhausting retries, mark as failed & alert the merchant via email
Sign webhooks with HMAC-SHA256 so merchants can verify authenticity. Include a timestamp to prevent replay attacks.
Trade-Offs & Alternatives
Synchronous vs Asynchronous Processing
| Approach | Latency | Complexity | Failure Handling |
|---|---|---|---|
| Synchronous | Lower (direct response) | Lower | Harder (timeouts block the client) |
| Async with queue | Higher (polling/webhook) | Higher | Easier (retry from queue) |
Most payment systems use a hybrid: the initial authorization is synchronous (merchant needs an immediate response), while settlement, reconciliation, & notifications are asynchronous.
Single DB vs Event Sourcing
- Single relational DB: simpler. Payment state is the latest row. Works well up to moderate scale. PostgreSQL with strong consistency is the standard choice.
- Event sourcing: store every state change as an immutable event. Reconstruct current state by replaying events. More complex but provides a perfect audit trail & supports temporal queries.
For payment systems, the append-only ledger already provides event-sourcing-like properties. A hybrid approach (relational DB for current state + append-only ledger for history) gives the best of both worlds.
SQL vs NoSQL
SQL (PostgreSQL) is strongly preferred for payment systems:
- ACID transactions ensure financial consistency
- Foreign keys enforce referential integrity
- Mature tooling for reporting & reconciliation queries
- Easier auditing
NoSQL databases lack the consistency guarantees that payment systems require. The throughput demands (500 TPS peak) are well within PostgreSQL's capabilities.
Bottlenecks & Scaling
Database Scaling
At 500 TPS, a single PostgreSQL instance with connection pooling (PgBouncer) handles the load comfortably. For growth:
- Vertical scaling first (PostgreSQL scales well vertically)
- Read replicas for reporting & reconciliation queries
- Partition the payments table by date (monthly partitions)
- Archive old transactions to cold storage after the retention period
Sharding is typically not needed until 10,000+ TPS. If required, shard by merchant_id to keep all of a merchant's data together.
External Processor Latency
The external processor call (200ms-2s) is the dominant latency component. Mitigations:
- Connection pooling to processor APIs
- Circuit breaker pattern: if a processor is failing, fail fast instead of waiting for timeouts
- Multi-processor support: if Processor A is down, route to Processor B
- Timeout budget: set a 3-second timeout, return failure to merchant, let them retry
Idempotency Store Performance
Every request hits the idempotency store. Keep it fast:
- Use Redis as a write-through cache in front of the persistent store
- TTL of 24 hours on idempotency keys (configurable per merchant)
- At 500 TPS, Redis handles this trivially
Reconciliation at Scale
With 10 million transactions/day, reconciliation jobs process large datasets:
- Use batch processing (Apache Spark or a simple partitioned SQL job)
- Process each processor's settlement file independently in parallel
- Partition by date & processor to limit working set size
Common Pitfalls
- Using floating point for money —
0.1 + 0.2 != 0.3in IEEE 754. Store amounts as integers in the smallest currency unit (cents). This is the most common & most dangerous mistake in payment system design. - Retrying without checking external state — blindly retrying a timed-out authorization can double-charge the customer. Always send an inquiry to the processor before retrying.
- Not implementing idempotency from day one — adding idempotency retroactively is extremely difficult. Design it in from the start.
- Mutable ledger entries — updating or deleting ledger entries destroys the audit trail. Use append-only entries with reversals for corrections.
- Ignoring the UNCERTAIN state — when an external call times out, the payment is in an unknown state. The system must resolve this before proceeding, not assume success or failure.
- Skipping reconciliation — without daily reconciliation, discrepancies accumulate silently. By the time they surface, they may be impossible to resolve.
- Handling PCI data directly when you do not need to — use processor-hosted checkout & tokenization to minimize PCI scope. Most systems never need to touch raw card numbers.
Key Takeaways
- Correctness is the top priority. Use ACID transactions, double-entry bookkeeping, & strict state machines to ensure financial integrity.
- Idempotency is non-negotiable. Every payment API must accept an idempotency key & return cached results for duplicate requests.
- Store money as integers in the smallest currency unit. Never use floating point for financial calculations.
- The ledger is append-only. Corrections are made by adding reversal entries, never by modifying existing records.
- Handle external processor failures with an UNCERTAIN state. Resolve it via inquiry before retrying to prevent double charges.
- Reconciliation is a core feature, not an afterthought. Automate it, run it daily, & alert on discrepancies.
- Minimize PCI scope by using tokenization & processor-hosted checkout. Only bring card data in-house if you have a specific reason & the compliance infrastructure to support it.