Design an E-Commerce Platform

This case study walks through the design of a large-scale e-commerce platform similar to Amazon or Shopify. The system supports millions of products across many sellers, handles spiky traffic during flash sales, and ensures reliable checkout & payment processing.

E-commerce microservices architecture

Functional Requirements

Product catalog: sellers create, update, and manage product listings with images, descriptions, pricing, and categories
Search & browse: buyers find products via keyword search, category navigation, and filters (price, rating, brand)
Shopping cart: users add, remove, and update items; carts persist across sessions
Checkout flow: address selection, shipping method, coupon/discount application, order summary, payment
Inventory management: real-time stock tracking, reservations during checkout, low-stock alerts
Order processing: order creation, status tracking, cancellation, returns & refunds
Payment integration: support for credit cards, digital wallets, and buy-now-pay-later via third-party gateways
Recommendations: personalized product suggestions based on browsing history, purchase history, and collaborative filtering
Multi-seller support: seller onboarding, per-seller storefronts, commission tracking, payouts
Reviews & ratings: buyers leave reviews; aggregate ratings displayed on product pages

Non-Functional Requirements

Availability: 99.99% uptime target; checkout & payment paths are the most critical
Latency: product page loads under 200 ms (p95); search results under 300 ms
Consistency: inventory counts must be strongly consistent to prevent overselling
Scalability: handle 100 M+ daily active users, 10x traffic spikes during events (Black Friday, Prime Day)
Durability: zero tolerance for lost orders or payment records
Security: PCI-DSS compliance for payment data; encryption at rest & in transit

Estimation

Traffic

100 M daily active users
Each user views ~20 pages per session -> 2 B page views/day -> ~23 K requests/second average
Peak during flash sales: ~230 K requests/second (10x)
5% of sessions result in a purchase -> 5 M orders/day -> ~58 orders/second average

Storage

500 M products in the catalog; each product record ~10 KB -> 5 TB for product metadata
Product images: average 5 images per product at 500 KB each -> 1.25 PB raw (served via CDN)
Orders: 5 M/day * 5 KB avg -> 25 GB/day -> ~9 TB/year
User data (profiles, carts, wishlists): 500 M users * 5 KB -> 2.5 TB

Bandwidth

Average page size 2 MB (mostly images from CDN)
2 B page views * 2 MB -> 4 PB/day outbound (CDN absorbs 95%+)
Origin bandwidth: ~200 TB/day

High-Level Design

The platform is split into independently deployable services behind an API gateway.

Core Services

Product Service — CRUD for product listings, catalog browsing, category management
Search Service — full-text search over the product catalog with filters & facets
Cart Service — shopping cart state management
Order Service — order lifecycle from creation through fulfillment
Inventory Service — stock levels, reservations, replenishment triggers
Payment Service — integrates with external payment gateways
User Service — authentication, profiles, addresses
Recommendation Service — ML-based product suggestions
Seller Service — seller onboarding, storefront config, commission & payout tracking
Notification Service — order confirmations, shipping updates, promotional emails

Data Flow — Happy Path Purchase

Client -> CDN (static assets, product images)
Client -> API Gateway -> Product Service (browse/search)
Client -> API Gateway -> Cart Service (add to cart)
Client -> API Gateway -> Order Service (place order)
  Order Service -> Inventory Service (reserve stock)
  Order Service -> Payment Service -> External Gateway (charge)
  Order Service -> Notification Service (confirmation email)
  Order Service -> Seller Service (notify seller to ship)

Data Stores

Product DB — PostgreSQL with read replicas; holds product metadata, categories, seller info
Search Index — Elasticsearch cluster; synced from Product DB via change data capture
Cart Store — Redis cluster; fast reads/writes, TTL-based expiry for abandoned carts
Order DB — PostgreSQL with strong consistency; partitioned by order date
Inventory DB — PostgreSQL; requires serializable isolation for stock reservation
User DB — PostgreSQL with read replicas
Object Storage — S3 for product images, invoices, and other blobs
Analytics Warehouse — ClickHouse or BigQuery for recommendation model training & business intelligence

Detailed Design

Product Catalog & Search

The Product Service owns the source of truth for all product data. When a seller creates or updates a listing, the change is written to PostgreSQL and an event is published to a Kafka topic.

The Search Service consumes these events and updates an Elasticsearch index. The index is optimized for full-text queries with filters on price, category, brand, rating, and seller.

Seller writes product -> Product DB (PostgreSQL)
                      -> Kafka (product.updated event)
                      -> Search Indexer -> Elasticsearch

Search queries flow through the API gateway to the Search Service, which fans out to Elasticsearch shards. Results are enriched with pricing and stock data before returning to the client.

Faceted navigation (e.g., "Laptops > Brand: Dell > Price: $500-$ 1000") is handled by Elasticsearch aggregations. Category trees are stored in PostgreSQL using a materialized path or nested set model.

Shopping Cart

Carts are stored in Redis for speed. Each cart is a hash keyed by user ID, with fields for each product (SKU, quantity, price snapshot).

For guest users, a session-based cart ID is stored in a cookie. When the guest signs in, the Cart Service merges the session cart with any existing user cart.

Cart data has a TTL of 30 days. If the user returns after expiry, the cart is empty. Price snapshots in the cart may become stale, so the checkout flow re-validates prices & availability at the moment of purchase.

Inventory Management & Reservation

Inventory is the most consistency-critical component. The Inventory Service uses PostgreSQL with serializable transactions to prevent overselling.

When a user begins checkout, the Order Service calls the Inventory Service to create a soft reservation:

BEGIN SERIALIZABLE;
  SELECT quantity FROM inventory WHERE sku = ? FOR UPDATE;
  -- check quantity >= requested
  UPDATE inventory SET quantity = quantity - ?, reserved = reserved + ? WHERE sku = ?;
COMMIT;

Soft reservations have a TTL (e.g., 10 minutes). If the user does not complete payment, a background job releases the reservation and restores available stock.

For flash sales with extreme contention on a single SKU, the system uses a token bucket approach: pre-allocate purchase tokens in Redis and let the Inventory DB reconcile asynchronously. This trades strict real-time accuracy for throughput.

Checkout & Payment Flow

Checkout is a multi-step process with compensation (saga pattern) for failures at any stage.

1. Validate cart (prices, availability)
2. Reserve inventory
3. Calculate totals (subtotal, tax, shipping, discounts)
4. Create order record (status: PENDING_PAYMENT)
5. Initiate payment via Payment Service
   -> Payment Service calls Stripe/Adyen/PayPal
6a. Payment SUCCESS -> update order to CONFIRMED, notify seller
6b. Payment FAILURE -> release inventory reservation, update order to FAILED

The Order Service acts as the saga orchestrator. Each step is idempotent so that retries are safe. The Payment Service stores a unique idempotency key per order to prevent double charges.

For PCI-DSS compliance, the platform never stores raw card numbers. The client-side checkout form sends card data directly to the payment gateway (e.g., Stripe Elements), which returns a token. Only the token is sent to the backend.

Order Processing & Fulfillment

After payment confirmation, the order moves through a state machine:

PENDING_PAYMENT -> CONFIRMED -> PROCESSING -> SHIPPED -> DELIVERED
                -> CANCELLED (from CONFIRMED or PROCESSING)
                -> RETURN_REQUESTED -> REFUNDED (from DELIVERED)

State transitions publish events to Kafka. Downstream consumers handle:

Seller notification (ship the order)
Warehouse management system integration (for platform-fulfilled orders)
Notification Service (email/SMS updates to buyer)
Analytics pipeline (revenue tracking)

Recommendations

The Recommendation Service combines multiple signals:

Collaborative filtering — users who bought X also bought Y (computed offline in batch)
Content-based filtering — similar product attributes (category, brand, price range)
Real-time personalization — recent browsing and cart activity stored in Redis, used to re-rank candidates

A nightly batch job trains the collaborative filtering model on the analytics warehouse. Results are pre-computed and stored in a key-value store (DynamoDB or Redis) keyed by user ID and product ID.

Real-time features (recently viewed, trending in category) are computed using a streaming pipeline on Kafka + Flink.

Multi-Seller Support

Each product belongs to a seller. The Seller Service manages:

Seller profiles & verification
Per-seller commission rates
Payout scheduling (aggregate orders, subtract commission, transfer funds)

When an order contains items from multiple sellers, the Order Service splits it into sub-orders, one per seller. Each sub-order has its own fulfillment lifecycle. The buyer sees a unified order view; sellers see only their sub-order.

Payouts run as a scheduled batch job (e.g., bi-weekly). The Payment Service transfers funds to seller bank accounts via ACH or equivalent.

Search Relevance & Autocomplete

Search ranking combines:

Text relevance (BM25 score from Elasticsearch)
Popularity (sales velocity, click-through rate)
Seller quality (rating, fulfillment speed)
Sponsored boost (paid placement by sellers)

Autocomplete uses a prefix-based index in Elasticsearch with a completion suggester. Popular queries are cached in Redis with short TTLs.

Spell correction uses an n-gram based approach: the search index includes n-gram sub-fields, and fuzzy matching catches common typos.

Trade-Offs & Alternatives

SQL vs NoSQL for Product Catalog

PostgreSQL was chosen for its strong consistency, relational modeling (products-categories-sellers), and mature tooling. A document store like MongoDB could simplify the schema for products with highly variable attributes, but joins across sellers, categories, and inventory become harder.

An alternative is a hybrid: PostgreSQL for structured data (pricing, inventory, seller info) and a document store for flexible product attributes (specifications vary by category).

Synchronous vs Asynchronous Checkout

The design uses synchronous payment calls during checkout for immediate user feedback. An alternative is fully asynchronous: accept the order, return a "processing" status, and confirm later. This improves throughput but degrades user experience — buyers want instant confirmation.

The hybrid chosen here is synchronous for the payment call (typically < 2 seconds) with asynchronous post-payment processing (notifications, fulfillment triggers).

Event Sourcing for Orders

The order state machine could be implemented with event sourcing: store every state change as an immutable event and derive the current state by replaying events. This provides a perfect audit trail and enables temporal queries ("what was the order status at 3 PM?").

The trade-off is increased complexity. For most e-commerce platforms, a status column with a separate audit log table achieves 90% of the benefit with far less complexity.

Cart in Redis vs Database

Redis provides sub-millisecond cart access but risks data loss if a node fails before replication. For most carts this is acceptable — losing a cart is annoying but not catastrophic. For high-value B2B carts, persisting to PostgreSQL with a Redis cache in front is safer.

Bottlenecks & Scaling

Hot Product Problem

A viral product or flash sale creates extreme read & write contention on a single SKU. Mitigations:

Cache product detail pages aggressively at the CDN edge
Use Redis for inventory pre-allocation (token bucket) to avoid DB contention
Rate-limit add-to-cart requests per user to prevent bot-driven spikes
Queue checkout requests and process in order (virtual waiting room)

Database Scaling

Product DB: read replicas handle browse traffic; write traffic is moderate (seller updates)
Order DB: partition by order creation date; older partitions move to cold storage
Inventory DB: shard by SKU range; each shard handles a subset of products
Cross-shard transactions (multi-seller orders) use the saga pattern rather than distributed transactions

Search Index Scaling

Elasticsearch scales horizontally by adding data nodes and shards. For 500 M products:

50 shards across 25 nodes (2 shards per node)
Each shard holds ~10 M products
Replica shards on different nodes for fault tolerance
Index aliasing enables zero-downtime re-indexing when the mapping changes

Payment Gateway Reliability

External payment gateways are a single point of failure. Mitigations:

Support multiple gateways (Stripe as primary, Adyen as fallback)
Circuit breaker pattern: if Stripe error rate exceeds threshold, route to Adyen
Idempotency keys on every payment request to handle retries safely
Store payment intent locally before calling the gateway so no charge is ever lost

CDN & Static Asset Scaling

Product images are the largest bandwidth consumer. Serve them through a multi-tier CDN:

Edge nodes close to users (CloudFront, Fastly)
Origin shield layer to reduce load on object storage
Image transformation service (resize, WebP conversion) at the edge or origin
Cache-Control headers with long max-age and versioned URLs for cache busting

Session & Authentication Scaling

With 100 M daily active users, session management becomes a scaling concern:

Use stateless JWT tokens for authentication to avoid server-side session storage
Short-lived access tokens (15 minutes) with longer-lived refresh tokens stored in Redis
Rate-limit login attempts per IP and per account to prevent credential stuffing
During flash sales, pre-authenticate users and issue session tokens before the event starts to reduce auth service load at peak

Monitoring & Alerting

An e-commerce platform requires granular monitoring across every service:

Business metrics: orders per minute, cart abandonment rate, revenue per second (any sudden drop signals a system problem)
Inventory alerts: alert when stock reservation failures spike, indicating contention or bugs
Payment success rate: track per-gateway success rates; a drop below threshold triggers automatic failover
Search latency: monitor p50, p95, p99; degraded search directly reduces conversion rates
Consumer lag: track Kafka consumer lag for the search indexer and notification pipeline; stale data causes poor user experience

Common Pitfalls

Overselling: relying on application-level checks without database-level serialization leads to race conditions; always use SELECT ... FOR UPDATE or equivalent
Cart-price drift: prices change between when the user adds to cart and when they check out; always re-validate at checkout
Distributed transactions: trying to atomically update inventory + create order + charge payment in a single transaction across services does not scale; use sagas with compensation
Search index lag: if the search index falls behind the source database, users see stale results (out-of-stock items appearing available); monitor consumer lag and alert early
Ignoring idempotency: payment retries without idempotency keys cause double charges; every external call that has side effects needs an idempotency mechanism
Monolithic product schema: forcing all product types into a single rigid schema leads to sparse columns and awkward queries; use a flexible attribute model
Neglecting seller experience: focusing only on buyer flows while seller tools (inventory upload, order management, analytics) remain clunky drives sellers to competing platforms

Key Takeaways

Inventory reservation with strong consistency is the most critical correctness requirement; overselling erodes trust faster than anything else
The saga pattern with compensation is the practical approach to distributed transactions across services like inventory, payment, and order management
Redis is an excellent fit for shopping carts and session data where sub-millisecond latency matters and occasional data loss is tolerable
Search and browse are read-heavy paths that benefit from aggressive caching (CDN, Redis, Elasticsearch replicas) while checkout is write-heavy and demands consistency
Multi-seller support fundamentally changes order processing: every order may need to be split into sub-orders with independent fulfillment lifecycles
Payment processing requires PCI-DSS compliance (tokenization, never storing raw card data) and idempotency on every gateway call
Flash sales and viral products create hot-key problems that require specific mitigations (token buckets, virtual waiting rooms, CDN edge caching)