Microservices Architecture

Service Decomposition

Breaking a monolith into independently deployable services. Each service owns its data and communicates over the network.

Monolith:                    Microservices:
┌─────────────────┐         ┌──────┐ ┌──────┐ ┌──────┐
│  Auth  Orders   │         │ Auth │ │Orders│ │Invent│
│  Inventory      │   ──>   │ [DB] │ │ [DB] │ │ [DB] │
│  Payments       │         └──────┘ └──────┘ └──────┘
│  [Single DB]    │         ┌──────┐ ┌──────┐
└─────────────────┘         │ Pay  │ │Notify│
                            │ [DB] │ │ [DB] │
                            └──────┘ └──────┘

Decomposition Strategies

By business capability: Align services to business domains (DDD bounded contexts).

E-commerce bounded contexts:
  ┌──────────┐  ┌───────────┐  ┌──────────┐
  │ Catalog  │  │   Order   │  │ Shipping │
  │ Context  │  │  Context  │  │ Context  │
  │          │  │           │  │          │
  │ Product  │  │ Order     │  │ Shipment │
  │ Category │  │ LineItem  │  │ Tracking │
  │ Price    │  │ Payment   │  │ Carrier  │
  └──────────┘  └───────────┘  └──────────┘

  "Product" in Catalog = full description, images, specs
  "Product" in Order   = just SKU, name, price at purchase time
  Different models for different contexts.

Strangler Fig Pattern: Incrementally replace monolith functionality with microservices.

Phase 1:  [Monolith] handles everything
Phase 2:  [Proxy] ──> [Monolith] (most routes)
                  ──> [New Service] (migrated routes)
Phase 3:  [Proxy] ──> [New Service A]
                  ──> [New Service B]
                  ──> [Monolith] (shrinking)
Phase N:  [Proxy] ──> [Service A] [Service B] ... [Service N]
                      (monolith decommissioned)

API Gateway

Single entry point for all client requests. Handles cross-cutting concerns.

                    ┌──────────────────┐
  Mobile App ─────> │                  │ ──> Auth Service
  Web App ────────> │   API Gateway    │ ──> Order Service
  3rd Party ──────> │                  │ ──> Product Service
                    │  - Routing       │ ──> Payment Service
                    │  - Auth/AuthZ    │
                    │  - Rate limiting │
                    │  - SSL termination│
                    │  - Request/Response│
                    │    transformation │
                    │  - Caching       │
                    └──────────────────┘

Backend for Frontend (BFF):
  ┌─────────┐     ┌──────────┐
  │Mobile BFF│────>│          │
  └─────────┘     │ Services │
  ┌─────────┐     │          │
  │ Web BFF  │────>│          │
  └─────────┘     └──────────┘
  Each BFF tailored to client needs.

Technologies: Kong, NGINX, AWS API Gateway, Envoy.

Service Mesh

Infrastructure layer for service-to-service communication. Handles networking concerns without application code changes.

Without mesh:                With mesh (sidecar proxy):
┌─────────┐                  ┌─────────┬─────────┐
│Service A │──direct──>      │Service A │ Proxy A │
└─────────┘                  └─────────┴────┬────┘
                                            │ (mTLS, retries,
┌─────────┐                  ┌─────────┬────┴────┐  circuit breaking)
│Service B │                 │Service B │ Proxy B │
└─────────┘                  └─────────┴─────────┘

Data Plane:  Sidecar proxies (Envoy) handle all traffic
Control Plane: Configures proxies (Istio Pilot, Linkerd)

Istio Architecture

┌─────────────────────────────────────────────┐
│                Control Plane                 │
│  ┌────────┐  ┌─────────┐  ┌──────────────┐ │
│  │ Pilot  │  │ Citadel │  │   Galley     │ │
│  │(routing│  │ (certs/ │  │ (config      │ │
│  │ config)│  │  mTLS)  │  │  validation) │ │
│  └────┬───┘  └────┬────┘  └──────┬───────┘ │
└───────┼───────────┼──────────────┼──────────┘
        │           │              │
┌───────┼───────────┼──────────────┼──────────┐
│       ▼           ▼              ▼          │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐    │
│  │ Envoy A │  │ Envoy B │  │ Envoy C │    │
│  │(sidecar)│  │(sidecar)│  │(sidecar)│    │
│  └────┬────┘  └────┬────┘  └────┬────┘    │
│  ┌────┴────┐  ┌────┴────┐  ┌────┴────┐    │
│  │  Svc A  │  │  Svc B  │  │  Svc C  │    │
│  └─────────┘  └─────────┘  └─────────┘    │
│              Data Plane                     │
└─────────────────────────────────────────────┘

Linkerd

Lighter weight than Istio. Uses its own micro-proxy (linkerd2-proxy, written in Rust) instead of Envoy. Focuses on simplicity and low resource overhead.

Circuit Breaker

Prevents cascading failures by stopping calls to a failing service.

States:
  CLOSED ──(failures > threshold)──> OPEN
    ↑                                   │
    │                            (timeout expires)
    │                                   │
    └──(success)── HALF-OPEN <──────────┘
                   (allow limited requests)

  CLOSED:    Normal operation, requests pass through
  OPEN:      Requests fail immediately (fast-fail)
  HALF-OPEN: Allow a few test requests
             Success → CLOSED
             Failure → OPEN

ENUMERATION CircuitState ← {CLOSED, OPEN, HALF_OPEN}

STRUCTURE CircuitBreaker
    state: CircuitState
    failure_count: integer
    failure_threshold: integer
    success_count: integer
    success_threshold: integer
    last_failure_time: timestamp or NONE
    timeout: duration

PROCEDURE NEW_CIRCUIT_BREAKER(failure_threshold, success_threshold, timeout)
    RETURN CircuitBreaker(
        state ← CLOSED, failure_count ← 0, failure_threshold,
        success_count ← 0, success_threshold,
        last_failure_time ← NONE, timeout)

PROCEDURE ALLOW_REQUEST(cb) → boolean
    IF cb.state = CLOSED THEN RETURN TRUE
    IF cb.state = OPEN THEN
        IF cb.last_failure_time IS NOT NONE THEN
            IF ELAPSED(cb.last_failure_time) > cb.timeout THEN
                cb.state ← HALF_OPEN
                cb.success_count ← 0
                RETURN TRUE
        RETURN FALSE   // fail fast
    IF cb.state = HALF_OPEN THEN RETURN TRUE

PROCEDURE RECORD_SUCCESS(cb)
    IF cb.state = HALF_OPEN THEN
        cb.success_count ← cb.success_count + 1
        IF cb.success_count ≥ cb.success_threshold THEN
            cb.state ← CLOSED
            cb.failure_count ← 0
    ELSE IF cb.state = CLOSED THEN
        cb.failure_count ← 0

PROCEDURE RECORD_FAILURE(cb)
    cb.failure_count ← cb.failure_count + 1
    cb.last_failure_time ← NOW()
    IF cb.state = CLOSED THEN
        IF cb.failure_count ≥ cb.failure_threshold THEN
            cb.state ← OPEN
    ELSE IF cb.state = HALF_OPEN THEN
        cb.state ← OPEN

Distributed Tracing (OpenTelemetry)

Track requests across service boundaries. Each request gets a trace ID; each service hop creates a span.

Trace ID: abc-123

Service A (span 1)
├── HTTP GET /order/42         [0ms ─────── 250ms]
│
├── Service B (span 2)
│   ├── gRPC GetUser(42)       [10ms ──── 80ms]
│   └── Redis cache lookup     [15ms ─ 20ms]
│
├── Service C (span 3)
│   ├── gRPC GetInventory(42)  [85ms ──── 180ms]
│   └── PostgreSQL query       [90ms ── 170ms]
│
└── Response to client          [250ms]

Context propagation:
  HTTP headers: traceparent: 00-abc123-span1-01
  gRPC metadata: same format
  Each service extracts context, creates child span, propagates

OpenTelemetry Architecture:
  ┌──────────┐     ┌────────────┐     ┌──────────┐
  │   App    │────>│ OTel       │────>│ Backend  │
  │ (SDK)    │     │ Collector  │     │ (Jaeger, │
  │          │     │            │     │  Zipkin,  │
  │ Traces   │     │ - Receive  │     │  Tempo)   │
  │ Metrics  │     │ - Process  │     │          │
  │ Logs     │     │ - Export   │     │          │
  └──────────┘     └────────────┘     └──────────┘

  Three signals: Traces, Metrics, Logs (unified under OTel)

Chaos Engineering

Deliberately inject failures to test system resilience.

Principles (Netflix):
  1. Define steady state (measurable baseline)
  2. Hypothesize steady state continues under failure
  3. Introduce real-world events:
     - Kill instances
     - Inject latency
     - Network partitions
     - CPU/memory pressure
     - Clock skew
  4. Observe: does steady state hold?

Tools:
  - Chaos Monkey: randomly kills instances
  - Litmus Chaos: Kubernetes-native chaos
  - Gremlin: SaaS chaos platform
  - Toxiproxy: network failure simulation

Chaos experiment example:

  Steady state: p99 latency < 200ms, error rate < 0.1%

  Experiment: kill 1 of 3 instances of order-service

  Expected: traffic redistributes, latency spike < 500ms,
            recovery within 30s, no errors

  Actual result? Run it and find out.

Observability: The Three Pillars

┌─────────────────────────────────────────────────┐
│                Observability                     │
│                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐      │
│  │  Metrics  │  │  Logs    │  │  Traces  │      │
│  │          │  │          │  │          │      │
│  │ Counter  │  │ Struct-  │  │ Spans    │      │
│  │ Gauge    │  │ ured     │  │ Context  │      │
│  │ Histogram│  │ events   │  │ Timing   │      │
│  │          │  │          │  │          │      │
│  │Prometheus│  │ ELK/Loki│  │ Jaeger/  │      │
│  │ Grafana  │  │ Datadog  │  │ Tempo    │      │
│  └──────────┘  └──────────┘  └──────────┘      │
└─────────────────────────────────────────────────┘

Metrics: WHAT is happening (request rate, error rate, latency)
Logs:    WHY it happened (detailed event records)
Traces:  WHERE in the system it happened (request flow)

Key Metrics (RED Method)

For services:
  Rate:      requests per second
  Errors:    failed requests per second
  Duration:  distribution of request latency

For resources (USE Method):
  Utilization: % of time resource is busy
  Saturation:  queue depth / backlog
  Errors:      error count

Structured Logging

// Structured logging with tracing
ASYNC PROCEDURE PROCESS_ORDER(order_id, user_id, db) → Order or error
    LOG_INFO("Processing order", order_id ← order_id, user_id ← user_id)

    order ← AWAIT db.GET_ORDER(order_id)
    IF order IS error THEN
        LOG_ERROR("Failed to fetch order from database",
                  order_id ← order_id, error ← error_message)
        RETURN error

    LOG_INFO("Order processed successfully",
             order_id ← order_id,
             total ← order.total,
             items ← LENGTH(order.items))

    RETURN order

Anti-Patterns and Pitfalls

1. Distributed Monolith: services tightly coupled, deploy together
   Fix: proper domain boundaries, async communication

2. Chatty services: too many synchronous calls
   Fix: aggregate APIs, batch operations, async patterns

3. Shared database: multiple services accessing one DB
   Fix: each service owns its data, use events to sync

4. No circuit breakers: cascading failures
   Fix: circuit breakers, bulkheads, timeouts on every call

5. Ignoring partial failure: treating network as reliable
   Fix: retries with backoff, idempotency, graceful degradation

6. Missing observability: "it works on my machine"
   Fix: traces, metrics, logs from day one

Real-World Technology Stack

Concern	Technology
API Gateway	Kong, AWS API Gateway, Envoy
Service Mesh	Istio, Linkerd, Cilium
Circuit Breaker	Resilience4j, Polly, built into mesh
Tracing	OpenTelemetry, Jaeger, Zipkin, Datadog
Metrics	Prometheus + Grafana, Datadog
Logging	ELK Stack, Grafana Loki, Datadog
Chaos	Litmus, Gremlin, Chaos Monkey
Container Orchestration	Kubernetes
CI/CD	ArgoCD, Flux, GitHub Actions

Key Takeaways

Decompose by business capability, not technical layer. Each service should map to a bounded context with its own data store.
A service mesh extracts networking concerns (mTLS, retries, observability) from application code into infrastructure, at the cost of operational complexity.
Circuit breakers, timeouts, and retries with exponential backoff are non-negotiable in production microservices.
Observability (metrics, logs, traces) is not optional. Without it, debugging distributed systems is nearly impossible.
Chaos engineering shifts resilience testing left: find failures before your customers do.
Start with a monolith, extract services when you understand the domain boundaries. Premature decomposition creates distributed monoliths.