2 min read
On this page

Microservices Architecture

Service Decomposition

Breaking a monolith into independently deployable services. Each service owns its data and communicates over the network.

Monolith:                    Microservices:
┌─────────────────┐         ┌──────┐ ┌──────┐ ┌──────┐
│  Auth  Orders   │         │ Auth │ │Orders│ │Invent│
│  Inventory      │   ──>   │ [DB] │ │ [DB] │ │ [DB] │
│  Payments       │         └──────┘ └──────┘ └──────┘
│  [Single DB]    │         ┌──────┐ ┌──────┐
└─────────────────┘         │ Pay  │ │Notify│
                            │ [DB] │ │ [DB] │
                            └──────┘ └──────┘

Decomposition Strategies

By business capability: Align services to business domains (DDD bounded contexts).

E-commerce bounded contexts:
  ┌──────────┐  ┌───────────┐  ┌──────────┐
  │ Catalog  │  │   Order   │  │ Shipping │
  │ Context  │  │  Context  │  │ Context  │
  │          │  │           │  │          │
  │ Product  │  │ Order     │  │ Shipment │
  │ Category │  │ LineItem  │  │ Tracking │
  │ Price    │  │ Payment   │  │ Carrier  │
  └──────────┘  └───────────┘  └──────────┘

  "Product" in Catalog = full description, images, specs
  "Product" in Order   = just SKU, name, price at purchase time
  Different models for different contexts.

Strangler Fig Pattern: Incrementally replace monolith functionality with microservices.

Phase 1:  [Monolith] handles everything
Phase 2:  [Proxy] ──> [Monolith] (most routes)
                  ──> [New Service] (migrated routes)
Phase 3:  [Proxy] ──> [New Service A]
                  ──> [New Service B]
                  ──> [Monolith] (shrinking)
Phase N:  [Proxy] ──> [Service A] [Service B] ... [Service N]
                      (monolith decommissioned)

API Gateway

Single entry point for all client requests. Handles cross-cutting concerns.

                    ┌──────────────────┐
  Mobile App ─────> │                  │ ──> Auth Service
  Web App ────────> │   API Gateway    │ ──> Order Service
  3rd Party ──────> │                  │ ──> Product Service
                    │  - Routing       │ ──> Payment Service
                    │  - Auth/AuthZ    │
                    │  - Rate limiting │
                    │  - SSL termination│
                    │  - Request/Response│
                    │    transformation │
                    │  - Caching       │
                    └──────────────────┘

Backend for Frontend (BFF):
  ┌─────────┐     ┌──────────┐
  │Mobile BFF│────>│          │
  └─────────┘     │ Services │
  ┌─────────┐     │          │
  │ Web BFF  │────>│          │
  └─────────┘     └──────────┘
  Each BFF tailored to client needs.

Technologies: Kong, NGINX, AWS API Gateway, Envoy.

Service Mesh

Infrastructure layer for service-to-service communication. Handles networking concerns without application code changes.

Without mesh:                With mesh (sidecar proxy):
┌─────────┐                  ┌─────────┬─────────┐
│Service A │──direct──>      │Service A │ Proxy A │
└─────────┘                  └─────────┴────┬────┘
                                            │ (mTLS, retries,
┌─────────┐                  ┌─────────┬────┴────┐  circuit breaking)
│Service B │                 │Service B │ Proxy B │
└─────────┘                  └─────────┴─────────┘

Data Plane:  Sidecar proxies (Envoy) handle all traffic
Control Plane: Configures proxies (Istio Pilot, Linkerd)

Istio Architecture

┌─────────────────────────────────────────────┐
│                Control Plane                 │
│  ┌────────┐  ┌─────────┐  ┌──────────────┐ │
│  │ Pilot  │  │ Citadel │  │   Galley     │ │
│  │(routing│  │ (certs/ │  │ (config      │ │
│  │ config)│  │  mTLS)  │  │  validation) │ │
│  └────┬───┘  └────┬────┘  └──────┬───────┘ │
└───────┼───────────┼──────────────┼──────────┘
        │           │              │
┌───────┼───────────┼──────────────┼──────────┐
│       ▼           ▼              ▼          │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐    │
│  │ Envoy A │  │ Envoy B │  │ Envoy C │    │
│  │(sidecar)│  │(sidecar)│  │(sidecar)│    │
│  └────┬────┘  └────┬────┘  └────┬────┘    │
│  ┌────┴────┐  ┌────┴────┐  ┌────┴────┐    │
│  │  Svc A  │  │  Svc B  │  │  Svc C  │    │
│  └─────────┘  └─────────┘  └─────────┘    │
│              Data Plane                     │
└─────────────────────────────────────────────┘

Linkerd

Lighter weight than Istio. Uses its own micro-proxy (linkerd2-proxy, written in Rust) instead of Envoy. Focuses on simplicity and low resource overhead.

Circuit Breaker

Prevents cascading failures by stopping calls to a failing service.

States:
  CLOSED ──(failures > threshold)──> OPEN
    ↑                                   │
    │                            (timeout expires)
    │                                   │
    └──(success)── HALF-OPEN <──────────┘
                   (allow limited requests)

  CLOSED:    Normal operation, requests pass through
  OPEN:      Requests fail immediately (fast-fail)
  HALF-OPEN: Allow a few test requests
             Success → CLOSED
             Failure → OPEN
ENUMERATION CircuitState ← {CLOSED, OPEN, HALF_OPEN}

STRUCTURE CircuitBreaker
    state: CircuitState
    failure_count: integer
    failure_threshold: integer
    success_count: integer
    success_threshold: integer
    last_failure_time: timestamp or NONE
    timeout: duration

PROCEDURE NEW_CIRCUIT_BREAKER(failure_threshold, success_threshold, timeout)
    RETURN CircuitBreaker(
        state ← CLOSED, failure_count ← 0, failure_threshold,
        success_count ← 0, success_threshold,
        last_failure_time ← NONE, timeout)

PROCEDURE ALLOW_REQUEST(cb) → boolean
    IF cb.state = CLOSED THEN RETURN TRUE
    IF cb.state = OPEN THEN
        IF cb.last_failure_time IS NOT NONE THEN
            IF ELAPSED(cb.last_failure_time) > cb.timeout THEN
                cb.state ← HALF_OPEN
                cb.success_count ← 0
                RETURN TRUE
        RETURN FALSE   // fail fast
    IF cb.state = HALF_OPEN THEN RETURN TRUE

PROCEDURE RECORD_SUCCESS(cb)
    IF cb.state = HALF_OPEN THEN
        cb.success_count ← cb.success_count + 1
        IF cb.success_count ≥ cb.success_threshold THEN
            cb.state ← CLOSED
            cb.failure_count ← 0
    ELSE IF cb.state = CLOSED THEN
        cb.failure_count ← 0

PROCEDURE RECORD_FAILURE(cb)
    cb.failure_count ← cb.failure_count + 1
    cb.last_failure_time ← NOW()
    IF cb.state = CLOSED THEN
        IF cb.failure_count ≥ cb.failure_threshold THEN
            cb.state ← OPEN
    ELSE IF cb.state = HALF_OPEN THEN
        cb.state ← OPEN

Distributed Tracing (OpenTelemetry)

Track requests across service boundaries. Each request gets a trace ID; each service hop creates a span.

Trace ID: abc-123

Service A (span 1)
├── HTTP GET /order/42         [0ms ─────── 250ms]
│
├── Service B (span 2)
│   ├── gRPC GetUser(42)       [10ms ──── 80ms]
│   └── Redis cache lookup     [15ms ─ 20ms]
│
├── Service C (span 3)
│   ├── gRPC GetInventory(42)  [85ms ──── 180ms]
│   └── PostgreSQL query       [90ms ── 170ms]
│
└── Response to client          [250ms]

Context propagation:
  HTTP headers: traceparent: 00-abc123-span1-01
  gRPC metadata: same format
  Each service extracts context, creates child span, propagates
OpenTelemetry Architecture:
  ┌──────────┐     ┌────────────┐     ┌──────────┐
  │   App    │────>│ OTel       │────>│ Backend  │
  │ (SDK)    │     │ Collector  │     │ (Jaeger, │
  │          │     │            │     │  Zipkin,  │
  │ Traces   │     │ - Receive  │     │  Tempo)   │
  │ Metrics  │     │ - Process  │     │          │
  │ Logs     │     │ - Export   │     │          │
  └──────────┘     └────────────┘     └──────────┘

  Three signals: Traces, Metrics, Logs (unified under OTel)

Chaos Engineering

Deliberately inject failures to test system resilience.

Principles (Netflix):
  1. Define steady state (measurable baseline)
  2. Hypothesize steady state continues under failure
  3. Introduce real-world events:
     - Kill instances
     - Inject latency
     - Network partitions
     - CPU/memory pressure
     - Clock skew
  4. Observe: does steady state hold?

Tools:
  - Chaos Monkey: randomly kills instances
  - Litmus Chaos: Kubernetes-native chaos
  - Gremlin: SaaS chaos platform
  - Toxiproxy: network failure simulation
Chaos experiment example:

  Steady state: p99 latency < 200ms, error rate < 0.1%

  Experiment: kill 1 of 3 instances of order-service

  Expected: traffic redistributes, latency spike < 500ms,
            recovery within 30s, no errors

  Actual result? Run it and find out.

Observability: The Three Pillars

┌─────────────────────────────────────────────────┐
│                Observability                     │
│                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐      │
│  │  Metrics  │  │  Logs    │  │  Traces  │      │
│  │          │  │          │  │          │      │
│  │ Counter  │  │ Struct-  │  │ Spans    │      │
│  │ Gauge    │  │ ured     │  │ Context  │      │
│  │ Histogram│  │ events   │  │ Timing   │      │
│  │          │  │          │  │          │      │
│  │Prometheus│  │ ELK/Loki│  │ Jaeger/  │      │
│  │ Grafana  │  │ Datadog  │  │ Tempo    │      │
│  └──────────┘  └──────────┘  └──────────┘      │
└─────────────────────────────────────────────────┘

Metrics: WHAT is happening (request rate, error rate, latency)
Logs:    WHY it happened (detailed event records)
Traces:  WHERE in the system it happened (request flow)

Key Metrics (RED Method)

For services:
  Rate:      requests per second
  Errors:    failed requests per second
  Duration:  distribution of request latency

For resources (USE Method):
  Utilization: % of time resource is busy
  Saturation:  queue depth / backlog
  Errors:      error count

Structured Logging

// Structured logging with tracing
ASYNC PROCEDURE PROCESS_ORDER(order_id, user_id, db) → Order or error
    LOG_INFO("Processing order", order_id ← order_id, user_id ← user_id)

    order ← AWAIT db.GET_ORDER(order_id)
    IF order IS error THEN
        LOG_ERROR("Failed to fetch order from database",
                  order_id ← order_id, error ← error_message)
        RETURN error

    LOG_INFO("Order processed successfully",
             order_id ← order_id,
             total ← order.total,
             items ← LENGTH(order.items))

    RETURN order

Anti-Patterns and Pitfalls

1. Distributed Monolith: services tightly coupled, deploy together
   Fix: proper domain boundaries, async communication

2. Chatty services: too many synchronous calls
   Fix: aggregate APIs, batch operations, async patterns

3. Shared database: multiple services accessing one DB
   Fix: each service owns its data, use events to sync

4. No circuit breakers: cascading failures
   Fix: circuit breakers, bulkheads, timeouts on every call

5. Ignoring partial failure: treating network as reliable
   Fix: retries with backoff, idempotency, graceful degradation

6. Missing observability: "it works on my machine"
   Fix: traces, metrics, logs from day one

Real-World Technology Stack

| Concern | Technology | |---|---| | API Gateway | Kong, AWS API Gateway, Envoy | | Service Mesh | Istio, Linkerd, Cilium | | Circuit Breaker | Resilience4j, Polly, built into mesh | | Tracing | OpenTelemetry, Jaeger, Zipkin, Datadog | | Metrics | Prometheus + Grafana, Datadog | | Logging | ELK Stack, Grafana Loki, Datadog | | Chaos | Litmus, Gremlin, Chaos Monkey | | Container Orchestration | Kubernetes | | CI/CD | ArgoCD, Flux, GitHub Actions |

Key Takeaways

  • Decompose by business capability, not technical layer. Each service should map to a bounded context with its own data store.
  • A service mesh extracts networking concerns (mTLS, retries, observability) from application code into infrastructure, at the cost of operational complexity.
  • Circuit breakers, timeouts, and retries with exponential backoff are non-negotiable in production microservices.
  • Observability (metrics, logs, traces) is not optional. Without it, debugging distributed systems is nearly impossible.
  • Chaos engineering shifts resilience testing left: find failures before your customers do.
  • Start with a monolith, extract services when you understand the domain boundaries. Premature decomposition creates distributed monoliths.