Microservices Architecture
Service Decomposition
Breaking a monolith into independently deployable services. Each service owns its data and communicates over the network.
Monolith: Microservices:
┌─────────────────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ Auth Orders │ │ Auth │ │Orders│ │Invent│
│ Inventory │ ──> │ [DB] │ │ [DB] │ │ [DB] │
│ Payments │ └──────┘ └──────┘ └──────┘
│ [Single DB] │ ┌──────┐ ┌──────┐
└─────────────────┘ │ Pay │ │Notify│
│ [DB] │ │ [DB] │
└──────┘ └──────┘
Decomposition Strategies
By business capability: Align services to business domains (DDD bounded contexts).
E-commerce bounded contexts:
┌──────────┐ ┌───────────┐ ┌──────────┐
│ Catalog │ │ Order │ │ Shipping │
│ Context │ │ Context │ │ Context │
│ │ │ │ │ │
│ Product │ │ Order │ │ Shipment │
│ Category │ │ LineItem │ │ Tracking │
│ Price │ │ Payment │ │ Carrier │
└──────────┘ └───────────┘ └──────────┘
"Product" in Catalog = full description, images, specs
"Product" in Order = just SKU, name, price at purchase time
Different models for different contexts.
Strangler Fig Pattern: Incrementally replace monolith functionality with microservices.
Phase 1: [Monolith] handles everything
Phase 2: [Proxy] ──> [Monolith] (most routes)
──> [New Service] (migrated routes)
Phase 3: [Proxy] ──> [New Service A]
──> [New Service B]
──> [Monolith] (shrinking)
Phase N: [Proxy] ──> [Service A] [Service B] ... [Service N]
(monolith decommissioned)
API Gateway
Single entry point for all client requests. Handles cross-cutting concerns.
┌──────────────────┐
Mobile App ─────> │ │ ──> Auth Service
Web App ────────> │ API Gateway │ ──> Order Service
3rd Party ──────> │ │ ──> Product Service
│ - Routing │ ──> Payment Service
│ - Auth/AuthZ │
│ - Rate limiting │
│ - SSL termination│
│ - Request/Response│
│ transformation │
│ - Caching │
└──────────────────┘
Backend for Frontend (BFF):
┌─────────┐ ┌──────────┐
│Mobile BFF│────>│ │
└─────────┘ │ Services │
┌─────────┐ │ │
│ Web BFF │────>│ │
└─────────┘ └──────────┘
Each BFF tailored to client needs.
Technologies: Kong, NGINX, AWS API Gateway, Envoy.
Service Mesh
Infrastructure layer for service-to-service communication. Handles networking concerns without application code changes.
Without mesh: With mesh (sidecar proxy):
┌─────────┐ ┌─────────┬─────────┐
│Service A │──direct──> │Service A │ Proxy A │
└─────────┘ └─────────┴────┬────┘
│ (mTLS, retries,
┌─────────┐ ┌─────────┬────┴────┐ circuit breaking)
│Service B │ │Service B │ Proxy B │
└─────────┘ └─────────┴─────────┘
Data Plane: Sidecar proxies (Envoy) handle all traffic
Control Plane: Configures proxies (Istio Pilot, Linkerd)
Istio Architecture
┌─────────────────────────────────────────────┐
│ Control Plane │
│ ┌────────┐ ┌─────────┐ ┌──────────────┐ │
│ │ Pilot │ │ Citadel │ │ Galley │ │
│ │(routing│ │ (certs/ │ │ (config │ │
│ │ config)│ │ mTLS) │ │ validation) │ │
│ └────┬───┘ └────┬────┘ └──────┬───────┘ │
└───────┼───────────┼──────────────┼──────────┘
│ │ │
┌───────┼───────────┼──────────────┼──────────┐
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Envoy A │ │ Envoy B │ │ Envoy C │ │
│ │(sidecar)│ │(sidecar)│ │(sidecar)│ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ │
│ │ Svc A │ │ Svc B │ │ Svc C │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ Data Plane │
└─────────────────────────────────────────────┘
Linkerd
Lighter weight than Istio. Uses its own micro-proxy (linkerd2-proxy, written in Rust) instead of Envoy. Focuses on simplicity and low resource overhead.
Circuit Breaker
Prevents cascading failures by stopping calls to a failing service.
States:
CLOSED ──(failures > threshold)──> OPEN
↑ │
│ (timeout expires)
│ │
└──(success)── HALF-OPEN <──────────┘
(allow limited requests)
CLOSED: Normal operation, requests pass through
OPEN: Requests fail immediately (fast-fail)
HALF-OPEN: Allow a few test requests
Success → CLOSED
Failure → OPEN
ENUMERATION CircuitState ← {CLOSED, OPEN, HALF_OPEN}
STRUCTURE CircuitBreaker
state: CircuitState
failure_count: integer
failure_threshold: integer
success_count: integer
success_threshold: integer
last_failure_time: timestamp or NONE
timeout: duration
PROCEDURE NEW_CIRCUIT_BREAKER(failure_threshold, success_threshold, timeout)
RETURN CircuitBreaker(
state ← CLOSED, failure_count ← 0, failure_threshold,
success_count ← 0, success_threshold,
last_failure_time ← NONE, timeout)
PROCEDURE ALLOW_REQUEST(cb) → boolean
IF cb.state = CLOSED THEN RETURN TRUE
IF cb.state = OPEN THEN
IF cb.last_failure_time IS NOT NONE THEN
IF ELAPSED(cb.last_failure_time) > cb.timeout THEN
cb.state ← HALF_OPEN
cb.success_count ← 0
RETURN TRUE
RETURN FALSE // fail fast
IF cb.state = HALF_OPEN THEN RETURN TRUE
PROCEDURE RECORD_SUCCESS(cb)
IF cb.state = HALF_OPEN THEN
cb.success_count ← cb.success_count + 1
IF cb.success_count ≥ cb.success_threshold THEN
cb.state ← CLOSED
cb.failure_count ← 0
ELSE IF cb.state = CLOSED THEN
cb.failure_count ← 0
PROCEDURE RECORD_FAILURE(cb)
cb.failure_count ← cb.failure_count + 1
cb.last_failure_time ← NOW()
IF cb.state = CLOSED THEN
IF cb.failure_count ≥ cb.failure_threshold THEN
cb.state ← OPEN
ELSE IF cb.state = HALF_OPEN THEN
cb.state ← OPEN
Distributed Tracing (OpenTelemetry)
Track requests across service boundaries. Each request gets a trace ID; each service hop creates a span.
Trace ID: abc-123
Service A (span 1)
├── HTTP GET /order/42 [0ms ─────── 250ms]
│
├── Service B (span 2)
│ ├── gRPC GetUser(42) [10ms ──── 80ms]
│ └── Redis cache lookup [15ms ─ 20ms]
│
├── Service C (span 3)
│ ├── gRPC GetInventory(42) [85ms ──── 180ms]
│ └── PostgreSQL query [90ms ── 170ms]
│
└── Response to client [250ms]
Context propagation:
HTTP headers: traceparent: 00-abc123-span1-01
gRPC metadata: same format
Each service extracts context, creates child span, propagates
OpenTelemetry Architecture:
┌──────────┐ ┌────────────┐ ┌──────────┐
│ App │────>│ OTel │────>│ Backend │
│ (SDK) │ │ Collector │ │ (Jaeger, │
│ │ │ │ │ Zipkin, │
│ Traces │ │ - Receive │ │ Tempo) │
│ Metrics │ │ - Process │ │ │
│ Logs │ │ - Export │ │ │
└──────────┘ └────────────┘ └──────────┘
Three signals: Traces, Metrics, Logs (unified under OTel)
Chaos Engineering
Deliberately inject failures to test system resilience.
Principles (Netflix):
1. Define steady state (measurable baseline)
2. Hypothesize steady state continues under failure
3. Introduce real-world events:
- Kill instances
- Inject latency
- Network partitions
- CPU/memory pressure
- Clock skew
4. Observe: does steady state hold?
Tools:
- Chaos Monkey: randomly kills instances
- Litmus Chaos: Kubernetes-native chaos
- Gremlin: SaaS chaos platform
- Toxiproxy: network failure simulation
Chaos experiment example:
Steady state: p99 latency < 200ms, error rate < 0.1%
Experiment: kill 1 of 3 instances of order-service
Expected: traffic redistributes, latency spike < 500ms,
recovery within 30s, no errors
Actual result? Run it and find out.
Observability: The Three Pillars
┌─────────────────────────────────────────────────┐
│ Observability │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Metrics │ │ Logs │ │ Traces │ │
│ │ │ │ │ │ │ │
│ │ Counter │ │ Struct- │ │ Spans │ │
│ │ Gauge │ │ ured │ │ Context │ │
│ │ Histogram│ │ events │ │ Timing │ │
│ │ │ │ │ │ │ │
│ │Prometheus│ │ ELK/Loki│ │ Jaeger/ │ │
│ │ Grafana │ │ Datadog │ │ Tempo │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────┘
Metrics: WHAT is happening (request rate, error rate, latency)
Logs: WHY it happened (detailed event records)
Traces: WHERE in the system it happened (request flow)
Key Metrics (RED Method)
For services:
Rate: requests per second
Errors: failed requests per second
Duration: distribution of request latency
For resources (USE Method):
Utilization: % of time resource is busy
Saturation: queue depth / backlog
Errors: error count
Structured Logging
// Structured logging with tracing
ASYNC PROCEDURE PROCESS_ORDER(order_id, user_id, db) → Order or error
LOG_INFO("Processing order", order_id ← order_id, user_id ← user_id)
order ← AWAIT db.GET_ORDER(order_id)
IF order IS error THEN
LOG_ERROR("Failed to fetch order from database",
order_id ← order_id, error ← error_message)
RETURN error
LOG_INFO("Order processed successfully",
order_id ← order_id,
total ← order.total,
items ← LENGTH(order.items))
RETURN order
Anti-Patterns and Pitfalls
1. Distributed Monolith: services tightly coupled, deploy together
Fix: proper domain boundaries, async communication
2. Chatty services: too many synchronous calls
Fix: aggregate APIs, batch operations, async patterns
3. Shared database: multiple services accessing one DB
Fix: each service owns its data, use events to sync
4. No circuit breakers: cascading failures
Fix: circuit breakers, bulkheads, timeouts on every call
5. Ignoring partial failure: treating network as reliable
Fix: retries with backoff, idempotency, graceful degradation
6. Missing observability: "it works on my machine"
Fix: traces, metrics, logs from day one
Real-World Technology Stack
| Concern | Technology | |---|---| | API Gateway | Kong, AWS API Gateway, Envoy | | Service Mesh | Istio, Linkerd, Cilium | | Circuit Breaker | Resilience4j, Polly, built into mesh | | Tracing | OpenTelemetry, Jaeger, Zipkin, Datadog | | Metrics | Prometheus + Grafana, Datadog | | Logging | ELK Stack, Grafana Loki, Datadog | | Chaos | Litmus, Gremlin, Chaos Monkey | | Container Orchestration | Kubernetes | | CI/CD | ArgoCD, Flux, GitHub Actions |
Key Takeaways
- Decompose by business capability, not technical layer. Each service should map to a bounded context with its own data store.
- A service mesh extracts networking concerns (mTLS, retries, observability) from application code into infrastructure, at the cost of operational complexity.
- Circuit breakers, timeouts, and retries with exponential backoff are non-negotiable in production microservices.
- Observability (metrics, logs, traces) is not optional. Without it, debugging distributed systems is nearly impossible.
- Chaos engineering shifts resilience testing left: find failures before your customers do.
- Start with a monolith, extract services when you understand the domain boundaries. Premature decomposition creates distributed monoliths.