Zero Trust Architecture

Zero trust is a security model that eliminates implicit trust based on network location. Traditional security treats everything inside the corporate network as trusted and everything outside as untrusted. Zero trust assumes no network location, user, or device is inherently trustworthy and verifies every request.

Never Trust, Always Verify

The core principle of zero trust is that every access request must be authenticated, authorized, and encrypted regardless of where it originates.

Traditional Perimeter Security vs Zero Trust

Traditional (castle and moat):
  Outside firewall: untrusted, blocked
  Inside firewall: trusted, full access
  
  Problem: once an attacker gets inside (phishing, VPN compromise,
  insider threat), they have unrestricted lateral movement.

Zero trust:
  Every request: authenticate, authorize, encrypt
  Network location: irrelevant to trust decisions
  Internal service A calling internal service B: must prove identity
  
  If an attacker compromises one service, they cannot automatically
  access other services. Every hop requires authentication.

Zero Trust Principles

Core principles:
  1. Verify explicitly
     Every request is authenticated and authorized using all available signals:
     identity, device health, location, behavior pattern, data classification.
  
  2. Use least privilege access
     Grant minimum permissions for the minimum time needed.
     Just-in-time access: admin rights granted for 1 hour, then revoked.
     Just-enough access: read-only when write is not needed.
  
  3. Assume breach
     Design as if attackers are already inside the network.
     Minimize blast radius through segmentation.
     Monitor and log everything for detection.

Google's BeyondCorp is the most well-known zero trust implementation. After experiencing the 2009 Operation Aurora attack, Google moved away from VPN-based access. Every Google employee accesses internal applications through an identity-aware proxy that verifies their identity and device health on every request, regardless of whether they are in the office or at home.

Trust Signals

Signals evaluated per request:
  
  Identity:
    Who is making the request?
    Is the identity verified (MFA, certificate)?
    
  Device:
    Is the device managed and compliant?
    Is the OS patched and up to date?
    Is disk encryption enabled?
    Is endpoint detection running?
    
  Context:
    What time is the request? (3 AM login from new location?)
    What is the risk level of the requested resource?
    Does this match the user's normal behavior pattern?
    
  Network:
    Is the connection encrypted?
    Is the source IP from a known-risk geography?
    (Network location alone is NOT sufficient for trust)

Micro-Segmentation

Micro-segmentation divides the network into small, isolated zones. Each zone has its own access controls, limiting how far an attacker can move after compromising a single component.

Network Segmentation Levels

Traditional segmentation:
  DMZ | Application Tier | Database Tier
  Three zones, broad access within each zone.
  Compromise one app server --> access all app servers.

Micro-segmentation:
  Each service has its own security boundary.
  Order Service can talk to Payment Service (port 443 only).
  Order Service CANNOT talk to User Database directly.
  Payment Service can talk to Payment Database (port 5432 only).
  Payment Service CANNOT talk to Order Database.
  
  Even within the same Kubernetes namespace, pods are isolated
  by network policies.

Implementation Approaches

Network policy (Kubernetes):
  Define which pods can communicate with which other pods.
  Default deny: no traffic unless explicitly allowed.
  
  Example policy:
    Allow: order-service --> payment-service (port 443)
    Allow: payment-service --> payment-db (port 5432)
    Deny: all other traffic
    
  A compromised order-service pod cannot reach the payment database
  because no network path exists.

Software-defined networking:
  VMware NSX, Cisco ACI, cloud VPCs
  Policy defined at the workload level, not the network level
  Follows the workload as it moves between hosts

Host-based firewalls:
  iptables/nftables rules on each host
  Fine-grained but harder to manage at scale

Micro-Segmentation Challenges

Challenges:
  - Mapping all legitimate communication paths (service dependency mapping)
  - Maintaining policies as services evolve (new services, new dependencies)
  - Debugging connectivity issues (blocked traffic looks like service failures)
  - Performance overhead of policy evaluation on every packet
  
Strategies:
  - Start with monitoring mode (log but don't block) to discover traffic patterns
  - Use service mesh for automatic policy generation from observed traffic
  - Implement gradual rollout: enforce for one service at a time

Netflix uses micro-segmentation across their AWS infrastructure. Each microservice runs in its own security group with explicit ingress and egress rules, limiting the blast radius of any single compromise.

Service Mesh

A service mesh provides infrastructure-level security, observability, and traffic management for service-to-service communication. It implements zero trust networking without requiring application code changes.

Service Mesh Architecture

Service mesh components:

  Data plane (sidecar proxy per service):
    Envoy proxy runs alongside each service instance
    Intercepts all inbound and outbound network traffic
    Handles mTLS, authorization, retries, observability
    
  Control plane (centralized management):
    Istiod (Istio) or Linkerd control plane
    Distributes certificates, policies, and configuration
    Manages proxy configuration across all sidecars

  Traffic flow:
    Service A --> [Envoy sidecar A] --mTLS--> [Envoy sidecar B] --> Service B
    
    Application code makes a plain HTTP call to Service B.
    The sidecar handles encryption, authentication, and authorization transparently.

Service Mesh Security Features

Security capabilities:
  
  Automatic mTLS:
    Every service-to-service call is encrypted
    Both sides present certificates (mutual authentication)
    Certificates are automatically rotated
    No application code changes required
  
  Authorization policies:
    Define which services can call which endpoints
    Based on service identity (not IP address)
    Can include request-level conditions (HTTP method, path, headers)
  
  Certificate management:
    Automatic issuance and rotation (24-hour lifetime typical)
    Short-lived certificates limit exposure if compromised
    Identity tied to workload, not to network location

Istio authorization policy example:
  
  Rule: only order-service can call payment-service
  
  Service: payment-service
  Allow:
    Source: order-service (verified by mTLS identity)
    Methods: POST
    Paths: /api/payments/*
  Deny:
    All other sources

Airbnb adopted Istio service mesh to enforce zero trust between their microservices. Every service call is mutually authenticated, and authorization policies define exactly which services can communicate.

Mutual TLS (mTLS)

Standard TLS authenticates only the server: the client verifies the server's certificate. Mutual TLS requires both sides to present and verify certificates, ensuring both client and server identities are verified.

TLS vs mTLS

Standard TLS:
  Client verifies server certificate: "Am I talking to the real server?"
  Server does NOT verify client: "I'll accept any client"
  Used for: browser-to-server HTTPS

Mutual TLS:
  Client verifies server certificate: "Am I talking to the real server?"
  Server verifies client certificate: "Is this client who it claims to be?"
  Used for: service-to-service communication in zero trust

mTLS Handshake

mTLS handshake (additional steps beyond regular TLS):

  1. Client --> Server: ClientHello (same as TLS)
  2. Server --> Client: ServerHello + server certificate (same as TLS)
  3. Server --> Client: CertificateRequest (NEW: server asks for client cert)
  4. Client --> Server: client certificate (NEW: client proves identity)
  5. Both verify each other's certificates
  6. Encrypted communication established
  
  Server now knows: this request comes from "order-service" (not just "some client")

mTLS in Practice

mTLS implementation approaches:

  Service mesh (recommended):
    Sidecar proxies handle mTLS automatically
    Applications see plain HTTP traffic
    Certificate rotation is transparent
    Zero application code changes
  
  Application-level:
    Application configures TLS with client certificate
    Must handle certificate loading, rotation, and error handling
    More control but more development work
  
  Infrastructure-level:
    Load balancers or API gateways terminate mTLS
    Backend services receive verified identity as a header
    Simpler for backends but trust shifts to the infrastructure

Certificate Lifecycle

Short-lived certificate pattern:
  1. Service starts, requests certificate from CA (e.g., Istio citadel)
  2. CA issues certificate valid for 24 hours
  3. Service uses certificate for mTLS
  4. Before expiration, service requests new certificate
  5. Old certificate expires, limiting exposure window
  
  Traditional: certificates valid for 1-2 years
    If compromised, attacker has months of access
    
  Zero trust: certificates valid for hours
    If compromised, attacker has hours at most
    Automatic rotation eliminates manual renewal failures

Stripe uses mTLS between all internal services. Every API call between their payment processing components requires mutual certificate verification, ensuring that even if an attacker gains network access, they cannot impersonate a service.

Zero Trust Implementation Strategy

Phased rollout:

  Phase 1: Visibility (months 1-3)
    Deploy service mesh in permissive mode
    Map all service-to-service communication
    Identify all dependencies and traffic patterns
    No enforcement yet
    
  Phase 2: Authentication (months 3-6)
    Enable mTLS in strict mode
    All service communication is encrypted and authenticated
    Services without certificates cannot communicate
    
  Phase 3: Authorization (months 6-12)
    Define and enforce authorization policies
    Start with coarse policies (service-level allow/deny)
    Gradually add fine-grained rules (method, path, headers)
    
  Phase 4: Continuous improvement
    Add device health checks and user context
    Implement just-in-time access for privileged operations
    Automate policy updates based on observed traffic patterns

Common Pitfalls

Treating zero trust as a product. Zero trust is an architecture and a mindset, not a single tool you can buy. It requires changes across identity, network, application, and data layers.
Skipping the visibility phase. Enforcing strict policies without understanding existing traffic patterns causes outages. Always start with observe-only mode.
mTLS without authorization. Encrypting and authenticating traffic without authorization policies means any authenticated service can call any other service. Authentication without authorization is incomplete.
Ignoring legacy systems. Older systems that cannot participate in mTLS or identity-aware access create holes in zero trust coverage. Plan migration paths or use compensating controls (network segmentation, API gateways).
Certificate management failures. Expired certificates cause outages. Automated rotation with monitoring for expiration is essential. Manual certificate management does not scale.
Policy sprawl. Thousands of fine-grained rules become impossible to audit and maintain. Start coarse and refine only where needed.

Key Takeaways

Zero trust eliminates implicit trust based on network location. Every request must be authenticated, authorized, and encrypted regardless of origin.
Micro-segmentation limits blast radius by isolating services into small security zones with explicit communication policies.
Service meshes implement zero trust at the infrastructure level, providing automatic mTLS, authorization, and certificate management without application code changes.
mTLS ensures both sides of a service-to-service connection verify each other's identity, replacing the traditional model of trusting anything inside the network.
Implement zero trust incrementally: visibility first, authentication second, authorization third. Rushing to enforcement without understanding traffic patterns causes outages.