Cloud Observability

The Three Pillars

Observability enables understanding a system's internal state from its external outputs.

                    Observability
            ┌───────────┼───────────┐
         Metrics      Logs       Traces
      (aggregated   (discrete   (request
       numbers)     events)     flow)
         │            │            │
     CloudWatch   CloudWatch   X-Ray
     Prometheus    Logs         Jaeger
     Datadog      Fluentd      Tempo

Monitoring

Amazon CloudWatch

CloudWatch is AWS's native monitoring and observability platform.

Core components:

Metrics: Time-series data points (CPU, memory, custom)
Alarms: Trigger actions based on metric thresholds
Logs: Centralized log collection and querying
Dashboards: Visualize metrics and logs together
Events/EventBridge: React to state changes

CloudWatch Metric Anatomy:
  Namespace:  AWS/EC2
  MetricName: CPUUtilization
  Dimensions: InstanceId=i-1234567890abcdef0
  Statistic:  Average
  Period:     300 seconds
  Value:      72.5

Key AWS Metrics to Monitor

Service	Metric	Alarm Threshold
EC2	CPUUtilization	> 80% sustained
ALB	TargetResponseTime	> 500ms p99
ALB	HTTPCode_Target_5XX_Count	> 0
Lambda	Errors	> 0
Lambda	Duration	> 80% of timeout
RDS	FreeStorageSpace	< 20% remaining
SQS	ApproximateAgeOfOldestMessage	> acceptable delay
DynamoDB	ThrottledRequests	> 0

Google Cloud Monitoring (formerly Stackdriver)

Metrics Explorer: Query and visualize any metric
Uptime checks: HTTP, TCP, HTTPS probes from global locations
Alerting policies: Multi-condition alerts with notification channels
Service monitoring: SLO tracking based on request latency and availability
MQL: Monitoring Query Language for advanced metric queries

CloudWatch Alarms

Metric ──► Evaluation ──► State Change ──► Action
                          │
                    ┌─────┼─────┐
                    OK   ALARM  INSUFFICIENT_DATA
                          │
                    ┌─────┼─────────────┐
                    SNS   Auto Scaling   Lambda
                    │     EC2 action     Custom
                    Email/SMS/PagerDuty  remediation

# CloudWatch alarm via CloudFormation
HighCpuAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: high-cpu-web-servers
    MetricName: CPUUtilization
    Namespace: AWS/EC2
    Statistic: Average
    Period: 300
    EvaluationPeriods: 3
    Threshold: 80
    ComparisonOperator: GreaterThanThreshold
    Dimensions:
      - Name: AutoScalingGroupName
        Value: !Ref WebServerASG
    AlarmActions:
      - !Ref ScaleUpPolicy
      - !Ref OpsNotificationTopic

Prometheus and Grafana (Cloud-Agnostic)

┌──────────┐     ┌────────────┐     ┌─────────┐
│ App with │◄────│ Prometheus │────►│ Grafana │
│ /metrics │pull │  Server    │     │Dashboard│
│ endpoint │     │  (TSDB)    │     │         │
└──────────┘     │  AlertMgr  │────►│ Alerts  │
                 └────────────┘     └─────────┘

Amazon Managed Prometheus (AMP): Serverless Prometheus-compatible monitoring
Amazon Managed Grafana (AMG): Managed Grafana with SSO integration
GCP Managed Prometheus: Globally scalable, Prometheus-compatible

Log Aggregation

Architecture

Sources              Collection          Storage/Query
┌─────────┐
│ App logs │─┐       ┌──────────┐       ┌──────────────┐
│ (stdout) │ │       │ Fluentd/ │       │ CloudWatch   │
└─────────┘ ├──────►│ Fluent   │──────►│ Logs         │
┌─────────┐ │       │ Bit      │       ├──────────────┤
│ System  │─┤       └──────────┘       │ OpenSearch   │
│ logs    │ │                          ├──────────────┤
└─────────┘ │       ┌──────────┐       │ S3 + Athena  │
┌─────────┐ │       │CloudWatch│       ├──────────────┤
│ AWS svc │─┘       │ Agent    │──────►│ Datadog/     │
│ logs    │         └──────────┘       │ Splunk       │
└─────────┘                            └──────────────┘

Structured Logging

{
  "timestamp": "2026-03-24T10:15:30.123Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Payment processing failed",
  "error_code": "GATEWAY_TIMEOUT",
  "order_id": "ORD-9876",
  "customer_id": "CUST-5432",
  "duration_ms": 30000
}

Best practices:

Use JSON format for machine parseability
Include correlation IDs (trace_id) for request tracing
Log at appropriate levels (DEBUG, INFO, WARN, ERROR)
Never log sensitive data (PII, credentials, tokens)
Set log retention policies to control costs

CloudWatch Logs Insights

-- Find top 10 slowest API endpoints in the last hour
fields @timestamp, @message
| filter @message like /duration_ms/
| parse @message '"path":"*","duration_ms":*,' as path, duration
| stats avg(duration) as avg_ms, max(duration) as max_ms, count() as requests by path
| sort max_ms desc
| limit 10

Distributed Tracing

How Tracing Works

Client ──► API Gateway ──► Order Service ──► Payment Service
  │            │                │                    │
  │     Trace: abc123    Span: order-1         Span: pay-1
  │     Span: gw-1       Parent: gw-1          Parent: order-1
  │            │                │                    │
  ▼            ▼                ▼                    ▼
  ┌──────────────────────────────────────────────────┐
  │ Trace abc123                                      │
  │ ├── gw-1     [════════════════════════════]       │
  │ ├── order-1    [══════════════════]               │
  │ ├── db-query     [═══]                            │
  │ └── pay-1              [══════════]               │
  │                            Total: 450ms           │
  └──────────────────────────────────────────────────┘

Tracing Services

Service	Provider	Protocol
AWS X-Ray	AWS	X-Ray SDK, OpenTelemetry
Cloud Trace	GCP	OpenTelemetry, Zipkin
Azure Monitor	Azure	OpenTelemetry
Jaeger	OSS	OpenTracing, OpenTelemetry
Tempo	Grafana	OpenTelemetry, Zipkin, Jaeger

OpenTelemetry

OpenTelemetry (OTel) is the industry-standard framework for collecting telemetry data.

# Auto-instrumentation with OpenTelemetry
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument frameworks
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

# Manual span for custom business logic
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.total", total)
    result = process(order_id)

Custom Metrics

Business Metrics

Beyond infrastructure metrics, track domain-specific indicators.

# Publish custom metric to CloudWatch
import boto3

cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
    Namespace='OrderService',
    MetricData=[{
        'MetricName': 'OrderValue',
        'Dimensions': [
            {'Name': 'Region', 'Value': 'us-east-1'},
            {'Name': 'PaymentType', 'Value': 'credit_card'},
        ],
        'Value': order_total,
        'Unit': 'None',
        'StorageResolution': 60  # Standard resolution
    }]
)

SLIs, SLOs, and SLAs

Concept	Definition	Example
SLI (Indicator)	Quantitative measure of service	99.2% of requests < 200ms
SLO (Objective)	Target value for an SLI	99.5% of requests < 200ms
SLA (Agreement)	Contract with consequences	99.9% uptime or credit issued
Error budget	SLO - actual = remaining budget	0.5% - 0.2% = 0.3% remaining

Dashboards and Alerting

Dashboard Design Principles

USE method for resources: Utilization, Saturation, Errors
RED method for services: Rate, Errors, Duration
Layered views: Overview → service → instance drill-down
Business context: Correlate technical metrics with business KPIs

Effective Alerting

Principle	Practice
Alert on symptoms, not causes	Alert on error rate, not CPU
Reduce noise	Multi-window, multi-burn-rate alerts
Actionable alerts only	Every alert should have a runbook
Appropriate urgency	Page for customer impact, ticket for degradation
Avoid alert fatigue	Review and prune alerts quarterly

Cost Monitoring and Optimization

AWS Cost Management Tools

Tool	Purpose
Cost Explorer	Visualize spending trends and forecasts
Budgets	Set spending thresholds with alerts
Cost Anomaly Detection	ML-based unusual spend detection
Savings Plans	Commit to usage for discounts
Trusted Advisor	Recommendations for cost optimization
Compute Optimizer	Right-sizing recommendations

Cost Allocation

Organization
├── Account: Production
│   ├── Tag: team=platform    → $12,000/mo
│   ├── Tag: team=data        → $8,500/mo
│   └── Tag: team=frontend    → $3,200/mo
├── Account: Staging           → $2,100/mo
└── Account: Development       → $1,800/mo

Tagging strategy: Enforce tags (team, environment, project, cost-center)
AWS Organizations: Separate accounts per environment/team
Cost allocation tags: Activate tags in Billing for cost breakdowns

Common Cost Optimization Actions

Delete unused resources: Unattached EBS volumes, idle load balancers, old snapshots
Right-size instances: Match instance family and size to workload
Use Savings Plans / Reserved Instances: 30-72% savings for predictable workloads
Spot instances: 60-90% savings for fault-tolerant batch workloads
Storage lifecycle policies: Auto-tier to cheaper storage classes
Review data transfer: Minimize cross-region and internet egress
Optimize Lambda: Right-size memory, reduce duration, use Graviton

FinOps

FinOps is the practice of bringing financial accountability to cloud spending.

FinOps Lifecycle

    Inform ──────► Optimize ──────► Operate
    │               │                │
    Cost visibility Right-sizing     Governance
    Allocation      Reservations     Automation
    Forecasting     Waste removal    Continuous
    Showback        Architecture     improvement

FinOps Practices

Phase	Activity	Tools
Inform	Cost allocation and showback	Cost Explorer, Kubecost
Inform	Forecasting and budgeting	AWS Budgets, Datadog Cost
Optimize	Right-sizing compute	Compute Optimizer, Spot.io
Optimize	Reserved/Savings Plans	AWS Cost Explorer recommendations
Operate	Anomaly detection	Cost Anomaly Detection
Operate	Automated scheduling	Instance Scheduler, Lambda
Operate	Tagging enforcement	AWS Config rules, SCP

Key FinOps Metrics

Unit economics: Cost per transaction, cost per customer, cost per request
Coverage: Percentage of spend covered by reservations/savings plans
Waste: Percentage of spend on idle or unused resources
Forecast accuracy: Predicted vs actual spend variance

Key Takeaways

Observability combines metrics, logs, and traces to understand system behavior
CloudWatch, Cloud Monitoring, and Prometheus cover infrastructure and custom metrics
Structured logging with correlation IDs enables cross-service debugging
OpenTelemetry is the standard for vendor-neutral instrumentation
Alert on symptoms (error rate, latency) rather than causes (CPU, memory)
FinOps brings financial discipline to cloud spending through visibility, optimization, and governance
Cost optimization is continuous: tag, monitor, right-size, reserve, and automate