5 min read
On this page

Cloud Observability

The Three Pillars

Observability enables understanding a system's internal state from its external outputs.

                    Observability
            ┌───────────┼───────────┐
         Metrics      Logs       Traces
      (aggregated   (discrete   (request
       numbers)     events)     flow)
         │            │            │
     CloudWatch   CloudWatch   X-Ray
     Prometheus    Logs         Jaeger
     Datadog      Fluentd      Tempo

Monitoring

Amazon CloudWatch

CloudWatch is AWS's native monitoring and observability platform.

Core components:

  • Metrics: Time-series data points (CPU, memory, custom)
  • Alarms: Trigger actions based on metric thresholds
  • Logs: Centralized log collection and querying
  • Dashboards: Visualize metrics and logs together
  • Events/EventBridge: React to state changes
CloudWatch Metric Anatomy:
  Namespace:  AWS/EC2
  MetricName: CPUUtilization
  Dimensions: InstanceId=i-1234567890abcdef0
  Statistic:  Average
  Period:     300 seconds
  Value:      72.5

Key AWS Metrics to Monitor

Service Metric Alarm Threshold
EC2 CPUUtilization > 80% sustained
ALB TargetResponseTime > 500ms p99
ALB HTTPCode_Target_5XX_Count > 0
Lambda Errors > 0
Lambda Duration > 80% of timeout
RDS FreeStorageSpace < 20% remaining
SQS ApproximateAgeOfOldestMessage > acceptable delay
DynamoDB ThrottledRequests > 0

Google Cloud Monitoring (formerly Stackdriver)

  • Metrics Explorer: Query and visualize any metric
  • Uptime checks: HTTP, TCP, HTTPS probes from global locations
  • Alerting policies: Multi-condition alerts with notification channels
  • Service monitoring: SLO tracking based on request latency and availability
  • MQL: Monitoring Query Language for advanced metric queries

CloudWatch Alarms

Metric ──► Evaluation ──► State Change ──► Action
                          │
                    ┌─────┼─────┐
                    OK   ALARM  INSUFFICIENT_DATA
                          │
                    ┌─────┼─────────────┐
                    SNS   Auto Scaling   Lambda
                    │     EC2 action     Custom
                    Email/SMS/PagerDuty  remediation
# CloudWatch alarm via CloudFormation
HighCpuAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: high-cpu-web-servers
    MetricName: CPUUtilization
    Namespace: AWS/EC2
    Statistic: Average
    Period: 300
    EvaluationPeriods: 3
    Threshold: 80
    ComparisonOperator: GreaterThanThreshold
    Dimensions:
      - Name: AutoScalingGroupName
        Value: !Ref WebServerASG
    AlarmActions:
      - !Ref ScaleUpPolicy
      - !Ref OpsNotificationTopic

Prometheus and Grafana (Cloud-Agnostic)

┌──────────┐     ┌────────────┐     ┌─────────┐
│ App with │◄────│ Prometheus │────►│ Grafana │
│ /metrics │pull │  Server    │     │Dashboard│
│ endpoint │     │  (TSDB)    │     │         │
└──────────┘     │  AlertMgr  │────►│ Alerts  │
                 └────────────┘     └─────────┘
  • Amazon Managed Prometheus (AMP): Serverless Prometheus-compatible monitoring
  • Amazon Managed Grafana (AMG): Managed Grafana with SSO integration
  • GCP Managed Prometheus: Globally scalable, Prometheus-compatible

Log Aggregation

Architecture

Sources              Collection          Storage/Query
┌─────────┐
│ App logs │─┐       ┌──────────┐       ┌──────────────┐
│ (stdout) │ │       │ Fluentd/ │       │ CloudWatch   │
└─────────┘ ├──────►│ Fluent   │──────►│ Logs         │
┌─────────┐ │       │ Bit      │       ├──────────────┤
│ System  │─┤       └──────────┘       │ OpenSearch   │
│ logs    │ │                          ├──────────────┤
└─────────┘ │       ┌──────────┐       │ S3 + Athena  │
┌─────────┐ │       │CloudWatch│       ├──────────────┤
│ AWS svc │─┘       │ Agent    │──────►│ Datadog/     │
│ logs    │         └──────────┘       │ Splunk       │
└─────────┘                            └──────────────┘

Structured Logging

{
  "timestamp": "2026-03-24T10:15:30.123Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Payment processing failed",
  "error_code": "GATEWAY_TIMEOUT",
  "order_id": "ORD-9876",
  "customer_id": "CUST-5432",
  "duration_ms": 30000
}

Best practices:

  • Use JSON format for machine parseability
  • Include correlation IDs (trace_id) for request tracing
  • Log at appropriate levels (DEBUG, INFO, WARN, ERROR)
  • Never log sensitive data (PII, credentials, tokens)
  • Set log retention policies to control costs

CloudWatch Logs Insights

-- Find top 10 slowest API endpoints in the last hour
fields @timestamp, @message
| filter @message like /duration_ms/
| parse @message '"path":"*","duration_ms":*,' as path, duration
| stats avg(duration) as avg_ms, max(duration) as max_ms, count() as requests by path
| sort max_ms desc
| limit 10

Distributed Tracing

How Tracing Works

Client ──► API Gateway ──► Order Service ──► Payment Service
  │            │                │                    │
  │     Trace: abc123    Span: order-1         Span: pay-1
  │     Span: gw-1       Parent: gw-1          Parent: order-1
  │            │                │                    │
  ▼            ▼                ▼                    ▼
  ┌──────────────────────────────────────────────────┐
  │ Trace abc123                                      │
  │ ├── gw-1     [════════════════════════════]       │
  │ ├── order-1    [══════════════════]               │
  │ ├── db-query     [═══]                            │
  │ └── pay-1              [══════════]               │
  │                            Total: 450ms           │
  └──────────────────────────────────────────────────┘

Tracing Services

Service Provider Protocol
AWS X-Ray AWS X-Ray SDK, OpenTelemetry
Cloud Trace GCP OpenTelemetry, Zipkin
Azure Monitor Azure OpenTelemetry
Jaeger OSS OpenTracing, OpenTelemetry
Tempo Grafana OpenTelemetry, Zipkin, Jaeger

OpenTelemetry

OpenTelemetry (OTel) is the industry-standard framework for collecting telemetry data.

# Auto-instrumentation with OpenTelemetry
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument frameworks
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

# Manual span for custom business logic
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.total", total)
    result = process(order_id)

Custom Metrics

Business Metrics

Beyond infrastructure metrics, track domain-specific indicators.

# Publish custom metric to CloudWatch
import boto3

cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
    Namespace='OrderService',
    MetricData=[{
        'MetricName': 'OrderValue',
        'Dimensions': [
            {'Name': 'Region', 'Value': 'us-east-1'},
            {'Name': 'PaymentType', 'Value': 'credit_card'},
        ],
        'Value': order_total,
        'Unit': 'None',
        'StorageResolution': 60  # Standard resolution
    }]
)

SLIs, SLOs, and SLAs

Concept Definition Example
SLI (Indicator) Quantitative measure of service 99.2% of requests < 200ms
SLO (Objective) Target value for an SLI 99.5% of requests < 200ms
SLA (Agreement) Contract with consequences 99.9% uptime or credit issued
Error budget SLO - actual = remaining budget 0.5% - 0.2% = 0.3% remaining

Dashboards and Alerting

Dashboard Design Principles

  1. USE method for resources: Utilization, Saturation, Errors
  2. RED method for services: Rate, Errors, Duration
  3. Layered views: Overview → service → instance drill-down
  4. Business context: Correlate technical metrics with business KPIs

Effective Alerting

Principle Practice
Alert on symptoms, not causes Alert on error rate, not CPU
Reduce noise Multi-window, multi-burn-rate alerts
Actionable alerts only Every alert should have a runbook
Appropriate urgency Page for customer impact, ticket for degradation
Avoid alert fatigue Review and prune alerts quarterly

Cost Monitoring and Optimization

AWS Cost Management Tools

Tool Purpose
Cost Explorer Visualize spending trends and forecasts
Budgets Set spending thresholds with alerts
Cost Anomaly Detection ML-based unusual spend detection
Savings Plans Commit to usage for discounts
Trusted Advisor Recommendations for cost optimization
Compute Optimizer Right-sizing recommendations

Cost Allocation

Organization
├── Account: Production
│   ├── Tag: team=platform    → $12,000/mo
│   ├── Tag: team=data        → $8,500/mo
│   └── Tag: team=frontend    → $3,200/mo
├── Account: Staging           → $2,100/mo
└── Account: Development       → $1,800/mo
  • Tagging strategy: Enforce tags (team, environment, project, cost-center)
  • AWS Organizations: Separate accounts per environment/team
  • Cost allocation tags: Activate tags in Billing for cost breakdowns

Common Cost Optimization Actions

  1. Delete unused resources: Unattached EBS volumes, idle load balancers, old snapshots
  2. Right-size instances: Match instance family and size to workload
  3. Use Savings Plans / Reserved Instances: 30-72% savings for predictable workloads
  4. Spot instances: 60-90% savings for fault-tolerant batch workloads
  5. Storage lifecycle policies: Auto-tier to cheaper storage classes
  6. Review data transfer: Minimize cross-region and internet egress
  7. Optimize Lambda: Right-size memory, reduce duration, use Graviton

FinOps

FinOps is the practice of bringing financial accountability to cloud spending.

FinOps Lifecycle

    Inform ──────► Optimize ──────► Operate
    │               │                │
    Cost visibility Right-sizing     Governance
    Allocation      Reservations     Automation
    Forecasting     Waste removal    Continuous
    Showback        Architecture     improvement

FinOps Practices

Phase Activity Tools
Inform Cost allocation and showback Cost Explorer, Kubecost
Inform Forecasting and budgeting AWS Budgets, Datadog Cost
Optimize Right-sizing compute Compute Optimizer, Spot.io
Optimize Reserved/Savings Plans AWS Cost Explorer recommendations
Operate Anomaly detection Cost Anomaly Detection
Operate Automated scheduling Instance Scheduler, Lambda
Operate Tagging enforcement AWS Config rules, SCP

Key FinOps Metrics

  • Unit economics: Cost per transaction, cost per customer, cost per request
  • Coverage: Percentage of spend covered by reservations/savings plans
  • Waste: Percentage of spend on idle or unused resources
  • Forecast accuracy: Predicted vs actual spend variance

Key Takeaways

  • Observability combines metrics, logs, and traces to understand system behavior
  • CloudWatch, Cloud Monitoring, and Prometheus cover infrastructure and custom metrics
  • Structured logging with correlation IDs enables cross-service debugging
  • OpenTelemetry is the standard for vendor-neutral instrumentation
  • Alert on symptoms (error rate, latency) rather than causes (CPU, memory)
  • FinOps brings financial discipline to cloud spending through visibility, optimization, and governance
  • Cost optimization is continuous: tag, monitor, right-size, reserve, and automate