5 min read
On this page

Cloud Observability

The Three Pillars

Observability enables understanding a system's internal state from its external outputs.

                    Observability
            ┌───────────┼───────────┐
         Metrics      Logs       Traces
      (aggregated   (discrete   (request
       numbers)     events)     flow)
         │            │            │
     CloudWatch   CloudWatch   X-Ray
     Prometheus    Logs         Jaeger
     Datadog      Fluentd      Tempo

Monitoring

Amazon CloudWatch

CloudWatch is AWS's native monitoring and observability platform.

Core components:

  • Metrics: Time-series data points (CPU, memory, custom)
  • Alarms: Trigger actions based on metric thresholds
  • Logs: Centralized log collection and querying
  • Dashboards: Visualize metrics and logs together
  • Events/EventBridge: React to state changes
CloudWatch Metric Anatomy:
  Namespace:  AWS/EC2
  MetricName: CPUUtilization
  Dimensions: InstanceId=i-1234567890abcdef0
  Statistic:  Average
  Period:     300 seconds
  Value:      72.5

Key AWS Metrics to Monitor

| Service | Metric | Alarm Threshold | |---------|--------|----------------| | EC2 | CPUUtilization | > 80% sustained | | ALB | TargetResponseTime | > 500ms p99 | | ALB | HTTPCode_Target_5XX_Count | > 0 | | Lambda | Errors | > 0 | | Lambda | Duration | > 80% of timeout | | RDS | FreeStorageSpace | < 20% remaining | | SQS | ApproximateAgeOfOldestMessage | > acceptable delay | | DynamoDB | ThrottledRequests | > 0 |

Google Cloud Monitoring (formerly Stackdriver)

  • Metrics Explorer: Query and visualize any metric
  • Uptime checks: HTTP, TCP, HTTPS probes from global locations
  • Alerting policies: Multi-condition alerts with notification channels
  • Service monitoring: SLO tracking based on request latency and availability
  • MQL: Monitoring Query Language for advanced metric queries

CloudWatch Alarms

Metric ──► Evaluation ──► State Change ──► Action
                          │
                    ┌─────┼─────┐
                    OK   ALARM  INSUFFICIENT_DATA
                          │
                    ┌─────┼─────────────┐
                    SNS   Auto Scaling   Lambda
                    │     EC2 action     Custom
                    Email/SMS/PagerDuty  remediation
# CloudWatch alarm via CloudFormation
HighCpuAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: high-cpu-web-servers
    MetricName: CPUUtilization
    Namespace: AWS/EC2
    Statistic: Average
    Period: 300
    EvaluationPeriods: 3
    Threshold: 80
    ComparisonOperator: GreaterThanThreshold
    Dimensions:
      - Name: AutoScalingGroupName
        Value: !Ref WebServerASG
    AlarmActions:
      - !Ref ScaleUpPolicy
      - !Ref OpsNotificationTopic

Prometheus and Grafana (Cloud-Agnostic)

┌──────────┐     ┌────────────┐     ┌─────────┐
│ App with │◄────│ Prometheus │────►│ Grafana │
│ /metrics │pull │  Server    │     │Dashboard│
│ endpoint │     │  (TSDB)    │     │         │
└──────────┘     │  AlertMgr  │────►│ Alerts  │
                 └────────────┘     └─────────┘
  • Amazon Managed Prometheus (AMP): Serverless Prometheus-compatible monitoring
  • Amazon Managed Grafana (AMG): Managed Grafana with SSO integration
  • GCP Managed Prometheus: Globally scalable, Prometheus-compatible

Log Aggregation

Architecture

Sources              Collection          Storage/Query
┌─────────┐
│ App logs │─┐       ┌──────────┐       ┌──────────────┐
│ (stdout) │ │       │ Fluentd/ │       │ CloudWatch   │
└─────────┘ ├──────►│ Fluent   │──────►│ Logs         │
┌─────────┐ │       │ Bit      │       ├──────────────┤
│ System  │─┤       └──────────┘       │ OpenSearch   │
│ logs    │ │                          ├──────────────┤
└─────────┘ │       ┌──────────┐       │ S3 + Athena  │
┌─────────┐ │       │CloudWatch│       ├──────────────┤
│ AWS svc │─┘       │ Agent    │──────►│ Datadog/     │
│ logs    │         └──────────┘       │ Splunk       │
└─────────┘                            └──────────────┘

Structured Logging

{
  "timestamp": "2026-03-24T10:15:30.123Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "message": "Payment processing failed",
  "error_code": "GATEWAY_TIMEOUT",
  "order_id": "ORD-9876",
  "customer_id": "CUST-5432",
  "duration_ms": 30000
}

Best practices:

  • Use JSON format for machine parseability
  • Include correlation IDs (trace_id) for request tracing
  • Log at appropriate levels (DEBUG, INFO, WARN, ERROR)
  • Never log sensitive data (PII, credentials, tokens)
  • Set log retention policies to control costs

CloudWatch Logs Insights

-- Find top 10 slowest API endpoints in the last hour
fields @timestamp, @message
| filter @message like /duration_ms/
| parse @message '"path":"*","duration_ms":*,' as path, duration
| stats avg(duration) as avg_ms, max(duration) as max_ms, count() as requests by path
| sort max_ms desc
| limit 10

Distributed Tracing

How Tracing Works

Client ──► API Gateway ──► Order Service ──► Payment Service
  │            │                │                    │
  │     Trace: abc123    Span: order-1         Span: pay-1
  │     Span: gw-1       Parent: gw-1          Parent: order-1
  │            │                │                    │
  ▼            ▼                ▼                    ▼
  ┌──────────────────────────────────────────────────┐
  │ Trace abc123                                      │
  │ ├── gw-1     [════════════════════════════]       │
  │ ├── order-1    [══════════════════]               │
  │ ├── db-query     [═══]                            │
  │ └── pay-1              [══════════]               │
  │                            Total: 450ms           │
  └──────────────────────────────────────────────────┘

Tracing Services

| Service | Provider | Protocol | |---------|----------|----------| | AWS X-Ray | AWS | X-Ray SDK, OpenTelemetry | | Cloud Trace | GCP | OpenTelemetry, Zipkin | | Azure Monitor | Azure | OpenTelemetry | | Jaeger | OSS | OpenTracing, OpenTelemetry | | Tempo | Grafana | OpenTelemetry, Zipkin, Jaeger |

OpenTelemetry

OpenTelemetry (OTel) is the industry-standard framework for collecting telemetry data.

# Auto-instrumentation with OpenTelemetry
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument frameworks
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

# Manual span for custom business logic
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.total", total)
    result = process(order_id)

Custom Metrics

Business Metrics

Beyond infrastructure metrics, track domain-specific indicators.

# Publish custom metric to CloudWatch
import boto3

cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
    Namespace='OrderService',
    MetricData=[{
        'MetricName': 'OrderValue',
        'Dimensions': [
            {'Name': 'Region', 'Value': 'us-east-1'},
            {'Name': 'PaymentType', 'Value': 'credit_card'},
        ],
        'Value': order_total,
        'Unit': 'None',
        'StorageResolution': 60  # Standard resolution
    }]
)

SLIs, SLOs, and SLAs

| Concept | Definition | Example | |---------|------------|---------| | SLI (Indicator) | Quantitative measure of service | 99.2% of requests < 200ms | | SLO (Objective) | Target value for an SLI | 99.5% of requests < 200ms | | SLA (Agreement) | Contract with consequences | 99.9% uptime or credit issued | | Error budget | SLO - actual = remaining budget | 0.5% - 0.2% = 0.3% remaining |

Dashboards and Alerting

Dashboard Design Principles

  1. USE method for resources: Utilization, Saturation, Errors
  2. RED method for services: Rate, Errors, Duration
  3. Layered views: Overview → service → instance drill-down
  4. Business context: Correlate technical metrics with business KPIs

Effective Alerting

| Principle | Practice | |-----------|----------| | Alert on symptoms, not causes | Alert on error rate, not CPU | | Reduce noise | Multi-window, multi-burn-rate alerts | | Actionable alerts only | Every alert should have a runbook | | Appropriate urgency | Page for customer impact, ticket for degradation | | Avoid alert fatigue | Review and prune alerts quarterly |

Cost Monitoring and Optimization

AWS Cost Management Tools

| Tool | Purpose | |------|---------| | Cost Explorer | Visualize spending trends and forecasts | | Budgets | Set spending thresholds with alerts | | Cost Anomaly Detection | ML-based unusual spend detection | | Savings Plans | Commit to usage for discounts | | Trusted Advisor | Recommendations for cost optimization | | Compute Optimizer | Right-sizing recommendations |

Cost Allocation

Organization
├── Account: Production
│   ├── Tag: team=platform    → $12,000/mo
│   ├── Tag: team=data        → $8,500/mo
│   └── Tag: team=frontend    → $3,200/mo
├── Account: Staging           → $2,100/mo
└── Account: Development       → $1,800/mo
  • Tagging strategy: Enforce tags (team, environment, project, cost-center)
  • AWS Organizations: Separate accounts per environment/team
  • Cost allocation tags: Activate tags in Billing for cost breakdowns

Common Cost Optimization Actions

  1. Delete unused resources: Unattached EBS volumes, idle load balancers, old snapshots
  2. Right-size instances: Match instance family and size to workload
  3. Use Savings Plans / Reserved Instances: 30-72% savings for predictable workloads
  4. Spot instances: 60-90% savings for fault-tolerant batch workloads
  5. Storage lifecycle policies: Auto-tier to cheaper storage classes
  6. Review data transfer: Minimize cross-region and internet egress
  7. Optimize Lambda: Right-size memory, reduce duration, use Graviton

FinOps

FinOps is the practice of bringing financial accountability to cloud spending.

FinOps Lifecycle

    Inform ──────► Optimize ──────► Operate
    │               │                │
    Cost visibility Right-sizing     Governance
    Allocation      Reservations     Automation
    Forecasting     Waste removal    Continuous
    Showback        Architecture     improvement

FinOps Practices

| Phase | Activity | Tools | |-------|----------|-------| | Inform | Cost allocation and showback | Cost Explorer, Kubecost | | Inform | Forecasting and budgeting | AWS Budgets, Datadog Cost | | Optimize | Right-sizing compute | Compute Optimizer, Spot.io | | Optimize | Reserved/Savings Plans | AWS Cost Explorer recommendations | | Operate | Anomaly detection | Cost Anomaly Detection | | Operate | Automated scheduling | Instance Scheduler, Lambda | | Operate | Tagging enforcement | AWS Config rules, SCP |

Key FinOps Metrics

  • Unit economics: Cost per transaction, cost per customer, cost per request
  • Coverage: Percentage of spend covered by reservations/savings plans
  • Waste: Percentage of spend on idle or unused resources
  • Forecast accuracy: Predicted vs actual spend variance

Key Takeaways

  • Observability combines metrics, logs, and traces to understand system behavior
  • CloudWatch, Cloud Monitoring, and Prometheus cover infrastructure and custom metrics
  • Structured logging with correlation IDs enables cross-service debugging
  • OpenTelemetry is the standard for vendor-neutral instrumentation
  • Alert on symptoms (error rate, latency) rather than causes (CPU, memory)
  • FinOps brings financial discipline to cloud spending through visibility, optimization, and governance
  • Cost optimization is continuous: tag, monitor, right-size, reserve, and automate