Cloud Observability
The Three Pillars
Observability enables understanding a system's internal state from its external outputs.
Observability
┌───────────┼───────────┐
Metrics Logs Traces
(aggregated (discrete (request
numbers) events) flow)
│ │ │
CloudWatch CloudWatch X-Ray
Prometheus Logs Jaeger
Datadog Fluentd Tempo
Monitoring
Amazon CloudWatch
CloudWatch is AWS's native monitoring and observability platform.
Core components:
- Metrics: Time-series data points (CPU, memory, custom)
- Alarms: Trigger actions based on metric thresholds
- Logs: Centralized log collection and querying
- Dashboards: Visualize metrics and logs together
- Events/EventBridge: React to state changes
CloudWatch Metric Anatomy:
Namespace: AWS/EC2
MetricName: CPUUtilization
Dimensions: InstanceId=i-1234567890abcdef0
Statistic: Average
Period: 300 seconds
Value: 72.5
Key AWS Metrics to Monitor
| Service | Metric | Alarm Threshold | |---------|--------|----------------| | EC2 | CPUUtilization | > 80% sustained | | ALB | TargetResponseTime | > 500ms p99 | | ALB | HTTPCode_Target_5XX_Count | > 0 | | Lambda | Errors | > 0 | | Lambda | Duration | > 80% of timeout | | RDS | FreeStorageSpace | < 20% remaining | | SQS | ApproximateAgeOfOldestMessage | > acceptable delay | | DynamoDB | ThrottledRequests | > 0 |
Google Cloud Monitoring (formerly Stackdriver)
- Metrics Explorer: Query and visualize any metric
- Uptime checks: HTTP, TCP, HTTPS probes from global locations
- Alerting policies: Multi-condition alerts with notification channels
- Service monitoring: SLO tracking based on request latency and availability
- MQL: Monitoring Query Language for advanced metric queries
CloudWatch Alarms
Metric ──► Evaluation ──► State Change ──► Action
│
┌─────┼─────┐
OK ALARM INSUFFICIENT_DATA
│
┌─────┼─────────────┐
SNS Auto Scaling Lambda
│ EC2 action Custom
Email/SMS/PagerDuty remediation
# CloudWatch alarm via CloudFormation
HighCpuAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: high-cpu-web-servers
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: 300
EvaluationPeriods: 3
Threshold: 80
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: AutoScalingGroupName
Value: !Ref WebServerASG
AlarmActions:
- !Ref ScaleUpPolicy
- !Ref OpsNotificationTopic
Prometheus and Grafana (Cloud-Agnostic)
┌──────────┐ ┌────────────┐ ┌─────────┐
│ App with │◄────│ Prometheus │────►│ Grafana │
│ /metrics │pull │ Server │ │Dashboard│
│ endpoint │ │ (TSDB) │ │ │
└──────────┘ │ AlertMgr │────►│ Alerts │
└────────────┘ └─────────┘
- Amazon Managed Prometheus (AMP): Serverless Prometheus-compatible monitoring
- Amazon Managed Grafana (AMG): Managed Grafana with SSO integration
- GCP Managed Prometheus: Globally scalable, Prometheus-compatible
Log Aggregation
Architecture
Sources Collection Storage/Query
┌─────────┐
│ App logs │─┐ ┌──────────┐ ┌──────────────┐
│ (stdout) │ │ │ Fluentd/ │ │ CloudWatch │
└─────────┘ ├──────►│ Fluent │──────►│ Logs │
┌─────────┐ │ │ Bit │ ├──────────────┤
│ System │─┤ └──────────┘ │ OpenSearch │
│ logs │ │ ├──────────────┤
└─────────┘ │ ┌──────────┐ │ S3 + Athena │
┌─────────┐ │ │CloudWatch│ ├──────────────┤
│ AWS svc │─┘ │ Agent │──────►│ Datadog/ │
│ logs │ └──────────┘ │ Splunk │
└─────────┘ └──────────────┘
Structured Logging
{
"timestamp": "2026-03-24T10:15:30.123Z",
"level": "ERROR",
"service": "order-service",
"trace_id": "abc123def456",
"span_id": "789ghi",
"message": "Payment processing failed",
"error_code": "GATEWAY_TIMEOUT",
"order_id": "ORD-9876",
"customer_id": "CUST-5432",
"duration_ms": 30000
}
Best practices:
- Use JSON format for machine parseability
- Include correlation IDs (trace_id) for request tracing
- Log at appropriate levels (DEBUG, INFO, WARN, ERROR)
- Never log sensitive data (PII, credentials, tokens)
- Set log retention policies to control costs
CloudWatch Logs Insights
-- Find top 10 slowest API endpoints in the last hour
fields @timestamp, @message
| filter @message like /duration_ms/
| parse @message '"path":"*","duration_ms":*,' as path, duration
| stats avg(duration) as avg_ms, max(duration) as max_ms, count() as requests by path
| sort max_ms desc
| limit 10
Distributed Tracing
How Tracing Works
Client ──► API Gateway ──► Order Service ──► Payment Service
│ │ │ │
│ Trace: abc123 Span: order-1 Span: pay-1
│ Span: gw-1 Parent: gw-1 Parent: order-1
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ Trace abc123 │
│ ├── gw-1 [════════════════════════════] │
│ ├── order-1 [══════════════════] │
│ ├── db-query [═══] │
│ └── pay-1 [══════════] │
│ Total: 450ms │
└──────────────────────────────────────────────────┘
Tracing Services
| Service | Provider | Protocol | |---------|----------|----------| | AWS X-Ray | AWS | X-Ray SDK, OpenTelemetry | | Cloud Trace | GCP | OpenTelemetry, Zipkin | | Azure Monitor | Azure | OpenTelemetry | | Jaeger | OSS | OpenTracing, OpenTelemetry | | Tempo | Grafana | OpenTelemetry, Zipkin, Jaeger |
OpenTelemetry
OpenTelemetry (OTel) is the industry-standard framework for collecting telemetry data.
# Auto-instrumentation with OpenTelemetry
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# Auto-instrument frameworks
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
# Manual span for custom business logic
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.total", total)
result = process(order_id)
Custom Metrics
Business Metrics
Beyond infrastructure metrics, track domain-specific indicators.
# Publish custom metric to CloudWatch
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
Namespace='OrderService',
MetricData=[{
'MetricName': 'OrderValue',
'Dimensions': [
{'Name': 'Region', 'Value': 'us-east-1'},
{'Name': 'PaymentType', 'Value': 'credit_card'},
],
'Value': order_total,
'Unit': 'None',
'StorageResolution': 60 # Standard resolution
}]
)
SLIs, SLOs, and SLAs
| Concept | Definition | Example | |---------|------------|---------| | SLI (Indicator) | Quantitative measure of service | 99.2% of requests < 200ms | | SLO (Objective) | Target value for an SLI | 99.5% of requests < 200ms | | SLA (Agreement) | Contract with consequences | 99.9% uptime or credit issued | | Error budget | SLO - actual = remaining budget | 0.5% - 0.2% = 0.3% remaining |
Dashboards and Alerting
Dashboard Design Principles
- USE method for resources: Utilization, Saturation, Errors
- RED method for services: Rate, Errors, Duration
- Layered views: Overview → service → instance drill-down
- Business context: Correlate technical metrics with business KPIs
Effective Alerting
| Principle | Practice | |-----------|----------| | Alert on symptoms, not causes | Alert on error rate, not CPU | | Reduce noise | Multi-window, multi-burn-rate alerts | | Actionable alerts only | Every alert should have a runbook | | Appropriate urgency | Page for customer impact, ticket for degradation | | Avoid alert fatigue | Review and prune alerts quarterly |
Cost Monitoring and Optimization
AWS Cost Management Tools
| Tool | Purpose | |------|---------| | Cost Explorer | Visualize spending trends and forecasts | | Budgets | Set spending thresholds with alerts | | Cost Anomaly Detection | ML-based unusual spend detection | | Savings Plans | Commit to usage for discounts | | Trusted Advisor | Recommendations for cost optimization | | Compute Optimizer | Right-sizing recommendations |
Cost Allocation
Organization
├── Account: Production
│ ├── Tag: team=platform → $12,000/mo
│ ├── Tag: team=data → $8,500/mo
│ └── Tag: team=frontend → $3,200/mo
├── Account: Staging → $2,100/mo
└── Account: Development → $1,800/mo
- Tagging strategy: Enforce tags (team, environment, project, cost-center)
- AWS Organizations: Separate accounts per environment/team
- Cost allocation tags: Activate tags in Billing for cost breakdowns
Common Cost Optimization Actions
- Delete unused resources: Unattached EBS volumes, idle load balancers, old snapshots
- Right-size instances: Match instance family and size to workload
- Use Savings Plans / Reserved Instances: 30-72% savings for predictable workloads
- Spot instances: 60-90% savings for fault-tolerant batch workloads
- Storage lifecycle policies: Auto-tier to cheaper storage classes
- Review data transfer: Minimize cross-region and internet egress
- Optimize Lambda: Right-size memory, reduce duration, use Graviton
FinOps
FinOps is the practice of bringing financial accountability to cloud spending.
FinOps Lifecycle
Inform ──────► Optimize ──────► Operate
│ │ │
Cost visibility Right-sizing Governance
Allocation Reservations Automation
Forecasting Waste removal Continuous
Showback Architecture improvement
FinOps Practices
| Phase | Activity | Tools | |-------|----------|-------| | Inform | Cost allocation and showback | Cost Explorer, Kubecost | | Inform | Forecasting and budgeting | AWS Budgets, Datadog Cost | | Optimize | Right-sizing compute | Compute Optimizer, Spot.io | | Optimize | Reserved/Savings Plans | AWS Cost Explorer recommendations | | Operate | Anomaly detection | Cost Anomaly Detection | | Operate | Automated scheduling | Instance Scheduler, Lambda | | Operate | Tagging enforcement | AWS Config rules, SCP |
Key FinOps Metrics
- Unit economics: Cost per transaction, cost per customer, cost per request
- Coverage: Percentage of spend covered by reservations/savings plans
- Waste: Percentage of spend on idle or unused resources
- Forecast accuracy: Predicted vs actual spend variance
Key Takeaways
- Observability combines metrics, logs, and traces to understand system behavior
- CloudWatch, Cloud Monitoring, and Prometheus cover infrastructure and custom metrics
- Structured logging with correlation IDs enables cross-service debugging
- OpenTelemetry is the standard for vendor-neutral instrumentation
- Alert on symptoms (error rate, latency) rather than causes (CPU, memory)
- FinOps brings financial discipline to cloud spending through visibility, optimization, and governance
- Cost optimization is continuous: tag, monitor, right-size, reserve, and automate