Grafana Dashboards
Grafana is the visualization layer for your metrics. It connects to Prometheus (and dozens of other data sources) and turns time-series data into dashboards you can actually use. The combination of Prometheus and Grafana is the standard monitoring stack for cloud-native infrastructure.
Grafana + Prometheus
The setup is straightforward: Prometheus collects and stores metrics, Grafana queries Prometheus and renders panels.
# Grafana datasource configuration
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
access: proxy
isDefault: true
jsonData:
timeInterval: 15s
Once connected, every PromQL query you write in Grafana runs against Prometheus. The query editor supports autocomplete, and you can switch between the visual builder and raw PromQL.
Dashboard Design Principles
The goal of a dashboard is to answer a question at a glance. If someone has to study your dashboard for five minutes to understand what is happening, the dashboard has failed.
The USE Method
For infrastructure resources (CPU, memory, disk, network), use Brendan Gregg's USE method:
- Utilization: How busy is the resource? (percentage of time busy)
- Saturation: How much extra work is queued? (queue length)
- Errors: How many error events occurred?
# CPU Utilization
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
# CPU Saturation (load average)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance)
# Disk Errors
rate(node_disk_io_errors_total[5m])
The RED Method
For request-driven services (APIs, microservices), use Tom Wilkie's RED method:
- Rate: How many requests per second?
- Errors: How many of those requests are failing?
- Duration: How long do those requests take?
# Rate
sum(rate(http_requests_total[5m])) by (service)
# Error rate (percentage)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
* 100
# Duration (p50, p90, p99)
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
The Golden Signals Dashboard
Google's SRE book defines four golden signals. A single dashboard covering these for your primary service is the most valuable dashboard you will build.
Latency
Time to serve a request. Show percentiles, not averages. The average can look fine while the p99 is catastrophic.
# Panel: Request Latency
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Display as a time series graph with three lines: p50, p90, p99. Add a horizontal threshold line at your SLO target.
Traffic
Volume of requests the system is handling.
# Panel: Request Rate
sum(rate(http_requests_total[5m]))
Errors
Rate of failed requests.
# Panel: Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
Display as a percentage. Add a threshold at your error budget (e.g., 0.1% for a 99.9% SLO).
Saturation
How full your service is. For most services, this maps to resource utilization.
# Panel: Memory Usage
container_memory_usage_bytes{pod=~"myapp.*"}
/
container_spec_memory_limit_bytes{pod=~"myapp.*"}
* 100
# Panel: CPU Usage
sum(rate(container_cpu_usage_seconds_total{pod=~"myapp.*"}[5m])) by (pod)
/
sum(container_spec_cpu_quota{pod=~"myapp.*"} / container_spec_cpu_period{pod=~"myapp.*"}) by (pod)
* 100
Variable Templates
Templates make dashboards reusable. Instead of hardcoding service names, use variables that appear as dropdowns.
# Variable: service
Label: Service
Query: label_values(http_requests_total, service)
Refresh: On dashboard load
# Variable: instance
Label: Instance
Query: label_values(http_requests_total{service="$service"}, instance)
Refresh: On dashboard load
Then use $service and $instance in your panel queries:
rate(http_requests_total{service="$service", instance="$instance"}[5m])
Users can switch between services without editing queries. Chain variables so selecting a service filters the available instances.
Multi-Value Variables
Allow selecting multiple values to compare services side by side:
# Enable multi-value and "All" option in variable settings
rate(http_requests_total{service=~"$service"}[5m])
The =~ operator uses regex matching, and Grafana formats multi-select variables as value1|value2|value3.
Annotations for Deploys
Mark deployment events on your dashboards so you can correlate changes with metric shifts.
# Push annotation via Grafana API after each deploy
# In your CI/CD pipeline:
curl -X POST http://grafana:3000/api/annotations \
-H "Authorization: Bearer $GRAFANA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "Deploy v1.2.3 by ci-bot",
"tags": ["deploy", "myapp"],
"time": '$(date +%s000)'
}'
Configure annotation queries in the dashboard settings to display these as vertical lines on every panel. When you see a latency spike, the deploy marker tells you immediately whether it correlates with a release.
Dashboard JSON & Provisioning
Dashboards are JSON. Store them in version control and provision them automatically.
# Grafana provisioning config
apiVersion: 1
providers:
- name: default
orgId: 1
folder: "Platform"
type: file
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Export a dashboard as JSON from the Grafana UI, commit it to your repository, and mount it into the Grafana container. Changes go through pull requests, not ad-hoc UI edits.
Panel Types
Choose the right visualization for the data:
- Time series: The default. Use for rates, latencies, utilization over time.
- Stat: Single number. Use for current values -- active users, error count in the last hour.
- Gauge: Progress toward a limit. Use for percentage-based metrics -- disk usage, SLO budget remaining.
- Table: Tabular data. Use for top-N queries or instance-level breakdowns.
- Heatmap: Distribution over time. Use for histogram data -- shows how latency distribution shifts.
- Logs: Display log lines from Loki or Elasticsearch alongside metrics.
Panel Configuration Tips
# Set meaningful Y-axis units
Unit: requests/sec, seconds, bytes, percent (0-100)
# Set thresholds for visual cues
Base: green
0.01: yellow (1% error rate warning)
0.05: red (5% error rate critical)
# Use legend to clarify multi-line charts
Legend mode: Table
Legend values: Last, Min, Max, Avg
Build 3 Dashboards, Not 50
Most teams create too many dashboards. Nobody looks at them. Focus on three:
-
Service overview: The golden signals dashboard described above. One per critical service. This is where you start during an incident.
-
Infrastructure: Node health, Kubernetes cluster state, resource utilization across the fleet. This answers "is the platform healthy?"
-
SLO tracking: Error budget burn rate, SLI trends, availability over rolling windows. This is for weekly review and capacity planning.
If someone asks for a fourth dashboard, ask what question it answers that the first three do not. Usually the answer is "I want to see this one specific metric" -- which is a panel on an existing dashboard, not a new dashboard.
Common Pitfalls
- Dashboard sprawl. Fifty dashboards and nobody can find anything. Organize into folders, limit creation, archive unused dashboards.
- Using averages instead of percentiles. An average latency of 100ms can hide a p99 of 5 seconds. Always show percentiles for latency.
- No annotations. Without deploy markers, correlating changes with metric shifts requires guesswork.
- Hardcoded service names. Use template variables. A dashboard locked to one service name is half as useful.
- Editing dashboards in the UI without saving to code. Someone clicks "save," someone else overwrites it. Store dashboard JSON in version control and provision automatically.
- Too many panels per dashboard. If a dashboard has 40 panels, it loads slowly and nobody scrolls to the bottom. Aim for 8 to 12 panels per dashboard.
- Ignoring time range defaults. Set a sensible default time range (last 1 hour for operational dashboards, last 7 days for SLO tracking).
Key Takeaways
- Use the RED method (rate, errors, duration) for services and the USE method (utilization, saturation, errors) for infrastructure.
- Build a golden signals dashboard for every critical service. It is the first place you look during an incident.
- Template variables make dashboards reusable across services and instances.
- Annotate deployments so you can correlate releases with metric changes.
- Store dashboards as JSON in version control. Provision them automatically.
- Build 3 good dashboards instead of 50 mediocre ones. Each dashboard should answer a clear question.