Grafana Dashboards

Grafana is the visualization layer for your metrics. It connects to Prometheus (and dozens of other data sources) and turns time-series data into dashboards you can actually use. The combination of Prometheus and Grafana is the standard monitoring stack for cloud-native infrastructure.

Grafana + Prometheus

The setup is straightforward: Prometheus collects and stores metrics, Grafana queries Prometheus and renders panels.

# Grafana datasource configuration
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    access: proxy
    isDefault: true
    jsonData:
      timeInterval: 15s

Once connected, every PromQL query you write in Grafana runs against Prometheus. The query editor supports autocomplete, and you can switch between the visual builder and raw PromQL.

Dashboard Design Principles

The goal of a dashboard is to answer a question at a glance. If someone has to study your dashboard for five minutes to understand what is happening, the dashboard has failed.

The USE Method

For infrastructure resources (CPU, memory, disk, network), use Brendan Gregg's USE method:

Utilization: How busy is the resource? (percentage of time busy)
Saturation: How much extra work is queued? (queue length)
Errors: How many error events occurred?

# CPU Utilization
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

# CPU Saturation (load average)
node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance)

# Disk Errors
rate(node_disk_io_errors_total[5m])

The RED Method

For request-driven services (APIs, microservices), use Tom Wilkie's RED method:

Rate: How many requests per second?
Errors: How many of those requests are failing?
Duration: How long do those requests take?

# Rate
sum(rate(http_requests_total[5m])) by (service)

# Error rate (percentage)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
* 100

# Duration (p50, p90, p99)
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

The Golden Signals Dashboard

Google's SRE book defines four golden signals. A single dashboard covering these for your primary service is the most valuable dashboard you will build.

Latency

Time to serve a request. Show percentiles, not averages. The average can look fine while the p99 is catastrophic.

# Panel: Request Latency
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Display as a time series graph with three lines: p50, p90, p99. Add a horizontal threshold line at your SLO target.

Traffic

Volume of requests the system is handling.

# Panel: Request Rate
sum(rate(http_requests_total[5m]))

Errors

Rate of failed requests.

# Panel: Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

Display as a percentage. Add a threshold at your error budget (e.g., 0.1% for a 99.9% SLO).

Saturation

How full your service is. For most services, this maps to resource utilization.

# Panel: Memory Usage
container_memory_usage_bytes{pod=~"myapp.*"}
/
container_spec_memory_limit_bytes{pod=~"myapp.*"}
* 100

# Panel: CPU Usage
sum(rate(container_cpu_usage_seconds_total{pod=~"myapp.*"}[5m])) by (pod)
/
sum(container_spec_cpu_quota{pod=~"myapp.*"} / container_spec_cpu_period{pod=~"myapp.*"}) by (pod)
* 100

Variable Templates

Templates make dashboards reusable. Instead of hardcoding service names, use variables that appear as dropdowns.

# Variable: service
Label:       Service
Query:       label_values(http_requests_total, service)
Refresh:     On dashboard load

# Variable: instance
Label:       Instance
Query:       label_values(http_requests_total{service="$service"}, instance)
Refresh:     On dashboard load

Then use $service and $instance in your panel queries:

rate(http_requests_total{service="$service", instance="$instance"}[5m])

Users can switch between services without editing queries. Chain variables so selecting a service filters the available instances.

Multi-Value Variables

Allow selecting multiple values to compare services side by side:

# Enable multi-value and "All" option in variable settings
rate(http_requests_total{service=~"$service"}[5m])

The =~ operator uses regex matching, and Grafana formats multi-select variables as value1|value2|value3.

Annotations for Deploys

Mark deployment events on your dashboards so you can correlate changes with metric shifts.

# Push annotation via Grafana API after each deploy
# In your CI/CD pipeline:

curl -X POST http://grafana:3000/api/annotations \
  -H "Authorization: Bearer $GRAFANA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Deploy v1.2.3 by ci-bot",
    "tags": ["deploy", "myapp"],
    "time": '$(date +%s000)'
  }'

Configure annotation queries in the dashboard settings to display these as vertical lines on every panel. When you see a latency spike, the deploy marker tells you immediately whether it correlates with a release.

Dashboard JSON & Provisioning

Dashboards are JSON. Store them in version control and provision them automatically.

# Grafana provisioning config
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: "Platform"
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Export a dashboard as JSON from the Grafana UI, commit it to your repository, and mount it into the Grafana container. Changes go through pull requests, not ad-hoc UI edits.

Panel Types

Choose the right visualization for the data:

Time series: The default. Use for rates, latencies, utilization over time.
Stat: Single number. Use for current values -- active users, error count in the last hour.
Gauge: Progress toward a limit. Use for percentage-based metrics -- disk usage, SLO budget remaining.
Table: Tabular data. Use for top-N queries or instance-level breakdowns.
Heatmap: Distribution over time. Use for histogram data -- shows how latency distribution shifts.
Logs: Display log lines from Loki or Elasticsearch alongside metrics.

Panel Configuration Tips

# Set meaningful Y-axis units
Unit: requests/sec, seconds, bytes, percent (0-100)

# Set thresholds for visual cues
Base:   green
0.01:   yellow  (1% error rate warning)
0.05:   red     (5% error rate critical)

# Use legend to clarify multi-line charts
Legend mode: Table
Legend values: Last, Min, Max, Avg

Build 3 Dashboards, Not 50

Most teams create too many dashboards. Nobody looks at them. Focus on three:

Service overview: The golden signals dashboard described above. One per critical service. This is where you start during an incident.
Infrastructure: Node health, Kubernetes cluster state, resource utilization across the fleet. This answers "is the platform healthy?"
SLO tracking: Error budget burn rate, SLI trends, availability over rolling windows. This is for weekly review and capacity planning.

If someone asks for a fourth dashboard, ask what question it answers that the first three do not. Usually the answer is "I want to see this one specific metric" -- which is a panel on an existing dashboard, not a new dashboard.

Common Pitfalls

Dashboard sprawl. Fifty dashboards and nobody can find anything. Organize into folders, limit creation, archive unused dashboards.
Using averages instead of percentiles. An average latency of 100ms can hide a p99 of 5 seconds. Always show percentiles for latency.
No annotations. Without deploy markers, correlating changes with metric shifts requires guesswork.
Hardcoded service names. Use template variables. A dashboard locked to one service name is half as useful.
Editing dashboards in the UI without saving to code. Someone clicks "save," someone else overwrites it. Store dashboard JSON in version control and provision automatically.
Too many panels per dashboard. If a dashboard has 40 panels, it loads slowly and nobody scrolls to the bottom. Aim for 8 to 12 panels per dashboard.
Ignoring time range defaults. Set a sensible default time range (last 1 hour for operational dashboards, last 7 days for SLO tracking).

Key Takeaways

Use the RED method (rate, errors, duration) for services and the USE method (utilization, saturation, errors) for infrastructure.
Build a golden signals dashboard for every critical service. It is the first place you look during an incident.
Template variables make dashboards reusable across services and instances.
Annotate deployments so you can correlate releases with metric changes.
Store dashboards as JSON in version control. Provision them automatically.
Build 3 good dashboards instead of 50 mediocre ones. Each dashboard should answer a clear question.