Observability
Prometheus and Grafana Storage Monitoring Dashboard: What to Put On It
Design Around Questions
A storage dashboard should answer operational questions quickly. Avoid building one giant dashboard with every metric. Build focused views that help a person decide what to do next.
Start with these questions:
- Is anything close to full?
- Is latency abnormal?
- Which workload is busiest?
- Are snapshots or backups growing unexpectedly?
- Are monitoring targets healthy?
Prometheus provides alerting rules that can send alerts to Alertmanager, and Grafana can visualize Prometheus data and manage alerts. See the official docs for Prometheus alerting, Grafana alerting, and Grafana with Prometheus.
Dashboard 1: Capacity
The capacity dashboard should show:
- Top volumes or datastores by percent used.
- Free capacity trend over 7, 30, and 90 days.
- Daily growth rate.
- Snapshot or backup repository growth.
- Forecast date for reaching 80, 90, and 95 percent.
Good capacity panels are mostly tables and trend lines. Operators need sortable lists more than decorative gauges.
Dashboard 2: Performance
The performance dashboard should show:
- Read and write latency.
- Read and write IOPS.
- Throughput in MB/s.
- Queue depth or saturation signals if available.
- Top talkers by host, volume, datastore, or workload.
Use percentiles where possible. Average latency can hide short but painful spikes.
Dashboard 3: Protection Health
Storage monitoring should include protection state, not only hardware health.
- Latest successful backup or snapshot age.
- Replication lag.
- Snapshot count and oldest snapshot.
- Failed backup or replication jobs.
- Restore test date or evidence link if you track it.
This is the dashboard that helps catch silent data protection drift.
Dashboard 4: Monitoring Health
Every monitoring stack needs a dashboard for itself.
- Prometheus target up/down state.
- Exporter scrape duration and failures.
- Missing metrics.
- Alert rule evaluation failures.
- Notification delivery issues.
If monitoring is unhealthy, storage dashboards may look quiet for the wrong reason.
Example Alert Ideas
groups:
- name: storage_capacity
rules:
- alert: StorageVolumeAbove90Percent
expr: storage_volume_used_percent > 90
for: 30m
labels:
severity: critical
annotations:
summary: "Storage volume above 90 percent used"
Tune alert names and expressions to your exporter. The structure matters more than the placeholder metric name.