Operations
Storage Homelab Observability Playbook
Why Observability Matters in Your Homelab
A storage homelab without monitoring is a black box. You can provision volumes, configure replication, and run workloads, but you can't answer critical questions: Is performance degrading? When will I run out of capacity? Are my snapshots consuming unexpected space?
Observability transforms your homelab from a sandbox into a learning platform where you can measure the impact of configuration changes, identify bottlenecks, and troubleshoot issues using the same tools production teams rely on.
The Lightweight Stack
You don't need enterprise APM suites or SaaS observability platforms. This stack runs on a single 4-core VM with 8GB RAM:
- Prometheus: Time-series metrics database (collects IOPS, latency, capacity data)
- Grafana: Visualization and dashboarding (charts, graphs, alerts)
- Node Exporter: System metrics from Linux hosts (CPU, disk, network)
- NetApp Harvest or Pure Exporter: Storage array-specific metrics
- Alertmanager: Alert routing and notification management
Total deployment time with Ansible: 30 minutes. Ongoing maintenance: 15 minutes per month.
Complete Ansible Deployment Playbook
This playbook deploys the full stack on a Ubuntu 22.04 or Rocky Linux 9 VM. It's idempotent—run it repeatedly to ensure configuration consistency.
Inventory File
# inventory/homelab.ini
[observability]
monitor01 ansible_host=192.168.1.50 ansible_user=ubuntu
[observability:vars]
prometheus_version=2.48.1
grafana_version=10.2.3
node_exporter_version=1.7.0
Main Playbook
---
- name: Deploy Observability Stack for Storage Homelab
hosts: observability
become: yes
vars:
prometheus_dir: /opt/prometheus
grafana_data_dir: /var/lib/grafana
prometheus_port: 9090
grafana_port: 3000
tasks:
- name: Install prerequisite packages
ansible.builtin.package:
name:
- wget
- tar
- adduser
- libfontconfig1
state: present
# === Prometheus Installation ===
- name: Create Prometheus user
ansible.builtin.user:
name: prometheus
shell: /bin/false
create_home: no
system: yes
- name: Create Prometheus directories
ansible.builtin.file:
path: "{{ item }}"
state: directory
owner: prometheus
group: prometheus
loop:
- "{{ prometheus_dir }}"
- "{{ prometheus_dir }}/data"
- /etc/prometheus
- name: Download Prometheus
ansible.builtin.unarchive:
src: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
dest: "{{ prometheus_dir }}"
remote_src: yes
creates: "{{ prometheus_dir }}/prometheus-{{ prometheus_version }}.linux-amd64"
- name: Copy Prometheus binaries
ansible.builtin.copy:
src: "{{ prometheus_dir }}/prometheus-{{ prometheus_version }}.linux-amd64/{{ item }}"
dest: /usr/local/bin/
mode: '0755'
remote_src: yes
loop:
- prometheus
- promtool
- name: Deploy Prometheus configuration
ansible.builtin.copy:
dest: /etc/prometheus/prometheus.yml
content: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'netapp_harvest'
static_configs:
- targets: ['localhost:12990'] # Adjust for your exporter
owner: prometheus
group: prometheus
- name: Create Prometheus systemd service
ansible.builtin.copy:
dest: /etc/systemd/system/prometheus.service
content: |
[Unit]
Description=Prometheus Time Series Database
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \\
--config.file=/etc/prometheus/prometheus.yml \\
--storage.tsdb.path={{ prometheus_dir }}/data \\
--web.listen-address=0.0.0.0:{{ prometheus_port }}
[Install]
WantedBy=multi-user.target
- name: Start and enable Prometheus
ansible.builtin.systemd:
name: prometheus
state: started
enabled: yes
daemon_reload: yes
# === Node Exporter Installation ===
- name: Download Node Exporter
ansible.builtin.unarchive:
src: "https://github.com/prometheus/node_exporter/releases/download/v{{ node_exporter_version }}/node_exporter-{{ node_exporter_version }}.linux-amd64.tar.gz"
dest: /tmp
remote_src: yes
- name: Copy Node Exporter binary
ansible.builtin.copy:
src: "/tmp/node_exporter-{{ node_exporter_version }}.linux-amd64/node_exporter"
dest: /usr/local/bin/
mode: '0755'
remote_src: yes
- name: Create Node Exporter systemd service
ansible.builtin.copy:
dest: /etc/systemd/system/node_exporter.service
content: |
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=root
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
- name: Start and enable Node Exporter
ansible.builtin.systemd:
name: node_exporter
state: started
enabled: yes
daemon_reload: yes
# === Grafana Installation ===
- name: Add Grafana APT repository key
ansible.builtin.apt_key:
url: https://apt.grafana.com/gpg.key
state: present
when: ansible_os_family == "Debian"
- name: Add Grafana APT repository
ansible.builtin.apt_repository:
repo: "deb https://apt.grafana.com stable main"
state: present
when: ansible_os_family == "Debian"
- name: Install Grafana
ansible.builtin.package:
name: grafana
state: present
- name: Start and enable Grafana
ansible.builtin.systemd:
name: grafana-server
state: started
enabled: yes
- name: Display access information
ansible.builtin.debug:
msg:
- "Prometheus: http://{{ ansible_host }}:{{ prometheus_port }}"
- "Grafana: http://{{ ansible_host }}:{{ grafana_port }} (admin/admin)"
- "Node Exporter: http://{{ ansible_host }}:9100/metrics"
Critical Storage Metrics to Track
Focus on metrics that answer operational questions rather than collecting everything. Here's the priority list:
Latency Metrics
- Read latency (ms):
storage_volume_read_latency_milliseconds - Write latency (ms):
storage_volume_write_latency_milliseconds - Percentiles (P50, P95, P99): Use Prometheus
histogram_quantile()functions
Why it matters: Latency spikes indicate disk contention, controller saturation, or network issues. Track baseline latency during idle periods (typically <1ms for flash arrays) and alert when sustained latency exceeds 2x baseline.
Throughput and IOPS
- Read IOPS:
rate(storage_volume_read_ops_total[5m]) - Write IOPS:
rate(storage_volume_write_ops_total[5m]) - Throughput (MB/s):
rate(storage_volume_read_bytes_total[5m]) / 1024 / 1024
Why it matters: Correlate IOPS patterns with application behavior. A database backup job should show high read IOPS at scheduled times. Unexpected IOPS spikes may indicate runaway queries or malware.
Capacity Tracking
- Used space:
storage_volume_used_bytes - Snapshot consumed:
storage_volume_snapshot_bytes - Growth rate:
deriv(storage_volume_used_bytes[24h])
Why it matters: Capacity planning prevents emergency expansions. Forecast when volumes reach 80% capacity and proactively expand or migrate data.
Practical Alert Rules
Add these alerting rules to /etc/prometheus/alerts.yml and reference in prometheus.yml:
groups:
- name: storage_alerts
interval: 1m
rules:
- alert: HighReadLatency
expr: storage_volume_read_latency_milliseconds > 5
for: 10m
labels:
severity: warning
annotations:
summary: "High read latency on {{ $labels.volume }}"
description: "Read latency is {{ $value }}ms (threshold: 5ms)"
- alert: CriticalCapacity
expr: (storage_volume_used_bytes / storage_volume_size_bytes) > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Volume {{ $labels.volume }} is 90% full"
description: "Used: {{ $value | humanizePercentage }}"
- alert: SnapshotGrowth
expr: rate(storage_volume_snapshot_bytes[6h]) > 1073741824 # 1GB per 6h
for: 30m
labels:
severity: info
annotations:
summary: "Rapid snapshot growth on {{ $labels.volume }}"
description: "Snapshot space increasing faster than expected"
Building Effective Dashboards
Create three dashboards in Grafana—keep them focused on specific troubleshooting workflows:
1. Performance Dashboard
- Time-series graphs of latency (read/write) per volume
- IOPS heatmap showing hot volumes
- Throughput stacked area chart
- Queue depth gauge (indicates saturation)
2. Capacity Dashboard
- Table of volumes sorted by % used (descending)
- Snapshot space consumption trend (30-day view)
- Capacity forecast graph (linear regression predicting days until full)
- Thin provisioning overcommit ratio gauge
3. Health Dashboard
- Controller CPU/memory utilization
- Network throughput (to/from array)
- Alerts panel showing active warnings/criticals
- Uptime and last restart timestamp
Dashboard design tip: Use dashboard variables for volume selection so you can inspect any volume
without editing panel queries. Example: $volume variable populated from Prometheus label values.
Operational Value Beyond Monitoring
Observability data becomes a force multiplier for your homelab learning:
- Performance experiments: Measure before/after metrics when tuning block size, caching, or network MTU
- Failure simulation: Kill a service and watch metrics diverge—understand failure modes visually
- Content creation: Include real telemetry in blog posts or tutorials to prove optimizations work
- Interview prep: Demo live Grafana dashboards during job interviews to showcase practical monitoring skills
Every production storage engineer should know how to query Prometheus, build Grafana panels, and interpret latency distributions. Your homelab is the perfect environment to master these skills without production pressure.
From Homelab to Production
Once your homelab stack is stable, migrating to production requires these enhancements:
- High availability: Run Prometheus in HA mode with remote write to long-term storage (Thanos, Cortex)
- Authentication: Enable Grafana LDAP/OAuth integration and enforce RBAC on dashboards
- Alerting routing: Configure Alertmanager to route alerts to PagerDuty, Slack, or email based on severity
- Data retention: Extend Prometheus retention from 15 days (homelab default) to 90+ days with compaction
- Security: Enable TLS for all component communication, rotate API tokens regularly
The architecture remains identical—only scale and security posture change. This makes your homelab experience directly transferable to enterprise environments.
💬 Discussion
Have questions or feedback about this guide? Found a better approach?