Operations

Storage Homelab Observability Playbook

Published February 14, 2026 | 8 min read

Why Observability Matters in Your Homelab

A storage homelab without monitoring is a black box. You can provision volumes, configure replication, and run workloads, but you can't answer critical questions: Is performance degrading? When will I run out of capacity? Are my snapshots consuming unexpected space?

Observability transforms your homelab from a sandbox into a learning platform where you can measure the impact of configuration changes, identify bottlenecks, and troubleshoot issues using the same tools production teams rely on.

The Lightweight Stack

You don't need enterprise APM suites or SaaS observability platforms. This stack runs on a single 4-core VM with 8GB RAM:

Prometheus: Time-series metrics database (collects IOPS, latency, capacity data)
Grafana: Visualization and dashboarding (charts, graphs, alerts)
Node Exporter: System metrics from Linux hosts (CPU, disk, network)
NetApp Harvest or Pure Exporter: Storage array-specific metrics
Alertmanager: Alert routing and notification management

Total deployment time with Ansible: 30 minutes. Ongoing maintenance: 15 minutes per month.

Complete Ansible Deployment Playbook

This playbook deploys the full stack on a Ubuntu 22.04 or Rocky Linux 9 VM. It's idempotent—run it repeatedly to ensure configuration consistency.

Inventory File

# inventory/homelab.ini
[observability]
monitor01 ansible_host=192.168.1.50 ansible_user=ubuntu

[observability:vars]
prometheus_version=2.48.1
grafana_version=10.2.3
node_exporter_version=1.7.0

Main Playbook

---
- name: Deploy Observability Stack for Storage Homelab
  hosts: observability
  become: yes
  
  vars:
    prometheus_dir: /opt/prometheus
    grafana_data_dir: /var/lib/grafana
    prometheus_port: 9090
    grafana_port: 3000
    
  tasks:
    - name: Install prerequisite packages
      ansible.builtin.package:
        name:
          - wget
          - tar
          - adduser
          - libfontconfig1
        state: present
    
    # === Prometheus Installation ===
    - name: Create Prometheus user
      ansible.builtin.user:
        name: prometheus
        shell: /bin/false
        create_home: no
        system: yes
    
    - name: Create Prometheus directories
      ansible.builtin.file:
        path: "{{ item }}"
        state: directory
        owner: prometheus
        group: prometheus
      loop:
        - "{{ prometheus_dir }}"
        - "{{ prometheus_dir }}/data"
        - /etc/prometheus
    
    - name: Download Prometheus
      ansible.builtin.unarchive:
        src: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
        dest: "{{ prometheus_dir }}"
        remote_src: yes
        creates: "{{ prometheus_dir }}/prometheus-{{ prometheus_version }}.linux-amd64"
    
    - name: Copy Prometheus binaries
      ansible.builtin.copy:
        src: "{{ prometheus_dir }}/prometheus-{{ prometheus_version }}.linux-amd64/{{ item }}"
        dest: /usr/local/bin/
        mode: '0755'
        remote_src: yes
      loop:
        - prometheus
        - promtool
    
    - name: Deploy Prometheus configuration
      ansible.builtin.copy:
        dest: /etc/prometheus/prometheus.yml
        content: |
          global:
            scrape_interval: 15s
            evaluation_interval: 15s
          
          scrape_configs:
            - job_name: 'prometheus'
              static_configs:
                - targets: ['localhost:9090']
            
            - job_name: 'node_exporter'
              static_configs:
                - targets: ['localhost:9100']
            
            - job_name: 'netapp_harvest'
              static_configs:
                - targets: ['localhost:12990']  # Adjust for your exporter
        owner: prometheus
        group: prometheus
    
    - name: Create Prometheus systemd service
      ansible.builtin.copy:
        dest: /etc/systemd/system/prometheus.service
        content: |
          [Unit]
          Description=Prometheus Time Series Database
          After=network.target
          
          [Service]
          User=prometheus
          Group=prometheus
          Type=simple
          ExecStart=/usr/local/bin/prometheus \\
            --config.file=/etc/prometheus/prometheus.yml \\
            --storage.tsdb.path={{ prometheus_dir }}/data \\
            --web.listen-address=0.0.0.0:{{ prometheus_port }}
          
          [Install]
          WantedBy=multi-user.target
    
    - name: Start and enable Prometheus
      ansible.builtin.systemd:
        name: prometheus
        state: started
        enabled: yes
        daemon_reload: yes
    
    # === Node Exporter Installation ===
    - name: Download Node Exporter
      ansible.builtin.unarchive:
        src: "https://github.com/prometheus/node_exporter/releases/download/v{{ node_exporter_version }}/node_exporter-{{ node_exporter_version }}.linux-amd64.tar.gz"
        dest: /tmp
        remote_src: yes
    
    - name: Copy Node Exporter binary
      ansible.builtin.copy:
        src: "/tmp/node_exporter-{{ node_exporter_version }}.linux-amd64/node_exporter"
        dest: /usr/local/bin/
        mode: '0755'
        remote_src: yes
    
    - name: Create Node Exporter systemd service
      ansible.builtin.copy:
        dest: /etc/systemd/system/node_exporter.service
        content: |
          [Unit]
          Description=Node Exporter
          After=network.target
          
          [Service]
          User=root
          ExecStart=/usr/local/bin/node_exporter
          
          [Install]
          WantedBy=multi-user.target
    
    - name: Start and enable Node Exporter
      ansible.builtin.systemd:
        name: node_exporter
        state: started
        enabled: yes
        daemon_reload: yes
    
    # === Grafana Installation ===
    - name: Add Grafana APT repository key
      ansible.builtin.apt_key:
        url: https://apt.grafana.com/gpg.key
        state: present
      when: ansible_os_family == "Debian"
    
    - name: Add Grafana APT repository
      ansible.builtin.apt_repository:
        repo: "deb https://apt.grafana.com stable main"
        state: present
      when: ansible_os_family == "Debian"
    
    - name: Install Grafana
      ansible.builtin.package:
        name: grafana
        state: present
    
    - name: Start and enable Grafana
      ansible.builtin.systemd:
        name: grafana-server
        state: started
        enabled: yes
    
    - name: Display access information
      ansible.builtin.debug:
        msg:
          - "Prometheus: http://{{ ansible_host }}:{{ prometheus_port }}"
          - "Grafana: http://{{ ansible_host }}:{{ grafana_port }} (admin/admin)"
          - "Node Exporter: http://{{ ansible_host }}:9100/metrics"

Critical Storage Metrics to Track

Focus on metrics that answer operational questions rather than collecting everything. Here's the priority list:

Latency Metrics

Read latency (ms): storage_volume_read_latency_milliseconds
Write latency (ms): storage_volume_write_latency_milliseconds
Percentiles (P50, P95, P99): Use Prometheus histogram_quantile() functions

Why it matters: Latency spikes indicate disk contention, controller saturation, or network issues. Track baseline latency during idle periods (typically <1ms for flash arrays) and alert when sustained latency exceeds 2x baseline.

Throughput and IOPS

Read IOPS: rate(storage_volume_read_ops_total[5m])
Write IOPS: rate(storage_volume_write_ops_total[5m])
Throughput (MB/s): rate(storage_volume_read_bytes_total[5m]) / 1024 / 1024

Why it matters: Correlate IOPS patterns with application behavior. A database backup job should show high read IOPS at scheduled times. Unexpected IOPS spikes may indicate runaway queries or malware.

Capacity Tracking

Used space: storage_volume_used_bytes
Snapshot consumed: storage_volume_snapshot_bytes
Growth rate: deriv(storage_volume_used_bytes[24h])

Why it matters: Capacity planning prevents emergency expansions. Forecast when volumes reach 80% capacity and proactively expand or migrate data.

Practical Alert Rules

Add these alerting rules to /etc/prometheus/alerts.yml and reference in prometheus.yml:

groups:
  - name: storage_alerts
    interval: 1m
    rules:
      - alert: HighReadLatency
        expr: storage_volume_read_latency_milliseconds > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High read latency on {{ $labels.volume }}"
          description: "Read latency is {{ $value }}ms (threshold: 5ms)"
      
      - alert: CriticalCapacity
        expr: (storage_volume_used_bytes / storage_volume_size_bytes) > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Volume {{ $labels.volume }} is 90% full"
          description: "Used: {{ $value | humanizePercentage }}"
      
      - alert: SnapshotGrowth
        expr: rate(storage_volume_snapshot_bytes[6h]) > 1073741824  # 1GB per 6h
        for: 30m
        labels:
          severity: info
        annotations:
          summary: "Rapid snapshot growth on {{ $labels.volume }}"
          description: "Snapshot space increasing faster than expected"

Building Effective Dashboards

Create three dashboards in Grafana—keep them focused on specific troubleshooting workflows:

1. Performance Dashboard

Time-series graphs of latency (read/write) per volume
IOPS heatmap showing hot volumes
Throughput stacked area chart
Queue depth gauge (indicates saturation)

2. Capacity Dashboard

Table of volumes sorted by % used (descending)
Snapshot space consumption trend (30-day view)
Capacity forecast graph (linear regression predicting days until full)
Thin provisioning overcommit ratio gauge

3. Health Dashboard

Controller CPU/memory utilization
Network throughput (to/from array)
Alerts panel showing active warnings/criticals
Uptime and last restart timestamp

Dashboard design tip: Use dashboard variables for volume selection so you can inspect any volume without editing panel queries. Example: $volume variable populated from Prometheus label values.

Operational Value Beyond Monitoring

Observability data becomes a force multiplier for your homelab learning:

Performance experiments: Measure before/after metrics when tuning block size, caching, or network MTU
Failure simulation: Kill a service and watch metrics diverge—understand failure modes visually
Content creation: Include real telemetry in blog posts or tutorials to prove optimizations work
Interview prep: Demo live Grafana dashboards during job interviews to showcase practical monitoring skills

Every production storage engineer should know how to query Prometheus, build Grafana panels, and interpret latency distributions. Your homelab is the perfect environment to master these skills without production pressure.

From Homelab to Production

Once your homelab stack is stable, migrating to production requires these enhancements:

High availability: Run Prometheus in HA mode with remote write to long-term storage (Thanos, Cortex)
Authentication: Enable Grafana LDAP/OAuth integration and enforce RBAC on dashboards
Alerting routing: Configure Alertmanager to route alerts to PagerDuty, Slack, or email based on severity
Data retention: Extend Prometheus retention from 15 days (homelab default) to 90+ days with compaction
Security: Enable TLS for all component communication, rotate API tokens regularly

The architecture remains identical—only scale and security posture change. This makes your homelab experience directly transferable to enterprise environments.

💬 Discussion

Have questions or feedback about this guide? Found a better approach?

Join the discussion on GitHub or contact us directly.