Skip to main content

Operations

Storage Homelab Observability Playbook

Why Observability Matters in Your Homelab

A storage homelab without monitoring is a black box. You can provision volumes, configure replication, and run workloads, but you can't answer critical questions: Is performance degrading? When will I run out of capacity? Are my snapshots consuming unexpected space?

Observability transforms your homelab from a sandbox into a learning platform where you can measure the impact of configuration changes, identify bottlenecks, and troubleshoot issues using the same tools production teams rely on.

The Lightweight Stack

You don't need enterprise APM suites or SaaS observability platforms. This stack runs on a single 4-core VM with 8GB RAM:

Total deployment time with Ansible: 30 minutes. Ongoing maintenance: 15 minutes per month.

Complete Ansible Deployment Playbook

This playbook deploys the full stack on a Ubuntu 22.04 or Rocky Linux 9 VM. It's idempotent—run it repeatedly to ensure configuration consistency.

Inventory File

# inventory/homelab.ini
[observability]
monitor01 ansible_host=192.168.1.50 ansible_user=ubuntu

[observability:vars]
prometheus_version=2.48.1
grafana_version=10.2.3
node_exporter_version=1.7.0

Main Playbook

---
- name: Deploy Observability Stack for Storage Homelab
  hosts: observability
  become: yes
  
  vars:
    prometheus_dir: /opt/prometheus
    grafana_data_dir: /var/lib/grafana
    prometheus_port: 9090
    grafana_port: 3000
    
  tasks:
    - name: Install prerequisite packages
      ansible.builtin.package:
        name:
          - wget
          - tar
          - adduser
          - libfontconfig1
        state: present
    
    # === Prometheus Installation ===
    - name: Create Prometheus user
      ansible.builtin.user:
        name: prometheus
        shell: /bin/false
        create_home: no
        system: yes
    
    - name: Create Prometheus directories
      ansible.builtin.file:
        path: "{{ item }}"
        state: directory
        owner: prometheus
        group: prometheus
      loop:
        - "{{ prometheus_dir }}"
        - "{{ prometheus_dir }}/data"
        - /etc/prometheus
    
    - name: Download Prometheus
      ansible.builtin.unarchive:
        src: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
        dest: "{{ prometheus_dir }}"
        remote_src: yes
        creates: "{{ prometheus_dir }}/prometheus-{{ prometheus_version }}.linux-amd64"
    
    - name: Copy Prometheus binaries
      ansible.builtin.copy:
        src: "{{ prometheus_dir }}/prometheus-{{ prometheus_version }}.linux-amd64/{{ item }}"
        dest: /usr/local/bin/
        mode: '0755'
        remote_src: yes
      loop:
        - prometheus
        - promtool
    
    - name: Deploy Prometheus configuration
      ansible.builtin.copy:
        dest: /etc/prometheus/prometheus.yml
        content: |
          global:
            scrape_interval: 15s
            evaluation_interval: 15s
          
          scrape_configs:
            - job_name: 'prometheus'
              static_configs:
                - targets: ['localhost:9090']
            
            - job_name: 'node_exporter'
              static_configs:
                - targets: ['localhost:9100']
            
            - job_name: 'netapp_harvest'
              static_configs:
                - targets: ['localhost:12990']  # Adjust for your exporter
        owner: prometheus
        group: prometheus
    
    - name: Create Prometheus systemd service
      ansible.builtin.copy:
        dest: /etc/systemd/system/prometheus.service
        content: |
          [Unit]
          Description=Prometheus Time Series Database
          After=network.target
          
          [Service]
          User=prometheus
          Group=prometheus
          Type=simple
          ExecStart=/usr/local/bin/prometheus \\
            --config.file=/etc/prometheus/prometheus.yml \\
            --storage.tsdb.path={{ prometheus_dir }}/data \\
            --web.listen-address=0.0.0.0:{{ prometheus_port }}
          
          [Install]
          WantedBy=multi-user.target
    
    - name: Start and enable Prometheus
      ansible.builtin.systemd:
        name: prometheus
        state: started
        enabled: yes
        daemon_reload: yes
    
    # === Node Exporter Installation ===
    - name: Download Node Exporter
      ansible.builtin.unarchive:
        src: "https://github.com/prometheus/node_exporter/releases/download/v{{ node_exporter_version }}/node_exporter-{{ node_exporter_version }}.linux-amd64.tar.gz"
        dest: /tmp
        remote_src: yes
    
    - name: Copy Node Exporter binary
      ansible.builtin.copy:
        src: "/tmp/node_exporter-{{ node_exporter_version }}.linux-amd64/node_exporter"
        dest: /usr/local/bin/
        mode: '0755'
        remote_src: yes
    
    - name: Create Node Exporter systemd service
      ansible.builtin.copy:
        dest: /etc/systemd/system/node_exporter.service
        content: |
          [Unit]
          Description=Node Exporter
          After=network.target
          
          [Service]
          User=root
          ExecStart=/usr/local/bin/node_exporter
          
          [Install]
          WantedBy=multi-user.target
    
    - name: Start and enable Node Exporter
      ansible.builtin.systemd:
        name: node_exporter
        state: started
        enabled: yes
        daemon_reload: yes
    
    # === Grafana Installation ===
    - name: Add Grafana APT repository key
      ansible.builtin.apt_key:
        url: https://apt.grafana.com/gpg.key
        state: present
      when: ansible_os_family == "Debian"
    
    - name: Add Grafana APT repository
      ansible.builtin.apt_repository:
        repo: "deb https://apt.grafana.com stable main"
        state: present
      when: ansible_os_family == "Debian"
    
    - name: Install Grafana
      ansible.builtin.package:
        name: grafana
        state: present
    
    - name: Start and enable Grafana
      ansible.builtin.systemd:
        name: grafana-server
        state: started
        enabled: yes
    
    - name: Display access information
      ansible.builtin.debug:
        msg:
          - "Prometheus: http://{{ ansible_host }}:{{ prometheus_port }}"
          - "Grafana: http://{{ ansible_host }}:{{ grafana_port }} (admin/admin)"
          - "Node Exporter: http://{{ ansible_host }}:9100/metrics"

Critical Storage Metrics to Track

Focus on metrics that answer operational questions rather than collecting everything. Here's the priority list:

Latency Metrics

Why it matters: Latency spikes indicate disk contention, controller saturation, or network issues. Track baseline latency during idle periods (typically <1ms for flash arrays) and alert when sustained latency exceeds 2x baseline.

Throughput and IOPS

Why it matters: Correlate IOPS patterns with application behavior. A database backup job should show high read IOPS at scheduled times. Unexpected IOPS spikes may indicate runaway queries or malware.

Capacity Tracking

Why it matters: Capacity planning prevents emergency expansions. Forecast when volumes reach 80% capacity and proactively expand or migrate data.

Practical Alert Rules

Add these alerting rules to /etc/prometheus/alerts.yml and reference in prometheus.yml:

groups:
  - name: storage_alerts
    interval: 1m
    rules:
      - alert: HighReadLatency
        expr: storage_volume_read_latency_milliseconds > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High read latency on {{ $labels.volume }}"
          description: "Read latency is {{ $value }}ms (threshold: 5ms)"
      
      - alert: CriticalCapacity
        expr: (storage_volume_used_bytes / storage_volume_size_bytes) > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Volume {{ $labels.volume }} is 90% full"
          description: "Used: {{ $value | humanizePercentage }}"
      
      - alert: SnapshotGrowth
        expr: rate(storage_volume_snapshot_bytes[6h]) > 1073741824  # 1GB per 6h
        for: 30m
        labels:
          severity: info
        annotations:
          summary: "Rapid snapshot growth on {{ $labels.volume }}"
          description: "Snapshot space increasing faster than expected"

Building Effective Dashboards

Create three dashboards in Grafana—keep them focused on specific troubleshooting workflows:

1. Performance Dashboard

2. Capacity Dashboard

3. Health Dashboard

Dashboard design tip: Use dashboard variables for volume selection so you can inspect any volume without editing panel queries. Example: $volume variable populated from Prometheus label values.

Operational Value Beyond Monitoring

Observability data becomes a force multiplier for your homelab learning:

Every production storage engineer should know how to query Prometheus, build Grafana panels, and interpret latency distributions. Your homelab is the perfect environment to master these skills without production pressure.

From Homelab to Production

Once your homelab stack is stable, migrating to production requires these enhancements:

The architecture remains identical—only scale and security posture change. This makes your homelab experience directly transferable to enterprise environments.


💬 Discussion

Have questions or feedback about this guide? Found a better approach?

Join the discussion on GitHub or contact us directly.