Skip to main content

NetApp

NetApp ONTAP Code Update: A Complete NDU Maintenance Runbook

NetApp guide — Storage KnowHow

Overview

This runbook walks through a planned NetApp ONTAP nondisruptive upgrade (NDU) from start to finish. It follows the same structure as an enterprise Method of Procedure (MOP): prerequisites, preparatory work, pre-change health checks, communications, the upgrade itself, monitoring, post-validation, and a backout plan.

ONTAP’s automated upgrade path handles node failover, image activation, and giveback in sequence — but the procedure still requires deliberate pre- and post-checks to confirm cluster readiness and capture evidence. NetApp documents the core automated upgrade command as cluster image update.

The example target version throughout this runbook is 9.17.1P3.


Prerequisites

Before opening a maintenance window, confirm the following are in place:

Area Requirement
Target ONTAP image Downloaded, staged, and hash-verified (at least 24 hours prior)
Maintenance window Duration agreed and recorded in your change record (e.g. 4 hours)
Health alerts No active system health alerts
Storage failover Healthy and possible for all HA pairs
Aggregates All online; no broken disks
Data LIFs All on home ports — exceptions documented
Monitoring AIQUM maintenance mode plan confirmed for the cluster
Change record Change ticket created; command outputs will be attached as evidence

Preparatory Work — Image Staging

Complete this at least 24 hours before the maintenance window so any image validation issues can be resolved without time pressure.

1. Download the Image

Download the target ONTAP package from mysupport.netapp.com. Select the correct version family (e.g., NetApp ONTAP 9.17.x) and download both the image (image.tgz) and its MD5 hash file.

2. Verify the Hash

On your staging server, verify the file integrity before uploading to the cluster:

# PowerShell — run on the staging server
Get-FileHash -Algorithm MD5 "C:\staging\NetApp\image.tgz"

Compare the output against the MD5 hash file from the NetApp support portal. Do not proceed if there is a mismatch.

3. Upload the Image to the Cluster

From an ONTAP CLI session, pull the image from your internal staging server:

cluster image package get -url http://<staging-server>/NetApp/image.tgz

4. Validate the Image

cluster image validate -version 9.17.1P3

Review all output. Fix any errors before the maintenance window. Warnings should be reviewed and documented in the change record. Do not ignore them without understanding the impact.


Pre-Change Health Checks

Run the following commands immediately before starting the maintenance window. Attach all output to the change record as your before-state evidence.

1. System Health Alerts

system health alert show

Expected result: no active alerts. Investigate and resolve any alerts before proceeding.

2. Storage Failover Status

storage failover show

Expected result: all HA pairs show state Connected and Takeover possible: true.

3. Aggregate Status

storage aggregate show -state !online

Expected result: no output (all aggregates are online).

4. Broken Disks

storage disk show -broken

Expected result: no output (no broken disks).

5. LIF Home Port Status

net int show -is-home false

Expected result: no output (all LIFs are on home ports). If LIFs are off home ports, either revert them or document the exceptions before starting.

6. Current Cluster Version

cluster image show

Record the current running version for the change record.

7. Cluster and Node Health

cluster show
system node show -fields health

All nodes must be healthy before starting the upgrade.

8. SnapMirror Status (If Applicable)

snapmirror show -fields status,healthy

Confirm there are no unhealthy or unexpected SnapMirror relationships that could be interrupted by a node failover during the upgrade.


Communications — Starting Maintenance

Before executing the change:

  1. Email notification — Notify storage operations stakeholders that the maintenance is starting.
  2. Chat notification — Post to your team channel: <CHANGE_ID> - <cluster-hostname> NetApp ONTAP Code Update - Starting
  3. Change record — Transition the change ticket to Implementing.

Change Execution

1. Pause Monitoring

If using AIQUM (Active IQ Unified Manager), log in and enable maintenance mode for the cluster for the duration of the maintenance window. This suppresses expected alerts during the failover/giveback cycles and prevents false escalations.

2. SSH Into the Cluster

Connect to the cluster management LIF. The automated upgrade manages each node in sequence; you do not need to SSH to individual nodes.

3. Generate “Start Maintenance” AutoSupport

This creates a timestamped record in NetApp’s systems and sets a maintenance window suppression period:

system node autosupport invoke -node * -type all -message "MAINT=4h Starting_Code_Update"

Adjust the MAINT=Xh duration to match your approved maintenance window.

4. Initiate the Nondisruptive Upgrade

The following command triggers the full automated upgrade sequence: validation, node failover, image activation, giveback, and repeat for each remaining node.

cluster image update -version 9.17.1P3 -pause-after none -ignore-validation-warning false -skip-confirmation false -stabilize-minutes 4 -nodes *

Key flags: - -pause-after none — upgrade all nodes without manual pause between them - -ignore-validation-warning false — do not silently bypass warnings - -stabilize-minutes 4 — wait 4 minutes after each node giveback before moving to the next node

Note: If your session disconnects during the upgrade, open a new SSH session and continue monitoring. The upgrade continues regardless of your session state.


Monitoring Progress

Open a separate SSH session to monitor progress without interrupting the upgrade session.

Live Progress

cluster image show-update-progress

Run this repeatedly or leave it open. It shows the current upgrade phase and per-node status.

Upgrade History

cluster image show-update-history

Shows completed phases and their timestamps — useful for evidence capture.

HA Status During Upgrade

storage failover show

During the upgrade, one node will be in takeover state at a time. Verify each node returns to Takeover possible: true before the upgrade moves to the next HA pair.


Backout Plan

Warning: Only execute the following steps if the automated upgrade fails and cannot be resumed.

Engage NetApp Support before starting any backout actions. Revert and downgrade procedures are Support-led and must be scoped to your specific ONTAP version, platform, and failure state.

Step 1 — Set the Boot Image

Tell ONTAP which image to boot on next restart:

system image modify -node * -image 9.16.1P2
system node image show

Step 2 — Verify Revert Preconditions

Switch to advanced privilege mode and run a check-only pass before committing:

set -privilege advanced

system node revert-to -node <node1> -check-only true -version 9.16.1P2

Resolve any issues reported before proceeding.

Step 3 — Execute the Revert

system node revert-to -node <node1> -version 9.16.1P2

Step 4 — Filesystem Revert (If Required)

Only needed if directed by NetApp Support for a specific failure state:

system node run -node <node1>
revert_to 9.16.1P2
boot_ontap

Step 5 — Repeat for the HA Partner Node

Perform Steps 2–4 on the partner node of each affected HA pair.

Step 6 — Restore HA Configuration

cluster ha modify -configured true
storage failover modify -node <node1> -enabled true

Post-Validation

Run the following commands after the upgrade completes. Attach all output to the change record as your after-state evidence.

1. Confirm Target Version

version -v

The output must show 9.17.1P3.

2. Confirm Upgrade Completion

cluster image show
cluster image show-update-history

All nodes must show the target version. The update history must show completed with no failures.

3. Storage Failover Status

storage failover show

All HA pairs must be back to Connected and Takeover possible: true.

4. LIF Placement

net int show -is-home false

All LIFs should be back on their home ports. If any are not, revert them:

network interface revert -vserver <svm-name> -lif <lif-name>

5. System Health Alerts

system health alert show

Expected result: no new active alerts.

6. Cluster and Node Health

cluster show
system node show -fields health

All nodes must be healthy.

7. Generate “End Maintenance” AutoSupport

system node autosupport invoke -node * -type all -message "MAINT=END"

Resume Monitoring

Disable AIQUM maintenance mode for the cluster once all post-validation checks pass. Confirm monitoring dashboards are showing green before calling the change complete.


Communications — Completing Maintenance

  1. Email notification — Notify storage operations stakeholders that the maintenance is complete.
  2. Chat notification — Post to your team channel: <CHANGE_ID> - <cluster-hostname> NetApp ONTAP Code Update - Completed
  3. Change record — Attach all pre- and post-validation command output, then transition the ticket to Implemented.

Key Takeaways

  1. Stage and validate the image at least 24 hours before the window — validation errors found during the change are avoidable.
  2. Capture storage failover show, system health alert show, and cluster image show output before and after as your change evidence.
  3. The cluster image update command handles failover and giveback automatically — avoid manual node intervention unless directed by NetApp Support.
  4. Keep AIQUM maintenance mode active for the full upgrade window to suppress expected failover alerts.
  5. Engage NetApp Support before starting any backout. Reverting ONTAP is not a self-service operation.

Comments