NetApp
NetApp ONTAP Code Update: A Complete NDU Maintenance Runbook
Overview
This runbook walks through a planned NetApp ONTAP nondisruptive upgrade (NDU) from start to finish. It follows the same structure as an enterprise Method of Procedure (MOP): prerequisites, preparatory work, pre-change health checks, communications, the upgrade itself, monitoring, post-validation, and a backout plan.
ONTAP’s automated upgrade path handles node failover, image activation, and giveback in sequence — but the procedure still requires deliberate pre- and post-checks to confirm cluster readiness and capture evidence. NetApp documents the core automated upgrade command as cluster image update.
The example target version throughout this runbook is 9.17.1P3.
Prerequisites
Before opening a maintenance window, confirm the following are in place:
| Area | Requirement |
|---|---|
| Target ONTAP image | Downloaded, staged, and hash-verified (at least 24 hours prior) |
| Maintenance window | Duration agreed and recorded in your change record (e.g. 4 hours) |
| Health alerts | No active system health alerts |
| Storage failover | Healthy and possible for all HA pairs |
| Aggregates | All online; no broken disks |
| Data LIFs | All on home ports — exceptions documented |
| Monitoring | AIQUM maintenance mode plan confirmed for the cluster |
| Change record | Change ticket created; command outputs will be attached as evidence |
Preparatory Work — Image Staging
Complete this at least 24 hours before the maintenance window so any image validation issues can be resolved without time pressure.
1. Download the Image
Download the target ONTAP package from mysupport.netapp.com. Select the correct version family (e.g., NetApp ONTAP 9.17.x) and download both the image (image.tgz) and its MD5 hash file.
2. Verify the Hash
On your staging server, verify the file integrity before uploading to the cluster:
# PowerShell — run on the staging server
Get-FileHash -Algorithm MD5 "C:\staging\NetApp\image.tgz"
Compare the output against the MD5 hash file from the NetApp support portal. Do not proceed if there is a mismatch.
3. Upload the Image to the Cluster
From an ONTAP CLI session, pull the image from your internal staging server:
cluster image package get -url http://<staging-server>/NetApp/image.tgz
4. Validate the Image
cluster image validate -version 9.17.1P3
Review all output. Fix any errors before the maintenance window. Warnings should be reviewed and documented in the change record. Do not ignore them without understanding the impact.
Pre-Change Health Checks
Run the following commands immediately before starting the maintenance window. Attach all output to the change record as your before-state evidence.
1. System Health Alerts
system health alert show
Expected result: no active alerts. Investigate and resolve any alerts before proceeding.
2. Storage Failover Status
storage failover show
Expected result: all HA pairs show state Connected and Takeover possible: true.
3. Aggregate Status
storage aggregate show -state !online
Expected result: no output (all aggregates are online).
4. Broken Disks
storage disk show -broken
Expected result: no output (no broken disks).
5. LIF Home Port Status
net int show -is-home false
Expected result: no output (all LIFs are on home ports). If LIFs are off home ports, either revert them or document the exceptions before starting.
6. Current Cluster Version
cluster image show
Record the current running version for the change record.
7. Cluster and Node Health
cluster show
system node show -fields health
All nodes must be healthy before starting the upgrade.
8. SnapMirror Status (If Applicable)
snapmirror show -fields status,healthy
Confirm there are no unhealthy or unexpected SnapMirror relationships that could be interrupted by a node failover during the upgrade.
Communications — Starting Maintenance
Before executing the change:
- Email notification — Notify storage operations stakeholders that the maintenance is starting.
- Chat notification — Post to your team channel:
<CHANGE_ID> - <cluster-hostname> NetApp ONTAP Code Update - Starting - Change record — Transition the change ticket to Implementing.
Change Execution
1. Pause Monitoring
If using AIQUM (Active IQ Unified Manager), log in and enable maintenance mode for the cluster for the duration of the maintenance window. This suppresses expected alerts during the failover/giveback cycles and prevents false escalations.
2. SSH Into the Cluster
Connect to the cluster management LIF. The automated upgrade manages each node in sequence; you do not need to SSH to individual nodes.
3. Generate “Start Maintenance” AutoSupport
This creates a timestamped record in NetApp’s systems and sets a maintenance window suppression period:
system node autosupport invoke -node * -type all -message "MAINT=4h Starting_Code_Update"
Adjust the MAINT=Xh duration to match your approved maintenance window.
4. Initiate the Nondisruptive Upgrade
The following command triggers the full automated upgrade sequence: validation, node failover, image activation, giveback, and repeat for each remaining node.
cluster image update -version 9.17.1P3 -pause-after none -ignore-validation-warning false -skip-confirmation false -stabilize-minutes 4 -nodes *
Key flags:
- -pause-after none — upgrade all nodes without manual pause between them
- -ignore-validation-warning false — do not silently bypass warnings
- -stabilize-minutes 4 — wait 4 minutes after each node giveback before moving to the next node
Note: If your session disconnects during the upgrade, open a new SSH session and continue monitoring. The upgrade continues regardless of your session state.
Monitoring Progress
Open a separate SSH session to monitor progress without interrupting the upgrade session.
Live Progress
cluster image show-update-progress
Run this repeatedly or leave it open. It shows the current upgrade phase and per-node status.
Upgrade History
cluster image show-update-history
Shows completed phases and their timestamps — useful for evidence capture.
HA Status During Upgrade
storage failover show
During the upgrade, one node will be in takeover state at a time. Verify each node returns to Takeover possible: true before the upgrade moves to the next HA pair.
Backout Plan
Warning: Only execute the following steps if the automated upgrade fails and cannot be resumed.
Engage NetApp Support before starting any backout actions. Revert and downgrade procedures are Support-led and must be scoped to your specific ONTAP version, platform, and failure state.
Step 1 — Set the Boot Image
Tell ONTAP which image to boot on next restart:
system image modify -node * -image 9.16.1P2
system node image show
Step 2 — Verify Revert Preconditions
Switch to advanced privilege mode and run a check-only pass before committing:
set -privilege advanced
system node revert-to -node <node1> -check-only true -version 9.16.1P2
Resolve any issues reported before proceeding.
Step 3 — Execute the Revert
system node revert-to -node <node1> -version 9.16.1P2
Step 4 — Filesystem Revert (If Required)
Only needed if directed by NetApp Support for a specific failure state:
system node run -node <node1>
revert_to 9.16.1P2
boot_ontap
Step 5 — Repeat for the HA Partner Node
Perform Steps 2–4 on the partner node of each affected HA pair.
Step 6 — Restore HA Configuration
cluster ha modify -configured true
storage failover modify -node <node1> -enabled true
Post-Validation
Run the following commands after the upgrade completes. Attach all output to the change record as your after-state evidence.
1. Confirm Target Version
version -v
The output must show 9.17.1P3.
2. Confirm Upgrade Completion
cluster image show
cluster image show-update-history
All nodes must show the target version. The update history must show completed with no failures.
3. Storage Failover Status
storage failover show
All HA pairs must be back to Connected and Takeover possible: true.
4. LIF Placement
net int show -is-home false
All LIFs should be back on their home ports. If any are not, revert them:
network interface revert -vserver <svm-name> -lif <lif-name>
5. System Health Alerts
system health alert show
Expected result: no new active alerts.
6. Cluster and Node Health
cluster show
system node show -fields health
All nodes must be healthy.
7. Generate “End Maintenance” AutoSupport
system node autosupport invoke -node * -type all -message "MAINT=END"
Resume Monitoring
Disable AIQUM maintenance mode for the cluster once all post-validation checks pass. Confirm monitoring dashboards are showing green before calling the change complete.
Communications — Completing Maintenance
- Email notification — Notify storage operations stakeholders that the maintenance is complete.
- Chat notification — Post to your team channel:
<CHANGE_ID> - <cluster-hostname> NetApp ONTAP Code Update - Completed - Change record — Attach all pre- and post-validation command output, then transition the ticket to Implemented.
Key Takeaways
- Stage and validate the image at least 24 hours before the window — validation errors found during the change are avoidable.
- Capture
storage failover show,system health alert show, andcluster image showoutput before and after as your change evidence. - The
cluster image updatecommand handles failover and giveback automatically — avoid manual node intervention unless directed by NetApp Support. - Keep AIQUM maintenance mode active for the full upgrade window to suppress expected failover alerts.
- Engage NetApp Support before starting any backout. Reverting ONTAP is not a self-service operation.
Comments