An on-call runbook template for SolarWinds Orion

Last updated: 2026-05-24

A practical template for the runbook your on-call rotation needs but probably doesn't have written down — escalation, response timing, common Orion-specific gotchas, and the handoff at end of shift.

Most network teams running SolarWinds Orion as their monitoring backbone have an on-call rotation. Most of those rotations don't have a written runbook. The reason is usually some combination of "the senior engineer just knows" and "we've been doing it this way for years." Both are fine until the senior engineer gets the flu or leaves the company.

This is a template you can rip off, fork, and customize. It's not exhaustive — your environment has its own quirks. It is the minimum that an on-call engineer can read in 10 minutes and know what to do when the page fires.

When to use the runbook

Any page that:

Is severity Critical or higher.
Is not auto-resolving (alert is still active after 5 minutes).
Is on a node you don't immediately recognize.

For a routine Warning that you know from experience is flappy and irrelevant, acknowledge and move on.

Step 1 — Acknowledge within 5 minutes

The page goes from paging tool (OnPage, PagerDuty, etc.) to your phone. Acknowledge in the paging tool first so it stops re-paging and doesn't escalate to the secondary. Don't acknowledge in Orion until you've at least looked at the alert — acknowledging in Orion suppresses further alert actions, which means the secondary won't get pinged if you fall asleep again.

Step 2 — Triage within 90 seconds

Open the alert. Read the alert text. Don't predict it; read it. Then check:

Is the node itself up? If the node is down, you have a node failure, not the alert condition the rule fired on.
What changed in the last hour? Check NCM for recent config changes, check Orion's audit log for recent admin actions.
Are there other alerts firing at the same time? Cluster of alerts = bigger event. Single alert = localized.
Customer impact? Is this affecting actual users / customers / billable services?

Triage from the phone using PocketNOC, the Orion Web Console mobile view, or whichever tool gets you there fastest. Detailed flow: Handling a 3am Orion alert from your phone.

Step 3 — Decide the path

Three buckets:

Defer. No customer impact, no blast radius, known-flappy alert. Acknowledge in Orion, file a ticket for the morning, go back to bed.
Investigate from the phone. Issue is localized and bounded. Look at the relevant performance charts, recent alerts, related nodes. You may decide it self-resolved or that it needs a ticket but not a fix right now.
Engage from the laptop. Anything that requires running commands on a device, pulling syslog, editing alert rules, or making config changes.

Default to bucket 1 when in doubt at 3am. The 30 minutes of recovery sleep are worth more than a slightly-faster investigation of something that will still be there in the morning.

Step 4 — Communicate

If you engage past triage:

Update Slack / your team channel that you've picked it up.
If you're going to be 30+ minutes, drop the timeline in the channel so people don't wonder.
If you need to wake someone else (a DBA, a vendor support engineer, a colleague), don't hesitate. The cost of waking the wrong person at 3am is small. The cost of an avoidable outage is large.

Common Orion-specific gotchas

"Node down" that isn't

Orion's node-down detection relies on ICMP ping by default. A node can stop responding to ICMP for reasons that don't matter to actual service health — ICMP rate limiting, firewall ACL change, host firewall update. Always check whether the node is reachable on its actual service ports before treating "node down" as a real outage.

Interface "down" that just got reconfigured

If someone shut down an interface administratively, Orion will fire an interface-down alert. Check admin status vs oper status in the interface details — admin down means somebody did it on purpose.

Polling lag on big installs

A large Orion install can have polling lag during database maintenance windows or after a polling-engine restart. An alert that fires at 3am for a condition that was actually 10 minutes ago can be confusing. Check the alert timestamp vs the data timestamp.

Alert rule that's too noisy

If you're getting the same alert from the same node multiple times in one shift, the rule needs work. Don't fix it at 3am — file a ticket for the morning to either add a flap-dampening condition or scope the rule more tightly.

Authentication failures

Orion accounts have lockout policies. If the on-call engineer locked out the SolarWinds service account by typo at 3am, the entire alerting pipeline can stop. Use individual accounts; never share a credential with the polling engine.

End-of-shift handoff

When your shift ends:

Hand off any active incidents to the incoming on-call in a written ticket / Slack message. Verbal handoffs at 8am between two tired humans lose information.
Note any alerts you acknowledged but didn't fix — they're now the incoming engineer's problem.
List anything you couldn't reproduce by the end of your shift but is worth watching for during the next one.

Written handoff is a small habit that costs 5 minutes and prevents real "I thought you had it" outages.

Recommendations

Have a real paging tool. OnPage, PagerDuty, OpsGenie — anything that overrides Do Not Disturb and escalates. Standard push notifications are not reliable enough for on-call.
Have a mobile monitoring viewer. PocketNOC, or at minimum the Orion Web Console mobile view bookmarked. The page tells you there's a problem; the viewer tells you what.
Read the runbook before your first shift. Not during it.
Write down the gotchas your team has hit. Add them to the runbook. Every Orion install has its own personality; document the personality.

Closing

A runbook is a forcing function for the team to think about the on-call workflow when nobody is on call. The 30 minutes spent writing it pays for itself the first time a new engineer picks up the rotation. Steal this template, change it to fit your environment, put it in the same place your team puts everything else.

Jason Lazerus — Founder, WeaveHub Technologies — 20+ years network and security engineering