Handling a 3am Orion alert from your phone
Last updated: 2026-05-24
The actual workflow when an Orion alert wakes you up. What to check in the first 60 seconds, what to defer to the laptop, and how to make the call without a desktop session.
The page wakes you up at 3:07am. Critical alert: a core switch interface is down. By the time you've found your phone, your brain has run through three scenarios — flapping link, hardware failure, somebody-tripped-over-a-cable — and you need data to pick between them. You have about 60 seconds before you commit to either "I'll fix this from here" or "I need to get to the laptop."
This post is about the 60 seconds.
Step 1: Open the alert in your monitoring tool
Whichever app gets there first — PocketNOC, the OnPage page that arrived a few seconds earlier, a Datadog notification, an email — the goal of step one is the same: read the actual alert text, not your prediction of it.
The alert text in Orion tells you which node, which interface, which alert rule triggered, and the timestamp. That's enough to start narrowing. If the alert is on a Gigabit Ethernet customer-facing port on an access switch in a remote office, the response is very different from the same alert on a 100G uplink between two core devices in the primary datacenter.
Step 2: Check the related node, not just the alerted object
The instinct is to open the failing interface immediately. Resist for 10 seconds. Open the node first. If the node itself shows critical, the interface is the wrong target — the node is. Whatever made the node unreachable also made every interface on it appear down. Don't waste a minute working an interface-level problem that's actually a node-level problem.
In PocketNOC: tap the alert → tap the node name in the alert detail → look at node status, response time, and the last 30 minutes of CPU/memory. In the web console: click through to the node detail page. Same shape, more clicks.
Step 3: Check correlated alerts
If three other nodes in the same rack also went unhealthy in the last 5 minutes, you're not looking at a single interface failure. You're looking at a rack-level problem — switch reboot, power event, top-of-rack fabric issue. The right response is "stop chasing the interface and find out what happened to the rack."
In Orion, the alerts list, sorted by recency, is where you spot this. PocketNOC's alerts screen does the same thing on a phone — scroll down a screen and a half. If the page shows "active alerts: 1" you're probably in a localized failure. If it shows "active alerts: 17, all in the last 4 minutes" you're in something bigger.
Step 4: Make the deferral call
After 60 seconds you should know roughly which bucket this is in:
- Single localized failure, low blast radius. Maybe a single port on an access switch. Acknowledge, file a ticket for the morning, go back to bed.
- Service-affecting but the system handled it. A redundant link went down, traffic is on the backup, no customer impact. Acknowledge, check it again in the morning.
- Active customer impact. Something's not redundant, or the redundancy didn't fail over cleanly. Stay up. Decide whether the phone is enough or if you need the laptop.
- You can't tell. This is the default at 3:08am. Decide quickly. Erring toward "I'll get the laptop" costs you 10 extra minutes of sleep but might save you from making the wrong call from a 6-inch screen.
The point of having the tool on the phone is not "I can fix everything from bed." It's "I can decide quickly whether this needs me to get out of bed." The latter is much more valuable.
What this looks like in PocketNOC specifically
When the push fires:
- Tap the notification — opens the alert detail.
- Top of screen: severity, alert name, node, interface, timestamp, current acknowledgment state.
- Tap the node → node detail with status, response time, recent performance chart, related interfaces, recent alerts on this node.
- Tap "Acknowledge" if you've decided not to engage right now — writes back to Orion via SWIS, alerts stop re-firing for the same condition.
- Back to alerts list — confirm whether this is isolated or part of a larger event.
Whole flow under 90 seconds with practice. The reason the app exists is the difference between that 90 seconds and the 8-12 minutes it takes to boot the laptop, connect to VPN, log into the web console, navigate to alerts, and load the node detail page in a browser.
What this does NOT replace
You're still going to want a laptop for:
- Reading recent config changes against NCM diffs.
- Pulling syslog for that device for the last hour.
- SSHing into the device and looking at interface counters directly.
- Updating an alert rule because you've learned this kind of alert is too noisy.
- Anything that requires a multi-line response or pasting a config snippet.
The phone is the triage tool. The laptop is the workshop. The on-call engineer who treats them as such — phone first, laptop only when the triage tells them they need it — sleeps more, makes better calls, and burns out slower than the one who reaches for the laptop on every page.
Closing
The point of mobile Orion access isn't to eliminate the laptop. It's to make the decision about whether you need the laptop a 90-second decision instead of a 10-minute one. PocketNOC, OnPage, and the Orion Web Console mobile view all serve that goal in different ways. The one you pick depends on whether your bottleneck right now is the page never arriving (paging tool), or arriving but with nothing to look at (mobile monitoring viewer).
For most on-call rotations, the second one is the next gap to close.