ISC Network Services Oncall Procedures
The NES oncall phone is managed by a staff member from NES for one
week at a time. Every Friday a message goes to the current staff
member who has responsibility for the oncall phone and the
person who will have responsibility for the upcoming week warning
them of the upcoming handoff.
The following Monday a message goes to
email@example.com and firstname.lastname@example.org notifying them of who
has responsibility for the NES oncall phone for that week and the
phone would be handed off sometime Monday morning.
If the person who is scheduled to have responsibility for the phone
has switched shifts with another staff member, a note should be
sent to email@example.com and firstname.lastname@example.org notifying
them of the change in rotation.
The normal rotation for the NES oncall phone is:
To determine the oncall rotation for a particular date:
- Joe Myers
- Diane Galeone
- David Anstine
- John Monko
- Anthony Massey
- Dane Fetterman
- David Dimm
- Peg Duffy
- David Tenney
There are two monitoring systems, Spectrum and Nagios, and
you can expect to get alarms from those two monitoring systems.
Spectrum is a system that works with SNMP traps that will set alarms
when certain conditions are met. It can be configured to alarm in
- https://jira.net.isc.upenn.edu/confluence/display/netman/NetManSpectrum - Overview of Spectrum alarming
- OneClick is the web interface to managing Spectrum alarms. You may have to use OneClick to disable Spectrum alarms.
- Attention is the software that handles the alarming for Spectrum. Hedwig is our Attention server and you should be able to manage the disposition of alarms by communicating with Hedwig via your phone. Ludwig is the secondary Attention server so it's possible that you could also get calls from Ludwig.
- Catsup a web overview of outstanding Spectrum alarms.
Nagios has the ability to run customized scripts that can also send
alarms. Where feasible we try to set up these tests so that they in
turn trigger SNMP traps so that Spectrum can handle the alarming but we do
have a number of scripts still running that will initiate alarms from
- inability to ping the server
- inablity to telnet to the server
- configured disk usage limit reached for a particular filesystem
- server is not running the right number of specified processes
Both of these systems send SMS messages to the oncall phone when there is an alarm for a particular service. Follow the escalation procedures below if you should receive an alarm. If the situation that caused an alarm clears, Spectrum will send an email message to the Zimbra account, nsoncall. You should sign into that account's mailbox to look for cleared messages.
Nagios will send an SMS message when the situation has cleared.
http://www.net.isc.upenn.edu/nes/services/vendor-escalation.html - Vendor service escalation procedures
http://www.net.isc.upenn.edu/nes/services/responsibilities-internal.html - Services responsibility list
https://secure.www.upenn.edu/netdev/nes/contacts - NES Contact sheet
- https://zimbra.upenn.edu/ - Zimbra Webmail client. Use this to access the nsoncall mailbox.
The oncall phone should be included in all monitoring email
reflector lists and the oncall operator will get warnings if there
is a problem with any service.
The basic escalation procedure when you're on call is based on the
responsibility list for Network Services. Regardless of what action you take, it is a good idea to report back to the group with an email message.
Again, whatever action is taken should be reported back to the group via email.
Determine whether an immediate response is needed depending on the
of day and whether the affected service is (A)lways available, (H)ighly available or has (B)asic availability.
For example, if it is 19:00 and you get alarmed for a service that is listed as having (B)asic availability, you can ignore the alarm. You should not have gotten called. Please send a note to the staff alerting them of the situation so that the alarm schedule or responsibility list can be adjusted. If you want to verify that you can ignore the alarm, you can skip to step #3 and try calling one of the people in the escalation list.
Determine if there was a scheduled outage. Look through your mail. It is the responsibility of the person primarily responsible for the service to make sure that an outage notice is sent, to make sure that Nagios alarms are suppressed prior to the outage and to notify the Netman group to suppress Spectrum alarms but people can forget. If there was a scheduled outage, you can ignore the alarms but please send email notifying the group that alarms were not suppressed.
Also note that monitoring is not foolproof and we can get false alarms so you could get an alarm and then a recovery of service message within minutes. If that happens, you don't need to call anyone else but do send email to the group letting them know what happened.
Try using the service and/or pinging the server to ascertain if there really is a problem. If you see a failure or are unable to test, check the responsibility list for Network Services and try calling one of the people responsible for the service/server.
- If you cannot reach the people responsible for the service/server or are unsure who to call,
try calling the following people in the order listed
below until you reach somebody:
- Peggy Yetter
- Eric Snyder
- Adam Preset
- Mark Sirota
- Shumon Huque
- Deke Kassabian
If you can't reach anyone at all, let the NOC know and try again