Penn Computing

Penn Computing

Computing Menu Computing A-Z
Computing Home Information Systems & Computing Penn

ISC Network Services Oncall Procedures

The NES oncall phone is managed by a staff member from NES for one week at a time. Every Friday a message goes to the current staff member who has responsibility for the oncall phone and the person who will have responsibility for the upcoming week warning them of the upcoming handoff. The following Monday a message goes to engineering@isc.upenn.edu and noc@isc.upenn.edu notifying them of who has responsibility for the NES oncall phone for that week and the phone would be handed off sometime Monday morning.

If the person who is scheduled to have responsibility for the phone has switched shifts with another staff member, a note should be sent to engineering@isc.upenn.edu and noc@isc.upenn.edu notifying them of the change in rotation.

The normal rotation for the NES oncall phone is:

  • Joe Myers
  • Diane Galeone
  • David Anstine
  • John Monko
  • Anthony Massey
  • Dane Fetterman
  • David Dimm
  • Peg Duffy
  • David Tenney

To determine the oncall rotation for a particular date:
Month: Day: Year:

Alarms

There are two monitoring systems, Spectrum and Nagios, and you can expect to get alarms from those two monitoring systems. Spectrum is a system that works with SNMP traps that will set alarms when certain conditions are met. It can be configured to alarm in situations like:
  • inability to ping the server
  • inablity to telnet to the server
  • configured disk usage limit reached for a particular filesystem
  • server is not running the right number of specified processes
Nagios has the ability to run customized scripts that can also send alarms. Where feasible we try to set up these tests so that they in turn trigger SNMP traps so that Spectrum can handle the alarming but we do have a number of scripts still running that will initiate alarms from Nagios alone.

Both of these systems send SMS messages to the oncall phone when there is an alarm for a particular service. Follow the escalation procedures below if you should receive an alarm. If the situation that caused an alarm clears, Spectrum will send an email message to the Zimbra account, nsoncall. You should sign into that account's mailbox to look for cleared messages. Nagios will send an SMS message when the situation has cleared.

Other Links

  1. http://www.net.isc.upenn.edu/nes/services/vendor-escalation.html - Vendor service escalation procedures
  2. http://www.net.isc.upenn.edu/nes/services/responsibilities-internal.html - Services responsibility list
  3. https://secure.www.upenn.edu/netdev/nes/contacts - NES Contact sheet
  4. https://zimbra.upenn.edu/ - Zimbra Webmail client. Use this to access the nsoncall mailbox.

Escalation Procedures

The oncall phone should be included in all monitoring email reflector lists and the oncall operator will get warnings if there is a problem with any service. The basic escalation procedure when you're on call is based on the responsibility list for Network Services. Regardless of what action you take, it is a good idea to report back to the group with an email message.
  1. Determine whether an immediate response is needed depending on the time of day and whether the affected service is (A)lways available, (H)ighly available or has (B)asic availability. For example, if it is 19:00 and you get alarmed for a service that is listed as having (B)asic availability, you can ignore the alarm. You should not have gotten called. Please send a note to the staff alerting them of the situation so that the alarm schedule or responsibility list can be adjusted. If you want to verify that you can ignore the alarm, you can skip to step #3 and try calling one of the people in the escalation list.

    Determine if there was a scheduled outage. Look through your mail. It is the responsibility of the person primarily responsible for the service to make sure that an outage notice is sent, to make sure that Nagios alarms are suppressed prior to the outage and to notify the Netman group to suppress Spectrum alarms but people can forget. If there was a scheduled outage, you can ignore the alarms but please send email notifying the group that alarms were not suppressed.

    Also note that monitoring is not foolproof and we can get false alarms so you could get an alarm and then a recovery of service message within minutes. If that happens, you don't need to call anyone else but do send email to the group letting them know what happened.

  2. Try using the service and/or pinging the server to ascertain if there really is a problem. If you see a failure or are unable to test, check the responsibility list for Network Services and try calling one of the people responsible for the service/server.

  3. If you cannot reach the people responsible for the service/server or are unsure who to call, try calling the following people in the order listed below until you reach somebody:
    • Peggy Yetter
    • Eric Snyder
    • Adam Preset
    • Mark Sirota
    • Shumon Huque
    • Deke Kassabian

  4. If you can't reach anyone at all, let the NOC know and try again later.

Again, whatever action is taken should be reported back to the group via email.
top

Information Systems and Computing
University of Pennsylvania
Comments & Questions


Penn Computing University of Pennsylvania
Information Systems and Computing, University of Pennsylvania