ISC Network Services Downtime
If any downtime is necessary for any of our services, please try to follow
- Downtime checklist
Design a downtime procedure that includes a checklist or plan of the things that you plan to do during the downtime and a backout procedure in case there are problems. This procedure should try to minimize full disruption of service. Any steps that can be taken without taking the service down should be considered.
Examples of plans:
- Determine downtime window
Check the Services responsibility list for the normal downtime window for the service and coordinate with the person who is the primary contact for the service to determine the necessary downtime window. Most service changes should be scheduled from Tuesday through Thursday.
Notification of the scheduled downtime should go out at least one week prior to the actual downtime. It's not always possible to give one week notice but we should aim for it. The following people should be notified:
- Network management
- Primary users of the service
Use either a mailing list associated with the service or if
it is a central service, a note should go to the NOC so that
they can send out an outage notice using the Outages app.
Example of an outage note to be sent to the NOC:
Subject: Outage for www.upenn.edu
Date: August 17, 2005
Start Time: 5:00 am
Duration: 1 hour
Building(s) Affected: Entire campus
Service(s) Affected: www.upenn.edu
Description: www.upenn.edu will be unavailable to all users on
August 17, 2005 from 5:00am to 6:00am while we
upgrade hardware. We do not anticipate
that the server will be unavailable during this
entire window but are reserving this outage in
case of problems.
- Add an entry to the firstname.lastname@example.org calendar. This is a public calendar. Use this URL, "https://zimbra.upenn.edu/home/nt-dtime/Calendar", to set up your calendar client to view the calendar or https://zimbra.upenn.edu/home/nt-dtime/Calendar?fmt=html to see an html version. Only those people who are responsible for sending out announcements should have admin access to update this calendar.
- Subject: - should contain the name of the server/service experiencing the downtime
- Locaton: - should be either "Off-site" which indicates that staff will do wor remotely from home or "On-site" which indicates that staff will be on campus
- Attendees: - the invite list should include those staff members who will be required for the work and other interested parties who will be overlooking the work
- Description: - description of the work to be done. Try to include the outage message that was sent about the outage also.
- Application announcement
If appropriate, applications associated with the service should give a warning that the service will not be available. Many of our user service web sites have a status block and an announcement should be placed there.
- Suppress alarms
Plan to suppress all monitoring alarms that may be associated with the service.
- Test changes
If appropriate, please test any service changes once applied. Be prepared to respond to any trouble reports about the service immediately. If there are any problems, we should send updates to the notification list above. A summary of the outage, whether there are problems or there is success should be sent to internal staff.
- Update documentation
Modify internal documentation in the source code repository or Wiki indicating anything changed as a result of the downtime. For example, if the downtime requires a failover to a system in another facility, modify the ActiveSystemsLocation document on the Wiki.
- Send notification out following step #3 for scheduled downtime but eliminate adding an entry to the downtime calendar. For an unscheduled downtime it may be easier to call ProDesk (215-573-4017) and the NOC (215-573-9631) rather than sending email.
- Add an announcement on the user service web site.
Information Systems and Computing, University of Pennsylvania