DePaul University Networks and Telecom J. Kristoff Request for Comments: 5 R&D/N&T Category: Best Current Practice June, 2003 Class: Public Revision: 1.0 Network Services and Infrastructure Operational Requirements Status of this Memo This document specifies a Best Current Practice for the DePaul University community, and requests discussion and feedback for improvement. Distribution of this memo is unlimited. Copyright Notice Copyright (C) DePaul University (2003). All rights reserved. Abstract This memo describes the installation and operational requirements for DePaul University network services and infrastructure. As information technology is increasingly critical to the institution, attention to the correct, safe, reliable, secure and efficient operation of the network infrastructure and its services is required. The network staff has the responsibility to manage critical network services and infrastructure. Kristoff Best Current Practice [Page 1] RFC 5 Network Operational Requirements June 2003 Table of Contents 1. Installation Procedures........................................ 2 1.1. Testing................................................... 3 1.2. Back-out.................................................. 3 1.3. Cabling................................................... 4 1.4. Documentation ............................................ 5 1.5. Network Management........................................ 5 1.6. Security.................................................. 6 1.7. Disaster Recovery......................................... 6 1.8. Deployment................................................ 6 1.9. Sign-off and Turn-over.................................... 6 2. Operational Procedures......................................... 7 2.1. Documentation............................................. 7 2.1.1. Labels............................................. 7 2.1.2. Configuration Worksheets........................... 7 2.1.3. Diagrams........................................... 8 2.1.4. Change Management.................................. 8 2.1.5. Journals........................................... 8 2.2. Environmentals............................................ 8 2.3. Monitoring................................................ 8 2.3.1. Availability....................................... 9 2.3.2. Integrity.......................................... 9 2.3.3. Resources.......................................... 9 2.3.4. Performance........................................ 10 2.3.5. Events............................................. 10 2.4. Troubleshooting........................................... 10 2.5. Backup and Restoral....................................... 11 2.6. Maintenance............................................... 11 2.6.1. Testing............................................ 11 2.6.2. Deployment......................................... 11 2.6.3. Back-out........................................... 12 2.7. Removal and Retirement.................................... 12 3. General Operational Considerations............................. 12 Acknowledgments................................................... 12 References........................................................ 12 Security Considerations........................................... 13 Editor's Address.................................................. 13 1. Installation Procedures All new installations require planning. While some new installations are trivial and require less preparation, even small changes to services and infrastructure can affect overall operations in significant ways. Network staff must perform all new installations using consistent best practices. If there isn't enough time to perform a quality installation, the time and opportunity to go back and fix something later will Kristoff Best Current Practice [Page 2] RFC 5 Network Operational Requirements June 2003 undoubtedly not be available once something is in production. Lapses in quality may compound and reduce the overall integrity of a system over time. Therefore, the network staff must opt for quality before expediency. 1.1. Testing The underlying goal of testing is not to make something work, but to try to make something fail. Testing is done to identify and fix problems before being introduced into production. Changes to a production system should first be tested in a test environment. A test environment should provide sufficient fault isolation so failures there do not impact systems on the production network. There is a three step testing process that may be used. The first step is for the tester to test in a test environment. The tester should look for potential problems such as bugs, failure modes and security issues in an isolated environment where exploration and changes can be tolerated. The second step is to do additional testing by challenging colleagues and other experts to try to make what is being tested fail. The tester should praise or reward anyone who can identify a previously unforeseen problem. After all, someone who finds a problem during testing saves the tester, the installation team, the support team time and the institution a lot of time, effort and embarrassment later. The third and final step is to invite users who will eventually be using what is being tested to test. These users should have a vested interest in seeing things work well in production. If these end users cannot find faults in the testing phase, hopefully they will not find problems after the tested system is moved into production either. 1.2. Back-out Successful testing offers no guarantee that a new installation will be smoothly introduced into production. The installation team must be able to restore the production network to a stable state in case a new installation fails. Therefore, all new installations must have a back-out plan. Kristoff Best Current Practice [Page 3] RFC 5 Network Operational Requirements June 2003 1.3. Cabling Cabling systems generally consist of passive, non-mechanical components. Failures in cabling systems should be a relatively rare occurrence. Failures in cabling systems that do occur are often the result of either poor installation practices or an uncontrollable external event (e.g. backhoe). Cabling system components must be installed properly so that failure due to poor installation practices will be minimized. Cabling installations that are hurried and substandard can dramatically increase intermittent connectivity problems, frustrate support staff and agitate end users. A cabling system must be neatly installed using sound cabling management practices. A quality cabling system consists of a number of best practices as described in the following paragraphs. First, cabling systems should be made of the highest quality cabling available and installed by professional installers. Preconstructed cables from a reliable source should be used. A high quality cable installation mitigates the possibility of incurring failure due to cable degradation, marginal equipment or installation deficiencies. Second, cables must be of the proper length. Cables that are too short cause strain on connectors, hinder flexibility, weaken the quality of the cable and can potentially induce accidents. Cables that are too long may have an effect on the signal propagation qualities, but also make it more difficult to properly arrange cabling systems in confined spaces. Third, cables should not be coupled together to form longer cables. Couplings can come apart when least expected. Even cables that are screwed together are probably not watertight. A coupled cable could be at the mercy of a leaky air conditioner or even a spilled drink. Fourth, wire tires and cabling management should be used to organize cables, to keep cables in a bunch, to relieve strain on connections and to prevent cables from slipping. The ends of fastened wire ties must be trimmed or wrapped flush to the base of the latch to prevent injury and snagging. Fifth, both ends of a cable must be properly labeled. Color coded cabling may also be used to help differentiate between connection types that are not obvious by looking at the physical connection (e.g. test versus production). Identifying cable characteristics help network staff trace the proper connections during troubleshooting. Labels and color coding also help prevent the network staff from disconnecting the wrong cable when making changes. Kristoff Best Current Practice [Page 4] RFC 5 Network Operational Requirements June 2003 1.4. Documentation New installations must include accompanying documentation such as configuration worksheets, installation instructions, baseline measurements, vendor contact info, vulnerability assessment results, disaster recovery procedures and troubleshooting guidelines. Hardware and cables must have appropriately affixed labels with which to help identify components during the installed life of equipment. 1.5. Network Management If a new installation cannot be properly managed once in production it should probably not be installed at all. To highlight the need for adequate network management procedures, the following four questions should be asked of any new installation: o What will be the effect of this breaking? o How will support staff know when this breaks? o How will this be fixed when it breaks? o How will this function be recovered in the event of an intrusion or disaster? Regardless of the answer, the first question must be asked. It may be that this installation is not important. If so and it is still worth doing anyway, the installation can proceed. It would be best however, that those who need this functionality are able to be without it for days, weeks or more if it becomes unavailable. Network management tools address the second question. An automated computer monitoring system is usually most capable for this job. The most undesirable way to be notified is by an end user. When a user notifies the network staff of a problem, the problem takes far longer to fix, because now not only the original problem needs fixing, but so does the user. The third question must be answered before the installation and preferably during the design phase. All necessary tools, hardware, software, documentation and vendor contact info must be readily available at all times after the installation is complete. Incident response and disaster recovery procedures address the fourth question. Kristoff Best Current Practice [Page 5] RFC 5 Network Operational Requirements June 2003 1.6. Security New installations should have documented security considerations and in some cases may require a formal security review by the institution's information security team. Network staff should review the Network Security Principles outlined in [dpunet-rfc2] to better understand the impact new installations may have on the network architecture. 1.7. Disaster Recovery New installations should have documented disaster recovery considerations and in some cases may require a review by the institution's disaster recovery team. 1.8. Deployment Large or complex installations should have a project manager assigned to ensure proper planning for a successful implementation. The project manager may be a design, installation or support staff team member, but one individual should oversee and coordinate a new installation. The project manager can be a dedicated person for all new installations or selected from the network staff for each installation. While it is common for organizations to schedule new installations at odd hours of the day or on weekends, it may not always be wise to do so. For example, off-hours work may occur when vendor support is limited. Also for example, staff may not do their best thinking at times when they would normally be sleeping. The installation team must make a calculated decision in determining the appropriate installation schedule. It is possible for the best time to make changes to be during normal working hours. The installation team must be prepared to back out a new installation and return the network to a stable state quickly in case of problems. Staff that can provide support and perform back out procedures must be readily available and prepared to respond to problems after any new installation for a reasonable period of normal usage. 1.9. Sign-off and Turn-over A designated person or group, separate from the installation team, but familiar and knowledgeable with the technical details of the new installation, should conduct an installation review. The review process is not a means to halt or remove the installation, but rather to provide feedback and helpful recommendations on issues that are discovered during the review, much like a home inspection. Kristoff Best Current Practice [Page 6] RFC 5 Network Operational Requirements June 2003 The installation team should respond to outstanding issues found by the independent review and make necessary changes or future plans to address problems found. The installation documentation, installation review and the install team's response to the review should be turned over to the operational staff as soon as possible. The operational staff should plan on taking over operational support immediately once the installation is considered stable for some short period of normal usage. The new installation should be considered incomplete until this sign-off and turn-over process occurs. 2. Operational Procedures The production network must be maintained and monitored. This section details the guiding principles for on-going network services and infrastructure operations. 2.1. Documentation Maintaining network documentation must be a core responsbility for all network operational staff. Network documentation is used for planning, troubleshooting and recovery procedures. Timely and accurate network documentation is critical to the overall operation of network services and infrastructure. Operational staff should perform periodic documentation reviews. A documentation review can be done in conjunction with disaster recovery testing schedules if disaster recovery testing occurs at least once per calendar year. A documentation review will help ensure that all current services and infrastructure documentation is accurate and up to date. 2.1.1. Labels Labels on cables and equipment must be kept up to date if changes or moves render the old labels obsolete. Old labels should be removed, not scratched out or written over. 2.1.2. Configuration Worksheets The network staff should maintain configuration worksheets that detail default and non-default parameters for network systems. These worksheets will help ensure consistency, which will avoid problems due to misconfiguration or incompatibilities. Configuration worksheets should be used as guides for installing additional systems or as authoritative data for automated tools that periodically audit systems for changes. Configuration worksheets should not be publicly available. Kristoff Best Current Practice [Page 7] RFC 5 Network Operational Requirements June 2003 2.1.3. Diagrams Operational staff should maintain overview diagrams for services and infrastructure. Network diagrams should be available for physical connectivity and network protocols in use. Application flow diagrams should also be maintained for services maintained by the operational staff. Troubleshooting flow charts for junior staff, other support staff and self-supporting users may also be maintained. Overview diagrams that exclude specific configuration details should be made publicly available. 2.1.4. Change Management All production network changes should be documented. Change management documentation provides a lifetime audit trail that is helpful for future staff and more importantly, future issues due to changes. Except for instances concerning confidential data, summarized change notices should be publicly available. 2.1.5. Journals Irregular events, incidents, outstanding issues, areas for research and downtime details should be documented in a journal or activity log. Journals help organize thoughts, issues, problems and ideas that can be used for future planning, development and maintenance. Journals that exclude detailed configuration and all confidential data can be made publicly available. 2.2. Environmentals Operational teams must ensure that proper environmental state is maintained within the operating area of equipment. This includes proper cooling, humidity control, lighting, power, physical security, hard copy documentation, emergency telephones, terminal access cables, tools, battery backup and emergency power shut-off. Other environmental accessories required for proper network operation may also be required at the discretion of the design, installation or operational teams. 2.3. Monitoring Inevitably, network services and infrastructure may fail or degrade over time. Therefore it is of utmost importance to have good network monitoring in place to identify potential problems and opportunities for improvement. Numerous network system attributes Kristoff Best Current Practice [Page 8] RFC 5 Network Operational Requirements June 2003 should be monitored including service availability, system integrity, resource activity, performance statistics and anomalous events. 2.3.1. Availability All network devices and services should be be monitored by an availability monitoring system. Availability monitoring may be in the form of rudimentary in-band tools such as PING, but may include more complex mechanisms that track availability of specific services through various paths. Availability monitoring tools should periodically monitor equipment and services with a minimum granularity of a few minutes. Availability monitoring should be able to notify support staff using both in-band (e.g. email) and out-of-band (e.g. pager) communications when a system becomes unavailable. Availability systems should maintain historical and trending statistics as well. Downtime reports that indicate the duration of outages and user populations affected should be maintained. Trending of downtime activity should be used to help identify areas needing improvement. 2.3.2. Integrity The operational integrity of system resources and data are of utmost importance. Integrity checking systems should perform strong auditing of sensitive network configurations and data on a regular basis. 2.3.3. Resources The goal of resource monitoring is to track trends and identify anomalies in the resources being monitored. Trending helps identify when changes in usage are occurring, when equipment is beginning to fail or when an upgrade may be necessary. All network equipment and services that contain finite resources should be monitored by resource monitoring tools. Resources to be monitored may include memory usage, capacity utilization, error counters, AAA (authentication, authorization and accounting) activity, security violations and aggregate traffic flow statistics. Resource monitoring may retain long term data of up to a year or more to assist in the long term planning process. Computers should perform all resource monitoring with key events being pushed to operational staff for review. Kristoff Best Current Practice [Page 9] RFC 5 Network Operational Requirements June 2003 2.3.4. Performance Network monitoring should constitute more than mere availability and status of individual system components. To ensure that the entire internetwork system from an end user perspective is behaving as desired, performance monitoring should be used to provide estimates and trends in the overall throughput, response time and correctness of an internetwork system. Interactions among various network components may cause unpredictable behavior that cannot be seen with finely focused monitoring systems. Response times (both one-way and round trip), active and available routes, latency and throughput for network equipment, and application behavior should be monitored so that an end-to-end view can be gained. An end-to-end view using performance monitoring should loosely mimic real usage patterns as closely as possibly. 2.3.5. Events Anomalous events such as system outages, security incidents, overload conditions or severe error conditions should be tracked and put into an activity log. 2.4. Troubleshooting Operational staff must have adequate training and the resources available to diagnosis and correct problems. Training may be achieved by participating in design and installation tasks, through more formal training from a third party training institute or independent learning from practice, reading and experience. Resources required vary, but may include hardware tools, software utilities, access to vendor support channels and test systems. Operational staff must recognize that a major objective in fixing any problem, real or imagined, is restoring an end user's confidence in the network and support staff. It is therefore of utmost importance for operational staff to have good customer service skills and always treat others with a high degree of respect. Regardless of any problem real or imagined from the network perspective, if an end user says there is a problem, then something, maybe not network related, needs fixing and the network staff should do their part in helping resolve problems quickly and painlessly. Operational staff often receive reports from other support staff or end users regarding problems that may involve the network infrastructure or its services. Operational staff should verify all Kristoff Best Current Practice [Page 10] RFC 5 Network Operational Requirements June 2003 information it receives from others before troubleshooting a particular problem. In many instances, information supplied can be inaccurate, misleading or incomplete. In some cases, changes to the network are made to help isolate problems. Changes must be documented at the time of the troubleshooting process or shortly thereafter so that the system can be restored to a proper state. Complex or unique troubleshooting activities should be logged to a journal for future reference. 2.5. Backup and Restoral Hardware, software, configuration information and critical data should be backed up with redundant systems, near-line systems, off-line systems or some combination of all three. Housing services, infrastructure and data into an off-site facility may be used for additional redundancy or back-up purposes. Systems and services requiring a backup and restoral process should have a recovery plan and the plan should be tested. A disaster recovery test is a good time to test backup and restoral procedures. 2.6. Maintenance Network hardware and software components often need to be maintained to extend the life of a system, enhance a system or fix problems discovered in the normal operation of a system. All maintenance activity should be documented. A planning calendar that schedules periodic reviews of hardware and software components may be used. 2.6.1. Testing Changes, upgrades and patch processes should be tested in a test environment. 2.6.2. Deployment System upgrades and patches should be applied regularly in order to stay current with improvements, demand and fixes. The maintenance team must understand the details of an upgrade or patch in order to determine when changes should be applied or avoided. Upgrade, patch, replacement and back-out processes must be documented appropriately. New installation deployment procedures as detailed earlier in this memo also apply to operational procedures. Kristoff Best Current Practice [Page 11] RFC 5 Network Operational Requirements June 2003 2.6.3. Back-out Upgrades and patches should have a back-out plan in case of failure or unexpected change in behavior. The back-out procedure should be tested in a test environment. Scheduled time to perform upgrades and patches should include time for back-out in case of problems. 2.7. Removal and Retirement Whenever a service or piece of equipment is no longer being used, all components must be completely removed from the production environment. This includes all physical components such as supporting cables and connectors as well as software packages installed for the system. Equipment and services should have an expiration date. This expiration can be used as part of a service tracking calendar to help plan for when hardware is going out of warranty, when software reaches end-of-life status from a vendor or simply as a means to manually perform verification on the continued life expectancy of a system. If necessary, an expiration can be revised during a review process. 3. General Operational Considerations The key to successful operations depends on a number of factors. In addition to the practices described in this memo, available staff, training, funding, physical resources and staff leadership may largely determine the success or failure of the institution's network operations. Managers and senior executives must instill operational excellence as a primary focus on network staff by encouraging the application of quality and best practices. Acknowledgments Cara Kaufman-Rosenthal provided useful feedback on an early version of this memo. Additional thanks to the entire Networks and Telecom Group who were able to read and comment on a near final version. References [dpunet-rfc2] Kristoff, J., Network Security Principles. DePaul University, DPUNET RFC 2, January 2002. [TOE] R. Hail, J. Kristoff, Towards Operational Excellence, Implementing and Maintaining Quality in Computer Networks, 1995, available at http://condor.depaul.edu/~jkristof/papers/toe.html. Kristoff Best Current Practice [Page 12] RFC 5 Network Operational Requirements June 2003 Security Considerations Installation and changes to network systems may introduce security vulnerabilities not previously present. Operational staff must be diligent to ensure new vulnerabilities are minimized and identified. Network management and monitoring should be used to identify attacks and breaches in security protections. Detailed security and incident response procedures are not covered in this memo, but should be part of the overall operational plan. This dpunet-RFC does not introduce any new security concerns. Other documents may address specific security considerations found in the area of network operations. Editor's Address John Kristoff Research & Design, Networks and Telecom DePaul University 1 East Jackson Boulevard Chicago, IL 60604 USA Phone: +01 312 362-5878 EMail: jtk@depaul.edu Kristoff Best Current Practice [Page 13]