|

In examining key failure points of your operation,
look for ways to mitigate risk with built-in redundancies and
automated procedures. |
Nobody likes network downtime. It
strikes at the heart of corporate profitability and its highly visible
nature always leads to unhappy customers and upper management questions: Why
did this happen? How can it be prevented? Don’t we already have procedures
in place? That is until budget time comes around and the redundancies,
staffing and training are hard to come by. The systems get more complex, and
the goal remains the same–keep the network up.
Setting realistic goals, then allocating sufficient resources to achieve
those goals is the ideal scenario for uptime management in any organization.
A simple stepwise procedure can lead you from having downtime manage your
organization to your organization managing its own uptime.
There are many resources available to assist in developing a cohesive and
comprehensive uptime management plan. The National Institute of Standards
and Technology (NIST) produced a contingency planning guide for information
technology systems, which is an invaluable document for any IT organization.
It outlines a seven-step approach:
1. Develop the contingency planning policy statement.
2. Conduct a business impact analysis.
3. Identify preventive controls.
4. Develop recovery strategies.
5. Develop an IT contingency plan.
6. Plan testing and training exercises.
7. Plan maintenance.
A lot of information on high-availability measurements and standards is
available. The most common metric of availability is expressed in “nines” as
in “five nines of reliability.” This refers to 99.999% availability, or only
five minutes of unplanned downtime per year.
That metric alone, though, is not enough to set a target for your company’s
high-availability planning. Is one five-minute outage a year the same as
five, one-minute outages throughout the year? Is downtime at 3 a.m. the same
as downtime at 3 p.m.? Ultimately, the user’s experience is what counts, in
addition to the revenue stream, opportunities lost and resources used in
firefighting instead of planned activities that determine the true impact of
downtime.
Achieving the lofty goal of five nines sounds like it should be every
organization’s objective, but any downtime that can be restored in less than
five minutes implies a fully automated system. With an outage of any
complexity, a human cannot recognize, analyze, diagnose, formulate a plan
and implement it in five minutes. Just rebooting one server can eat up most
of your time budget for the year.
A plan that provides reasonable expectations should be created, based on:
the criticality of the network; the negative impact of downtime; and the
available resources to increase uptime. Take an honest reckoning of what
will be acceptable downtime and build the systems necessary to achieve it.
When the inevitable downtime occurs, remind yourself (and upper management)
that it was all part of the plan.
In addition to availability, several other measurements should be
considered:
Mean time to repair. You can make the system reliable, but failures
will eventually happen. This measures the time from failure to recovery,
once the problem is diagnosed.
Affected users. Take time to think about an outage that lasts only
one minute, but affects 1,000 users, versus an outage that affects one user
for 1,000 minutes. Which is worse for your organization?
Potential affected users. If a 10,000-subscriber cable TV system goes
out, but only 10% of the homes have TVs on, then the potential affected
users is 10,000 but the number affected is only 1,000.
A standard calculation of loss can be summarized as: L = P x T x Cr
+ Cp. Where P is the probability that
a disaster will occur in percent, C is the cost (lost revenue plus lost
productivity) attributed to being down per unit of time, and T is the length
of the downtime.
This measurement has to be done for each failure point in the system, as
each has its own probability and cost impact. When all of the possible
downtime costs for various scenarios are estimated, the cost of lessening
the risks can be compared to the probability and costs associated with those
risks.
In examining key failure points of your operation–equipment, connectivity,
processes and staffing–look for ways to mitigate risk with built-in
redundancies and automated procedures that can shave downtime to a minimum.
Sometimes, simple solutions can provide cost-effective means to reduce the
likelihood of failures, and to shorten their duration when things do go
amiss.
Fault tolerance and redundancies can be built into most systems and
processes. Standard techniques, such as RAID arrays, high-availability
clustering, hot sites and protection switching, can be employed wherever
possible to provide alternate resources that can be brought to bear when
necessary. Battery backup, standby generators and diversity routing from
multiple telecom providers can also be used.
In considering redundancy, do not forget the human element. Adequate
staffing and cross training are often overlooked. In the event of a
region-wide outage due to a hurricane, local staff may not be able to access
the necessary facilities, or they may be consumed with personal issues. In
these cases, remote access from staff outside the affected area can make all
the difference.
If these redundancies can be brought to bear without human intervention,
critical time can be saved. System and environmental monitoring solutions
and services can alert personnel to potential problems before downtime
occurs and automatically trigger redundant failover switching when
predetermined conditions are met.
The human element can never be fully automated away, but clear-cut
procedures for identification, notification, mitigation, escalation and
resolution can reduce downtime and possibly prevent costly mistakes that are
often borne of crisis thinking. Reliability, availability and scalability
have direct impacts on customer satisfaction, employee productivity and
revenue-generation.
Optimizing your network for maximum availability is not just smart business,
it is critical for business continuity and long-term success.
David Weiss is CEO of Dataprobe, Paramus, N.J.
For more information:
www.rsleads.com/611cn-255
|