|
Understanding Network Availability
Availability means uptime and we like uptime; however, what does uptime really mean? We like the idea of fault tolerant systems and five-nines (99.999%) availability, but how hard is that to achieve and what does it cost? Networks are complex and made up of numerous devices, such as routers, switches, and servers, each having their own availability quotient, which makes it difficult but not impossible to determine a network’s end-to-end availability. This newsletter will provide you with the tools needed to truly understand network availability and describe some of the fundamental mathematics and simple probability theory concepts that are at the core of being able to determine answers to such questions.
The definition of availability, or reliability, is the probability that a device will perform a required function without failure under defined conditions for a defined period of time. There are two main factors that are involved in the calculation of availability: Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR but this acronym may have other meanings). The calculation for availability is shown here. MTBF is simply the average time between failure of a device assuming the device is fixed and put back into service. MTTR is the average time to fix and restore the device in order to be put back into service. For example, a router may have a MTBF of 100,000 hours and an MTTR of 1 hour.
It is important to keep in mind that every device will have a probability of failure. A way to understand this is to view the device in the context of the batch produced. Let’s say 1,000,000 IP phones were manufactured by a company in a batch that used the same parts and software. If 10 of these IP phones failed in a five year period, what is the probability that a phone from this batch that you bought will fail over a five year period? The answer is 10 divided by a million or .00001 with a probability of not failing of .99999. If we knew which 10 in the batch would fail before we bought them, the ones we would buy would never fail and we wouldn’t need to talk about MTBF. In other words, everything has a probability of failure over its useful life and, like the Lotto, there’s always the chance that yours is the one! Also, keep in mind that MTBF is a number based on averages and probabilities, so just because something has an MTBF of 100,000 operational hours doesn’t mean it will run for 100,000 hours and die. The one you buy maybe the one that contributes to the low end of the curve, similar to the experience of purchasing a light bulb.
Network designers should understand the availability factors for devices in their networks but where do these factors, MTBF and MTTR, come from? One could come up with a set of averages by simply monitoring and logging actual failure events over a representative period of time, say several years. This method would be tedious and in today’s world useless as, say, a router would likely be replaced in three years with a newer version. But how do we know the availability of a newly purchased device?
Determining MTBF
Electronic manufacturers turn to statistical projections, mathematical models, and databases, which show device and component histories. These reliability prediction models offer standard equations that allow manufacturers to calculate the failure rate of a device based on its component data and parameters. There are several different reliability prediction models available, but they share a common goal. MIL-HDBK-217 is a reliability prediction standard originally developed for defense and aerospace related organizations but adopted by many commercial and industrial companies. Many times referred to simply as 217, MIL-HDBK-217 includes mathematical reliability models for nearly all types of electrical and electronic items. These reliability models are based on parameters of the device such as number of pins, number of transistors, power dissipation, and environmental factors. Results from MIL-HDBK-217 are provided as both a failure rate and as a MTBF (Mean Time Between Failures) where the MTBF is the mathematical inverse of the failure rate. Another popular prediction model, especially for the telecommunications industry, is Telecordia’s SR-332 model. Telecordia found that 217 gave pessimistic numbers for its commercial quality products and modified the models to better reflect their field experience including the effects of the burn-in period for new products and the degradation period of old-age products. Using these prediction capabilities, manufacturers come up with a good estimate of a new device’s MTBF.
MTTR is a different story as there may be many events and circumstances that influence how long it takes to repair, or bring back to full operation, a device once it has failed. In the best case, a failed part can be swapped with an on-site spare for minimal down time. In other cases, complex hardware and software modifications may take place after extensive troubleshooting was performed. There may even be circumstances where the outage occurred but detection or notification of the failure took a long time, which would delay even the start of the repair process. High availability could be substantially lowered, even for a very reliable device, by a high MTTR. For example, in the case of the router described above with a 100,000 hour MTBF, a one hour MTTR would yield an availability of .99999 but a two hour MTTR misses the five-nines goal with a .99998 uptime result. This points to the fact that good notification, troubleshooting, and repair functions can contribute greatly to the overall availability level even if the devices’ MTBF levels are not as high as desired. Spending money on this side of the equation, such as having plenty of spares around, could be well worthwhile.
Calculating Network Availability
Thus far the discussion has focused on the availability of a single device but networks involve many devices all of which must be functioning in order to maintain the operation of the intended application. Let’s take a simple network problem to demonstrate the difficulty of achieving a high availability level for a network application. In this example, the concern is the availability of the printer, or what is the probability that the printer can perform its job when called upon at any time of the day. In this diagram, the printer can only print pages of information sent to it by the file server if all of the items along the network path are up and functioning. I have not included all possible devices that could have an effect on availability in this diagram for simplicity. I have assigned a high individual device availability number to each item for demonstration purposes with a slightly lower value for the printer because it is likely to have mechanical parts, which historically are more prone to failure than electronic parts. The probability that the printer will be able to print the requested pages becomes the probability that all devices along the network path are able to function at the same time. If one device is down, the printer can not do what it is called upon to do and the printer “system” will be down. I have found that end users really do not care why the print job can not be performed, only that the system is down (again). This is an example of a dependent, serial system in which the end-to-end availability is a function of the availability of each item in the series.
One can determine the end-to-end availability of a dependent series (AS) when the individual availabilities of the items in the series are known by multiplying the individual availabilities as shown. The result, .9501, is the probability that all devices are up and functioning at the same time, meaning that the printer will be capable of doing the called upon task. Unavailability of the printer is the probability that the printer will not be functioning, which is .o499 (1-.9501), or 5% of the time it will be down.
Other Important Relationships
This may seem obvious, that all parts of the system need to be functioning in order to print pages from a far off file server; however, this calculation process also highlights several other, maybe less obvious, facts.
First, no matter how high the availability of individual devices, when two or more are required to function in a series, the resulting end-to-end availability will be lowered. The resulting unavailability (1 – availability) of the series is actually the sum of the unavailability factors of the individual devices. By adding more devices in the series, one is adding to the unavailability of the end-to-end system. This is a great point to remember as we really like to add things to networks – multiplexers, firewalls, gateways, session border controllers, etc. The message in this is that if a very high availability is required, the number of devices in the series must be kept low.
A second important less obvious fact is that the end-to-end availability will be less than the lowest device’s availability in the series. The printer has an availability of .97 in this example. If all of the other devices were perfect, a 1.00 availability, the end-to-end availability would be .97.
But the other devices are less than perfect so we multiply the .97 by less than a whole number bringing the result down below .97. Actually, the worst device availability in the series becomes the high-water mark of the series and each added device lowers it. The message in this is that if a very high end-to-end availability is required, every device in the series must have a very, very high individual availability.
Both of these facts lead to another observation: in order to maintain very high series availability numbers, it is going to be expensive. Fewer devices per series and higher quality devices in the series will more than likely add to the overall expense of the network.
Determining the Impact of a Backup strategy
There is another approach that can be used to maintain a high series availability number, but it comes with a price as well. A backup strategy, at some level, will increase the overall system availability. However, before a backup scheme is installed, it is very useful to be able know just how much of an improvement will be achieved in the overall system availability figure. The approach involves determining the probability that the primary and backup methods will both be down at the same time. The calculation procedure is not too difficult if the device availabilities for both the primary system and the backup system are known.
As shown in the diagram, a simple dial backup strategy is proposed for the printer system. However, the redundancy employed covers only a portion of the entire series, which is the network related items, leaving the servers and printer unprotected. What is the resulting end-to-end availability of the new system?
The calculation is a three-step process. First the availability of the dial back-up network leg (AB) and the availability of the primary network leg that is being protected (AP) are determined.
Next, these factors, AB and AP, are used to determine the new availability for the network, (Anet) including the back-up enhancement. This is done by multiplying the unavailabilities (1 – availability) of these two network legs and then converting this number back to availability (1 – unavailability). Anet represents the combined availability of both the primary and back-up network segments only.
Finally, the availability of the backed-up segment, Anet, is used with the availabilities of the other, non-backed up devices, using the same series calculation technique to show the end-to-end availability of the series with the backup strategy employed. Here the result says that the backup strategy raises the overall availability for the printer operations from .9501 to .9692.
Testing Alternative Strategies
This technique for determining the availability of a backup strategy can be used to test and compare other ideas. For example, the same calculation procedure could be performed to determine the resulting improvement in printer availability by backing up the printer only. In this case, the new end-to-end availability would increase to .9786, which is better than the end-to-end availability produced by backing up the network piece (.9692).
Why burn your calculator out doing this type of exercise? Estimating the effect of a backup strategy can show the effectiveness of proposed strategies which can be compared to the availability goals for the system and the cost to deploy each one. Also, this technique can be used in a variety of situations besides networking; for example, estimating the impact of backing up a whole server versus mirroring the hard drives. Even with a backup strategy, achieving five-nines is going to be expensive.
Further Training
These types of tips and insights can be found in all training classes provided by McGuire Consulting. The high quality nature of these courses is based on the many years of work and training experience of the author, Jay D. McGuire.
|