|
You can—and should—monitor plant health. I’ve got a secret. Actually, a few of them. And I’m willing to tell. So, you want to build a broadband data infrastructure? You are probably spending a fortune on backbone switches and routers, on DSLAMs and modems in your POPs, on CMTSs in your head ends, on provisioning systems, on upgrading your network operations center (NOC), on building your plant out 2 way and increasing frequency, on conditioning your lines. On infrastructure. Next, you’ll spend money on aspirin. Why? In a word, operations. Networks aren’t static, they are living things; they grow, they atrophy, they get sick, they die. Investing in infrastructure and not in operations is kind of like deciding that being healthy today means you will live forever. Why not opt out of your medical plan? Because you know better, that’s why. You’ll build out your NOC and staff it; you’ll put in the latest tools. You’ll display the big real-time maps when the executives take their tours and use the alarm views when they aren’t looking. Behind it all, you take millions, maybe tens of millions, of dollars of infrastructure and do basic reachability tests (ping, SNMP’s ifOperStatus, etc.) to the key elements hoping to see massive failures before all the phones start ringing. Even still, you’ll rely on your customer to tell you that things are broken as often as you see the failures yourself. A better way? That’s part of the secret. You can and should monitor plant health. Remember the old days of leased lines? Telco circuits just 20 years ago were expensive and noisy. People ran data using protocols optimized for high levels of errors (e.g., SNA over X.25). Several things changed that, one of which was monitoring. The technology to improve quality emerged, but it was surveillance that drove its implementation. The use of T1 ESF registers led, in part, to the high-quality data we now take for granted. The trick is to use instrumentation for monitoring and not just debugging. Good news: your end-of-line monitors have become cheap. Yes, there are expensive plant surveillance systems. What you might not know is that the high-speed data infrastructure you are deploying already contains large amounts of instrumentation, available free through SNMP. DOCSIS and DSL modems make excellent monitors. The “leased lines” folks figured this stuff out years ago. It’s only been a secret to dial, and now to HSD (high-speed data) implementers. Dial infrastructures largely failed to make use of similar instrumentation because call quality varied too much. A few smart providers monitor call logs and CDRs (call detail records) for indications that specific ports are bad, but almost nobody watches line quality. What you need to do is not act like dial folks simply because broadband access is replacing dial. You need to realize that “always on” means both endpoints are fixed, giving you the opportunity to monitor a statically built link. If you run a cable plant, watch all of your upstream interfaces from each head end and correlate poor lines with fiber nodes. By distributing the polling load to the head ends, and efficiently packing SNMP requests with multiple questions, you can collect large amounts of data with overhead similar to that of pings. By locally thresholding the results, you can look for problems without backhauling that information to a central location and simply send problem notification events to your NOC (via SNMP TRAPs). Chances are that you already have caching or data collection servers deployed in your larger head ends, so you can do this distributed monitoring using your existing boxes. Similar architectures and solutions are easily deployed in DSL environments. Don’t just ping! Don’t just collect bandwidth utilization. Collect and analyze data for problems. To do this, you must define the problem set in advance and then look for specific symptoms. If you don’t know what you are looking for, the odds of finding anything useful are minimal! A review of significant trouble tickets I recently did for one carrier showed that most of its problems tracked were line and frame relay outages. This seemed unusual given that they were selling an IP-based service. It wasn’t until I looked at the number of problems closed because they were “not understood” or “cleared themselves” that I realized this provider was only looking for line and frame circuit hits. Had it been monitoring higher-layer services, its problem set would have looked quite different (data corruption, routing instability, etc.). Finding and fixing those problems would have greatly improved both service availability and end-user satisfaction. The amazing thing is that the use of SNMP to collect and then TRAP allows for trivial integration of plant monitoring with your NOC tools. Nearly all enterprise management systems can accept SNMP traps, so you now have a single console that watches both Layer 3 infrastructure and plant health. Integration with your provisioning system is crucial. If your provisioning process includes updating DNS so that your systems can all be reached by name, that’s all you need. Assuming that you have logically named the modems in each POP/head end, you need only to run a script that does a nightly “zone transfer” to get the addresses of every networked device. From this, you parse the names of the head end/POP modems and collect those addresses that you want to watch from each location. Armed with this list, you walk the SNMP MIB interfaces table for each box and arrange to monitor the inbound (uplink) interfaces. Any additions, deletions, or changes that have been made to your network are automatically incorporated each night. Someday, much of this device discovery will be done by querying an LDAP database. Until then, you are watching all the important inbound links; you pick up changes or additions nightly; you have a low-impact, distributed architecture; and you can easily integrate these collectors into your centralized management platform. Do this and you can reallocate some of your aspirin budget to buy the champagne you’ll use at your victory party. More secrets? Sure. We could discuss route peering; high-speed access with poor interconnects isn’t valuable at all. It gives your customers fast access to a bottleneck. Most people think you should peer with everybody you can, especially at points away from the public NAPs. It turns out that sometimes it’s better not to peer, and surveillance will again tell you where and why. That’s another secret, and the topic for another article. Steinberg is founder and president of Netops, Pleasantville, NY. |
|