Features

November 2007

COVER STORY

Scrutinized

Maine implements remote network-management technology to fill the gap between SNMP and packet analyzers.

by Drew Robb

CNDuncan Bond, Maine’s data network supervisor, now has four tools for analyzing network traffic and resolving bandwidth issues. Photograph by Andy Molloy

Managing a network requires a variety of tools, each providing different insight into what is flowing over that network. Unfortunately, there is no one product that provides all the necessary oversight. For Duncan Bond, Maine’s data network supervisor, simple network-management protocol (SNMP) gave him the high-level view showing when a pipeline was overloaded, and packet capture gave the fine details. Sometimes, however, that was not enough.

“It is not a big deal when the problem is here in Augusta,” says Bond. “We can pack up a packet analyzer fairly readily and go look at the problem.”

Remote locations were a different story.

“When a location is one hundred or two hundred miles away, the logistics and the cost of driving there and back is significant,” he says. “What we needed was an intermediate tool we could place on the network that would give us some insight into what was happening at the remote locations.”

To gain this missing insight, he activated NetFlow on all the routers and switches made by Cisco Systems that were in the state’s network. These devices sent NetFlow data to a server running Plixer’s Scrutinizer software for analysis and reporting.

Maine’s network is headquartered in the state capital, Augusta, located in the southwestern part of the state. From there it stretches more than 200 miles to locations on the U.S.-Canadian border. Five hundred Cisco routers connect 750 business units throughout the state.

The network backbone consists of asynchronous transfer mode (ATM) circuits leased from Verizon, connected with Nortel switches. All ATM locations have at least 20-Mbps bandwidth and some have up to 50 Mbps. Each ATM location has Cisco routers that connect to the edge sites. In the Augusta/Lewiston area, a 50-Mbps synchronous optical network (SONET) ring from Oxford Networks provides backup, so emergency data and voice communications can still get through if the ATM network goes down.

A metro area network serving the capital area consists of 100-Mbps and Gigabit Ethernet fiber optics, and a 100-Mbps wireless connection. These are linked by four Cisco 8600 series switches.

The WAN is primarily T-1 (1.544 Mbps) lines leased from Verizon. The LANs at the edge sites run 10 Mbps, 100 Mbps or Gigabit Ethernet. Check Point Software firewalls and InterSpect IPSes protect the network.

Several traffic views

The state had two ways of monitoring its network traffic–SNMP data using WhatsUp Gold from Ipswitch and WebNM network-management software from Plixer International. WebNM uses a variety of protocols and applications, such as SNMP and multirouter traffic graphing (MRTG) to give the network administrators several ways of viewing traffic.

An alphabetical list of links to maps of different network segments is provided, such as “Ellsworth (area 63)” or “State Office Complex MAN (area 10). Next to each is a count of the number of items up in that segment, the number of items down and the number of items with services down.

The larger task was setting up NetFlow reporting on the routers. To activate the reporting on a single server is not a big job, but 500 servers is a different story.

Clicking on any of the map links brings up a network map of that area and shows the current status of the links. Administrators can click on the links to drill down into more detailed performance and historical data. They can also view performance graphs showing historical information.

“We have always had good reporting on how much bandwidth is being used and whether there are any errors on the circuit,” says Bond.

WebNM tells if a circuit is down or overloaded, but not what traffic is causing the problem. For more detailed information, Bond uses Fluke Networks’ Protocol Inspector.

“We have a number of Fluke remote probes, primarily on our major critical links, such as the one to the Internet,” says Bond. “That allows us to have a real-time RMON-based solution, as well as a packet-capture mechanism if we need it.”

These were placed at backbone locations, but monitoring an edge location meant driving out and installing one temporarily on that LAN. The drive from Augusta to Caribou takes four hours, so that was not a solution to immediate problems. To improve service levels, the state needed something that would give it greater insight into what was happening on the network.

“One of the more common problems is people calling and saying the circuit is slow,” says Bond. “With SNMP data, we could look and see that the T-1 is saturated, but what we wanted was more detailed information on why it is saturated.”

Just as important was a way to find out what was happening in the past. “Yesterday afternoon at 2:00 something happened, but they are just reporting it now,” he adds.

Netflow across the network

This led him to look at implementing NetFlow across his network. Originally developed a dozen years ago by Cisco, NetFlow v.9 was recently adopted as an IEEE standard called IPFIX. Because it is a standard, data from different vendors can be analyzed and the state can still apply the correct level of forensics or traffic analysis to it.

NetFlow is part of the router or switch software. It gathers data on what protocols, applications and IP addresses are using a circuit. It then periodically forwards this data to a server for analysis. This lets an enterprise identify what traffic is hogging a connection.

“If the bandwidth is saturated, we can quickly tell what caused it,” says Bond. “If it is not a business-critical application, then we can get the user to cease and, depending on its urgency, do it off hours.”

Once he decided to start using NetFlow, Bond initially experimented with building his own collection-and-analysis software. After a few days, he saw that, while it was an interesting exercise, it would take too much of his time. He decided to use Plixer’s Scrutinizer.

“We have a long-standing relationship with the vendor because of WebNM,” says Bond. “I tried the beta release and was very pleased with the results, so we decided to move forward with it.”

He also liked the price. Scrutinizer costs $1,995 to $8,995 for the software, including support and updates. He says that the initial setup was just a straight software install. He currently runs version 5.0 VERIF of the software on a Windows Server with dual 3.8 GHz hyperthreaded CPUs, 4-GB RAM and a 400-GB RAID.

The larger task was setting up NetFlow reporting on the routers. To activate the reporting on a single server is not a big job, but 500 servers is a different story.

“You have to put an entry in the router to activate the NetFlow and tell it where to send it,” Bond explains. “Going in and changing 500 routers is not trivial.”

Because of the variation in network devices, no template would fit all the devices, especially on the core and minicore locations. On the bulk of the networks, however, staff could telnet to a box and use a scripted cut-and-paste operation. The whole task took about six to eight hours.

Bond now gets NetFlow data from all Cisco routers on his LAN, as well as IPFIX data from the Nortel routers on the ATM backbone. He also integrated Scrutinizer with his WhatsUp and WebNM installations.

Data in a timely manner

Initially, Scrutinizer provides the NetFlow data in one-minute intervals. These are then rolled up to five-minute, 30-minute, two-hour, 12-hour, one-day and seven-day intervals. Bond has Scrutinizer set up so that the one-minute data is kept for a week, the five- and 30-minute data for a month, and the one- and seven-day interval data for a year. He says a year’s worth of data uses about 140 GB of space.

Five staff members in his area routinely use Scrutinizer and another six or eight in network services use it frequently. He also sometimes makes the information available to qualified staff in other agencies. For example, if he spots traffic that potentially violates access policies, he can make that data available to human resources.

To improve service levels, the state needed something that would give it greater insight into what was happening on the network.

Bond says the long-term NetFlow information is also useful for capacity planning, and he sometimes gives agency heads access to information on their own network through a specialized log limited to a specific address range.

“As the granularity decreases, you can see trends easily,” he says. “At a glance, I can look back a year and see if there is an increase in particular protocols or implementations.”

This helps indicate when there is an actual need for additional bandwidth, but it also helps for distinguishing when a problem can be addressed simply by shifting traffic load.

“Yesterday, users at a site reported extreme slowness and Scrutinizer showed us that a particular conversation between two servers was causing the problem,” he says. “Someone was copying files between the two servers, and, since it was not critical, we were able to get support staff to cancel that transaction and reschedule it off hours.”

He also cites the example where every Tuesday morning bandwidth to a number of sites was saturated. By looking at the NetFlow data, the problem was traced to automated software updates.

“Rather than adding bandwidth, they were able to come up with alternate ways to handle the updates without causing problems at the site,” says Bond. “It became a traffic-engineering exercise, changing business practices rather than paying for new circuits.”

Security, encryption functions

NetFlow also helps with security. When someone hooks an infected laptop into the network, for example, that traffic shows up as an anomaly. It also helps verify that links that should be encrypted actually are.

“With NetFlow, it is fairly easy: Look at the graphs, and see what protocols are present on the WAN links,” he says. “On the edges, we could see the point-to-point conversations, but not in the encrypted tunnel, which validated what we expected to be happening.”

Bond now has four levels of data to analyze, including up/down status (WhatsUp), utilization (WebNM), NetFlow/IPFIX (Scrutinizer) and packets (Fluke).

He says that when a user reports a site being slow, he first looks at the WhatsUp screen and follows the hot link over to the WebNM data to see if it is a bandwidth problem. If so, he goes back to the WhatsUp screen and drills into the Scrutinizer data for the devices. The entire process only takes a couple of minutes.

There are still times when packet analyzers are needed. In one example, an application running at an edge location was having an interactive session with a server located at the core, and was running extremely slow.

“With NetFlow, we could see that the conversation was happening but not determine why it was running slow,” he says. “With the packet analyzer, we were able to look inside and see that there was an invalid directory pointer calling in some NetBios packets that were putting in about a three-minute delay on every transaction.”

“NetFlow provides us the kind of insight into the network that lets us come up with rapid answers to problems and saves us a lot of time,” says Bond. “Some things that would take us a couple of weeks to answer, now we can spot it almost instantly. That translates into better productivity for my crew.”

About Cisco Systems

Founded in 1984 by a small group of computer scientists from Stanford University, Cisco Systems designs, manufactures and sells IP-based networking products and technologies. Headquartered in San Jose, Calif., the company has more than 61,535 employees worldwide, providing products and solutions in the company’s core development areas of routing and switching, as well as in advanced technologies such as application networking, digital media, mobility, storage, unified communications, data centers, security, telepresence and video.

Drew Robb is a freelance writer from Los Angeles who specializes in technology and engineering.

For more information from Cisco Systems (click here)