skip navigation

Press Release

CITES > project status > campus network upgrade > Email to CITES-Techsupport on core upgrade

March 27, 2006

Greetings, Tech Support Staff:

I know we've all been pretty silent here about how things have been going with the spring break cutover from the old "cube" core to the new 10-gigabit core. I apologize for that, but everyone in Network Engineering and Network Maintenance was extremely busy and working as much as 18 hours a day last week. Our first priority was to stabilize the network for normal use as rapidly as possible. Now that that has been accomplished, I'm sending out this update.

We installed and extensively tested the new core equipment before the cutover, in order to address as many issues as possible before impacting the production UIUCnet. The migration plan involved first severing all the edges of the old cube and connecting those devices only to the new node distribution devices, thus enabling traffic flow on the new network without actually moving any fiber optic cables. From there, we moved virtual router instances to get the new core equipment actually routing instead of merely switching backbone traffic. This was followed by the big Network Maintenance job of moving the hundreds of fiber optic connections from the old to the new equipment.

We expected some difficulties and were amply rewarded for our expectations. Many of the issues we encountered with bringing this equipment up in production are quite minor and we'll be working with the vendor over time to get them resolved. There were, however, two very major issues which caused considerable disruption last week until we and the Foundry support engineers resolved them.

The worst of these was the widespread connectivity issue experienced Wednesday and Thursday. This had the greatest impact in the central part of campus but plagued departmental networks campuswide. This was one of those problems that was hard to diagnose but easy to fix. The Foundry NetIron routers that we are using for the new core and node distribution devices are marketed primarily to large Internet Service Providers. When used in that environment, they have to move a very large amount of traffic to and from IP addresses all over the Internet, but are directly connected to a relatively small number of other devices. And as such, the default table sizes and hardware packet forwarding resources are set to make sense for that application. We're directly connecting dozens of buildings and tens of thousands of end systems to each of our NetIron routers. While they can work well in this application, tables and resource allocation needed to be resized to make them do so for us.

One of the tables that needed to be resized was the ARP table, which by default in the Foundry NetIron has room for 8192 entries. The NetIron distribution router in Node 4 currently has 14,000 ARP table entries, and once the table size was increased to accommodate that, the intermittent connectivity issue went away.

The second serious issue last week was the instability seen on the north end of campus, including DCL and the CITES data center, during evenings. Since the Node 1 distribution device ran very well during the day but began destabilizing like clockwork at 7:45 pm every evening, we were sure this was due to some form of traffic entering the device that was causing a problem. The slightly good news here was that, as I mentioned in my CCSP talk on the new core, the individual port cards on the NetIrons have a lot of autonomy, with their own processors and hardware forwarding resources. It was one of these port card processors that was crashing while the rest of the device continued to operate normally. Thus, we were able to immediately improve reliability somewhat by moving important connections such as the CITES data center and the old cube devices to one of the port cards that was not crashing. However, this remained a serious problem and we were not able to resolve it without assistance from Foundry.

Since the timing of this was so predictable, Foundry technical and developer resources were able to watch the device very carefully and determine the exact cause of the crash, which turned out to be malformed IP packets coming from our Cisco VPN server. Once we made that identification, a workaround was easy by simply moving the VPN server to a different device temporarily. This will allow Foundry developers a brief time to fix their bug. It's interesting that this was a case of a bug in one device triggering a bug in another.

So, where are we now?

All of the new core is up, stable, and handling traffic for the entire campus. All virtual router instances are now on the new node distribution devices. With a few exceptions due to gigabit Ethernet link issues, all physical gigabit Ethernet connections to buildings have been moved to the new node distribution devices. Most of the old "cube" core devices have been retired, with several of them powered off entirely, and we expect to complete that process this week.

We need to finish VLAN cleanup; we were pretty lavish in provisioning VLANs across the new core but the result of that is that a lot of VLANs are going places they don't need to and we can improve stability by pruning them back. Network Maintenance continues to move 100-megabit links off ancient Cisco Catalyst and Foundry Ironcore switches to the newer 100-megabit ancillary node switches.

It was a very rough week for all of us, and we certainly understand and apologize for the problems caused by the disruptions. The very good news is that by all our observations and instrumentation, this equipment is running very smoothly now, and the worst is behind us. There may be further brief interruptions as we continue to complete the cutover and clean things up, but they should be able to be well-controlled and will be scheduled during our maintenance windows.

If anyone has any questions or concerns about the process, please feel free to let me know.

Thanks again for your patience.

--
Charley Kline
UIUCnet Architect
CITES Network Engineering

CITES welcomes comments about our services and comments about our web site.
Return to the top of this page.
Last modified April 12, 2006