November 03, 2016 / by Jacob Uecker / In architecture

Routing Over Trunks: A Case of Shared Fate

As someone who often has the opportunity to review a large number of network configurations, I’m always befuddled when I see layer 2 trunks between routers. All too often, this leads a network down a dangerous path of VLAN extensions and, ultimately, network meltdown. Let’s visit the scenario first. Imagine a set of three routers: Linus, Charlie, and Lucy. They’re physically connected in a full-mesh and each has a LAN or two hanging off of them. Something like the diagram above.

Now, here’s what typically happens. Someone demands, pleads or otherwise convinces the network folks that a device connected behind Lucy must be able to communicate with devices behind Linus at layer 2. Do you have a klaxon sounding in your head? You should. Red lights galore. Let’s dive down the rabbit hole a bit and see where it leads. An unwitting engineer says, we can do this by setting up a VLAN trunk between Linus and Lucy and then setting up routing over a VLAN. Other VLANs will be allowed over the trunk for L2 extensions. And of course, why not go whole hog and set it up so that devices behind Charlie are also in the loop? (pun intended) A quick change request and maintenance window later, we’re in business. Let’s review. We have network reachability between routers just as we did before using a routing protocol? Check. We have layer 2 access over a VLAN? Check. We’ve primed our network chock full of pain and suffering? Double check.

In complex systems like networking, it’s often very helpful to think about how things will fail. We do this because networks do fail and if we can understand how and under what conditions, we can better recover from them. In fact, I’d say that failure analysis is critical to good network design. One aspect of failure analysis is the concept of shared fate. This is when a failure in one domain of the network leads to a failure in another. One more definition before we apply this to our situation. A failure domain is a part of the network that is impacted when a device misbehaves. Russ White puts it this way: “A failure domain is any group of devices that will share state when the network topology changes.” A VLAN is an excellent example of a failure domain. It’s not uncommon for an entire VLAN to be impacted by a misbehaving endpoint. So now, let’s apply this to our situation. Suppose there is an issue on our VLAN that we’ve trunked everywhere. This could easily impact all links that the VLAN is trunked on - namely our inter-router links. When this happens, our routing protocol adjacencies can be impacted due to the loss of hello packets. Now our routing protocol flaps and network reachability in general is impacted.

So what can be done? The right answer is almost always: It depends. There are a number of options but it depends on your environment. The first is to aggressively question the need for a device to be in the same broadcast domain. Very often this is a demand of a vendor who doesn’t understand his or her own equipment. Push hard. Next, overlay technologies can be a great option (something that we’ll cover in a later post. #whetyournetworkappitite) or maybe running another cable is the way to go. This is where good design comes into play.