Networking Tales

Understanding networking concepts

IEEE 802.1D Spanning Tree Topology Changes

The Spanning Tree Protocol (STP) was designed to avoid Layer-2 loops in a LAN with redundant links. The potential loops are avoided in STP by forcing some switch ports to move into a blocking state in which they are not allowed to forward data frames. But the protocol must also be able to react to topology changes and converge to a new logical topology in which some of the blocked ports need to be moved to a forwarding state.

In the original IEEE 802.1D specification of the STP protocol, there is no mechanism to ensure that all the switches of the STP domain are already aware of the change in the topology and have reacted accordingly. As a consecuence, a switch running STP cannot move a port directly to the forwarding state because it may cause a temporary loop. Unused MAC table entries can also be a source of temporary loops and outages in the network during a topology change. STP rely on timers and the reduction of the MAC table aging timer to handle topology changes.

In this blog post, we will examine the events and BPDUs generated during a topology change with the help of Wireshark and a sample network in GNS3:

S1 has been configured as the root bridge using the command spanning-tree vlan 1 root primary:

S1 configured as root

S2 has been configured as the secondary root bridge using the command spanning-tree vlan 1 root secondary:

S2 configured as secondary root

S2 will become the designated switch on the network segment between S2 and S3. Therefore, S3 will place its Gi0/1 port into a blocking state to prevent a forwarding loop.

In a stable scenario, S1 will generate configuration BPDUs every 2 seconds (hello timer). These BPDUs will be received by S2 and S3 on their root ports. In the following Wireshark capture, we can see that S1 is sending configuration BPDUs to S2 every 2 seconds. Notice that S2 does not send any BPDU to S1 because a switch does not send BPDUs through its root port in a stable topology:

From the traffic capture on the link between S2 and S3, we can see that S2 is relaying the BPDUs it receives from S1 to S3. When the BPDUs are forwarded to S3, S2 changes the root path cost from 0 to 4 and uses its Bridge ID (BID) as the sender BID (the Bridge Identifier field):

When S3 receives these BPDUs, it realizes that there is a path to the root through S2, but it choose its Gi0/0 port as root port because it is the best path to reach the root. S3 does not send BPDUs to S2 because its Gi0/1 port is in a blocking state.

In this network topology, all traffic between PC2 and PC3 must go through S1. Let’s generate some traffic between PC3 and PC2 to verify how their MAC address are learned by the switches:

S3 has learned the MAC address of PC2 through its root port Gi0/0:

PC2 MAC address
S3 MAC table

S2 has learned the MAC address of PC3 through its root port Gi0/0:

PC3 MAC address
S2 MAC table

Now let’s shut down the interface Gi0/0 on S2 to simulate a link failure between S1 and S2. Before shutting down the port, let’s activate the debugging of STP events on S3:

Immediately after the port Gi0/0 on S2 is shut down, S2 starts annoucing itself as the root. Since S3 is blocking on its Gi0/1 port, S2 was only receiving BPDUs from its root port which is now administratively down. S2 realizes that all the paths to the root has been lost and starts an election process which results in declaring itself as the root bridge.

Let’s examine the events captured on S3:

During aproximately 20 seconds (MaxAge timer), S3 ignores the inferior BPDUs it receives from S2. Only after the MaxAge timer expires, S3 reacts to those inferior BPDUs and unblocks its Gi0/1 port. As soon as the port Gi0/1 enters the Listening state, S3 starts sending BPDUs to S2. S2 realizes that these BPDUs are superior and stops claiming itself as the root. S2 selects its Gi0/1 port as the root port and generates a Topology Change Notification (TCN) BPDU, which is received by S3 on its Gi0/1 port. Then, S3 forward the TCN towards the root S1. From the debug, we can see that S3 also generates a TCN when the Gi0/1 port transitions to the Forwarding state.

Let’s examine the Wireshark captures on the link between S2 and S3. After the link failure, S2 starts announcing itself as the root:

When the MaxAge timer expires, S3 starts relaying the BPDUs generated by S1 to S2:

When S2 receives the superior BPDU, it elect its Gi0/1 port as the root port and generates a TCN:

S3 acknowledges the TCN received from S2:

Examining the Wireshark capture on the link between S1 and S3, we can see the TCN relayed from S3 to S1:

As soon as S1 receives the TCN, it acknowledges the TCN and sets the Topology Change (TC) bit on the subsequent BPDUs:

S3 sends a second TCN to S1 when its Gi0/1 port transitions to Forwarding state:

S1 sets the TC bit on the configuration BPDUs for the duration of MaxAge+Forward-delay seconds:

When S2 and S3 receive a configuration BPDU with the TC bit set, they reduce their MAC address table aging time from the default value of 300 seconds to Forward-delay seconds (15 seconds by default). This way, during a topology change, switches can age out old dynamic entries that were learned using the network topology prior to the link failure. This also prevents network outages due to “blackholes“. For example, imagine that PC3 is sending traffic to PC2. In the initial topology, the traffic will follow the path PC3->S3->S1->S2->PC2. If the link between S2 and S3 fails, the data traffic will be interrupted for about 50 seconds until the STP converges to the new topology. But without the MAC aging reduction mechanism, S3 will keep sending the frames from PC3 to S1 for 300 seconds until the dynamic entry for PC2’s MAC address expires.

To conclude, let’s explain why S3 waits for MaxAge seconds before reacting to the inferior BPDUs received from S2. The reception of an inferior BPDU from S2 may indicate that S2 has lost its conection to the root or that the root is down. The key point is to understand that, even though S3 is still receiving BPDUs from S1 every 2 seconds, it is not safe to assume that S1 is still alive. In this small topology, we could probably safely conclude that S1 is alive because it is directly connected to S3, but in a larger topology with several hops between the switch and the root, we may receive some BPDUs after the root switch goes down due to propagation delays. This is the reason why STP maintain the old root information for a port during MaxAge seconds before accepting the new information.

BGP Path Selection (Part II)

In this second part of my blog series about the BGP Path Selection Algorithm, I will discuss the following attributes:

  • Origin Type
  • Lowest MED
  • eBGP over iBGP
  • Lower RID

The first attribute in the list is the Origin Type or Origin Code, which is a well-known mandatory BGP attribute. There are three possible values for this attribute:

  1. IGP (i)
  2. EGP (e)
  3. Incomplete (?)

When a network is advertised into BGP using a network statement, the origin IGP (i) is used by default. EGP (e) refers to the old Exterior Gateway Protocol (EGP) that was used before BGP and it is not used anymore. Therefore, the e origin code will rarely appear in the BGP table nowadays. Finally, networks that are advertised into BGP using the redistribute command are set with the Incomplete (?) origin code. IGP (i) is preferred over Incomplete (?).

Let’s use the following topology to see this attribute in action:

R2 and R3 are both connected to the LAN segment (192.168.1.0/24 network). An eBGP session is configured between R1 and R2 and between R1 and R3. R2 uses a network statement to advertise the 192.168.1.0/24 network to R1, while R3 uses the redistribute connected command instead.

Here is the BGP configuration on every router:

R1 BGP configuration
R2 BGP configuration
R3 BGP configuration

If we examine the BGP table on R1, we can see that R1 has learned two routes to the 192.168.1.0/24 network:

The preferred path to reach the 192.168.1.0/24 network is through R2, which has the i origin code. The route through R3 has the ? origin since the 192.168.1.0/24 network was advertised into BGP in R3 using redistribution.

The next attribute in the list is the Multi-Exit Discriminator (MED). The MED attribute, also called as metric, is a non-transitive BGP attribute whose purpose is to influence the inbound traffic coming from another AS. When you advertise a network into BGP, by default, the MED is set to the IGP path to reach the network being advertised. Since the MED is a metric, a lower MED is preferred over a higher MED.

Let’s use the following topology to explain the MED attribute:

Above we have two Autonomous Systems (AS): AS100 (R1 and R2) and AS300 (R3). R1 and R2 both advertise the 192.168.1.0/24 network into BGP. Therefore, R3 has two possible paths to reach the 192.168.1.0/24 network.

Let’s begin using the following initial configuration:

R1 BGP configuration
R2 BGP configuration
R3 BGP configuration

If we examine R3 BGP table, we can see that the BGP path selection algorithm has chosen the path through R1:

Notice from the previous output that both routes learned by R3 have a metric of 0. Now let’s set the MED in R1 to 500 by using a route-map:

If we examine R3 BGP table again, we can see that the BGP path selection algorithm has now chosen the path through R2 because it is the path with lower MED:

Let’s discuss the next decision factor used in BGP: eBGP over iBGP. The BGP Path Selection Algorithm always prefers eBGP routes over iBGP routes. Let’s change the previous topology to see this decision step in action:

R1 and R2 are now using an iBGP session between them to share the routes learned from R3. Here is the new BGP configuration on every router:

R1 BGP configuration
R2 BGP configuration
R3 BGP configuration

If we examine the BGP table on R1, we can see that R1 has learned two routes to reach the 192.168.1.0/24 network: one route through R3 (eBGP) and one route through R2 (iBGP). The preferred route is the eBGP route through R3:

If we examen the BGP table on R2, we can see that R2 has also chosen the eBGP route through R3:

If none of the previous attributes can be used as a decision factor to select the best BGP path, the BGP Selection Algorithm will choose the path learned from the peer with the lower RID. This is the last attribute in the list shown at the begining of the post.

Let’s go back to the topology used to describe the MED attribute:

Before configuring the MED in R1, why did R3 choose the route through R1 to reach the 192.168.1.0/24 network? Let’s examine R3 BGP table again:

If we examine the attributes discussed in Part I of this blog series about BGP Path Selection (weight, local preference, local originate, shortest AS path), we can verify that none of the attributes can be used as a decision factor to select the best path. Both routes have a weight of 0, we have not configured local preference values to favor any one route over the other, none of the routes are locally originated and both routes have the same AS path length.

Now, let’s examine the new attributes discussed in this post. The origin type is the same for the two routes (i), since both R1 and R2 are advertising the 192.168.1.0/24 network using a network statement. Both routes have a MED of 0, since both R1 and R2 are advertising a directly connected network. Finally, both routes are eBGP routes, since R3 has learned both routes from a peer that belong to a different AS. As a result, BGP has chosen the route advertised by the peer with the lowest RID, which is R1 (1.1.1.1).

BGP Path Selection (Part I)

The BGP best-path selection algorithm refers to the logic used by BGP to select the best path when multiple routes to the same network are available. BGP uses a set of Path Attributes (PA) to identify the preferred route. In the first part of this blog post, I will discuss the following attributes for the best-path selection:

  • Weight
  • Local preference
  • Local originated
  • Shortest AS_Path

Let’s begin by examining the following network topology:

Each router represents a different Autonomous System (AS) and R5 has a loopback interface which will be advertised to the other routers through BGP. Here is the external BGP (eBGP) configuration on every router:

R1 BGP configuration
R2 BGP configuration
R3 BGP configuration
R4 BGP configuration
R5 BGP configuration

Note that the 5.5.5.5/32 network corresponding to the R5 loopback (Lo5) interface has been advertised into BGP through the network command.

R1 can follow two possible routes to reach the 5.5.5.5/32 network: one route using R2 as the next-hop and other route using R3 as next-hop. If we examine the output of the show ip bgp command on R1 we can see that R1 has learned the two routes to the 5.5.5.5/32 networks and the preferred path selected by BGP is through R2:

The BGP path-selection algorithm has chosen the route with the least AS hop count. The path learned through R2 has an AS path length of 2 (AS200 and AS500), while the path learned through R3 has an AS path length of 3 (AS300, AS400 and AS500). The AS hop count is represented by the AS_Path atribute. The output of the show ip route bgp command on R1 confirms that the route through R2 has been installed in the routing table, which corresponds to the route with the shortest AS Path:

Now imagine that, for some reason, we want to force R1 to select the longer route through R3. One option is to configure the Weight attribute on the routes learned through R3. The BGP weight is the first attribute examined when selecting the best path. The default value for this attribute is 32768 for locally originated routes and 0 for routes learned through a BGP peer. Let’s change the weight associated with the BGP adjacency with R3:

We need to perform a soft reset of the BGP neighbor relationships (using the clear ip bgp * in command) in order to see the efect of the weight attribute:

Now the preferred route is through R3, even though it is a longer route in terms of AS path length. It can be confirmed by examining R1’s routing table:

The weight attribute is a Cisco defined attribute and only works with Cisco devices. If we are working in a multi-vendor environment, we can use the local preference attribute for the same purpose. The local preference is the second attribute examined in the BGP path-selection algorithm when the weight value cannot be used as a tie breaker. The weight attribute is locally defined and it is not advertised to other peers. In contrast, the local preference attribute is a well-known discretionary attribute and is advertised thoughout an AS. A higher local preference value is preferred over a lower value and it is locally configured in a AS to influence the selection of the next-hop for the outbound traffic (this attribute is not advertised between eBGP peers).

Let’s configure a local preference value of 500 on R1 using a route-map as follows:

After the configuration of the local preference attribute, we can verify that now R1 prefers the route through R3 to reach the network 5.5.5.5/32:

Now imagine that a router has learned two routes to the same network and both have the same local preference. Which is the next attribute examined by the BGP path selection algorithm? BGP will always prefer a route that has been locally originated over a route learned from a BGP peer. A route can be advertised locally though the network command, by aggregation or redistribution. To discuss this point, let’s advertised into BGP the 10.0.12.0/30 network between R1 and R2:

If we examine the BGP table in R1, we can see that R1 has learned two routes to the 10.0.12.0/30 network: one route using R2 as next hop and the preferred locally originated route represented by a next hop of 0.0.0.0. Notice the difference in the AS path between the two routes. The route through R2 has an AS path of (200, i) and the local route an AS path of (i). The i origin code represents a route locally originated using a network command. The origin codes will be discussed in my next post. Also notice that Cisco by default assigns a weight of 32768 to local routes, which ensures that the local routes will always be the preferred routes:

In summary, if a router learns two or more routes to the same destination, BGP will select the route with the higher weight attribute. If all routes have the same weight or if the weight attribute is not configured, BGP will select the route with the higher local preference. In the next decision point, a locally originated route is preferred versus a route received by a BGP peer. Finally, if there is a tie in all the BGP attributes discussed so far, the route with the shortest AS path will be selected as the preferred route.

The Acummulated Interior Gateway Protocol (AIGP) is another attribute that can also be configured to influence the selection of the best route, but it is beyond the scope of this post. In Part II, we will discuss other important attributes that are considered by the BGP path-selection algorithm (Origin type, Lowest MED, eBGP over iBGP and Lower RID).

OSPF and distribute lists

In this blog post, I will talk about the effect of the distribute-list command on the operation of the OSPF routing protocol. This command can be applied in the inbound direction and in the outbound direction. When applied in the inbound direction, it filters routes to be installed in the routing table and, when applied outbound, it filters which routes are redistributed from other routing protocols or other routing sources, such as static or connected routes.

We will be using the following single-area OSPF topology consisting of two routers:

The LAN segments LAN1 and LAN2 are assumed to be stub networks, so the router interfaces attached to these LANs are configured as passive interfaces:

R1 OSPF configuration
R2 OSPF configuration

If we examine the routing table of R1, we can see that the 10.0.2.0/24 network (LAN2) has been learned via OSPF:

R1’s OSPF routes

Likewise, R2 has learned the network 10.0.1.0/24 (LAN1) from R1 via OSPF:

R2’s OSPF routes

Now imagine we want to prevent the network 10.0.1.0/24 from installing on R2’s routing table. If R2 removes this network from the Link State Database (LSDB) or if R1 stops announcing the network in its router-LSAs, the correct operation of the OSPF protocol would be altered (remember that one of the principles of a Link-state Routing Protocol is that all the routers must share the same vision of the network topology). But R2 can learn the network 10.0.1.0/24 from R1’s router-LSA, and decide later if it wants to install this network in its routing table or not. This is the purpose of the distribute-list command when applied in the inbound direction. Let’s see the effect on R2:

Inbound distribute-list configuration on R2

R2 has been configured with a standard access-list that deny the network 10.0.1.0/24 and has been referenced in the inbound distribute-list command. Let’s see the content of R2’s routing table after applying the distribute-list command:

R2’s routing table after applying the distribute list

We can see that network 10.0.1.0/24 is missing because it has been blocked by the distribute list. Let’s now examine the content of the router-LSA generated by R1 in R2’s LSDB:

R1’s router-LSA

As we can see, network 10.0.1.0/24 is still present in the LSDB. The distribute list applied has only prevented this network from entering R2’s routing table.

Let’s now examine another scenario. Imagine that R2 is redistributing into OSPF routes from other routing protocols (like EIGRP or RIP) or from other routing sources (like connected routes or static routes). For example, let’s create a loopback interface on R2 and assign it the IP address of 172.16.0.2/32. It will automatically create an entry in the R2’s routing table with a connected route to the host address of 172.16.0.2/32:

Loopback interface configuration on R2
Connected route to loopback 0

Let’s now redistribute this new connected route into OSPF:

Redistribution of connected subnets into OSPF

We can verify that this new connected route has been redistributed into OSPF through an external type-5 LSA:

Content of the R2’s LSDB after connected subnet redistribution

We can also verify that R1 has learned this external OSPF route:

R1’s OSPF routes after route redistribution in R2

Let’s now create another loopback interface on R2 and see the effect on R1’s routing table:

Configuration of a second loopback interface on R2
R1 OSPF routes after the creation of the second loopback interface on R2

As we can see, R1 has learned a second external route to the new loopback interface IP address because R2 is redistributing all connected subnets into OSPF. In fact, if we examine the content of R2’s LSDB, we can verify that R2 is now generating two external type-5 LSAs:

R2’s LSDB after creating a second loopback interface

If we want to prevent this new route from installing into R1’s routing table, we can use the distribute-list command in the outbound direction on R2. Let’s see how it works:

Outbound distribute list configuration on R2

In order to configure this new distribute list, a second standard access list has been defined. This access list permits only the IP address of the first loopback interface (Loopback0) and denies any other IP addresses. This access list is referenced in the outbound distribute-list command. Let’s now examen in the content of R1’s routing table:

R1 OSPF routes after applying the outbound distribute list

The external route to network 10.2.2.2/32 is gone, since has been filtered by the outbound distribute list on R2. Let’s now examine the new content of the R2’s LSDB:

R2’s LSDB after applying the outbound distribute-list

We can verify that R2 has stopped generating the external LSA associated with the new loopback interface. The outbound distribute list filters which routes are inyected into the OSPF domain by the Autonomous System Border Router (ASBR), which is R2, in this case. Unlike the inbound operation of distribute lists, when applied oubound, the distribute lists modify how LSAs are flooded into the OSPF area.

OSPFv3: building the topology

In the previous blog post, we reviewed the process of building a simple single-area OSPFv2 topology using the output from the “show ip ospf database” command. In this post, we will repeat the same process but using OPSFv3 and IPv6.

We will be using the same network topology, but the router interfaces have been configured with IPv6 addresses:

Let’s start the topology reconstruction process by examining the Router-LSA generated by R1:

The previous output shows some differences between the format of the Router-LSA in OSPFv3 and its counterpart in OSPFv2:

  • The option field is longer (24 bits)
  • The Link State ID is no longer equal to the RID of the advertising router
  • All addressing information has been removed from the router-LSA
  • Link state information about interfaces connected to stub networks has been removed from the router-LSA

In OSPFv3, a router may generate one or more Router-LSAs. Remember that the tuple LS type, Link State ID, advertising router RID uniquely identify an LSA in the Link State Database (LSDB). If more than one router-LSA is generated, they can be distinguised by generating a different Link State ID for each of them.

In OSPFv2, both Router and Network-LSAs store some addressing information. Specifically, in OSPFv2 Router-LSA we may found the following addressing information:

  • For point-to-point links: local interface IPv4 address
  • For links connected to transit networks: DR interface IPv4 address and local interface IPv4 address
  • For links connected to stub networks: local interface IPv4 address and network mask

In OPSFv2, the network mask for transit networks is stored in the Network-LSA generated by the Designated Router (DR).

By contrast, OSPFv3 removes all addressing information originally stored in Router and Network-LSAs and moves it to other types of LSAs, as discussed later in this post. OSPFv3 router and network LSAs only contain topological information necessary to build the Shortest Path First (SPF) tree. It means that Router and Network-LSAs are only flooded when a topological change occurs and prefix changes will not trigger any SPF recalculation, which is an improvement over OSPFv2.

The Router-LSA generated by R1 describes two pont-to-point links:

The fields used to describe the attached interfaces are:

  • Metric
  • Local Interface ID
  • Neighbor Interface ID
  • Neighbor Router ID

Metric represents the outbound cost assigned to the interface. Local Interface ID is a number that uniquely identifies the local router interface. Neighbor Interface ID corresponds to the interface ID of the neighboring router and Neighbor Router ID identify the node attached at the other end of the link.

The two links described in the R1 Router-LSA can be translated into the following graph:

The neighbor RID can be used by the local router as a key to search in the LSDB for the next Router-LSA and place another piece of the puzzle. Let’s examine both R2 and R3 Router-LSAs:

R3 router-LSA
R2 router-LSA

These new Router-LSAs contain a link description that points back to R1. This link corresponds to the serial link that connects R1 to R2 and R3. These LSAs contain another interface type: “Link connected to a Transit Network”. The transit network corresponds to the Ethernet segment that connects routers R2 and R3 through the switch SW1. Like OSPFv2, OSPFv3 represents broadcast and Non-broadcast Multiaccess Networks (NBMA) by adding a pseudonode to the SPF topology. This pseudonode is described by a Network-LSA created by the DR, which is R3 in this topology. The Link State ID of a Network-LSA corresponds to the interface ID of the DR. This field in combination with the RID of the DR can be used as a key to search in the LSDB for the Network-LSA associated with the Ethernet segment:

If we compare the previous output with the structure of a Network-LSA in OSPFv2, we may notice two main differences. First, the Link State ID in field in OSPFv2 is equal to the IP address of the DR. In OSPFv3, this value is replaced by the interface ID of the DR, as mentioned before. In addition, in OSPFv3 the network mask field is missing. Remember that Router and Network-LSAs do not store any address information.

We can find information about prefixes and IPv6 addresses in the last two LSAs that need to be analyzed in order to complete the starting network topology: Link and Intra-Area-Prefix-LSAs. But before moving on to the next LSAs, let’s add another piece to the puzzle using the information stored on the Network-LSA generated by R3:

Basically, the Network-LSA describes the list of routers connected to the same network segment (R2 and R3). The relationship between R2 and R3 can be described by the Network-LSA pseudonode, which is represented by the cloud in the previous graph. The cost to reach this pseudonode can be obtaind from the Router-LSAs.

At this point, you may notice that Router and Network-LSAs provide all the topological information needed to build the SPF tree from any router to any other router. However, in addition to the topological information, routers need information about prefixes and neighbor link-local addresses in order to build their routing tables. In OSPFv2, the next-hop IP address is determined from the Router-LSA received from the neighboring routers. But Router-LSAs have an area flooding scope. Therefore, if an interface IP changes, a new Router-LSA is re-flooded throughout the entire area. But such change is only relevant to neighboring routers connected to the same link where the interface is attached. Another improvement of OSPFv3 over OSPFv2 is the addition of a link-local flooding scope. In OSPFv3, neighbor-specific information is placed in another LSA type: Link-LSAs. A router originates a separate Link-LSA for each attached link connected to one or more neighbors. A Link-LSA provides the following information:

  • The router’s link-local address
  • The list of IPv6 prefixes associated with the interface
  • A collection of options bits (beyond the scope of this discussion)

Let’s examine the Link-LSA generated by R1 for its interface connected to the stub network where PC1 is attached:

From the previous output we can derive the following information:

  • Link State ID: 4 (interface ID of R1 interface)
  • Router Priority: 1
  • Link Local Address: FE80::1
  • Prefix associated with R1 interface: 2001:1::/64

The link-local scope of Link-LSAs can be confirmed by looking at the Link-LSAs generated by R2 from the point of view of R1:

The previous output shows that R1 receives only one Link-LSA from R2, the one associated with the point-to-point link between R1 and R2.

Finally, let’s examine the last LSA we will be discussing in this post: Intra-Area-Prefix LSA. An Intra-Area-Prefix LSA provides a list of IPv6 prefixes which are associated with either a Router-LSA or a Network-LSA. A router may generate multiple Intra-Area-Prefix LSAs in order to keep the size of the LSA small. The different LSAs are distinguished by their Link State ID.

Let’s examine the Intra-Area-Prefix LSA generated by R1:

The previous LSA is linked to the R1’s Router-LSA by the fields: Referenced LSA Type, Referenced Link State ID and Referenced Advertising Router. A total of three IPv6 address prefixes are described, which correspond to the two point-to-point links and the stub network.

If we examine the Intra-Area-Prefix LSAs generated by R3, we get the following output:

R3 generates two Intra-Area-Prefix LSAs: one for the point-to-point link to R1 and another for the transit network connecting R2 and R3. The Intra-Area-Prefix LSA associated with the transit network is linked to the Network-LSA generated by R3.

As a final note, let’s examine the output of the Intra-Area-Prefix LSA generated by R2:

The previous LSA only describe the IPv6 prefix associated with the point-to-point link connecting R1 to R2. R2’s Intra-Area-Prefix LSA omits the IPv6 prefix associated with the transit network because it is already described in the Intra-Area-Prefix-LSA generated by the DR.

Combining all the information from the different Link and Intra-Area-Prefix LSAs we can build the final topology:

OSPFv2: building the topology

This blog post describes the process of building a simple network topology using the information derived from the “show ip ospf database” command. This is an interesting exercise to understand how to read the information contained in the different LSA types and how they are linked together like a puzzle.

We will begin the process using a sample topology that consists of three routers configured using single area OSPFv2. Therefore, we will review only Type-1 and Type-2 LSAs. In a later blog post, I will repeat the same exercise but using IPv6 and OSPFv3.

The goal of this exercise is to reconstruct the same topology using only the information from the Link State Database (LSDB). Let’s start by examining the output of the “show ip ospf database router self-originate” command in router R1:

The previous command shows the detailed information about the Router LSA (Type-1 LSA) originated from R1. From the previous output, I want to highlight the following information:

  • Link State ID: 1.1.1.1
  • Advertising Router: 1.1.1.1
  • Number of links: 5

The Link State ID identifies a node in the OSPF graph. In a Type-1 LSA, the Link State ID value is the Router ID (RID) of the advertising router. The Link State ID is the key used to search in the LSDB in order to link the different LSAs together. The final OSPF graph is used by the routers to calculate the shortest path using the Shortest Path First (SPF) algorithm (Dijkstra algorithm).

The output shows a total number of 5 links in R1 instead of 3. This is due to the way that point-to-point links are described in OSPFv2. When addressing information is assigned to the interfaces in a point-to-point link, it is modeled as if it were a stub network attached to the router. This extra node is used in SPF calculations only when the traffic goes to the network assigned to the point-to-point link (for example, if the destination is one of the serial interfaces). For simplicity, we will ignore those two extra stub networks associated with the serial links on R1 and consider only the point-to-point links and the real stub network where PC1 is attached.

The first connection described in the output of the previous command corresponds to the point-to-point connection between R1 and R3:

It can be read as node 1.1.1.1 is connected to node 3.3.3.3 through a link that has an associated IP of 10.0.13.1 and a cost of 64. The Link ID field points to the Link State ID of the Router LSA generated by R3. This relationship between R1 and R3 can be represented as follows:

The second connection corresponds to the additional stub network associated with the addressing information assigned to the previous serial link between R1 and R3, so we will ignore it when building the topology.

The third connection corresponds to the point-to-point connection between R1 and R2:

This information allows us to add another node to the previous graph:

The next connection is the other extra stub network associated with the serial link between R1 and R2, so we will skip it. The final connection corresponds to the real stub network associated with the network segment where PC1 is attached:

In this case, the Link State ID corresponds to the subnet ID (192.168.10.0) and the Link Data field carry the subnet mask information (255.255.255.0). The cost to reach this network is also specify. With that information, we can add another piece to the graph:

At this point, we have two choices to continue the reconstruction of the topology. The R1’s Router LSA is connected to the R2 and R3’s Router LSAs through the Link ID fields. For example, the serial link between R1 and R2 has a Link ID of 2.2.2.2. This value can be used by R1 as a key to look for the R2’s Router LSA in the LSDB:

The first connection described in the R2’s Router LSA is the serial connection to R1:

The Link ID 1.1.1.1 points to the R1’s Router LSA, which allows us to place the two pieces of the puzzle together. Notices that the Link Data field corresponds to the IP address of R2’s serial interface. This IP can be used by R1 as the next-hop address when computing routes through R2. The interface cost value of 64 is used by R2 when computing routes using its serial interface as the outgoing interface.

Before examining the third connection described in the R2’s Router LSA, let’s have a look at the output associated with R3’s Router LSA:

In this LSA, the first connection corresponds to the serial connection between R1 and R3. At this point, we can add the costs and IPs associated with the point-to-point links connecting R1 to R2 and R3:

In order to complete the final topology, let’s analyze the link to the transit network described in the Router LSAs of R2 and R3. The transit network corresponds to the Ethernet segment that connects routers R2 and R3 through the switch SW1. Ethernet networks are modeled in OSPF by adding a pseudonode to the topology. In multiaccess networks like Ethernet, more than two routers can be attached to the same network segment. In order to simplify the complexity and reduce the size of the LSDB, instead of creating a single connection between every possible pair of routers in the LSDB, an extra node is created (the pseudonode) and it is used as a vertex in the graph to which all the other routers connect.

The pseudonode is described by the Type-2 LSA created by the Designated Router (DR), which is R3 in this topology:

The Link State ID of the Type-2 LSA corresponds to the IP address of the DR (192.168.20.2). The network mask is also specified in the LSA, which can be used to calculate the network address using the IP of the DR. Finally, the Type-2 LSA stores a list of RIDs associated with all the routers connected to the transit network. This RIDs can be used as keys to search in the LSDB for the corresponding Router LSAs, in order to find all the possible paths in the graph.

We can conclude the process of building the topology by adding the information provided by the Type-2 LSA, resulting in the following diagram:

In the next post, I will repeat the same process but using IPv6 and OPSFv3 to discuss the differences.

The Magic Number

Mastering subnetting is a fundamental skill in network engineering. Finding the subnet ID given an IP address and a non-trivial subnet mask can be a challenge at first, but with practice, it can become a second nature making possible to solve subnetting problems in just a few seconds.

One of the main study resources I used for the CCNA was the Cisco Official Cert Guide Book (OCG). In this book, the author describe a process to find the subnet ID when using a difficult mask. A difficult mask is a subnet mask that has one octet that is neither 0 nor 255. Since, by definition, a subnet mask cannot interleave 0s and 1s, the possible values for that octet are: 128, 192, 224, 240, 248, 252 and 254. The OCG book calls that octet the interesting octet.

The process of finding the resident subnet ID of an IP address described in the book can be summarized as follows. For every octet in the subnet mask:

  1. If the mask octet is 255, just copy the corresponding IP octet in the same position.
  2. If the mask octet is 0, put a 0 in the same position.
  3. If the mask octet is an interesting octet:
    • Calculate the magic number as 256 – V, where V is the decimal value of the interesting octet.
    • Calculate the subnet ID octet as the largest multiple of the magic number that is less than the correspoding IP octet.

As an example, let’s find the resident subnet ID of the IP address 10.0.50.76/29. In this case, the subnet mask in dotted decimal notation (DDN) is 255.255.255.248. Following the previous procedure, since the first three octets are 255, to find the resulting subnet ID we just need to copy the first three octets of the IP address: 10.0.50.X. To find the last octet ‘X’, we have to follow the step 3 and calculate the magic number as 256 – 248 = 8. The value ‘X’ will be the largest multiple of 8 that is less than 76, which is 72. Therefore, the subnet ID will be 10.0.50.72/29.

But, at this point, you may ask yourself: why does it work? In order to understand this process, let’s have a look at the following figure:

The figure shows four different subnetting schemes. The first one corresponds to a /25 mask, which divides the original network in two subnets. The second one corresponds to a /26, which divides the network in four subnets. The next one divides the network in eight subnets (/27) and the last one in sixteen subnets (/28). The figure highlights one important fact: in all subnetting schemes, the last subnet octet always corresponds to the interesting octet described in the OCG book. This is because the interesting octet, by definition, will have a binary value of all 0’s in the host portion of the octet and all 1’s in the subnet portion. Another important fact we can notice from the figure is that the size of the resulting subnets can be calculated by subtracting the interesting octet value from 256, which corresponds to the magic number concept used in the OCG book.

Once we know the size (magic number) of the resulting divisions of the original network range, finding the multiple of this number closest to the original IP octet value without going over gives us the desired resident subnet ID.

Layer 3 Switches

Trying to explain how a layer 3 switch works is, in my opinion, a very useful exercise when you are starting to learn basic switching and routing concepts. It is an opportunity to review both the layer 2 switching logic and the layer 3 routing logic.

Let’s review the operation of a L3 switch using an example based on the following simple topology:

If the switch receives a frame from PC1 destined to PC2, since both PCs reside on the same network segment (VLAN 10), the switch will behave as a L2 switch while forwarding the receiving frame. It will look for PC2 MAC address in its MAC address table and the frame will be forwarded out the port connected to PC2 (assuming the PC2 address is found in the MAC table).

In this first forwarding scenario, if the destination MAC address belongs to another host in the same VLAN where the receiving port resides, the corresponding layer 2 forwarding process is triggered depending on the type of incoming frame (known unicast frame, unknown unicast frame, broadcast or multicasts).

Now let’s see what happens when PC1 sends a frame to a device in another subnet, for example PC4. In this case, a layer 3 device is needed to communicate hosts in different VLANs. PC1 can’t send the frame directly to PC4 because the destination IP resides in another network. Therefore, PC1 will send the frame to its default gateway, which corresponds to the layer 3 switch in the topology. When the L3 switch receives the frame, it realizes the destination MAC address corresponds to its own MAC address in that VLAN and the layer 3 forwarding process is triggered. Since the frame is destined to the switch itself, the switch acts like an end device in this scenario and cannot behave like a L2 switch. The switch takes the role of a router and performs the following steps to process the incoming frame:

  1. The IP packet is de-encapsulated from the received data-link frame.
  2. The destination IP address is compared to the L3 switch routing table to find a route that matches the destination address.
  3. If a route is found, the packet is encapsulated in a new data-link frame and the frame is transmitted through the outgoing interface specified in the matched routing table entry. If no route is found, the packet is discarded.
  4. In order to send the new frame, the switch must find the destination MAC address using its local ARP cache or through an ARP request.

If the outgoing interface identified at step 3 is a routed interface, the new frame is forwarded out the physical routed interface. If the outgoing interface is a Switch Virtual Interface (SVI), the forwarding process is a bit more complex.

An SVI is a logical interface inside the switch that is bound to a VLAN and allows the switch to send IP packets to that VLAN like another IP device. You can think of an SVI like a physical Netword Card Interface (NIC) of a PC, but the SVI is virtual (which means it is build by software). The SVI allows the switch to send and receive IP packets through the physical ports associated with the corresponding VLAN.

If the outgoing interface identified by the routing table is an SVI, out which physical port is the frame forwarded? The answer to this question depends on the destination MAC address:

  • If the destination MAC address is a broadcast or multicast address, the frame is forwarded out all physical ports assigned to the outgoing VLAN.
  • If the destination MAC address is a unicast address (learned through the ARP process or find in the ARP table), the switch must look for the destination MAC in its MAC address table, like a regular L2 switch. The matched MAC table entry will identify the outgoing physical port that will be used to forward the frame.

During my studies for my CCNA exam preparation, I used to review all these steps and concepts involved in the operation of a L3 switch to make sure I really understand all the fundamentals of routing and switching: MAC address tables, ARP tables, routing tables, layer 2 forwarding logic, SVIs, layer 3 routing logic, etc. In future blog posts, I’ll make a deeper review of some of those specific switching and routing topics.

Temporary loops in STP

When I was studying all the basic concepts related to the traditional STP protocol, I start asking myself one question: why do we need the listening and learning states? Why can’t a switch interface move from a blocking state directly to a forwarding state?

The theory says that if the LAN topology suddenly changes (for example, a link fails) and an interface that was previously blocked is moved to a forwarding state, a temporary loop may be created. The reason for these temporary loops could be the old MAC table entries that were learned using the old topology. To solve this problem, STP defines two interim states (listening and learning states).

During the listening state, the old MAC table entries are removed and during the learning state the interface starts to learn the source MAC addresses of the received frames. These two transitory states help the switches to adapt to the new topology and avoid the creation of potential temporary loops.

All of this may sound reasonable, but I wasn’t able to find a scenario where a temporary loop was created as a consequence of moving an interface directly from a blocking state to a forwarding state. Then I tried to search on the Internet and found some forum discussions about this topic and, surprisingly, I came up with a blog post from the author of the Cisco Official Cert Guide for the CCNA certification, Wendell Odom. In that post, the author admitted he wasn’t able to find a case in which the listening state is really necessary in STP to avoid temporary loops. He also quoted a fragment from a book written by Radia Perlman, the creator of STP. In that book, Radia Perlman even suggested the listening state wasn’t really necessary.

I strongly recommend reading the Wendell Odom article. But, as a summary, it seems that learning MAC addresses immediately after unblocking an interface isn’t harmful. Even though an interface could potentially learn a wrong MAC address it will not create a loop. Therefore, the listening and learning state could have been merged into a simple “preforwarding state” in the original STP definition, as suggested by Radia Perlman in her book.

A unicast ARP request?

After analyzing some Wireshark traffic captures, I noticed that some ARP requests were broadcast frames, as expected, but I randomly found some ARP requests that were unicast frames. I thought ARP requests were always broadcast frames, but I was wrong.

Let’s do a quick review of the ARP protocol using a simple topology in Packet Tracer:

Imagine that PC1 (10.10.10.1) wants to access the Web Server (10.10.10.3). PC1 will generate an IP packet that needs to be encapsulated inside an Ethernet frame. To create the Ethernet frame, PC1 needs to know the MAC address of the Web Server. Now is when the ARP process come into play. PC1 will send an ARP request asking for the MAC address of 10.10.10.3. Since the destination MAC address is unknown, this frame must be a broadcast frame and it will be received by all the devices in the LAN. Only the device with IP 10.10.10.3 (the Web Server) will reply with its MAC address.

The Wireshark capture shown below corresponds to an ARP request. In the Ethernet section of the capture, we can see that it is a broadcast frame (the destination MAC address is FF:FF:FF:FF:FF:FF).

The next Wireshark capture is also an ARP request, but the destination address is a unicast MAC address instead of a broadcast address.

What is, then, the purpose of a unicast ARP request if the destination MAC address is already known? I found the answer of this question in the RFC 1122 – “Requirements for Internet Hosts – Communication Layers”. Section 2.3.2.1 describe four mechanisms for flushing out-of-date ARP entries. One of the mechanisms is called “unicast poll” and is described as: “Actively poll the remote host by periodically sending a point-to-point ARP Request to it, and delete the entry if no ARP Reply is received from N successive polls“.

Following the example shown in the previous capture, the sending host has an entry in its ARP table that matches the IP 192.168.18.31 with the MAC address 60:f2:62:0f:50:a5. In order to verify if the MAC address is still valid, a unicast ARP request is sent using the MAC address recorded in the ARP table. If the destination host is still using that MAC address, it will reply with an ARP reply message. If no reply is received, it means that the MAC address is no longer in use and the sending host will delete the corresponding ARP table entry.

One final observation. In this particular capture of a unicast ARP request, the “Target MAC address” field is filled with a MAC address of all zeros (00:00:00:00:00:00). However, it is possible to find ARP requests that also use the MAC address of the destination host in this field. For example, the capture shown below:

Does it depend on the particular implementation of the ARP protocol running on the sending host? Or maybe another protocol feature? I’ll keep digging into it.

« Older posts

© 2022 Networking Tales

Theme by Anders NorenUp ↑