ACI Common Pervasive Gateway

I’d like to thank Takuya Kishida for his work on this post!

Common Pervasive Gateway is an older feature that was used to connect multiple ACI fabrics together via a L2 connection prior to the availability of ACI MultiPod and ACI MultiSite. While most customers will use a newer feature such as ACI MultiPod, or ACI MultiSite, there are still a number of customers who use the Common Pervasive Gateway feature to support L2 connectivity between fabrics.

Introduction

In this post, we’ll discuss the Common Pervasive Gateway feature in ACI, which has been available since 1.2(1i).

This feature addresses the requirement to extend Bridge Domain(BD) across multiple ACI Fabrics to provide a same default gateway to servers in each ACI Fabrics.

A normal default gateway configured as ACI BD subnet is called pervasive gateway. Hence this feature is called Common Pervasive Gateway.

This feature enables moving one or more virtual machines (VM) or conventional hosts across different ACI Fabrics seamlessly without any configuration changes or operations as if VMs are moving within a same ACI Fabric.

Official documents for this feature are as follows.

The problem solved by Common Pervasive Gateway

If Common Pervasive Gateway is not used, there was a problem in learning EndPoint (EP) on another ACI Fabric.

Screen Shot 2018-11-29 at 9.59.37 PM.png
The problem

※ pMAC : physical MAC

  1. H1 ARPs for GW IP 192.168.0.254
    L1 responds to ARP with pMAC
  2. H1 sends the packet with DMAC as pMAC and DIP as 192.168.1.2
  3. L1 routes the packet and sends it to one of the Spines for proxy
  4. Spine, being unable to find the entry for 192.168.1.2, sends glean packets
  5. BL1 ARPs for 192.168.1.2 with pMAC as ARP sender MAC and 192.168.1.254 as ARP sender IP.
    BL2 doesn’t learn pMAC and 192.168.1.254 since ACI Fabric2 owns these as its own MAC and IP as well.
  6. If 192.168.1.2 is already learned:
    ARP request from BL1 is forwarded to L2. And L2 sends it to H2.
    If 192.168.1.2 is not learned yet:
    ARP request from BL1 is sent to one of the Spines for proxy and glean packets are sent for 192.168.1.2 to learn 192.168.1.2 as EP.
    Then, next ARP request from BL1 is forwarded to H2.
    If ARP flood is enabled:
    ARP request from BL1 is flooded to H2
  7. H2 responds to ARP with DMAC as pMAC, DIP as 192.168.1.254, SMAC as H2MAC, SIP as 192.168.1.2
  8. L2 receives the ARP reply and doesn’t forward it to anywhere because DMAC and DIP is L2’s router MAC and IP.
    Hence ACI Fabric1 cannot learn H2.

NOTE: This may work if we configure different pMACs on each ACI Fabric BDs. However, which means having two different MACs for the same IP address. This implies VM may keep using wrong ARP entry when vMotion happened from one Fabric to another.

What If we configure different pMACs and different subnet IPs on each ACI Fabrics? This means we would need reconfigure the default gateway IP address on VM when VM moves between ACI Fabrics.

This problem is resolved by common pervasive gateway.  The next section describes how it resolve this problem and how Common Pervasive Gateway works.

How it works with Common Pervasive Gateway

This section describes how common pervasive gateway works.

Common pervasive gateway allows us to have a virtual MAC and virtual IP which is common(i.e., the same) across multiple ACI Fabrics.

All devices in the Bridge Domain with the Common Pervasive Gateway feature is supposed to point at the Common Pervasive Gateway (virtual IP) as its default gateway.

However, it is still required to have a non-virtual IP on the BD in each of the ACI Fabrics on top of virtual IP. This non-virtual IP should be in the same subnet as virtual IP and should be unique for each of the ACI Fabrics. This is similar to HSRP physical IP and virtual IP configuration.

It is also required to have unique physical MAC addresses (Custom MAC address in APIC GUI) on each BD in each ACI Fabrics that will be stretched.

Screen Shot 2018-11-29 at 10.04.47 PM.png
How Common Pervasive Gateway works
  1. H1 ARPs for GW IP 192.168.0.254 (vIP)
    L1 responds to ARP with vMAC
  2. H1 sends the packet with DMAC as vMAC and DIP as 192.168.1.2
  3. L1 routes the packet and sends it to one of the Spines for proxy
    When vMAC is configured on BD, a packet with DMAC set to pMAC1 won’t be routed since router-mac for the BD becomes vMAC.
  4. Spine, being unable to find the entry for 192.168.1.2, sends glean packets
  5. BL1 ARPs for 192.168.1.2 with pMAC1 as ARP sender MAC and 192.168.1.251 as ARP sender IP.
    -> Both sender MAC/IP are set to physical not virtual
    BL2 learns pMAC1 and 192.168.1.251.
  6. If 192.168.1.2 is already learned:
    ARP request from BL1 is forwarded to L2. And L2 sends it to H2.
    If 192.168.1.2 is not learned yet:
    ARP request from BL1 is sent to one of the Spines for proxy and glean packets are sent for 192.168.1.2 to learn 192.168.1.2 as EP.
    Then, next ARP request from BL1 is forwarded to H2.
    If ARP flood is enabled:
    ARP request from BL1 is flooded to H2
  7. H2 responds to ARP with DMAC as pMAC1, DIP as 192.168.1.251, SMAC as H2MAC, SIP as 192.168.1.2
    L2 learns H2MAC,IP(192.168.1.2) and updates COOP on Spines in ACI Fabric2.
  8. L2 bridges the packet to BL2
  9. BL2 bridges the packet to BL1
    BL1 learns H2MAC,IP(192.168.1.2) and updates COOP on Spines in ACI Fabric1.
  10. Subsequent packets from H1 are routed to ACI Fabric2 through BL1.
    BL1 sends them out on L2OUT with SMAC as pMAC1, DMAC as H2MAC

Configuration

Components

  • Multiples ACI Fabrics
  • Bridge Domain (BD) for each ACI Fabrics
  • Unique physical BD MAC (Custom MAC) for each ACI Fabrics
  • Unique non-virtual IP for each ACI Fabrics
  • Identical virtual MAC across ACI Fabrics
  • Identical virtual IP across ACI Fabrics
  • L2Out between each BDs on each ACI Fabrics

Screen Shot 2018-11-29 at 10.08.58 PM.png

L2Out in ACI means we have a Layer 2 Network, which could have other switches or different networks while normal EPG basically is supposed to have only EndPoints such as servers.

The term “L2Out” technically implies External Bridged Network which is associated directory to BD. However, attaching L2 networks via an EPG (L2EPG) can accomplish the same thing. Hence, A lot of documents use L2Out meaning either configuration. * Note – L2Out via EPG is the most widely deployed version of the L2Out, and is the recommended configuration.

Design Prerequisites

When using Common Pervasive Gateway, we have these requirements:
(these requirements are described in Cisco ACI Basic Configuration Guide (Chapter: ACI Fabric Layer 3 Outside Connectivity))

  • The Bridge domain that is configured to communicate across ACI fabrics must be configured for flood mode
         => This means we need to set L2 Unknown Unicast to flood on the BD for Common Pervasive Gateway
  • Only one EPG from a bridge domain (If the BD has multiple EPGs) should be configured on a border Leaf on the port which is connected to the second Fabric
         => This means
    + we should have only one L2Out for each BD with Common Pervasive Gateway
    +Any normal EPGs in the same BD with Common Pervasive Gateway cannot use the port used for L2Out connection between two ACI Fabrics
  • Do not connect hosts directly to an inter-connected Layer 2 network that enables a pervasive common gateway among the two ACI fabrics.
         => This means L2Out connection between two ACI Fabrics for Common Pervasive Gateway must be used only for L2Out connectivity between two ACI Fabrics.
    For example, even if we have an external L2 switch between two ACI Fabrics for L2Out connectivity for Common Pervasive Gateway, we cannot connect any hosts to that switch with same vlan as L2Out. Also we cannot connect any hosts to that switch even with other vlans for any EPGs in the same BD which is the second prerequisite above.

Below image describes above prerequisites in the picture.

Screen Shot 2018-11-29 at 10.11.24 PM.png

The Common Pervasive Gateway is also a feature to extend BD. Hence each BD with common pervasive gateway requires one L2Out respectively.

GUI

Screen Shot 2018-11-29 at 10.14.59 PM.png

FAQ

  • What is the purpose of virtual IP?

The BD subnet marked as virtual IP should be default gateway for servers. This IP must be identical across multiple ACI Fabrics.

Also if a BD subnet is configured as virtual IP, that IP address won’t be used as a source for ARP request originated from ACI Leaf BD SVI as long as other non-virtual IP exists in the same subnet. This is to make sure that BL1 in ACI Fabric1 in the previous scenario generates ARP request to H2 with unique sender MAC,IP for ACI Fabrics.

  • What is the purpose of virtual MAC?

When virtual MAC is configured, it becomes router-mac for the BD. Hence the packet with DMAC set to physical MAC(custom MAC in GUI) is no longer routed on the BD but just bridged.

ARP requests for any of SVI subnets on the BD is resolved with this virtual MAC. If it wasn’t resolved with virtual MAC, it may hit CSCux73998 .

  • How is physical MAC (custom MAC in GUI) used after virtual MAC is configured?

When traffic is routed on BD and source MAC needs to be rewritten, physical MAC is used as a source MAC for the packet.

Also it is used as a source MAC when traffic is generated from BD SVI such as ping from BD SVI.

This applies to both virtual IP and non-virtual IP. So even when ping is sourced from virtual IP, physical MAC is used as source MAC.

  • How is a packet with DMAC set to physical MAC (custom MAC in GUI) processed on ACI Leaf after virtual MAC is configured?

It will be bridged as normal Layer2 traffic. It cannot be routed on BD with virtual MAC since router-mac for the BD is already replaced with virtual MAC. Hence as long as ACI Leaf doesn’t learn the physical MAC from outside, it will be processed as unknown L2 unicast.

  • Can we still ping to BD SVI with DMAC set to physical MAC (custom MAC in GUI) after virtual MAC is configured?

Yes. Given Layer2 forwarding table, it looks like a packet should not be processed in CPU but just bridged. However, due to sup-tcam entry below, ICMP packets destined to BD SVI (router-ip for BD) are sup-redirected.

module-1# show system internal aclqos asic sup-tcam entries unit 0 type ipv4 ingress
  eid-->20 key-type-->IPV4
    wide_key(1):0/0 vec_type(3):2/0 sup_tx(1):0/0 myrip(1):1/0 l3_prot(8):1/0
  result--->  count----> 112760
    stats_vld(1):1 policer_apply(1):1 stats_police_idx(10):14 sup_redirect(1):1 sup_code(8):7
    sup_traffic_class(4):5

module-1# show platform internal ns forwarding router-ip ing | egrep '220000|IP'
POS    TYPE   OI   SEG-ID   IP-Addr            Valid
  10      0    0   220000   192.168.0.254        1
 130      0    0   220000   192.168.0.253        1
 207      0    0   220000   192.168.5.254        1
 569      0    0   220000   192.168.1.254        1
 997      0    0   220000   99.99.99.254         1
  • Can we still use non-virtual IP as default gateway for servers when virtual MAC and virtual IP are configured?

Technically yes but not recommended. It would work because ARP is always resolved with VMAC even for non-virtual IP.

  • What if virtual MAC is configured without any virtual IP?

Same thing still happens as described in “What is the purpose of virtual MAC?”.

However, ACI Leaf switches cannot be sure which BD subnet is used as common across ACI Fabrics. Hence ARP request for H2 from BL1 in the previous scenario may be generated with sender IP of the one which is supposed to be virtual IP, in other words, the one configured as BD SVI on another ACI Fabrics as well. ACI Fabric1 fails to resolve ARP for H2 if that happened.

  • Why do we have to configure non-vitual IP on top of virtual IP in the same subnet?

If there was only virtual IP, target IP for ARP reply from H2 in previous scenario would be virtual IP because BL1 had no choise other than generating ARP with sender IP set to virtual IP. Hence ARP reply from H2 would be sup-redirected on L2 in ACI Fabric2 due to this sup-tcam entry.

module-1# show system internal aclqos asic sup-tcam entry unit 0 type arp ing
  eid-->27 key-type-->ARP
    wide_key(1):0/0 vec_type(3):4/0 sup_tx(1):0/0 myrip(1):1/0
  result--->  count----> 90208
    stats_vld(1):1 policer_apply(1):1 stats_police_idx(10):1b sup_redirect(1):1 sup_code(8):1
    sup_traffic_class(4):5
  • Why do we have to configure unique non-virtual IP for each ACI Fabrics?

Same reason as above.

  • Why do we have to configure unique physical MAC (custom MAC in GUI) for each ACI Fabrics?

Basically same reason as non-virtual IP. If DMAC of ARP reply is same as one of router-mac, it is sup-redirected and not forwarded.

module-1# show system internal aclqos asic sup-tcam entry unit 0 type arp ing
  eid-->29 key-type-->ARP
    wide_key(1):0/0 vec_type(3):4/0 sup_tx(1):0/0 infra(1):0/0 ivxlan(1):0/0 myrmac(1):1/0
    dst_mac_broadcast(1):1/0
  result--->  count----> 5
    stats_vld(1):1 policer_apply(1):1 stats_police_idx(10):1d sup_redirect(1):1 sup_code(8):1
    sup_traffic_class(4):5

module-1# show platform internal ns forwarding router-mac ingress
  0        00:0c:0c:0c:0c:0c
  2        00:22:bd:f8:19:ff
  4        00:00:ac:ac:ac:ac

Verification

We need to know how to verify the current software and hardware status is correct. We can do that by checking each processes described at previous section.

  • Policy Manager (Logical Object)
admin@apic1:~> moquery -c fvBD -f 'fv.BD.name=="BD1"'
# fv.BD
name                   : BD1
dn                     : uni/tn-TK/BD-BD1
mac                    : 00:22:BD:F8:19:FF        <--- physical BD MAC (custom MAC in GUI)
vmac                   : 00:00:AC:AC:AC:AC        <--- virtual MAC

admin@apic1:~> moquery -c fvSubnet -f 'fv.Subnet.ip=="192.168.0.254/24"'
ip           : 192.168.0.254/24
dn           : uni/tn-TK/BD-BD1/subnet-[192.168.0.254/24]
virtual      : yes                                <--- virtual IP flag
  • Policy Manager (Concrete Object)
admin@apic1:~> moquery -c sviIf -f 'svi.If.id=="vlan44"'
id           : vlan44
dn           : topology/pod-1/node-101/sys/ctx-[vxlan-2228224]/bd-[vxlan-16711542]/svi-[vlan44]
mac          : 00:22:BD:F8:19:FF                  <--- physical BD MAC (custom MAC in GUI)
vmac         : 00:00:AC:AC:AC:AC                  <--- virtual MAC

admin@apic1:~> moquery -c ipv4Addr -f 'ipv4.Addr.addr=="192.168.0.254/24"'
addr             : 192.168.0.254/24
ctrl             : pervasive,virtual              <--- virtual IP flag
dn               : topology/pod-1/node-101/sys/ipv4/inst/dom-TK:VRF1/if-[vlan44]/addr-[192.168.0.254/24]
type             : primary                        <--- primary IP flag (could be secondary)
  • Policy Element (Concrete Object)
Leaf-1# moquery -c sviIf -f 'svi.If.id=="vlan44"'
id           : vlan44
dn           : sys/ctx-[vxlan-2228224]/bd-[vxlan-16711542]/svi-[vlan44]
mac          : 00:22:BD:F8:19:FF                  <--- physical BD MAC (custom MAC in GUI)
vmac         : 00:00:AC:AC:AC:AC                  <--- virtual MAC

Leaf-1# moquery -c ipv4Addr -f 'ipv4.Addr.addr=="192.168.0.254/24"'
addr             : 192.168.0.254/24
ctrl             : pervasive,virtual              <--- virtual IP flag
dn               : sys/ipv4/inst/dom-TK:VRF1/if-[vlan44]/addr-[192.168.0.254/24]
type             : primary
  • SVIMgr
Leaf-1# show system internal interface-vlan info vlan 44 | egrep 'data|mac|------'
SVI database
------------
mac=0022.bdf8.19ff iod=98
vmac=0000.acac.acac
------- fsm log -------
------- cfg pss info -------
mac=0022.bdf8.19ff
vmac=4ff7.c475.5d0a
default-mac=FALSE
------- runtime pss info -------
  • IPMgr
Leaf-1# show ip int vlan 44
IP Interface Status for VRF "TK:VRF1"
vlan44, Interface status: protocol-up/link-up/admin-up, iod: 98,
  IP address: 192.168.1.254, IP subnet: 192.168.1.0/24 secondary
  IP address: 192.168.0.254, IP subnet: 192.168.0.0/24  virtual
  IP address: 192.168.0.253, IP subnet: 192.168.0.0/24 secondary
  IP broadcast address: 255.255.255.255
  IP primary address route-preference: 1, tag: 0
  • ARP
Leaf-1# vsh -c 'show ip arp internal info int vlan 44'
 arp_if 0x97d0fe94
Interface Vlan44        context_name TK:VRF1 and context_id 11
ttl 0   ARP rearp interval is 0 and Throttle timeoutis 300
ARP proxy is Disabled and Local proxy is Disabled
 gratuitous_arp_timer_flag is Enabled
 gratuitous_arp_request is Enabled
 gratuitous_arp_update_cache is Enabled
 FHRP proxy arp node is (nil)
 VMAC : 0000.acac.acac
  • ELTMC
Leaf-1# vsh_lc -c 'show system internal eltmc info int vlan 44'
            IfInfo:
           interface:         Vlan44   :::         ifindex:      151060524
                 iod:             98   :::           state:             up
                 Mod:              0   :::            Port:              0
             context:        TK:VRF1   :::         vlan_id:             44
           vlan_type:        BD_VLAN   :::      hw_vlan_id:             19
   access_encap_type:        Unknown   :::    access_encap:              0
   fabric_encap_type:          VXLAN   :::    fabric_encap:       16711542
       ip_mode_flags:           0x80   :::              IP:     0xc0a800fe
              ip_str:  192.168.0.254   :::             MTU:           9000
        Router MAC:00.22.bd.f8.19.ff   :::            Vmac:00.00.ac.ac.ac.ac <--- even though it is still described as router-mac in here, it no longer acts as router-mac. Vmac is router-mac.
      NorthStar Info:
           qq_tbl_id:            429   :::         qq_ocam:              0
     seg_stat_tbl_id:             92   :::        seg_ocam:              0
         flood_encap:             34   :::  igmp_mld_encap:             49
            rmac idx:              2   :::        vmac_idx:              4

            BCM Info:
      bcm_l3_intf_id:             29        <--- srcMAC of packets from this SVI can be checked with this
              if_bmp:

Leaf-1# vsh_lc -c 'show system internal eltmc info rmac_table'
Eltmc rmac entries ..
         tenant_rmac:00:00:ac:ac:ac:ac
         ns_rmac_idx:              4   :::      ns_ref_cnt:              1
         bcm_vfp_idx:              0   :::     bcm_ref_cnt:              0
  
         tenant_rmac:00:22:bd:f8:19:ff
         ns_rmac_idx:              2   :::      ns_ref_cnt:             22
         bcm_vfp_idx:              0   :::     bcm_ref_cnt:             71

 

Troubleshooting TIPs

As described in FAQ section, BD SVI always use physical BD MAC (custom MAC in GUI) as source MAC even though it replies to ARP request with virtual MAC.

Hence the expected dst/src MAC in packets through BD with virtual MAC are as follows.

(assuming both BDs are using virtual MAC)

ICMP echo :

H1(192.168.0.1) —-> [(192.168.0.254)BD—BD(192.168.1.254)] —-> H2(192.168.1.2)

DMAC: VMAC                                                                                 DMAC: H2MAC

SMAC: H1MAC                                                                               SMAC: pMAC

 

ICMP reply:

H1(192.168.0.1) <—- [(192.168.0.254)BD—BD(192.168.1.254)] <—- H2(192.168.1.2)

DMAC: H1MAC                                                                                 DMAC: VMAC 

SMAC: pMAC                                                                                    SMAC: H2MAC

As described in FAQ section, router-mac is replaced with virtual MAC. So if DMAC of incoming packets were pMAC instead of vMAC, it won’t routed but just handled as L2 unicast and most likely flooded or proxy’ed within BD.

!!! Caution !!!

There are some devices such as Netapp with fast path feature which doesn’t do ARP when it replies to packets but instead uses source MAC of incoming packets as dest MAC for reply packet. Those devices may reply to ACI Leaf with pMAC as dest MAC since pMAC is used as source for packets coming out from ACI BD with virtual MAC. If that happens, traffic won’t be routed.

Example Product & Feature

 

 

 


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.