Operationalizing ACI: Enhanced Endpoint Tracker

One of the most frequent topics that comes up when I am discussing ACI with customers, is, “How can we better operationalize ACI?”. In other words, how can we take what we have with ACI, and make it simpler to consume for not only ourselves, but for our Ops staff?

For those new to ACI (and even to those of us with a bit more experience), the ability to quickly isolate Stale or Stuck endpoints on the fabric, and then clear those endpoints to remediate the black-holing of traffic during these events can be a time-consuming and frustrating event.

Enhanced Endpoint Tracker was developed to make it easier to address stale/stuck endpoints, and to make it easier to support and operationalize ACI in general! While this ACI App can be found on the Cisco ACI App Store (and installed on the APIC), I highly recommend that you can download it here in a standalone OVA form. Why do I recommend the standalone OVA? For applications that get installed on the APIC, Cisco imposes a 2G memory limit and a 10G disk quota on stateful applications in order to ensure that ACI apps don’t eat up too many resources from critical APIC operations**. As a result, I have found that the current iteration of Enhanced Endpoint Tracker runs much better on a standalone OVA.

**Note – As of APIC4.0 the Enhanced Endpoint Tracker was re-written and can be run on the APIC much more efficiently. If the Standalone OVA is not an option, you should be able to take advantage of Enhanced Endpoint Tracker as an App on the APIC.

What benefits do I get from the Enhanced Endpoint Tracker?

  1. Per-node History – Per-node history gives you a run down of the endpoint entry for each ACI Leaf in the entire fabric, the status of the endpoint (i.e., when it was created, when it was deleted), the interface on the ACI Leaf, encapsulation, pctag, EPG, etc.
  2. Move Events – Will show you a display of every move event for your endpoint. Things like, “From which Leaf to which Leaf did the endpoint move?”, and the time of the endpoint move.
  3. NEW – Rapid EP moves – Rapid Moves will allow you to detect when an endpoint is rapidly updating. More times than not, this is a result of IP-to-mac address changes, which can occur when servers are setup in an active-active fashion without VPC. In these cases, the IP address is the same, but the physical mac address for the IP dataplane traffic changes, which can cause problems and should be addressed.
  4. Off-Subnet Events – Off-subnet events refers to the possibility that the endpoint ever had an IP address associated with it that did not belong to the defined subnet on the Bridge Domain (i.e., a microsoft VM with no access to DHCP services, and without a static IP address will auto-assign an IP from the 169.254.x.x range and try to talk).
  5. Stale Events – Is this endpoint currently stale on any ACI Leafs in the fabric?
  6. Clear Endpoint – The ability to clear this endpoint from all ACI Leafs in the entire fabric.

The ability to Search for Endpoints

With “Endpoint History” button, located at the top of the Enhanced Endpoint tracker screen, you can search for the IPv4 or MAC address component of your endpoint. Once you entered the endpoint of choice, you’ll have several options to review for the endpoint such as:

EndPoints Search

Moves – The ability to Find Endpoints which are flapping

The “Endpoint Moves” tab allows you to quickly ascertain which how many times endpoints are your fabric are moving from one ACI Leaf to another. Now, I should say that in a normal ACI Fabric, some level of endpoint mobility is completely normal; Virtual Machines move from one host to another depending on the requirements of the application and depending on the hardware resources. However, an endpoint that is flapping and moving constantly can indicate a looped condition or a misconfiguration, which can needlessly eat up ACI Leaf resources and cause issues.

Moves

Rapid Moves – The ability to see which endpoints are changing their IP-to-Mac binding

Rapid Moves will allow you to detect when an endpoint is rapidly updating, as this indicates the endpoint is unstable, or misconfigured. When an endpoint as flagged as rapid, analysis is temporarily disabled for that endpoint and rapid notifications are sent. More times than not, this is a result of IP-to-mac address changes, which can occur when servers are setup in an active-active fashion without VPC. In these cases, the IP address is the same, but the physical mac address for the IP dataplane traffic changes, which can cause problems and should be addressed.

Offsubnet – The ability to quickly identify Endpoints which are have misconfigured IPv4 addresses

The “Off-Subnet Endpoints” Option allows you to quickly find and isolate misconfigurations on your end hosts. The Off-Subnet Endpoints tab has two modes to display; Currently Off-Subnet hosts (which is as expected, only those hosts who currently are sending IP addresses which are not configured on their Bridge Domain Subnet), and “Historically Off-Subnet Events”, which keeps a historical record of Off-Subnet events for end hosts that have occurred in the past.

OffSubnet Endpoints

Stale Endpoints – The ability to Find Stale Endpoints and Clear them from the Fabric!

And certainly, we have saved the best for the last. For anyone who has ever had a stale or “stuck” endpoint, you know the headache that was surely involved in not only tracking down the misbehaving endpoint entry in the fabric, but also in removing that stale entry to restore service.

Note – If you want more information on stale/stuck endpoint entries, please take a look at the ACI Fabric Endpoint Learning Whitepaper.

The “Stale Endpoints” tab has two options; One to allow you to immediately identify endpoints which are currently stale, and then an option to allow you to see Historically Stale Endpoints.

Stale Endpoints

Once a Stale Endpoint has been identified, just click on the Endpoint in question, and then click “Clear Endpoint” button. This will issue an API call and clear the stale endpoint entry from all ACI Leaf switches in the fabric! Voila!

Clearing an Endpoint – part 1
Clearing an Endpoint – part 2

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.