Upgrading ACI Fabric and MSO, Please Read This first.

This article is meant to point out items that you should pay attention to before upgrading your ACI Fabric and MSO Controllers.  The items listed here have been gained through first hand experience where I got pulled in to help customers upgrade and in some cases escalations, where customers had a “not such a smooth upgrade” because they had not done their due diligence before attempting to upgrade.

From 5.x release of ACI software most of these checks have been built into the APIC OS itself, so this makes upgrades much easier/smoother (as long as you don’t ignore the warning checks).   However even so,  It’s best to have a short discussion on this topic.

Previously Jody had written some excellent ACI Upgrade Guidance documents that are still very relevant.  I am not going to discuss every step here, since Jody has already gone through all that in his articles.  Instead this is more of an addendum to his articles.

Please go through these articles first.

  1. ACI Upgrade Tool
  2. Upgrading your ACI Fabric

List of Items you need to pay attention to  ensure you have a smooth ACI/MSO upgrade:

  1. The Very First thing you should do is to make sure that you go to/click: Cisco ACI Upgrade Checklist.  This site should be your de-facto go to site before upgrading.
    1. Go through every item in that checklist and check them off after reading and verifying each item on this checklist. 
    2. Please don’t skip reading the release notes (listed in that site), either. 
    3. Make sure you go through the hardware support list there based on the release you are going to upgrade to and verify you don’t have unsupported hardware.
    4. Make sure you Determine the appropriate upgrade path.  It might very well be necessary to go through multiple hops to go to the desired release of software if you are at a very old release of software.
  2. Backup your APIC  and MSO to a remote safe location (you can backup locally and then scp/ftp to a safe location).   Do not keep the backup only on the local controllers.   
    1. DO NOT backup your APIC or MSO without AES Encryption Enabled.  In case you need to restore, if things go sour, you backups with no AES Encryption will be useless for the most part.
    2. If you already have AES Encryption turned on, but you don’t know your AES Encryption phrase,   redo the phrase before you take the backup.  Make sure to save the pass phrase somewhere where you can look it up if you can’t remember.  I personally always have a simple phrase that I can remember like “This is my very important backup”.    I’m not suggesting that you use that phrase, but you need to remember your pass phrase.
  3. Remember the basic order of operations for upgrade is:
    1. MSO
    2. APICs
    3. Leaves/Spines of physical Fabric
  4. If you have ACI Multisite configuration (physical fabrics or cloud Fabrics or a combination), please make sure that you upgrade your MSO first.  This is particular important (mandatory) if you are upgrading from ACI Fabric version 5.0.2x to a higher version like 5.1.2e and having Cloud ACI Sites in the mix.   5.1.2e of ACI Fabric version requires you to have MSO Release 3.0.2x to 3.1.1x or higher.
    1. There was major change done in the 5.1.2e release with respect to BGP EVPN PEER. Instead of using IP from them External Subnet Pool, the private IP for CSR is used to establish BGP sessions.   If you upgrading from pre 5.1.x release to 5.1.x or later and you don’t upgrade your MSO first, your cloud sites will not come up.
    2. If you are using radius for user authentication, you need to be aware that the format of the AVPair that radius is giving out to users needs to be slightly modified.  If you don’t do this and your radius server is giving out AVPairs for APICs and MSO,  your users may not be able to log into the upgraded MSO if the MSO AVpair is listed after the APIC AVpair if using the old style AVPair format for MSO.  This was done for SSO support which is the way it works on MSO running on ND ( ND – Nexus Dashboard which is bascally a rebranded CASE version 2.0 onwards). My suggestion is to use the newer AVPair and combine the APIC and MSO AVPairs in a single line.  You can read about this at: Cisco ACI Multi-Site Configuration Guide, Release 3.1(x)

example avPair on freeradius:

Before 3.1.1x:

support Crypt-Password := "7GZlyJ1vpBbMA"
Cisco-AVPair = "shell:domains =all/custom-privilege/", # This is the APIC AVPair
Cisco-AVPair += "shell:msc-roles=powerUser/",  # This is the MSO AVPair (old style).  Notice here the old style of domain is msc-roles and ACI AVpair comes before the APIC AV Pair
Cisco-AVPair += "shell:roles=customrole",
Cisco-AVPair += "shell:priv-lvl=7"

From 3.1.1x and later:
support Crypt-Password := "7GZlyJ1vpBbMA"
Cisco-AVPair = "shell:domains =all/custom-privilege/,msoall/powerUser/",  # combined apic + MSO,  domain has to be msoall
Cisco-AVPair += "shell:roles=customrole",
Cisco-AVPair += "shell:priv-lvl=7"

If you do not follow the above rules, your users will see an error "Authentication domain does not exist", when trying to log into the MSO with their user/credentials as you can see in the figure below. 
Figure 1
  1. Some Items that you need to make certain on the APIC upgrade
    1.  Overlapping Vlan Pools and multiple bindings on EPGs having reference to multiple physical domains. This will cause inconsistencies in FD vlan allocation. This could cause endpoint connectivity problems after upgrade.  Things might have been working before (based on the order of configuration), but after upgrade and leaf reboot the vlan allocation from pools might get allocated from the wrong pools and cause you an outage.  Unless using port-local Vlans, this sort of configuration is incorrect and you should avoid that.  If you see you have overlapping vlans, do not upgrade till you fix them.   You can read more about this in the following article:  Overlap VLAN Pool Lead Intermittent Packet Drop to VPC Endpoints and Spaning-tree Loop.
    2. It goes without saying that endpoints should be dual homed VPC.  If not, then you should expect outage. There should be atleast 2 or more spines (odd/even IDs) in a production network.  Upgrades should be done in odd/even groups of leaves and spines.  The same applies for L3Outs.  You should have redundant L3Outs in odd and even border leaves.  In one intersting situation I was involved with, the DNS server resided in ACI Fabric and they had 2 border leaves for the redundant L3Out.  However they had configured both border leaves on even group Leaf IDs.   Customer had not paid attention to that and during the upgrade of the even group leaves, DNS resolution failed causing application outage.
    3.  Please check your L3Out External EPG Prefixes defined and the scopes defined on the prefixes especially if using Shared L3Outs or Transit L3Outs.    In very old software (2.x release) the scope checks were not as stringent and if you had incorrect scopes defined, it would still work, but after upgrade this would stop working causing an outage.   Please see: Understanding Scope of Prefixes in L3Out External EPGs.
    4. Make sure you have console connections to all APICs (CIMC), leaves and spines.  In case there is an issue or you want to check the status if upgrade seems like it’s hung, then you will need to go in from console to investigate.  
    5. Before downloading images to APIC make sure to delete older images (apic and switches) that you are no longer using.  Keep the image that the fabric is running right now and delete the previous images.  This ensures that you will not run out of space.
    6. Rouge Endpoint Detection.   Some folks like to turn off Rouge EP Detection Feature (if it was turned on) before upgrade and enable it after upgrade again.   Generally if Rouge Endpoint Detection Feature is already turned on and you are satisfied, I don’t feel that, this is not necessary.   However, if you do turn it off before upgrade and intend to turn it back on again, kindly first make sure that you don’t have flapping endpoints before you turn it on.  If you do have flapping endpoints (generally a endpoint configuration/connectivity issue), please fix that first before turning on Rouge Endpoint Detection.  I suggest using the docker based Endpoint Tracker to determine if you have flapping EPs. (Don’t use the one for ACI App, as it does not give you as good of a visibility as the ova based or docker based one).  You can get the location of this EP tracker at:  EP Tracker Info  It should take you no longer than 5 minutes to get the docker based EP tracker up and running. 

Conclusion:   Remember if you do your due diligence before upgrade and follow the correct procecures you shoud have 0 impact to production traffic.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.