In this article, i would like to share with you a real story of a network operation - an aggregator switch swapping.
In a few words, the operation failed and the rollback was too difficult. So, above this true scenario, i like to share with you
some lessons learned and how to better prepare those kinds of operations.
First, depending of you knowledge and experience in the networking field, you may guest what could motivate a switch swapping in a production environnement. Well, i want to give you the core reasons of this case.
The topology could be described simply as a fastEthernet-based ports switch, which is connected to five others switchs.
Three of these switchs are access switchs that connect end-devices ( computers, Phones, Printers, ...). The last two switchs are the Core backbone switchs of the Campus Network - Entreprise network. Let me give you a picture for better clarification.
Now, here are the reasons :
 The aggregator switch (X) is old and end of life.
 The aggregator switch (X) has only FE ports wich can deliver up to 100MB, whereas others three access switchs are full gigaEthernet-based ports. The statistiques show that the trunk interfaces of switch (X) are full. The target is to have both side of the trunk interfaces giga-based capacity.
 There are many errors on the two interfaces that connect this aggregator switch (X) to the Core backbone switchs.
Just some main reasons to share with you. Also, it is important to mention that the aggregator switch (X) have only two gigabytes SFP-based ports, which are used for interconnecting to the Core switchs -- One port go to one Core switch. It is better to have the same port capacity at each side of the interconnection. Below is a small picture of the architecture :
Second, i will explain you what was done before the operation itself, i mean during the planing phase.  They went to the technical local and make sure the interconnection scheme/topology is the correct one. Then, they found and marked all cables on the switch (X)?  They scheduled the operation time after the bussines hour - because replacing the aggregator switch will cause service interruption.  They notify all departements that are connected to one of the three switchs - so they can speed their tasks and take some precautions  They check the state (up or down) of all ports and the leds color.  The new switch that is going to replace the old one (X), is a new model - A new Cisco 3650 was configured correctly by adapting the existing configuration to it. They basically update the configuration for the new switch  A written plan that state each action in order was shared with the operation staff. As the planing phase was done, they started to wait for the operation hour.
In this section, let me explain you what happened at the time of action. A member of the operation staff got the idea to not remove all cables on the aggregator switch (X) and not remove it from the rack. Again, during the time of operation, this guy proposed simply to power on the new switch and do not rack it (just put it on a table or near the rack). Then, he suggested to remove the two Fiber links (connected on the old aggregator switch) and connect them on the new switch 3650, in order to see if those two critical links will go up. Fortunately, one of them went up and after five (05) minutes the second link also. Already, the situation appear strange, as these links are Layer 3 and use EIGRP. Because the synchronisation time was too long. Well, as the two links were up, they decided to test the data and voice services, by connecting a PC and an IP phone on the new switch before continuing the migration. The computer and IP phone, got ip address and worked fine. Ten minutes after this result, the team noticed suddently that one SFP port goes down. So they tried to remove the SFP module and plug it again, now it goes up while other SFP port goes down. Since the time elapsed, they decided to not progress with the migration, because they are facing an abnormal situation. Just after this decision to make a rollback, they noticed that all two links went down. So, they changed the fiber cables (two garters) and the modules, but the ports status remain down. Now the last option, is to perform a rollback, by just remove the two fiber links from the new switch 3560 to the old one. Unfortunately, after this action, the two SFP ports on the old switch are also down and communication between this switch (X) and Core switchs is impossible. The staff team is now anxious because something that worked well doesn't work now. The time dedicated for the operation is over and the day after is a bussiness day. A solution must be found immediately and it is clear that the problem doesn't come from the switchs configuratio, but maybe the fiber cables, SFP modules or switchs ports. When they put the old or new SFP modules, the switch detected these SFPs but the result remained the same (ports states down). They suspected the fiber cable that make the interconnection between the Core Switchs and the aggregator switch. But at the time of operation, and due to this fiber lenght, nothing can be done there. So, after more than forty (45) minutes, the links are still down. What to do ? What was the mistake ? Anyway, the service is interrupted because of their operation and they are responsibles as the service was up even the errors on links. While they are thinking to a solution at the evening, they noticed that the links are now up and the services(Data and Voice) are operational. Nothing was touched, nothing was modified, so what happened ? But here the question is : What should be done to avoid this current confusion ? I answer with this short statement : Always, make sure that your rollback action will success before schedule this kind of operation ( network equipment swapping ). If your rollback action requires a device reboot, then reboot the equipement first to analyse its behaviour. If your rollback requires a SFP insertion, check the behaviour by removing and re-inserting SFP modules before any operation. The simple big error of the staff was the miss of checking each action involved in the rollback operation. Finally, what i want to add, is to take a lot of time to plan everything very well when it comes to operate on a production networking environnement. I hope you enjoy this reading and learn something useful from a real world scenario. Thanks for reading and stay tuned for future articles.