Modern Day Icarus Story – Part 1 fsck

Configuring VMware HCX in my home lab to migrate VM’s between two VMware vCenter clusters was my goal this week. HCX simplifies application mobility and migration between clouds. Last week I successfully paired both sites and I was ready to extend the network.

I discovered that my target site was inaccessible on Monday morning. I was disappointed since this worked last week. The troubleshooting process pointed to my tp-link T1700G-28TQ switch in my home lab as a possible culprit. After ping failures, I unplugged the Ethernet cable connected to my target site router and to my surprise the link light stayed on instead of going out. Quickly I discovered that the management plane of the switch crashed but the data plane was still switching some but not all traffic. I rebooted the switch and the networking problem was solved. I successfully logged into the HCX target site but I started to feel the heat from the sun melt the wax in my wings.

tplink switch at top of rack

I didn’t expect I would run into new problems at the source site that after I solved the target site networking problem. The management UI for both NSX-T and vCenter Server at the source site weren’t accessible. I started to loose altitude from some feathers coming off my wing once I saw the dreaded write failures from their Linux console on both VMs. My home lab uses both VMware vSAN and NFSv3 on a QNAP NAS for storage. These critical VM’s were stored on the QNAP NAS. This NAS has one network path through the failed switch. I wouldn’t of have any issues if I stored these VM’s on vSAN since these servers are connected to two switches for redundancy in case of a single failure. After rebooting both management VM’s I saw that the file systems were corrupted and the VM’s were halted.

VMware vCenter file system errors on console

I knew I wouldn’t crash and drown in the ocean below like Icarus when I was able to successfully boot the VM and access the vCenter Server UI after cleaning the filesystem. I followed VMware knowledge base article 2149838 which described the recommended approach with e2fsck.

Prior to taking an in-depth enterprise Linux class I would have been anxious editing the grub loader to change the boot target and clean the file system. However these steps were now second nature to me since I had to do these steps by memory to pass the associated hands-on Linux certification from the class.

I haven’t managed my home lab like an enterprise environment by taking shortcuts to save time and money. I was lucky that fsck worked since I didn’t have a vCenter or a distributed virtual switch (dvs) backup. Due to this hard lesson I configured a vCenter backup schedule and exported the dvs configuration. My next blog will go over the steps I took to recover the NSX management console and VM.

vCenter Backup UI