Disconnected ESX Host

Got a call today.


All VM on an ESX host just went grey – all disconnected.

Trouble shooting steps:

  1. Ping ESX host Service Console – All ok

  2. Look in the VI client what is with the server – NOT OK – all machines are greyed out – (hey that is what they said wasn’t it).

  3. SSH into the Service console - All ok

  4. Direct GUI management to the server NOT OK. could not load the inventory

  5. All VM’s on the host were running and responding to ping.

  6. No failover was initiated in the cluster.

  7. On the console – I saw that there were 7 processes of vmware-hostd each using a lot of RAM.

  8. service mgmt-vmware stop – to stop the service. GOT STUCK

  9. Off to this KB  which helped me stop the service and get the host responsive again.

        cd /var/run/vmware  
        ls -l vmware-hostd.PID watchdog-hostd.PID (to get the current PID of the process)  
        cat vmware-hostd.PID (i.e. 1234 is the PID)  
        kill -9 <PID> (kill the process)  
        rm vmware-hostd.PID watchdog-hostd.PID remove the files  
        service mgmt-vmware start (restart the agent)
  10. The host came back online – all VM’s were no longer grey.

Here starts my questions.

  1. Why did this happen?

    I went to start digging into the logs and found that there was a gap in the system logs for about 20 minutes – which is really strange.

  2. It seems this happened after a snapshot removal


  3. I have opened a SR with VMware to get to the bottom of this issue.