This was a weird one that hit me today.
I had a performance issue on a server.
esxtop is the first I thing I looked at and got this:
ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY 69 69 VSE 5 83.98 85.74 0.00 380.16 22.14
So I looked to which machine it was:
And the result I got was nuddah!!
[[email protected] root]# vm-support -x | grep VSE vmid=1428 VSE [[email protected] ]# ps -efww | grep VSE root 4426 1 0 Feb23 ? 00:00:06 /usr/lib/vmware/bin/vmkload_app …… …… a/**VSE/VSE.vmx** root 4476 3756 0 12:27 pts/2 00:00:00 grep VSE
So there was a running VM – or so it seemed.
I ran the same steps the other host in cluster
ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY 59 59 VSE 5 11.74 11.96 0.00 487.23 6.94 [[email protected] root]# vm-support -x | grep VSE vmid=1410 VSE [[email protected] root]# vmware-cmd -l | grep VSE [[email protected] root]#
OK so what was going on here? Looking at the details of the machine – I saw that the name of the VM had no correlation to the actual folder it was in
Looking for the machine again
[[email protected] root]# vmware-cmd -l | grep CSG1 /vmfs/volumes/…………a/CSG1/CSG1.vmx [[email protected] ]# vmware-cmd -l | grep CSG1 [[email protected] ]#
OK. So I now have found the machine Named VSE running on dmz2 but I still had a process running on dmz1 that was taking up CPU
[[email protected] ]# ps -efww | grep VSE root 4426 1 0 Feb23 ? 00:00:06 /usr/lib/vmware/bin/vmkload_app ... /**VSE/VSE.vmx** root 4476 3756 0 12:27 pts/2 00:00:00 grep VSE
I looked into the folder itself
[[email protected] ]# ls -la /vmfs/volumes/……a/VSE/ total 23413952 drwxr-xr-x 1 root root 980 Mar 17 11:39 . drwxr-xr-t 1 root root 2380 Mar 15 11:21 .. -rw------- 1 root root 2510 Mar 17 11:35 vmdumper.png -rw------- 1 root root 23573652480 Feb 23 23:53 VSE_1-flat.vmdk -rw------- 1 root root 268435456 Feb 23 23:53 VSE-6785c36f.vswp -rw------- 1 root root 131604480 Feb 23 23:53 VSE-flat.vmdk -rwxr-xr-- 1 root root 1960 Feb 24 02:04 VSE.vmx [[email protected] ]#
As you can see all the files were old and this looked like a Phantom machine
Time to kill the process on dmz1
I have the wid (WorldID) from before – 1428
[[email protected] ]# less /proc/vmware/vm/1428/cpu/status
You will find the master world ID for this process will be in the output after the vm.XXXX
(the 4 digits - in my case it was 1427)
Then kill the process
[[email protected] ]# /usr/lib/vmware/bin/vmkload_app -k 9 1427 Warning: Mar 17 12:37:04.706: Sending signal '9' to world 1427.
Process was gone and not using a full proc on nothing
[[email protected] ]# ps -efww | grep VSE root 4785 3756 0 12:37 pts/2 00:00:00 grep VSE
Just to be on the safe side I took a vm-support snapshot of the VMID before the whole process – maybe I can find something out about the problem later on.
How the phantom happened I am still not sure. What worries me more – is how this can be detected in the future and I do not have to wait for a problem to arise to find these things out.
I would be interested in hearing your comments or suggestions as to how to address the above question.