The Future of Cluster Services

For those of you don’t already know, VMware have a new feature coming out in there next version. I am lucky enough to be part of the Beta group who is busy testing this. The feature is called VMware Fault Tolerance. All of this info has been publicly exposed and even demonstrated.

Fault Tolerance Demo - VMware Roadmap

In a nutshell. We all know that one of the greatest features that are available today with VI is HA (High Availability). If you do not know what HA is, then I suggest you should start here. In short HA removes your dependency from physical hardware. If you physical host goes down (hardware failure / power … you get the picture) then within a pre-defined timeout, your VM’s that were running on that host will automatically be powered on across the other hosts in the cluster (if you have it configured correctly of course).

Fault tolerance will come and either compliment / replace Microsoft Clustering. This feature will protect your VM with another VM up to the level of resident memory and mouse movements that are replicated between the primary and secondary machine. And you know what the best part of it all is? No need for cluster configuration, no dependence on any Clustering solution, it is all built into your Virtual Infrastructure. This can be implented on any supported VM OS, no shared storage between VM’s and you are not limited to certain editions of software for the clustering solution.

So if you have HA, why would you need to use a Microsoft Cluster? (I hear you back there over in row 10..) Well, yes you could argue that HA covers most cases of failure, but not all.

HA will not kick in if there is an OS failure - only Host failure (you could probably achieve that with Virtual machine monitoring - but that is still an expiremental feature).

If the host goes down, then your OS will reboot on a new host. Now how healthy is that for Databases that did not close properly, I would think - not … that … good!

If your host goes down then your server / application will be unavailable until the OS comes back up. Now I do not want to picture you when your Exchange/SQL server goes down and your whole organization is waiting for it to come back up (Truth be told - it can be pretty quick - sometimes less than 120 seconds, but no-one likes downtime. Especially not Management). So any good Exchange/SQL Admin will plan in advance for this kind of scenario and cover that with a redundant Cluster.

Now there are several things here (let’s take Microsoft for example).

  1. You will need a Windows Enterprise Server license (actually two of them)
  2. Most probably (depending on the technology) also shared storage between the two hosts.
  3. You will need two Physical servers.
  4. If your OS does the Harakiri - the other cluster node will take ownership of the shared resources, and will kick in with the minimum amount of downtime.

You can see where I am going here can’t you?


Now of course there are limitations with FT, not all of them can be divulged at the moment, but I gather the product will only get better in the future.

So what do you think? Will Fault Tolerance replace a decent amount of clustered systems we use today? Or will we as VI Admins, provide even higher availability for our Virtual servers than we do today and continue using physical clusters for our critical services?