Nutanix is one of the only Hyperconverged solutions to offer synchronous mirroring and metrocluster capability.
So if you are concerned about loosing a whole site with multiple Nutanix nodes, the short answer for protecting against site failures is: use Metro Availability capability.
Minimum requirements for Metro Availability:
- two three-node Nutanix clusters, i.e one three-node Nutanix cluster per site
- (redundant) ip network(s) with less than 5ms round-trip-time between the sites.
If you plan to mirror all the data (and virtual machines), then use the same node models across the sites, so you have enough capacity to mirror everything. If you plan to mirror only portion of your data (and virtual machines), then you can use different node models between the sites. End result: two separate Nutanix clusters in different geographical locations synchronously replicating over ip network all or portion of data (mirroring is done at container/datastore level). With Metro Availability you can achieve recovery-point-objective (RPO) of zero, i.e you won’t lose any data in case of site failure (provided that you have mirrored everything).
If you need to lower your recovery-time-object (RTO), i.e time to start services on surviving site, you can do that by automating site fail-over, add third location with sufficient network connectivity and capability run tie-breaker or witness virtual machine.
If you can live with recovery-point-objective (RPO) being something other than zero, then there are more options, either async replication and/or plain tape backup with third-party backup application.
Availability Domain a.k.a Block Awareness
From time to time I get to customer meetings and they have done some studying by themselves by reading www.nutanixbible.com, an excellent resource for explaining inner-workings of Nutanix system. They have found “Availability Domains” a.k.a “Block Awareness”. Maybe they have a limited budget and cannot afford all the gear required (six nodes) for synchronous mirroring with Metro Availability. So in order to shave some costs they come up with an idea of splitting a single (three-node) Nutanix cluster over two or more physical locations by taking advantage of “Availability Domains” feature.
Let’s first explore different levels of awareness.
- Node awareness: By design Nutanix is node aware and only writes one replica of data locally. Other replicas are written to other nodes in the cluster.
- Block awareness: Each block can contain one to four nodes. If you have enough blocks, your cluster can also be block aware. Each Block has a serial number and that can be used to determine in which block a node resides. With Block awareness nodes will not write replicas into nodes that reside in the same block. This way with large enough cluster (minimum of three blocks and enough nodes) you can lose a complete block with multiple nodes even when using only replication factor 2 (RF2) for data.
- Rack awareness: As far as I know there is no rack serial number recorded with in Nutanix configuration, so there is no way distinguish in which rack the node or block resides. Since rack isn’t really an active component with separate ip address or such, there is no easy way to identify rack, other than manually assigning blocks into different racks. Manual operations are source of mistakes and mayhem.
- Site awareness: Likewise there is no serial number or other identifying information for site which can be automatically and reliably determined. Assigning Site identity would have to be manual operation.
So there is a difference between Block Awareness and Rack/Site awareness. You can automatically detect in which block the node resides and you cannot automatically determine in which rack or site the node resides.
Let’s explore few possible scenarios and try to find out why using Block Awareness and splitting nodes for single Nutanix cluster between sites might not be such a good idea. (Most likely it is not supported anyway).
Single two-node Nutanix cluster split between two sites
Minimum size for Nutanix cluster is three nodes. Why is that? It comes down to “quorum”. If you only had two nodes, you wouldn’t have “quorum” or witness capability.
Let’s say one of the nodes got somehow corrupted while the other remained intact. You would have one copy of “bad” data and one copy of “good” data. Without witness or “quorum” how would you know which one is “correct”? With three nodes you can lose or corrupt one node and the end result in such case would be one copy of “bad” data and two copies of “good” data. Two “good” copies can vote out a single “bad” copy and the cluster remains healthy and can continue to operate.
Picture: failure scenarios for a two-node Nutanix cluster split between two sites
Single three-node Nutanix cluster split between two sites
Now let’s say that you had three Nutanix nodes, each in separate Nutanix block (or chassis). Minimum requirement of three blocks for Block Awareness is fulfilled. Minimum requirement of three nodes per cluster is fulfilled. And you wanted to split them over two different sites.
SiteA would have one node / one block, SiteB would have two nodes / blocks. If SiteA with only one node would go down, you would still have two nodes on SiteB and “quorum” or majority of cluster nodes available. No Problem, Nutanix cluster would remain operational, hypervisor HA would kick in and start VMs from SiteA on remaining nodes on surviving SiteB. In an optimal failure mode, you would be able to split your cluster over two separate physical locations and withstand a site failure.
However, we are all familiar with Murphy’s Law and failures typically occur in non-optimal way. In case loosing SiteB with two nodes, you wouldn’t have “quorum” and one remaining node on SiteA wouldn’t have “quorum” and wouldn’t be able to keep the cluster running, not only VMs on SiteB go down, also VMs on SiteA would go down.
Picture: failure scenarios for a three-node Nutanix cluster split between two sites
On top of that with three nodes you can only use Replication Factor 2 (RF2), which is designed to withstand loss of a single node at a time. In case of multi-node failure, you would lose some data, since both replicas of some data would be lost.
Conclusion:
Splitting a single Nutanix cluster over two sites isn’t such a good idea. You would have 50% chance of total cluster failure combined with some data loss in case of site failure .
Single three-node Nutanix cluster split over three sites
Wait!! You said that minimum size of Nutanix cluster is three nodes, what if I split them over three sites? With one node per site, one could lose a site and still have two nodes / sites remaining and the cluster would remain operational and there wouldn’t be any data loss. Correct, that could work, let’s dig in little further, maybe there is a reason not to do it.
Picture: failure scenarios for a three-node Nutanix cluster split over three sites
Running three sites with redundant networks spanning sites can be quite costly, so it might be actually cheaper to build a two site Metro Availability solution. But putting economics aside and let’s say you already have three sites with sufficient network connectivity.
You also have to consider network partitioning, if you lose connectivity between the sites, each site would have 1/3 of nodes available and none of the sites would have “quorum”. This would result in shutting down the whole cluster, even though there wasn’t really a problem with the nodes.
There are also things to consider when you want to expand your three-site cluster. After initial three nodes you can expand your single site Nutanix cluster one node at a time. With nodes split over three sites you can’t add just one node on one site and still have “quorum”.
Single four-node Nutanix cluster split over three sites
Let’s say you would expand SiteA to have two nodes and SiteB & SiteC would have one node. In case you lose/corrupt SiteB (or SiteC), you would have two nodes on SiteA and one node on SiteC (or SiteB) and would have 3 nodes out of 4 still intact. No problem, the Nutanix cluster would remain operational and hypervisor HA would start VMs on surviving nodes. But if SiteA would go down/corrupt, you wouldn’t have quorum or majority of nodes available any more, 50% of (2 out of 4) nodes are gone and 50% (2 out of 4) nodes are operational, no way to determine which one is bad and which one is good and the cluster would shutdown.
Picture: failure scenarios for a four-node Nutanix cluster split over three sites
So in this case there is 33% chance of total cluster failure in case of a site failure.
Single five-node Nutanix cluster split over three sites
So in order to maintain “quorum” while having site failure you would have to expand the cluster by adding at least two nodes:
- SiteA: one nodes
- SiteB: two nodes
- SiteC: two nodes
If you loose:
- SiteA, you would have 4/5 nodes available and “quorum”
- SiteB, you would have 3/5 nodes available and “quorum”
- SiteC, you would have 3/5 nodes available and “quorum”
Picture: Failure scenarios for a five-node Nutanix cluster split over three sites
It might be better idea to add three nodes instead of two to keep the sites balanced. Some skew in node numbers per site is allowed, but there are some limitations how much is allowed, this limitation is related to metadata service (Cassandra) replication rings and maintaining “quorum” with those. To be on the safe side, better to keep the sites uniform.
You could continue expanding your cluster by adding a node per site in each expansion until your blocks are full. Depending on the model used you can have one to four nodes per block.
So what happens when your block gets full? To be on the safe side you would have to add additional site. With multiple blocks sharing a site, you could end up in a situation where Nutanix executes block awareness, writing replicas into different blocks, but into blocks that reside at the same site. Should you lose that site, you could lose data as both replicas of the data might be gone.
As far as I know there isn’t “rack” or “site” construct in Availability domains. With such constructs one could expand beyond one block per site while maintaining only three sites. But since there isn’t really an easy way to implement those without manual intervention, it is unlikely that Nutanix will implement such constructs and rack and/or site awareness.
Conclusion:
While in theory you could build a single three node Nutanix cluster and split the nodes across three sites, there are additional risks and restrictions. There might be additional complications, that I have not thought of. Most likely splitting nodes for a single cluster between sites is not anyway supported by Nutanix. Implement at your own risk .
What about Replication Factor 3 (RF3)?
Above examples are with Replication Factor 2 (RF2). With RF2 you can lose / corrupt any single node in the cluster and keep the cluster healthy and running. What about Replication Factor 3 (RF3)? With RF3 you can lose or corrupt any two nodes in the cluster at the same time.
With RF2 + Block awareness + enough nodes you can lose a complete block with maximum of four nodes and have your cluster remain operational. With RF3 + Block awarness + enough nodes you can lose two blocks with maximum of four nodes per block, eight nodes in total and have your cluster remain operational.
What if you wanted to use RF3 and split nodes over multiple sites? We have already established that splitting nodes over two sites won’t work with RF2, it won’t work with RF3 either, so let’s skip that case. How about RF3 and three sites?
Single Nutanix Cluster with RF3 split over three sites
Minimum number of nodes required for RF3 is five nodes, to be able to withstand two node failure. In case of two node failure you would end up two “bad” copies and three “good” copies and still have “quorum”.
You could split the nodes in following way: 1 node at SiteA, 2 nodes at SiteB, 2 nodes at SiteC.
In case site failure, results would be:
- SiteA / 1 node down, 4/5 of nodes available, “quorum” available, cluster remains running
- SiteB / 2 nodes down, 3/5 of nodes available, “quorum” available, cluster remains running
- SiteC / 2 nodes down, 3/5 of nodes available, “quorum” available, cluster remains running
Picture: failure scenarios for a five-node Nutanix cluster with RF3 split over three sites
This is very much like the five node / three site example with RF2. One difference is that with RF2 you could start with three nodes and expand to five nodes as required. With RF3 you would have to start with at least five nodes, which isn’t so far way of minimum number of nodes required for MetroAvailability (six nodes required). With RF3 you would need three sites, where as with MetroAvailability you could have only two sites.
Other difference with RF3 is that you could expand your cluster to have two blocks per site. Since block awareness and RF3 will write one copy locally and two remotely, even in the worst case scenario you would have maximum two copies of same data within a site, one replica per block. The third copy would be written to different block on different site.
Once you need to expand beyond two blocks per site with RF3, there is a risk that all the replicas of data will be held with in a site, each in separate block, but at the same site. Should you lose the site, you could lose all three copies of the data and experience data loss.
Conclusion:
With RF3 and Block Awareness you could split your cluster over three sites and expand it further than with RF2 and Block Awareness. Other than that, similar risks and restrictions are apply with this solution as well. Like with RF2 there might be additional complications. Most likely the setup is not supported by Nutanix. Implement at your own risk
Single Nutanix Cluster with RF3 split over five sites
In theory you could split your cluster over five sites instead of three sites. This would buy you more room to expand, but otherwise the same problems would apply. Costs of running five sites with redundant networks spanning the sites will drive total cost to quite ridiculous level, making two site two cluster MetroAvailability to look even more sweet deal.
Conclusion:
If you are concerned about site failure, use MetroAvailability. In most cases Availability Domains or Block Awareness won’t solve your problem. Splitting a single Nutanix cluster between sites won’t most likely be supported by Nutanix. Nutanixbible also states “As of Acropolis base software version 4.5 and later block awareness is best effort and doesn’t have strict requirements for enabling.” So even if the stars were aligned, there is no guarantee that Block Awareness would work in every situation.
Thanks for reading, comments are welcome