Fault Tolerance and Scaling

Maintaining a quorum

To ensure that both service and cluster level operations run smoothly, a quorum of cluster nodes must be running at all times. A quorum means that more than half (50% + 1) of the nodes need to be running and communicating with each other at any given moment.

Your cluster should always be designed and built to contain an odd number of nodes. This helps maintain a quorum in both normal and adverse networking conditions. Keep this in mind when planning your deployment and looking ahead to maintaining your cluster.

Number of nodes in cluster	Number of nodes required for a quorum
3	2
5	3
n	(n / 2) + 1

Failure handling

The health of all services in the system are monitored.

If a service is found to be unhealthy, the system automatically attempts to self-heal, generally by restarting the process.
It may take five minutes or more for a service to be recognized as unhealthy and restarted, potentially on a different node when appropriate.
Service interruptions may occur depending on the type of failure.
The Cluster Management dashboard provides a view of events regarding detected failures.

When a cluster node becomes unavailable for any reason, whether planned or unplanned:

The cluster generally moves the services that had been running on that node onto other nodes.
It may take five minutes or more for a node to be recognized as unavailable. This delay is designed to prevent unwarranted service disruptions that could be triggered by temporary conditions, such as intermittent network issues.
There are instructions for gracefully shutting down or rebooting a node. These should be used any time a node is shut down or rebooted.

Scaling

A standard deployment of three nodes supports:

6000 concurrent host sessions
The failure of 1 node

If capacity is needed beyond what the standard deployment provides, there are two options: vertical scaling and horizontal scaling. Vertical scaling is recommended in most cases as it does not involve managing additional nodes.

Vertical scaling

To scale vertically, you add more memory and CPU cores to each of your existing nodes. This adds capacity to handle additional requests on the existing nodes, i.e. more concurrent host connections, but it does not increase the number of nodes that can fail without a system disruption.

Host Session Capacity

For each additional 6000 host sessions needed, the following should be added to each node in the cluster:

2 additional CPU Cores
4 GB RAM

Max concurrent host sessions	Required CPU Cores	Required Memory
6000	Base requirement - 8 Cores	Base requirement - 16 GB
12,000	10 Cores	20 GB
18,000	12 Cores	24 GB

Horizontal scaling

To scale horizontally you add more nodes to your cluster. This adds both additional capacity as well as increased resilience for nodes to fail, but involves managing additional nodes and the complexity that comes with that.

Important

You must always have an odd number of nodes in your cluster

Host Session Capacity

When scaling horizontally, each node added to the cluster adds capacity for approximately an additional 2000 host sessions.

Max concurrent host sessions	Required Number of Nodes*
6000	Base requirement - 3 nodes
10,000	5 nodes
14,000	7 nodes

* Assuming base system requirements for CPU and Memory

Headroom

When building a fault tolerant cluster, each node must reserve a minimum level of free compute resources so that it can take on additional load when needed.

When scaling vertically, we recommend doubling the required system requirements.
When scaling horizontally, these resources are factored into the system requirements.