8.7 Configuring Cascade Failover Prevention

A cascading failover occurs when a bad cluster resource causes a server to fail, then fails over to another server causing it to fail, and then continues failing over to and bringing down additional cluster servers until possibly all servers in the cluster have failed.

OES Cluster Services provides a Cascade Failover Prevention function. It detects if a node has failed because of a bad cluster resource, and prevents that bad resource from failing over to other servers in the cluster.

8.7.1 Understanding the Cascade Failover Prevention Quarantine

The cascade failover prevention quarantine puts a resource in a comatose state rather than letting it load on (and potentially cause to fail) other cluster nodes. A resource can be quarantined if the systematic analysis of the logged node failures determines the following conditions to be true:

  • If the resource is likely responsible for several consecutive node failures, unrelated to interference from failures of other resources.

    The consecutive failures might have occurred on different nodes in the cluster. If the resource loads successfully on any node, the failure count for the resource starts over.

  • If loading the resource will put the cluster in grave danger.

OES Cluster Services does the following to determine if a resource should be put into quarantine:

  1. Traces the history of node failures for the suspected bad resource. This includes:

    • What node the resource was running on or loading on

    • If the node failed

    • The state the resource was in when the node failed

    • If there were other resources trying to load when the node failed

  2. Repeats the above process until one of the following happens:

    • The end of the cluster log file is reached

    • Enough consecutive node failures are found

    • Found that the node did not fail

    • Found that the whole cluster was down

    • The entries in the log file are more than 365 days old

If the resource attempts to load on a node where it was previously loaded and there are additional nodes still available in the cluster, it will not be quarantined and will be allowed to load. Also, a resource is not quarantined when it is initially brought online.

For a particular resource, there is no fixed number of node failures that triggers a quarantine. Generally, three consecutive node failures triggers a resource quarantine. However, the actual number of failures considered can vary based on other factors like how many nodes are in the cluster and what other resources are doing at the time of the node failures. For example, if no other resources are in a running or loading state when a resource loads but never reaches a running state, then two consecutive node failures may trigger resource quarantine.

Factors that might contribute to a resource being quarantined include:

  • A large number of consecutive node failures (generally, three or more)

  • No other resources are causing node failures

  • The resource never reaches a running state

Factors that might help prevent a resource from being quarantined include:

  • A small number of consecutive node failures (generally, one or two)

  • The resource has failed on this node previously

  • Other resources are causing node failures

  • The resource reaches a running state

  • There is one node left up and running in the cluster

The resource quarantine is disabled if:

  • Cascade Failover Prevention is turned off.

    See Releasing a Resource from Quarantine.

  • There is no shared storage (SAN) or SBD partition.

  • There are enough nodes in the cluster to form a quorum.

8.7.2 Releasing a Resource from Quarantine

While a resource is in quarantine, you can still manually take the resource from the comatose state to an offline state, and then bring it online or cluster migrate it to other cluster nodes.

To get the resource out of quarantine so that is once again able to automatically fail over:

  1. Log in to the node as the root user.

  2. Disable Cascade Failover Prevention. See Disabling Cascade Failover Prevention.

  3. Re-enable Cascade Failover Prevention. See Manually Enabling Cascade Failover Prevention.

8.7.3 Enabling or Disabling Cascade Failover Prevention

The Cascade Failover Prevention function is enabled by default when you install OES Cluster Services. You can control Cascade Failover Prevention by creating a configuration file in the /etc/modprobe.d folder, and using it to disable and enable the ncs_cascade_failover_detection_flag. Two detection modes are supported:

  • A value of 0 disables the function.

  • A value of 2 enables the function. This value is assumed of the configuration file does not exist, or if the flag line is not present in the file. The file does not exist by default.

The setting in the file is specific to a node, and is set separately on each node. After you modify the setting on a node, you must manually unload and reload OES Cluster Services software on the server to apply change. The setting persists through patches and upgrades.

Disabling Cascade Failover Prevention

You might need to disable the Cascade Failover Prevention function for the following reasons:

  • To get a resource out of quarantine and allow it to once again automatically fail over, you can temporarily disable the Cascade Failover Prevention function on the node where the resource was taken offline.

  • To stop using the Cascade Failover Prevention function in the cluster, you can disable the function on every node in the cluster.

To disable Cascade Failover Prevention:

  1. Log in to the node as the root user.

  2. Navigate to the /etc/modprobe.d folder.

  3. In a text editor, create a configuration file (such as novell-ncs.conf) under the /etc/modprobe.d folder.

    If the file already exists, open it.

  4. Add the following content to the file. If the line already exists, change the flag value to 0.

    options crm ncs_cascade_failover_detection_flag=0

    The flag value 0 disables Cascade Failover Prevention for the node.

  5. Save the file, then close the text editor.

  6. Open a terminal console, then restart OES Cluster Services software to apply the modified Cascade Failover Prevention setting on this node:

    rcnovell-ncs restart

    or

    systemctl restart novell-ncs.service
  7. Repeat this procedure on other nodes where you want to disable Cascade Failover Prevention.

  8. (Optional) Re-enable Cascade Failover Prevention as described in Manually Enabling Cascade Failover Prevention.

Manually Enabling Cascade Failover Prevention

You might need to manually re-enable Cascade Failover Prevention:

  • If you disabled the function to release resources from the quarantine

  • If you disabled the function for other reasons

To enable Cascade Failover Prevention:

  1. Log in to the node as the root user.

  2. Navigate to the /etc/modprobe.d folder.

  3. Use one of the following methods to enable Cascade Failover Prevention:

    • Delete the file: Delete the /etc/modprobe.d/novell-ncs.conf file that you created in Disabling Cascade Failover Prevention.

    • Change the value to 2: In a text editor, open the /etc/modprobe.d/novell-ncs.conf file that you created in Disabling Cascade Failover Prevention, change the flag value from 0 to 2, then save the file.

      options crm ncs_cascade_failover_detection_flag=2

      The flag value 2 enables Cascade Failover Prevention for the node.

    • Delete the line: In a text editor, open the /etc/modprobe.d/novell-ncs.conf file that you created in Disabling Cascade Failover Prevention, remove the ncs_cascade_failover_detection_flag line, then save the file.

  4. Open a terminal console, then restart OES Cluster Services software to apply the modified Cascade Failover Prevention setting on this node:

    rcnovell-ncs restart

    or

    systemctl restart novell-ncs.service
  5. Repeat this procedure on other nodes where you want to enable Cascade Failover Prevention.