修改故障转移群集心跳时间

Windows Server Failover Clustering is a high availability platform that is constantly monitoring the network connections and health of the nodes in a cluster. If a node is not reachable over the network, then recovery action is taken to recover and bring applications and services online on another node in the cluster.

Failover Clustering by default is configured to deliver the highest levels of availability, with the smallest amount of downtime. The default settings out of the box are optimized for failures where there is a complete loss of a server, what we will refer to in this blog as a ‘hard’ failure. These would be unrecoverable failure scenarios such as the failure of non-redundant hardware or power. In these situations the server is lost and the goal is for Failover Clustering to very quickly detect the loss of the server and rapidly recover on another server in the cluster. To accomplish this fast recovery from hard failures, the default settings for cluster health monitoring are fairly aggressive. However, they are fully configurable to allow flexibility for a variety of scenarios.

These default settings deliver the best behavior for most customers, however as clusters are stretched from being inches to possibly miles apart the cluster may become exposed to additional and potentially unreliable networking components between the nodes. Another factor is that the quality of commodity servers is constantly increasing, coupled with augmented resiliency through redundant components (such as dual power supplies, NIC teaming, and multi-path I/O), the number of non-redundant hardware failures may potentially be fairly rare. Because hard failures may be less frequent some customers may wish to tune the cluster for transient failures, where the cluster is more resilient to brief network failures between the nodes. By increasing the default failure thresholds you can decrease the sensitivity to brief network issues that last a short period of time.

Trade-offs

It is important to understand that there is no right answer here, and the optimized setting may vary by your specific business requirements and service level agreements.

Aggressive Monitoring – Provides the fastest failure detection and recovery of hard failures, which delivers the highest levels of availability. Clustering is less forgiving of transient failures, and may in some situations prematurely failover resources when there are transient network outages.
Relaxed Monitoring – Provides more forgiving failure detection which provides greater tolerance of brief transient network issues. These longer time-outs will result in cluster recovery from hard failures taking more time and increasing downtime.

Think of it like your cell phone, when the other end goes silent how long are you willing to sit there going “Hello?... Hello?... Hello?” before you hang-up the phone and call the person back. When the other end goes silent, you don’t know when or even if they will come back.

The key question you need to ask yourself is: What is more important to you? To quickly recover when you pull out the power cord or to be tolerant to a network hiccup?

Settings

There are four primary settings that affect cluster heartbeating and health detection between nodes.

Delay – This defines the frequency at which cluster heartbeats are sent between nodes. The delay is the number of seconds before the next heartbeat is sent. Within the same cluster there can be different delays between nodes on the same subnet and between nodes which are on different subnets.
Threshold – This defines the number of heartbeats which are missed before the cluster takes recovery action. The threshold is a number of heartbeats. Within the same cluster there can be different thresholds between nodes on the same subnet and between nodes which are on different subnets.

It is important to understand that both the delay and threshold have a cumulative effect on the total health detection. For example setting CrossSubnetDelay to send a heartbeat every 2 seconds and setting the CrossSubnetThreshold to 10 heartbeats missed before taking recovery, means that the cluster can have a total network tolerance of 20 seconds before recovery action is taken. In general, continuing to send frequent heartbeats but having greater thresholds is the preferred method.

Windows Server 2012 and later:

Parameter	Fast Failover (Default)	Relaxed	Maximum
SameSubnetDelay	1 second	1 second	2 seconds
SameSubnetThreshold	5 heartbeats	10 heartbeats	120 heartbeats
CrossSubnetDelay	1 second	1 seconds	4 seconds
CrossSubnetThreshold	5 heartbeats	20 heartbeats	120 heartbeats

Note: The “Relaxed” column above is a recommendation for customers looking to set their clusters to be more tolerant of transient failures. The recommended suggestions double the defaults to 10 second threshold for nodes on the same subnet and 20 second threshold for nodes on different subnets. If the Hyper-V role is installed on a Windows Server 2012 R2 Failover Cluster, the SameSubnetThreshold default will automatically be increased to 10 and the CrossSubnetThreshold default will automatically be increased to 20.

Disclaimer: When increasing the cluster thresholds, it is recommended to increase in moderation. It is important to understand that increasing resiliency to network hiccups comes at the cost of increased downtime when a hard failure occurs. In most customers’ minds, the definition of a server being down on the network is when it is no longer accessible to clients. Traditionally for TCP based applications this means the resiliency of the TCP reconnect window. While the cluster thresholds can be configured for durations of minutes, to achieve reasonable recovery times for clients it is generally not recommended to exceed the TCP reconnect timeouts. Evaluate your business needs to define what are the maximum values that are right for your deployments configuration.

It critical to recognize that cranking up the thresholds to high values does not fix nor resolve the transient network issue, it simply masks the problem by making health monitoring less sensitive. The #1 mistake made broadly by customers is the perception of not triggering cluster health detection means the issue is resolved (which is not true!). I like to think of it, that just because you choose not to go to the doctor it does not mean you are healthy. In other words, the lack of someone telling you that you have a problem does not mean the problem went away.

Configuration:

Cluster heartbeat configuration settings are considered advanced settings which are only exposed via PowerShell. These setting can be set while the cluster is up and running with no downtime and will take effect immediately with no need to reboot or restart the cluster.

To view the current heartbeat configuration values:
PS C:\Windows\system32> get-cluster | fl *subnet*

The setting can be modified with the following syntax:
PS C:\Windows\system32> (get-cluster).SameSubnetThreshold = 20

注：Windows Server 2008：

群集的各个节点相互间会定期使用专用的群集网络交换数据报信息（默认每隔5秒），因为只要群集节点在工作，就会不停地、周期性地发送信息给群集内的其余节点，所有这些消息被称作心跳信息，用于传送心跳信息的专用网络被称为心跳线。

如果主动节点发生故障，在指定的周期内（默认为两个周期，10秒钟）备份节点没有接受到心跳信息，那么将进行故障转移，由备份节点接管群集，对外提供服务。

Windows Server 2008 R2：

Parameter	Default	Range
SameSubnetDelay	1000 milliseconds	250-2000 milliseconds
CrossSubnetDelay	1000 milliseconds	250-4000 milliseconds
SameSubnetThreshold	5	3-10
CrossSubnetThreshold	5	3-10

Additional Considerations for Logging:

In Windows Server 2012 there is additional logging to the Cluster.log for heartbeat traffic when heartbeats are dropped. By default the RouteHistoryLength setting is set 10, which is two times the number of default thresholds. If you increase theSameSubnetThreshold or CrossSubnetThrehold values, it is recommended to increase the RouteHistoryLengthvalue to be twice the value to ensure that if the time arises that you need to troubleshoot heartbeat packets being dropped that there is sufficient logging. This can be done with the following syntax:
PS C:\Windows\system32> (get-cluster).RouteHistoryLength = 20

For more information on troubleshooting issues with nodes being removed from cluster membership due to network communication issues, please see the following blog:
http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx

from：http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx

posted on 2014-05-26 21:29 momingliu11 阅读(3001) 评论(0) 收藏举报