Event Id 1146 Microsoft Windows Failover Clustering Tools
Hello, The alert description gives the most information in this case: 'Cluster network 'Cluster Network 2' is down. None of the available nodes can communicate using this network.' This is networking at the windows leve, you'll have to get with your networking team and windows team to check the binding order of the adaptors and port configuration. It also wouldn't hurt to ask the network team if they noticed any gliches or rebooted any switches (or ran upgrades, etc) that could have caused an issue. Check all cables and make sure they are good.
2012-7-9 Having a problem with nodes being removed from active Failover Cluster. Windows Server, Failover Clustering. Nodes which gave me event id. Cluster network name resource error logging on. Event ID: 1231. Source: Microsoft-Windows. Opening Event Viewer and view events related to failover clustering. Check your network connections in the windows failover clustering tool and have networking check ports and cables. Figure out what cluster network 2 is and what type of traffic traverses over it. Find the root cause.
I would also go in and check what type of traffic 'Cluster Network 2' transports. I like to name mine what they are, such as 'Public Cluster Network', 'Private Cluster Network', 'Management Network', etc. In cases like this it makes it much easier to understand what is affected. Depending on the version of Windows, the failover clustering administration GUI should show the networking and if it's up or down. It sounds like the private network lost connection, as if it were the public network you wouldn't have been able to remotely access SQL Server. Private traffic can run over the public network in event of a failure, but public can't traverse over the private network.
Start from the time of the event and work with networking to see if anything happened. Check your network connections in the windows failover clustering tool and have networking check ports and cables. Figure out what cluster network 2 is and what type of traffic traverses over it.
Find the root cause.Sean Sean Gallardy, MCC.
In Window Server 2008 Failover Clustering, the team invested significant time into making clustering easier. In the Windows Server 2008 R2 release we have continued down that path, adding several troubleshooting enhancements. One of the important aspects of troubleshooting a service outage is doing diligent postmortem analysis – to understand why you experienced the problem so that you can take corrective action to avoid seeing it again. A common problem can be due to third-party resource dlls which may not have had the detailed level of testing as the in-box dlls. In previous releases we offered the ability to isolate components into separate processes, and in 2008 R2 we have built in additional isolation logic, so that if a resource dll crashes, little else is affected, offering even higher availability to your mission-critical applications.
Microsoft Windows Failover Clustering
The resource dll is a component is provided by the application being clustered and is a proxy between the application and the cluster. If the cluster wants to stop or start the application it will notify the resource dll, and resource dll will communicate this information to the application. The cluster does not load the resource dlls into the cluster service process, instead it loads them into the Resource Host Monitor (RHS.exe) process, which is recyclable. Previously all resources used to run in a single RHS process by default.
But this meant that if one resource crashes then the entire RHS process could fail and all resources hosted by this RHS will fail. We’ve improve our default behavior in 2008 R2 by separating our critical resources from our dlls in RHS. Now the Cluster Group (including the quorum resource) and Storage Group (including Available Storage and Clustering Shared Volumes) now all run in a single, isolated RHS process. The other resource dlls will run in one or more additional RHS processes. There are two common reasons for seeing instability in a resource dll: 1. The resource dll itself may crash. In most cases this is caused by an access violation in the resource dll.
In previous releases we took action to alert the admin of this event by reporting it to the Resource Control Manager (RCM) and exiting. RCM is a component inside the cluster service, which, upon receiving notification that the resource caused a crash, would mark this resource as “run in the separate monitor”. This offered higher availability to the cluster because this resource will be loaded in its own RHS process, and if it crashes again only that resource will be affected. In R2 we have enhanced this behavior by not only reporting the failure and isolating this resource, but additionally we report the access violation by generating a Windows Error Report (WER).
WER will collect a dump file, create a problem report and will handle the report according to the policy applied on that computer which raises awareness of the issue to system administrator or Microsoft. The resource dll might take too long to perform requested action, in some cases it might even deadlock. There is not effective way to detect if it is just taking long time or there is a deadlock.
One way to solve this issue is to limit amount of time we are waiting for the resource to complete request, and if it does not complete in that time we would assume that the component handling this call is not in a healthy state. Some activities, such as online and offline, can take some time, so you may see the ‘pending’ state in the UI. If online is taking a long time, the resource might spawn a worker thread, and tell RHS that online call is pending, which can notify RHS that it requires more time. Once the resource comes online it will notify RHS. Offline is handled in the similar way. All other activities are simply limited by time and have to complete before RHS decides that it has timed out. In previous releases, when RHS decides that activity has timed out it will notify RCM and terminate the process.
RCM will then isolate the resource in a separate RHS. In R2 we improved the logic based on whether the event is common or not. Many deadlocks are one-time events caused by a race condition, so it may not be appropriate to isolate that resource because of a single occurrence, as having too many individual RHS process can cause a slight performance impact. This will again create a WER and forward it to the appropriate destination. In Windows Server 2008 R2 the reports can be found under Control Panel I System and Security Action Center Problem Reports. All RHS issues will be in the category “Failover Cluster Resource Host Subsystem”. The image below shows two issues.
The first shows an Access Violation which is sent to WER as a Problem Report. Hooking up a debugger to the dump file would provide more details around what resource caused the problem.
The second item is generated when RHS has detected a call is taking too long. In this case, RHS explicitly calls WER to generate a problem report and provides additional information that allows the user to see details about which resource and call caused the issue without looking into the dump file. This example shows the case when the ONLINERESOURCE call to the resource “r1” of the type “FlexRes” took too long.
You can learn more about the benefits and configuration of Windows Error Reporting from the following resources: Thanks, Vladimir Petter Senior Software Development Engineer Clustering & High-Availability Microsoft. We have 2 clustered servers for fileshares that have been up and running for 2 months. Recently, one disk crash and cluster switch to another node with error in eventview: Cluster resource 'FileServer-(SPDBCLUST)(Cluster Disk 14)' (resource type ', DLL 'clusres.dll') either crashed or deadlocked. The Resource Hosting Subsystem (RHS) process will now attempt to terminate, and the resource will be marked to run in a separate monitor.Source: FailoverClustering Event ID: 1230 Pls guide us for this issue.waiting for your reply.