Thursday 6 June 2013

test or debug the fence_scsi watchdog service in a RHEL 6 cluster

Environment

Resolution

1) The fence_scsi_check watchdog script should trigger a reboot when a clusternode has been successfully fenced via the fence_scsi agent. To test this, simply use the fence_node utility. The cluster node that was fenced should reboot itself.
# fence_node <nodename>
2) Verify that a reboot will not occur if a clusternode is registered with any device listed in the fence_scsi.dev file. A reboot is only triggered when a clusternode's key is no longer registered with all of devices listed in the fence_scsi.dev file.
This test is similar to test #1, but instead of using fence_node to remove a clusternode's key from all the devices at once, removed a clusternode's key from the devices in multiple steps.
Assume that a node is registered with three devices: /dev/sda/dev/sdb, and /dev/sdc.
# fence_scsi -o off -n <nodename> -d /dev/sda
# fence_scsi -o off -n <nodename> -d /dev/sdb
After removing a node's key from two of the three devices, no reboot should be triggered.
# fence_scsi -o off -n <nodename> -d /dev/sdc
After removing a node's key from the last device, a reboot should be triggered.
3) The fence_scsi_check watchdog script should not trigger a reboot if registration fails.
If registration (unfencing) fails, neither the fence_scsi.key or fence_scsi.dev file should exist and the fence_scsi_check watchdog script should exist without error.
This can be tested in at least two ways. One option is to manually define the devices you wish to register with in the /etc/cluster/cluster.conffile. In this list of devices, specify either non-existent devices or devices that do not support SCSI-3 persistent reservations. Another option is to manually remove the fence_scsi.key file and/or the fence_scsi.dev file.
4) The fence_scsi_check watchdog script should not trigger a reboot if the watchdog service is running but cman service has not been started.
There must be distinction between the node having been fenced (ie. it was registered with the devices, but now is not) versus the clusternode having never been registered with the devices. The watchdog script should not trigger a reboot unless the clusternode lost its registrations. This test also depends on the existence of the fence_scsi.key and fence_scsi.dev files.
To test this, simply start the watchdog service without starting the cman service and neither the fence_scsi.dev file nor the fence_scsi.key file should exist, and therefore there are no registrations to check.