DM RAID: Fix raid_resume not reviving failed devices in all cases

DM RAID: Fix raid_resume not reviving failed devices in all cases When a device fails in a RAID array, it is marked as Faulty. Later, md_check_recovery is called which (through the call chain) calls 'hot_remove_disk' in order to have the personalities remove the device from use in the array. Sometimes, it is possible for the array to be suspended before the personalities get their chance to perform 'hot_remove_disk'. This is normally not an issue. If the array is deactivated, then the failed device will be noticed when the array is reinstantiated. If the array is resumed and the disk is still missing, md_check_recovery will be called upon resume and 'hot_remove_disk' will be called at that time. However, (for dm-raid) if the device has been restored, a resume on the array would cause it to attempt to revive the device by calling 'hot_add_disk'. If 'hot_remove_disk' had not been called, a situation is then created where the device is thought to concurrently be the replacement and the device to be replaced. Thus, the device is first sync'ed with the rest of the array (because it is the replacement device) and then marked Faulty and removed from the array (because it is also the device being replaced). The solution is to check and see if the device had properly been removed before the array was suspended. This is done by seeing whether the device's 'raid_disk' field is -1 - a condition that implies that 'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1) -> hot_remove_disk' has been called. If 'raid_disk' is not -1, then 'hot_remove_disk' must be called to complete the removal of the previously faulty device before it can be revived via 'hot_add_disk'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
author: Jonathan Brassow <jbrassow@redhat.com> 2013-05-08 18:00:54 -0500
committer: NeilBrown <neilb@suse.de> 2013-06-14 08:10:25 +1000
commit: a4dc163a55964d683f92742705c90c78c0f56c0c (patch)
tree: cbe562b8c39069ba7f0691eb06316ec087c24255 /drivers/md
parent: f381e71b042af910fbe5f8222792cc5092750993 (diff)
1 files changed, 15 insertions, 0 deletions
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 59d15ec0ba81..49f0bd510fb9 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -1587,6 +1587,21 @@ static void attempt_restore_of_faulty_devices(struct raid_set *rs)
 			DMINFO("Faulty %s device #%d has readable super block."
 			       "  Attempting to revive it.",
 			       rs->raid_type->name, i);
+
+			/*
+			 * Faulty bit may be set, but sometimes the array can
+			 * be suspended before the personalities can respond
+			 * by removing the device from the array (i.e. calling
+			 * 'hot_remove_disk').  If they haven't yet removed
+			 * the failed device, its 'raid_disk' number will be
+			 * '>= 0' - meaning we must call this function
+			 * ourselves.
+			 */
+			if ((r->raid_disk >= 0) &&
+			    (r->mddev->pers->hot_remove_disk(r->mddev, r) != 0))
+				/* Failed to revive this device, try next */
+				continue;
+
 			r->raid_disk = i;
 			r->saved_raid_disk = i;
 			flags = r->flags;
author	Jonathan Brassow <jbrassow@redhat.com>	2013-05-08 18:00:54 -0500
committer	NeilBrown <neilb@suse.de>	2013-06-14 08:10:25 +1000
commit	a4dc163a55964d683f92742705c90c78c0f56c0c (patch)
tree	cbe562b8c39069ba7f0691eb06316ec087c24255 /drivers/md
parent	f381e71b042af910fbe5f8222792cc5092750993 (diff)