Changing a failed Disk in Software RAID-1
You know something’s wrong when you see this:
# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [linear] [multipath] [raid10]
md2 : active raid1 sdb3[2] sda3[0]
1458830400 blocks [2/1] [U_]
md0 : active raid1 sdb1[1] sda1[0]
4200896 blocks [2/2] [U_]
md1 : active raid1 sdb2[2] sda2[0]
2104448 blocks [2/1] [U_]
unused devices: <none>
U_ means, that one disk is gone. UU would mean, everything’s alright. In this example /dev/sdb failed. Just replace it with your own device.
Now to remove the physical disk you first have to remove every partition of that disk from the software raid:
#mdadm /dev/md0 -r /dev/sdb1
#mdadm /dev/md1 -r /dev/sdb2
#mdadm /dev/md2 -r /dev/sdb3
If one of the partitions is still fine and only part of the disk is damaged, you might have to flag that partition as failed before being able to remove it:
#mdadm —manage /dev/md0 —fail /dev/sdb1
Now shut down your computer and replace the disk (or simply replace the disk if it is hot swappable). The new disk has to be of the same size of course.
Copy the partition table from the working disk:
#dd if=/dev/sda of=/dev/sdb count=1 bs=512
Check if it worked by using fdisk on /dev/sdb. IMPORTANT: write out the partition table, this way the system syncs the disk changes.
Now all you have to do is add the partitions to the raid again:
#mdadm /dev/md0 -a /dev/sdb1
#mdadm /dev/md1 -a /dev/sdb2
#mdadm /dev/md2 -a /dev/sdb3
If you get an error message that your partition is “not large enough to join array”, you probably forgot to resync the partition table. In this case, remove the already added partitions from the raid (using the failed flag described above), enter fdisk again and write the changes to disk. Alternatively you could run
#sfdisk -R /dev/sdb
to resync the partition table.