Blog The RAID :: Alasdair Keyes

The RAID

Posted: 2023-01-08 21:15:00 by Alasdair Keyes

After my failed drive on New Year's day, I ordered a new disk and rebuilt the array. Thankfully due to monthly checking by the OS, all the data on the three remaining drives was readable and the array is complete again.

I put the failed disk into another machine and ran the following badblocks command on it.

badblocks -o sdb_badblocks.txt -b 4096 -w -s /dev/sdb

I used the destructive test as the data was not needed now that the array was back to full strength. Incidentally, using a block size of 4096 over the default 1024 seemed to provide about a 2x-3x speed increase.

Even with that, the 2TB disk took just over 33 hours for a full write pass and a confirmation read pass.

At the end of it, a full write and read pass were managed with no errors reported. This is frustrating as mdadm had obviously detected a read error to reject the disk - this was logged in syslog.

I thought that maybe the bad sectors had been remapped by the firmware during the badblocks test, but checking the SMART stats again I saw that no errors are reported and also no re-allocation had been logged (ID# 5 below).

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       103
  3 Spin_Up_Time            0x0007   168   168   024    Pre-fail  Always       -       342 (Average 311)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       75
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   146   146   020    Pre-fail  Offline      -       29
  9 Power_On_Hours          0x0012   086   086   000    Old_age   Always       -       99078
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       75
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       989
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       989
194 Temperature_Celsius     0x0002   200   200   000    Old_age   Always       -       30 (Min/Max 16/44)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

So I'm not sure why the error occurred. Maybe the controller is bad, the cable is dodgy or my server got hit by some stray cosmic rays somewhere and caused some kind of CRC error (yes, it can and does happen). The server has ECC memory so bitflips in the RAM should have been detected had they occurred.

Interestingly, this is the first failed disk I've had within a Linux MDADM array in over 20 years of running servers (I've had plenty of failed disks in Dell PERC controllers and whatever controllers Supermicro jam into their servers!). All previous arrays have been torn down before a disk failed.

As such this was also the first time I've had to rebuild an array. This particular RAID was running for over 11 years before this disk failed. For those interested, I followed this post by Redhat about the steps to take https://www.redhat.com/sysadmin/raid-drive-mdadm.

Should something similar happen again, I think I would run badblocks in non-destructive mode on the disk in situ, then if it passed push it back into the array for it to be rebuilt before I looked at buying a new disk.

If you found this useful, please feel free to donate via bitcoin to 1NT2ErDzLDBPB8CDLk6j1qUdT6FmxkMmNz

The RAID

IT Consultancy Services