Thank you for very comprehensive introduction. Let’s consider two scenario ( both with TLER-capable hard drives):
Let CRITICAL_ERROR means unrecoverable error code message that leads to disk removal from RAID array. Let DEFER_HANDLING error code message means deferred error handling .
1. With TLER-capable RAID controller. Let me quote from WD article provided by you (quotation in italic):
“TLER-capable hard drives will perform the normal error recovery, and after 7 seconds, issue an error message to the RAID controller and defer the error recovery task until a later time.
The error handling is further coordinated between the TLER-capable hard drive and the RAID card. The TLER-capable drive will respond without waiting on the error to be resolved. RAID cards are very capable of handling this with a combination of parity protection and journaling. The RAID card flags the error in the error log and proceeds to deliver data using parity protection until the drive retries its own error recovery and corrects the error”.
So, disk encounters error, tries to handle for 7 seconds but it can’t within such period and sends error message DEFER_HANDLING, controller understands it (it can differentiate it from CRITICAL_ERROR), switches to degraded mode, recovers required data by redundant information from other disks and after some time returns from degraded mode to standard mode after considering disk has recovered from failure controller .
2. With NOT TLER-capable RAID controller:
TLER-capable hard drives will perform the normal error recovery, and after 7 seconds, issue an error message to the RAID controller and defer the error recovery task until a later time.
How not TLER-aware controller will understand such message and differentiate it from critical failure that must lead to disk removal from array? Most likely it can’t and it will remove disk from array and switch to degrade mode until administrator rebuilds the array. Are you agree?
The error handling is further coordinated between the TLER-capable hard drive and the RAID card.
How will NOT TLER-aware controller coordinate error handling further?