In a very short span ~ 2 months, I have had two SSD drives fail in similar ways and hence want to post some information I collected and get feedback.
Vendor confirmed that the drives were tested prior to deploying.
Failure signature - some of the sectors become unreadable
/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297200 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000221492 s, 2.3 MB/s
/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297210 count=1
dd: reading '/dev/sda': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.28118 s, 0.0 kB/s
/tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297300 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000207949 s, 2.5 MB/s
So far in my system, 3 drives have failed within a short span. One completely dead and another two with symptoms as shown above.
I'm guessing that "fstrim" might be causing this. This is just a hunch and I have no conclusive evidence.
In another system with drives from non-Intel vendor, enabling fstrim was causing XFS filesystem panics and system instability (freezes etc).
The system with Intel drives has CentOS 7.2 with MD RAID-0
Since MD RAID-0 by default disables fstrim, I used "raid0.devices_discard_performance=Y" module parameter
A note attached in the Linux kernel sources (linux-184.108.40.2067.36.3.el7/driver/md/raid0.c) say the following about the module parameter
/* Unfortunately, some devices have awful discard performance,
* especially for small sized requests. This is particularly
* bad for RAID0 with a small chunk size resulting in a small
* DISCARD requests hitting the underlaying drives.
* Only allow DISCARD if the sysadmin confirms that all devices
* in use can handle small DISCARD requests at reasonable speed,
* by setting a module parameter.
PS: Refer to the attachments for detailed SMART data and other logs
1) The drive seems to have correct partition alignment
2) SMART data seems to indicate that the drive has 90% remaining life
3) SMART data shows that "91858" LBAs were written to the drive, which is pretty low for the drive to fail.
SMART devstat data shows following - not sure which information is reliable
1 0x018 6 6020013265 Logical Sectors Written
1 0x020 6 29657954 Number of Write Commands
1) Is the SSD model (INTEL SSDSC2BB016T7) susceptible to the warning posted in Linux kernel sources (see the note above) ? Are any recent (post 2010) drives impacted by that module parameter?
2) Based on this attached information, is it possible to know why certain sectors are unreadable (eg: failures induced by too many writes, other issues evident in the logs/SMART data etc)
3) If the attached information is insufficient to conclude why the failure happened, what do you recommend as the information to be collected ?
4) Are there any know gotchas around fstrim and Intel SSD drives?
If these questions can be better answered in a different community please point me to the same.