10 Replies Latest reply on Jul 6, 2017 3:57 PM by Intel Corporation

    Medium error - very frequent in multiple drives

    kthommandra

      In a very short span ~ 2 months, I have had two SSD drives fail in similar ways and hence want to post some information I collected and get feedback.

      Vendor confirmed that the drives were tested prior to deploying.

       

      Failure signature - some of the sectors become unreadable

       

      /tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297200 count=1

      1+0 records in

      1+0 records out

      512 bytes (512 B) copied, 0.000221492 s, 2.3 MB/s

       

      /tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297210 count=1

      dd: reading '/dev/sda': Input/output error

      0+0 records in

      0+0 records out

      0 bytes (0 B) copied, 0.28118 s, 0.0 kB/s

       

      /tmp/raid/var/log @mw136.sjc# dd if=/dev/sda of=/dev/null bs=512 skip=1007297300 count=1

      1+0 records in

      1+0 records out

      512 bytes (512 B) copied, 0.000207949 s, 2.5 MB/s

       

       

      So far in my system, 3 drives have failed within a short span. One completely dead and another two with symptoms as shown above.

      I'm guessing that "fstrim" might be causing this. This is just a hunch and I have no conclusive evidence.

      In another system with drives from non-Intel vendor, enabling fstrim was causing XFS filesystem panics and system instability (freezes etc).

       

      The system with Intel drives has CentOS 7.2 with MD RAID-0

      Since MD RAID-0 by default disables fstrim, I used "raid0.devices_discard_performance=Y" module parameter

       

      A note attached in the Linux kernel sources (linux-3.10.0.327.36.3.el7/driver/md/raid0.c) say the following about the module parameter

       

                      /* Unfortunately, some devices have awful discard performance,

                       * especially for small sized requests. This is particularly

                       * bad for RAID0 with a small chunk size resulting in a small

                       * DISCARD requests hitting the underlaying drives.

                       * Only allow DISCARD if the sysadmin confirms that all devices

                       * in use can handle small DISCARD requests at reasonable speed,

                       * by setting a module parameter.

                       */

       

      Summary:

      PS: Refer to the attachments for detailed SMART data and other logs

       

      1) The drive seems to have correct partition alignment

      2) SMART data seems to indicate that the drive has 90% remaining life

      3) SMART data shows that "91858" LBAs were written to the drive, which is pretty low for the drive to fail.

          SMART devstat data shows following - not sure which information is reliable

        1  0x018  6       6020013265  Logical Sectors Written

        1  0x020  6         29657954  Number of Write Commands

       

      Questions:

       

      1) Is the SSD model (INTEL SSDSC2BB016T7) susceptible to the warning posted in Linux kernel sources (see the note above) ? Are any recent (post 2010) drives impacted by that module parameter?

       

      2) Based on this attached information, is it possible to know why certain sectors are unreadable (eg: failures induced by too many writes, other issues evident in the logs/SMART data etc)

       

      3) If the attached information is insufficient to conclude why the failure happened, what do you recommend as the information to be collected ?

       

      4) Are there any know gotchas around fstrim and Intel SSD drives?

       

      If these questions can be better answered in a different community please point me to the same.

       

      -krishna

       

       

       

       

       

       

        • 1. Re: Medium error - very frequent in multiple drives
          Intel Corporation
          This message was posted on behalf of Intel Corporation

          Hello Kthommandra,

          Thanks for bringing this situation to our attention. We'd like to engage additional resources in investigating on this situation, please allow us some time and we'll get back to you with an update.

          Regards,
          Nestor C

          • 2. Re: Medium error - very frequent in multiple drives
            kthommandra

            Thank you for taking a look.

             

            Another observation that we made is that when /sys/block/XXX/queue/nomerges is set to 2 then fstrim is really slow and its fast when the value is 0/1.

             

            What is the recommended size for the trim requests for these drives?

            • 3. Re: Medium error - very frequent in multiple drives
              Intel Corporation
              This message was posted on behalf of Intel Corporation

              Hello Kthommandra,

              After checking the logs, we would like you to please update the firmware version on the SSDs, as the one from logs is not the latest one.

              For you to be able to do that, please download the Intel® SSD Data Center Tool, the command to run the firmware update is: isdct load -intelssd X (X = Index of the drive)


              The following changes are included in this firmware update:

              • Correction to SMART attribute BBh and F1h increment behavior
              • Fix to drive behavior when power loss occurs during Secure Erase
              • Fixed issue where SCT Extended Status Code, Action
              Code and Function Code were not being cleared on a COMRESET
              • Fix to address occasional Standby Immediate failure
              • Legacy ATA commands not relevant in ACS-3 no longer aborted
              • Correction to drive behavior when running SMART
              selftest using Smartctl* and ABORT command received

              At the same time, could you please provide us the nLog from the Intel® SSD Data Center Tool?
              The command is: isdct dump -nlog -intelssd X

              In case you would like to check the guide, here it is. 

              We will be waiting for your response.

              Regards,
              Nestor C

              • 4. Re: Medium error - very frequent in multiple drives
                kthommandra

                I have updated all the drives in the affected system to the latest firmware

                 

                ===

                Firmware : N2010112

                 

                FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.

                ===

                 

                During FW update I noticed the following errors in dmesg, I think these are fine.

                 

                Jun 10 11:14:13 kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 actt

                ion 0x6 frozen

                Jun 10 11:14:13 kernel: ata5.00: failed command: DOWNLOAD MICROCODE

                Jun 10 11:14:13 kernel: ata5.00: cmd 92/03:a4:00:00:07/00:00:00:00:00/40 tt

                ag 19 pio 83968 out#012         res 40/00:60:27:00:00/00:00:00:00:00/00 Emask 0xx

                4 (timeout)

                Jun 10 11:14:14 kernel: ata5.00: status: { DRDY }

                Jun 10 11:14:14 kernel: ata5: hard resetting link

                Jun 10 11:14:16 kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl

                300)

                Jun 10 11:14:16 kernel: ata5.00: configured for UDMA/133

                Jun 10 11:14:16 kernel: ata5: EH complete

                Jun 10 11:14:16 kernel: ata5.00: Enabling discard_zeroes_data

                • 5. Re: Medium error - very frequent in multiple drives
                  kthommandra

                  Just to clarify, the errors described in the original post still persist after FW upgrade.

                  • 6. Re: Medium error - very frequent in multiple drives
                    kthommandra

                    This is the request log file from the impacted drive

                    • 7. Re: Medium error - very frequent in multiple drives
                      Intel Corporation
                      This message was posted on behalf of Intel Corporation

                      Hi Kthommandra,

                      Thank you so much for all the information provided. We will keep you posted with any news.

                      Regards,
                      Nestor C

                      • 8. Re: Medium error - very frequent in multiple drives
                        kthommandra

                        Hi

                         

                        We have another occurrence of the "exact" same issue in another server

                         

                        Third failure in a short span of time.

                        Kindly escalate the investigation.

                         

                        Device Model:    INTEL SSDSC2BB016T7

                        Firmware Version: N2010112

                        User Capacity:    1,600,321,314,816 bytes [1.60 TB]

                        Sector Sizes:    512 bytes logical, 4096 bytes physical

                        Rotation Rate:    Solid State Device

                        ATA Version is:  ACS-3 (unknown minor revision code: 0x006d)

                        SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

                         

                        We had re-enabled fstrim on this server with nomerges=0.

                        We don't know if the issue is related to fstrim but for now we have disabled fstrim again

                         

                        I have attached the nlog from this newly failed drive.

                         

                        Questions:

                         

                        1) Can you share with us some information around the TRIM requirements of these drives - esp what kind of trim requests (size/frequency etc) can be detrimental to the drive etc.

                        2) Is there any kind of re-formatting that we could do and re-use these drives instead of replacing them?

                        • 9. Re: Medium error - very frequent in multiple drives
                          kthommandra

                          any update on the investigation?

                           

                          A related question - when a drive is in this state, could we just re-format it and use it with potentially lowered capacity?

                          • 10. Re: Medium error - very frequent in multiple drives
                            Intel Corporation
                            This message was posted on behalf of Intel Corporation

                            Hello Kthommandra,

                            We apologize for the long delay and we'd like you to please check your private messages inbox.

                            Please let us know.

                            Regards,
                            Nestor C