10 Replies Latest reply on May 7, 2013 9:49 AM by selig

    Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???

    mr.teecee

      Hi,

       

      I have an Intel server with an Intel case, board and a SRCSASL4I Intel HW RAID card. I have 2 backpane expander boxes, so I have a HW RAID card with 6 + 4 HDDs inside. I always had problems with the badly choosen WDC 2Tb Green HDDs, but for more than a few months I replaced the disks, so I had 6 x WD RED 2Tb + 4 x Samsung 2Tb. I always had some issues after having these disks also. Now I replaced all remaining non WD REDs, so I have now a WD  RED 2Tb x 10 RAID, still with problems!

       

      These errors appeared a few times a week:

      Consistency Check detected uncorrectable multiple medium     errors:

      Puncturing bad block:   PD   Int.Ports 0-3:2:2

       

      Since I'm having whole new HDDs I cannot understand the error message.

       

      First of all: Int.Ports 0-3:2:2 this error message if for WHAT DISK? In the RAID Web console I got  'Connector: Int. Ports 0-3' for the 4disk and the 6disk expander also.

       

      Also, the messages are:

      Consistency Check detected uncorrectable multiple medium     errors:       ( PD  Int.Ports 0-3:2:2  Location   0x1651aa63  VD       0)

      Puncturing bad block:   PD   Int.Ports 0-3:2:2      Location   0x1651aa63

      Consistency Check detected uncorrectable multiple medium     errors:       ( PD   Int.Ports 0-3:2:4  Location   0x16513b66  VD       0)

      Puncturing bad block:   PD   Int.Ports 0-3:2:4      Location   0x16513b66

      Consistency Check detected uncorrectable multiple medium     errors:       ( PD   Int.Ports 0-3:2:0  Location   0x13c3085  VD       0)

      Puncturing bad block:   PD   Int.Ports 0-3:2:0      Location   0x13c3085

      Consistency Check detected uncorrectable multiple medium     errors:       ( PD  Int.Ports 0-3:2:1  Location   0x13c3005  VD       0)

      Puncturing bad block:   PD   Int.Ports 0-3:2:1      Location   0x13c3005


      So this refers to 4 disks as if they having a bad block? 4 out of 10 new disks? No way!

      Please note that for all disks I have 'Media Error Count: 0' and 'Pred Fail Count: 0' in the RAID web console.


      PLEASE help me understand and resolve the problem.


      Thank You for reading this and hopefully trying to help me!

        • 1. Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???
          selig

          Dear ,

           

          First of all, validate S.M.A.R.T. values.

           

          Boot to Linux (can be some LiveCD) and invoke the following command on each disk:

          smartctl -A /dev/[hdd]

           

          for example:

          smartctl -A /dev/sda

          smartctl -A /dev/sg0

          it could vary depending on configuration. You can check drive name assigment in dmesg.

           

           

          Put output here and I will check your results.

           

           

          Greetings,

          Saelic Vogel

          1 of 1 people found this helpful
          • 2. Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???
            mr.teecee

            Thanks, I'll do that probably this afternoon/evening, since the RAID is in use, so the easiest way is to shut down server for a while and get the disks out. (I think I cannot use liveCD on the server since the RAID-controller shows only 1 volume as a disk...)

             

            Thanks,

              Tamas

            • 3. Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???
              selig

              It depends. There is possibility to check each disk independently on some Linux drivers.

               

              For example: if drive is advertaising itself as /dev/sda, it could be possible to refer to each physical disks of it's RAID group via block device /dev/sgN  (N=0,1,2...).

               

              Invoke   ls -l /dev | grep sd*   command and see what's there.

              • 4. Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???
                mr.teecee

                It depends. There is possibility to check each disk independently on some Linux drivers.

                Ahh, I see.

                Do You know if Ubuntu Server 13.04 or 12.04 LTS supports it? (I'm downloading 13.04 64bit at the moment, so hopefully I can do it in 1 shutdown without getting the HDDs out.)

                 

                Thanks!

                • 5. Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???
                  selig

                  Kernel should support it, but I'm not sure if Ubuntu LiveCD has smartmontools package.

                   

                  I don't remember when I was using LiveCD so I am unable to avice you which one to pick. Anyway, the main point is 1) what you've got in /dev/ and 2) smartmontools package installed.

                   

                  You can put results here, I will help you.

                  • 6. Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???
                    mr.teecee

                    Hi,

                     

                    I've put up the test result here: Dropbox - WD RED SMART test

                    I can see only 1 error in WDRED08.txt: 2 RAW READ ERRORS, but all the other HDDs were OK. Or did I miss something?

                    The SMARTs are on the bottom of the txt-s. I could dig them with SentinelHD after getting the disks out one-by-one...

                     

                    Thank You really much for helping me on these.

                    • 7. Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???
                      selig

                      In WDRED05 there is Spin Up Time noted as 6700ms which is a little bit longer than usual (compare to 4266 in WDRED07). It's also puzzling that other disks got zero thers. Anyway, it is not related to your problem.

                       

                      Raw Read Error Rate = 2  in WDRED08 deffinitely points, that disk replacement is neccessary (especialy in mission-critical environments). I would start from this point to solve the issue - replace this disk. It is big probability that the consistency error is caused by this.

                       

                      Other disks are fine. Do what I mentioned above.

                       

                      S.V.

                      1 of 1 people found this helpful
                      • 8. Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???
                        mr.teecee

                        Raw Read Error Rate = 2  in WDRED08 deffinitely points, that disk replacement is neccessary (especialy in mission-critical environments). I would start from this point to solve the issue - replace this disk. It is big probability that the consistency error is caused by this.

                         

                        Other disks are fine. Do what I mentioned above.

                         

                        Thanks, I got 9 out of 10 working in RAID5, and 1 as hot spare, so I'll replace the WDRED08 to the hotspare and do a consistency check again.

                         

                        Also, I'm curious: the 'puncturing bad block' message is 4 times and referring to 4 different disks according to the error message: 0-3:2:2, 0-3:2:4, 0-3:2:0, 0-3:2:1

                        Do I know which disks are these?

                        The SMARTs were good except for my 08 disk, where there is 2 raw read error. 2 is 2, not 4 as it is stated in the error messages. All the others were clean according to the SMARTs :-\

                         

                        Thanks again, I'll write my result of the consistency check after replacement!


                         

                        • 9. Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???
                          mr.teecee

                          Thanks again, I'll write my result of the consistency check after replacement!

                          Haha! It seems You were right: 2 of 4 bad blocks message appeared shortly after started the copyback process.

                          If the volume won't have much load than 6 hours to complete the copyback and the consistency check

                          • 10. Re: Intel HW RAID: consystency check error, puncturing bad block - on new HDDs???
                            selig

                            >> 2 is 2, not 4 as it is stated in the error messages

                            It's not that simple. Basing on SMART records we can assume that there are 2 bad blocks for sure, but we must also assume that there could be also some pending bad blocks which where not yet reported by SMART attributes (Current Pending Sector or Raw Read Error Rate). Also there is Write error rate parameter which value can vary while disk is powered on and (basing on WD information) it's is updated when power cycle change (stragne for me, some time ago it was updated duting runtime, I must validate it).

                             

                             

                            If you are interested in my opinion, I would say that HW RAID is obsolete. I don't use RAID controlers for data security, because they don't offer good protection level. The only thing you've got is dispersion of data which in reality you can easly loose. We got such briliant file system which name is ZFS. It solves a lot of problems, it's much more reliable and error resilent. It is the file system of current storage needs - even M$ plagiated it to their tragic Windows Server 2012 system - known as 'storage pools'.

                            You see, in your situation you don't even know if data which you put on those bad sectors are still OK - there is no integrity control. The consistecy check only informs you that there is some inconsistency between physical drives which must be corrected. But how the controller knows which data are correct? If you have critical data, the best way to protect them is ZFS, no $10000 controllers.

                             

                             

                            Anyway, which block device reffered to physical disks in your situation? Was it /dev/sgN?