10 Replies Latest reply on Aug 26, 2009 10:19 PM by Codeplug

    Why is Matrix Manager marking a drive as bad that isn't bad?

    theshowmecanuck

      Problem Description

       

      I just built a new system. I have a pair of 640GB drives mirrored as the system drive(s). I have 4x1TB drives in RAID10 for storage. Initially I had the storage volume set up as RAID5. The processor is an i7 920 running on an Asus P6T motherboard, running Windows 7 RC1 and has 6 GB of DDR3. There are six useful SATA ports on the board (0 to 5). Ports 0 and 1 are the used for the mirrored system volume. Porst 2 through 5 are used for the storage volume.

       

      Just after I finished building the machine and installing the OS (with the storage volume being RAID5) the Matrix Storage Manger reported that SATA Port #4 had failed. I removed the drive and replaced it (I bought one drive extra as a spare). The machine rebuilt the array. In the meantime, I installed the 'bad' drive in an external hard drive enclosure and formatted it to NTFS and ran disk check on it, including looking for bad sectors. The disk check came back clean. No problems with the supposedly 'bad' hard drive.

       

      A day later I had an issue with the drive at port #4. I took it out of the raid array via the config utility, and added it back to the RAID array without physically changing the drive. It rebuilt and ran fine for a while. I doubt very much if two new drives are bad, and especially so when the storage manager reported that it added the 'new' drive into the array and that it functioned normally... even when I didn't really put a new drive back in.

       

      The next day I rebuilt the storage volume as RAID10 for performance reasons. Last night, after about a week of no issues, I removed a drive from a set of two other SATA ports that Asus adds to the board that serve no useful purpose from what I can see. They support some 'special' features that I can't make heads or tails of (and I wished they were just two more 'normal' sata ports). They call them JMstore or something like that. Anyway, I attached my SATA DVD R/W Drive to one of those and attached the front panel esata port to the second. The dvd works but the system (windows 7) didn't recognize that the external drive was connected, to I shut down, removed the external drive from the front panel and attaced it to an sSATA port build right into the back panel of the motherboard (on the back of of the PC with the sound output plugs, usb ports, etc).

       

      After that happened I restarted the machine and low and behold the Matrix Drive Manager told me that the drive on SATA Port #4 had failed again. That and the drive on SATA Port #1 of the mirrored system volume was showing failed. The machine was not shut down abnormally when working with the eSATA drive issue. So I shut down the system and restarted it, and when the RAID config screen came I pressed <CNTRL+I> and marked those drives as not being part of the RAID sets (e.g. the drives on ports 1 and 4). I then added them back to their respective RAID arrays and continued with the reboot. The system came back up and rebuilt the drives. That is, the system did not say, "those drives are failed and I can't add them to your RAID volumes." I belive like the first time, the drives were not really bad, but that there is either something wrong with the Matrix Storage Manager or with the Mobo.

       

      Questions

       

      As I sit here I just checked and the Storage Manager is telling me the drive at SATA Port 4 has failed again. Why would the storage manager report a drive as being bad when it is not? Would anyone believe that it is the Mobo SATA port #4 that is bad and not the Matrix Storage Manager? And why?

       

      I am definitely going to talk to the store I bought it at and will likely get it exchanged. But I am interested in any feedback from the people here and especially from Intel.

       

      Matrix Storage Manager 'Storage Report' follows:

      ---------------------------------------------------------------------------------------------

      System Information

       

      Kit Installed: 8.9.0.1015
      Kit Install History: 8.9.0.1015, Uninstall
      Shell Version: 8.9.0.1015

       

      OS Name: Microsoft Windows 7 Ultimate
      OS Version: 6.1.7100  Build 7100
      System Name: TBONE
      System Manufacturer: ASUSTeK Computer INC.
      System Model: P6T
      Processor: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
      BIOS Version/Date: American Megatrends Inc. 0603   , 05/19/2009

       

      Language: ENU

       

       

       

      Intel(R) Matrix Storage Manager

       

      Intel RAID Controller: Intel(R) ICH8R/ICH9R/ICH10R/DO/PCH SATA RAID Controller
      Number of Serial ATA ports: 6

      RAID Option ROM Version: 8.0.0.1038
      Driver Version: 8.9.0.1015
      RAID Plug-In Version: 8.9.0.1015
      Language Resource Version of the RAID Plug-In: 8.9.0.1015
      Create Volume Wizard Version: 8.9.0.1015
      Language Resource Version of the Create Volume Wizard: 8.9.0.1015
      Create Volume from Existing Hard Drive Wizard Version: 8.9.0.1015
      Language Resource Version of the Create Volume from Existing Hard Drive Wizard: 8.9.0.1015
      Modify Volume Wizard Version: 8.9.0.1015
      Language Resource Version of the Modify Volume Wizard: 8.9.0.1015
      Delete Volume Wizard Version: 8.9.0.1015
      Language Resource Version of the Delete Volume Wizard: 8.9.0.1015
      ISDI Library Version: 8.9.0.1015
      Event Monitor User Notification Tool Version: 8.9.0.1015
      Language Resource Version of the Event Monitor User Notification Tool: 8.9.0.1015
      Event Monitor Version: 8.9.0.1015

      Array_0000
      Status: No active migrations
      Hard Drive Data Cache Enabled: Yes
      Size: 1192.3 GB
      Free Space: 0 GB
      Number of Hard Drives: 2
      Hard Drive Member 1: WDC WD6400AAKS-00A7B0
      Hard Drive Member 2: WDC WD6400AAKS-00A7B0
      Number of Volumes: 2
      Volume Member 1: Win7SystemRAID1
      Volume Member 2: temp

      Array_0001
      Status: No active migrations
      Hard Drive Data Cache Enabled: Yes
      Size: 3726 GB
      Free Space: 0 GB
      Number of Hard Drives: 4
      Hard Drive Member 1: ST31000528AS
      Hard Drive Member 2: ST31000528AS
      Hard Drive Member 3: ST31000528AS
      Hard Drive Member 4: ST31000528AS
      Number of Volumes: 1
      Volume Member 1: Storage_RAID10

      Win7SystemRAID1
      Status: Normal
      System Volume: No
      Volume Write-Back Cache Enabled: No
      RAID Level: RAID 1 (mirroring)
      Size: 300 GB
      Physical Sector Size: 512 Bytes
      Logical Sector Size: 512 Bytes
      Number of Hard Drives: 2
      Hard Drive Member 1: WDC WD6400AAKS-00A7B0
      Hard Drive Member 2: WDC WD6400AAKS-00A7B0
      Parent Array: Array_0000

      temp
      Status: Normal
      System Volume: No
      Volume Write-Back Cache Enabled: No
      RAID Level: RAID 1 (mirroring)
      Size: 296.1 GB
      Physical Sector Size: 512 Bytes
      Logical Sector Size: 512 Bytes
      Number of Hard Drives: 2
      Hard Drive Member 1: WDC WD6400AAKS-00A7B0
      Hard Drive Member 2: WDC WD6400AAKS-00A7B0
      Parent Array: Array_0000

      Storage_RAID10
      Status: Degraded
      System Volume: Yes
      Volume Write-Back Cache Enabled: No
      RAID Level: RAID 10 (striping and mirroring)
      Strip Size: 64 KB
      Size: 1863 GB
      Physical Sector Size: 512 Bytes
      Logical Sector Size: 512 Bytes
      Number of Hard Drives: 4
      Hard Drive Member 1: ST31000528AS
      Hard Drive Member 2: ST31000528AS
      Hard Drive Member 3: ST31000528AS
      Hard Drive Member 4: ST31000528AS
      Parent Array: Array_0001

      Hard Drive 0
      Usage: Array member
      Status: Normal
      Device Port: 0
      Device Port Location: Internal
      Current Serial ATA Transfer Mode: Generation 2
      Model: WDC WD6400AAKS-00A7B0
      Serial Number: WD-WMASY4234584
      Firmware: 01.03B01
      Native Command Queuing Support: Yes
      Hard Drive Data Cache Enabled: Yes
      Size: 596.1 GB
      Physical Sector Size: 512 Bytes
      Logical Sector Size: 512 Bytes
      Number of Volumes: 2
      Volume Member 1: Win7SystemRAID1
      Volume Member 2: temp
      Parent Array: Array_0000

      Hard Drive 1
      Usage: Array member
      Status: Normal
      Device Port: 1
      Device Port Location: Internal
      Current Serial ATA Transfer Mode: Generation 2
      Model: WDC WD6400AAKS-00A7B0
      Serial Number: WD-WMASY4386166
      Firmware: 01.03B01
      Native Command Queuing Support: Yes
      Hard Drive Data Cache Enabled: Yes
      Size: 596.1 GB
      Physical Sector Size: 512 Bytes
      Logical Sector Size: 512 Bytes
      Number of Volumes: 2
      Volume Member 1: Win7SystemRAID1
      Volume Member 2: temp
      Parent Array: Array_0000

      Hard Drive 2
      Usage: Array member
      Status: Normal
      Device Port: 2
      Device Port Location: Internal
      Current Serial ATA Transfer Mode: Generation 2
      Model: ST31000528AS
      Serial Number: 6VP064KV
      Firmware: CC34
      Native Command Queuing Support: Yes
      Hard Drive Data Cache Enabled: Yes
      Size: 931.5 GB
      Physical Sector Size: 512 Bytes
      Logical Sector Size: 512 Bytes
      Number of Volumes: 1
      Volume Member 1: Storage_RAID10
      Parent Array: Array_0001

      Hard Drive 3
      Usage: Array member
      Status: Normal
      Device Port: 3
      Device Port Location: Internal
      Current Serial ATA Transfer Mode: Generation 2
      Model: ST31000528AS
      Serial Number: 6VP05JN0
      Firmware: CC34
      Native Command Queuing Support: Yes
      Hard Drive Data Cache Enabled: Yes
      Size: 931.5 GB
      Physical Sector Size: 512 Bytes
      Logical Sector Size: 512 Bytes
      Number of Volumes: 1
      Volume Member 1: Storage_RAID10
      Parent Array: Array_0001

      Hard Drive 4
      Usage: Array member
      Status: Failed
      Device Port: 4
      Device Port Location: Internal
      Current Serial ATA Transfer Mode: Generation 2
      Model: ST31000528AS
      Serial Number: 6VP04ZWE
      Firmware: CC34
      Native Command Queuing Support: Yes
      Hard Drive Data Cache Enabled: Yes
      Size: 931.5 GB
      Physical Sector Size: 512 Bytes
      Logical Sector Size: 512 Bytes
      Number of Volumes: 1
      Volume Member 1: Storage_RAID10
      Parent Array: Array_0001

      Hard Drive 5
      Usage: Array member
      Status: Normal
      Device Port: 5
      Device Port Location: Internal
      Current Serial ATA Transfer Mode: Generation 2
      Model: ST31000528AS
      Serial Number: 5VP01JXT
      Firmware: CC34
      Native Command Queuing Support: Yes
      Hard Drive Data Cache Enabled: Yes
      Size: 931.5 GB
      Physical Sector Size: 512 Bytes
      Logical Sector Size: 512 Bytes
      Number of Volumes: 1
      Volume Member 1: Storage_RAID10
      Parent Array: Array_0001

      -------------------------------------------------------------------------

        • 1. Re: Why is Matrix Manager marking a drive as bad that isn't bad?
          Ndi

          Mine does that too, on the same port.

           

          I narrowed it down to 2 possibilities:

           

          a) Port 4 is bananas (b-a-n-a-n-a-s!)

           

          b) It fails because of a SMART event. In my setup, I have 2 sections of 3 drives each via RIAD hotplug interfaces. The first disk in slot 3 (last in group 1), and the other 4 are in group 2 (FFU-UUU with F free and U used, order is 1-2-3-4). It's like that because of a recent migration. Anyway, as you can see, the 4th drive is at the bottom. After getting it out and testing it it was fine EXCEPT the SMART log listed a SMART EXCEEDED event with temperature being 69 C (max is 60 or so).

           

          The controller might have spit it out because of SMART fail, even if the drive is good. Mind you, the first drive it spit out was fine too, rebuild, spit, rebuild. After 2 months or so, it actually started to give out bad sectors, cooked most likely. So even though it's nice and all it might still be on its way out. If it's an enterprise drive it must have log. If not, it might. Do a SMART test.

           

          Also, you might want to touch the drive if it fails again. If you can _just_barely_ keep your finger on the hottest spot, it's probably near maxtemp.

           

          Also, I'd pay Intel good money to allow me to see SMART events. Not necessarily expose the drive, just pass them on in a log or so. When the drive gives a clue, write in a text file: Drive 4 maxtemp exceeded. It would be invaluable in the decision to rebuild or replace.

          • 2. Re: Why is Matrix Manager marking a drive as bad that isn't bad?
            theshowmecanuck

            Unfortunately temperature doesn't seem to be an issue here. The SMART reports for all the drives report max temperatures of only around 40 or 41 degrees. At least if they looked way high, it would tell me something.

            • 3. Re: Why is Matrix Manager marking a drive as bad that isn't bad?
              Ndi

              I'd still try a hands-on feel of the drives at high activity or on failure.

               

              Also, they are thrown out when a counter starts increasing, but before failing, such as relocation counter and somesuch. My drive actually failed a while after so I don't think it's simply bananas.

               

              You can try swapping drives. Since it's stripe over mirror, it should (and will, because the order doesn't matter) be safe to swap drives 3 and 4. if it spits drive 3 on port 4, it might be the port after all. Mine never complained of any other port, but those that it did complain about failed soon after. So it might not be bogus. I have 3 failed drives here.

              • 4. Re: Why is Matrix Manager marking a drive as bad that isn't bad?
                UglyPercy

                I had four Seagate 1.5 TB 7200.11 series drives in a RAID-10 for months under Vista 64-bit with no issues under IMSM 8.8.  I upgrade to Windows 7 RC1 and IMSM 8.9.0.1015, and every day a drive is marked as "failed" -- sometimes the same one, sometimes a different one.  Sometimes the same port, sometimes a different one.  Each time, I mark the drive as "normal", and rebuild the array.  And then another one is marked as "failed" again in a day or so.  I downgrade back to Vista and IMSM 8.8, and no more problems -- it's been a few weeks now.

                 

                The inescapable conclusion: IMSM 8.9.0.1015 does not work properly under Windows 7 RC1, at least for some users.

                • 5. Re: Why is Matrix Manager marking a drive as bad that isn't bad?
                  Ndi

                  I didn't know IMSM could mark drives as failed. I believe that's the controller's job.

                  • 6. Re: Why is Matrix Manager marking a drive as bad that isn't bad?
                    UglyPercy

                    Not sure what you mean by "it's the controller's job".  You mean the actual ICH chip?  Yes, it does the actual "failed" marking, being a piece of hardware, but it is itself controlled by software, the drivers in the IMSM package.

                     

                    If you mean that the controller *exclusively* decides whether to mark a drive as failed, I don't believe that's correct.  IMSM decides to do the actual marking based somehow on perceived controller status, and I believe IMSM 8.9.0.1015 is making bad decisions under Windows 7 RC1.

                    • 7. Re: Why is Matrix Manager marking a drive as bad that isn't bad?
                      Ndi

                      Well I don't know how it works, if I did I wouldn't be here with an open question.

                       

                      However, I'm pretty sure that the IMSC is event-driven and it recieves events from the controller. If it didn't, it would have to keep polling for errors every now and then, it would be bad for hot-plug. As a result, IMSC needs an event, and that event is hardware-based. While it is possible that some version ignores -say- temperature SMART alert for < 50 degrees and other version doesn't, I coulsn't say, but it's possible.

                       

                      IMSC just kicking drives out for no actual reason and no provocation, I find that hard to believe. I point out that, as in my previous post, the key work is "I believe" it does or doesn't do that. With no documentation or specifications it's speculation and I could be dead wrong. I have a few years as a coder behind and while it gives me insight into implementation techniques it does nothing in the sense of sniffing out real life.

                       

                      As the darned thing doesn't even have an option to kick out a working drive on user request (while it's OK and rebuilt-to), though I wish I could so I could diagnose the drive while sill inside, I doubt it simply kick stuff out. Maybe some ignoring went berserk in W7, this is possible, even likely. Unprovoked, however, I doubt it.

                       

                      Oh, and, I've had bad experiences with non-enterprise drives. Internal recovery algos delay drive response, something as little as thermal recalibration could get it spit out of the RAID. There's a reason RE(WD) and NS(Seagate) are twice the price. Well, were, they are getting cheaper now.

                       

                      The only thing I'm sure about is this (I'll say it again). All drives my controller spit out either were bad or went bad soon after. By soon I mean months. They usually fail faster and faster until I pull them out.

                      • 8. Re: Why is Matrix Manager marking a drive as bad that isn't bad?
                        UglyPercy

                        I don't know anything for sure either, but for the record both IMSM 8.8 and 8.9.0.1015 had the hourly/daily failure marking issue under Windows 7 RC1.  I bet there's a reason 8.9.0 has not been officially released yet, and that this is part of it.  No official IMSM release has yet been declared as being Windows 7 compatible, and I'm sure it isn't just laziness on Intel's part.

                        • 9. Re: Why is Matrix Manager marking a drive as bad that isn't bad?
                          theshowmecanuck

                          I think I might have a solution, but if so, it is kind of crappy one.

                          On the two Arrays, I set "Hard Drive Data Cache Enabled" to 'No'

                          I already had "Enable Volume Write-Back Cache" set to 'No' for the volumes. Every since then, I've not had a problem. I don't like it though as it removes some of the performance perks of RAID. Since this doesn't come with battery backup etc. then it is rather sensible then. It has only bee about 5 days without issue, so I won't say it is 'fixed', but considering this happened almost daily, it is looking pretty good.


                          I have a 3ware card coming that has its own risc processor on board as well as 128 MB (and is expandable) or DDR2 RAM AND battery backup. ;-) The only problem is that now I have to buy one for my other tower PC. I can see it is feeling jealous.

                          • 10. Re: Why is Matrix Manager marking a drive as bad that isn't bad?
                            Codeplug

                            Your issue sounds the same as what's described in this thread: http://communities.intel.com/thread/5036

                             

                            Did 8.9 remail stable with hd-cache turned off?

                             

                            gg