8 Replies Latest reply on Aug 31, 2017 8:36 PM by Intel Corporation

    2 S5500BC Server Boards Both Producing DIMM B1 Uncorrectable ECC Errors

    L.D

      Hi,

       

      I'm building a Windows Server 2012 R2 system using some existing server parts that we have available. I started with one S5500BC board and added the required hardware, which is an LSI 9260-8i card (this card was already in a number of old servers we have with the same board and worked fine) and I've added an Intel I350-T4 network card and a 2 port USB 3 card. I have done the exact same upgrades / additions to 2 other servers based on the same S5500BC boards and Server 2012 R2 and neither of those other 2 servers have any issues.

       

      So with this build, the server lost power 2 nights in a row, windows just logging the standard 'kernel power' error, which is as if the plug was pulled (but it wasn't). After the second night of this I installed the Intel ASC and it told me there was an uncorrectable ECC issue in dim slot B1. I powered down the server and moved the RAM about and it still reported the same error in the same slot. I did this a couple of times to be sure and it always reported DIMM slot B1 as the problem, regardless of what stick was in it. I took this to mean that the motherboard itself was faulty so I replaced the whole board with another S5500BC we have spare. This is a working board that also (like the first one) has never given any reason to believe there are any problems with it.

       

      After putting the second board in, I ran the ASC again and the same B1 error was present. I thought the may be the ASC calling up the old logs so I uninstalled and reinstalled it, and the memory error was gone. I checked it after around 12 hours of the server running and this remained the same. However a couple of nights ago the server again 'lost power' or just shut down ungracefully. I check the ASC again, and the same B1 error is back! On a completely different board.

       

      What could be going on here? I quite urgently need to get this server stable. I have tried to update the firmwares using the S5500BC_BIOS63_BMC61_FRUSDR22_ME112 package, I ran the one boot windows flash utility (v9.7 build 21) pointing to the extracted location using the following command:

       

      flashupdt -u C:\TempPath

       

      But I get the following:

       

      Update file configuration: XXX S5500BC,1.0

      *ERROR* BMC responded with incompatible values

       

      Could anyone please help? The only thing that is different about this server to the other 2 that are stable is that this one has 8 drives instead of 6. But it had 8 drives anyway during it's previous installation and there were no issues... ??

       

      Thanks.

        • 1. Re: 2 S5500BC Server Boards Both Producing DIMM B1 Uncorrectable ECC Errors
          Intel Corporation
          This message was posted on behalf of Intel Corporation

          Hello L D,

           

          Starting by the very beginning let me tell you a couple things before starting the troubleshooting process.

           

          This board S5500BC became end of live back in 2013 so at this point this support will be my best effort since we don’t even have this board in our labs anymore for testing or troubleshooting purposes. Second this OS Windows Server 2012 R2 were never tested or validated on this board so from that stand point you might have software instability.

           

          Been said that the power lost issue you describe looks more like an actual power subsystem issue more than board related. I could think about the power distribution board or any of the power supply units in case there are more than one. Actually the fact that you say this server has 8 hdds and the other has 6 only makes me think that whenever all hdds are working at full load could make the power to be insufficient thus making it to shut down. Another reason for unexpected shutdowns could be a processor overheating.

           

          About the ECC error it’s difficult to tell but somehow the issue could be following the RAM stick.

           

          About the BIOS not updating its possible the version you are trying to install is way ahead from the current one installed. You could try updating to the oldest available on the website and then updating to the latest one.

           

          For more info go to the Technical Product Specification

           

          Hope this helps a bit.

           

          Jose H.

          • 2. Re: 2 S5500BC Server Boards Both Producing DIMM B1 Uncorrectable ECC Errors
            L.D

            Hi Jose,

             

            Thanks for taking the time to reply to me.

             

            I am aware of the EOL status of the S5500BC board, and the fact that it's not really validated for Server 2012 R2 (even though there are a couple of driver downloads available in the drivers section, this is why I thought I'd go for 2012 over 2016). These things certainly do create difficulties here! Especially when this server is meant to be an important production machine (budget is very tight).

             

            In terms of this being merely a power issue: I should have said, this server board and chassis started off life with a single non redundant PSU. When I was rebuilding and upgrading it for this project, I swapped that PSU with a redundant dual PSU that was known good and was spare. When this happened for the first 2 nights, I assumed that the issue was caused by this PSU change, so I went back to the old single PSU but the problem still happened repeatedly. For this reason I doubt this is related directly to the PSU(s) that I have tested with so far. In regards to a power distribution board or power daughter board or such, this chassis does not contain this so it cannot be this.

             

            In regards to the fact that the storage controller (and PSU) has 8 disks attached, I am again almost certain that this cannot be the cause of the problem because this board and chassis used to have 8 disks connected in the past without issue. Also I have this same board model running in another site with 8 disks attached and none of these problems.

             

            In regards to CPU overheating, almost certainly not. As I said above, this same issue is present on 2 S5500BC boards! When I changed to the second board, I used 2 spare CPU's (the same as the original 2) and fresh thermal paste. No where (BIOS or ASC) is it being reported that any temps are anywhere near problem levels.

             

            In regards to RAM, yes this is one of the only options that kind of still makes sense, yet it also doesn't because I have moved the RAM around to all different slots and it always reports the problem with slot B1 **UPDATE** apart from when I came in and checked it this morning, it actually now reports it as B1 AND slot B2 (it has never done this before throughout this problem being present for about a week now, it always only said B1). Also this RAM just came out of a working server and there were never any issues.

             

            I did what you suggested and updated to the oldest next available firmwares via the EFI shell. It then allowed me to go through a succession of newer firmwares and I am now on the latest for BIOS (68), BMC (61), ME (v1.22) and FRUSDR (R22), so there should be nothing else firmware wise that I can update on this board now, thanks for your help with this.

             

            So if this continues to happen tonight, which I assume it will as nothing has helped so far, what would you suggest I do? I can swap all the RAM with another server and see if the problem follows the RAM to that server? By the way this board is populated with the maximum 32GB RAM. Aside from moving/trying different RAM (which I will have to source first) I have no idea what to try next.

             

            Thanks.

            • 3. Re: 2 S5500BC Server Boards Both Producing DIMM B1 Uncorrectable ECC Errors
              Intel Corporation
              This message was posted on behalf of Intel Corporation

              Hi L D,

              Doing a bit of research about this board we found that the Xeon 5500 processors series integrates the memory controller (IMC) within the processor. Since the RAM issue keeps showing up even after swapping the board the common device would be the processors, so the ECC errors could be generated by the IMC meaning the processor itself could be the issue.

              By any chance is it possible for you to swap processors or try different ones or test one processor at the time?

              Please let me know

              Jose H.

              • 4. Re: 2 S5500BC Server Boards Both Producing DIMM B1 Uncorrectable ECC Errors
                L.D

                Hi Jose,

                 

                Thanks for spending so much time looking into this. As I mentioned in my 4th paragraph in the last post, when I changed to the second S5500BC board, at that time I also fitted 2 different E5620 CPUs (both identical). These are the same CPUs as was in the first board when I first came across this error. The issue has occurred a number of times since having the different board and 2 different CPUs installed.

                 

                For this reason, I highly doubt that the CPUs are the cause of the issue.

                 

                Thanks.

                • 5. Re: 2 S5500BC Server Boards Both Producing DIMM B1 Uncorrectable ECC Errors
                  Intel Corporation
                  This message was posted on behalf of Intel Corporation

                  Hello L D,

                   

                  I would like to take a look at the board BMC system event log (SEL). If you could retrieve them and attach them I would really appreciate it. That should give us some light.

                   

                  Jose H.

                  • 6. Re: 2 S5500BC Server Boards Both Producing DIMM B1 Uncorrectable ECC Errors
                    L.D

                    Hi Jose,

                     

                    After updating all the board firmwares to the latest versions, the issue has not been present now for 5 or 6 nights! I almost dare to say that this is issue is solved but I'm not going to mark a correct answer until next working week just to be sure.

                     

                    In regards to the SEL that you mentioned, if you still want it, can you tell me how to generate it and I'll upload it. I can't see this anywhere in ASC, do I have to get it from EFI shell?

                     

                    L D

                    • 7. Re: 2 S5500BC Server Boards Both Producing DIMM B1 Uncorrectable ECC Errors
                      Intel Corporation
                      This message was posted on behalf of Intel Corporation

                      Hello L D,

                      I am glad to hear about the issue has not reappeared after the BIOS update. About the log lets just skip them for now. On newer machines there are downloadable tools from the downloadcenter that allow you to generate the logs, but on this machines I am not entirely sure. Actually I didn't find them.

                      I will wait until next week and hopefully it won't show up anymore.

                      Regards

                      Jose H.
                       

                      • 8. Re: 2 S5500BC Server Boards Both Producing DIMM B1 Uncorrectable ECC Errors
                        Intel Corporation
                        This message was posted on behalf of Intel Corporation

                        Hello L D,

                        I will proceed to mark this thread as answered as part of our regular process. If by any chance the issue reappears just create a new post and we will continue were we left.

                        Regards

                        Jose H.