9 Replies Latest reply on Feb 27, 2011 6:10 PM by tedk

    Linux boot error: Timeout while waiting for Read request answer

    Ehsan

      Hi,

       

      When trying to boot Linux on the cores, I encounter an error:

       

      ERROR: Timeout while waiting for Read request answer (CMD=0x7) with TID 99! Cancelling request...

       

      I've tried re-initialization and power cycling several times. Sometimes I see an "Unexpected packet" error. How can I get around this? Is there a hardware issue?

       

      Thanks,

      -Ehsan

        • 1. Re: Linux boot error: Timeout while waiting for Read request answer
          Nil

          Hi,

           

          Have a look at this thread

           

          http://communities.intel.com/message/103940

           

          As stated in there you migh have to file a bug under the category "Marc Administration Needed" in Bugzilla

          http://marcbug.scc-dc.com/bugzilla3/

          • 2. Re: Linux boot error: Timeout while waiting for Read request answer
            Ehsan

            I've already read that thread but my issue is different. I cannot boot Linux but there is no problem in training and re-initialization.

             

            -Ehsan

            • 3. Re: Linux boot error: Timeout while waiting for Read request answer
              tedk

              Is this a Data Center machine or your own local system? If DC, what is its name?

              Have you successfully booted Linux before?

              Have you tried booting on a subset of cores?

              Have you captured the output of training? Does it not show any errors?

               

              Usually the issue you brought up is fixed by a complete power cycle, being careful to take power down and bring it up in the proper order. If this is a Data Center machine we can do that for you.

              • 4. Re: Linux boot error: Timeout while waiting for Read request answer
                Ehsan

                It's our local machine at UIUC.

                Yes, I had used the machine before. The problem started to happen after a couple of runs of FV example program in rcce apps/STENCIL folder.

                I've tried to boot on a subset of cores and sometimes it works for some cores.

                Output of the initialization doesn't show any errors.

                We've tried complete power cycling before and it help just for one time and the problem came back.

                Is this problem because of the example program?

                • 5. Re: Linux boot error: Timeout while waiting for Read request answer
                  tedk

                  I've never run FV. Let me run it on one of our internal machines and then we can compare results. If FV puts the machine into a bad state, I can always get it back up (because it's local).  Power management is on the leading edge, and there are still things about it we need to learn. It'll probably be the evening or next morning before I can have results for you.

                  • 6. Re: Linux boot error: Timeout while waiting for Read request answer
                    Nil

                    For me re-initializing the system solved that problem (not always), i encoutered this problem with FV as well.

                    • 7. Re: Linux boot error: Timeout while waiting for Read request answer
                      tedk

                      I built and ran FV on one of our local systems. It seemed to run fine. But it does have a side effect that I found surprising.

                       

                      First I built RCCE with power management enabled, then build FV. I ran FV on just two cores, 0 and 1. I gave it an argument of 3. This means that the frequency divider is 3, which means that the frequency is set to 533 MHz. This was my default; it was what I chose when I trained.

                       

                      Now before running FV, I did a sccBmc -c "status" to look at the voltage settings. I got the following. (The lines below are actually the output after I returned VCC4 to its pre-FV value.)

                       

                      Tertiary supplies:
                         OPVR VCC0: 1.0900 V
                         OPVR VCC1: 1.0895 V
                         OPVR VCC2: 1.0914 V
                         OPVR VCC3: 1.0893 V
                         OPVR VCC4: 1.0904 V
                         OPVR VCC5: 1.0913 V
                         OPVR VCC7: 1.0902 V

                       

                      If you look inside the RCCE code, you'll see that what happens when you run FV is that RCCE looks through an array called RC_V_MHz_cap[].  This is an array of structures. The third element in that structure is the maximum ferquency allowed for a particular voltage.

                       

                      triple RC_V_MHz_cap[] = {
                      /* 0 */ {0.7, 0x70, 460},
                      /* 1 */ {0.8, 0x80, 598},
                      /* 2 */ {0.9, 0x90, 644},
                      /* 3 */ {1.0, 0xA0, 748},
                      /* 4 */ {1.1, 0xB0, 875},
                      /* 5 */ {1.2, 0xC0, 1024},
                      /* 6 */ {1.3, 0xD0, 1198}
                      };

                       

                      With FV 3, you go through the array and choose an element whose frequency is larger than RC_GLOBAL_CLOCK/Fdiv = 533, and that is the second element. The voltage for this element is 0.8.

                       

                      After running FV 3,  I did another sccBmc -c "status" and got the following. Note the VCC4 equal to 0.8431. I chose the same frequency I had but ended up with a different voltage than I had originally.

                       

                      Tertiary supplies:
                        OPVR VCC0: 1.0954 V
                        OPVR VCC1: 1.0949 V
                        OPVR VCC2: 1.0921 V
                        OPVR VCC3: 1.0946 V
                        OPVR VCC4: 0.8431 V
                        OPVR VCC5: 1.0904 V
                        OPVR VCC7: 1.0885 V

                       

                      Also note that this is VCC4 for the status command but power domain 0 for RCCE. The staus command numbers the voltage islands as

                      123

                      456

                      RCCE numbers them as

                      345

                      012

                       

                      The side effect is that there does not seem to be an easy way of returning VCC4 to 1.09... I tried rebooting Linux, then retraining, then running sccPowercycle without changing VCC4. What got VCC4 back to 1.09 was removing crbif, telnetting to the BMC, isssuing a power off, issuing a power on, exiting the BMC, and then reboting the MCPC with the reboot command.

                       

                      So if you run FV and try stuff out, you end up putting the SCC into a state that is not its normal bootup state. I don't know what the consequencs of that are. What do you think?

                      • 8. Re: Linux boot error: Timeout while waiting for Read request answer
                        Ehsan

                        Thanks, our system is working again.

                        Is there any way to use FV or change the voltage safely? We will need to change the voltage frequently for our power management research.

                        • 9. Re: Linux boot error: Timeout while waiting for Read request answer
                          tedk

                          RCCE tries to optimize. When you set a frequency, it chooses a voltage for which that frequency is considered a maximum. Lowering the voltage further might actually harm the chip. Increasing the voltage would consume extra power. Exactly what value FV chooses is controlled by the values in the array RC_V_MHz_cap[] which is initialized in RCCE_power_management.c. You could define a different initialization.

                           

                          Or you could not use RCCE at all and set the voltage yourself. The difficulty here is that there is not a reasonable way from within a program of reading the voltage. So if you change the frequency and the voltage changes as a result, there is currently no way of getting back.

                           

                          I don't think (in fact I'm nearly positive) that if you use the RCCE API, you will not enter an unsafe range.

                          I don't knwo why ou got the boot error. When I see such an error, I usually can get my system back by retraining and rebooting. SOmetimes, I must actually power cycle.