1 2 Previous Next 18 Replies Latest reply on Jun 14, 2012 7:26 AM by drodo

    RCCE_iset_power function causes cores to be unstable

    wbrozas

      I have been trying to use RCCE_iset_power to set the frequency and voltage accorrdingly but after the call to the function, I am no longer able to use the cores. Is it possible that if I try to set the frequency divider to 4 and the voltage level is then set 0, that 0 is too low of a voltage level to be stable at that frequency for my SCC.

        • 1. Re: RCCE_iset_power function causes cores to be unstable
          tedk

          Please also take a look at Bug 360,  http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=360

          I ran FV on my marc101 with a divider of 4.

           

          What do you mean by "voltage set to 0?"

           

          tekubasx@marc101:/shared/tekubasx/FV$ rccerun -nue 8 -f rc.hosts FV 4
          pssh -h PSSH_HOST_FILE.4689 -t -1 -p 8 /shared/tekubasx/FV/mpb.4689 < /dev/null
          [1] 17:23:53 [SUCCESS] rck00
          [2] 17:23:53 [SUCCESS] rck01
          [3] 17:23:53 [SUCCESS] rck03
          [4] 17:23:53 [SUCCESS] rck02
          [5] 17:23:53 [SUCCESS] rck05
          [6] 17:23:53 [SUCCESS] rck04
          [7] 17:23:53 [SUCCESS] rck06
          [8] 17:23:53 [SUCCESS] rck07
          pssh -h PSSH_HOST_FILE.4689 -t -1 -P -p 8 /shared/tekubasx/FV/FV 8 0.533 00 01 02 03 04 05 06 07 4 < /dev/null
          rck01: UE 1, Core ID 1; size of V dom 0 is 4, size of F dom 0 is 2
          rck00: UE 0, Core ID 0; size of V dom 0 is 4, size of F dom 0 is 2
          rck03: UE 3, Core ID 3; size of V dom 0 is 4, size of F dom 1 is 2
          rck05: UE 5, Core ID 5; size of V dom 1 is 4, size of F dom 2 is 2
          rck06: UE 6, Core ID 6; size of V dom 1 is 4, size of F dom 3 is 2
          rck07: UE 7, Core ID 7; size of V dom 1 is 4, size of F dom 3 is 2
          rck00: RC_V_MHz_cap[Vlevel].MHz_cap MHz: 460 400
          RC_V_MHz_cap[Vlevel].MHz_cap MHz: 460 400
          outVlevel: 0
          Requested fdiv: 4, actual fdiv, vlevel: 4, 0
          rck02: UE 2, Core ID 2; size of V dom 0 is 4, size of F dom 1 is 2
          rck04: UE 4, Core ID 4; size of V dom 1 is 4, size of F dom 2 is 2
          rck04: RC_V_MHz_cap[Vlevel].MHz_cap MHz: 460 400
          RC_V_MHz_cap[Vlevel].MHz_cap MHz: 460 400
          outVlevel: 0
          Requested fdiv: 4, actual fdiv, vlevel: 4, 0 <== Note the tiles that have a divider of 4 now. VCC4 and VCC5.
          rck00: Clock divider for tile 0 is 4
          Clock divider for tile 1 is 4
          Clock divider for tile 2 is 4
          Clock divider for tile 3 is 4
          Clock divider for tile 4 is 3
          Clock divider for tile 5 is 3
          Clock divider for tile 6 is 4
          Clock divider for tile 7 is 4
          Clock divider for tile 8 is 4
          Clock divider for tile 9 is 4
          Clock divider for tile 10 is 3
          Clock divider for tile 11 is 3
          Clock divider for tile 12 is 3
          Clock divider for tile 13 is 3
          Clock divider for tile 14 is 3
          Clock divider for tile 15 is 3
          Clock divider for tile 16 is 3
          Clock divider for tile 17 is 3
          Clock divider for tile 18 is 3
          Clock divider for tile 19 is 3
          Clock divider for tile 20 is 3
          Clock divider for tile 21 is 3
          Clock divider for tile 22 is 3
          Clock divider for tile 23 is 3
          [1] 17:23:54 [SUCCESS] rck01
          [2] 17:23:54 [SUCCESS] rck03
          [3] 17:23:54 [SUCCESS] rck00
          [4] 17:23:54 [SUCCESS] rck02
          [5] 17:23:54 [SUCCESS] rck04
          [6] 17:23:54 [SUCCESS] rck05
          [7] 17:23:54 [SUCCESS] rck06
          [8] 17:23:54 [SUCCESS] rck07

           

          All cores are up.
          tekubasx@marc101:/shared/tekubasx/FV$ sccBoot -s
          INFO: Welcome to sccBoot 1.4.1 (build date Jul  4 2011 - 16:14:13)...
          Status: The following cores can be reached with ping (booted): All cores!

           

          Voltage has dropped.
          tekubasx@marc101:/shared/tekubasx/FV$ sccBmc -c status |grep OPV
            OPVR VCC0: 1.0922 V
            OPVR VCC1: 1.0910 V
            OPVR VCC2: 1.0836 V
            OPVR VCC3: 1.0910 V
            OPVR VCC4: 0.7435 V <==
            OPVR VCC5: 0.7339 V
            OPVR VCC7: 1.0746 V

          • 2. Re: RCCE_iset_power function causes cores to be unstable
            tedk

            Oh, I see what you mean by a Vlevel of 0. This is the first entry in RC_V_MHz_cap[], which should drop voltage to around 0.7. And it does in my example.

             

            Are you dropping voltage for all 48 cores? The example I tried was 8 cores resulting in 2 power domains for a total of 16 cores because 4 of the cores are in one power domain and 4 in another.

            • 3. Re: RCCE_iset_power function causes cores to be unstable
              wbrozas

              Yes, I'm dropping the voltage of all 48 cores, but even if it say success, some cores are no longer reachable (aka. cannot ssh into all cores. try using pssh)

              • 4. Re: RCCE_iset_power function causes cores to be unstable
                tedk

                What is your initial frequency? What I mean is ... when you initialize with sccBmc -i , what menu item do you choose?

                • 5. Re: RCCE_iset_power function causes cores to be unstable
                  tedk

                  Please look at bug 360 and bug 110

                  http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=360

                  http://marcbug.scc-dc.com/bugzilla3/show_bug.cgi?id=110

                   

                  I think if you download the attached NewSettings.zip attached to 110 and install it on your MCPC, it will correct your problem. The zip contains a settings directory that replaces /opt/sccKit/current/settings.

                   

                  At this point, I'd recommend that everyone replace their settings directory with the one in this zip. Do save the original settings just to be safe.

                  • 6. Re: RCCE_iset_power function causes cores to be unstable
                    wbrozas

                    I have scckit 1.4.1 patch 3. I tried the settings in NewSettings.zip.

                    It seemed to work for 8 cores but when I run it on all 48 it fails.

                     

                    [1] 14:40:48 [FAILURE] rck18 Exited with error code 255
                    [2] 14:40:48 [FAILURE] rck16 Exited with error code 255
                    [3] 14:40:48 [FAILURE] rck08 Exited with error code 139
                    [4] 14:40:49 [FAILURE] rck37 Exited with error code 139
                    [5] 14:40:59 [FAILURE] rck21 Exited with error code 139

                     

                    I started with mesh0

                    (533 800 800)

                    • 7. Re: RCCE_iset_power function causes cores to be unstable
                      wbrozas

                      I believe when FV is run on rck16 some cores become unreachable. rck00 and rck08 seem to be fine. If  edit my hosts file to have 16 first then run FV on one core the output seems to say it worked but then if I try to pssh and echo something not all cores work. Once some cores become unreachable FV on 48 cores will obivously hang because of the barrier.

                      • 8. Re: RCCE_iset_power function causes cores to be unstable
                        tedk

                        That's unfortunate. Please let me run some tests here. Did you do a complete power cycle before initializing with sccBmc -i? That new settings file worked for other users with the same issue as yours. If your chip is damaged, we'll send you a new one

                        • 9. Re: RCCE_iset_power function causes cores to be unstable
                          wbrozas

                          I have tried a complete power cycle with no luck. rck16 returns success after the system call but then I am unable to use a few other cores. Same happens with rck40. Also be sure to try to ssh to all the cores. Sometimes ping works but cannot ssh.

                          • 10. Re: RCCE_iset_power function causes cores to be unstable
                            tedk

                            We've been looking at this issue the last few days. In summary, it does seem that dropping the voltage to 0.7 with 400MHz does put the SCC into a bad state.

                             

                            Has anyone operated in this range? It's possible that there may be some variation among the chips. I've done my study on marc101 which runs 1.4.1.3. When the SCC gets into that bad state, I can regain control by becoming root and running /opt/sccKit/current/bin/sccPowercycle. I think usrs in the data center have sudo access to this command.

                             

                            Here's a summary of what I did so far. I started by intiializing with Tile533_Mesh800_DDR800. You can check that the tile is really running at 533MHz by reading GCBCFG with sccDump. I issued sccDump for each of the 24 tiles. The value returned is shown in EAS:Table 4. That table doesn't show the least significant 8 bits which are 0xf0 when SCC Linux is booted.

                                 GCBCFG   = 0x00a8e2f0 <== this is the value that indicates 533 MHz

                             

                            I then used the RCCE example Fdiv to drop the frequency to 400 MHz. Fdiv does not change the voltage. Then, GCBCFG is

                                 GCBCFG   = 0x00e0e3f0 <== this is the value that indicates 400 MHz.

                             

                            I can look the voltages with "sccBmc -c status |grep OP"

                                OPVR VCC0: 1.0969 V

                                OPVR VCC1: 1.0958 V

                                OPVR VCC2: 1.0957 V

                                OPVR VCC3: 1.0956 V

                                OPVR VCC4: 1.0959 V

                                OPVR VCC5: 1.0965 V

                                OPVR VCC7: 1.0956 V

                            The cores are up and running as seem with "sccBoot -s"

                            $ sccBoot -s

                            INFO: Welcome to sccBoot 1.4.1 (build date Jul  4 2011 - 16:14:13)...

                            Status: The following cores can be reached with ping (booted): All cores!

                             

                            I wrote a short C program (called setvoltage) that changes just the voltage. This is not a RCCE program. I wanted to just change the voltage and not modify the frequency. As you know, there are 6 voltage  domains (also called power domains). RCCE defines a core in each power domain as the power domain master. If you are not using RCCE, any core can change the voltage in any power domain. But in any case I chose to run setvoltage on 6 cores, one in each power domain and chose the cores that RCCE identifies as power domain masters.

                               $ pssh -h pssh.pdmhosts -t -1 -P -p 6 /shared/tekubasx/POWER/setvolt

                             

                            So, I first dropped the voltage to 0.9v.

                               $ sccBmc -c status |grep OP

                                 OPVR VCC0: 0.9227 V

                                 OPVR VCC1: 0.9155 V

                                 OPVR VCC2: 1.0792 V <== the mesh "highway" ...not changed

                                 OPVR VCC3: 0.9093 V

                                 OPVR VCC4: 0.9226 V

                                 OPVR VCC5: 0.9103 V

                                 OPVR VCC7: 0.9086 V

                             

                            "sccBoot -s" still shows all cores. So I then dropped the voltage to 0.8v.

                               $ sccBmc -c status |grep OP

                                 OPVR VCC0: 0.9228 V

                                 OPVR VCC1: 0.9154 V

                                 OPVR VCC2: 1.0792 V <== the mesh "highway" ...not changed

                                 OPVR VCC3: 0.9093 V

                                 OPVR VCC4: 0.9227 V

                                 OPVR VCC5: 0.9105 V

                                 OPVR VCC7: 0.9088 V

                             

                            "sccBoot -s" still shows all cores. So I then dropped the voltage to 0.7v.

                               $ sccBmc -c status |grep OP

                                 OPVR VCC0: 0.7279 V

                                 OPVR VCC1: 0.7192 V

                                 OPVR VCC2: 1.0609 V

                                 OPVR VCC3: 0.7175 V

                                 OPVR VCC4: 0.7297 V

                                 OPVR VCC5: 0.7185 V

                                 OPVR VCC7: 0.7197 V

                             

                            But now "sccBoot -s" does not not show all cores.

                            $ sccBoot -s

                            INFO: Welcome to sccBoot 1.4.1 (build date Jul  4 2011 - 16:14:13)...

                            Status: The following cores can be reached with ping (booted): 28 cores (PIDs = 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x12, 0x13, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1b, 0x1d, 0x21 and 0x23)...

                             

                            $ sccBoot -s

                            INFO: Welcome to sccBoot 1.4.1 (build date Jul  4 2011 - 16:14:13)...

                            Status: The following cores can be reached with ping (booted): 28 cores (PIDs = 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x12, 0x13, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1b, 0x1d, 0x21 and 0x23)...

                             

                            At this point I must issue that sccPowercycle to return the system to operation. So what I'm suggesting is that 0.7 is too low for 400 MHz. You might try within RCCE changing the values in RC_V_MHz_cap[] ... that is, changing the 0x70 to 0x80.

                             

                            I can verify that if I issue the RCCE "FV 4" I get errors and hangs. And if you change the RCCE array mentioned above, we might be able to run FV successfully. What do you think?

                            • 11. Re: RCCE_iset_power function causes cores to be unstable
                              tedk

                              I attached some files with more information. The pdf contains some screen output and a diagram showing how the cores, tiles, and domains are numbered.

                              • 12. Re: RCCE_iset_power function causes cores to be unstable
                                tedk

                                One of the real difficulties with looking at power management is the fact that you cannot read (only write) the RPC register.

                                 

                                Another thing is that if you increase the voltage before increasing the frequency and don't wait long enough (and not sure what long enough is), you crash.

                                 

                                Message was edited by: Ted Kubaska typo

                                 

                                Message was edited by: Ted Kubaska another typo

                                • 13. Re: RCCE_iset_power function causes cores to be unstable
                                  tedk

                                  I just modfied RC_V_MHz_cap[] (0.7 to 0.8, 0x70 to 0x80) and rebuilt RCCE and FV and now I do not see the hang with FV 4 that I did before.

                                  • 14. Re: RCCE_iset_power function causes cores to be unstable
                                    wbrozas

                                    Thanks those attachments will help but I'm just curious if you set the divider to 16 will the voltage level still be 0. It seemed like for me anything that set the voltage level to 0 was unstable. (Some not all cores)

                                    1 2 Previous Next