    core instabilities due to voltage changes




      we have some troubles with the power management functions from the RCCE libraries. At first we had only access to only one Intel-SCC machine on our local institute and assumed, that the hardware could be malfunctioning, but now we have access to more chips (including marc020 and marc040) and the problems are still occurring. I already filed a bug report on this describing the problems, I am going to quote myself there:



      I wrote a test RCCE test program, which uses the power management features from

      the RCCE library. It basically runs an algorithm (qsort), measures the time it

      took to complete, increases the frequency divider (max. up until 16) and

      repeats. While only using RCCE_set_frequency_divider, the cores stayed stable.

      But when RCCE_iset_power and RCCE_wait_power were used, the program hung and i

      had to ^C interrupt it. The program executed with the following parameters:

      rccerun -nue 1 -f ./rcc.hosts setfDiv 4

      4 stands for the max frequency divider used, meaning, that after the program

      set the frequency divider to 2,3 and finally 4 it finishes.

      The source and binary can be found under /shared/jochenzimmermann/setfDiv

      First all cores in the power domain of core 0 were down, sccBoot -l did not

      bring them up. After that I did a sccBmc -i and sccBoot -l, but now not a

      single core can be booted anymore. Using sccPowercycle may solve this problem,

      but executing the binary afterwards probably turns the scc into an unstable

      state once again. Maybe there is something wrong with my test program, but it

      is very basic.




      For those of you, who have already used the power management functions from the RCCE libraries: did you had similar problems?

      I also tried several variants of my test program (e.g. not looping through the freq. dividers), but that did not help. I attached the source code, if you want to take a closer look at it or even try running it yourself. Any Information on this could really help us.


      Thanks in advance


          short update: We got an upgrade to sccKit (from 1.4.0) on marc040. The program now runs perfectly fine on marc040, although on some other machines, with sccKit, it still causes the problems mentioned above. I think sccKit >= 1.4.2 is necessary but not sufficient for the power management functions from RCCE to work. Sadly I don't know what causes the program to fail on some machines and work on others, probably some hardware issues.