9 Replies Latest reply: Apr 12, 2012 9:11 AM by mwaughex RSS

Stability problems at 800 MHz tile clock

philippg Community Member
Currently Being Moderated

Has anybody ever encountered stability problems when running the SCC device with a tile clock of 800 MHz? I've been trying to use the sccBmc initialization presets 0 (Tile533_Mesh800_DDR800) and 4 (Tile800_Mesh800_DDR800) and I keep getting freezes or timeouts when running programs (e.g. NPB's LU or BT benchmark) with the latter configuration - which leads me to assume that at least our SCC model might not work reliably with a tile/core clock of 800 MHz (sometimes it is even impossible to boot linux). At 533 MHz everything is fine and working like a charm. Also sometimes the preset training fails and sccBmc switches to the "more sophisticated training option" - I don't know if this is important...

 

Anybody else encountered similar problems? I guess/hope/expect that sccBmc sets the tile/core voltages accordingly when switching the operating frequency to actually support stable operation at 800 MHz, but can anybody confirm this?


  • 1. Re: Stability problems at 800 MHz tile clock
    michael.riepen Community Member
    Currently Being Moderated

    We have a sighting with regard to the MPB at 800 MHz and we're currently looking into it. Do these workloads make use of the message passing buffer? I think for the time beeing it would be best to stick with 533MHz for MPB related applications. As soon as we know more, we'll keep you updated!

     

    The preset training usually fails when the system has been trained before (as the presets are applied on top of the trained settings). However, the following fast training should then be able to train the system properly. So, this is expected behaviour in case of a re-training...

     

    "sccBmc -i" does not modify the voltages. It only changes the frequencies via JTAG and trains the system interface as well as the memory controllers according to the new frequency settings.

  • 2. Re: Stability problems at 800 MHz tile clock
    philippg Community Member
    Currently Being Moderated

    Probably the MPB is the source of the error, we only ran the unmodified BT benchmark supplied with RCCE 1.0.7. However I now encountered a different problem also regarding stability:

     

    I wrote a small test program to set the frequency and voltage of the tiles (i.e. with RCCE_iset_power()) and was able to set them to 400 MHz (frequency divider 4) at 0.6 volts (voltage id 0) - which, according to the documentation in the Programmer's Guide, is an invalid combination:

     

    UE 0, Core ID 0; size of V dom 0 is 1, F dom 0 is 1
    UE 0 trying to write VID word 10470 (level 0, V 0.700000) to address b7f5e000
    UE 0 writes VID word again
    UE 0 at voltage level 0, frequency divider 4
    Finished, new frequency and voltage: 400.000000 MHz, 0.600000 V

     

    The last line is output from my own program - the MHz and V values are calculated according to the Programmer's Guide (V = vid * 0.1 + 0.6), however I'm wondering whether voltage level 0 is really 0.6 V or actually 0.7 V as indicated by the RCCE library function output?

     

    After that, additional attempts to set the frequency and voltage via RCCE_iset_power() fail and I am unable to use the SCC since even training with sccBmc -i with Tile533_Mesh800_DDR800 fails with the following error:


    INFO: =====================================================
    INFO: Starting system initialization (with setting Tile533_Mesh800_DDR800)...
    INFO: processRCCFile(): Configuring SCC with content of file "/opt/sccKit/1.2.3/settings/Tile533_Mesh800_DDR800_preset.rlb" (via BMC server 10.3.16.127:5010).
    INFO: Trying to train in preset mode:
    INFO: Resetting Rocky Lake SCC device: Done.
    INFO: Resetting Rocky Lake FPGA: Done.
    INFO: Re-Initializing PCIe driver...
    ERROR: Timeout while waiting for Read request answer (CMD=0x7) with TID 0! Cancelling request...
    ERROR: Failed to trim SIF! Please try to reload the driver or powercycle SCC board in case of repeated failure!
    INFO: System initialization done.
    INFO: =====================================================

     

    I tried power cycling the board (which is apparently also permitted to be performed by remote users via telnet on port 5010 since the interface offers it) or resetting both SCC and  FPGA - without any change. Also the attempt to boot linux results in the following warnings:

     

    WARNING: Received unexpected IO-Packet:
    INFO: Unexpected IO packet 024 from RC to HOST -> transferPacket(0x00, CORE0, 0x0_00000528, NCIOWR, 0x10, 0x0000000000000000_0000000000000000_0000000000000000_0000009400000094);
    WARNING: Received unexpected IO-Packet:
    INFO: Unexpected IO packet 116 from RC to HOST -> transferPacket(0x10, CORE0, 0x0_00000528, NCIORD, 0x01, 0x0000000000000000_0000000000000000_0000000000000000_0000000000a8e2fa);
    WARNING: Received unexpected IO-Packet:
    INFO: Unexpected IO packet 007 from RC to HOST -> transferPacket(0x00, CORE1, 0x0_00000528, NCIORD, 0x10, 0x0000000000000000_0000000000000000_0000000000000000_0000000000a8e2ff);
    WARNING: Received unexpected IO-Packet:
    INFO: Unexpected IO packet 000 from RC to HOST -> transferPacket(0x01, CORE1, 0x0_00000528, NCIORD, 0x30, 0x0000000000000000_0000000000000000_0000000000000000_0000000000a8e2ff);
    WARNING: Received unexpected IO-Packet:
    INFO: Unexpected IO packet 090 from RC to HOST -> transferPacket(0x10, CORE1, 0x0_000052e0, NCIORD, 0x01, 0x0000000000000000_0000000000000000_0000000000000000_0000000000a8e2ff);

     

    Linux seems to boot successfully according to sccBoot -s, however I am unable to ssh to the cores or run programs with rccerun...I will create a new administrative task entry in the bugzilla system to have someone take a look at it (maybe reloading the driver is sufficient or the MCPC needs to be rebooted?), but just to document the occurrence...

  • 3. Re: Stability problems at 800 MHz tile clock
    tedk Community Member
    Currently Being Moderated

    Did you make your test program available anywhere? I'd like to try it out. There are two documentation issues you bring up ... 1) you are able to set the board to what should be an invalid setting of voltage/freq and 2) there's some concern about whether voltage level 0 is 0.6 or 0.7.

     

    You say you are using RCCE 1.0.7. RCCE is at 1.0.8 .. but I think the update is mainly manpages and some slight code reorg to be able to compile with gcc. I don't think using 1.0.8 addresses your issue, but it is worth updating.

     

    I think I'd like to try out your code on something other than marc007 ... like on one of our internal systems.

  • 4. Re: Stability problems at 800 MHz tile clock
    philippg Community Member
    Currently Being Moderated

    Since the file permissions are set the way they are, you should be able to access the binary at marc007.scc-dc.com:/shared/philippg/bt.B.36

     

    But it is really just the icc-8.1-compiled version of the NPB BT benchmark supplied with your RCCE 1.0.7 - I did not modify anything, which is exactly why I'm puzzled by the fact that SCC seems to crash after a few runs (the compiler in use is also the one supplied). I was also considering thermal issues for a second, but this is probably not the cause. Furthermore it's "funny" that SCC crashes in a way that a power cycle is not sufficient to solve the problem.

     

    Regarding the power issue (which apparently are independent of my SCC problems, since it also crashes without using them):

     

    As can be seen in the output, RCCE_iset_power() or one of its sub-functions prints out "UE 0 trying to write VID word 10470 (level 0, V 0.700000) to address b7f5e000". This seems to contradict the documentation of the SCC, which clearly states voltage level 0 to represent 0.6 volts.

     

    I was also able to set the SCC to a frequency of 400 MHz @ 0.6 volts, which should not be possible according to the documentation since 400 MHz should require a higher voltage level. My best guess at this point would be that voltage level 0 actually *is* 0.7 volts and therefore the system works as it should and its just a documentation error?

     

     

    Regarding the version of RCCE: I use 1.0.7 because 1.0.8 does not compile for me on marc007. I tried compiling it the same way I did with 1.0.7 (with SCC  being the target of course) using the supplied compilers/software and it results in

     

    exec: 36: -a: not found
    make: *** [RCCE_admin.o] Error 2

     

    Any hints? But again, as you said, it probably has no effect regarding the crash/freeze issue.

  • 5. Re: Stability problems at 800 MHz tile clock
    tedk Community Member
    Currently Being Moderated

    In common/symbols.in if you specificy to compile with icc you won't see the -a error.

     

    Now I don't know what is causing that and will find out. But I did a ./configure SCC and then a ./makeall and was able to build the RCCE library in my home on marc007. ... /home/tekubasx.

     

    You should not need icpc ... RCCE is strictly C, not C++ ... but thanks for pointing this out. I don't see any reason why it should  not also build with icpc and I'll look into it.

  • 6. Re: Stability problems at 800 MHz tile clock
    philippg Community Member
    Currently Being Moderated

    Ah, I did not notice/try that before - changing the entry in symbols(.in) from icpc to icc works - thanks! At least one issue that's resolved

  • 7. Re: Stability problems at 800 MHz tile clock
    tedk Community Member
    Currently Being Moderated

    On marc007, icpc should work and compile RCCE 1.0.8 now. There was a blank line before the shebang in

    /opt/icc-8.1.038/bin/icpc. This appears to have been unique to marc007.

  • 8. Re: Stability problems at 800 MHz tile clock
    philippg Community Member
    Currently Being Moderated

    A note regarding the voltage documentation issue: This is an excerpt taken from RCCE_V1.0.8/src/RCCE_power_management.c

     

    55   // the following array contains triples of voltage/VID value/max_frequency

    56   triple RC_V_MHz_cap[] = {

    57   /* 0 */ {0.7, 0x70, 460},

    58   /* 1 */ {0.8, 0x80, 598},

    59   /* 2 */ {0.9, 0x90, 644},

    60   /* 3 */ {1.0, 0xA0, 748},

    61   /* 4 */ {1.1, 0xB0, 875},

    62   /* 5 */ {1.2, 0xC0, 1024},

    63   /* 6 */ {1.3, 0xD0, 1198}

    64   };

     

    However in SCCProgrammersGuide.pdf v 0.63, table 9 on page 37, level 0 starts at 0.6 volts and a maximum frequency of 327 MHz. Hence my confusion.

  • 9. Re: Stability problems at 800 MHz tile clock
    mwaughex Community Member
    Currently Being Moderated

    SCC Programmer's Guide Revision 1.0. corrects documentation error.

    http://communities.intel.com/docs/DOC-5684

More Like This

  • Retrieving data ...

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points