1 2 Previous Next 16 Replies Latest reply on Aug 27, 2010 9:32 AM by pspillai

    DMA errors


      We're having more issues with our RockyLake board in Pittsburgh.  I can bring the system up, initialize the interface with the conservative sttings (533, 800, 800), and boot Linux on all of the cores.  sccConsole also works.  However, if I try to launch a binary on all cores simulatneously, I almost always get the following error:


      ERROR: Timeout while waiting for DMA transfer to complete! Cancelling request...


      This problem does not occur if I launch the binary on individual cores, or if I run simulataneously on all after having warmed up the buffer caches.  So it has something  to do with simulataneous transfers to / from the host.  What is causing this?  Is there something that might help?  How can I recoiver from this?  Currently, I have to reboot and reinitialize everything when this happens.  Any suggestions are appreciated.  Thanks,


      - Babu

        • 1. Re: DMA errors

          How are you launching the binary? With pssh or rccerun? Can you run one of the simple RCCE examples, like pingpong?

          Does the problem you see always occur when you try to launch a binary?

          • 2. Re: DMA errors

            Actually, I was using sccKonsole, with keyboard directed to all console tabs.  The binary is just a trivial hello world program, but it is statically linked, so it is 500K in size.  If I run the binary on each core individually, warming up the buffer cace, and then launch it in parallel, everything is fine.  On a cold system, launching in parallel almost always results in the DMA error, requiring a reboot.

            • 3. Re: DMA errors

              I tried out a "hello world" on an MCPC/RL system here. I don't see the problem you are experiencing. Could it be hw-related? The useage model is not really to bring up sccKonsole and execute this way (pssh is the preferred method) but I don't see why it should not work fine in this case. Can you ry RCCE with the pingpong example?


              The test is  just ..

              #include <stdio.h>

              main() { printf("hello SCC\n");}

              I compile it as

              icc -static -DCOPPERRIDGE -mcp=pentium -gcc-version=340 myhello.c

              (the -DCOPERRIDGE is not really needed for something this simple, but it doesn't hurt) and get an a.out that I copy to /shared/tkubasx. I then issue

              sccKonsole 0..3

              and get a konsole window with four tabs. I choose

              Edit-> Copy Input to -> All Tabs

              I then issue


              in one tab and I see output in all tabs. I need the complete path here.


              Is this what you are doing?

              1 of 1 people found this helpful
              • 4. Re: DMA errors

                Babu, are you using the latest sccKit 1.2.3 along with the latest BMC firmware 1.06?

                • 5. Re: DMA errors

                  sccKit 1.2.3 is downloadable from our public SVN and there are directions on how to install the BMC update on this site. It's easy to upgrade and doesn't take long.


                  It would be important to know if the error you see is due to an older sccKit and BMC firmware. We'd use that information as an incentive for people to upgrade!

                  • 6. Re: DMA errors

                    Hi Ted,


                    The hello world test you performed is essentially what I was doing, but I ran on 48 cores, not just 4.  I think the problem occurs when too many concurrent "network" flows happen, so with just 4, nothing bad happens.  I just checked on the system here -- with 4 parallel instances, there are no DMA errors.  I'm not sure exactly how many are needed before problems occur.


                    - Babu

                    • 7. Re: DMA errors

                      Interestingly, this problem does not occur when I use pssh.  However, the binaries don't actually start simultaneously when using pssh on the system here.  Please take a look at the attached screenshot.  The load meter in the lower right shows 3 peaks -- the first two are with pssh.  The load is spread out because all of the tasks don't start simultaneously.  The last is using sccKonsole.  Note the sharp spike.  (This test was using a small pi calculator program that loads an scc core for about 8.5 seconds; the sccKonsole run did not encounter DMA errors because after the first pssh run, the binary was in the buffer caches).


                      - Babu

                      • 8. Re: DMA errors

                        Ok -- turns out pssh by default runs only 32 ssh instances in parallel.  I used -p 48 to increase this.  Now the tasks run fast on pssh as well, but I don't get the DMA errors.  Perhaps there is still just enough variation in start times using pssh that things work, while using sccKonsole, there are too many concurrent accesses and things break.  I am not sure why this happens, though.  Perhapst it is still a host hardware compatibility issue.

                        • 9. Re: DMA errors

                          I have now upgraded to 1.2.3 and BMC firmware 1.06.  However, the problem persists.  Also, I have experienced the DMA errors when starting the tasks using pssh, though much less often than with sccKonsole.  This seems to occur when lots of simultaneous packets are sent by the SCC cores to the host.  I'm not sure if it is the FPGA that gets into a bad state or the crbif driver, but unloading and reloading the driver does not seem to fix it.  I have to reboot everything to restore operation once the DMA error occurs.  I will try to create a program that simply generates a lot of data packets between cores and to the host and see if I can reproduce this problem that way.


                          - Babu

                          • 10. Re: DMA errors

                            I made a "hello word" ..


                            #include main()


                                printf("hello SCC\n");


                            I used the sccGui to bring up 48 konsole windows, redirected input from rck00 to go to all cores, cd'ed to /shared/<myname> on rck00, and ran ./myhello.

                            This appeared to work fine. So far I have not seen errors, but I'm looking further.

                            • 11. Re: DMA errors

                              The DMA errors continue to plague the system in Pittsburgh.  We have changed out the host machine again, and now have one of the Intel SR1630 servers that are known to work well.  All of the software has been upgraded, including sccKit and the BMC firmware.  The same errors continue to occur very reproducibly here.  What Ted outlined in the previous message (hello world program launched simultaneously on all 48 cores using sccKonsle with inputs copied to all tabs) consistently causes the DMA errors.  (Ted: I assume your hello world binary is statically linked, and therefore pretty large -- mine is around 550KB).


                              So it seems that we either have a problem with the host interface card, the cable, or the Rocky Lake board itself.  I'm not sure what is the best way to proceed.  Any advice is appreciated.


                              - Babu

                              • 12. Re: DMA errors

                                My SCC hello world program is 459323 bytes.

                                Yes, it is statically linked. icc switches are

                                -static -mcpu=pentium -gcc-version=340 -DCOPPERRIDGE

                                although I doubt the -DCOPPERRIDGE is necessary for such a simple program.


                                Are you still getting those DMA errors? We cannot duplicate the error here.

                                • 13. Re: DMA errors

                                  I attached the myhello that I ran.

                                  • 14. Re: DMA errors

                                    Also, please try performing a memort test by clicking on the swiss army knife button on the SccGUI. This test will ensure that the hardware does not have any problem.

                                    1 2 Previous Next