5 Replies Latest reply on Sep 8, 2011 9:34 AM by tedk

    RCCE_get into local MPB requires explicit flush?

    eshifer

      I encountered an apparent difference between RCCE emulator and SCC HW for RCCE_get. Is there a known difference?

      My code implemets a blocked algorithm for matrix inversion gauss jordan elimination process. Within this code, I'm passing some scaling information through the MPB in API "gory" mode - using RCCE_put and RCCE_get.

      The receive side needs to use the MPB arguments and pass them without a change to another tile. I assumed its ok in this case to RCCE_get directly into the local MPB, having also a benefit of the incoming arguments in the local L1 for immediate use. This seemed to be the case on the RCCE emulator, where the MPB arguments were consumed and delivered along a tile chain. However, on the HW, the local tile got 0's instead of the MPB data.

      In order to solve it I had to pay the overhead of casting the MPB argumetns from the remote tile to private memory and then again to local MPB -

       

      // ME_com points to remote tile. ME to local tile.

      //target_ptr used to be the same for all tile chain - just passing informaion along MPB and consuming

      // on the way in local L1 without paying casting overhead - worked on RCCE emulator but not on real HW

      //RCCE_get((t_vcharp)(target_ptr), (t_vcharp)(target_ptr), size, ME_com); => returns '0 to local core on HW

      RCCE_get((t_vcharp)(local_buf), (t_vcharp)(target_ptr), size, ME_com);
      RCCE_put((t_vcharp)(target_ptr), (t_vcharp)(local_buf), size, ME);

        • 1. Re: RCCE_get into local MPB requires explicit flush?
          tedk

          I'm not aware of any known differences between answers from the RCCE emulator and RCCE on SCC. Where are you taking the emulator from? We had a couple of emulator updates, and I think the emulator in the trunk is now working correctly. (Thanks to Andreas Prell for that.)

           

          Sorry, but we're confused about what you are actually doing. Can you provide more code context? Are you interested in passing the arguments of put and get as well as or instead of the actual data? Why are you referring to passing to another tile rather than a core?

          • 2. Re: RCCE_get into local MPB requires explicit flush?
            eshifer

            I'm logged to Marc013. Pulled from trunk andreproduced. Printed out the difference between emulator and HW -

            source file (debug prints on lines #72 and #89) -

            /home/eshifer/chk_rcce_get/rcce/apps/mat_inv_scc/RCCE_mat_inv_async.c

             

            Note on emulator the outer routine seems to enjoy the true numbers that were passed from remote MPB to local MPB. On HW, the read value is '0. When passing matrix argument on HW from remote MPB to private memory and then to local MPB, the probelm got fixed (the HW fix is the commented-out lines in the attach source file at lines 60,61 for a b2b RCCE_get and RCCE_put).

             

            /home/eshifer/chk_rcce_get/rcce/apps/mat_inv_emulator/emulator.log

            hosts mat_inv_async n 4 b 2 x 2 y 2
            mat_inv_async 4 0.533 00 01 02 03 n 4 b 2 x 2 y 2
            Parallel Matrix Inversion on 4 cores
            com inner routine pointer check -  target_ptr 134583840, target_val 36.000000
            com outer routine pointer check -  target_ptr 134583840, target_val 36.000000
            com inner routine pointer check -  target_ptr 134583872, target_val -0.916667
            com outer routine pointer check -  target_ptr 134583872, target_val -0.916667
            ...

            Total time: 0.013866
            collecting results..

             

            /home/eshifer/chk_rcce_get/rcce/apps/mat_inv_scc/scc.log

            eshifer@marc013:/shared/eshifer$ ./rccerun -nue 4 -f ./rc.hosts mat_inv_async n 4 b 2 x 2 y 2
            pssh -h PSSH_HOST_FILE.16601 -t -1 -p 4 /shared/eshifer/mpb.16601 < /dev/null
            [1] 13:22:16 [SUCCESS] rck01
            [2] 13:22:16 [SUCCESS] rck03
            [3] 13:22:18 [SUCCESS] rck02
            [4] 13:22:33 [FAILURE] rck00 Exited with error code 255
            pssh -h PSSH_HOST_FILE.16601 -t -1 -P -p 4 /shared/eshifer/mat_inv_async 4 0.533 00 01 02 03 n 4 b 2 x 2 y 2 < /dev/null
            rck01: com inner routine pointer check -  target_ptr -1216921536, target_val 0.000000
            com outer routine pointer check -  target_ptr -1216921536, target_val 0.000000
            com inner routine pointer check -  target_ptr -1216921504, target_val 0.000000
            com outer routine pointer check -  target_ptr -1216921504, target_val 0.000000
            ...

            rck00: Parallel Matrix Inversion on 4 cores
            Total time: 0.033951
            collecting results..
            Faliure - Matrix inversion compute error
            [1] 13:22:38 [SUCCESS] rck02
            [2] 13:22:38 [SUCCESS] rck03
            [3] 13:22:38 [SUCCESS] rck01
            [4] 13:22:38 [SUCCESS] rck00

            • 3. Re: RCCE_get into local MPB requires explicit flush?
              tedk

              I think I finally understand what it is you want to do. Yes, it should work on the hw without the detour.

               

              In your code can you be assured that ME is never equal to ME_com? The put is performed by ME and the get by ME_com. We should be able to put into the MPB as well as a private buffer as long as ME != ME_com.

               

              But there is a memcpy_get() and memcpy_put() inside the get and put. These are routines optimized for the hw. When you use the emulator, you are actually using the standard memcpy(). We could also use the standard memcpy() with the hw. The MPB-to-MPB path may not have been tested with the optimized memcpys. For example, in RCCE_get.c, we could replace memcpy_get with memcpy as follows ... this entails rebuilding RCCE.

                  105 //  memcpy((void *)target, (void *)source, num_bytes); <== this is the Linux memcpy
                  106   memcpy_get((void *)target, (void *)source, num_bytes);

               

              Another thing to try is to look at the return values for RCCE_get() and RCCE_put(). The error codes are in RCCE.h. I think what happens is that if RCCE gets an error, it won't transfer the bytes and consequently you'll see zero.

               

              I don't think the RCCE spec is in error , but when I looked at the description of put and get, I think it could be clearer. target is both cases (put and get) pointes to where the data are placed.

              • 4. Re: RCCE_get into local MPB requires explicit flush?
                eshifer

                yep, when I changed to the regular memcpy the problem disappered.

                I wasn't able to run with error reporting for RCCE_get, as the HW got stuck and suddenly sccBsc -i and sccBoot -l commands were not found. What setup should I run to see these initialization commands?

                As for ME and ME_com - these are different cores (if not, it's a bug in the program). 

                • 5. Re: RCCE_get into local MPB requires explicit flush?
                  tedk

                  Ah, so then there is a bug in memcpy_put/get. If you don't file a bug, I will soon. Isn't it somebody's rule that a bug occurs in every untested path?

                   

                  If you cannot find the sccKit commands ... I don't know why that would occur. The commands are in /opt/sccKit/current/bin.You can just put that in your path or use the perl script in /opt/sccKit/current as in eval `/opt/sccKit/current/setup` ... which I have in my .bashrc.