4 Replies Latest reply on Feb 12, 2015 10:19 AM by Intel_Alvarado

    Memory performance on Galileo/Quark.

    Krzysztof_Czarnowski

      This is related to

      Using ESRAM from user-space code

      and

      intel_qrk_esram_map_range fails

       

      Primary motivation for my investigation was to verify what performance gain can result from using eSRAM available (kind of) on Quark, using Galileo board (gen 1) and Linux BSP (v. 1.0.2). The results are... well... unexpected. With some additional effort maybe it would be possible to make eSRAM usable, but more insight into what's really going on.

       

      I've run two memory performance tests, one in the kernel and one in user space for various "kinds" of memory. The test consisted of performing, in a tight loop, a sequence of memory reads from a 512kB memory area. After each read the address is incremented by 32 and finally wrapped back. (The step is 32 since I'm trying to cancel cache influence and thought first that the cache line size is 32B. It is actually 16B for Quark, but there is no fault anyway.)

       

      In user space three memory areas are tested:

      dram --- allocated by the test application with malloc()

      rmem --- kmalloc()'ed by kernel driver and mmap()'ed by the application

      esram --- kmalloc()'ed and overlayed with eSRAM by kernel driver and mmap()'ed by the application

       

      In the driver two memory areas are tested:

      rmem --- kmalloc()'ed by driver

      esram --- kmalloc()'ed and overlayed with eSRAM by driver

       

      The results are as follows:

       

      Number of reads: 1015808 (~1M)

                   userspace            kernel

      dram     16745

      rmem    104918                 104103

      esram    62750                    61854

       

      Number of reads: 10010624 (~10M)

                   userspace            kernel

      dram     154117

      rmem    1025651              1024575

      esram    609470                 608786

       

      It looks like the memory allocated in user space is _much_ faster than that allocated by kernel. I must still try to decode from which physical memory block dram comes (need to decode /proc/PID/pagemap). Kernel got memory from

      00100000-0efdefff : System RAM

      (excerpt from /proc/iomem).


      So finally: the question is if there is someone who understands this behaviour?

      Any link to Galileo physical memory map?

      Is it possible to gain anything from eSRAM over the users space (and preferably in user space)?


        • 1. Re: Memory performance on Galileo/Quark.
          Intel_Alvarado

          Hi Krzysztof_Czarnowski,

           

          We are currently working on your post to provide you the best possible answer. We will contact you as soon as we have more information.

           

          Regards

          Sergio

          • 2. Re: Memory performance on Galileo/Quark.
            Krzysztof_Czarnowski

            Sorry for a late post. There is some progress and now I can get the expected performance. I can use eSRAM in user space code and gain some performance. Obviously the gain heavily depends on the actual use case (around 40% gain in my test, but probably around 10% in real life scenarios). However I still have a feeling that I lack some understanding...

             

            More details:

             

             

            I finally got down to analyse the physical memory mapping for the test app and
            found something unexpected about the malloc()’ed memory (dram) – the fast one.
            Namely, even after performing the read test, it’s not mapped. (The only one
            mapped page out of 128+1 allocated pages is this +1 which is the one used for
            heap housekeeping.)

             

             

            So the interesting question is what exactly is being read during the test
            since it’s not memory ;-) Whatever it actually is, it is quite fast :-)

            Any supposition on possible optimizing out the reads, I dismiss---the test is
            run by the same function for all the cases (and volatile is used where
            appropriate and not used where it should not be used). So I think it’s a
            hardware feature of the platform. Please note that on a PC platform I don’t
            get the effect of unusually fast memory from malloc() in exactly the same test.

             

            So I added to the test app a function to write over the dram region when
            requested (“filldram” command) and performed yet another experiment:

             

            command> read_test_dram 10000000
            read_test_dram 10010624 --> 154083
            command> filldram
            command> read_test_dram 10000000
            read_test_dram 10010624 --> 1025555
            command> read_test_rmem 10000000
            read_test_rmem 10010624 --> 1025771
            command> read_test_esram 10000000
            read_test_esram 10010624 --> 609685

             

            I can see that dram is so fast only before it’s written. After “filling with
            data” it starts to perform just as rmem WHICH IS EXACTLY AS EXPECTED (and seen
            on PC). And at the same time we can see that eSRAM used by the application is
            roughly 40% faster then regular DRAM (in this test which consists of tight
            read loop avoiding cache influence). Same as used by the kernel---see previous
            test results.

             

            So finally I have consistent/reasonable/expected results, problem solved ;-)

             

            Number of reads: 10010624 (~10M)

                         userspace            kernel

            dram     1025555

            rmem    1025771              ---

            esram    609685               ---

             

            I’d like to understand why it’s working this way. My point of
            view:
            -- after malloc() succeeds, the memory is allocated in virtual space but it’s
            not mapped to physical pages. This is absolutely OK!
            -- when app reads from memory, page fault should be generated and physical
            memory page mapped in. This seems not to happen. Right, the memory was not
            written by the application yet so there is no valid data, but it shouldn’t
            matter. Anyway where is the data read from when the read is performed? :-(

            • 3. Re: Memory performance on Galileo/Quark.
              Intel_Alvarado

              Hi Krzysztof_Czarnowski,

              We will investigate on this and give you an update as soon as possible.

              Regards

              Sergio

              • 4. Re: Memory performance on Galileo/Quark.
                Intel_Alvarado

                Hi,

                 

                We are following up on your case to see if you were able to find the solution to your question of the data read. Please let us know if you still need assistance on this thread.

                 

                Regards

                Sergio