1 2 Previous Next 22 Replies Latest reply on Aug 10, 2016 6:37 AM by Intel Corporation

    Critical performance drop on newly created large file

    AlexNZ
      • NVMe drive model:  Intel SSD DC P3700 U.2 NVMe SSD
      • Capacity: 764G
      • FS: XFS
      • Other HW:
        • AIC SB122A-PH
        • 8 Intel NVMe DC P3700 2 on CPU 0, 6 on CPU 1
        • 128 GiB RAM (8 x 16 DDR4 2400Mhz DIMMs)
        • 2 x Intel E5-2620v3 2.4Ghz CPUs
        • 2 x Intel DC S2510 SATA SSDs  (one is used a system drive).
        • Note that both are engineering samples provided by Intel NSG.  But all have had the latest firmware updated using isdct 3.0.0.
      • OS: CentOS Linux release 7.2.1511 (Core)
      • Kernel: Linux fs00 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

       

      We have been testing two Intel DC P3700 U.2 800GB NVMe SSDs to see the impact of the emulated sector size to throughput (512 vs 4096). Using fio 2.12, we observed a puzzling collapse of performance. The steps are given below.

       

      Steps:

      1. Copy or write sequentially single large file (300G or larger)

      2. Start fio test with the following config:

      [readtest]

      thread=1

      blocksize=2m

      filename=/export/beegfs/data0/file_000000

      rw=randread

      direct=1

      buffered=0

      ioengine=libaio

      nrfiles=1

      gtod_reduce=0

      numjobs=32

      iodepth=128

      runtime=360

      group_reporting=1

      percentage_random=90

       

      3. Observe extremely slow performance:

      fio-2.12

      Starting 32 threads

      readtest: (groupid=0, jobs=32): err= 0: pid=5097: Thu Jul 14 13:00:25 2016

        read : io=65536KB, bw=137028B/s, iops=0, runt=489743msec

          slat (usec): min=4079, max=7668, avg=5279.19, stdev=662.80

          clat (msec): min=3, max=25, avg=18.97, stdev= 6.16

           lat (msec): min=8, max=31, avg=24.25, stdev= 6.24

          clat percentiles (usec):

           |  1.00th=[ 3280],  5.00th=[ 4320], 10.00th=[ 9664], 20.00th=[17536],

           | 30.00th=[18816], 40.00th=[20352], 50.00th=[20608], 60.00th=[21632],

           | 70.00th=[21632], 80.00th=[22912], 90.00th=[25472], 95.00th=[25472],

           | 99.00th=[25472], 99.50th=[25472], 99.90th=[25472], 99.95th=[25472],

           | 99.99th=[25472]

          lat (msec) : 4=3.12%, 10=9.38%, 20=25.00%, 50=62.50%

        cpu          : usr=0.00%, sys=74.84%, ctx=792583, majf=0, minf=16427

        IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

           submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

           complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

           issued    : total=r=32/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

           latency   : target=0, window=0, percentile=100.00%, depth=128

       

      Run status group 0 (all jobs):

         READ: io=65536KB, aggrb=133KB/s, minb=133KB/s, maxb=133KB/s, mint=489743msec, maxt=489743msec

       

      Disk stats (read/write):

        nvme0n1: ios=0/64317, merge=0/0, ticks=0/1777871, in_queue=925406, util=0.19%

       

      4. Repeat the test

      5. Performance is much higher:

      fio-2.12

      Starting 32 threads

       

      readtest: (groupid=0, jobs=32): err= 0: pid=5224: Thu Jul 14 13:11:58 2016

      read : io=861484MB, bw=2389.3MB/s, iops=1194, runt=360564msec

          slat (usec): min=111, max=203593, avg=26742.15, stdev=21321.98

          clat (msec): min=414, max=5176, avg=3391.05, stdev=522.29

           lat (msec): min=414, max=5247, avg=3417.79, stdev=524.75

          clat percentiles (msec):

           |  1.00th=[ 1614],  5.00th=[ 2376], 10.00th=[ 2802], 20.00th=[ 3097],

           | 30.00th=[ 3228], 40.00th=[ 3359], 50.00th=[ 3458], 60.00th=[ 3556],

           | 70.00th=[ 3654], 80.00th=[ 3785], 90.00th=[ 3949], 95.00th=[ 4080],

           | 99.00th=[ 4359], 99.50th=[ 4424], 99.90th=[ 4752], 99.95th=[ 4883],

           | 99.99th=[ 5014]

          bw (KB  /s): min= 4096, max=385795, per=3.13%, avg=76601.82, stdev=30418.14

          lat (msec) : 500=0.01%, 750=0.08%, 1000=0.10%, 2000=2.71%, >=2000=97.11%

        cpu          : usr=0.01%, sys=0.76%, ctx=1106592, majf=0, minf=2097203

        IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.5%

           submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

           complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%

           issued    : total=r=430742/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

           latency   : target=0, window=0, percentile=100.00%, depth=128

       

      Run status group 0 (all jobs):

         READ: io=861484MB, aggrb=2389.3MB/s, minb=2389.3MB/s, maxb=2389.3MB/s, mint=360564msec, maxt=360564msec

       

      Disk stats (read/write):

        nvme0n1: ios=7302600/0, merge=0/0, ticks=18446744073148861576/0, in_queue=18446744073151667706, util=100.00%

       

       

      Note: even 30 minutes delay between steps 1 and 2 doesn't improve situation.

       

      More info abot used NVMe drives:

       

      [root@fs00 ~]# isdct show -intelssd 0

       

      - Intel SSD DC P3700 Series CVFT420400AP800HGN -

       

      Bootloader : 8B1B0131

      DevicePath : /dev/nvme0n1

      DeviceStatus : Healthy

      Firmware : 8DV10171

      FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.

      Index : 0

      ModelNumber : INTEL SSDPE2MD800G4

      ProductFamily : Intel SSD DC P3700 Series

      SerialNumber : CVFT420400AP800HGN

      [root@fs00 ~]# isdct show -a -intelssd 0|grep SectorSize

      SectorSize : 4096

       

      [root@fs00 ~]# isdct show -intelssd 1

       

      - Intel SSD DC P3700 Series CVFT420400G3800HGN -

       

      Bootloader : 8B1B0131

      DevicePath : /dev/nvme1n1

      DeviceStatus : Healthy

      Firmware : 8DV10171

      FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.

      Index : 1

      ModelNumber : INTEL SSDPE2MD800G4

      ProductFamily : Intel SSD DC P3700 Series

      SerialNumber : CVFT420400G3800HGN

       

      [root@fs00 ~]# isdct show -a -intelssd 1|grep SectorSize

      SectorSize : 512

        • 1. Re: Critical performance drop on newly created large file
          zperry

          Since we deal with similar situation, I tried the above steps and confirmed on our machine this issue.  In fact, I also tried it with both XFS and EXT4.  The symptom showed up regardless.

          • 2. Re: Critical performance drop on newly created large file
            Intel Corporation
            This message was posted on behalf of Intel Corporation

            AlexNZ,

            Thanks for bringing this situation to our attention, we would like to verify this and provide a solution as fast as possible. Please allow us some time to check on this and we will keep you all posted.

            NC

            • 3. Re: Critical performance drop on newly created large file
              Intel Corporation
              This message was posted on behalf of Intel Corporation

              Hello,

              After reviewing the settings, we would like to verify the following:

              For the read test, could you please try: fio –output=test_result.txt –name=myjob –filename=/dev/nvme0n1 –ioengine=libaio –direct=1 –norandommap –randrepeat=0 –runtime=600 –blocksize=4K –rw=randread –iodepth=32 –numjobs=4 –group_reporting.

              It is important to notice that we normally run the tests with 4 threads and iodepth=32, for the blocksize=4K.

              Please let us know as we may need to keep researching about this.

              NC

              • 4. Re: Critical performance drop on newly created large file
                AlexNZ

                Hello,

                 

                With proposed settings I received the following result:

                 

                myjob: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32

                ...

                fio-2.12

                Starting 4 processes

                 

                myjob: (groupid=0, jobs=4): err= 0: pid=23560: Wed Jul 20 07:06:08 2016

                  read : io=1092.2GB, bw=1863.1MB/s, iops=477156, runt=600001msec

                    slat (usec): min=1, max=63, avg= 2.76, stdev= 1.57

                    clat (usec): min=14, max=3423, avg=260.81, stdev=90.86

                     lat (usec): min=18, max=3426, avg=263.68, stdev=90.84

                    clat percentiles (usec):

                     |  1.00th=[  114],  5.00th=[  139], 10.00th=[  157], 20.00th=[  185],

                     | 30.00th=[  207], 40.00th=[  229], 50.00th=[  251], 60.00th=[  274],

                     | 70.00th=[  298], 80.00th=[  326], 90.00th=[  374], 95.00th=[  422],

                     | 99.00th=[  532], 99.50th=[  588], 99.90th=[  716], 99.95th=[  788],

                     | 99.99th=[ 1048]

                    bw (KB  /s): min= 5400, max=494216, per=25.36%, avg=484036.11, stdev=14017.77

                    lat (usec) : 20=0.01%, 50=0.01%, 100=0.23%, 250=49.61%, 500=48.54%

                    lat (usec) : 750=1.55%, 1000=0.06%

                    lat (msec) : 2=0.01%, 4=0.01%

                  cpu          : usr=15.00%, sys=41.78%, ctx=77056567, majf=0, minf=264

                  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

                     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

                     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

                     issued    : total=r=286294132/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

                     latency   : target=0, window=0, percentile=100.00%, depth=32

                 

                Run status group 0 (all jobs):

                   READ: io=1092.2GB, aggrb=1863.1MB/s, minb=1863.1MB/s, maxb=1863.1MB/s, mint=600001msec, maxt=600001msec

                 

                Disk stats (read/write):

                  nvme0n1: ios=286276788/29109, merge=0/0, ticks=72929877/10859607, in_queue=84848144, util=99.33%

                 

                But in this case it was the test for raw device (/dev/nvme0n1), whereas in our case it was file on XFS on NVMe drive.

                Also during latest tests we determined that flushing page cache (echo 1 > /proc/sys/vm/drop_caches) solves the problem.

                Why does page cache affect direct IO - is still the question.

                Can it be something specific to NVMe drivers?

                 

                AlexNZ

                • 5. Re: Critical performance drop on newly created large file
                  zperry

                  I read this thread with strong interest.  I concur with AlexNZ, testing files residing on a file system is far more useful in production situations.  We do so to figure out the overhead of

                   

                  • local file system (XFS, EXT4 etc)
                  • Distributed file system(s) (Lustre, GPFS etc)

                   

                  over raw devices (individual and aggregated).

                   

                  The following suggestion from NC is only for testing raw devices.

                  fio –output=test_result.txt –name=myjob –filename=/dev/nvme0n1 –ioengine=libaio –direct=1 –norandommap –randrepeat=0 –runtime=600 –blocksize=4K –rw=randread –iodepth=32 –numjobs=4 –group_reporting.

                  On our end, we have done many hundreds of raw device tests.  Results are always in line with what Intel has published.  But this particular file testing result, as I posted on July 15, is a "shocker"!

                   

                  It would be great to know why fio reading a regular file from a NVMe SSD with direct=1 is still affected by data in the page cache.

                   

                  Another point: we understand why usually for Intel NVMe SSDs, numjobs=4 and iodepth=32 are used. But such settings are only optimal for raw devices, right?   When it comes to reading/writing regular files, IMHO we should configure fio using parameter values that match as closely as possible to that of the actual workloads.  NC, your view?

                  • 6. Re: Critical performance drop on newly created large file
                    Intel Corporation
                    This message was posted on behalf of Intel Corporation

                    Hello all,

                    According to this situation and checking all the information provided, we will be escalating the situation here and we will be updating here. Please expect a response anytime soon.

                    NC

                    • 7. Re: Critical performance drop on newly created large file
                      Intel Corporation
                      This message was posted on behalf of Intel Corporation

                      Hello all,

                      We would like you to try the test but before that could you please try to TRIM the drives first?, once you do that please share the results back to us.

                      Also, please make sure you are using the correct driver in this link.
                      Something important to mention is that the performance tools we use are Synthetic Benchmarking tools, as explained in the Intel® Solid-State Drive DC P3700 evaluation guide, and these are intended to measure the behavior of the SSD without taking into consideration other components in the system that would add "bottlenecks". Synthetic benchmarks measure raw drive I/O transfer rates.

                      Here is the evaluation guide.

                      Please let us know.

                      NC

                      • 8. Re: Critical performance drop on newly created large file
                        zperry

                        Thanks for your follow-up.  I did try fstrim on a DC P3700 NVMe SSD here.

                         

                        First of all, lets get the driver and firmware issue out of the way.  The server runs CentOS 7.2:

                         

                        [root@fs11 ~]# uname -a

                        Linux fs11 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

                        [root@fs11 ~]# cat /etc/redhat-release

                        CentOS Linux release 7.2.1511 (Core)

                         

                        We also use the latest isdct:

                         

                        [root@fs11 ~]# isdct version

                        - Version Information -

                        Name: Intel(R) Data Center Tool

                        Version: 3.0.0

                        Description: Interact and configure Intel SSDs.

                         

                        And, according to the tool, the drive is healthy:

                         

                        [root@fs11 ~]# isdct show -intelssd 2

                         

                         

                        - Intel SSD DC P3700 Series CVFT515400401P6JGN -

                         

                         

                        Bootloader : 8B1B0131

                        DevicePath : /dev/nvme2n1

                        DeviceStatus : Healthy

                        Firmware : 8DV10171

                        FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.

                        Index : 2

                        ModelNumber : INTEL SSDPE2MD016T4

                        ProductFamily : Intel SSD DC P3700 Series

                        SerialNumber : CVFT515400401P6JGN

                         

                        While the drive had a file system (XFS), with data, I ran fstrim:

                         

                        [root@fs11 ~]# fstrim -v /export/beegfs/data2

                        fstrim: /export/beegfs/data2: FITRIM ioctl failed: Input/output error

                         

                        So, I umount the XFS, use isdct delete to remove all data, recreated the XFS, mount it again, and then ran fstrim:

                         

                        Same outcome. Please see the session log below:

                         

                        [root@fs11 ~]# df -h

                        Filesystem      Size  Used Avail Use% Mounted on

                        /dev/sda3       192G  2.4G  190G   2% /

                        devtmpfs         63G     0   63G   0% /dev

                        tmpfs            63G     0   63G   0% /dev/shm

                        tmpfs            63G   26M   63G   1% /run

                        tmpfs            63G     0   63G   0% /sys/fs/cgroup

                        /dev/sda1       506M  166M  340M  33% /boot

                        /dev/sdb        168G   73M  157G   1% /export/beegfs/meta

                        tmpfs            13G     0   13G   0% /run/user/99

                        /dev/nvme2n1    1.5T  241G  1.3T  17% /export/beegfs/data2

                        tmpfs            13G     0   13G   0% /run/user/0

                        [root@fs11 ~]# umount /export/beegfs/data2

                        [root@fs11 ~]# df -h

                        Filesystem      Size  Used Avail Use% Mounted on

                        /dev/sda3       192G  2.4G  190G   2% /

                        devtmpfs         63G     0   63G   0% /dev

                        tmpfs            63G     0   63G   0% /dev/shm

                        tmpfs            63G   26M   63G   1% /run

                        tmpfs            63G     0   63G   0% /sys/fs/cgroup

                        /dev/sda1       506M  166M  340M  33% /boot

                        /dev/sdb        168G   73M  157G   1% /export/beegfs/meta

                        tmpfs            13G     0   13G   0% /run/user/99

                        tmpfs            13G     0   13G   0% /run/user/0

                        [root@fs11 ~]# isdct delete -f -intelssd 2

                        Deleting...

                         

                         

                        - Intel SSD DC P3700 Series CVFT515400401P6JGN -

                         

                         

                        Status : Delete successful.

                         

                         

                        [root@fs11 ~]# mkfs.xfs -K -f -d agcount=24 -l size=128m,version=2 -i size=512 -s size=4096 /dev/nvme2n1

                        meta-data=/dev/nvme2n1           isize=512    agcount=24, agsize=16279311 blks

                                 =                       sectsz=4096  attr=2, projid32bit=1

                                 =                       crc=0        finobt=0

                        data     =                       bsize=4096   blocks=390703446, imaxpct=5

                                 =                       sunit=0      swidth=0 blks

                        naming   =version 2              bsize=4096   ascii-ci=0 ftype=0

                        log      =internal log           bsize=4096   blocks=32768, version=2

                                 =                       sectsz=4096  sunit=1 blks, lazy-count=1

                        realtime =none                   extsz=4096   blocks=0, rtextents=0

                        [root@fs11 ~]# mount -a

                        [root@fs11 ~]# df -h

                        Filesystem      Size  Used Avail Use% Mounted on

                        /dev/sda3       192G  2.4G  190G   2% /

                        devtmpfs         63G     0   63G   0% /dev

                        tmpfs            63G     0   63G   0% /dev/shm

                        tmpfs            63G   26M   63G   1% /run

                        tmpfs            63G     0   63G   0% /sys/fs/cgroup

                        /dev/sda1       506M  166M  340M  33% /boot

                        /dev/sdb        168G   73M  157G   1% /export/beegfs/meta

                        tmpfs            13G     0   13G   0% /run/user/99

                        tmpfs            13G     0   13G   0% /run/user/0

                        /dev/nvme2n1    1.5T   34M  1.5T   1% /export/beegfs/data2

                        [root@fs11 ~]# fstrim -v /export/beegfs/data2

                        fstrim: /export/beegfs/data2: FITRIM ioctl failed: Input/output error

                         

                        dmesg showed the following:

                         

                        [985659.986980] XFS (nvme2n1): Unmounting Filesystem

                        [985716.637432]  nvme2n1: unknown partition table

                        [985741.677631]  nvme2n1: unknown partition table

                        [985760.616153] XFS (nvme2n1): Mounting V4 Filesystem

                        [985760.619709] XFS (nvme2n1): Ending clean mount

                        [985817.484055] blk_update_request: I/O error, dev nvme2n1, sector 3079279312

                        [985817.484081] blk_update_request: I/O error, dev nvme2n1, sector 3104445112

                        [985817.484136] blk_update_request: I/O error, dev nvme2n1, sector 3087667912

                        [985817.484171] blk_update_request: I/O error, dev nvme2n1, sector 3070890712

                        [985817.484230] blk_update_request: I/O error, dev nvme2n1, sector 3121222312

                        [985817.484249] blk_update_request: I/O error, dev nvme2n1, sector 3062502112

                        [985817.484266] blk_update_request: I/O error, dev nvme2n1, sector 3112833712

                        [985817.484284] blk_update_request: I/O error, dev nvme2n1, sector 3096056512

                         

                        Please note that I created the XFS on the entire block device, without using a partition.  Also, I applied fstrim to another DC P3700 and succeeded. Even with this "good" NVMe SSD, the symptom that AlexNZ reported first still showed up - I have becoming suspicious about Linux OS itself but have found nothing concrete yet.   Now the inability to use fstrim is becoming an issue.  What can we do about it?

                        • 9. Re: Critical performance drop on newly created large file
                          AlexNZ

                          Hello,

                           

                          I can confirm that after TRIM result is still poor.

                          Actually, after quick looking at linux kernel code, including XFS implementation, I found that even during direct reading, page cache is still involved.

                          But such poor performance still looks weired.

                          • 10. Re: Critical performance drop on newly created large file
                            zperry

                            Just a quick supplement regarding the I/O errors that I reported in my last reply: I even tried to do the following:

                             

                            1. umount the drive
                            2. Do a nvmeformat: isdct start -intelssd 2 -nvmeformat LBAformat=3 SecureEraseSetting=0 ProtectionInformation=0 MetaDataSettings=0
                            3. Recreate XFS
                            4. mount the XFS
                            5. ran fstrim -v to the mount point.

                             

                            I still got

                             

                            [root@fs11 ~]# dmesg |tail -11

                            [987891.677911]  nvme2n1: unknown partition table

                            [987898.749260] XFS (nvme2n1): Mounting V4 Filesystem

                            [987898.752844] XFS (nvme2n1): Ending clean mount

                            [987948.612051] blk_update_request: I/O error, dev nvme2n1, sector 3070890712

                            [987948.612088] blk_update_request: I/O error, dev nvme2n1, sector 3087667912

                            [987948.612151] blk_update_request: I/O error, dev nvme2n1, sector 3121222312

                            [987948.612193] blk_update_request: I/O error, dev nvme2n1, sector 3062502112

                            [987948.612211] blk_update_request: I/O error, dev nvme2n1, sector 3104445112

                            [987948.612228] blk_update_request: I/O error, dev nvme2n1, sector 3079279312

                            [987948.612296] blk_update_request: I/O error, dev nvme2n1, sector 3096056512

                            [987948.612314] blk_update_request: I/O error, dev nvme2n1, sector 3112833712

                             

                            So, unlike SCSI drives that I used years ago, format didn't remap "bad" sectors.  Would appreciate a hint as to how to get this issue resolved too.

                            • 11. Re: Critical performance drop on newly created large file
                              zperry

                              I tried to narrow down the cause of the issue with fstrim more. It seems to me the hardware (i.e. the NVMe SSD itself) is responsible, rather than the software layer on top of it (XFS).  So I decided to add a partition table first and create XFS on the partition. As is evident below, adding the partition didn't help. 

                               

                              Is the drive faulty?  If yes, then why isdct still deems its DeviceStatus Healthy?

                               

                              [root@fs11 ~]# isdct delete -f -intelssd 2

                              Deleting...

                               

                              - Intel SSD DC P3700 Series CVFT515400401P6JGN -

                               

                              Status : Delete successful.

                               

                              [root@fs11 ~]# parteed -a optimal /dev/nvme2n1 mklabel gpt

                              -bash: parteed: command not found

                              [root@fs11 ~]# parted -a optimal /dev/nvme2n1 mklabel gpt

                              Information: You may need to update /etc/fstab.

                               

                              [root@fs11 ~]# parted /dev/nvme2n1 mkpart primary 1048576B 100%

                              Information: You may need to update /etc/fstab.

                               

                              [root@fs11 ~]# parted /dev/nvme2n1                                       

                              GNU Parted 3.1

                              Using /dev/nvme2n1

                              Welcome to GNU Parted! Type 'help' to view a list of commands.

                              (parted) print                                                           

                              Model: Unknown (unknown)

                              Disk /dev/nvme2n1: 1600GB

                              Sector size (logical/physical): 4096B/4096B

                              Partition Table: gpt

                              Disk Flags:

                               

                              Number  Start  End    Size    File system  Name    Flags

                              1      1049kB  1600GB  1600GB              primary

                               

                              (parted) quit     

                              [root@fs11 ~]# mkfs.xfs -K -f -d agcount=24 -l size=128m,version=2 -i size=512 -s size=4096 /dev/nvme2n1

                              meta-data=/dev/nvme2n1          isize=512    agcount=24, agsize=16279311 blks

                                      =                      sectsz=4096  attr=2, projid32bit=1

                                      =                      crc=0        finobt=0

                              data    =                      bsize=4096  blocks=390703446, imaxpct=5

                                      =                      sunit=0      swidth=0 blks

                              naming  =version 2              bsize=4096  ascii-ci=0 ftype=0

                              log      =internal log          bsize=4096  blocks=32768, version=2

                                      =                      sectsz=4096  sunit=1 blks, lazy-count=1

                              realtime =none                  extsz=4096  blocks=0, rtextents=0

                              [root@fs11 ~]# mount -a

                              [root@fs11 ~]# lsblk

                              NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT

                              sda      8:0    0 223.6G  0 disk

                              ├─sda1    8:1    0  512M  0 part /boot

                              ├─sda2    8:2    0  31.5G  0 part [SWAP]

                              └─sda3    8:3    0 191.6G  0 part /

                              sdb      8:16  0 223.6G  0 disk /export/beegfs/meta

                              sdc      8:32  0  59.6G  0 disk

                              sdd      8:48  0  59.6G  0 disk

                              sr0      11:0    1  1024M  0 rom 

                              nvme0n1 259:2    0  1.5T  0 disk

                              nvme1n1 259:6    0  1.5T  0 disk

                              nvme2n1 259:7    0  1.5T  0 disk /export/beegfs/data2

                              nvme3n1 259:5    0  1.5T  0 disk

                              nvme4n1 259:0    0  1.5T  0 disk

                              nvme5n1 259:3    0  1.5T  0 disk

                              nvme6n1 259:1    0  1.5T  0 disk

                              nvme7n1 259:4    0  1.5T  0 disk

                              [root@fs11 ~]# fstrim -v /export/beegfs/data2

                               

                              [991510.325647] XFS (nvme2n1): Mounting V4 Filesystem

                              [991510.329236] XFS (nvme2n1): Ending clean mount

                              [991558.419113] blk_update_request: I/O error, dev nvme2n1, sector3087667912

                              [991558.419159] blk_update_request: I/O error, dev nvme2n1, sector 3112833712

                              [991558.419178] blk_update_request: I/O error, dev nvme2n1, sector 3121222312

                              [991558.419196] blk_update_request: I/O error, dev nvme2n1, sector3079279312

                              [991558.419214] blk_update_request: I/O error, dev nvme2n1, sector3096056512

                              [991558.419286] blk_update_request: I/O error, dev nvme2n1, sector 3070890712

                              [991558.419304] blk_update_request: I/O error, dev nvme2n1, sector3062502112

                              [991558.419322] blk_update_request: I/O error, dev nvme2n1, sector3104445112

                              • 12. Re: Critical performance drop on newly created large file
                                Intel Corporation
                                This message was posted on behalf of Intel Corporation

                                Hello,

                                Thanks everyone for trying the suggestion. We would like to gather all these inputs and research here with our department in order to work in a resolution for all of you.

                                Please allow us some time to do the research, we will keep you posted.

                                NC

                                • 13. Re: Critical performance drop on newly created large file
                                  zperry

                                  Thanks for following-up.  I reviewed what I had done regarding the fstrim, and the tests that I have done, and came up two additional plausible causes:

                                   

                                  1. In the way I do mkfs.xfs, I always use the -K option, what if I don't use it?
                                  2. I would like to take advantage of the variable sector support provided by DC P3700. So, we are evaluating the performance benefits of using large SectorSize these days.  Thus, the NVMe SSDs that I tested fstrim on has a 4096 sector size. What happens if I retain the default 512?

                                   

                                  My tests indicates that the Intel DC P3700 firmware or the NVMe Linux driver or both may have a bug.  The following are my evidences. Please review.

                                   

                                  We use a lot of Intel DC P3700 SSDs of various capacities - 800GB  to 1.6TB are two common ones - and have done hundreds of tests over them.

                                   

                                  We also understand that with Intel NVMe DC P3700 SSDs, there is no need to run trim at all. The firmware's garbage collection takes care of such needs transparently and behind the scene.  But still, IMHO it's a good idea when sector size is changed, well-known Linux utilities still work as anticipated.  We ran into this issue by serendipity, and got a "nice" surprise along the way

                                   

                                  Case 1. mkfs.xfs without -K

                                   

                                  We will pick one /dev/nvme2n1, umount it, isdct delete all data on it, mkfs.xfs without the -K flag, and then run fstrim.

                                  [root@fs11 ~]# man mkfs.xfs

                                  [root@fs11 ~]# lsblk

                                  NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT

                                  sda       8:0    0 223.6G  0 disk

                                  ├─sda1    8:1    0   512M  0 part /boot

                                  ├─sda2    8:2    0  31.5G  0 part [SWAP]

                                  └─sda3    8:3    0 191.6G  0 part /

                                  sdb       8:16   0 223.6G  0 disk /export/beegfs/meta

                                  sdc       8:32   0  59.6G  0 disk

                                  sdd       8:48   0  59.6G  0 disk

                                  sr0      11:0    1  1024M  0 rom 

                                  nvme0n1 259:2    0   1.5T  0 disk /export/beegfs/data0

                                  nvme1n1 259:6    0   1.5T  0 disk /export/beegfs/data1

                                  nvme2n1 259:7    0   1.5T  0 disk /export/beegfs/data2

                                  nvme3n1 259:5    0   1.5T  0 disk /export/beegfs/data3

                                  nvme4n1 259:0    0   1.5T  0 disk /export/beegfs/data4

                                  nvme5n1 259:3    0   1.5T  0 disk

                                  nvme6n1 259:1    0   1.5T  0 disk

                                  nvme7n1 259:4    0   1.5T  0 disk

                                  [root@fs11 ~]# umount /export/beegfs/data2

                                  [root@fs11 ~]# lsblk

                                  NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT

                                  sda       8:0    0 223.6G  0 disk

                                  ├─sda1    8:1    0   512M  0 part /boot

                                  ├─sda2    8:2    0  31.5G  0 part [SWAP]

                                  └─sda3    8:3    0 191.6G  0 part /

                                  sdb       8:16   0 223.6G  0 disk /export/beegfs/meta

                                  sdc       8:32   0  59.6G  0 disk

                                  sdd       8:48   0  59.6G  0 disk

                                  sr0      11:0    1  1024M  0 rom 

                                  nvme0n1 259:2    0   1.5T  0 disk /export/beegfs/data0

                                  nvme1n1 259:6    0   1.5T  0 disk /export/beegfs/data1

                                  nvme2n1 259:7    0   1.5T  0 disk

                                  nvme3n1 259:5    0   1.5T  0 disk /export/beegfs/data3

                                  nvme4n1 259:0    0   1.5T  0 disk /export/beegfs/data4

                                  nvme5n1 259:3    0   1.5T  0 disk

                                  nvme6n1 259:1    0   1.5T  0 disk

                                  nvme7n1 259:4    0   1.5T  0 disk

                                  [root@fs11 ~]# isdct delete -f -intelssd 2

                                  Deleting...

                                   

                                   

                                  - Intel SSD DC P3700 Series CVFT515400401P6JGN -

                                   

                                   

                                  Status : Delete successful.

                                   

                                   

                                   

                                   

                                  [root@fs11 ~]# mkfs.xfs -f -d agcount=24 -l size=128m,version=2 -i size=512 -s size=4096 /dev/nvme2n1

                                  meta-data=/dev/nvme2n1           isize=512    agcount=24, agsize=16279311 blks

                                           =                       sectsz=4096  attr=2, projid32bit=1

                                           =                       crc=0        finobt=0

                                  data     =                       bsize=4096   blocks=390703446, imaxpct=5

                                           =                       sunit=0      swidth=0 blks

                                  naming   =version 2              bsize=4096   ascii-ci=0 ftype=0

                                  log      =internal log           bsize=4096   blocks=32768, version=2

                                           =                       sectsz=4096  sunit=1 blks, lazy-count=1

                                  realtime =none                   extsz=4096   blocks=0, rtextents=0

                                   

                                   

                                  [root@fs11 ~]# mount -a

                                  [root@fs11 ~]# lsblk

                                  NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT

                                  sda       8:0    0 223.6G  0 disk

                                  ├─sda1    8:1    0   512M  0 part /boot

                                  ├─sda2    8:2    0  31.5G  0 part [SWAP]

                                  └─sda3    8:3    0 191.6G  0 part /

                                  sdb       8:16   0 223.6G  0 disk /export/beegfs/meta

                                  sdc       8:32   0  59.6G  0 disk

                                  sdd       8:48   0  59.6G  0 disk

                                  sr0      11:0    1  1024M  0 rom 

                                  nvme0n1 259:2    0   1.5T  0 disk /export/beegfs/data0

                                  nvme1n1 259:6    0   1.5T  0 disk /export/beegfs/data1

                                  nvme2n1 259:7    0   1.5T  0 disk /export/beegfs/data2

                                  nvme3n1 259:5    0   1.5T  0 disk /export/beegfs/data3

                                  nvme4n1 259:0    0   1.5T  0 disk /export/beegfs/data4

                                  nvme5n1 259:3    0   1.5T  0 disk

                                  nvme6n1 259:1    0   1.5T  0 disk

                                  nvme7n1 259:4    0   1.5T  0 disk

                                  [root@fs11 ~]# fstrim -v /export/beegfs/data2

                                  fstrim: /export/beegfs/data2: FITRIM ioctl failed: Input/output error

                                  [root@fs11 ~]# dmesg |tail -11

                                  [1085309.104445]  nvme2n1: unknown partition table

                                  [1085888.004212] XFS (nvme2n1): Mounting V4 Filesystem

                                  [1085888.008012] XFS (nvme2n1): Ending clean mount

                                  [1085937.102088] blk_update_request: I/O error, dev nvme2n1, sector 3112833712

                                  [1085937.102306] blk_update_request: I/O error, dev nvme2n1, sector 3087667912

                                  [1085937.102471] blk_update_request: I/O error, dev nvme2n1, sector 3070890712

                                  [1085937.102630] blk_update_request: I/O error, dev nvme2n1, sector 3062502112

                                  [1085937.102788] blk_update_request: I/O error, dev nvme2n1, sector 3104445112

                                  [1085937.102946] blk_update_request: I/O error, dev nvme2n1, sector 3121222312

                                  [1085937.103085] blk_update_request: I/O error, dev nvme2n1, sector 3096056512

                                  [1085937.103085] blk_update_request: I/O error, dev nvme2n1, sector 3079279312

                                   

                                  So, the -K didn't make a difference.

                                   

                                  Case II, 4096 sector size vs 512 sector size (default)

                                   

                                  All odd numbered SSDs have 512 sector size; all even numbered SSDs have 4096 sector size.

                                   

                                  [root@fs11 ~]# lsblk

                                  NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT

                                  sda       8:0    0 223.6G  0 disk

                                  ├─sda1    8:1    0   512M  0 part /boot

                                  ├─sda2    8:2    0  31.5G  0 part [SWAP]

                                  └─sda3    8:3    0 191.6G  0 part /

                                  sdb       8:16   0 223.6G  0 disk /export/beegfs/meta

                                  sdc       8:32   0  59.6G  0 disk

                                  sdd       8:48   0  59.6G  0 disk

                                  sr0      11:0    1  1024M  0 rom 

                                  nvme0n1 259:2    0   1.5T  0 disk /export/beegfs/data0

                                  nvme1n1 259:6    0   1.5T  0 disk /export/beegfs/data1

                                  nvme2n1 259:7    0   1.5T  0 disk /export/beegfs/data2

                                  nvme3n1 259:5    0   1.5T  0 disk /export/beegfs/data3

                                  nvme4n1 259:0    0   1.5T  0 disk /export/beegfs/data4

                                  nvme5n1 259:3    0   1.5T  0 disk /export/beegfs/data5

                                  nvme6n1 259:1    0   1.5T  0 disk /export/beegfs/data6

                                  nvme7n1 259:4    0   1.5T  0 disk /export/beegfs/data7

                                  [root@fs11 ~]# for i in {0..7}

                                  > do

                                  > echo "Apply fstrim -v to /export/beegfs/data${i} ..."

                                  > echo "which is layered on top of /dev/nvme${i}n1 ..."

                                  > fstrim -v /export/beegfs/data${i}

                                  > done

                                  Apply fstrim -v to /export/beegfs/data0 ...

                                  which is layered on top of /dev/nvme0n1 ...

                                  fstrim: /export/beegfs/data0: FITRIM ioctl failed: Input/output error

                                  Apply fstrim -v to /export/beegfs/data1 ...

                                  which is layered on top of /dev/nvme1n1 ...

                                  /export/beegfs/data1: 1.5 TiB (1600185982976 bytes) trimmed

                                  Apply fstrim -v to /export/beegfs/data2 ...

                                  which is layered on top of /dev/nvme2n1 ...

                                  fstrim: /export/beegfs/data2: FITRIM ioctl failed: Input/output error

                                  Apply fstrim -v to /export/beegfs/data3 ...

                                  which is layered on top of /dev/nvme3n1 ...

                                  /export/beegfs/data3: 1.5 TiB (1600185982976 bytes) trimmed

                                  Apply fstrim -v to /export/beegfs/data4 ...

                                  which is layered on top of /dev/nvme4n1 ...

                                  fstrim: /export/beegfs/data4: FITRIM ioctl failed: Input/output error

                                  Apply fstrim -v to /export/beegfs/data5 ...

                                  which is layered on top of /dev/nvme5n1 ...

                                  /export/beegfs/data5: 1.5 TiB (1600185982976 bytes) trimmed

                                  Apply fstrim -v to /export/beegfs/data6 ...

                                  which is layered on top of /dev/nvme6n1 ...

                                  fstrim: /export/beegfs/data6: FITRIM ioctl failed: Input/output error

                                  Apply fstrim -v to /export/beegfs/data7 ...

                                  which is layered on top of /dev/nvme7n1 ...

                                  /export/beegfs/data7: 1.5 TiB (1600185982976 bytes) trimmed

                                   

                                  So, fstrim -v failed when it's applied to all SSDs with 4096 sector size, but worked with all drives with 512 sector size.

                                  • 14. Re: Critical performance drop on newly created large file
                                    Intel Corporation
                                    This message was posted on behalf of Intel Corporation

                                    Hello everyone,

                                    We would like to address the performance drop questions first so we don't mix the situations.
                                    Can you please confirm if this was the process you followed:

                                    -Create large file
                                    -Flush page cache
                                    -Run FIO

                                    Now, at which step are you flushing the page cache to avoid performance drop?

                                    Please let us know.
                                    NC

                                    1 2 Previous Next