2 Replies Latest reply on Feb 28, 2018 6:34 AM by Intel Corporation

    P4600 and Linux kernel 4.13 timeout

    berthierp

      Hi

       

      I have installed two P4600 NVME devices in a server and installed Proxmox 5.1-3.  The running kernel is 4.13.13-6-pve.  There is no RAID controller involved.

       

      # nvme list

      Node             SN                   Model                                    Namespace Usage                      Format           FW Rev

      ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

      /dev/nvme0n1     BTLE736103CH4P0KGN   INTEL SSDPEDKE040T7                      1           4.00  TB /   4.00  TB    512   B +  0 B   QDV10170

      /dev/nvme1n1     BTLE736103AG4P0KGN   INTEL SSDPEDKE040T7                      1           4.00  TB /   4.00  TB    512   B +  0 B   QDV10170

       

      The output of "isdct show -a -intelssd" is attached in the file "intelssdp4600-2.txt".

       

      Now by using LVM I can always reproduce a "hanging" behaviour of 1-2 minutes that does not lead to any fatal error:

       

      vgcreate SSD /dev/nvme0n1 /dev/nvme1n1

      lvcreate -l 100%FREE -n SSDVMSTORE01 --stripes 2 --stripesize 128 --type striped SSD

      lvremove -d -v SSD/SSDVMSTORE01

      Do you really want to remove and DISCARD active logical volume SSDTEST/SSDTEST01? [y/n]: y

       

      here the command seems to be hanging for 1-2 minutes.  in the Logs I see:

       

      Feb 21 14:14:17 px kernel: [ 3654.745355] nvme nvme0: I/O 200 QID 14 timeout, aborting

      Feb 21 14:14:17 px kernel: [ 3654.745772] nvme nvme0: I/O 201 QID 14 timeout, aborting

      Feb 21 14:14:17 px kernel: [ 3654.746110] nvme nvme0: I/O 202 QID 14 timeout, aborting

      Feb 21 14:14:17 px kernel: [ 3654.746436] nvme nvme0: I/O 203 QID 14 timeout, aborting

      Feb 21 14:14:32 px kernel: [ 3669.013614] nvme nvme0: Abort status: 0x0

      Feb 21 14:14:32 px kernel: [ 3669.014012] nvme nvme0: Abort status: 0x0

      Feb 21 14:14:32 px kernel: [ 3669.014325] nvme nvme0: Abort status: 0x0

      Feb 21 14:14:32 px kernel: [ 3669.014629] nvme nvme0: Abort status: 0x0

      Feb 21 14:15:10 px kernel: [ 3707.737495] nvme nvme1: I/O 297 QID 14 timeout, aborting

      Feb 21 14:15:10 px kernel: [ 3707.737902] nvme nvme1: I/O 298 QID 14 timeout, aborting

      Feb 21 14:15:10 px kernel: [ 3707.738231] nvme nvme1: I/O 299 QID 14 timeout, aborting

      Feb 21 14:15:10 px kernel: [ 3707.738547] nvme nvme1: I/O 300 QID 14 timeout, aborting

      Feb 21 14:15:25 px kernel: [ 3722.005726] nvme nvme1: Abort status: 0x0

      Feb 21 14:15:25 px kernel: [ 3722.006113] nvme nvme1: Abort status: 0x0

      Feb 21 14:15:25 px kernel: [ 3722.006434] nvme nvme1: Abort status: 0x0

      Feb 21 14:15:25 px kernel: [ 3722.006751] nvme nvme1: Abort status: 0x0

       

      After this, the command completes without error.

       

      This does not happen with Debian 9.3 (Kernel 4.9.x).  If I partition the devices with one 2GB primary partition and do the same operation but with /dev/nvmeXn1p1 or p2  the timeout does happen on the second partition but not on the first:

      parted -a optimal /dev/nvme0n1 mklabel gpt

      parted -a optimal /dev/nvme0n1 mkpart primary 4 2047

      parted -a optimal /dev/nvme0n1 mkpart primary 2048 100%

       

      parted -a optimal /dev/nvme1n1 mklabel gpt

      parted -a optimal /dev/nvme1n1 mkpart primary 4 2047

      parted -a optimal /dev/nvme1n1 mkpart primary 2048 100%

       

       

      Any clues?

      Best,

      Pierre