7 Replies Latest reply on May 22, 2012 1:24 AM by emilec

    SLES 10 SP3/SP4 Modular Server Dual Storage and MPIO (Multipath) Setup

    emilec

      Overview

      I've recently had a long discussion with Intel Technical support about drivers for MPIO on SUSE Linux Enterprise Server 10 SP3/SP4. Drivers for SP1 and SP2 are available for download, but nothing for SP3 and SP4. Intel eventually told me to upgrade to SLES 11, which was not possible for my client, not to mention that SLES 10 has long term support from SUSE for which Intel need to provide drivers.

       

      I then contacted SUSE and they replied to say that the MPIO drivers are now part of SP3 and SP4, but there is no documentation (that I can find) to support this.

       

      Using the SLES 10 SP1/SP2 installation guide as a base along with other sources from the web I have come up with a working solution.

       

      Update

      Be sure to read addtional comments on the path grouping policy in my post of Apr 19, 2012


      Installation

      Start by following the detailed PDF for SLES 10 SP1/SP2 taking my notes below into account.

      The best way to do this is to start with a single controller, install SLES, configure MPIO and then add the second controller.

      • Check fstab setup disk by-id as per the PDF
      • Check that you have the SLES MPIO packages installed. If they are not there install them from YaST
      # rpm -qa | grep device
      device-mapper-1.02.13-6.14
      # rpm -qa | grep multi
      multipath-tools-0.4.7-34.38
      
      • Do NOT install the Intel packages (dm-intel, mpath_prio_intel)
      • Set services to start
      # chkconfig boot.multipath on
      # chkconfig multipathd on
      
      
      • Edit kernel settings (note there is no dm-intel)
      # vi /etc/sysconfig/kernel
      INITRD_MODULES="mptsas processor thermal fan reiserfs edd dm-multipath"
      • Run mkinitrd
      # mkinitrd
      
      • Create a multipath.conf file
      # vi /etc/multipath.conf
      
      devnode_blacklist {
              devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
              devnode "^hd[a-z]"
              devnode "^cciss!c[0-9]d[0-9]*"
      }
      
      
      
      devices {
      
      device {
                      vendor                  "Intel"
                      product                 "Multi-Flex"
                      path_grouping_policy    group_by_prio
                      getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                      prio                    "alua /dev/%n"
                      path_checker            tur
                      path_selector           "round-robin 0"
                      hardware_handler                "1 alua"
                      failback                        immediate
                      rr_weight                       uniform
                      no_path_retry           queue
                      rr_min_io                       100
                      features                        "1 queue_if_no_path"
                      }
      }
      

      You will notice some key differences in the device setup compared to the sample multipath.conf.SLES.txt that comes with the Intel drivers. Because there is no mpio_prio_intel we use alua instead. The prio line is key because that checks the priority of the devices in the event that a controller fails and allows you to fail-over.

      • Reboot
      # shutdown -r now
      

       

      • After startup you can now check multipath output on the single controller
      # multipath -ll
      22222000155e8d800 dm-0 Intel,Multi-Flex
      [size=100G][features=1 queue_if_no_path][hwhandler=1 alua]
      \_ round-robin 0 [prio=1][active]
       \_ 0:0:2:0 sda 8:0   [active][ready]
      

       

      • Shutdown
      # shutdown -h now
      

       

      • Insert the second controller and monitor the Modular Server web interface to make sure it's installed correctly then startup SLES again
      • You should now have two paths
      # multipath -ll
      22222000155e8d800 dm-0 Intel,Multi-Flex
      [size=100G][features=1 queue_if_no_path][hwhandler=1 alua]
      \_ round-robin 0 [prio=2][active]
       \_ 0:0:2:0 sda 8:0   [active][ready]
       \_ 0:0:3:0 sdb 8:16  [failed][ready]
      

      If you run the command repeatedly you will see the paths alternate between failed and active. This is normal and will also show in /var/log/messages

      Mar 27 09:42:19 sles10 kernel: sd 0:0:2:0: alua: port group 00 state S supports touSnA
      Mar 27 09:42:19 sles10 multipathd: sda: tur checker reports path is up
      Mar 27 09:42:19 sles10 multipathd: 8:0: reinstated
      Mar 27 09:42:19 sles10 multipathd: 22222000155e8d800: remaining active paths: 2
      Mar 27 09:42:19 sles10 kernel: sd 0:0:2:0: alua: port group 00 switched to state A
      Mar 27 09:42:21 sles10 kernel: sd 0:0:3:0: Device not ready: <6>: Current: sense key: Not Ready
      Mar 27 09:42:21 sles10 kernel:     Additional sense: Logical unit not accessible, target port in standby state
      Mar 27 09:42:21 sles10 kernel: end_request: I/O error, dev sdb, sector 313529
      Mar 27 09:42:21 sles10 kernel: device-mapper: multipath: Failing path 8:16.
      Mar 27 09:42:21 sles10 multipathd: 8:16: mark as failed
      Mar 27 09:42:21 sles10 multipathd: 22222000155e8d800: remaining active paths: 1
      Mar 27 09:42:21 sles10 kernel: sd 0:0:3:0: Device not ready: <6>: Current: sense key: Not Ready
      Mar 27 09:42:21 sles10 kernel:     Additional sense: Logical unit not accessible, target port in standby state
      Mar 27 09:42:21 sles10 kernel: end_request: I/O error, dev sdb, sector 34262770
      Mar 27 09:42:26 sles10 kernel: sd 0:0:3:0: alua: port group 01 state S supports touSnA
      Mar 27 09:42:26 sles10 multipathd: sdb: tur checker reports path is up
      Mar 27 09:42:26 sles10 multipathd: 8:16: reinstated
      Mar 27 09:42:26 sles10 multipathd: 22222000155e8d800: remaining active paths: 2
      Mar 27 09:42:26 sles10 kernel: sd 0:0:3:0: alua: port group 01 switched to state A
      

       

      Test fail-over

      Now that you have multipath configured you need to test to make sure that your fail-over works.

      • First check the controller affinity for the disk device. In the Modular Server web interface go Server > Select Compute Module > Virtual Drives
      Drive #(LUN)NameSizeRAID LevelStatusVisibleAffinity/Active
      0SLES10SP3100.00GBRAID0OKYesSCM 1/SCM 2

      As you can see my system is active on SCM2

      • Start a tail of the messages file on the console
      # tail -f /var/log/messages
      
      • Walk to the back of your modular server and pull out SCM2. You should see the following in the logs.
      Mar 27 08:37:42 sles10 kernel:  end_device-0:1:1: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 1, phy 11,sas_addr 0x500015500002050a
      Mar 27 08:37:42 sles10 kernel:  phy-0:1:40: mptsas: ioc0: delete phy 11, phy-obj (0xffff810c433c1c00)
      Mar 27 08:37:42 sles10 kernel:  port-0:1:1: mptsas: ioc0: delete port 1, sas_addr (0x500015500002050a)
      Mar 27 08:37:42 sles10 kernel: sd 0:0:1:0: alua: Detached
      Mar 27 08:37:42 sles10 kernel: Synchronizing SCSI cache for disk sdb:
      Mar 27 08:37:42 sles10 kernel:  phy-0:3: mptsas: ioc0: delete phy 3, phy-obj (0xffff810c43ab3800)
      Mar 27 08:37:42 sles10 kernel:  port-0:1: mptsas: ioc0: delete port 1, sas_addr (0x5001e671c78f53ff)
      Mar 27 08:37:42 sles10 kernel: mptsas: ioc0: delete expander: num_phys 25, sas_addr (0x5001e671c78f53ff)
      Mar 27 08:37:42 sles10 multipathd: sdb: remove path (uevent)
      Mar 27 08:37:42 sles10 multipathd: 22222000155e8d800: load table [0 209715200 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 1 1 8:0 100]
      Mar 27 08:37:42 sles10 multipathd: sdb: path removed from map 22222000155e8d800
      Mar 27 08:37:42 sles10 multipathd: dm-0: add map (uevent)
      Mar 27 08:37:42 sles10 multipathd: dm-0: devmap already registered
      Mar 27 08:37:42 sles10 multipathd: dm-1: add map (uevent)
      Mar 27 08:37:42 sles10 multipathd: dm-3: add map (uevent)
      Mar 27 08:37:42 sles10 multipathd: dm-2: add map (uevent)
      Mar 27 08:37:42 sles10 multipathd: dm-6: add map (uevent)
      Mar 27 08:37:42 sles10 multipathd: dm-5: add map (uevent)
      Mar 27 08:37:42 sles10 multipathd: dm-7: add map (uevent)
      Mar 27 08:37:43 sles10 kernel: sd 0:0:2:0: alua: port group 00 state A supports touSnA
      Mar 27 08:37:43 sles10 multipathd: dm-4: add map (uevent)
      

       

      Again check the event logs in the Modular Server web interface and the affinity.

      Drive #(LUN)NameSizeRAID LevelStatusVisibleAffinity/Active
      0SLES10SP3100.00GBRAID0OKYesSCM 1/SCM 1

      As you can see my system is active on SCM1

       

      • Make sure to test a few things in the fail-over state to make sure your system is still stable
      • Push SCM2 back into the Modular Server and check /var/log/messages
      Mar 27 08:45:57 sles10 kernel: mptsas: ioc0: add expander: num_phys 25, sas_addr (0x5001e671c78f53ff)
      Mar 27 08:45:58 sles10 kernel: mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 11, sas_addr 0x500015500002050a
      Mar 27 08:45:58 sles10 kernel:   Vendor: Intel     Model: Multi-Flex        Rev: 0308
      Mar 27 08:45:58 sles10 kernel:   Type:   Direct-Access                      ANSI SCSI revision: 05
      Mar 27 08:45:58 sles10 kernel:  0:0:3:0: mptscsih: ioc0: qdepth=64, tagged=1, simple=1, ordered=0, scsi_level=6, cmd_que=1
      Mar 27 08:45:58 sles10 kernel:  0:0:3:0: alua: supports explicit TPGS
      Mar 27 08:45:58 sles10 kernel:  0:0:3:0: alua: port group 01 rel port 06
      Mar 27 08:45:58 sles10 kernel:  0:0:3:0: alua: port group 01 state S supports touSnA
      Mar 27 08:45:58 sles10 kernel: SCSI device sdb: 209715200 512-byte hdwr sectors (107374 MB)
      Mar 27 08:45:58 sles10 kernel: sdb: Write Protect is off
      Mar 27 08:45:58 sles10 kernel: sdb: Mode Sense: 97 00 10 08
      Mar 27 08:45:58 sles10 kernel: SCSI device sdb: drive cache: write back w/ FUA
      Mar 27 08:45:58 sles10 kernel: SCSI device sdb: 209715200 512-byte hdwr sectors (107374 MB)
      Mar 27 08:45:58 sles10 kernel: sdb: Write Protect is off
      Mar 27 08:45:58 sles10 kernel: sdb: Mode Sense: 97 00 10 08
      Mar 27 08:45:58 sles10 kernel: SCSI device sdb: drive cache: write back w/ FUA
      Mar 27 08:45:58 sles10 kernel:  sdb:<4>printk: 128 messages suppressed.
      Mar 27 08:45:58 sles10 kernel: Buffer I/O error on device sdb, logical block 0
      Mar 27 08:45:58 sles10 kernel: Buffer I/O error on device sdb, logical block 0
      Mar 27 08:45:58 sles10 kernel: Buffer I/O error on device sdb, logical block 0
      Mar 27 08:45:58 sles10 kernel:  unable to read partition table
      Mar 27 08:45:58 sles10 kernel: sd 0:0:3:0: Attached scsi disk sdb
      Mar 27 08:45:58 sles10 kernel: sd 0:0:3:0: Attached scsi generic sg1 type 0
      Mar 27 08:45:58 sles10 kernel: Buffer I/O error on device sdb, logical block 26214384
      Mar 27 08:45:58 sles10 kernel: Buffer I/O error on device sdb, logical block 26214384
      Mar 27 08:45:58 sles10 kernel: Buffer I/O error on device sdb, logical block 26214398
      Mar 27 08:45:58 sles10 kernel: Buffer I/O error on device sdb, logical block 26214398
      Mar 27 08:45:58 sles10 kernel: Buffer I/O error on device sdb, logical block 0
      Mar 27 08:45:58 sles10 kernel: Buffer I/O error on device sdb, logical block 0
      Mar 27 08:45:58 sles10 kernel: Buffer I/O error on device sdb, logical block 0
      Mar 27 08:45:58 sles10 multipathd: sdb: add path (uevent)
      Mar 27 08:45:58 sles10 multipathd: 22222000155e8d800: load table [0 209715200 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 2 1 8:0 100 8:16 100]
      Mar 27 08:45:58 sles10 multipathd: sdb path added to devmap 22222000155e8d800
      Mar 27 08:45:58 sles10 multipathd: dm-0: add map (uevent)
      Mar 27 08:45:58 sles10 multipathd: dm-0: devmap already registered
      Mar 27 08:45:58 sles10 kernel: sd 0:0:2:0: alua: port group 00 state A supports touSnA
      Mar 27 08:45:58 sles10 kernel: sd 0:0:3:0: alua: port group 01 state S supports touSnA
      Mar 27 08:45:58 sles10 kernel: sd 0:0:3:0: alua: port group 01 switched to state A
      Mar 27 08:45:58 sles10 multipathd: dm-1: add map (uevent)
      Mar 27 08:45:58 sles10 multipathd: dm-2: add map (uevent)
      Mar 27 08:45:58 sles10 multipathd: dm-6: add map (uevent)
      Mar 27 08:45:58 sles10 multipathd: dm-7: add map (uevent)
      Mar 27 08:45:58 sles10 multipathd: dm-5: add map (uevent)
      Mar 27 08:45:58 sles10 multipathd: dm-3: add map (uevent)
      Mar 27 08:46:00 sles10 multipathd: dm-4: add map (uevent)

       

      Resources

      Useful sites I found while troubleshooting:

      http://www.linuxquestions.org/questions/blog/bittner-195120/multipath-on-debian-lenny-and-intel-modular-server-mfsys25-3072/

      http://www.suse.com/documentation/sles10/pdfdoc/stor_admin/stor_admin.pdf

      http://sources.redhat.com/lvm2/wiki/MultipathUsageGuide

      http://blogs.citrix.com/2011/02/04/xenserver-multipathing-with-intel-ims/

      http://doc.opensuse.org/products/draft/SLES/SLES-storage_sd_draft/multipathing.html

      http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.html

       

       

      Intel SP1/SP2 Docs and Drivers:

      http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=17588&ProdId=3034&lang=eng&OSVersion=SUSE%20Linux%20Enterprise%20Server%2010*&DownloadType=Drivers

      http://www.intel.com/support/motherboards/server/sb/CS-029441.htm

        • 1. Re: SLES 10 SP3/SP4 Modular Server Dual Storage and MPIO (Multipath) Setup
          Dan_O

          Very nice solution - thanks for posting this.

          • 2. Re: SLES 10 SP3/SP4 Modular Server Dual Storage and MPIO (Multipath) Setup
            emilec

            Pleasure.

             

            I'm also testing a Promise vtrak with redundant controllers attached to this setup. I'm getting mixed results when simulating fail-over under heavy system load. I'll try report back if there is anything useful. At the moment it seems if there is too much load while writing to the Promise and a controller fails the OS becomes unresponsive. I'm trying different path grouping policies to see if that makes any difference.

            • 3. Re: SLES 10 SP3/SP4 Modular Server Dual Storage and MPIO (Multipath) Setup
              emilec

              Some feedback on my fail-over testing with a Promise VTrak E310s (dual controllers). I had problems with the system hanging when I simulated a controller failure on the modular server while doing heavy writes to the Promise. I went back and did the same test on the local disk in the modular server and it was fine. So the problem was only with the Promise.

               

              I had a feeling this had something to do with the Active/Active, Active/Passive setup so I did some more reading on multipath and started looking at all the path group policy settings. You options are as follows:

              multibus: One path group is formed with all paths to a LUN. Suitable for devices that are in Active/Active mode.
              failover: Each path group will have only one path.
              group_by_serial: One path group per storage controller(serial). All paths that connect to the LUN through a controller are assigned to a path group. Suitable for devices that are in Active/Passive mode.
              group_by_prio: Paths with same priority will be assigned to a path group.
              group_by_node_name: Paths with same target node name will be assigned to a path group.

               

              The default Intel suggests is group_by_prio. I tried multibus which also failed. I then tried group_by_serial and voila, problem solved! So my updated multipath.conf file (including the Promise VTrack) is as follows:

               

              devices {
                      device {
                              vendor                  "Promise"
                              product                 "VTrak"
                              path_grouping_policy    group_by_serial
                              getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                              path_checker            tur
                              path_selector           "round-robin 0"
                              hardware_handler                "0"
                              failback                        immediate
                              rr_weight                       uniform
                              no_path_retry           20
                              rr_min_io                       100
                              features                        "1 queue_if_no_path"
                      }
                      device {
                              vendor                  "Intel"
                              product                 "Multi-Flex"
                              path_grouping_policy    group_by_serial
                              getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                              prio                    "alua /dev/%n"
                              path_checker            tur
                              path_selector           "round-robin 0"
                              hardware_handler                "1 alua"
                              failback                        immediate
                              rr_weight                       uniform
                              no_path_retry           queue
                              rr_min_io                       100
                              features                        "1 queue_if_no_path"
                              }
              }
              

               

              There was also a note to say that using multibus on a Active/Passive setup would reduce I/O performance. My undertanding is that both the modular and the VTrack support Active/Active, but I tested it anyway and there was no real peformance difference.

               

              Here are some bonnie tests I did in the VTrack I did for each grouping policy

              Disks 12 x Seagate ST2000NM0011 in one pool with 3 x 6TB RAID 6 volumes.

              group_by_prio

              # bonnie -d /home1/ -s 40000 -m sles10-prio
              Bonnie 1.4: File '/home1//Bonnie.30739', size: 41943040000, volumes: 1
              Writing with putc()...         done:  67740 kB/s  87.2 %CPU
              Rewriting...                   done: 1949875 kB/s  84.9 %CPU
              Writing intelligently...       done: 125190 kB/s  10.4 %CPU
              Reading with getc()...         done:  98664 kB/s  95.6 %CPU
              Reading intelligently...       done: 4000456 kB/s 100.0 %CPU
              Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
                            ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
                            -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
              Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
              sles10 1*40000 67740 87.2125190 10.4 1949875 84.9 98664 95.64000456  100 262467.2  184

               

              multibus

              # bonnie -d /home1/ -s 40000 -m sles10-multi
              Bonnie 1.4: File '/home1//Bonnie.6718', size: 41943040000, volumes: 1
              Writing with putc()...         done:  68732 kB/s  87.6 %CPU
              Rewriting...                   done: 2262718 kB/s  98.1 %CPU
              Writing intelligently...       done: 130749 kB/s   8.5 %CPU
              Reading with getc()...         done: 100383 kB/s  96.7 %CPU
              Reading intelligently...       done: 5008622 kB/s 100.0 %CPU
              Seeker 2...Seeker 1...Seeker 3...start 'em...done...done...done...
                            ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
                            -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
              Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
              sles10 1*40000 68732 87.6130749  8.5 2262718 98.1 100383 96.75008622  100 320667.0  224
              

               

              group_by_serial

              # bonnie -d /home1/ -s 40000 -m sles10-serial
              Bonnie 1.4: File '/home1//Bonnie.8445', size: 41943040000, volumes: 1
              Writing with putc()...         done:  61271 kB/s  89.4 %CPU
              Rewriting...                   done: 1910663 kB/s  94.4 %CPU
              Writing intelligently...       done: 123190 kB/s   9.9 %CPU
              Reading with getc()...         done: 101686 kB/s  97.7 %CPU
              Reading intelligently...       done: 4074685 kB/s 100.0 %CPU
              Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
                            ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
                            -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
              Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
              sles10 1*40000 61271 89.4123190  9.9 1910663 94.4 101686 97.74074685  100 278299.6  223
              • 4. Re: SLES 10 SP3/SP4 Modular Server Dual Storage and MPIO (Multipath) Setup
                emilec

                Update

                After spending a lot of time trying to make this work and then doing a whole lot of performance testing on the vtrak I made the system live at a client site. They quickly reported performance problems and I discovered a huge degredation in the I/O performance on the "local" disk. I then removed the secondary controller and the I/O performance returned to normal.

                 

                I then reproduced this in our lab and discovered that using group_by_prio and group_by_serial both cause at least 50% performance loss in disk I/O. This is quite a shock as Intel recommend group_by_prio! I then went and did more reading and after testing many settings settled on failover as my prefferred path grouping policy. This does not suffer from the same performance loss and the system remains stable under heavy load and simulated controller failure.

                 

                Configuration

                multipath.conf

                devnode_blacklist {
                        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
                        devnode "^hd[a-z]"
                        devnode "^cciss!c[0-9]d[0-9]*"
                }
                
                devices {
                device {
                                vendor                  "Intel"
                                product                 "Multi-Flex"
                                path_grouping_policy    failover
                                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                                prio                    "alua /dev/%d"
                                path_checker            tur
                                path_selector           "round-robin 0"
                                hardware_handler                "1 alua"
                                failback                        immediate
                                #rr_weight                       uniform
                                rr_weight                       priorities
                                no_path_retry           queue
                                rr_min_io                       100
                                features                        "1 queue_if_no_path"
                                }
                }
                

                 

                So now multipath -ll output looks as follows

                # multipath -ll
                22206000155abb71e dm-0 Intel,Multi-Flex
                [size=136G][features=1 queue_if_no_path][hwhandler=1 alua]
                \_ round-robin 0 [prio=1][active]
                 \_ 0:0:2:0 sdb 8:16  [active][ready]
                \_ round-robin 0 [prio=1][enabled]
                 \_ 0:0:3:0 sda 8:0   [active][ready]
                

                 

                As you can see the controllers are now in a active/enabled state. You will also see that /var/log/message is now free of the annoying error messages I originally thought to be normal. The paths swapping between active and failed is the cause of the I/O performance problems!

                 

                Test fail-over

                So now when we remove a controller we have the following in /var/log/messages

                Apr 17 15:43:30 sles10 kernel:  end_device-0:1:1: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 1, phy 11,sas_addr 0x500015500002050a
                Apr 17 15:43:30 sles10 kernel:  phy-0:1:40: mptsas: ioc0: delete phy 11, phy-obj (0xffff810266ddd800)
                Apr 17 15:43:30 sles10 kernel:  port-0:1:1: mptsas: ioc0: delete port 1, sas_addr (0x500015500002050a)
                Apr 17 15:43:30 sles10 kernel: sd 0:0:1:0: alua: Detached
                Apr 17 15:43:30 sles10 kernel: Synchronizing SCSI cache for disk sdb:
                Apr 17 15:43:30 sles10 kernel:  phy-0:3: mptsas: ioc0: delete phy 3, phy-obj (0xffff8102672c7c00)
                Apr 17 15:43:30 sles10 kernel:  port-0:1: mptsas: ioc0: delete port 1, sas_addr (0x5001517b9f5e03ff)
                Apr 17 15:43:30 sles10 kernel: mptsas: ioc0: delete expander: num_phys 25, sas_addr (0x5001517b9f5e03ff)
                Apr 17 15:43:30 sles10 multipathd: sdb: remove path (uevent)
                Apr 17 15:43:30 sles10 multipathd: 22206000155abb71e: load table [0 285149758 multipath 1 queue_if_no_path 1 alua 1 1 round-robin 0 1 1 8:0 100]
                Apr 17 15:43:30 sles10 multipathd: sdb: path removed from map 22206000155abb71e
                Apr 17 15:43:30 sles10 multipathd: dm-0: add map (uevent)
                Apr 17 15:43:30 sles10 multipathd: dm-0: devmap already registered
                Apr 17 15:43:30 sles10 multipathd: dm-1: add map (uevent)
                Apr 17 15:43:30 sles10 multipathd: dm-3: add map (uevent)
                Apr 17 15:43:30 sles10 multipathd: dm-2: add map (uevent)
                Apr 17 15:43:30 sles10 multipathd: dm-5: add map (uevent)
                Apr 17 15:43:30 sles10 multipathd: dm-6: add map (uevent)
                Apr 17 15:43:30 sles10 multipathd: dm-7: add map (uevent)
                Apr 17 15:43:31 sles10 kernel: sd 0:0:0:0: alua: port group 00 state S supports touSnA
                Apr 17 15:43:31 sles10 kernel: sd 0:0:0:0: alua: port group 00 switched to state A
                Apr 17 15:43:33 sles10 multipathd: dm-4: add map (uevent)
                

                 

                Let's check the multipath status:

                # multipath -ll
                22206000155abb71e dm-0 Intel,Multi-Flex
                [size=136G][features=1 queue_if_no_path][hwhandler=1 alua]
                \_ round-robin 0 [prio=1][active]
                 \_ 0:0:2:0 sdb 8:16  [active][ready]
                

                 

                We then push the controller back in and have the following in the logs:

                 
                Apr 17 15:47:12 sles10 kernel: mptsas: ioc0: add expander: num_phys 25, sas_addr (0x5001517b9f5e03ff)
                Apr 17 15:47:12 sles10 kernel: mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 11, sas_addr 0x500015500002050a
                Apr 17 15:47:12 sles10 kernel:   Vendor: Intel     Model: Multi-Flex        Rev: 0308
                Apr 17 15:47:12 sles10 kernel:   Type:   Direct-Access                      ANSI SCSI revision: 05
                Apr 17 15:47:12 sles10 kernel:  0:0:2:0: mptscsih: ioc0: qdepth=64, tagged=1, simple=1, ordered=0, scsi_level=6, cmd_que=1
                Apr 17 15:47:12 sles10 kernel:  0:0:2:0: alua: supports explicit TPGS
                Apr 17 15:47:12 sles10 kernel:  0:0:2:0: alua: port group 01 rel port 06
                Apr 17 15:47:12 sles10 kernel:  0:0:2:0: alua: port group 01 state S supports touSnA
                Apr 17 15:47:12 sles10 kernel: SCSI device sdb: 285149758 512-byte hdwr sectors (145997 MB)
                Apr 17 15:47:12 sles10 kernel: sdb: Write Protect is off
                Apr 17 15:47:12 sles10 kernel: sdb: Mode Sense: 97 00 10 08
                Apr 17 15:47:12 sles10 kernel: SCSI device sdb: drive cache: write back w/ FUA
                Apr 17 15:47:12 sles10 kernel: SCSI device sdb: 285149758 512-byte hdwr sectors (145997 MB)
                Apr 17 15:47:12 sles10 kernel: sdb: Write Protect is off
                Apr 17 15:47:12 sles10 kernel: sdb: Mode Sense: 97 00 10 08
                Apr 17 15:47:12 sles10 kernel: SCSI device sdb: drive cache: write back w/ FUA
                Apr 17 15:47:12 sles10 kernel:  sdb:<4>printk: 95 messages suppressed.
                Apr 17 15:47:12 sles10 kernel: Buffer I/O error on device sdb, logical block 0
                Apr 17 15:47:12 sles10 kernel: Buffer I/O error on device sdb, logical block 1
                Apr 17 15:47:12 sles10 kernel: Buffer I/O error on device sdb, logical block 2
                Apr 17 15:47:12 sles10 kernel: Buffer I/O error on device sdb, logical block 3
                Apr 17 15:47:12 sles10 kernel: Buffer I/O error on device sdb, logical block 0
                Apr 17 15:47:12 sles10 kernel: Buffer I/O error on device sdb, logical block 1
                Apr 17 15:47:12 sles10 kernel: Buffer I/O error on device sdb, logical block 2
                Apr 17 15:47:12 sles10 kernel: Buffer I/O error on device sdb, logical block 3
                Apr 17 15:47:12 sles10 kernel: Buffer I/O error on device sdb, logical block 0
                Apr 17 15:47:12 sles10 kernel: Buffer I/O error on device sdb, logical block 1
                Apr 17 15:47:12 sles10 kernel:  unable to read partition table
                Apr 17 15:47:12 sles10 kernel: sd 0:0:2:0: Attached scsi disk sdb
                Apr 17 15:47:12 sles10 kernel: sd 0:0:2:0: Attached scsi generic sg1 type 0
                Apr 17 15:47:12 sles10 multipathd: sdb: add path (uevent)
                Apr 17 15:47:12 sles10 multipathd: 22206000155abb71e: load table [0 285149758 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:0 100 round-robin 0 1 1
                Apr 17 15:47:12 sles10 multipathd: sdb path added to devmap 22206000155abb71e
                Apr 17 15:47:12 sles10 multipathd: dm-0: add map (uevent)
                Apr 17 15:47:12 sles10 multipathd: dm-0: devmap already registered
                Apr 17 15:47:12 sles10 kernel: sd 0:0:0:0: alua: port group 00 state A supports touSnA
                Apr 17 15:47:12 sles10 multipathd: dm-1: add map (uevent)
                Apr 17 15:47:12 sles10 multipathd: dm-2: add map (uevent)
                Apr 17 15:47:12 sles10 multipathd: dm-5: add map (uevent)
                Apr 17 15:47:12 sles10 multipathd: dm-3: add map (uevent)
                Apr 17 15:47:12 sles10 multipathd: dm-6: add map (uevent)
                Apr 17 15:47:12 sles10 multipathd: dm-7: add map (uevent)
                Apr 17 15:47:12 sles10 multipathd: dm-4: add map (uevent)
                

                 

                I/O Tests

                group_by_serial with 2 controllers

                # bonnie -s 20000 -d /home -m sles10-mpio-alua-pathgrouping-byserial
                Bonnie 1.4: File '/home/Bonnie.6182', size: 20971520000, volumes: 1
                Writing with putc()...         done:  22053 kB/s  28.2 %CPU
                Rewriting...                   done:  23456 kB/s   3.0 %CPU
                Writing intelligently...       done:  20422 kB/s   1.9 %CPU
                Reading with getc()...         done:  48580 kB/s  43.0 %CPU
                Reading intelligently...       done:  69911 kB/s   5.9 %CPU
                Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
                              ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
                              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
                Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
                sles10 1*20000 22053 28.2 20422  1.9 23456  3.0 48580 43.0 69911  5.9  510.1  1.6
                

                 

                failover with 2 controllers

                # bonnie -s 20000 -d /home -m sles10-mpio-alua-pathgrouping-failover
                Bonnie 1.4: File '/home/Bonnie.7691', size: 20971520000, volumes: 1
                Writing with putc()...         done:  55475 kB/s  64.9 %CPU
                Rewriting...                   done:  46435 kB/s   5.1 %CPU
                Writing intelligently...       done:  62430 kB/s   5.6 %CPU
                Reading with getc()...         done:  75727 kB/s  65.4 %CPU
                Reading intelligently...       done: 110119 kB/s   8.7 %CPU
                Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
                              ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
                              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
                Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
                sles10 1*20000 55475 64.9 62430  5.6 46435  5.1 75727 65.4110119  8.7  890.6  3.0

                 

                Conclusion

                To be frank I have to question the testing done by Intel on this setup. It's clear once you test this properly that there is a huge problem with their preferred path grouping policy. Next step is to repeat these tests and settings with the vtrak attached, which I will try and report back on.

                 

                I hope this helps someone else in the future!

                • 5. Re: SLES 10 SP3/SP4 Modular Server Dual Storage and MPIO (Multipath) Setup
                  emilec

                  A quick update on some testing I have done on SLES 11 SP2. Here the recommended config from Intel does work. The interesting part though is why.

                   

                  Config for SLES11

                  # cat /etc/multipath.conf
                  blacklist {
                          devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
                          devnode "^(hd|xvd)[a-z][[0-9]*]"
                          devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
                  }
                  
                  devices {
                          device {
                          vendor                  "Intel"
                          product                 "Multi-Flex"
                          path_grouping_policy    "group_by_prio"
                          getuid_callout          "/lib/udev/scsi_id -g -u /dev/%n"
                          prio                    "alua"
                          path_checker            tur
                          path_selector           "round-robin 0"
                          hardware_handler        "1 alua"
                          failback                immediate
                          rr_weight               uniform
                          rr_min_io               100
                          no_path_retry           queue
                          features                "1 queue_if_no_path"
                          }
                  }

                   

                  SLES11 with path grouping policy group_by_prio

                  # multipath -ll
                  2224b000155126b27 dm-0 Intel,Multi-Flex
                  size=60G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
                  |-+- policy='round-robin 0' prio=130 status=active
                  | `- 0:0:3:0 sdb 8:16 active ready running
                  `-+- policy='round-robin 0' prio=1 status=enabled
                    `- 0:0:0:0 sda 8:0  active ready running
                  

                   

                  SLES10 with path grouping policy group_by_prio

                  # multipath -ll
                  22222000155e8d800 dm-0 Intel,Multi-Flex
                  [size=100G][features=1 queue_if_no_path][hwhandler=1 alua]
                  \_ round-robin 0 [prio=2][active]
                   \_ 0:0:2:0 sda 8:0   [active][ready]
                   \_ 0:0:3:0 sdb 8:16  [failed][ready]
                  

                   

                  As you can see as with my path grouping policy of failover for SLES10 the controllers are listed active/enabled. The noticeable difference is the prio values in the multipath output. This means that the algorithm which is supposed to calculate the prio values doesn't work on SLES10 which is why group_by_prio does not work on SLES10. Both controllers come back with the same prio value which is why they keep failing and becoming active all the time.

                  • 6. Re: SLES 10 SP3/SP4 Modular Server Dual Storage and MPIO (Multipath) Setup
                    bic_admin

                    Thank you a lot for your work and contribution to the Community, emilec.

                     

                    We got SLES 11 SP2 running with multipath i/o and can do a "affinity change" without problems.

                     

                    But one problem still exists: When shutting down the server, the system is not able to do a clean unmount of the partitions.

                     

                    "Not shutting down MD Raid - reboot/halt scripts do this."     missing

                    Removing multipath targets: May 21 08:53:52 | 22289xxxxx_part2: map in use

                    May 21 08:53:52 | failed to remove multipath map 22289xxxxx

                     

                     

                    When the server boots up again, it does a fsck (File-System Check). One time so far, it has found orphaned inodes.

                     

                    Does anybody know how to solve this problem? Do you have it too?

                    • 7. Re: SLES 10 SP3/SP4 Modular Server Dual Storage and MPIO (Multipath) Setup
                      emilec

                      Hi bic_admin

                       

                      I must admit I didn't pay much attention to the shutdown procedure on the SLES 11 SP2 platform I was testing on. I know my SLES 10 platforms all shutdown and boot cleanly. Unfortunately my lab equipment has been reassigned to another task, but when it's free I'll see if I can reproduce your problem.