Load balance algorithm w/ AIX and XIV

IBM only supports a queue depth of 1 when attaching to XIV with the default algorithm of round_robin. Usually round_robin or load_balance is the best choice, but since IBM is only supporting a queue depth of 1 at this time, there is a performance penalty for asynchronous I/Os. This looks to have been fixed in 5.3.10 (APAR IZ42730) and 6.1 (APAR IZ43146), but is still broken (probably never to be fixed) in earlier releases.

So, IBM recommendation is to split up your storage needs into a number of LUNs matching the number of paths to your XIV, use the fail_over algorithm with a larger queue depth, and assign a different path the highest priority for each LUN. This is kind of a poor man’s load balancing. It’s not that bad, other than having to look at 4 or more hdisks for every LUN, and having to figure out what path to give the highest priority for each one!

IBM doesn’t really see this as a problem, but it’s a huge pain to do correctly in an enterprise.

So, how do we start? First, figure out what hdisk you’re talking about, then run:

lspath -H -l hdiskx -F "status name parent path_id connection"
status  name   parent path_id connection

Enabled hdiskx fscsi0 0       50050763061302fb,4010401200000000
Enabled hdiskx fscsi0 1       50050763060302fb,4010401200000000
Enabled hdiskx fscsi1 2       50050763060802fb,4010401200000000
Enabled hdiskx fscsi1 3       50050763061802fb,4010401200000000

We need the parent device and the connection bit (WWN,LUN#) to specify just a single path. Then run:

lspath -AHE -l hdiskx -p fscsi0 -w "50050763061302fb,4010401200000000"
attribute value              description  user_settable

scsi_id   0x20400            SCSI ID      False
node_name 0x5005076306ffc2fb FC Node Name False
priority  1                  Priority     True

That shows you the priority of this path. You can see it’s still the default of 1. You can check the other paths too.

The goal is to spread out the load between all the available paths. To do this, we must create 4 LUNs. If we need 4GB, we need 4 1GB LUNs. Then we can give each one a different primary path. So, in this example, we should run:

chpath -l hdiskx -p fscsi0 -w 50050763061302fb,4010401200000000 -a priority=1
chpath -l hdiskx -p fscsi0 -w 50050763060302fb,4010401200000000 -a priority=2
chpath -l hdiskx -p fscsi1 -w 50050763060802fb,4010401200000000 -a priority=3
chpath -l hdiskx -p fscsi1 -w 50050763061802fb,4010401200000000 -a priority=4

The first command isn’t really necessary, but I was on a roll. Now, we have to change the algorithm for the hdisk and set the queue depth:

chdev -l  hdiskx -a algorithm=fail_over -a queue_depth=32

Make sure to stager the next one so that path 1 gets a priority of 1, 2 gets 2… and 0 gets a priority of 4. Rinse and repeat until you have 4 LUNs each with a different primary path.

Now wasn’t that easy. Oh, and when you add more disks, be sure to keep them distributed as evenly as possible.

8 Comments

    john woo

    Nice! so you mean:hdisk1~hdisk4 should like this:
    ——————————–
    hpath -l hdisk1 -p fscsi0 -w 50050763061302fb,4010401200000000 -a priority=1
    chpath -l hdisk1 -p fscsi0 -w 50050763060302fb,4010401200000000 -a priority=2
    chpath -l hdisk1 -p fscsi1 -w 50050763060802fb,4010401200000000 -a priority=3
    chpath -l hdisk1 -p fscsi1 -w 50050763061802fb,4010401200000000 -a priority=4
    ——————————————–
    hpath -l hdisk2 -p fscsi0 -w 50050763061302fb,4010401200000000 -a priority=2
    chpath -l hdisk2 -p fscsi0 -w 50050763060302fb,4010401200000000 -a priority=3
    chpath -l hdisk2 -p fscsi1 -w 50050763060802fb,4010401200000000 -a priority=4
    chpath -l hdisk2 -p fscsi1 -w 50050763061802fb,4010401200000000 -a priority=1
    ————————————————
    hpath -l hdisk3 -p fscsi0 -w 50050763061302fb,4010401200000000 -a priority=3
    chpath -l hdisk3 -p fscsi0 -w 50050763060302fb,4010401200000000 -a priority=4
    chpath -l hdisk3 -p fscsi1 -w 50050763060802fb,4010401200000000 -a priority=1
    chpath -l hdisk3 -p fscsi1 -w 50050763061802fb,4010401200000000 -a priority=2
    ——————————————-
    hpath -l hdisk4 -p fscsi0 -w 50050763061302fb,4010401200000000 -a priority=4
    chpath -l hdisk4 -p fscsi0 -w 50050763060302fb,4010401200000000 -a priority=1
    chpath -l hdisk4 -p fscsi1 -w 50050763060802fb,4010401200000000 -a priority=2
    chpath -l hdisk4 -p fscsi1 -w 50050763061802fb,4010401200000000 -a priority=3

      Pat Vaughan

      Yes, that’s right. We set the intra-disk policy for each LV to maximum, then make sure every volume group has at least 2 disks, and each disk uses a different FC adapter. That way when the LPs are allocated, they’re balanced between the available FC adapters. A step further would be to stripe the LVs across the disks making the I/O even more balanced, but that’s more than we need. I’ve found that just setting the intra-disk policy on the LVs to maximum when they’re created gives us a pretty good balance based on the I/O statistics gathered by the XIV array.

    john woo

    for DS4300 Active/Standby,why always I/O use fscsi0.

      Pat Vaughan

      On our DS4Ks, we assign a controller to each of 2 SAN fabrics, and all of our hosts have 1 FC adapter in each fabric as well. That’s pretty standard, and I’ll assume your setup is similar. There are a couple of things to look at. fget_config should show you 1 DAR and 2 active DAC devices per DS4K. I’ve seen the driver have issues, and only list one DAC as active and the other as “NONE”. The other thing to check is that you have half your LUNs set as having the A controller preferred and have the B controller.

      Running fget_config -Av should look something like this:

      ---dar1---
      
      User array name = 'FastT2'
      dac1 ACTIVE dac3 ACTIVE
      
      Disk     DAC   LUN Logical Drive
      hdisk33  dac3    1 tsm2_dbvg
      hdisk34  dac3    4 tsm2_backuppool2
      hdisk35  dac1    5 TSM_ASMmirr_1
      hdisk36  dac3    6 tsm_backuppoolNC1
      hdisk37  dac1    7 tsm2_appvg

      If those things look good, then you may have a communications issue. Check your zoning. Or, it could be a LUN masking issue. Double-check the HBA Host Port assignments in the Storage Management GUI.

      And here is my output from running lsattr on the DAR device:

      # lsattr -El dar1
      act_controller dac1,dac3 Active Controllers                          False
      aen_freq       600       Polled AEN frequency in seconds             True
      all_controller dac1,dac3 Available Controllers                       False
      autorecovery   no        Autorecover after failure is corrected      True
      balance_freq   600       Dynamic Load Balancing frequency in seconds True
      cache_size     128       Cache size for both controllers             False
      fast_write_ok  yes       Fast Write available                        False
      held_in_reset  none      Held-in-reset controller                    True
      hlthchk_freq   600       Health check frequency in seconds           True
      load_balancing no        Dynamic Load Balancing                      True
      switch_retries 5         Number of times to retry failed switches    True
      

    john woo

    Thanks for your information.
    I wonder to know whether any kind of Storage can use static LUN load balance without any MPIO software.

    The test: HBA1 and HBA2 on servers
    Switch1 and Switch2
    LUN0 from Storage1 and LUN1 from Storage 2.

    I want to balance the LUN0 and LUN1 with HBA1 and HBA2.

    Active Path:
    HBA1-Switch1-Storage1
    HBA1-Switch1-Storage2

    so always fcsi0 works.

    Change the hdisk2(LUN1) path priority to 1 and use fcsi1 for access LUN1, fcsi0 for access LUN0.

    But the result is :
    fcsi0(HBA1)access LUN0 and LUN1.
    break the fcsi0, all path to fail-over to fcsi1.

    So headache.

    the priority has relationship or requirement with storage controller?

      Pat Vaughan

      On AIX, MPIO is now a part of the OS. You CAN use some storage arrays with the native MPIO, but you really need the appropriate drivers to make the arrays work correctly. For XIV disk the “drivers” are pretty much just some ODM entries, the newer versions include 1 utility. For DS8K disks, the “drivers” include quite a bit of extra code. For DS4K disk, you really need RDAC to make it work correctly. For some systems, using the included MPIO IS recommended, like HPUX for some storage arrays.

      Your issue sounds like you have an active/passive setup (DS4K? or maybe XIV), and the fabric is working as expected (you can access both LUNs from either fabric). So, the issue sounds to me to be which path is the primary path. With DS4K, that has to be changed in the management GUI. With XIV attached storage, that is changed by setting the appropriate path as priority 1. Assuming that’s been done, I would open a software PMR with IBM.

    neter

    and wht you would recommed if I see all of my 7 LUNs on 32 paths?
    so in total “lspath|wc” shows 224 paths!!

    by default (so also every time a disk is rmdev -dl … the priority is 1 for all LUNs).
    it means that by default all traffic goes via fscsi0 (the first one adapter).

    is there a easy way to omtimally prioritize such configuration (224 paths — so 56 paths per each of 4 adapters = 8 paths of each LUN per each of 4 adapters)

      Pat Vaughan

      Well, first thing I would do is step back and define what we’re trying to do and what we can do. We have 4 adapters, lets assume they’re 2 8Gb dual port adapters in a Power7 system. And we have 8 adapters on the storage array, lets assume it’s 4 dual-port adapters in a DS8K. And, we got here by zoning all of the ports on the host to all of the ports on the disk array. Of course, we do this for performance. The goal is to spread the load out against all the available adapters and switch ports. But, are we really going to generate enough load to warrant that much bandwidth? Probably not.

      Lets say this box’s goal in life is to crush the disk array. Is any system with 4 adapters going to saturate the 8 adapters on the disk array? Probably not. Can the disk array even pump out enough data to sustainably saturate those 8 adapters? With spinning disk, probably not.

      But, across the enterprise we still want to balance out the I/O on those 8 adapters in the disk array. I would probably pick 2 ports on the array, each port on a different physical adapter, to zone to each of the adapters in the host. Immediately, we’ve dropped the number of paths to just 2 per HBA, while still utilizing all the adapters in the host array for performance and availability. You could even zone them on a 1 to 1 basis, but not all the ports on the array would be used.

      Now, it’s a matter of setting the priority on each path. I would write a script to set the priority. On the first LUN set the first path on the first adapter to 1, the first path on the second adapter to 2, and so on. Then for the next lun setting the first path on the second adapter to 1, the first path on the third adapter to 2, and so on. With just 7 LUNs, you won’t be able to set at least 1 LUN on each of the array adapters to the primary path.

      A way more simple setup would be to use virtual fibrechannel adapters to pass each physical adapter down to the LPAR, then load the multipath software for your storage array.

      The ideal for performance would be to assign 8 LUNs for each VG and either set striping on the LVs or set the LV inter-disk policy to maximum. Then, if you can, use either RAW LVs or CIO to get the best performance possible.

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*