Stor2RRD Overview

If you manage your own SAN, you’ll eventually be asked questions like “Why are some of my databases slow?”, “Why do we periodically have performance problems?” or “Do we have a hot LUN?”. Modern arrays have real-time performance monitoring, but not all of them have historical data so you can see if there’s a periodic performance issue or if the current performance is out of the ordinary. There are vendor supplied products and lots of third party products that let you gather performance statistics, but they’re usually pretty expensive. If you just need to gather and report on the performance data for IBM V7000, SVC, or DS8000 storage, there is a great FREE product call Stor2RRD.

Stor2RRD is developed by XORUX, the developers of the excellent Lpar2RRD tool, and is free to use with relatively modest fees for support. As it’s name suggests, it collects data from your storage arrays and puts the data into RRD databases. It has much the same requirements as Lpar2RRD, a simple Linux web server with PERL and RRD, and you can run it on the same server as LPAR2RRD. If you have a DS8000 array, you’ll also need the DSCLI package for your storage, or just SSH if you have an SVC or V7000 storage array.

We had issues getting version 0.45 to work. But the developers responded to a quick Email with a preview of the next version, 0.48, which fixed the problem. The setup was pretty simple, we didn’t have any problems with the provided directions, and got everything setup and tested in an couple of hours.

After running the tool for a couple of weeks, we’ve collected what seems like a lot of data. Some of the high-level graphs are very busy, so much that it runs the risk of being “data porn”, data for data’s sake that loses some of it’s usefulness. But, you can drill down from these high-level graphs to the Storage Pool, MDisk, LUN, drive or SAN Port level and get details like IOPS, throughput, latency and capacity.

For instance, here is a graph if the read performance for the managed disks in one of our V7000’s:
mdisk_read

That sure looks like mdiskSSD3, the teal blue one, is a hot array. Here is the read response time for that particular mdisk:
mdiskSSD3_read_resp
The response time isn’t too bad on that array, 3ms Max and 1.4ms on average, which for this data is more than fast enough.

This is just one simple example of the data that Stor2RRD collects. With this data we have real information showing if a system’s slowness is because the server is using an abnormal amount of bandwidth or if we should consider adding more SSD to an over-subscribed pool. And that helps us make intelligent storage decisions and backup our reasoning with real numbers.

For the cost of a small Linux VM, you can deploy a troubleshooting and monitoring tool the rivals some very expensive third party products. And, if it’s helpful in your environment, Stor2RRD annual support is a fraction of the cost of other products.

There is a full featured demo on the Stor2RRD website where you use the tool yourself with the developers data.

Installing the XIVGui on Fedora 16

I’ve been running the XIVGui on a Windows7 VM so that I have it available from anywhere. That does work, but then I have to launch an rdesktop session, login, then launch the XIVGui, and login again. I finally got tired of the extra steps and decided to load the XIVGui when I upgraded to Fedora 16. I considered making an RPM, but I’m sure IBM would frown on redistributing their code. These manual steps work great on Fedora 16, should work fine on Fedora 15. I haven’t tested it with RHEL or other versions.

First, you need the 32bit version of libXtst, even if you’re using the 64bit client:

yum install libXtst-1.2.0-2.fc15.i686

Then just download the package from IBM’s ftp server, uncompress it, and move the resulting directory to someplace on your system, I used /usr/local/lib.

tar -zxvf xivgui-xxx-linux64.tar.gz
mv XIVGUI /usr/local/lib/

Then, we just need to make a couple of .desktop files.

/usr/share/applications/xivgui.desktop:

[Desktop Entry]
Name=XIVGui
Comment=GUI management tool for IBM XIV
Exec=/usr/local/lib/XIVGUI/xivgui
Icon=/usr/local/lib/XIVGUI/images/xivIconGreen-32.png
Terminal=false
Type=Application
Categories=System;
StartupNotify=true
X-Desktop-File-Install-Version=0.18

/usr/share/applications/xivtop.desktop:

[Desktop Entry]
Name=XIVTop
Comment=GUI performance tool for IBM XIV
Exec=/usr/local/lib/XIVGUI/xivtop
Icon=/usr/local/lib/XIVGUI/images/xivIconTop-32.png
Terminal=false
Type=Application
Categories=System;
StartupNotify=true
X-Desktop-File-Install-Version=0.18

Now XIVGui and XIVTop should show up under “System Tools”.

NPIV N-Port changes w/ AIX

I was at a meeting with other storage admin’s where they talked about never using NPIV with AIX servers because AIX can’t handle it if the N-Port changes due to a N-Port failover in AG mode. I’ve never seen that. In testing our AIX boxes handled the failover without any problems. Part of the reason may be because I’ve enabled Dynamic Tracking and Fast I/O Failover on these fibre adapters. Dynamic Tracking allows for N-Port ID changes, and Fast I/O Failover makes the failure of a Fibre adapter happen faster, which can be good if you are using a multi-path driver. It’s a simple change, but requires either a reboot or bringing down the adapter for the changes. Here’s the command to make the changes in the ODM, which will be applied the next time you reboot:

chdev  -l fscsi1 -a dyntrk='yes' -a fc_err_recov='fast_fail' '-P'

The other, and better, option is to build F-Port trunks to front your AG switch, which preserves the PID in case of a ISL failure, but that requires a trunking license on your switches.

AIX – Remove failed MPIO paths

Here is a quick and dirty script to remove failed MPIO paths. You can end up with failed paths if you make some SAN connection changes.

for disk in `lsdev -Cc disk | grep 2107 | awk '{ print $1 }'`
do
        for path in `lspath -l $disk -F "status connection" | grep Failed | awk '{ print $2 }'`
        do
                echo $disk
                rmpath -l $disk -w $path -d
        done
done

Load balance algorithm w/ AIX and XIV

IBM only supports a queue depth of 1 when attaching to XIV with the default algorithm of round_robin. Usually round_robin or load_balance is the best choice, but since IBM is only supporting a queue depth of 1 at this time, there is a performance penalty for asynchronous I/Os. This looks to have been fixed in 5.3.10 (APAR IZ42730) and 6.1 (APAR IZ43146), but is still broken (probably never to be fixed) in earlier releases.

So, IBM recommendation is to split up your storage needs into a number of LUNs matching the number of paths to your XIV, use the fail_over algorithm with a larger queue depth, and assign a different path the highest priority for each LUN. This is kind of a poor man’s load balancing. It’s not that bad, other than having to look at 4 or more hdisks for every LUN, and having to figure out what path to give the highest priority for each one!

IBM doesn’t really see this as a problem, but it’s a huge pain to do correctly in an enterprise.

So, how do we start? First, figure out what hdisk you’re talking about, then run:

lspath -H -l hdiskx -F "status name parent path_id connection"
status  name   parent path_id connection

Enabled hdiskx fscsi0 0       50050763061302fb,4010401200000000
Enabled hdiskx fscsi0 1       50050763060302fb,4010401200000000
Enabled hdiskx fscsi1 2       50050763060802fb,4010401200000000
Enabled hdiskx fscsi1 3       50050763061802fb,4010401200000000

We need the parent device and the connection bit (WWN,LUN#) to specify just a single path. Then run:

lspath -AHE -l hdiskx -p fscsi0 -w "50050763061302fb,4010401200000000"
attribute value              description  user_settable

scsi_id   0x20400            SCSI ID      False
node_name 0x5005076306ffc2fb FC Node Name False
priority  1                  Priority     True

That shows you the priority of this path. You can see it’s still the default of 1. You can check the other paths too.

The goal is to spread out the load between all the available paths. To do this, we must create 4 LUNs. If we need 4GB, we need 4 1GB LUNs. Then we can give each one a different primary path. So, in this example, we should run:

chpath -l hdiskx -p fscsi0 -w 50050763061302fb,4010401200000000 -a priority=1
chpath -l hdiskx -p fscsi0 -w 50050763060302fb,4010401200000000 -a priority=2
chpath -l hdiskx -p fscsi1 -w 50050763060802fb,4010401200000000 -a priority=3
chpath -l hdiskx -p fscsi1 -w 50050763061802fb,4010401200000000 -a priority=4

The first command isn’t really necessary, but I was on a roll. Now, we have to change the algorithm for the hdisk and set the queue depth:

chdev -l  hdiskx -a algorithm=fail_over -a queue_depth=32

Make sure to stager the next one so that path 1 gets a priority of 1, 2 gets 2… and 0 gets a priority of 4. Rinse and repeat until you have 4 LUNs each with a different primary path.

Now wasn’t that easy. Oh, and when you add more disks, be sure to keep them distributed as evenly as possible.

IBM XIV Thin Provisioning

Thin provisioning on an IBM XIV is pretty hot, but there are some gotchas. Thin provisioning lets you actually allocate more space in LUNs to your hosts than you have in physical storage. So, if you have a lot of filesystems or volume groups that have a lot of free space, that’s cool. Where on other storage systems you would burn the whole space allocated by the LUNs, you’re only allocating (physically) as much as you’re really using. It’s easy to burn yourself, so you have to monitor your free space in the XIV “Storage Pools”. When a Storage Pool fills up, the volumes go to Read-Only mode until you resize the pool.

Now, here’s a catch. You can define the LUN to any size, but when the first I/O hits, the system allocates 17GB to the LUN regardless of size. So, you define a 8GB LUN (for giggles) on a thinly provisioned Storage Pool. When the first I/O hits (like you actually turned on the host, or ran cfgmgr, or did a scan for new hardware) 17GB will be reserved out of the Storage Pool. And, it burns this free space in 17GB increments. 18GB burns 34GB free space, ect…

Now, the system tries to keep you from doing stupid things like this. If you specify the size of the LUN in GB, it will automagically round up to the nearest 17GB chunk. But, if you specify in 512 byte blocks (because we all think in 512 byte blocks, right?), the LUN will appear to the host as the exact size specified. And, it still burns the 17GB free space.

So, at a minimum the actual physical space you need for a thinly provisioned Storage Pool is:
17GB X [# of LUNs]

Determining how many BB credits you need

At 4Gb, a full packet is about 1 km long, and at 2Gb a full packet is about 2km long!  Yes, at any given time 2k of your data is spread from the port to 1km down the cable (as the light travels).  Each packet burns 1 buffer, no matter the size of the packet.  The buffer credit isn’t released until the packet gets to the receiving switch and the sending switch receives an acknowledgment.  So, at 4Gb, you need 2 buffer credit for each km of distance.  1 to fill the 1km pipe to the receiving switch, and 1 waiting for the acknowledgment.

So, for a 10 km fibre path, you need 20 buffer credits, assuming that every packet is a full packet.  Even if every data packet is full, there is some inter-switch communication that doesn’t fill a packet.  That means that the packets are shorter and take less time to transmit, meaning there are more packets in the pipe.  So, the 2 packets per km number is the MINIMUM number you’ll need. And, we’re not counting the time it takes to process the frames on the switches, which is probably pretty minimal.

Also, be aware that on a 48k, there are 1024 buffers per ASIC, out of which 8 buffers per port are reserved. So, if you put a bunch of long distance ISLs all on one ASIC, you might run in into problems because there aren’t enough buffers to go around. If you have a GoldenEye based switch, you only get 700 buffer credits per ASIC, out of which 8 buffers per port are reserved. Use portbuffershow to see your buffer usage per ASIC.

Enableing Access Gateway (NPIV) on Brocade

Brocade’s flavor of NPIV is called Access Gateway. It’s a way to dumb down the switch and make it more of a pass-through device. When AG is enabled, the switch makes much less routing or switching decisions, and passes all the traffic to an upstream switch. The upstream switch ports switch to F ports, and the “egress” ports on the NPIV switch become N ports.