VIO Disk Tuning

I setup VIO servers so rarely, that it’s really easy to miss a step. These are some of the tuning commands I use when setting up VIO. Set these on the VIO server.

Set the FC adapter error recovery policy to fast_fail:

chdev -dev fscsi0 -attr fc_err_recov=fast_fail -perm


Enable dynamic tracking on the FC adapter:

chdev -dev fscsi0 -attr dyntrk=yes -perm


Set the FC adapter link type to point-to-point:

chdev -dev fcs0 -attr init_link=pt2pt -perm


I haven’t found a need to tweak the num_cmd_elems, lg_term_dma, or max_xfer_size with 4Gb Fibre Channel.
For each disk to be used as a VTD, disable SCSI reservations:

chdev -dev hdisk1 -attr reserve_policy=no_reserve -perm

After these are run, reboot the VIO server. And then check these on the client LPARs.

Set the VSCSI error recovery mode to fast_fail for each VSCSI adapter:

chdev -l vscsi0 -a vscsi_err_recov=fast_fail


Set the VSCSI path timeout for each VSCSI adapter:

chdev -l vscsi0 -a vscsi_path_to=30


Set the healthcheck mode and interval:

chdev -l hdisk0 -a hcheck_mode=nonactive -a hcheck_interval=20

Monitoring AIX With Munin – Memory

The memory collection script for Munin lists used, free, pinned, and swap in a stacked graph. The problem is that the “used” graph is total used memory, which includes pinned, computational RAM, and filesystem cache. So, the pinned RAM is double-counted. And, to me, it’s very important to know how much RAM is used by filesystem cache. With the default script a 64GB system with 16GB of pinned memory, 16GB of computational memory, and 32GB of filesystem page looked like it had 80GB of RAM and was suffering from memory exhaustion. When in reality it’s a 64GB system with only 50% of the memory used by the OS and user programs.

I’ve changed the script for my systems. I’ve added a routine to get the file pages from vmstat then subtracted that from the used RAM. And, I’ve subtracted the pinned memory from used as well. The results match what is reported by nmon closely, so it seems pretty good to me.

Here’s what the new graph looks like:

And, here’s the updated script:
memory

Monitoring AIX With Munin

I’ve had a “capacity planning” system in place for a few years that we wrote in-house. We’ve been running nmon every 5 minutes, pulling all kinds of statistics out, and populating RRD files on a central server. It allows us to see trends and patterns that provides data for planning purchases. It’s not really a tactical monitoring system, more a strategic planning system. But, there are a couple of problems

First, it spawns a nmon process every 5 minutes. nmon has a annoying habit of touching the volume groups on startup. If there is a VG modification running, nmon can’t get the info it wants. So if you run a long process, say a reorgvg, the nmon processes just stack up.

The second problem is that it’s a push system. A cron job fires off nmon every 5 minutes. If the network or central server is down the jobs that populate our central RRD stack up.

Munin fixes these problems. First it doesn’t use nmon. It has a bunch of simple pre-packaged scripts. Each script gets a single set of statistics, a good UNIX approach, and uses just simple OS commands (vmstat, iostat, df, and so on). And, it’s a “pull” system. A cron job on the munin server polls the clients, gets configuration information and the data, creates the RRD files, and generates a set of web pages. If the network or the server is down, it’s no big deal.

The Munin client is written mostly in perl, and works with the perl packaged in AIX. You do need some extra modules, mainly Net::Server, from CPAN. The data collection scripts are responsible checking that they will successfully run, providing RRD configuration information, and returning the actual data. One cool thing is that the RRD configuration is pulled at every execution, and the RRD files are updated accordingly. So, if you want to change a data set from COUNTER type to a GAUGE, just change script and the server takes care of everything.

The downers. I had some problems getting the install script to work. It just needs a lot of modules. So, I got it working on one node, then I copy all the files to other nodes. The data collection scripts are okay, but they have some annoyances. The CPU script doesn’t understand shared CPU partitions. The memory script shows used, free, pinned, and swap. So, my 64GB system with 16GB of pinned memory and no swap shows that I have 80GB of RAM. And, it doesn’t show how much of the “used” RAM is filesystem cache. Luckily they’re really simple and easily updated.

Where do my I/Os go?!?

You may have noticed that in the “vmstat -s” output there are several counters related to I/O.

...
            344270755 start I/Os
             16755196 iodones
...

Here’s what the man page has to say about these counters:

start I/Os
                   Incremented for each read or write I/O request initiated by
                   VMM. This count should equal the sum of page-ins and page-outs.

iodones
                   Incremented at the completion of each VMM I/O request.

There’s something fishy going on here. The system started roughly 34 times more I/Os than it completed. So, what gives? I had a discussion about this once in which the admin explained that I/Os get started, but not completed. The thinking goes that the I/Os get put into a queue, and if the I/O isn’t filled quickly enough it times out and retries. Sort of like a TCP retransmit for disk I/Os.

But, if that happened, you would think that there would be an error in the errpt, or errors logged in the device subsystem. At least you would see unreasonable I/O service times. And I never see anything like that.

What’s seems to be going on is that the I/O’s get dispatched to the VMM, and if they are sequential they are coalesced into 1 I/O and serviced by 1 system call. When the system detects sequential I/O, the next I/O reads in 2 * j2_minPageReadAhead file pages, the next one after that reads in 4 * j2_minPageReadAhead file pages, and so on until it reaches j2_maxPageReadAhead. Each set of I/Os is serviced by 1 system call, even though it pulls in multiple pages. And that is way better for performance.

So the ratio of start I/Os to iodones is really an indicator of how much sequential I/O the system is doing. And, if the system has lots of sequential I/O, it can be an indicator used to tune JFS2 read-ahead.

And remember, the difference between minfree and maxfree should be at least equal to j2_maxPageReadAhead. So if you make changes, adjust minfree and maxfree accordingly.

Why I Don’t Care About AIX I/O Wait, And You Probably Shouldn’t Either

I don’t care about I/O Wait time in AIX, at least not a lot. But, I can’t seem to get through to my ex-technical manager or coworkers that I/O Wait largely doesn’t matter.

The thinking goes that IO wait time reported by NMON or Topas is time that the CPU couldn’t do anything else because there are I/Os that aren’t getting satisfied. But, the systems in question don’t have any I/O load. They’re middleware servers that take requests from the clients, do some processing on them, then query the database for the data, and then the process happens in reverse. There’s no real disk I/O going on at all, in fact the disk I/O only spikes to about 20MB/s ( on an enterprise class SAN) when a handful of reports are written to disk at the top of the hour. Really, there’s not a lot of network I/O going on either, maybe a couple of MB/s on a 1Gb network.

So, to dispel the confusion about the disk I/O bottleneck, we run some disk I/O benchmarks on the SAN. The results come back across the board several hundred MB/s with a respectable number of IOPS, way way more than the system generates during normal operations. And, the bottleneck is probably in the HBA adapters and system bus.

So, what’s going on? Well I/O wait reported by the system is badly named. It’s not time that the system is waiting at all. Kind of an over-simplification would be that I/O wait time is time that the CPU had nothing to do (idle) and there was an I/O in the I/O queue. You can generate high I/O by having a system with nothing going on, but a trickle of I/O activity. Conversely, if you want to reduce I/O on the same system, pile on more CPU work load. There will be less time that the CPU is “idle”, but with the same I/O load the I/O wait time will be less. So, really, you can almost ignore I/O wait time.

So, how do I know if my system is bogged down by I/O? I like vmstat and NMON. Look at the disk adapter I/O in NMON. Knowing your underlying disk architecture, is it a lot of I/O both in terms of MB/s and IOPS? That tells you how busy your HBAs are. To see how much CPU time is being actually held up by I/O wait, look at vmstat. I run it with 2 or 3 second intervals, check the man page for more info.

 # vmstat -Iwt 2

System configuration: lcpu=16 mem=95488MB

   kthr            memory                         page                       faults           cpu       time
----------- --------------------- ------------------------------------ ------------------ ----------- --------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa hr mi se
  3   0   2   16182808     152244     0    70     0     0     0      0  4324  45931 20125 26  4 60 10 12:18:43
  3   0   1   16182831     152220     0    16     0     0     0      0  3509  37090 17622 22  3 67  8 12:18:45

You can see, I have a 16 lcpu machine (Power6 with SMT on, so it's really 8 CPUs), with a runq of 3. The runq shows you the number of runnable threads, that's how many threads are eligible to use a CPU timeslice. With that machine, if the runq is 8 or less, your golden. If it's 16, eh, it's okay. Over that and there are threads waiting on CPUs to finish processing. The "b" column is threads blocked from running because of I/Os. If you're doing FS I/O, that's the column to watch. You want that as low as possible. This system is a DB server doing RAW I/O, the "p" column is just like the "b" column but for raw I/O. For an enterprise DB server I don't get too excited over a few in the "p" column.

How do you reduce the "b" or "p" column? For FS I/O, there are some VMM tuning options to try, but it depends on your I/O characteristics. For both FS and raw I/O, spread the I/O across more drive heads (maybe you have a couple of hot LUNs or can't push the IOPs you need), install more disk cache (or DB SGA), possibly install more HBAs if you're really moving a lot of data.

AIX 6.1 Screen Blanking w/ Commands

After upgrading to AIX 6.1, you may notice that some terminal behaviors have changed. SMIT loses it’s pretty boarders when commands execute. When you leave commands like more or vi, the screen clears. And, some function keys may not work.

The problem is that IBM decided to change their TERMINFO file for xterm. If you want the old xterm file, just copy /usr/share/lib/terminfo/x/xterm from any AIX 5.3 system. You can also try setting your TERM type to xterm-old or xterm-r5, but copying the old file works better for me.

AIX Network Sniffing With Wireshark

Wireshark is a real good GUI packet capture and analysis tool. I have a portable version on a thumbdrive that I occasionally use to diagnose problems on AIX servers.

First, capture the data with iptrace. Refer to the iptrace man page for options. For example, to capture rsh (port 514) from hosta and write it to /tmp run:

startsrc -s iptrace -a "-s hosta -p 514 /tmp/hosta.out"

After your done capturing your data, run:

stopsrc –s iptrace

Then download the output file to your desktop and open the resulting file with Wireshark. Wireshark will automatically determine that the file is from AIX iptrace.

Saving The Structure Of A VG

This process uses the savevg command to backup the VG structure information. I used to to this by first excluding any filesystems from the backup by populating the /etc/exclude.vg files with all the filesystems, but IBM introducted the “-r” flag that does the same thing.  We purposely exclude rootvg from this because all the rootvg filesysetems will be in the mksysb backup.

#!/bin/perl
#
# Author: Patrick Vaughan - http://patrickv.info
#
# Purpose:
# Saves VG structure to files for easy recovery
#
# Restore with:
# restvg -f /vgdata/savevg.VGNAME DISKNAME
#
$test=1;
$DEBUG=0;
$" = "";
$GREP = "/usr/bin/grep";
$LSVG = "/usr/sbin/lsvg";
$EXCLUDE_VGS = "rootvg"; # VGs to exclude in a REGEXP
$VGDATA_DIR = "/vgdata";
if ( ! -d "$VGDATA_DIR" ) {
         mkdir("$VGDATA_DIR") || die("Could not create directory $VGDATA_DIR: $@ \n");
}
@VGS = grep(!/$EXCLUDE_VGS/, `$LSVG -o`);
for ( @VGS ) {
         chomp;
         $vg = $_;
#       exclude files are no longer necessary since the -r flag was introduced
#       if ( -f "/etc/exclude.$vg" ) {
#               unlink("/etc/exclude.$vg") || die("Could not remove file /etc/exclude.$vg: $@\n");
#       }
#       (@LVs) = grep(/^\//, split(/\s+/, `$LSVG -l $vg`, -1 ) );
#       open(EXCLUDE_VG, ">/etc/exclude.$vg") or die("Could not create file /etc/exclude.$vg: $@\n");
#       for (@LVs) {
#               # populate /etc/exclude.vg with ^./filesystem/ because we don't want the data
#               print EXCLUDE_VG ("^.$_/\n");
#       }
#       close(EXCLUDE_VG);
#       undef(@LVs);

         print "savevg of $vg to $VGDATA_DIR/savevg.$vg starting.\n";
         system(`savevg -irf $VGDATA_DIR/savevg.$vg $vg`);
         print "savevg of $vg to $VGDATA_DIR/savevg.$vg complete.\n";

# We're done with the exclude files, lets delete them
#       unlink("/etc/exclude.$vg") || die("Could not remove file /etc/exclude.$vg: $@\n");
}

This will create one file in the directory /vgdata for each volume group on the system.  You can then use the restvg -f /vgdata/savevg.xxx to restore the VG data.  Check the restvg man page for more information.

To get information about the backup run:

restvg -l -f savevg.X
VOLUME GROUP:           sharedGvg
BACKUP DATE/TIME:       Tue May 29 09:33:56 EDT 2007
UNAME INFO:             AIX cnpg 3 5 002680EF4C00
BACKUP OSLEVEL:         5.3.0.50
MAINTENANCE LEVEL:      5300-05
BACKUP SIZE (MB):       15008
SHRINK SIZE (MB):       1727
VG DATA ONLY:           no
sharedGvg:
LV NAME             TYPE       LPs   PPs   PVs  LV STATE      MOUNT POINT
shareGlv            jfs2       938   938   1    open/syncd    /shared

You can see the VG size and LV info to make decisions on restoring the VG.

Multi-VG mksysb Tape

More than one VG can be saved on a mksysb tape, which makes restoring things like your TSM or NIM master servers much easier.   We’ll need to use the rmt tape devices that do not rewind after their operation (rmtX.1).   First, rewind the tape (just to be sure we’re at the beginning) and do the mksysb as normal:

/usr/bin/tctl -f/dev/rmt0 rewind
/usr/bin/mksysb -p -v /dev/rmt0.1

Then just append your VGs on the end of the tape with savevg:

/usr/bin/savevg -p -v -f/dev/rmt0.1 vg1
/usr/bin/savevg -p -v -f/dev/rmt0.1 vg2
/usr/bin/tctl -f/dev/rmt0 rewind

To restore, just do the mksysb restore as normal, then do:

/usr/bin/tctl -f/dev/rmt0.1 rewind
/usr/bin/tctl -f/dev/rmt0.1 fsf 4
/usr/bin/restvg -f/dev/rmt0.1 hdiskX

The “fsf 4” will fast forward the tape to the first saved VG after the mksysb.  To restore the 2nd saved VG, use “fsf 5”.

If you need to restore individual files, you can do it like this:

/usr/bin/tctl -f/dev/rmt0 rewind
restore -x -d -v -s4 -f/dev/rmt0.1 ./[path]/[file]

“-s4” is rootvg, use “-s5” for next VG. The files are restored in your current directory.

Using this you can make a mksysb tape with your OS, your TSM application VGs, and your NIM VGs all on one tape.  And, if you save the VG structure into your rootvg, you can restore your extra filesystems as well.  Then you just have to restore the files from TSM.  I wouldn’t try to backup the TSM log, archivelog, or DB this way, but those can be created with your TSM DB backup.  Just be sure to do a “backup devconfig” and “backup volhist” regularly.

NIM Quickstart

I muddled through my first NIM install, and got it working okay. But, I’ve since heard Steve Knudson several times deliver his NIM sessions? and that has really solidified the concepts better than trying to read the manuals. You can see a replay of Steve’s presentation at the AIX Virtual User Group, which I highly recommend.

I recently had to build a new NIM server, so I thought it was a good time to post my cliff notes version.

Install these on the NIM Master:

  • bos.net.tcp.server
  • bos.net.nfs.server
  • bos.sysmgt.nim.master
  • bos.sysmgt.nim.spot

Make these directories:

  • /export
  • /export/61 ( or 53, or something representative of your AIX version)

Make these filesystems:

  • /export/61/lpp_6100-01-07
  • /export/61/spot_6100-01-07
  • /export/mksysb

The lpp filesystem will be the lppsource for AIX 6.1 TL1 SP7, and the spot filesystem will be the coresponding spot. For AIX 6.1, the lppsource is about 4G and the spot is about 600M.  The mksysb directory is good for customized base install images and mksysb images.

Initialize the NIM master:

smitty nim

Configure the NIM Environment / Advanced Configuration / Initialize the NIM Master Only

Network name:  something like 192_168_1

Primary Network Install Interface: enX

I also let the clients register themselves, it’s just easier that way.

Create the lppsource:

Insert DVD 1

smitty bffcreate

input device: /dev/cd0

directory for storing software: /export/61/lpp_6100-01-07

extend filesystems: yes

process multiple volumes: yes

You’ll be prompted to change DVDs.

Define to lppsource to NIM:

smitty nim

Configure the NIM Environment / Advanced Configuration / Create Basic Installation Resources /  Create a New LPP_SOURCE Only

Resource SERVER: master

LPP_SOURCE Name: lpp_6100-01-07

LPP_SOURCE Directory: /export/61/lpp_6100-01-07

Create new filesystem for LPP_SOURCE: no

Make the SPOT:

smitty nim

Configure the NIM Environment / Advanced Configuration / Create Basic Installation Resources /  Create a New SPOT Only

Resource SERVER: master

Input device: hit F4 and select your lppsource

SPOT Name: spot_6100-01-07

SPOT Directory: /export/61

Create new filesystem for SPOT: no

And, that’s about it.  Just register your clients. I don’t use a hardware address, I just populate the IP.  I also use nimsh instead of rsh.  Nimsh is the newer protocol, and doesn’t require you to setup a .rhosts file on every client.