Disk Cloning With Splitvg

In a recent post, Low-impact database clone with splitvg, Anthony English used the splitvg command to clone a database. I hadn’t thought of the splitvg command since playing with it when it was first announced in the Differences Guide for AIX 5.2 (?). As luck I was building a new LPAR that is a copy of an already existing LPAR. I don’t strictly NEED the files in the filesystems copied to the new LPAR, but I do need the filesystems. But getting the files might save the application analysts some time.  So, I decided to break out the old splitvg command.

Luckily, I had a spare LUN assigned to this LPAR that was available. The first step was to simply extend the VG to the new disk, and run mirrorvg. After everything was synced up, the split was painless and only took a few seconds:

splitvg -y copyvg -i -c 2 cernervg

After that, the new VG shows up:

# lspv
...
hdisk10         00043a1267585862                    cernervg        active
hdisk11         00043a1267650a54                    copyvg          active
...

And, you can look at the LVs with lsvg:

# lsvg -l copyvg
copyvg:
LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT POINT
fscernerlv          jfs2       8       8       1    closed/syncd  /fs/fs/cerner
fscernerwhlv        jfs2       1       1       1    closed/syncd  /fs/fs/cerner/w_standard
...

For some odd reason, the filesystems were created with the prefix "/fs/fs". They should have been created with just the "/fs" prefix, but I'll fix that later anyway.

Then I did a varyoffvg and exportvg on the source LPAR, presented the LUN to the target LPAR, and ran cfgmgr on the target. After that, the disk showed up, with the same PVID as before:

 # lspv
hdisk0          000439c25388aca6                    rootvg          active
hdisk1          00043a1267650a54                    None

A quick importvg, and we're in business:

# importvg -y cernervg hdisk1

But, the filesystems still have the "/fs/fs" prefix. So, a quick and dirty script cleans that up:

for fs in `lsvg -l cernervg | grep fs | awk '{ print $7 }' | cut -d'/' -f 4-`
do
chfs -m /$fs /fs/fs/$fs
done

And, the LVs still have the "fs" prefix, I could leave them, but my OCD won't let me:

for fs in `lsvg -l cernervg | grep fs | awk '{ print $1 }' | cut -d's' -f 2-`
do
chlv -n $fs fs$fs
done

Then I used "mount -a" to mount all the filesystems. They had to replay the JFS2 logs, but since they didn't have much in the way of writes going on when I ran the splitvg, they were fine.

Overall, it wasn't a bad way to go. The mirrorvg took a while to complete, and fixing the names for the LVs and filesystems took a little work, but not bad. It's better than creating all the filesystems by hand.

If I really wasn't concerned about the data, I could have use the savevg and restvg command to recreate the filesystems onto a blank LUN faster with less effort.

Limit Sendmail Message Size

I recently had a AIX box send a 1.5 GB Email to our MS Exchange Email system, which brought Exchange to a screeching halt. Our Exchange admin was understandably unimpressed. So after a few seconds of research, I found sendmail has a setting to limit the maximum message size. Put this in your sendmail.cf file and restart sendmail:

O MaxMessageSize=50000000

That's in bytes, so that should be 50MB.

VIO Disk Tuning

I setup VIO servers so rarely, that it’s really easy to miss a step. These are some of the tuning commands I use when setting up VIO. Set these on the VIO server.

Set the FC adapter error recovery policy to fast_fail:

chdev -dev fscsi0 -attr fc_err_recov=fast_fail -perm


Enable dynamic tracking on the FC adapter:

chdev -dev fscsi0 -attr dyntrk=yes -perm


Set the FC adapter link type to point-to-point:

chdev -dev fcs0 -attr init_link=pt2pt -perm


I haven’t found a need to tweak the num_cmd_elems, lg_term_dma, or max_xfer_size with 4Gb Fibre Channel.
For each disk to be used as a VTD, disable SCSI reservations:

chdev -dev hdisk1 -attr reserve_policy=no_reserve -perm

After these are run, reboot the VIO server. And then check these on the client LPARs.

Set the VSCSI error recovery mode to fast_fail for each VSCSI adapter:

chdev -l vscsi0 -a vscsi_err_recov=fast_fail


Set the VSCSI path timeout for each VSCSI adapter:

chdev -l vscsi0 -a vscsi_path_to=30


Set the healthcheck mode and interval:

chdev -l hdisk0 -a hcheck_mode=nonactive -a hcheck_interval=20

Monitoring AIX With Munin – Memory

The memory collection script for Munin lists used, free, pinned, and swap in a stacked graph. The problem is that the “used” graph is total used memory, which includes pinned, computational RAM, and filesystem cache. So, the pinned RAM is double-counted. And, to me, it’s very important to know how much RAM is used by filesystem cache. With the default script a 64GB system with 16GB of pinned memory, 16GB of computational memory, and 32GB of filesystem page looked like it had 80GB of RAM and was suffering from memory exhaustion. When in reality it’s a 64GB system with only 50% of the memory used by the OS and user programs.

I’ve changed the script for my systems. I’ve added a routine to get the file pages from vmstat then subtracted that from the used RAM. And, I’ve subtracted the pinned memory from used as well. The results match what is reported by nmon closely, so it seems pretty good to me.

Here’s what the new graph looks like:

And, here’s the updated script:
memory

Monitoring AIX With Munin

I’ve had a “capacity planning” system in place for a few years that we wrote in-house. We’ve been running nmon every 5 minutes, pulling all kinds of statistics out, and populating RRD files on a central server. It allows us to see trends and patterns that provides data for planning purchases. It’s not really a tactical monitoring system, more a strategic planning system. But, there are a couple of problems

First, it spawns a nmon process every 5 minutes. nmon has a annoying habit of touching the volume groups on startup. If there is a VG modification running, nmon can’t get the info it wants. So if you run a long process, say a reorgvg, the nmon processes just stack up.

The second problem is that it’s a push system. A cron job fires off nmon every 5 minutes. If the network or central server is down the jobs that populate our central RRD stack up.

Munin fixes these problems. First it doesn’t use nmon. It has a bunch of simple pre-packaged scripts. Each script gets a single set of statistics, a good UNIX approach, and uses just simple OS commands (vmstat, iostat, df, and so on). And, it’s a “pull” system. A cron job on the munin server polls the clients, gets configuration information and the data, creates the RRD files, and generates a set of web pages. If the network or the server is down, it’s no big deal.

The Munin client is written mostly in perl, and works with the perl packaged in AIX. You do need some extra modules, mainly Net::Server, from CPAN. The data collection scripts are responsible checking that they will successfully run, providing RRD configuration information, and returning the actual data. One cool thing is that the RRD configuration is pulled at every execution, and the RRD files are updated accordingly. So, if you want to change a data set from COUNTER type to a GAUGE, just change script and the server takes care of everything.

The downers. I had some problems getting the install script to work. It just needs a lot of modules. So, I got it working on one node, then I copy all the files to other nodes. The data collection scripts are okay, but they have some annoyances. The CPU script doesn’t understand shared CPU partitions. The memory script shows used, free, pinned, and swap. So, my 64GB system with 16GB of pinned memory and no swap shows that I have 80GB of RAM. And, it doesn’t show how much of the “used” RAM is filesystem cache. Luckily they’re really simple and easily updated.

Tuning AIX for TSM

DRAFT – I’m still working on it

These are some of the things I’ve been playing with. I’ll update this as I find more changes to make. Your mileage most definitely will vary.

We used to say that TSM is a hog and will use any resource it can. CPU, Memory, Network I/O, Disk I/O, Tape I/O, everything. And, when you were talking about a system with 1 or 2 GB of RAM, a couple of slow (by today’s standards) processors, and only a few adapters, that was right. With current machines, TSM seems to almost sip memory, and isn’t terrible on CPU. It’s great at moving bits between disk and tape. It’s probably the heaviest disk I/O user in your environment, I know it is in mine. There are a few things I’ve done to try and squeeze some more cycles out of TSM.

If you are running your TSM server on AIX, in my opinion, you are WAY ahead of the shops running it on Windows. With UNIX, and AIX in particular, you can easily tune the OS out of the box in ways that Windows shops can only dream of. You have a lot of control over where your data gets placed on the storage devices, you can tweak hardware parameters for adapters and devices, specify how your memory is managed and allocated, and you can monitor your system in great detail. For instance, in our shop the Windows group assigns a disk to a drive. If that disk is too hot, chances are they don’t know it, and if they did there’s not a lot they can do other than break up the data that’s on that drive mount to other mount points that map to other disks. And, that’s interruptive. On AIX, you add a disk to the volume group, make sure the LV is set to Maximum inter-disk policy, and run a reorgvg. You just solved your hot disk problem and it wasn’t disruptive at all.

We have a typical TSM system. The largest server has a 500+GB database (TSM 5.5, not 6.X with real DB2) and backs up 2 to 4 TB a night depending on the night to old repurposed DS4K disk. After the backup windows are complete it migrates the data off to tape. At this moment, the system is using around 8GB of RAM… out of 64GB. That’s around 56GB of RAM for FS cache.

Speaking of which, try turning FS cache off. The other admin’s gave me a weird look what I said that. Think about this, most of the data TSM backs up gets written to disk, then not re-read until it gets migrated or copied to tape. In the disk backup pools your doing a ton of sequential I/O. Way more than you have memory for. And, unless you do server side de-dupe, you’re not going to re-read the stuff you write before you run out of memory and the system has to burn cycles to clear the FS cache pages to make room for more. Now we don’t want to turn it all the way off because you need features of FS cache like I/O coalescence, read-ahead, and write-behind. Those sequential I/Os get coalesced together in memory and take less cycles to service, increasing throughput. If you look at “vmstat -s” you’ll see way more “start I/Os” than “iodones” because those I/Os get grouped together and serviced as one system call. Consider mounting your disk pools with the “rbrw” option. This tells the system to use whatever memory pages it’s going to use to fulfill the I/Os for either reads or writes, then put them back on the free page list for re-use. That way your sequential I’Os get coalesced, but since you’re not going to re-read the data, you might as well free up the RAM instead of making lrud do it later. If you’re using server-side de-dupe, consider using the “rbr” option. That tells the system to keep the pages in memory for writes, but flush them after reads. Chances are once the data is written and then re-read for de-deupe you aren’t going to re-read it again anytime soon. That frees up RAM for the stuff you want to do, and uses your FS cache more efficiently. You may see a higher ratio of usr to sys time in your CPU monitoring tool of choice depending on your workload, which means more real work is getting done.

JFS2 read-ahead is your friend. Check the output of “vmstat -s”. The ratio of “start I/O”s to iodones is an indication of how much of your I/O is sequential and how efficient your JFS2 read-ahead and write-behind mechanism is. For TSM, you want a pretty high ratio, I’m still tuning this. I’m tuning the ioo parameter j2_maxPageReadAhead. When the OS reads 2 sequential pages, the next sequential I/O reads in 2 * j2_minPageReadAhead pages, the next after 4 * j2_minPageReadAhead, and so on until the number of pages reaches j2_maxPageReadAhead. I’ve set this parameter to 512, just went to 1024. I’ve been told that I could set it to 2048 if I moved to faster disk, but I should not exceed that.

Bump up your JFS2 cache.  By default JFS2 is tuned for a mixed work load, and then it’s not really setup for intense reads and writes.  TSM doesn’t do a lot of mixed workload.  It’s a lot of intense sequential I/O.  What you want to look at is the “external pager filesystem I/Os blocked with no fsbuf” line of “vmstat -v”.  If that goes up very quickly, bump up the “j2_nBufferPerPagerDevice” ioo parameter.  It’s a restricted parameter, so you may have to use the -F flag.  That will bump up the number of filesystem buffers allocated at filesystem mount time.  Mine is currently set to 2048, with a default of 512.  Once the j2_nBufferPerPagerDevice is exhausted, the system will allocate extra buffer space to service the I/O dynamically.  These dynamically allocated buffers are carved out of free memory, then returned to the free list when the system is done with them.  The problem is that by default the system allocates 256k (16 buffers) at a time.  So, if everything is the default, the system burns through 512 buffers doing JFS2 I/O, then to alleviate that it nickles and dimes you 16 buffers at a time.  I set the j2_dynamicBufferPreallocation ioo parameter to 128.  If the “external pager filesystem I/O blocked with no fsbuf” counter goes up quickly, continue to bump those up.  With the default setting TSM can push that counter up really fast.  I rarely see an increment in this counter.  By contrast, an untuned system with moderate disk I/O gets 5 to 6 digits per day.

The I/O’s get passed off to the LVM layer buffers.  Check the global I/Os blocked with lvmo.  If they keep increasing, bump them up per VG with the lvmo command until they stop increasing.  Remember the number of buffers is per physical volume in the volume group, so adding disks bumps this up too.  To see what VGs are exhausting the pbufs run:

for vg in `lsvg -o`
do
     lvmo -v $vg -a
     echo
done
If you see those increasing, bump up the pv_pbuf_count on the offending VG:
lvmo -v VG -a pv_pbuf_count=XXXX

The last time I did big disk I/O operations like copying disk to disk, or even better defining a file volume to TSM, I get about 500MB/s pretty well sustained.  That’s heavier I/O with DS4500 disks than I get with my DS8100, and by far the most intense workload on our SAN.  Your goal should always be to crush the SAN, at least until we get to SSDs.

Play with mounting your database with the “cio” option. Your database is probably too big to fit into RAM, and lots of random database operations may be slowed down by going through the JFS2 stack. Mounting your filesystems with the “cio” option is the next best thing to using RAW LVs for databases. But, some users have found that mounting the database this way actually slows down some operations, like database backups. This is probably because you lose the JFS2 read-ahead feature. I didn’t see much difference either way and am still testing with it.  So, give it a go and see what you get.

Check your disk queues. During your busiest times, look at the output of “iostat -alDRT 5 60”. Good times are during migrations, big deduplication jobs, expiration, and TSM DB backups. You’re looking for numbers in the last few columns, the “serv qfull” one in particular. This is how many times the disk queue for the offending disk was full. That can be a big killer, especially on databases, and most especially if your using “cio” on your filesystems. If the disk queue is full, the disk doesn’t accept any more requests. If your seeing numbers in this column, try adding more disks to the volume group of the offending disk from different arrays, and redistribute the LVs across all the disks (reorgvg works nicely for this but can take a while). You can also bump up the queue per disk with chdev, but the disk must be varied off to make the change:

chdev -l hdiskX -a queue_depth=XX

By redistributing our database from 2 disks to 4 disks I got a significant increase in speed for everything TSM does except backups (backups are network bound). Expiration is much faster now, and database backups went from 2.5 hours to about 1.5 hours.

We used to say “there are no TSM problems that can’t be fixed by more tape drives”, and it’s still somewhat true.  If you’re still using tape and haven’t gone to VTL or just sets of RAID disk, more and faster tape drives are good.  You want to keep your ratio of HBAs to tape drives to about 3 to 1.  That’s a general rule, you can go 4 to 1, especially if you have 8Gb HBAs and 2Gb tape drives.  If you have two libraries, one for primary storage pools, and one for copy storage pools, mix those tape drives on each HBA.  You’re probably writing to your copy storage pools, then your primary ones.  So, by mixing those drives on a HBA, you can get a lower effective ratio.  Now, there is a new multi-destination migraiton feature in TSM 6.2, which may blow that by letting you do backup copy and primary storage pool migration at the same time.  Currently, I have 4Gb HBAs and 4Gb LTO3 tape drives in a 3 to 1 ratio, and I’m not pushing the HBAs at all.  In fact your SAN admin is probably complaining that your tape drives are “slow drain” devices on the SAN, filling up buffers on the switches.

Find under-utilization times and resources.  We have an old TSM installation, we started with ADSM 3.1.2 and our procedures have developed from that early version.  We run our nightly backups to disk starting about 5pm until about 4am, then copy them to an off-site tape library, then migrate the disk to on-site tape.  When we looked at what our slowest link was, it’s the network.  We are moving to 10Gb ethernet, but currently run a 2 link etherchannel with LACP negotiation.  When we graphed our network throughput on a normal night the network stays below it’s saturation point until around midnight when it flirts with it’s maximum utilization until just before the end of our backup window.  So, the link that was costing us the most in terms of time wasn’t being used very efficiently.  We are looking at how to re-arrange our nightly schedules to smooth our our network utilization, and maybe get a higher effective throughput, and keep the systems who exceed their backup windows down.  The next slowest resource is the tape drives.  And, when we looked at how utilized they were, we found they were idle most of the night while the backups were running.    What I did was develop a simple script to do disk to copy storagepool backups every hour starting an hour before our backup windows began.  That took our morning copy storage pool process down from 4 to 6 hours down to less than 1 hour.  And, it has the side-effect of better protecting the data on our disk storage pools.

Where do my I/Os go?!?

You may have noticed that in the “vmstat -s” output there are several counters related to I/O.

...
            344270755 start I/Os
             16755196 iodones
...

Here’s what the man page has to say about these counters:

start I/Os
                   Incremented for each read or write I/O request initiated by
                   VMM. This count should equal the sum of page-ins and page-outs.

iodones
                   Incremented at the completion of each VMM I/O request.

There’s something fishy going on here. The system started roughly 34 times more I/Os than it completed. So, what gives? I had a discussion about this once in which the admin explained that I/Os get started, but not completed. The thinking goes that the I/Os get put into a queue, and if the I/O isn’t filled quickly enough it times out and retries. Sort of like a TCP retransmit for disk I/Os.

But, if that happened, you would think that there would be an error in the errpt, or errors logged in the device subsystem. At least you would see unreasonable I/O service times. And I never see anything like that.

What’s seems to be going on is that the I/O’s get dispatched to the VMM, and if they are sequential they are coalesced into 1 I/O and serviced by 1 system call. When the system detects sequential I/O, the next I/O reads in 2 * j2_minPageReadAhead file pages, the next one after that reads in 4 * j2_minPageReadAhead file pages, and so on until it reaches j2_maxPageReadAhead. Each set of I/Os is serviced by 1 system call, even though it pulls in multiple pages. And that is way better for performance.

So the ratio of start I/Os to iodones is really an indicator of how much sequential I/O the system is doing. And, if the system has lots of sequential I/O, it can be an indicator used to tune JFS2 read-ahead.

And remember, the difference between minfree and maxfree should be at least equal to j2_maxPageReadAhead. So if you make changes, adjust minfree and maxfree accordingly.

Why I Don’t Care About AIX I/O Wait, And You Probably Shouldn’t Either

I don’t care about I/O Wait time in AIX, at least not a lot. But, I can’t seem to get through to my ex-technical manager or coworkers that I/O Wait largely doesn’t matter.

The thinking goes that IO wait time reported by NMON or Topas is time that the CPU couldn’t do anything else because there are I/Os that aren’t getting satisfied. But, the systems in question don’t have any I/O load. They’re middleware servers that take requests from the clients, do some processing on them, then query the database for the data, and then the process happens in reverse. There’s no real disk I/O going on at all, in fact the disk I/O only spikes to about 20MB/s ( on an enterprise class SAN) when a handful of reports are written to disk at the top of the hour. Really, there’s not a lot of network I/O going on either, maybe a couple of MB/s on a 1Gb network.

So, to dispel the confusion about the disk I/O bottleneck, we run some disk I/O benchmarks on the SAN. The results come back across the board several hundred MB/s with a respectable number of IOPS, way way more than the system generates during normal operations. And, the bottleneck is probably in the HBA adapters and system bus.

So, what’s going on? Well I/O wait reported by the system is badly named. It’s not time that the system is waiting at all. Kind of an over-simplification would be that I/O wait time is time that the CPU had nothing to do (idle) and there was an I/O in the I/O queue. You can generate high I/O by having a system with nothing going on, but a trickle of I/O activity. Conversely, if you want to reduce I/O on the same system, pile on more CPU work load. There will be less time that the CPU is “idle”, but with the same I/O load the I/O wait time will be less. So, really, you can almost ignore I/O wait time.

So, how do I know if my system is bogged down by I/O? I like vmstat and NMON. Look at the disk adapter I/O in NMON. Knowing your underlying disk architecture, is it a lot of I/O both in terms of MB/s and IOPS? That tells you how busy your HBAs are. To see how much CPU time is being actually held up by I/O wait, look at vmstat. I run it with 2 or 3 second intervals, check the man page for more info.

 # vmstat -Iwt 2

System configuration: lcpu=16 mem=95488MB

   kthr            memory                         page                       faults           cpu       time
----------- --------------------- ------------------------------------ ------------------ ----------- --------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa hr mi se
  3   0   2   16182808     152244     0    70     0     0     0      0  4324  45931 20125 26  4 60 10 12:18:43
  3   0   1   16182831     152220     0    16     0     0     0      0  3509  37090 17622 22  3 67  8 12:18:45

You can see, I have a 16 lcpu machine (Power6 with SMT on, so it's really 8 CPUs), with a runq of 3. The runq shows you the number of runnable threads, that's how many threads are eligible to use a CPU timeslice. With that machine, if the runq is 8 or less, your golden. If it's 16, eh, it's okay. Over that and there are threads waiting on CPUs to finish processing. The "b" column is threads blocked from running because of I/Os. If you're doing FS I/O, that's the column to watch. You want that as low as possible. This system is a DB server doing RAW I/O, the "p" column is just like the "b" column but for raw I/O. For an enterprise DB server I don't get too excited over a few in the "p" column.

How do you reduce the "b" or "p" column? For FS I/O, there are some VMM tuning options to try, but it depends on your I/O characteristics. For both FS and raw I/O, spread the I/O across more drive heads (maybe you have a couple of hot LUNs or can't push the IOPs you need), install more disk cache (or DB SGA), possibly install more HBAs if you're really moving a lot of data.

AIX 6.1 Screen Blanking w/ Commands

After upgrading to AIX 6.1, you may notice that some terminal behaviors have changed. SMIT loses it’s pretty boarders when commands execute. When you leave commands like more or vi, the screen clears. And, some function keys may not work.

The problem is that IBM decided to change their TERMINFO file for xterm. If you want the old xterm file, just copy /usr/share/lib/terminfo/x/xterm from any AIX 5.3 system. You can also try setting your TERM type to xterm-old or xterm-r5, but copying the old file works better for me.