DRAFT – I’m still working on it
These are some of the things I’ve been playing with. I’ll update this as I find more changes to make. Your mileage most definitely will vary.
We used to say that TSM is a hog and will use any resource it can. CPU, Memory, Network I/O, Disk I/O, Tape I/O, everything. And, when you were talking about a system with 1 or 2 GB of RAM, a couple of slow (by today’s standards) processors, and only a few adapters, that was right. With current machines, TSM seems to almost sip memory, and isn’t terrible on CPU. It’s great at moving bits between disk and tape. It’s probably the heaviest disk I/O user in your environment, I know it is in mine. There are a few things I’ve done to try and squeeze some more cycles out of TSM.
If you are running your TSM server on AIX, in my opinion, you are WAY ahead of the shops running it on Windows. With UNIX, and AIX in particular, you can easily tune the OS out of the box in ways that Windows shops can only dream of. You have a lot of control over where your data gets placed on the storage devices, you can tweak hardware parameters for adapters and devices, specify how your memory is managed and allocated, and you can monitor your system in great detail. For instance, in our shop the Windows group assigns a disk to a drive. If that disk is too hot, chances are they don’t know it, and if they did there’s not a lot they can do other than break up the data that’s on that drive mount to other mount points that map to other disks. And, that’s interruptive. On AIX, you add a disk to the volume group, make sure the LV is set to Maximum inter-disk policy, and run a reorgvg. You just solved your hot disk problem and it wasn’t disruptive at all.
We have a typical TSM system. The largest server has a 500+GB database (TSM 5.5, not 6.X with real DB2) and backs up 2 to 4 TB a night depending on the night to old repurposed DS4K disk. After the backup windows are complete it migrates the data off to tape. At this moment, the system is using around 8GB of RAM… out of 64GB. That’s around 56GB of RAM for FS cache.
Speaking of which, try turning FS cache off. The other admin’s gave me a weird look what I said that. Think about this, most of the data TSM backs up gets written to disk, then not re-read until it gets migrated or copied to tape. In the disk backup pools your doing a ton of sequential I/O. Way more than you have memory for. And, unless you do server side de-dupe, you’re not going to re-read the stuff you write before you run out of memory and the system has to burn cycles to clear the FS cache pages to make room for more. Now we don’t want to turn it all the way off because you need features of FS cache like I/O coalescence, read-ahead, and write-behind. Those sequential I/Os get coalesced together in memory and take less cycles to service, increasing throughput. If you look at “vmstat -s” you’ll see way more “start I/Os” than “iodones” because those I/Os get grouped together and serviced as one system call. Consider mounting your disk pools with the “rbrw” option. This tells the system to use whatever memory pages it’s going to use to fulfill the I/Os for either reads or writes, then put them back on the free page list for re-use. That way your sequential I’Os get coalesced, but since you’re not going to re-read the data, you might as well free up the RAM instead of making lrud do it later. If you’re using server-side de-dupe, consider using the “rbr” option. That tells the system to keep the pages in memory for writes, but flush them after reads. Chances are once the data is written and then re-read for de-deupe you aren’t going to re-read it again anytime soon. That frees up RAM for the stuff you want to do, and uses your FS cache more efficiently. You may see a higher ratio of usr to sys time in your CPU monitoring tool of choice depending on your workload, which means more real work is getting done.
JFS2 read-ahead is your friend. Check the output of “vmstat -s”. The ratio of “start I/O”s to iodones is an indication of how much of your I/O is sequential and how efficient your JFS2 read-ahead and write-behind mechanism is. For TSM, you want a pretty high ratio, I’m still tuning this. I’m tuning the ioo parameter j2_maxPageReadAhead. When the OS reads 2 sequential pages, the next sequential I/O reads in 2 * j2_minPageReadAhead pages, the next after 4 * j2_minPageReadAhead, and so on until the number of pages reaches j2_maxPageReadAhead. I’ve set this parameter to 512, just went to 1024. I’ve been told that I could set it to 2048 if I moved to faster disk, but I should not exceed that.
Bump up your JFS2 cache. By default JFS2 is tuned for a mixed work load, and then it’s not really setup for intense reads and writes. TSM doesn’t do a lot of mixed workload. It’s a lot of intense sequential I/O. What you want to look at is the “external pager filesystem I/Os blocked with no fsbuf” line of “vmstat -v”. If that goes up very quickly, bump up the “j2_nBufferPerPagerDevice” ioo parameter. It’s a restricted parameter, so you may have to use the -F flag. That will bump up the number of filesystem buffers allocated at filesystem mount time. Mine is currently set to 2048, with a default of 512. Once the j2_nBufferPerPagerDevice is exhausted, the system will allocate extra buffer space to service the I/O dynamically. These dynamically allocated buffers are carved out of free memory, then returned to the free list when the system is done with them. The problem is that by default the system allocates 256k (16 buffers) at a time. So, if everything is the default, the system burns through 512 buffers doing JFS2 I/O, then to alleviate that it nickles and dimes you 16 buffers at a time. I set the j2_dynamicBufferPreallocation ioo parameter to 128. If the “external pager filesystem I/O blocked with no fsbuf” counter goes up quickly, continue to bump those up. With the default setting TSM can push that counter up really fast. I rarely see an increment in this counter. By contrast, an untuned system with moderate disk I/O gets 5 to 6 digits per day.
The I/O’s get passed off to the LVM layer buffers. Check the global I/Os blocked with lvmo. If they keep increasing, bump them up per VG with the lvmo command until they stop increasing. Remember the number of buffers is per physical volume in the volume group, so adding disks bumps this up too. To see what VGs are exhausting the pbufs run:
for vg in `lsvg -o` do lvmo -v $vg -a echo done
lvmo -v VG -a pv_pbuf_count=XXXX
The last time I did big disk I/O operations like copying disk to disk, or even better defining a file volume to TSM, I get about 500MB/s pretty well sustained. That’s heavier I/O with DS4500 disks than I get with my DS8100, and by far the most intense workload on our SAN. Your goal should always be to crush the SAN, at least until we get to SSDs.
Play with mounting your database with the “cio” option. Your database is probably too big to fit into RAM, and lots of random database operations may be slowed down by going through the JFS2 stack. Mounting your filesystems with the “cio” option is the next best thing to using RAW LVs for databases. But, some users have found that mounting the database this way actually slows down some operations, like database backups. This is probably because you lose the JFS2 read-ahead feature. I didn’t see much difference either way and am still testing with it. So, give it a go and see what you get.
Check your disk queues. During your busiest times, look at the output of “iostat -alDRT 5 60”. Good times are during migrations, big deduplication jobs, expiration, and TSM DB backups. You’re looking for numbers in the last few columns, the “serv qfull” one in particular. This is how many times the disk queue for the offending disk was full. That can be a big killer, especially on databases, and most especially if your using “cio” on your filesystems. If the disk queue is full, the disk doesn’t accept any more requests. If your seeing numbers in this column, try adding more disks to the volume group of the offending disk from different arrays, and redistribute the LVs across all the disks (reorgvg works nicely for this but can take a while). You can also bump up the queue per disk with chdev, but the disk must be varied off to make the change:
chdev -l hdiskX -a queue_depth=XX
By redistributing our database from 2 disks to 4 disks I got a significant increase in speed for everything TSM does except backups (backups are network bound). Expiration is much faster now, and database backups went from 2.5 hours to about 1.5 hours.
We used to say “there are no TSM problems that can’t be fixed by more tape drives”, and it’s still somewhat true. If you’re still using tape and haven’t gone to VTL or just sets of RAID disk, more and faster tape drives are good. You want to keep your ratio of HBAs to tape drives to about 3 to 1. That’s a general rule, you can go 4 to 1, especially if you have 8Gb HBAs and 2Gb tape drives. If you have two libraries, one for primary storage pools, and one for copy storage pools, mix those tape drives on each HBA. You’re probably writing to your copy storage pools, then your primary ones. So, by mixing those drives on a HBA, you can get a lower effective ratio. Now, there is a new multi-destination migraiton feature in TSM 6.2, which may blow that by letting you do backup copy and primary storage pool migration at the same time. Currently, I have 4Gb HBAs and 4Gb LTO3 tape drives in a 3 to 1 ratio, and I’m not pushing the HBAs at all. In fact your SAN admin is probably complaining that your tape drives are “slow drain” devices on the SAN, filling up buffers on the switches.
Find under-utilization times and resources. We have an old TSM installation, we started with ADSM 3.1.2 and our procedures have developed from that early version. We run our nightly backups to disk starting about 5pm until about 4am, then copy them to an off-site tape library, then migrate the disk to on-site tape. When we looked at what our slowest link was, it’s the network. We are moving to 10Gb ethernet, but currently run a 2 link etherchannel with LACP negotiation. When we graphed our network throughput on a normal night the network stays below it’s saturation point until around midnight when it flirts with it’s maximum utilization until just before the end of our backup window. So, the link that was costing us the most in terms of time wasn’t being used very efficiently. We are looking at how to re-arrange our nightly schedules to smooth our our network utilization, and maybe get a higher effective throughput, and keep the systems who exceed their backup windows down. The next slowest resource is the tape drives. And, when we looked at how utilized they were, we found they were idle most of the night while the backups were running. What I did was develop a simple script to do disk to copy storagepool backups every hour starting an hour before our backup windows began. That took our morning copy storage pool process down from 4 to 6 hours down to less than 1 hour. And, it has the side-effect of better protecting the data on our disk storage pools.