Patrick Vaughan's Web Site - .. - -- / ...-- ...--

Static DHCP IPs with KVM Virtualization

When building a virtualization lab system, I’ve found that I want static IPs assigned to my guests. You could just assign static IPs in the guest OS, but then you should document what IPs are in use for what hosts. It would be easier to just assign static IP entries in the DHCP server. There doesn’t seem to be a straight-forward way to get this done.

What I’ve found works is to destroy the network, edit it directly, and then restart it.

[root@m77413 libvirt]# virsh -c qemu:///system net-destroy default
Network default destroyed

[root@m77413 libvirt]# virsh -c qemu:///system net-edit default
Network default XML configuration edited.

[root@m77413 libvirt]# virsh -c qemu:///system net-start default
Network default started

The xml file entries should look like:

  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.122.2' end='192.168.122.254' />
      <host mac='52:54:00:10:6e:17' name='cent-install.test' ip='192.168.122.2' />
      <host mac='52:54:00:ab:10:2a' name='cent-netserver.test' ip='192.168.122.3' />
      <host mac='52:54:00:df:47:95' name='install.test' ip='192.168.122.10' />
    </dhcp>
  </ip>

TSM Windows Client OS Level Demystified

The client OS level field in TSM for most operating systems is pretty straightforward. On Linux, it’s the kernel version, HP-UX and AIX show a recognizable OS level. For windows the OS level is more cryptic. Here is a list of the OS levels:

Operating System	Client OS Level
Windows 95	4.00
Windows 98	4.10
Windows ME	4.90
Windows NT 4.0	4.00
Windows 2000	5.00
Windows XP	5.01
Windows Server 2003	5.02
Windows Vista	6.00
Windows Server 2008	6.00
Windows Server 2008 R2	6.01
Windows 7	6.01

OK, I’m Converted. I Like AIXPert

As usual, I’m late to the party. I was at the Power Systems Technical University in San Antonio several years ago (an awesome venue), and there was a session on the new AIXPert feature of AIX 6.1 (later back-ported to 5.3). At the time I though it was clunky and wasn’t too excited about it.

It’s really just a bunch of pre-packaged shell scripts, that are defined in an XML file you really need to manage manually (the horror). You run the master aixpert command and specify what XML file you want. You can go with AIX Defaults, Low, Medium, High, and SOX (if you’re into that kind of thing). When you run aixpert, it applies whatever settings are associated with the level you selected. Here is where I usually yawn. It’s just not that exciting. Everyone pretty much does that with a script or text file or some corporate documentation (probably written by someone without a clue, E&Y I’m looking at you). Yes, there are a couple of ways to get to a GUI, but it’s really more manageable for me with the commandline.

The bit that gets tedious is that no one definition is going to fit any one shop. So you have to export the standard definitions to a custom XML file and open it up and hack it by hand. Systems like XML, I don’t really are to read it all day long. But, it’s not that difficult, here’s an example:

<AIXPertEntry name="cust_udp_recvspace" function="udp_recvspace">
    <AIXPertRuleType type="LLS"/>
    <AIXPertDescription>Network option udp_recvspace: Set network option udp_recvspace's value to 655360</AIXPertDescription>
    <AIXPertPrereqList>bos.rte.security,bos.rte.shell,bos.rte.ILS,bos.net.tcp.client,bos.rte.commands,bos.rte.date</AIXPertPrereqList>
    <AIXPertCommand>/etc/security/aixpert/bin/ntwkopts</AIXPertCommand>
    <AIXPertArgs>udp_recvspace=655360 s cust_udp_recvspace</AIXPertArgs>
    <AIXPertGroup>Tune network options</AIXPertGroup>
 </AIXPertEntry>

Nothing too fancy there. It’s mostly fluff, the PrereqList, Command, and Args options are the important ones, the rest of more for the user than anything else. When you apply that XML, the system runs the shell script which sets the appropriate network option. It’s all fairly simple.

The cool part is that most of the things set with aixpert can also be checked with aixpert. When you apply an XML file, aixpert saves the rules that applied to /etc/security/aixpert/core/appliedaixpert.xml. If I modify that setting and run “aixpert -c”, aixpert parses the appliedaixpert.xml file and checks things out. This is what I get:

# aixpert -c
do_action(): rule(cust_udp_recvspace_CA6BE6C2) : failed.
Processedrules=66       Passedrules=65  Failedrules=1   Level=AllRules
        Input file=/etc/security/aixpert/core/appliedaixpert.xml

To set the world right again, you just re-apply your XML file. I found a minor issue here. I’ve had to remove the /etc/security/aixpert/core/appliedaixpert.xml file before setting a new one. You can get the same rule in there repeatedly, why IBM doesn’t offer a commandline switch to do that I don’t know.

Another cool thing, you can Undo the changes applied by the built-in aixpert rules. When aixpert applies a setting, it writes an undo rule to /etc/security/aixpert/core/undo.xml. Then, running “aixpert -u” will undo what you’ve already done. I would probably purge that once in a while too so that you can recover to a known good state.

So, just wrap a dirt-simple cron script around it to notify when something goes wrong… Something like this:

#!/usr/bin/perl

$EMAIL_ADDRESS = "user\@domain.net";
@TEMP = `hostname`;
$HOSTNAME = $TEMP[0];
chomp($HOSTNAME);
$MAIL_FROM_ADDR="aixpert@$HOSTNAME";

@OUTPUT = `/usr/sbin/aixpert -c 2>&1`;
@REPORT = grep(/^Processedrules.*/, @OUTPUT);

$REPORT[0] =~ s/\s+/=/g;
($null, $PROCESSED ,$null, $PASSED, $null, $FAILED, $null, $LEVEL) = split(/=/, $REPORT[0]);

if ( $FAILED > 0 ) {
        open (MAIL, "| /usr/sbin/sendmail -t ");
        select (MAIL);

        print "Mime-Version: 1.0\n";
        print "Content-type: text/html; charset=\"iso-8859-1\"\n";
        print "From: $mail_from_addr\n";
        print "To: $EMAIL_ADDRESS <$EMAIL_ADDRESS>\n";
        print "Subject: Aixpert $HOSTNAME: $FAILED test FAILED\n";
        print "<html><head></head><body>\n";
        print "<pre>\n";

        print "@OUTPUT";

        print MAIL "</pre>\n";        print MAIL "</body></html>\n";
        $~ = "STDOUT";
        close (MAIL);

        exit 1;
}
exit 0;

That should keep the auditors happy. There are enough basic security settings in the default XML files that with a little tweaking you can hit all or very nearly all of your security audit queries.

That’s all well and good, as far as that goes, but what really made me like aixpert is that it gives you a very simple framework to apply your own settings, and make sure those settings are correct. If you distribute an XML file and a few scripts around your enterprise, you can ensure that those settings are standardized across hosts too.

Here’s a simple script to make sure the attributes of your vscsi devices are correct:

#!/bin/perl

@ADAPTERS = `lsdev -c adapter -Sa | grep -E "^vscsi" | awk '{ print \$1 }'`;
$REPORT = $ENV{'AIXPERT_CHECK_REPORT'};

%ATTRIBUTES = ("vscsi_err_recov", "fast_fail",
                "vscsi_path_to", 30 );

if ( $REPORT == 1 ) {
        for (@ADAPTERS) {
                chomp($_);
                $ADAPT = $_;
                @TEMP = `lsattr -El $ADAPT | awk '{ print \$1, \$2}'`;
                for (@TEMP) {
                        ($ATTR, $VALUE) = split(/\s+/, $_);

                        if ( $VALUE != $ATTRIBUTES{$ATTR} && $ATTRIBUTES{$ATTR} ) {
                                print "$ADAPT attribute $ATTR is $VALUE, should be $ATTRIBUTES{$ATTR}\n";
                                $FAIL++;
                        }
                }
        }
        if ( $FAIL ) {
                exit 1;
        }
}else {
        for (@ADAPTERS) {
                chomp($_);
                $ADAPT = $_;
                @TEMP = `lsattr -El $ADAPT | awk '{ print \$1, \$2}'`;
                for (@TEMP) {
                        ($ATTR, $VALUE) = split(/\s+/, $_);

                        if ( $VALUE != $ATTRIBUTES{$ATTR} && $ATTRIBUTES{$ATTR} ) {
                                system("/usr/sbin/chdev -l $ADAPT -a $ATTR=$ATTRIBUTES{$ATTR} 2>&1 > /dev/null");
                        }
                }
        }
}

Using a script like this as a template, you can check and correct the value of any number of system attributes. Adding it to aixpert is a breeze, just add a stanza to your XML file:

<AIXPertEntry name="cust_vscsi_config" function="vscsi_config">
    <AIXPertRuleType type="MLS"/>
    <AIXPertDescription>Resets attributes of the vscsi devices</AIXPertDescription>
    <AIXPertPrereqList></AIXPertPrereqList>
    <AIXPertCommand>/etc/security/aixpert/custom/vscsi_config.pl</AIXPertCommand>
    <AIXPertArgs>GENERIC</AIXPertArgs>
    <AIXPertGroup>Custom Rules</AIXPertGroup>
</AIXPertEntry>

Once you decide to get into it, aixpert is a pretty nice little tool. There’s a great little movie created by Nigel Griffiths at the DeveloperWorks website to get you started too!

But, I still don’t care for System Director. 🙂

CODBL0004W in IBM License Metric Tool

After installing the IBM License Metric Tool, you might see:

CODBL0004W
Essential periodic calculations did not occur when expected. The last day processed is Apr 25, 2011 while it should be Apr 29, 2011.

By default the tool processes the data collected 2 days prior, so you’ll see the specified dates are a few days old. IBM wants you to collect a bunch of data, and open a ticket, but you may be able to correct this yourself. In CODIF8140E Essential periodic calculations did not occur when expected IBM tells you that it’s probable that the TLMSRV user doesn’t have the correct privileges to the database, and to turn on debugging and send the logs to IBM. At the bottom of the page, it tells you what is actually needed:

Direct CREATETAB authority = YES 
Direct BINDADD authority = YES 
Direct CONNECT authority = YES




You can save yourself a lot of time and check it yourself. 'su' to your db2 user (probably db2inst1), run db2, and connect to the database as the tlmsrv user:
db2 => connect to TLMA user TLMSRV using XXXXXXX

   Database Connection Information

 Database server        = DB2/AIX64 9.7.0
 SQL authorization ID   = TLMSRV
 Local database alias   = TLMA
Then run 'get authorizations':
db2 => get authorizations

 Administrative Authorizations for Current User

 ...
 Direct CREATETAB authority                 = NO
 Direct BINDADD authority                   = YES
 Direct CONNECT authority                   = YES
...
 Indirect CREATETAB authority               = YES
 Indirect BINDADD authority                 = YES
 Indirect CONNECT authority                 = YES
...
 
If any of those direct and indirect privilleges say no, you can grant the privilege to the user. If the privileges are OK, you can skip this step. First, re-connect to the database to get the necessary privileges:
db2 => connect to TLMA

   Database Connection Information

 Database server        = DB2/AIX64 9.7.0
 SQL authorization ID   = DB2INST1
 Local database alias   = TLMA
 
Then grant the appropriate privilege:

db2 => grant CREATETAB on database to TLMSRV
DB20000I  The SQL command completed successfully.
db2 => grant BINDADD on database to TLMSRV
DB20000I  The SQL command completed successfully.
db2 => grant CONNECT on database to TLMSRV
DB20000I  The SQL command completed successfully.


My Privileges are OK, what do I do now?


In IZ98530: NO INFO: CPU CORE ON PARTITION/LOGICAL CPU CORE ON PARTITION there is a note that this can also be caused by known issues in populating the database during installation. This SHOULD be fixed in version 7.2.2, but you can do some simple steps to correct the issue.
First, login to the database as above and reset the Tier Table Version field:


db2 => connect to TLMA

   Database Connection Information

 Database server        = DB2/AIX64 9.7.0
 SQL authorization ID   = DB2INST1
 Local database alias   = TLMA

db2 => update adm.CONTROL set value = '2010-10-01' where name = 'TIER_TABLE_VERSION'
DB20000I  The SQL command completed successfully.

Now go to the ILMT web page and import a new Tiers XML file. Navigate to License Metric Tool -> Administration -> Import Systems Tier Table -> Import. I manually click on the link provided there and download the file, then import that file via the provided form. If you have Internet access setup, you can have the tool download it for you also.
Now stop the ILMT server, and go back to DB2.  We need to clear out a table, the command in the website isn't quite right. Here's what I used:
db2 => connect to tlma

   Database Connection Information

 Database server        = DB2/AIX64 9.7.0
 SQL authorization ID   = DB2INST1
 Local database alias   = TLMA

db2 => select * from adm.PRD_AGGR_TIME
...
  3 record(s) selected.

db2 => delete from adm.PRD_AGGR_TIME
DB20000I  The SQL command completed successfully.
db2 => select * from adm.PRD_AGGR_TIME
...
  0 record(s) selected.


While we're in there, we need to reset the LAST_AGGREGATE_STEP field:
db2 => update adm.CONTROL set value = '0' where name = 'LAST_AGGREGATE_STEP'
DB20000I  The SQL command completed successfully.

Now restart the ILMT processes, wait 24 hours, and see if the problem goes away. If not, you're back to turning debugging up and calling IBM. But hopefully it won't come to that.

Encrypting A Default Username For stopServer.sh

By default ILMT, and WebSphere in general, asks you for a password when running srvstop.sh if security is enabled. That’s nice if you don’t trust your users. But, if you have a secured system you may not want to have to lookup the userid the once or twice a year you bring down WebSphere.

On a new install of ILMT, with the bundled WebSphere server, all you have to do is edit the soap.client.props file:


vi /opt/IBM/LMT/eWAS/profiles/AppSrv01/properties/soap.client.props

And fill in the com.ibm.SOAP.loginUserid and com.ibm.SOAP.loginPassword properties. Now srvstop.sh shouldn’t prompt you anymore for a username and password. The down-side of this is that the password is in plain text, IBM provides a tool to encrypt that, just run it against the file and property to encrypt:


/opt/IBM/LMT/eWAS/profiles/AppSrv01/bin/PropFilePasswordEncoder.sh /opt/IBM/LMT/eWAS/profiles/AppSrv01/properties/soap.client.props com.ibm.SOAP.loginPassword

Your directories may be different, but you get the idea. The tool will ask if you want to make a backup of the file first, and then encrypt the com.ibm.SOAP.loginPassword property.

Smit 1800-109 Error With Printers

I’ve recently found some of our systems have corrupt smit screens when looking at printer queue characteristics. When looking at any options under “smit chpq” for some of the printers, we got:

 1800-109 There are currently no additional
SMIT screen entries available for this
item.  This item may require installation of
additional software before it can be accessed.

The message clearly points to missing filesets. But printers.rte, bos.rte.printers, and the printer device filesets ( like printers.hplj-4si.rte) were all installed and up to date. The problem is that the ODM stanzas for the printers aren’t correct. The queue subsystem looks a files under /var/spool/lpd/pio/@local to do the printing, but smit looks in the ODM.

So, there’s a quick fix. Find the files for the offending printer:

ls /var/spool/lpd/pio/@local/custom | grep queuename
queuename:hp@queuename
queuenameps:hp@queuename

Then just run the piodigest command to read in the colon file and recreates the ODM stanzas:

/usr/lib/lpd/pio/etc/piodigest /var/spool/lpd/pio/@local/custom/queuename:hp@queuename

After that, the smit screens were available again.

Scheduling TSM VMware Backups

UPDATE: I got some feedback that some people are not clear on creating multiple schedulers, so I’m updating this post.

Since I wrote about doing TSM Full-VM Image Backups With VStore API I’ve done more testing, and have put it into “production” with our smaller VMWare cluster. This cluster is managed by one department inside IT. It has some VMs used by customers inside the enterprise, some VMs used for development or testing, and some VMs used by the group inside IS. Now that I’m ready to put VMWare backups into practice, I need to schedule the backups. And, I got some great feedback on the first post, so I thought I would follow it up.

First we need to setup the client, and here we have a few options. If we have few enough VMs that we can backup all of them in one backup window, then we just specify the VMs in the client options and setup a schedule. If on the other hand we have too many VMs to backup in one backup window, then we need to setup multiple schedules with multiple schedulers and options files. This is similar to how you would setup Data Protection for MS SQL backups. If you have divided your VMs into folders, you can specify a folder of VMs to backup for each scheduler in the options file. Otherwise, you probably want to specify the VMs individually for each scheduler.

There is another option, you can specify a whole ESX host to each scheduler. I discounted this because a VMotion event would throw off where the VMs get backed up to inside TSM. In my environment, we are sure to end up with multiple unnecessary backups that would have to be cleaned up manually.

Because this is a smaller test system, there are a lot of VMs that I don’t want to backup regularly, and I can get it all backed up in on backup window. So, I specified to backup “ALL-VM”s in the GUI, followed by each VM I wanted to exclude.

This, of course, just adds some lines to the dsm.opt file:

VMMC "VM_IMAGES"
DOMAIN.VMFULL "ALL-VM;-VM=crcamzm1-bkup;-VM=DCFM;-VM=dmzftp;-VM=ftp..."

If you're going to run multiple schedulers, at this point you can copy the dsm.opt file to a new file and use that to setup your scheduler services. Otherwise, just restart the scheduler service and you're in business. Also of note, you can schedule file level backups of Windows VMs with the VStorage API similarly just by changing the options.

When I looked at the server side, I was happy to note that Tivoli has added a VM subaction to the Backup action. So, I created a schedule that ran once a week using the enhanced schedule options:


tsm: TSM>def sched standard mmc-opsvc1 action=backup subaction=vm schedstyle=enhanced startt=18:00 DAY=Sunday
ANR2500I Schedule MMC-OPSVC1 defined in policy domain STANDARD.

If you have too much data to backup in one backup window, you can break up the backups into multiple schedules that run either at the same time (you could probably run 2 at a time to increase network utilization) or in different backup windows. First, create a node on the TSM server for each schedule and then assign it to a backup schedule:


tsm: TSM>register node mmc-opsvc1_vm1 PASSWORD userid=none
ANR2060I Node MMC-OPSVC1_VM2 registered in policy domain STANDARD.

tsm: TSM>register node mmc-opsvc1_vm2 PASSWORD userid=none
ANR2060I Node MMC-OPSVC1_VM2 registered in policy domain STANDARD.

tsm: TSM>def sched standard mmc-opsvc1_vm1 action=backup subaction=vm schedstyle=enhanced startt=18:00 DAY=Sunday
ANR2500I Schedule MMC-OPSVC1_VM1 defined in policy domain STANDARD.

tsm: TSM>def sched standard mmc-opsvc1_vm2 action=backup subaction=vm schedstyle=enhanced startt=18:00 DAY=Monday
ANR2500I Schedule MMC-OPSVC1_VM2 defined in policy domain STANDARD.

tsm: TSM>def assoc standard mmc-opsvc1_vm1 mmc-opsvc1_vm1
ANR2510I Node MMC-OPSVC1_VM1 associated with schedule MMC-OPSVC1_VM1 in policy domain STANDARD.

tsm: TSM>def assoc standard mmc-opsvc1_vm2 mmc-opsvc1_vm2
ANR2510I Node MMC-OPSVC1_VM2 associated with schedule MMC-OPSVC1_VM2 in policy domain STANDARD.

On the client, create seperate dsm.opt files for each nodename. The VMCHost, VMCUser and VMCPW options will need to be changed for your environment. I put them in with the GUI, the copied and edited the dsm.opt file manually. Here's a basic example, you can specify other options for your environment, notice that I specified two VMFOLDERs instead of the "ALL-VM" option:


C:\Program Files\Tivoli\TSM\baclient>type dsm_vm1.opt
NODENAME         MMC-OPSVC1_VM1
TCPSERVERADDRESS tsm2.mhc.net
TCPPORT          1502
ERRORLOGRETENTION 7 D
SCHEDLOGRETENTION 7 D
PASSWORDACCESS GENERATE
COMPRESSION NO
DEDUPLICATION NO
SCHEDMODE POLLING

VMCHOST mmc-opsvc1.ad.mhc.net
VMCUSER pvaughan
VMCPW ****
VMBACKUPTYPE FULLVM
VMFULLTYPE VSTOR
VMMC "VM_IMAGES"
DOMAIN.VMFULL "VMFOLDER=Ambulatory Apps;VMFOLDER=Amcom;"

Once you have the dsm.opt files in place, you just need to register a scheduler service for each file. Here's an example of one:

C:\Program Files\Tivoli\TSM\baclient>dsmcutil install /name:"TSM Scheduler VM1"
/node:"MMC-OPSVC1_VM1" /password:"PASSWORD" /optfile:"dsm_vm1.opt" /startnow:Yes
...
Installing TSM Client Service:

       Machine          : MMC-OPSVC1
       Service Name     : TSM Scheduler VM1
       Client Directory : C:\Program Files\Tivoli\TSM\baclient
       Automatic Start  : no
       Logon Account    : LocalSystem
The service was successfully installed.
...
Authenticating TSM password for node MMC-OPSVC1_VM2 ...

Connecting to TSM Server via client options file 'C:\Program Files\Tivoli\TSM\baclient\dsm_vm1.opt' ...

Password authentication successful.

The registry password for TSM node MMC-OPSVC1_VM1 has been updated.

Starting the 'TSM Scheduler VM1' service ...

The service was successfully started.

Another nice thing that Tivoli did is make each set of image backups for a specific VM a filespace in TSM:


tsm: TSM>q file mmc-opsvc1

Node Name           Filespace       FSID     Platform     Filespace     Is Files-        Capacity       Pct
                    Name                                  Type            pace               (MB)      Util
                                                                        Unicode?
---------------     -----------     ----     --------     ---------     ---------     -----------     -----
MMC-OPSVC1          \VMFULL-HMC        1     WinNT        API:TSMVM        No                 0.0       0.0
MMC-OPSVC1          \VMFULL-DC-        2     WinNT        API:TSMVM        No                 0.0       0.0
                     FM
MMC-OPSVC1          \VMFULL-ftp        3     WinNT        API:TSMVM        No                 0.0       0.0

As you can see, a filespace is created in the format of "\VMFULL-VMNAME". If you need to remove the backups for a VM because it's been decommissioned or moved, you can simply delete the filespace.

VMWare Datastore Sizing and Locking

I had a recent discussion with a teammate about VMWare datastores. We are using thin provisioning on a ESXi 4.1 installation backed by IBM XIV storage.

In our previous installation we ran ESX 3.X backed by DS4000 disk. What we found out is that VMs grow like weeds and our datastores quickly filled up. This admin just resized the datastores and we went on our way. A technical VMWare rep afterward mentioned that while it is supported, adding extents to VMFS datastores isn’t necessarily best practice.

When we laid down our new datastores, I wanted to avoid adding extents, so made the LUNs 1 TB. That’s as big as I dared to avoid using extents in datastores, but is probably too big for our little installation.

I noticed that our datastores were getting to about 90% utilized, so I added a new LUN and datastore. When I mentioned in our team meeting that I had added a datastore we had a somewhat heated discussion. My teammate really wanted to resize the LUN and add extents to the datastore. I pointed out that I didn’t think that was the best practice and 3 or 4 datastores isn’t really a lot to manage.

So, why not just use one datastore per storage array? The big argument seems to be that people add a second LUN, then extend the datastore to the new LUN. The down-side of this is that if one LUN goes off-line, all the associated data is unavailable. VMWare will try to keep all the data for each VM on one extent, but it’s not always successful. So, if one LUN goes offline, best case is only some of your VMs are affected. Less ideally, they lose part of their data and more VMs are impacted or are running in a state where some of the storage isn’t available. Or, if the failed LUN is the first LUN (the Master Extent), the whole datastore goes offline. At least the architecture allows for a datastore to survive losing an extent under ideal circumstances.

What’s less apparent is the performance hit of multiple VMs and ESX hosts accessing one large LUN. With a lot of VMs generating I/Os you can exceed the disk queues, which default to 32 operations per LUN, for the datastore. Adding more LUNs to the datastore DOES increase the number of queue slots for the whole datastore. And that would be a good thing, assuming the data is equally distributed across all the LUNs, which is not going to be the case.

And, similar to inode locking in a filesystem, shared storage has to contend with volume locking. Multiple systems can read from the same LUN with no problem. But when a write occurs, the volume is locked by one host until the write is committed. Any other host trying to do a write gets a signal that there is a lock and has to wait for the lock to be released. On modern disk arrays, with write caching, this should be very fast; but it’s not ideal.

So, to avoid write locking you can try to keep all your servers on one datastore. But, that’s not really practical long-term as VMs get migrated between hosts. Or, you can minimize the number of VMs that are using each datastore. In addition to keeping the number of VMs/datastore low, a strategy to consider is to mix heavy I/O VMs with VMs that have low I/O requirements; which will help manage the queue depth for each LUN.

How many VMs is too many per datastore? Depends on your environment. I’ve seen recommendations ranging from 12 to 30. If you have a lot of static web servers that don’t do any real writes, you can get away with lots. If you have Oracle or MS SQL servers that do a lot of I/O, including writes, keep the numbers low. You can log into the ESX host and run exstop and hit “u”. There are lots of interesting fields in here. CMDS/s, READS/s, WRITES/s, and so on. Check the QUED field to see the current number of queu slots in use.

A good rundown on this is Mastering VMware VSphere 4. Recommendations from the book: single extend VMFS datastores per LUN, don’t add extents just because you don’t want to manage another datastore, but go ahead and span a VMFS datastore if you need really high I/O or really big datastores.

I have another take on it. Always use one LUN per datastore. The argument that datastores backed my multiple LUNs give better performance is a little flawed because VMWare tries to allocate all the storage associated with one VM on one extent. If you need high I/O assign a LUN from each datastore, then separate the data logically on the VM. You get to leverage more disk queue slots by bringing in more LUNs per VM, the datastores are a single LUN which is easy to manage and maintain, and LUN locking is less of an issue with smaller datastores. And, while you do end up with more datastores, it’s not that big of a deal to manage.

The down-side, and there usually is one, is that you’re back to relying on more pieces that could go wrong. If you spread the data across multiple datastores, and a datastore goes offline, that VM is impacted. It’s really about the same exposure you have with using multiple LUNs per datastore. If the LUN your data is on goes down, your data is unavailable. So plan your DR and availability schemes accordingly.

TSM Full-VM Image Backup With VStore API

As a request from one of our VMWare admins, I’ve started testing the VMWare image backups. In the past, we’ve installed the regular backup/archive client on each VMWare guest OS. This has let us standardize our install regardless of if it’s a virtual or physical server. This doesn’t allow you to take a full snapshot of a VM, instead you have to rely on your baremetal recover procedures just as if it was a physical server.

Unfortunately, an application admin messed up a server, almost beyond repair. If the VMWare admins had been made aware of the work, they could have generated a snapshot just before the work started, and recovering would have been quick and simple.

Starting in the 6.2.2 version of the Windows client TSM supports full image backups using the VStorage API. I’ve heard people complain there isn’t a lot of documentation on this, so I thought I would write about my testing.

VMWare image backups have been in TSM for a while, but previous versions have relied on VMWare Consolidated Backup (VCB), which VMWare is withdrawing support for. The VCB functionality is still in the client, so you can upgrade to the latest version if you are already using image backups with VCB.

Like some other similar products, you will need a windows server to do the backups. This does not do a proxy backup as the file level backups using the VStorage API, but saves the files in TSM under the node doing the backups. I used the VCenter server because it was handy, though I don’t see why you couldn’t use another Windows box. My VCenter server isn’t super powerful, but it has plenty of power to spare for just running VCenter, and should be adequate to use for backups of our VMWare cluster. Here are the specs:
Intel Xeon E5530 2.4 GHz
4 GB RAM
Windows Server 2008 R2 Standard
2 – 140GB SCSI drives (mirrored)
2 – 1GB NICs (no etherchannel)

Setup could not be much more simple. In the client options, there is a “VM Backup” tab with several options. First, you must select if you want to do a VMWare File Level backup (does a proxy backup for Windows guests), VMWare Full VM backup (either using VCB or VStorage), or a Hyper-V Full VM Backup (for Microsoft Hyper-V users).

Next, you can specify doing a “Domain Full VM” or a “Domain File Level VM” backup. Domain Full VM backups are image backups of the whole VM. Domain File Level backups seem to be proxy node backups of Windows boxes. I don’t know why there are two options to specify this in the GUI, as they seem to be mutually exclusive. In the same section, you can list VMs to backup, use pre-defined groups like “ALL-VM” or “ALL-WINDOWS”, or specify folders of VMs to backup if you’ve broken down your VMs into folders.

Next where you specify your VMWare VCenter or ESX server login information. You’ll probably want to specify a role account for this. In testing I just used my regular VCenter login.

And, finally, you can tell the client to use a specific management class for your image backups. I made a new management class called VM_IMAGES. Expiration and retention seem to work as expected. You must also specify if you’re using the VStorage API or VCB.

Here’s a screenshot:

I turned off client deduplication and compression for my testing. I’ll test it’s effects later. I found that the test VM images grew in size when compressed, and enabling compression and deduplication cut the transfer speed in half. There is a note in the documentation that says that compression can only be used with the VStore Full VM backup if it’s being saved to a deduplication enabled storage pool.

There are a couple of other things to note, the VCB backup feature uses a staging area on a datastore to copy the files to before backing them up. The new VStorage API backup doesn’t. This may make it somewhat faster, and in my testing the disk usage did not increase during backups. And, even cooler, the VStorage API uses the changed block tracking feature of ESX 4 and later to only backup the occupied space. So, if you have a VM with a 100GB volume and only 4GB of data, it’s not going to backup the whole 100GB.

Lets do a backup! In the GUI, you can select Actions -> Backup VM which opens a GUI selection box. It’s a lot like the file backup GUI, just select the VM:

When you hit the backup button, the same status window used for file backups opens and everything progresses as usual. You can also use the commandline:

tsm> backup vm DCFM -vmfulltype=vstor -vmmc=VM_IMAGES
Full BACKUP VM of virtual machines 'DCFM'.

Backup VM command started.  Total number of virtual machines to process: 1

Backup of Virtual Machine 'DCFM' started

Starting Full VM backup of Virtual Machine 'DCFM'

Backing up Full VM configuration information for 'DCFM'
          4,086 VM Configuration [Sent]
Processing snapshot disk [OpsXIV2 Datastore 1] DCFM/DCFM.vmdk (Hard Disk 1), Capacity: 17,179,869,184, Bytes to Send: 15,998,124,032  (nbdssl)
Volume --> 15,998,124,032 Hard Disk 1
Backup processing of 'DCFM' finished without failure.

Total number of objects inspected:        1
Total number of objects backed up:        1
...
Total number of bytes inspected:     14.89 GB
Total number of bytes transferred:   14.89 GB
Data transfer time:                1,075.59 sec
Network data transfer rate:        14,525.08 KB/sec
Aggregate data transfer rate:      14,487.67 KB/sec
Objects compressed by:                    0%
Total data reduction ratio:            0.00%
Subfile objects reduced by:               0%
Elapsed processing time:           00:17:58

Successful Full VM backup of Virtual Machine 'DCFM'

Unmount virtual machine disk on backup proxy for VM 'DCFM (nbdssl)'
Deleted directory C:\Users\pvaughan\AppData\Local\Temp\2\vmware-pvaughan\421a7a97-64a4-79d1-b854-c9b30ea6dca7-vm-65\san
Deleted directory C:\Users\pvaughan\AppData\Local\Temp\2\vmware-pvaughan\421a7a97-64a4-79d1-b854-c9b30ea6dca7-vm-65\nbdssl

Backup VM command complete
Total number of virtual machines backed up successfully: 1
  virtual machine DCFM backed up to nodename MMC-OPSVC1
Total number of virtual machines failed: 0
Total number of virtual machines processed: 1

When the backup starts the TSM client creates a new snapshot of the VM. When the backup is finished (or interrupted) the snapshot is dropped. The disk images of that snapshot is what you end up with. The up-shot of this is that the disks are probably in pretty good shape (fsck ran clean for me) and you can get a good backup of a running VM. Here’s a screenshot taken during the backup:

Backups are cool and all, but what about restores? The restore process is pretty simple too. Just select Actions -> Restore VM from the GUI. A familiar window pops open, and you select the VM backup to restore:

When you select Restore another window opens. This allows you to restore back to the original VM image (boring), or specify a new VM. This MAY be useful for cloning a VM between VM Clusters if you can’t use the VM Converter tool. If you want to make a new VM, just specify the new name, the Datacenter, ESX host, and Datastore:

The client will reach out and create a new VM exactly like the original, and then restore the files for it. I successfully cloned my test VM while it was running the first try. And, the restored VM booted without incident.

LPAR Memory Overhead

Here’s a simple thing that I ran across. I have a vendor that recommended that I set the Maximum memory in my LPARs to the system maximum. That way you never have to reboot to increase the maximum memory in that LPAR. I found out later that setting your LPARs memory to the system maximum makes the hypervisor allocate more memory for overhead.

This is a very old configuration issue, but I just ran across the actual numbers. When the LPAR is activated, the hypervisor allocates 1/64th the LPAR maximum for page frame tables. This is a memory structure that the hypervisor uses to track the memory pages used by the LPAR. So, lets say you have a 128GB managed system with LPARs that only really need 16GB of RAM, but the LPAR’s maximum memory is set to 128GB. By the time you’ve activated your 7th LPAR your using 2GB per LPAR, or 14GB of RAM, just for the hypervisor memory page frame tables.