I’ve had a “capacity planning” system in place for a few years that we wrote in-house. We’ve been running nmon every 5 minutes, pulling all kinds of statistics out, and populating RRD files on a central server. It allows us to see trends and patterns that provides data for planning purchases. It’s not really a tactical monitoring system, more a strategic planning system. But, there are a couple of problems
First, it spawns a nmon process every 5 minutes. nmon has a annoying habit of touching the volume groups on startup. If there is a VG modification running, nmon can’t get the info it wants. So if you run a long process, say a reorgvg, the nmon processes just stack up.
The second problem is that it’s a push system. A cron job fires off nmon every 5 minutes. If the network or central server is down the jobs that populate our central RRD stack up.
Munin fixes these problems. First it doesn’t use nmon. It has a bunch of simple pre-packaged scripts. Each script gets a single set of statistics, a good UNIX approach, and uses just simple OS commands (vmstat, iostat, df, and so on). And, it’s a “pull” system. A cron job on the munin server polls the clients, gets configuration information and the data, creates the RRD files, and generates a set of web pages. If the network or the server is down, it’s no big deal.
The Munin client is written mostly in perl, and works with the perl packaged in AIX. You do need some extra modules, mainly Net::Server, from CPAN. The data collection scripts are responsible checking that they will successfully run, providing RRD configuration information, and returning the actual data. One cool thing is that the RRD configuration is pulled at every execution, and the RRD files are updated accordingly. So, if you want to change a data set from COUNTER type to a GAUGE, just change script and the server takes care of everything.
The downers. I had some problems getting the install script to work. It just needs a lot of modules. So, I got it working on one node, then I copy all the files to other nodes. The data collection scripts are okay, but they have some annoyances. The CPU script doesn’t understand shared CPU partitions. The memory script shows used, free, pinned, and swap. So, my 64GB system with 16GB of pinned memory and no swap shows that I have 80GB of RAM. And, it doesn’t show how much of the “used” RAM is filesystem cache. Luckily they’re really simple and easily updated.