Free performance monitoring and capacity planning for IBM Power™ platform

Alerting

LPAR2RRD has now build-in alerting feature. You can define alarms for any CPU pool or lpar in your environment.
This feature does not intend to replace your standard alerting you have already in place.
It is rather enhanced functionality especially for critical CPU pools and lpars.
There are 3 main reasons why use it:
  1. Possibility to monitor whole physical servers.
    This is something what usual monitoring tools are not able to provide for IBM Power™ platform
  2. Possibility to monitor CPU pools.
    Same as for above, you can hardly find it in usual monitoring/alerting tools for IBM Power™ platform
  3. Alarming here is based on "real physical consumption".
    There is still a lot of monitoring tools which are not aware of virtualization and provide not accurate data for CPU alerting/monitoring.
  4. All alerts comes from LPAR2RRD hosted server so you do not need to open new firewall hole to each monitored lpar

Alerting types
  • Emailing. You can place direct email address on each directive, use email groups or default email address.
  • Nagios support. You can configure Nagios to pick up alarms from LPAR2RRD via standard NRPE module.
    LPAR2RRD Nagios plug-in installation
  • External alerting via external shell script. Each alert can invoke defined script with given parameters. You can use it for your integration needs.
  • SNMP trap: it is not implemented yet in v3.20, it should come soon
  • Alert plug-ins to other monitoring tools can be developed on demand especially for customers under support contract

This is configurable
  • CPU maximum and minimum for alert issuing
  • Time of CPU peak. When CPU average utilization is in given time in average above the limit
  • You can create different email groups and direct alarms to them
  • CPU warning in percentage of CPU Critical alarm
  • Alert retention. Time between alerting one alarm again

How to start
  • At first download and replace scripts/update_cfg_alert.sh.
    Then one in 3.20 has a bug. Also remove etc/alert.cfg if you have already created it.
  • Create configuration file
    (upgrade process creates configuration file automatically, so you might skip this)
    $ cd /home/lpar2rrd/dev
    $ ./scripts/update_cfg_alert.sh
    it creates this configuration file: etc/alert.cfg
  • edit ./etc/alert.cfg and configure alerts
  • place into crontab following script:
    0,10,20,30,40,50 * * * * /home/lpar2rrd/lpar2rrd/load_alert.sh > /home/lpar2rrd/lpar2rrd/load_alert.out 2>&1
  • Check whether emailing is working from LPAR2RRD hosted server:
    echo "ok" | mailx -s lpar2rrd_test your_addr@your_company
    
  • when you want to refresh list of servers/pools/lpars within alert.cfg then just run again:
    $ ./scripts/update_cfg_alert.sh

Configuration of alerts is described enough in configuration file ./etc/alert.cfg

Note: If there are configured hundreds of lpars or CPU pools for alerting then it might have impact on performance of the server where LPAR2RRD is hosted. After each big change in alert configuration run ./load_alert.sh from the cmd line to find out typical run-time duration (it is printed out at the end).
It should not be too close of the time range when it is scheduled from crontab (10minutes typically, we do not recommend less)

Known issues
  • Bug comming with 3.30 - 3.33: new lpars and pool names have suffix .rrm or .rrh in alert.cfg. Fix:
    $ cd $LPAR2RRD_HOME$
    $ ed etc/alert.cfg << EOF
    g/\.rrm/s/\.rrm//
    g/\.rrh/s/\.rrh//
    w
    q
    EOF
    
  • There has been found a bug, it is fixed in 3.21. For older version is fix here
    Just go to LPAR2RRD home dir and copy&pasted text started with "ed " and ended "EOF"
    $ cd $LPAR2RRD_HOME$
    $ ed bin/alrt.pl << EOF
    284s/managedname/managedname_prev/g
    w
    q
    EOF