From collectd Wiki
Jump to: navigation, search

collectd-nagios is a small tool to interact with the monitoring suite Nagios. It reads values from collectd using the UnixSock plugin and compares them with the specified ranges. collectd-nagios then terminates with an exit code according to the Nagios plugin development guidelines.

Suggested improvements

The following includes a few suggested improvements open for discussion:

Querying values

As of now (version 5.0), a single dataset (possibly composed of multiple data-sources) may be queried from collectd. This is fairly unflexible. E.g., it does not allow to check the percentage of used space on a disk (starting with collectd 5.0 which uses multiple datasets for this information).

The idea is to support a more flexible syntax when specifying a dataset using the -n (value spec) option:

  • mark -d (data-source) as deprecated and, optionally, append the data-source to the value spec (-n option) separated by another slash, e.g. load/load/midterm
  • support basic arithmetic operations, e.g. df-root/df_complex-used / (df-root/df_complex-used + df-root/df_complex-free); note: backward compatibily is preserved as specifying a single dataset still works just like before
    • using / as divisor might cause problems; another approach might be to use functions like DIV, SUM, etc. --yann2 on IRC
  • -d is forbidden if -n does not specify a single dataset
  • support simple functions like MIN, MAX, …

Output formatting

As of now (version 5.0), the output of collectd-nagios is hard-wired into the tool, e.g. OKAY: 0 critical, 0 warning, 3 okay. In a lot of cases, some more information (to be displayed in the Nagios frontend) might be desirable (especially when querying multiple values.

The idea is to make the output configurable through a configuration file and by specifing a format string with placeholders. The syntax might look like the following:

 <Service "disk_percent-root">
   # none of the following options are required
   Output "DISK {status} - free space /: {value:df-root/df_complex-free} ({value}%)"
   PerfData "'/'={value}:{warn}:{crit}:0:{value:df-root/df_complex-free + df-root/df_complex-used}"
   # may be overwritte on the command line:
   Query "df-root/df_complex-used / (df-root/df_complex-used + df-root/df_complex-free)"
   Warn "80"
   Crit "90"
 <Host "hostname">
   # settings applying to a specific host only
   <Service "foo">
     # ...

The service name may then be specified using a newly added command line option.

Nagios Event Broker

Another approach to let collectd interact with Nagios might be to implement a Nagios Event Broker taking care of the communication between the two daemons. An initial idea has been drafted and proposed at the German Nagios Portal Workshop 2011 in Hannover (Monitors 2011). An English summary of the draft is available as a PDF file.

Nagios Output Plugin

Yet another approach would be some kind of Nagios output plugin. The basic idea would be very similar to the NEB layed out above, however, since collectd would run as a separate daemon, the data processing would not block the Nagios core (well, a reasonably designed NEB should not do that either ;-)). Yet, this might avoid some code duplications (which, however, should be solved by pulling common code out in a library -- so this is not a real argument) and processing the data would benefit from the advanced data-processing features in collectd (i.e., Chains). The threshold configuration would be done as part of the plugin configuration and/or using the central threshold configuration. Data would be fed to Nagios using the external commands interface.