Notifications and thresholds
Starting with version 4.3 collectd introduces the concept of notifications. This document describes the concept and the related thresholds as they are implemented currently. The concept is also documented in the collectd.conf(5) manual page.
Notifications are generic text messages with an associated severity and a time. Their use is to inform the user about a notable condition, such as an unusual high CPU load or a failed health check. In addition to the severity and time the messages may be associated with performance data using the usual
host/plugin/type tuple. The text doesn't follow any protocol or other specifications and the text of notifications generated by collectd may change without notice between versions. If interpretation of the text should become necessary we will add a computer understandable field or flag for that purpose. The severity can be one of
FAILURE with the usual meaning. The time hopefully is self-explanatory.
Notifications are dispatched in the same manner in which performance data is dispatched: There are "producers", i. e. plugins that create notifications, and "consumers", i. e. plugins that receive notifications and do something with them. Plugins that can either create or receive notifications are right now:
- Exec plugin: The Exec plugin can use notifications in both ways: It can receive notifications from the executed programs and dispatch it to other plugins, and it can receive dispatched notifications from the daemon and execute programs to "handle" the notification. This enables one to submit some custom status to the daemon which provides a stable and extensible infrastructure to act upon the notification.
Also, it's easy to write a script or application which is called with each notification. This script can perform some appropriate action tailored to one's need.
- LogFile plugin: The LogFile plugin has been extended to write notifications to the log file along with the other diagnostic output.
- Network plugin: The notifications can be sent and received using the network plugin, just like performance data.
- Perl plugin: Plugins written in Perl can use notifications, too. The appropriate functions have been ported to Perl so that one can create and dispatch as well as receive and handle notifications.
- SysLog plugin: The SysLog plugin has been extended to support notifications. This feature has just been merged to master, so please be patient while it becomes available into stable release.
- UnixSock plugin: The UnixSock plugin has been extended to provide the
PUTNOTIFcommand. Using this command one can submit notifications to the system using the UNIX domain socket provided by this plugin.
One of the central sources of notifications is to check whether performance values are in an acceptable range. This is done using "Thresholds", the second big change in 4.3. You can define thresholds for any value or group of values which will then be checked.
But besides range checking a possibly less obvious mechanism is enabled if thresholds are configured for a value. Because one appears to be interested in the value, a notification will be created when it hasn't been received for an unusual long time. This way you will get a notification for missing values, too, which would otherwise go unnoticed.
When configuring thresholds you can define if the threshold is supposed to be "persistent". With persistent notifications a notification will be created for each value that is out-of-range. This may result in a high number of notifications, basically one notification each interval. If a threshold is configured to be non-persistent a notification is created for each state change, i. e. when the status changes from "okay" to "out of range" and a second one when it changes back to "okay".
There are no detailed plans what we're going to build on top of that infrastructure, but we have some interesting ideas. A plugin which sends notifications to a user using email is a must-have. A plugin which makes a (VoIP)-phonecall would be nice, too. Something using Festival comes to mind. We're always open for new ideas, of course ;)
When thinking about the concept of monitoring functionality in collectd, we tried to take advantage from design problems of other solutions. For example, other projects have a "check" which has a certain status. Although health checks and availability checks are important, there are a lot of situations where some performance data needs to be in some range. One example would be the free space in
/var, which should never be less than 100 MByte (or something like that).
Of course, other solutions can do that kind of checks, too. But the way to do this is a disaster: One defines a script to be executed (including arguments) and within the arguments the threshold values are coded. This is not only unreadable but also annoying as hell if every plugin uses its own (or a slightly different) syntax for these thresholds. In our opinion the best solution is to have the plugin report the values and let the daemon decide whether it is good or not. The user gets a uniform interface for defining these threshold limits.
So with collectd's notifications you're very flexible: A notification can inform the user that a host is unreachable, that the harddisk is dead or that the moon was destroyed. But, using the threshold values, a notification can be created when the system temperature exceeds 70 degrees (Celsius), too. So as much functionality as appropriate has been pulled into the daemon so that the plugins can focus on what they were meant to do.