Troubleshooting

From collectd Wiki
Jump to: navigation, search

It's not working. This page tries to help with the most common pitfalls:

Contents

No data appears on the server

I'm trying to use the Network plugin to send data from one instance of collectd to another and the RRDtool plugin to store the data on the server. It doesn't work, though.

  • Are RRD-files created on the server at all?
    • If so, the data was successfully received by the server at least once. You should probably continue with #Graphs are empty below.
  • Are the packets actually sent by the client and received by the server?
    tcpdump -i eth0 -p -n -s 1500 udp port 25826
    • On the client, you should see outgoing packets with roughly the frequency specified by the interval setting. Collectd does not want to waste bandwidth with packet overhead and will buffer multiple measurements until an UDP-packet is close to full so if you have few sensors it might take multiple intervals before a packet is sent.
    • In case you only have a few active sensors and a long interval, restarting collectd after at least one(1) interval has passed should cause a flush of any gathered measurements giving you a packet.
    • If you do not see incoming data on the server, this is a problem with your network, not with collectd. Check that firewalls allow communication on the appropriate port. See that servers use the correct interface to send data.
  • Is the receiving socket opened by the server?
    netstat -lnp | grep collectd
    • If not, the configuration of the Network plugin is incorrect or something went wrong during start-up.
  • If your client sends packets and a receiving socket is open, make sure you listen to the correct protocol
    • IPv4: <Listen "0.0.0.0" "25826">
    • IPv6: <Listen "::" "25826">

Graphs are empty

I'm using the RRDtool plugin to record incoming data in RRD-files on the server side. The files are created by the daemon but when I create graphs from the files they are empty.

  • Is the last modification timestamp (mtime) of the RRD-files changing?
    watch -n 10 'ls -l $RRD_FILE'
  • Are the files re-created if you delete or rename the files?
  • Is the last update field in the RRD-file changing?
    while sleep 10; do rrdtool info $RRD_FILE | grep last_update; done
  • If the RRD-files are getting modified and (re)created correctly, check your types.db(5) file. If it is inconsistent between clients and server – especially if the data source type differs – then the server may not record any data. The Network plugin uses the types.db to parse network traffic.
  • The interval of received packets might be too big, causing the server side graph software to consider the node offline. Check UDP packet loss and/or graph software settings - for testing, set the client interval configuration option to a small value (e.g., two seconds).
  • Make sure that only one client sends data for any given RRD file. The most common cause for two clients updating the same file is that they are using the same host name. In all likelihood this is a very bad idea.

(Huge) Spikes in RRD files and graphs

Discussion

This is most commonly associated with the COUNTER data source type. Both, the DERIVE and COUNTER data source types (DSTs) divide the change between two reads by the time between the reads. See the data source page for a full discussion of the topic.

The main difference between the DERIVE and COUNTER data source types is how they handle the case when the new value is smaller than the old value. The DERIVE DST will interpret this decrease as a negative rate and – if the minimum value is set to zero – discard this value. The COUNTER DST on the other hand assumes that the 32bit or 64bit value overflowed and will calculate a (positive) rate accordingly.

The problem is that sometimes the COUNTER DST is too clever for its own good: If the counter is reset (i.e. forced back to zero), the false assumption that an overflow had taken place will result in huge values to be computed. Assume, for example, that the old value is 5 Billion (5 ⋅ 109). Then the counter was reset to zero and has increased to 42 since then. Because the old value is greater than 232 – 1, an 64bit overflow is assumed. Thus, the new rate is calculated as 42 + 264 – 5 ⋅ 109 which is roughly 18 ⋅ 1018 (18 Quintillion).

Fix

Just like the minimum value of zero prevents negative rates to be allowed when using DERIVE, a maximum value can be configured. Huge spikes are only allowed into the RRD file if no maximum value has been set or if the maximum value was set too high. The trick is to set this maximum value and forcing RRDtool to throw away all offending rates.

 file="/path/to/your.rrd"
 rrdtool tune "$file" --maximum value:1234
 rrdtool dump "$file" | rrdtool restore --range-check - "$file--FIXED"
 mv "$file--FIXED" "$file"

The above uses rrdtune(1) to set the new maximum rate of 1234. Then the offending values are removed using the export / import with range check trick.


"Value is too old" errors using perl via EXEC plugin

You may get a lot of "value is too old" errors when using perl script via Exec plugin, if script is proving values via PUTVAL method.
The issues turned out to be output buffering.
This one line in the beginning of the perl script fixes the issue:

 # If set to nonzero, forces a flush after every write or print:
 $| = 1;
Personal tools
Namespaces

Variants
Actions
Navigation
Tools