Minutes from 2020-07-20

From collectd Wiki
Revision as of 22:42, 20 July 2020 by Sranganath (talk | contribs)

Jump to: navigation, search

Attendees: Slawomir Strehlau, Octo, Pawel Zak, Matthias Runge, Pawel T, Piotr Dresler, Sunku, Barbara Kaczorowska, Elene Margalitadze, Jabir Kanhira Kadavathu, Svetlana Schmidt

NVMe extension to smart plugin

Updates to Netlink plugin with SRIOV metrics

Collectd 6.0 Design Aspects

  • Having label based metrics, plugins converted are important
  • Bunch of backward compatibility items in the code, for example, libvirt would transparently get converted into virt however the term libvirt has been hardocded multiple places; little hacks like this are in many places in the code, should be cleaned up
  • Tree wide removal in absolute types, which is merged
  • Not necessary for all plugins, go with subset of plugins
  • Features based in doc discussed during meetup during Feb 2020:
    • https://docs.google.com/presentation/d/1MEIusZJEpgUCegFKP6Qfup7kV04ketvH5M6YnPNaObI/edit#slide=id.g7db70ed343_0_7
    • Ability to add/remove plugins without reloading Collectd daemon: is relatively complicated - no one is working on it
    • Dynamically change plugin configuration without restarting Collectd daemon - whats needed is someone to come up with design, propose concrete path forward. Its not a heavily contested feature but its hard to implement. Open to contributions
    • Ability to customize/add metadata/labels to the metrics and Ability to control level of metadata/labels exposed by read plugins for easier consumption - this is WIP
      • convention similar to Prometheus for example labels with starting and ending '_' are for write plugins,
      • could probably get rid of metadata but rather have info available via label set
      • Minimal CPU footprint - could consider this at a later time. This isnt a blocker. Plan to run regression test as things approach to beta stage so 6.0 doesn’t occupy lot of CPU
      • One item that might reduce CPU footprint is modify looking up a string in a binary search tree, not converting to a string might result in more string comparisons
    • Ability to collect metrics at sub-second level - already have this. Currently use weird time format interanally inside collectd, value 2 power 30 is considered 1 second, considering changing this to counting microseconds, as its easy to convert to human readable. Its purely internal thing, externally expose seconds with 3 digits of precision after a comma. Can convert to something human readable. Its internal data structure, little bit unusual to use. Changing it mean touching a lot of places, its not obvious that this will improve situation much, lot of work but little gain. It would make it easier for new contributors however
  • Container packaging, better alignment with Kubernetes - not a release tiems
  • Ability to produce/consume on-wire metric streams in a standardized format - plan to have a new write plugin that is exporting opentelemetry protobuf stream, ideally would like it to be much more for example to have ascii export format that opencensus use; we already have write_prometheus which uses prometheus format
  • Support for distribution metrics - interns working on it
  • Florian is updating a design doc -> will publish a doc. There is a page in wiki, new plugin maintainers guide, it’s a starting point -> will look to expand this

Collectd 6.0 release date

  • Announce it widely to consider flag day, provide as much info as possible and as early as possible
  • For these 2 types of customers below, its hard to accommodate
    • Expect ancient versions on old hardware
    • Pre-alpha work on new hardware that’s never seen
  • It may not be an issue if we are rolling out this for enterprises
  • We should probably release the last collectd 5.x minor version shortly before we do collectd 6, so it satisfy 5.x users
    • 5.x only get minor releases going forward, only security updates
    • 5.13 in dec would be good, depending on number of new features going in
  • Then its upto customers to update
  • Needs testing from large group with beta release of 6.0 for people to try out
    • If testers come back with issues like cpu consumption, or crashes, etc., can look at it more
  • Announcement methodologies - blog page, wiki home page, etc., as soon as we know what release is going to look like so we can publish a document describing how release is going to look like
  • It might make sense to write press release like blog post - overview and long text, this would be good at time of release of beta version with talking points, upgrade guide
  • Announcing would be better to hold off until we know more, a simple headsup that 6.x would be coming is not going to make users happy. Announcing without too many details will cause uncertainty.
  • Release date - it make sense to focus on features at this time, not ready to finalize a date, shoot for end of the year. Will have a good amount of headsup time. Beta testing phase would be atleast 4 weeks. We are not to force people to migrate people within a month.
  • Florian would port few core plugins - cpu, mem, disk, partition, temp, etc.

Intern project - distribution metrics

  • Welcome Elene, Barbara & Svetlana
  • Intern project - new feature for 6.0 - it could go into 6.1
  • Current data types are counter, gauge, derive. Additionally provide distribution metrics, gets insight into collectd parameters
  • Examples - used in getting alerts, 50% of queries take more than 300ms, another example is when there is a crash one could look for request latencies at 20ms
  • Methodology is to distribute the collected scalar metrics and increase counter of corresponding buckets, similar to histogram in Prometheus
  • There will be counters used for histogram, calculate averages as necessary.
  • Contemplating remove derive/make counter behave like derive by removing overflow logic. Prometheus and opentelemetry only has derive from existing collectd 5.0 data types
  • There is almost no usage of counter metrics in collectd now and this is planned to be deprecated. It affects people writing custom plugins, as they pick counter without understanding implications
  • Scope to implement datatypes & tests, if interns have time add this to daemon, if there is more time, look to write to one read and one write plugin.
  • This is a value_t type which is union, it maps to gauge which is a double. Counter is int_64, when distribution data type is added it could be a pointer to such a distribution type, which means if no one uses there is no performance impact. If this is used, its not going to be as fast as scalar metric as updating this will be different than counter/derive, etc.
  • There are 2 main usecases:
    • Plugin asynchronously gets updates (for example counting latency of requests asynchronously), read plugin once in a while gets a view of the distribution, which gets handled by daemon & passed on to write plugins. Updating distribution quickly is main concern, perform lots of operations per second is important.
    • Get distribution from external source, say code to scrape from Prometheus backend (exporter, etc.), we don’t use update function, rather pass it on to right write plugin. In this scenario calculating percentiles is more important
  • Overall in a system with ton of scalar metrics, it doesn’t impact performance much. They are to implement their idea and benchmark to see how they perform under various loads and figure out best approach
  • All the write plugins need to be updated to support this new datatype eventually. As path forward we should find relatively simple to implement default behavior, like a median with 80th percentile, etc. we need to add functions like distribution to percentile function, likely we can handle it as a guage. Some write plugins can use distribution metrics easily like write_stackdriver, write_prometheus, etc. For read plugins ping, curl,etc., measure response time that can change to distribution. There are lot of possibilities but its not mandatory to use distribution metrics