Difference between revisions of "Minutes from 2020-07-06"

From collectd Wiki
Jump to: navigation, search
(Collectd 6.0 & Open Metrics work (by Octo))
Line 3: Line 3:
  
 
== Collectd 6.0 & Open Metrics work (by Octo) ==
 
== Collectd 6.0 & Open Metrics work (by Octo) ==
* Changed the core data structures in Collectd 6.0 branch that caused good amount of work
+
 
* Data structures are copy of Prometheus protobuf, it has structures called metric family that has metric name, metric_type that indicates gauge or counter, etc. Metric family can contain many metrics where metrics hold actual value, sampling time & interval (addition to prometheus format), holds labels (in meta data). Implications:  
+
* Changed the core data structures in {{GitBranch|collectd-6.0}} that caused good amount of work
** DispatchValue is taking metric family instead
+
** Data structures are a C adoption of the Prometheus protobuf, it has structures called "metric family" that has metric name and metric type (indicating gauge or counter, etc). Metric family can contain many metrics where metrics hold actual value, sampling time & interval (interval being an addition to prometheus format), holds labels, and meta data. Implications:  
** It allows plugins like CPU/memory, send unrelated metrics to daemon in one call so they can have same timestamp.
+
** The daemon is taking metric family instead (<code>plugin_dispatch_metric_family()</code>)
** Currently plugins can populate the time field to the function they sent, if not dispatch function will send the timestamp if its unset. This might result in a difference in millisecond range of when the metric value was read from and when it is sent out by the dispatch function.
+
** It allows plugins like the CPU and memory plugins to send related metrics to the daemon in one call. If the timestamp is zero, they all receive the same timestamp, which simplifies aggregation / alignment.<br>Currently plugins can populate the time field to the function they sent, if not dispatch function will set the timestamp if its unset. This happens on each [[value list]] individually, resulting in a difference in millisecond range between related metrics.
** Two differences with this apprach:
+
** Currently three differences between the implemented data structures and OpenMetrics / Prometheus:
*** Openmetrics uses timestamp in milliseconds since epoch, collectd is much more accurate compared to this approach. We need to divide by 1000 to send to prometheus
+
*** Openmetrics uses timestamp in milliseconds since epoch, collectd currently uses the more accurate <code>cdtime_t</code> data structure.<br>Suggestion: store times in microseconds instead.
*** Collectd used to store 'interval', but its not something we can export to Prometheus
+
*** <code>metric_t</code> contains an "interval" field that is not present in Prometheus.
 +
*** <code>metric_t</code> contains a "meta" field that is not present in Prometheus. Current thinking is to make this internal-only again, for example to mark metrics received via the network.
 
** Functions that work today already:  
 
** Functions that work today already:  
*** dispatch logic, read/write plugin logic, converting counters to rates
+
*** Dispatching / routing of metrics to write plugins.
*** Metadata associated with cache entry, that allows cache to be read, works
+
*** The metrics cache: converting counters to rates, storing meta data with metric identifiers.
** Ported CPU plugin, write stack driver plugin & write_log plugin, includes formatting json, graphite, stackdriver
+
** The [[Plugin:CPU|CPU]], [[Plugin:Write Stackdriver|Write Stackdriver]], and [[Plugin:Write Log|Write Log]] plugins have been ported.
*** Still need to update tests for these to ensure it works
+
** Formatting in "Graphite", [[JSON]], and Stackdriver works.
** Code that are looking up metrics is still to be finalized with design approach, this affects aggregation plugin
+
** A few unit tests Still need to updating.
** Quite a few decisions to be taken -
+
** Code that is looking up metrics with partial matches, e.g. thresholds, aggregation, is not written yet and needs a solid design.
*** csv plugin is easy to migrate but need to figure out filesystem path for metric labels.  
+
* Migrating plugins
*** Write plugins are more or less converted but read plugins should need to be look into for naming schema
+
** 173 plugins potentially build in today's ''collectd'', some of them are barely used and could be removed, for example {{Plugin|Ascent}}, {{Plugin|XMMS}}.
*** Need to check for accuracy of compatibility layer  that converts valuelist to metric family type
+
** Quite a few decisions to be taken. For example, the {{Plugin|CSV}} is easy to migrate but needs to figure out how to map metric labels to a filesystem path.
** About 173 plugins potentially build in today's Collectd, some of them are barely used and could be removed, for example accent plugin
+
** Each write plugins needs to be updated to accept the new <code>metric_family_t</code>.
** Octo to put together a single document on design decisions, trade off with one huge document vs. many small ones is that all the info is consolidated into one place.
+
** Many read plugins potentially build due to the compatibility layer but should be looked at to make use of the new naming schema.
** Collectd 6 branch: merging
+
** Plugins that allow users to map inputs to plugin instance or type instance need special care, e.g. {{Plugin|PostgreSQL}}.
*** Will clean up git history, once its in collectd 6 branch more people can migrate plugins. Then will ask everyone to contribute - either collaborate in a doc or code
+
** '''Suggestion:''' Let ''collectd 6'' ship with fewer plugins than ''collectd 5''. Let community migrate the plugins they depend on and drop unused plugins.
** Looking into Memory plugins. Figuring out way to expose lot of metrics at same time and how to deal with them
+
** '''Suggestion:''' Move infrequently used plugins into a separate repository, similar to the ''go-plugins'' and ''python-plugins'' repositories.
 +
** '''AI([[User:Octo|octo]]):''' put together a single document on design decisions, trade off with one huge document vs. many small ones is that all the info is consolidated into one place.
 +
* Next steps
 +
** Not quite ready for everyone to jump in port plugins, but not far off. Would be great if folks could ear-mark some time.
 +
** [[User:Octo|octo]] will clean up git history, once it's in {{GitBranch|collectd-6.0}} more people can migrate plugins. Then will ask everyone to contribute either collaborate in a Google Doc or wiki.
 +
** Migrating a few plugins to get a feel for the API and make improvements.
 +
*** Currently migrating the {{Plugin|Memory}} to figure out an elegant way to create a metric family with many similar metrics.
 +
* Misc
 
** Ensuring backward compatibility is subtle and expect changes
 
** Ensuring backward compatibility is subtle and expect changes
** Release end of the year might be realistic, depends on how many are open to migrate the 173 plugins.  
+
** Release end of the year might be realistic, depends on how many are contributing to the migration effort and whether or not we migrate all 173 plugins.
** Tried to identify as many plugins as possible, need to fix lot of libraries. Updating write_http plugins is fairly straightforward with changes to metrics formatting. Not quite ready for everyone to jump in port plugins, but would be a nice thing to consider
 
  
 
== Porting effort for 6.0 ==
 
== Porting effort for 6.0 ==

Revision as of 08:22, 7 July 2020

Attendees: Slawomir Strehlau, Octo, Pawel Zak, Matthias Runge, Robert, Piotr Dresler, Kamil Waitrowski, Sunku


Collectd 6.0 & Open Metrics work (by Octo)

  • Changed the core data structures in the collectd-6.0 branch that caused good amount of work
    • Data structures are a C adoption of the Prometheus protobuf, it has structures called "metric family" that has metric name and metric type (indicating gauge or counter, etc). Metric family can contain many metrics where metrics hold actual value, sampling time & interval (interval being an addition to prometheus format), holds labels, and meta data. Implications:
    • The daemon is taking metric family instead (plugin_dispatch_metric_family())
    • It allows plugins like the CPU and memory plugins to send related metrics to the daemon in one call. If the timestamp is zero, they all receive the same timestamp, which simplifies aggregation / alignment.
      Currently plugins can populate the time field to the function they sent, if not dispatch function will set the timestamp if its unset. This happens on each value list individually, resulting in a difference in millisecond range between related metrics.
    • Currently three differences between the implemented data structures and OpenMetrics / Prometheus:
      • Openmetrics uses timestamp in milliseconds since epoch, collectd currently uses the more accurate cdtime_t data structure.
        Suggestion: store times in microseconds instead.
      • metric_t contains an "interval" field that is not present in Prometheus.
      • metric_t contains a "meta" field that is not present in Prometheus. Current thinking is to make this internal-only again, for example to mark metrics received via the network.
    • Functions that work today already:
      • Dispatching / routing of metrics to write plugins.
      • The metrics cache: converting counters to rates, storing meta data with metric identifiers.
    • The CPU, Write Stackdriver, and Write Log plugins have been ported.
    • Formatting in "Graphite", JSON, and Stackdriver works.
    • A few unit tests Still need to updating.
    • Code that is looking up metrics with partial matches, e.g. thresholds, aggregation, is not written yet and needs a solid design.
  • Migrating plugins
    • 173 plugins potentially build in today's collectd, some of them are barely used and could be removed, for example Ascent plugin, XMMS plugin.
    • Quite a few decisions to be taken. For example, the CSV plugin is easy to migrate but needs to figure out how to map metric labels to a filesystem path.
    • Each write plugins needs to be updated to accept the new metric_family_t.
    • Many read plugins potentially build due to the compatibility layer but should be looked at to make use of the new naming schema.
    • Plugins that allow users to map inputs to plugin instance or type instance need special care, e.g. PostgreSQL plugin.
    • Suggestion: Let collectd 6 ship with fewer plugins than collectd 5. Let community migrate the plugins they depend on and drop unused plugins.
    • Suggestion: Move infrequently used plugins into a separate repository, similar to the go-plugins and python-plugins repositories.
    • AI(octo): put together a single document on design decisions, trade off with one huge document vs. many small ones is that all the info is consolidated into one place.
  • Next steps
    • Not quite ready for everyone to jump in port plugins, but not far off. Would be great if folks could ear-mark some time.
    • octo will clean up git history, once it's in the collectd-6.0 branch more people can migrate plugins. Then will ask everyone to contribute – either collaborate in a Google Doc or wiki.
    • Migrating a few plugins to get a feel for the API and make improvements.
      • Currently migrating the Memory plugin to figure out an elegant way to create a metric family with many similar metrics.
  • Misc
    • Ensuring backward compatibility is subtle and expect changes
    • Release end of the year might be realistic, depends on how many are contributing to the migration effort and whether or not we migrate all 173 plugins.

Porting effort for 6.0

  • Continued maintaining 4.x branch relatively long time, eventually stopped maintaining about 5 years
  • But 4.11 bug fixes kept coming in for about 6 months after 5.0 release when it ended up dead
  • So 6.0 will be released but 5.x will be still supported taking only major security releases.
  • For 6.0 release, create separate directory for not as frequently used plugins (ex. teamspeak plugin, etc.). CPU/memory plugins could be with core plugin list
  • Is 6.0 release without all plugins need to be ported?
    • Need to be open for people to port additional plugins to 6.0 version, only accept plugins where people are invested in porting the plugins
    • Florian will send a doc for 6.0 features being written and to be done

Go collectd changes for 6.0

  • 6.0 changes will break go-collectd framework. Will maintain different branches, will check to see if the packages could be kept backward compatible
  • API needs to be stabilized, currently porting key plugins in C for 6.0, need to come up with stable API for go-collectd. Once there, don’t expect huge changes, then 6.0 branch in go-collectd could be created which passes metrics family. Go data structures with plain text protocol need to be updated

Interns on distribution metrics

  • 3 interns started today in Google (2nd or 4th semester of bachelors), looking to research solution and write design document on distribution metrics. They will present doc in next call. Working on 6.0 branch, as a new feature in 6.0.
  • Example usage of distribution metrics: considering latency of web service metrics, 1 metric every 10 sec, every metric with 2000 requests, naïve approach is calculate all requests happening in 10 seconds. If we want to use latency as Service Level Indicator, we want to calculate 95th percentile of all metrics. This is what distribution metrics allows us to do, a distribution over certain range of metrics.

Go-collectd features