Plugin:mcelog

From collectd Wiki
Revision as of 16:14, 16 June 2020 by MichaelForde (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Mcelog plugin
Type: read
Callbacks: config, init, read, shutdown
Status: supported
First version: 5.8
Copyright: 2016–2017 Intel Corporation
License: MIT license
Manpage: collectd.conf(5)
List of Plugins

The purpose of mcelog plugin is to send notifications and stats relevant to Machine Check Exceptions (MCE) when they occur. The plugin leverages the mcelog Linux utility to detect that an exception has occurred. mcelog supports a client server model and does the logging and accounting of exceptions when they occur. The plugin simply leverages the client protocol of mcelog to detect when an exception has occurred. The goal of this equivalence feature is to expose Reliability, Availability and Serviceability (RAS) features metrics and events provided by the platform to higher level fault management applications. The plugin does the following:

  • Checks mcelog server liveliness, reports a failure if it’s not running or if it fails.
  • Retrieve aggregated Memory Corrected and Uncorrected Errors from the client protocol (Submit event/stat).

Mcelog must be configured to run on the platform in daemon mode and logging capabilities must be enabled. For a full description of available options please refer to the collectd.conf(5) manual page.

Synopsis

 <Plugin mcelog>
   <Memory>
     McelogClientSocket "/var/run/mcelog-client"
     PersistentNotification false
   </Memory>
 </Plugin>

Will be changed after branch "feat_mcelog_mem_notification_level" is merged (default if all commented for now is socket):

# <Plugin mcelog>
#   <Memory>
#     McelogClientSocket "/var/run/mcelog-client"
#     PersistentNotification false
#   </Memory>
#   McelogLogfile "/var/log/mcelog"
# </Plugin>

Parameters

None yet

Metrics


Metric/Feature Name

Date Type

Format Example

Internal Collectd Version

Description

Dependencies

Limitations

Comments

Memory corrected errors

Int

51522

None

Number of Corrected memory errors since the system boot





gets metrics from mcelog daemon.

Memory corrected errors in 24 Hours

Int

51522

None

Number of Corrected memory errors since previous 24 hours





gets metrics from mcelog daemon.

Memory Uncorrected errors

Int

51522

None

Number of Corrected memory errors since the system boot





gets metrics from mcelog daemon.

Memory Uncorrected errors in 24 Hours

Int

51522

None

Number of Corrected memory errors since previous 24 hours





gets metrics from mcelog daemon.

Socket

Int

0

None

Socker number error occurred on





gets metrics from mcelog daemon.

Channel

Char

0

None

Memory channel each channel represents a DIMM module





gets metrics from mcelog daemon.

Memory DIMM

Char

B1

None

Memory DIMM corresponding the memory used by the cores errors occurred on





gets metrics from mcelog daemon.

Memory Slot

Char

1

None

Memory slot corresponding the memory used by the cores errors occurred on





gets metrics from mcelog daemon.

CPU ID

Int

0

Future

CPU ID of the cores errors occurred on. Will be added to new EDAC plugin




Memory Page

Hex

0x12345

Future

Memory page corresponding the memory used by the cores errors occurred on. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Memory Offset

Hex

0x0

Future

Memory offset in the page. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Memory Row

Hex

0x12345





Not part of Collectd. Currently available with kernel EDAC logs

Memory Grain

Int

8

Future

The byte granularity or the error grain. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Error Syndrome

Hex

0x6ce3

Future

Memory syndrome corresponding the memory used by the cores errors occurred on. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Error Type

Text


Future

Error type. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Error code

Integer

0101:0090

Future

Error code put out by EDAC. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Logging

Log path



Configurable logging path





Not part of Collectd. Currently available with kernel EDAC logs

dimmX or rankX directory info

Varying


Future

Expose interface files provided by sysfs through mcX/dimmX or rankX directories





Not part of Collectd. Currently available with kernel EDAC logs

csrowX directory info

Varying


Future

Expose interface files provided by sysfs through mcX/csrowX directories





Not part of Collectd. Currently available with kernel EDAC logs

RAS interrupts

Count on each core

[CoreID]:[InterruptCont]

Future

Expose the RAS related interrupts on cores of interest via Collectd





Discussion open to see if this info can be exposed through the plugin.

Example Graph

None yet. Add one now!

Dependencies

Also See

RAS/mcelog Plugin High Level Design Tests Executed