Difference between revisions of "Plugin:mcelog"

From collectd Wiki
Jump to: navigation, search
 
(2 intermediate revisions by one other user not shown)
Line 9: Line 9:
 
   | Manpage={{Manpage|collectd.conf|5|plugin_mcelog}}
 
   | Manpage={{Manpage|collectd.conf|5|plugin_mcelog}}
 
}}
 
}}
The purpose of ''mcelog'' plugin is to send notifications and stats relevant to Machine Check Exceptions (MCE) when they occur. The plugin leverages the mcelog Linux utility to detect that an exception has occurred. mcelog supports a client server model and does the logging and accounting of exceptions when they occur. The plugin simply leverages the client protocol of mcelog to detect when an exception has occurred.
+
The purpose of ''mcelog'' plugin is to send notifications and stats relevant to Machine Check Exceptions (MCE) when they occur. The plugin leverages the mcelog Linux utility to detect that an exception has occurred. mcelog supports a client server model and does the logging and accounting of exceptions when they occur. The plugin simply leverages the client protocol of mcelog to detect when an exception has occurred. The goal of this equivalence feature is to expose Reliability, Availability and Serviceability (RAS) features metrics and events provided by the platform to higher level fault management applications.
 
The plugin does the following:
 
The plugin does the following:
 
* Checks mcelog server liveliness, reports a failure if it’s not running or if it fails.
 
* Checks mcelog server liveliness, reports a failure if it’s not running or if it fails.
Line 24: Line 24:
 
     </Memory>
 
     </Memory>
 
   </Plugin>
 
   </Plugin>
 +
 +
Will be changed after branch "feat_mcelog_mem_notification_level" is merged (default if all commented for now is socket):
 +
# <Plugin mcelog>
 +
#   <Memory>
 +
#     McelogClientSocket "/var/run/mcelog-client"
 +
#     PersistentNotification false
 +
#   </Memory>
 +
#   McelogLogfile "/var/log/mcelog"
 +
# </Plugin>
 +
 +
=== Parameters ===
 +
 +
None yet
 +
 +
== Metrics ==
 +
 +
{| class="wikitable"
 +
|-
 +
! <br />Metric/Feature Name
 +
! <br />Date Type
 +
! <br />Format Example
 +
! <br />Internal Collectd Version
 +
! <br />Description
 +
! <br />Dependencies
 +
! <br />Limitations
 +
! <br />Comments
 +
|-
 +
| <br />Memory corrected  errors
 +
| <br />Int
 +
| <br />51522
 +
| <br />None
 +
| <br />Number of  Corrected memory errors since the system boot
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />gets metrics from  mcelog daemon.
 +
|-
 +
| <br />Memory corrected  errors in 24 Hours
 +
| <br />Int
 +
| <br />51522
 +
| <br />None
 +
| <br />Number of  Corrected memory errors since previous 24 hours
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />gets metrics from  mcelog daemon.
 +
|-
 +
| <br />Memory Uncorrected  errors
 +
| <br />Int
 +
| <br />51522
 +
| <br />None
 +
| <br />Number of  Corrected memory errors since the system boot
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />gets metrics from  mcelog daemon.
 +
|-
 +
| <br />Memory Uncorrected  errors in 24 Hours
 +
| <br />Int
 +
| <br />51522
 +
| <br />None
 +
| <br />Number of  Corrected memory errors since previous 24 hours
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />gets metrics from  mcelog daemon.
 +
|-
 +
| <br />Socket
 +
| <br />Int
 +
| <br />0
 +
| <br />None
 +
| <br />Socker number  error occurred on
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />gets metrics from  mcelog daemon.
 +
|-
 +
| <br />Channel
 +
| <br />Char
 +
| <br />0
 +
| <br />None
 +
| <br />Memory channel  each channel represents a DIMM module
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />gets metrics from  mcelog daemon.
 +
|-
 +
| <br />Memory DIMM
 +
| <br />Char
 +
| <br />B1
 +
| <br />None
 +
| <br />Memory DIMM  corresponding the memory used by the cores errors occurred on
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />gets metrics from  mcelog daemon.
 +
|-
 +
| <br />Memory Slot
 +
| <br />Char
 +
| <br />1
 +
| <br />None
 +
| <br />Memory slot  corresponding the memory used by the cores errors occurred on
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />gets metrics from  mcelog daemon.
 +
|-
 +
| <br />CPU ID
 +
| <br />Int
 +
| <br />0
 +
| <br />Future
 +
| <br />CPU ID of the  cores errors occurred on. Will be added to new EDAC plugin
 +
| <br />
 +
| <br />
 +
| <br />
 +
|-
 +
| <br />Memory Page
 +
| <br />Hex
 +
| <br />0x12345
 +
| <br />Future
 +
| <br />Memory page  corresponding the memory used by the cores errors occurred on. Will be added  to new EDAC plugin
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />Not part of  Collectd. Currently available with kernel EDAC logs
 +
|-
 +
| <br />Memory Offset
 +
| <br />Hex
 +
| <br />0x0
 +
| <br />Future
 +
| <br />Memory offset in  the page. Will be added to new EDAC plugin
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />Not part of  Collectd. Currently available with kernel EDAC logs
 +
|-
 +
| <br />Memory Row
 +
| <br />Hex
 +
| <br />0x12345
 +
| <br />
 +
| <br />
 +
| <br />
 +
| <br />
 +
| <br /> Not part of  Collectd. Currently available with kernel EDAC logs
 +
|-
 +
| <br />Memory Grain
 +
| <br />Int
 +
| <br />8
 +
| <br />Future
 +
| <br />The byte  granularity or the error grain. Will be added to new EDAC plugin
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />Not part of  Collectd. Currently available with kernel EDAC logs
 +
|-
 +
| <br />Error Syndrome
 +
| <br />Hex
 +
| <br />0x6ce3
 +
| <br />Future
 +
| <br />Memory syndrome  corresponding the memory used by the cores errors occurred on. Will be added  to new EDAC plugin
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />Not part of  Collectd. Currently available with kernel EDAC logs
 +
|-
 +
| <br />Error Type
 +
| <br />Text
 +
| <br />
 +
| <br />Future
 +
| <br />Error type. Will  be added to new EDAC plugin
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />Not part of  Collectd. Currently available with kernel EDAC logs
 +
|-
 +
| <br />Error code
 +
| <br />Integer
 +
| <br />0101:0090
 +
| <br />Future
 +
| <br />Error code put out  by EDAC. Will be added to new EDAC plugin
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />Not part of  Collectd. Currently available with kernel EDAC logs
 +
|-
 +
| <br />Logging
 +
| <br />Log path
 +
| <br />
 +
| <br />
 +
| <br />Configurable  logging path
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />Not part of  Collectd. Currently available with kernel EDAC logs
 +
|-
 +
| <br />dimmX or rankX  directory info
 +
| <br />Varying
 +
| <br />
 +
| <br />Future
 +
| <br />Expose interface  files provided by sysfs through mcX/dimmX or rankX directories
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />Not part of  Collectd. Currently available with kernel EDAC logs
 +
|-
 +
| <br />csrowX directory  info
 +
| <br />Varying
 +
| <br />
 +
| <br />Future
 +
| <br />Expose interface  files provided by sysfs through mcX/csrowX directories
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />Not part of  Collectd. Currently available with kernel EDAC logs
 +
|-
 +
| <br />RAS interrupts
 +
| <br />Count on each core
 +
| <br />[CoreID]:[InterruptCont]
 +
| <br />Future
 +
| <br />Expose the RAS  related interrupts on cores of interest via Collectd
 +
| <br />
 +
| <br />
 +
| <br /> <br />  <br />Discussion open to  see if this info can be exposed through the plugin.
 +
|}
 +
 +
== Example Graph ==
 +
{{No Example Graph}}
  
 
== Dependencies ==
 
== Dependencies ==
  
 
* [http://www.mcelog.org mcelog]
 
* [http://www.mcelog.org mcelog]
 +
 +
== Also See ==
 +
[https://wiki.opnfv.org/pages/viewpage.action?pageId=13207205 RAS/mcelog Plugin High Level Design]
 +
[https://wiki.opnfv.org/display/fastpath/Memory+RAS+Plugin+Executed+Tests Tests Executed]
 +
  
 
[[Category:Plugins]]
 
[[Category:Plugins]]
 +
[[Category:Needs Info]]
 
{{DEFAULTSORT:Mcelog}}
 
{{DEFAULTSORT:Mcelog}}

Latest revision as of 16:14, 16 June 2020

Mcelog plugin
Type: read
Callbacks: config, init, read, shutdown
Status: supported
First version: 5.8
Copyright: 2016–2017 Intel Corporation
License: MIT license
Manpage: collectd.conf(5)
List of Plugins

The purpose of mcelog plugin is to send notifications and stats relevant to Machine Check Exceptions (MCE) when they occur. The plugin leverages the mcelog Linux utility to detect that an exception has occurred. mcelog supports a client server model and does the logging and accounting of exceptions when they occur. The plugin simply leverages the client protocol of mcelog to detect when an exception has occurred. The goal of this equivalence feature is to expose Reliability, Availability and Serviceability (RAS) features metrics and events provided by the platform to higher level fault management applications. The plugin does the following:

  • Checks mcelog server liveliness, reports a failure if it’s not running or if it fails.
  • Retrieve aggregated Memory Corrected and Uncorrected Errors from the client protocol (Submit event/stat).

Mcelog must be configured to run on the platform in daemon mode and logging capabilities must be enabled. For a full description of available options please refer to the collectd.conf(5) manual page.

Synopsis

 <Plugin mcelog>
   <Memory>
     McelogClientSocket "/var/run/mcelog-client"
     PersistentNotification false
   </Memory>
 </Plugin>

Will be changed after branch "feat_mcelog_mem_notification_level" is merged (default if all commented for now is socket):

# <Plugin mcelog>
#   <Memory>
#     McelogClientSocket "/var/run/mcelog-client"
#     PersistentNotification false
#   </Memory>
#   McelogLogfile "/var/log/mcelog"
# </Plugin>

Parameters

None yet

Metrics


Metric/Feature Name

Date Type

Format Example

Internal Collectd Version

Description

Dependencies

Limitations

Comments

Memory corrected errors

Int

51522

None

Number of Corrected memory errors since the system boot





gets metrics from mcelog daemon.

Memory corrected errors in 24 Hours

Int

51522

None

Number of Corrected memory errors since previous 24 hours





gets metrics from mcelog daemon.

Memory Uncorrected errors

Int

51522

None

Number of Corrected memory errors since the system boot





gets metrics from mcelog daemon.

Memory Uncorrected errors in 24 Hours

Int

51522

None

Number of Corrected memory errors since previous 24 hours





gets metrics from mcelog daemon.

Socket

Int

0

None

Socker number error occurred on





gets metrics from mcelog daemon.

Channel

Char

0

None

Memory channel each channel represents a DIMM module





gets metrics from mcelog daemon.

Memory DIMM

Char

B1

None

Memory DIMM corresponding the memory used by the cores errors occurred on





gets metrics from mcelog daemon.

Memory Slot

Char

1

None

Memory slot corresponding the memory used by the cores errors occurred on





gets metrics from mcelog daemon.

CPU ID

Int

0

Future

CPU ID of the cores errors occurred on. Will be added to new EDAC plugin




Memory Page

Hex

0x12345

Future

Memory page corresponding the memory used by the cores errors occurred on. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Memory Offset

Hex

0x0

Future

Memory offset in the page. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Memory Row

Hex

0x12345





Not part of Collectd. Currently available with kernel EDAC logs

Memory Grain

Int

8

Future

The byte granularity or the error grain. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Error Syndrome

Hex

0x6ce3

Future

Memory syndrome corresponding the memory used by the cores errors occurred on. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Error Type

Text


Future

Error type. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Error code

Integer

0101:0090

Future

Error code put out by EDAC. Will be added to new EDAC plugin





Not part of Collectd. Currently available with kernel EDAC logs

Logging

Log path



Configurable logging path





Not part of Collectd. Currently available with kernel EDAC logs

dimmX or rankX directory info

Varying


Future

Expose interface files provided by sysfs through mcX/dimmX or rankX directories





Not part of Collectd. Currently available with kernel EDAC logs

csrowX directory info

Varying


Future

Expose interface files provided by sysfs through mcX/csrowX directories





Not part of Collectd. Currently available with kernel EDAC logs

RAS interrupts

Count on each core

[CoreID]:[InterruptCont]

Future

Expose the RAS related interrupts on cores of interest via Collectd





Discussion open to see if this info can be exposed through the plugin.

Example Graph

None yet. Add one now!

Dependencies

Also See

RAS/mcelog Plugin High Level Design Tests Executed