Documentation/edac.txt: Add Nehalem specific EDAC characteristics

As Nehalem has a different binding to EDAC API, and its own different
error injection code, documents it.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This commit is contained in:
Mauro Carvalho Chehab 2009-08-05 21:16:56 -03:00
parent 4157d9f554
commit 31983a04d6

View file

@ -6,6 +6,8 @@ Written by Doug Thompson <dougthompson@xmission.com>
7 Dec 2005
17 Jul 2007 Updated
(c) Mauro Carvalho Chehab <mchehab@redhat.com>
05 Aug 2009 Nehalem interface
EDAC is maintained and written by:
@ -717,3 +719,111 @@ unique drivers for their hardware systems.
The 'test_device_edac' sample driver is located at the
bluesmoke.sourceforge.net project site for EDAC.
=======================================================================
NEHALEM USAGE OF EDAC APIs
This chapter documents some EXPERIMENTAL mappings for EDAC API to handle
Nehalem EDAC driver. They will likely be changed on future versions
of the driver.
Due to the way Nehalem exports Memory Controller data, some adjustments
were done at i7core_edac driver. This chapter will cover those differences
1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
(QPI). At the driver, the term "socket" means one QPI. It should also be
associated with the CPU physical socket.
Each MC have 3 physical read channels, 3 physical write channels and
3 logic channels. The driver currenty sees it as just 3 channels.
Each channel can have up to 3 DIMMs.
The minimum known unity is DIMMs. There are no information about csrows.
As EDAC API maps the minimum unity is csrows, the driver exports one
DIMM per csrow.
Currently, it also exports the several memory controllers as just one. This
limit will be removed on future versions of the driver.
2) Nehalem MC has the hability to generate errors. The driver implements this
functionality via some error injection nodes:
For injecting a memory error, there are some sysfs nodes, under
/sys/devices/system/edac/mc/mc0/:
inject_addrmatch:
Controls the error injection mask register. It is possible to specify
several characteristics of the address to match an error code:
dimm = the affected dimm. Numbers are relative to a channel;
rank = the memory rank;
channel = the channel that will generate an error;
bank = the affected bank;
page = the page address;
column (or col) = the address column.
each of the above values can be set to "any" to match any valid value.
At driver init, all values are set to any.
For example, to generate an error at rank 1 of dimm 2, for any channel,
any bank, any page, any column:
echo "dimm:2 rank:1" >/sys/devices/system/edac/mc/mc0/inject_addrmatch
To return to the default behaviour of matching any, you can do:
echo "dimm:any rank:any" >/sys/devices/system/edac/mc/mc0/inject_addrmatch
inject_eccmask:
specifies what bits will have troubles,
inject_section:
specifies what ECC cache section will get the error:
3 for both
2 for the highest
1 for the lowest
inject_socket:
specifies what QPI (or processor socket) will generate the error.
on Xeon 35xx, it should be 0.
on Xeon 55xx, it should be 0 or 1.
inject_type:
specifies the type of error, being a combination of the following bits:
bit 0 - repeat
bit 1 - ecc
bit 2 - parity
inject_enable starts the error generation when something different
than 0 is written.
All inject vars can be read. root permission is needed for write.
Datasheet states that the error will only be generated after a write on an
address that matches inject_addrmatch. It seems, however, that reading will
also produce an error.
For example, the following code will generate an error for any write access
at socket 0, on any DIMM/address on channel 2:
echo "channel:2" > /sys/devices/system/edac/mc/mc0/inject_addrmatch
echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket
echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
The generated error message will look like:
EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
3) Nehalem specific Corrected Error memory counters
Nehalem have some registers to count memory errors, reporting it on a
way that it is different from what EDAC API allows. Due to that, a
separate sysfs note were created to handle such counters.
They can be read by looking at the contents of "corrected_error_counts"
counter:
$ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
dimm0: 15866
dimm1: 0
dimm2: 27285