Documentation/edac.txt: Add Nehalem specific EDAC characteristics
As Nehalem has a different binding to EDAC API, and its own different error injection code, documents it. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This commit is contained in:
parent
4157d9f554
commit
31983a04d6
1 changed files with 110 additions and 0 deletions
|
@ -6,6 +6,8 @@ Written by Doug Thompson <dougthompson@xmission.com>
|
|||
7 Dec 2005
|
||||
17 Jul 2007 Updated
|
||||
|
||||
(c) Mauro Carvalho Chehab <mchehab@redhat.com>
|
||||
05 Aug 2009 Nehalem interface
|
||||
|
||||
EDAC is maintained and written by:
|
||||
|
||||
|
@ -717,3 +719,111 @@ unique drivers for their hardware systems.
|
|||
The 'test_device_edac' sample driver is located at the
|
||||
bluesmoke.sourceforge.net project site for EDAC.
|
||||
|
||||
=======================================================================
|
||||
NEHALEM USAGE OF EDAC APIs
|
||||
|
||||
This chapter documents some EXPERIMENTAL mappings for EDAC API to handle
|
||||
Nehalem EDAC driver. They will likely be changed on future versions
|
||||
of the driver.
|
||||
|
||||
Due to the way Nehalem exports Memory Controller data, some adjustments
|
||||
were done at i7core_edac driver. This chapter will cover those differences
|
||||
|
||||
1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
|
||||
(QPI). At the driver, the term "socket" means one QPI. It should also be
|
||||
associated with the CPU physical socket.
|
||||
|
||||
Each MC have 3 physical read channels, 3 physical write channels and
|
||||
3 logic channels. The driver currenty sees it as just 3 channels.
|
||||
Each channel can have up to 3 DIMMs.
|
||||
|
||||
The minimum known unity is DIMMs. There are no information about csrows.
|
||||
As EDAC API maps the minimum unity is csrows, the driver exports one
|
||||
DIMM per csrow.
|
||||
|
||||
Currently, it also exports the several memory controllers as just one. This
|
||||
limit will be removed on future versions of the driver.
|
||||
|
||||
2) Nehalem MC has the hability to generate errors. The driver implements this
|
||||
functionality via some error injection nodes:
|
||||
|
||||
For injecting a memory error, there are some sysfs nodes, under
|
||||
/sys/devices/system/edac/mc/mc0/:
|
||||
|
||||
inject_addrmatch:
|
||||
Controls the error injection mask register. It is possible to specify
|
||||
several characteristics of the address to match an error code:
|
||||
dimm = the affected dimm. Numbers are relative to a channel;
|
||||
rank = the memory rank;
|
||||
channel = the channel that will generate an error;
|
||||
bank = the affected bank;
|
||||
page = the page address;
|
||||
column (or col) = the address column.
|
||||
each of the above values can be set to "any" to match any valid value.
|
||||
|
||||
At driver init, all values are set to any.
|
||||
|
||||
For example, to generate an error at rank 1 of dimm 2, for any channel,
|
||||
any bank, any page, any column:
|
||||
echo "dimm:2 rank:1" >/sys/devices/system/edac/mc/mc0/inject_addrmatch
|
||||
|
||||
To return to the default behaviour of matching any, you can do:
|
||||
echo "dimm:any rank:any" >/sys/devices/system/edac/mc/mc0/inject_addrmatch
|
||||
|
||||
inject_eccmask:
|
||||
specifies what bits will have troubles,
|
||||
|
||||
inject_section:
|
||||
specifies what ECC cache section will get the error:
|
||||
3 for both
|
||||
2 for the highest
|
||||
1 for the lowest
|
||||
|
||||
inject_socket:
|
||||
specifies what QPI (or processor socket) will generate the error.
|
||||
on Xeon 35xx, it should be 0.
|
||||
on Xeon 55xx, it should be 0 or 1.
|
||||
|
||||
inject_type:
|
||||
specifies the type of error, being a combination of the following bits:
|
||||
bit 0 - repeat
|
||||
bit 1 - ecc
|
||||
bit 2 - parity
|
||||
|
||||
inject_enable starts the error generation when something different
|
||||
than 0 is written.
|
||||
|
||||
All inject vars can be read. root permission is needed for write.
|
||||
|
||||
Datasheet states that the error will only be generated after a write on an
|
||||
address that matches inject_addrmatch. It seems, however, that reading will
|
||||
also produce an error.
|
||||
|
||||
For example, the following code will generate an error for any write access
|
||||
at socket 0, on any DIMM/address on channel 2:
|
||||
|
||||
echo "channel:2" > /sys/devices/system/edac/mc/mc0/inject_addrmatch
|
||||
echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
|
||||
echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
|
||||
echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
|
||||
echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket
|
||||
echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
|
||||
dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
|
||||
|
||||
The generated error message will look like:
|
||||
|
||||
EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
|
||||
|
||||
3) Nehalem specific Corrected Error memory counters
|
||||
|
||||
Nehalem have some registers to count memory errors, reporting it on a
|
||||
way that it is different from what EDAC API allows. Due to that, a
|
||||
separate sysfs note were created to handle such counters.
|
||||
|
||||
They can be read by looking at the contents of "corrected_error_counts"
|
||||
counter:
|
||||
|
||||
$ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
|
||||
dimm0: 15866
|
||||
dimm1: 0
|
||||
dimm2: 27285
|
||||
|
|
Loading…
Reference in a new issue