Commit graph

317 commits

Author SHA1 Message Date
Tamas Vincze
118f3e1afd edac: i5000_edac critical fix panic out of bounds
EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4)
Kernel panic - not syncing: EDAC MC0: Uncorrected Error  (XEN) Domain 0 crashed: 'noreboot' set - not rebooting.

This happens because FERR_NF_FBD bit 28 is not updated on i5000.  Due to
that, both bits 28 and 29 may be equal to one, returning channel = 3.  As
this value is invalid, EDAC core generates the panic.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14568

Signed-off-by: Tamas Vincze <tom@vincze.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 12:15:38 -08:00
Roel Kluin
926311fd7d amd64_edac: Ensure index stays within bounds in amd64_get_scrub_rate
Add a missing iterator variable thus fixing the conditional of the
for-loop in amd64_get_scrub_rate().

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2010-01-15 10:45:58 +01:00
Borislav Petkov
5213c32f9d edac, pci: remove pesky debug printk
Do not spam the logs needlessly with the sole info that
edac_pci_dev_parity_clear is being called.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-24 11:07:09 +01:00
Borislav Petkov
92389102b6 amd64_edac: restrict PCI config space access
Do not access F2x19[0,4] on K8 since they're undefined there.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-24 11:07:08 +01:00
Borislav Petkov
43f5e68733 amd64_edac: fix forcing module load/unload
Clear the override flag after force-loading the module.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-24 11:07:08 +01:00
Borislav Petkov
56b34b91e2 amd64_edac: make driver loading more robust
Currently, the module does not initialize fully when the DIMMs aren't
ECC but remains still loaded. Propagate the error when no instance of
the driver is properly initialized and prevent further loading.

Reorganize and polish error handling in amd64_edac_init() while at it.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-24 11:07:07 +01:00
Borislav Petkov
8f68ed9728 amd64_edac: fix driver instance freeing
Fix use-after-free errors by pushing all memory-freeing calls to the end
of amd64_remove_one_instance().

Reported-by: Darren Jenkins <darrenrjenkins@gmail.com>
LKML-Reference: <1261370306.11354.52.camel@ICE-BOX>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-24 11:07:07 +01:00
Borislav Petkov
603adaf6b3 amd64_edac: fix K8 chip select reporting
Fix the case when amd64_debug_display_dimm_sizes() reports only half the
amount of DRAM on it because it doesn't account for when the single DCT
operates in 128-bit mode and merges chip selects from different DIMMs.

Reported-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
LKML-Reference: <200912112202.48173.johannes.hirte@fem.tu-ilmenau.de>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-24 11:07:07 +01:00
Linus Torvalds
661e338f72 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
  edac, mce, amd: silence GART TLB errors
  edac, mce: correct corenum reporting
2009-12-16 10:09:43 -08:00
Borislav Petkov
256f7276af edac, mce, amd: silence GART TLB errors
Although reporting of benign GART TLB errors is disabled in
__mcheck_cpu_apply_quirks, those are still being logged, and, as a
result, trip up amd64_edac. Pull up reporting check so that machines
with loaded edac module bail out early and don't spit fragments into
dmesg.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-16 17:48:39 +01:00
Nils Carlson
bbead2104e edac: i5100 add 6 ranks per channel
Add support for 6 ranks per channel to the i5100 chipset.  I have tested
the patch as far as possible with correctible errors and things appear
good.  The DIMM mapping is correct for our board, but boards may differ.

Signed-off-by: Nils Carlson <nils.carlson@ludd.ltu.se>
Acked-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:12 -08:00
Nils Carlson
295439f2a3 edac: i5100 add scrubbing
Addscrubbing to the i5100 chipset.  The i5100 chipset only supports one
scrubbing rate, which is not constant but dependent on memory load.  The
rate returned by this driver is an estimate based on some experimentation,
but is substantially closer to the truth than the speed supplied in the
documentation.

Also, scrubbing is done once, and then a done-bit is set.  This means that
to accomplish continuous scrubbing a re-enabling mechanism must be used.
I have created the simplest possible such mechanism in the form of a
work-queue which will check every five minutes.  This interval is quite
arbitrary but should be sufficient for all sizes of system memory.

Signed-off-by: Nils Carlson <nils.carlson@ludd.ltu.se>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:12 -08:00
Nils Carlson
b18dfd05f9 edac: i5100 clean controller to channel terms
The i5100 driver uses the word controller instead of channel in a lot of
places, this is simply a cleanup of the patch.

Signed-off-by: Nils Carlson <nils.carlson@ludd.ltu.se>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16 07:20:12 -08:00
Borislav Petkov
35d8069234 edac, mce: correct corenum reporting
Fix core number reporting with NB MCEs.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-15 15:52:13 +01:00
Borislav Petkov
505422517d x86, msr: Add support for non-contiguous cpumasks
The current rd/wrmsr_on_cpus helpers assume that the supplied
cpumasks are contiguous. However, there are machines out there
like some K8 multinode Opterons which have a non-contiguous core
enumeration on each node (e.g. cores 0,2 on node 0 instead of 0,1), see
http://www.gossamer-threads.com/lists/linux/kernel/1160268.

This patch fixes out-of-bounds writes (see URL above) by adding per-CPU
msr structs which are used on the respective cores.

Additionally, two helpers, msrs_{alloc,free}, are provided for use by
the callers of the MSR accessors.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20091211171440.GD31998@aftab>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-12-11 10:59:21 -08:00
Borislav Petkov
df5b1606bd amd64_edac: bump driver version
This was long overdue ...

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-08 13:38:14 +01:00
Andrew Morton
18ba54ac12 amd64_edac: fix use-uninitialised bug
drivers/edac/amd64_edac.c: In function 'amd64_edac_init':
drivers/edac/amd64_edac.c:2840: warning: 'ret' may be used uninitialized in this function

Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-08 13:38:13 +01:00
Borislav Petkov
bdc30a0c8c amd64_edac: correct sys address to chip select mapping
The routine does the reverse mapping of the error address of a CECC back
to the node id, DRAM controller and chip select of the DIMM which caused
the error. We should lookup the channel using the syndromes _only_ when
the DCTs are ganged so fix that.

Also, add an early exit when there's an error while scanning for the
csrow thus decreasing indentation levels for better readability.

Finally, fixup comments.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-08 13:38:12 +01:00
Borislav Petkov
bfc04aec7d amd64_edac: add a leaner syndrome decoding algorithm
Instead of using the whole syndrome tables for channel decoding, use a
set of eigenvectors with which the tables can be generated to search for
the syndrome in error. The algorithm operates independently of symbol
size and can be used for both x4 and x8 syndromes.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-08 13:37:59 +01:00
Borislav Petkov
986a42a250 amd64_edac: remove early hw support check
The .probe_valid_hardware low_ops member checked whether the DCTs are in
DDR3 mode and bailed out if so. Now that all the needed changes for DDR3
support is in place, remove it.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:31 +01:00
Borislav Petkov
6b4c0bdeb0 amd64_edac: detect DDR3 memory type
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:31 +01:00
Borislav Petkov
239642fe19 edac: add memory types strings for debugging
Instead of using deeply-nested conditionals for dumping the DIMM type in
debug mode, add a strings array of the supported DIMM types.

This is useful in cases where an edac driver supports multiple DRAM
types and is only defined in debug builds.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:31 +01:00
Borislav Petkov
cec7924f56 edac, mce: update AMD F10h revD check
F10h revD start with model number 8.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:30 +01:00
Borislav Petkov
1f6bcee75e amd64_edac: remove unneeded extract_error_address wrapper
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:30 +01:00
Borislav Petkov
44e9e2ee21 amd64_edac: rename StinkyIdentifier
SystemAddress -> sys_addr

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:30 +01:00
Borislav Petkov
ad858bfa14 amd64_edac: remove superfluous dbg printk
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:29 +01:00
Borislav Petkov
1433eb9903 amd64_edac: enhance address to DRAM bank mapping
Add cs mode to cs size mapping tables for DDR2 and DDR3 and F10
and all K8 flavors and remove klugdy table of pseudo values. Add a
low_ops->dbam_to_cs member which is family-specific and replaces
low_ops->dbam_map_to_pages since the pages calculation is a one liner
now.

Further cleanups, while at it:

- shorten family name defines
- align amd64_family_types struct members

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:29 +01:00
Borislav Petkov
d16149e8c3 amd64_edac: cleanup f10_early_channel_count
Do not read DCLR[01] again since this is done in
amd64_read_mc_registers() earlier. There can be more than two physical
DIMMs present so clamp the channels value to max 2. Also, do not report
DCT data width - it is also done earlier.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:29 +01:00
Borislav Petkov
8566c4df16 amd64_edac: dump DIMM sizes on K8 too
Extend f10_debug_display_dimm_sizes to dump the logical DIMMs
configuration on K8 revF too. Remove the ganged arg since we print the
DCT operating mode (ganged vs unganged) earlier.

Also, DCT csrow configuration is relevant therefore dump it as
KERN_DEBUG instead of only on debug builds. Remove misleading DIMM
output since there's no reliable way of mapping of chip selects to
actual physical DIMMs.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:28 +01:00
Borislav Petkov
8de1d91e62 amd64_edac: cleanup rest of amd64_dump_misc_regs
Clarify bitfields description, add PCI config function/offset names to
registers for easy reference, simplify code layout, remove unneeded
info.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:28 +01:00
Borislav Petkov
68798e1760 amd64_edac: cleanup DRAM cfg low debug output
Carve out the register-specific debug statements into a separate
function, clarify meanings of the single bitfields in the register,
remove irrelevant output and macros.

There should be no functionality change resulting from this patch.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:28 +01:00
Borislav Petkov
6ba5dcdc44 amd64_edac: wrap-up pci config read error handling
Add a pci config read wrapper for signaling pci config space access
errors instead of them being visible only on a debug build. This is
important on amd64_edac since it uses all those pci config register
values to access the DRAM/DIMM configuration of the nodes.

In addition, the wrapper makes a _lot_ (look at the diffstat!) of
error handling code superfluous and improves much of the overall code
readability by removing error handling details out of the way.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:27 +01:00
Borislav Petkov
f6d6ae9657 amd64_edac: unify MCGCTL ECC switching
Unify almost identical code into one function and remove NUMA-specific
usage (specifically cpumask_of_node()) in favor of generic topology
methods.

Remove unused defines, while at it.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:27 +01:00
Rusty Russell
ba578cb34a cpumask: use modern cpumask style in drivers/edac/amd64_edac.c
cpumask_t -> struct cpumask, and don't put one on the stack.  (Note: this
is actually on the stack unless CONFIG_CPUMASK_OFFSTACK=y).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:27 +01:00
Borislav Petkov
e97f8bb8ce amd64_edac: make DRAM regions output more human-readable
Do not shift the TOP_MEM and TOP_MEM2 values by 23 but rather save the
whole 64-bit value read from the MSR. Although the TOP_MEM/TOP_MEM2 bits
are only a subset of the 64bit register, the values are correct since
the remaining bits are Read-As-Zero and no shifting is needed.

Also, cleanup DRAM base/limit debug output.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:27 +01:00
Borislav Petkov
72381bd55e amd64_edac: clarify DRAM CTL debug reporting
Make debug info formulations about the DRAM and DCT configuration of the
machine more human readable.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-12-07 19:14:26 +01:00
Ingo Molnar
26fb20d008 Merge branch 'perf/mce' into perf/core
Merge reason: It's ready for v2.6.33.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-03 20:11:06 +01:00
Borislav Petkov
17adea01b9 amd64_edac: fix CECCs reporting
Shift error type bits properly.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-11-04 14:04:06 +01:00
Li Hong
a3c4c58085 amd64_edac: fix a wrong goto clause in amd64_edac.c
In amd64_edac_init(void) in amd64_edac.c, cache_k8_northbridges() is
called before pci_register_driver. If it fails, should exit with err
directly.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Acked-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-11-04 14:02:32 +01:00
Keith Mannthey
c2494ace99 edac: i5100 fix initialization code
Allow csrows to properly initialize when the topology only has active
channels on 2 and 3.  This new check allows proper detection and
initialization in this topology.  Only checking the first mrt that
represented channels 0 and 1 is not sufficient.

I also fixed up the related debug information path.  I can submit as a 2nd
patch if needed.

Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Acked-by: Aristeu Rozanski <aris@ruivo.org>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29 07:39:30 -07:00
Ira W. Snyder
0616fb003d edac: i5400 fix missing CONFIG_PCI define
When building without CONFIG_PCI the edac_pci_idx variable is unused,
causing a build-time warning.  Wrap the variable in #ifdef CONFIG_PCI,
just like the rest of the PCI support.

Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29 07:39:30 -07:00
Jeff Roberson
156edd4aaa edac: i5400 fix csrow mapping
The i5400 EDAC driver has several bugs with chip-select row computation
which most likely lead to bugs in detailed error reporting.  Attempts to
contact the authors have gone mostly unanswered so I am presenting my diff
here.  I do not subscribe to lkml and would appreciate being kept in the
cc.

The most egregious problem was miscalculating the addresses of MTR
registers after register 0 by assuming they are 32bit rather than 16.
This caused the driver to miss half of the memories.  Most motherboards
tend to have only 8 dimm slots and not 16, so this may not have been
noticed before.

Further, the row calculations multiplied the number of dimms several
times, ultimately ending up with a maximum row of 32.  The chipset only
supports 4 dimms in each of 4 channels, so csrow could not be higher than
4 unless you use a row per-rank with dual-rank dimms.  I opted to
eliminate this behavior as it is confusing to the user and the error
reporting works by slot and not rank.  This gives a much clearer view of
memory by slot and channel in /sys.

Signed-off-by: Jeff Roberson <jroberson@jroberson.net>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-29 07:39:30 -07:00
Borislav Petkov
4997811e3b amd64_edac: fix DRAM base and limit extraction masks, v2
This is a proper fix as a follow-up to 66216a7 and 916d11b.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-16 18:51:22 +02:00
Borislav Petkov
fb2531953f mce, edac: Use an atomic notifier for MCEs decoding
Add an atomic notifier which ensures proper locking when conveying
MCE info to EDAC for decoding. The actual notifier call overrides a
default, negative priority notifier.

Note: make sure we register the default decoder only once since
mcheck_init() runs on each CPU.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20091003065752.GA8935@liondog.tnic>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-10-12 12:24:45 +02:00
Linus Torvalds
624235c5b3 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, pci: Correct spelling in a comment
  x86: Simplify bound checks in the MTRR code
  x86: EDAC: carve out AMD MCE decoding logic
  initcalls: Add early_initcall() for modules
  x86: EDAC: MCE: Fix MCE decoding callback logic
2009-10-08 12:06:36 -07:00
Borislav Petkov
94baaee494 amd64_edac: beef up DRAM error injection
When injecting DRAM ECC errors (F3xBC_x8), EccVector[15:0] is a bitmask
of which bits should be error injected when written to and holds the
payload of 16-bit DRAM word when read, respectively.

Add /sysfs members to show the DRAM ECC section/word/vector.

Fail wrong injection values entered over /sysfs instead of truncating
them.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:51:28 +02:00
Borislav Petkov
66216a7a15 amd64_edac: fix DRAM base and limit extraction
On Fam10h and above, F1x[1, 0][7C:40] are DRAM Base/Limit registers
which specify the destination node of a DRAM address. Those address
boundaries are being extracted into ->dram_base[] and ->dram_limit[].
Correct the extraction masks to match the respective address bits.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:51:15 +02:00
Borislav Petkov
9d858bb10a amd64_edac: fix chip select handling
Different processor families support a different number of chip selects.
Handle this in a family-dependent way with the proper values assigned at
init time (see amd64_set_dct_base_and_mask).

Remove _DCSM_COUNT defines since they're used at one place and originate
from public documentation.

CC: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:50:50 +02:00
Keith Mannthey
2cff18c22c amd64_edac: simple fix to allow reporting of CECC errors
This allows the errors to be further decoded and mapped to csrows.
Tested with ECC debug dimms and an Rev F cpu based system.

Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:49:58 +02:00
Borislav Petkov
8edc544589 amd64_edac: fix K8 intlv_sel check
The check when DRAM interleaving is enabled should be done against the
pvt->dram_IntlvSel field and not against the ->dram_limit.

Simplify first loop and fixup printk formatting while at it.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2009-10-07 16:49:43 +02:00