Replace per-DCT macros with smarter ones, drop hack and look for the
spare rank on all chip selects on a channel.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
When node interleaving is enabled, a subset of the addr[14:12] bits has
to be removed in order to get the normalized DCT address of the DRAM
channel. The actual number of bits to remove is determined by F1x[1,
0][7C:40][IntlvEn]. Do this correctly.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
On revC3 and revE Fam10h machines and later, non-interleaved graphics
framebuffer memory under the 16G mark can be swapped with a region
located at the bottom of memory so that the GPU can use the interleaved
region and thus two channels. Add support for that.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The address bits from MC4_STATUS differ only between K8 and the rest so
no need for a per-family method.
No functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Use the struct mce directly instead of copying from it into a custom
struct err_regs.
No functionality change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The only difference is that F10h used to sport ganged DCTs and F15h
doesn't so adjust the F10h routine and reuse it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Remove reporting of errors with UC bit set - this is done by the MCE
decoding code anyway and this driver deals with DRAM ECC errors only. UC
(NB uncorrectable error) doesn't necessarily mean it is a DRAM error.
Remove unused macros while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The fact whether we are chipkill capable or not does not have any
bearing when computing the channel index on a ganged DCT configuration
so remove that. Also, simplify debug statements. Finally, remove old
error injection leftovers, while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Remove family names from macro names, drop single bit defines and
comment their meaning instead.
No functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
* Restrict DCT ganged mode check since only Fam10h supports it
* Adjust DRAM type detection for BD since it only supports DDR3
* Remove second and thus unneeded DCLR read in k8_early_channel_count() - we do
that in read_mc_regs()
* Cleanup comments and remove family names from register macros
* Remove unused defines
There should be no functional change resulting from this patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Do not read DBAM regs twice and simplify code around them.
There should be no functional change resulting from this patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This function maps the system address to the normalized DCT address.
Document what the code does for more clarity and wrap insane bitmasks in
a more understandable macro which generates them. Also, reduce number of
arguments passed to the function. Finally, rename this function to what
it actually does.
No functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cleanup and simplify f10_determine_channel(); make it more readable.
Also drop f10_map_intlv_en_to_shift() in favor of simply counting the
bits in F1x124[DramIntlvEn] which is equivalent.
There should be no functionality change resulting from this patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add a struct representing the DRAM chip select base/limit register
pairs. Concentrate all CS handling in a single function. Also, add CS
looping macros for cleaner, more readable code. While at it, adjust code
to F15h. Finally, do smaller macro names cleanups (remove family names
from register macros) and debug messages clarification.
No functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add a struct representing the DRAM base/limit range pairs and remove all
cached subfields. Replace them with accessor functions, which actually
saves us some space:
text data bss dec hex filename
14712 1577 336 16625 40f1 drivers/edac/amd64_edac_mod.o.after
14831 1609 336 16776 4188 drivers/edac/amd64_edac_mod.o.before
Also, it simplifies the code a lot allowing to merge the K8 and F10h
routines.
No functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
F15h "multiplexes" between the configuration space of the two DRAM
controllers by toggling D18F1x10C[DctCfgSel] while F10h has a different
set of registers for DCT0, and DCT1 in extended PCI config space.
Add DCT configuration space accessors per family thus wrapping all the
different access prerequisites. Clean up code while at it, shorten
names.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Raise the debug level of these routines so that their output get issued
out only when the highest debug level is selected. Otherwise, don't
pollute driver debug output.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add tile support for the EDAC driver, which provides unified system
error (memory, PCI, etc.) reporting. For now, the TILEPro port
reports memory correctable error (CE) only.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Get rid of old users of of_platform_driver in arch/powerpc. Most
of_platform_driver users can be converted to use the platform_bus
directly.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
to edac-core
fix the totally wrong info w.r.t page,row,dimm-label previously reported to
edac-core by i82975x driver
Signed-off-by: Arvind R. <arvino55@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
amd64_debug_display_dimm_sizes() reports the distribution of the DIMMs
on each DRAM controller and its chip select sizes. Thus, the last don't
have anything to do with whether we're running in ganged DCT mode or not
- their sizes don't change all of a sudden. Fix that by removing the
ganged-check and dump DCT0's config for DCT1 when in ganged mode since
they're identical.
Reported-and-tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Fix a bunch of
warning: ‘inline’ is not at beginning of declaration
messages when building a 'make allyesconfig' kernel with -Wextra.
These warnings are trivial to kill, yet rather annoying when building with
-Wextra.
The more we can cut down on pointless crap like this the better (IMHO).
A previous patch to do this for a 'allnoconfig' build has already been
merged. This just takes the cleanup a little further.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Minor formatting fixup since the information which core was associated
with the MCE is not always valid.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Building for X86_32 produces shift count warnings, so use BIT_64() to
eliminate the warnings.
drivers/edac/mce_amd.c:778: warning: left shift count >= width of type
drivers/edac/mce_amd.c:778: warning: left shift count >= width of type
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: bluesmoke-devel@lists.sourceforge.net
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Now that everything is inplace, enable MCE decoding on F15h. Make
initcall routine a bit more readable.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Shorten up MCi_STATUS flags and add BD's new deferred and poison types.
Also, simplify formatting.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
MCE bank 2 is redefined from a BU to a CU (Combined Unit) bank on F15h.
Add a decoder function for CU MCEs.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add a decoder for F15h DC MCEs to support the new types of DC MCEs
introduced by the BD microarchitecture.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
F15h enlarges the extended error code of an MCE to a 5-bit field
(MCi_STATUS[20:16]). Add a mask variable which default 0xf is overridden
on F15h.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
K8 does not allow for an atomic RMW to a cacheline as F10h does so
disable the error injection interface for it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Make the ->{get|set}_sdram_scrub_rate return the actual scrub rate
bandwidth it succeeded setting and remove superfluous arg pointer used
for that. A negative value returned still means that an error occurred
while setting the scrubrate. Document this for future reference.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Now that all prerequisites are in place, drop the two-stage driver
instances initialization in favor of the following simple init sequence:
1. Probe PCI device: we only test ECC capabilities here and if none exit
early.
2. If the hw supports ECC and it is/can be enabled, we init the per-node
instance.
Remove "amd64_" prefix from static functions touched, while at it.
There actually should be no visible functional change resulting from
this patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Rework the code to check the hardware ECC capabilities at PCI probing
time. We do all further initialization only if we actually can/have ECC
enabled.
While at it:
0. Fix function naming.
1. Simplify/clarify debug output.
2. Remove amd64_ prefix from the static functions
3. Reorganize code.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This is in preparation for the init path reorganization where we want
only to
1) test whether a particular node supports ECC
2) can it be enabled
and only then do the necessary allocation/initialization. For that,
we need to decouple the ECC settings of the node from the instance's
descriptor.
The should be no functional change introduced by this patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
PCI ECS is being enabled by default since 2.6.26 on AMD so this code is
just superfluous now, remove it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Remove static allocation in favor of dynamically allocating space for as
many driver instances as northbridges present on the system.
There should be no functional change resulting from this patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add a macro per printk level, shorten up error messages. Add relevant
information to KERN_INFO level. No functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Rename variables representing PCI devices to their BKDG names for faster
search and shorter, clearer code.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Move the remaining per-family init code into the proper place and
simplify the rest of the initialization. Reorganize error handling in
amd64_init_one_instance().
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Run a per-family init function which does all the settings based on
the family this driver instance is running on. Move the scrubrate
calculation in it and simplify code.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
* 'x86-amd-nb-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, cacheinfo: Cleanup L3 cache index disable support
x86, amd-nb: Cleanup AMD northbridge caching code
x86, amd-nb: Complete the rename of AMD NB and related code
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
When matching error address to the range contained by one memory node,
we're in valid range when node interleaving
1. is disabled, or
2. enabled and when the address bits we interleave on match the
interleave selector on this node (see the "Node Interleaving" section in
the BKDG for an enlightening example).
Thus, when we early-exit, we need to reverse the compound logic
statement properly.
Cc: <stable@kernel.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This corrects the misprint introduced when moving '#if
PAGE_SHIFT' from i7core_edac.c to edac_core.h (commit
e9144601d3)
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Andrei Konovalov <akonovalov@mvista.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
00740c5854 changed edac_core to
un-/register a workqueue item only if a lowlevel driver supplies a
polling routine. Normally, when we remove a polling low-level driver, we
go and cancel all the queued work. However, the workqueue unreg happens
based on the ->op_state setting, and edac_mc_del_mc() sets this to
OP_OFFLINE _before_ we cancel the work item, leading to NULL ptr oops on
the workqueue list.
Fix it by putting the unreg stuff in proper order.
Cc: <stable@kernel.org> #36.x
Reported-and-tested-by: Tobias Karnat <tobias.karnat@googlemail.com>
LKML-Reference: <1291201307.3029.21.camel@Tobias-Karnat>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Otherwise, variable i will be -1 inside the latest iteration of the
while loop.
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Change EDAC's Makefile to use <modules>-y instead of
<modules>-objs because -objs is deprecated and not mentioned in
Documentation/kbuild/makefiles.txt.
[bp: Fixup commit message]
[bp: Fixup indentation]
Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Support more than just the "Misc Control" part of the northbridges.
Support more flags by turning "gart_supported" into a single bit flag
that is stored in a flags member. Clean up related code by using a set
of functions (amd_nb_num(), amd_nb_has_feature() and node_to_amd_nb())
instead of accessing the NB data structures directly. Reorder the
initialization code and put the GART flush words caching in a separate
function.
Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Not only the naming of the files was confusing, it was even more so for
the function and variable names.
Renamed the K8 NB and NUMA stuff that is also used on other AMD
platforms. This also renames the CONFIG_K8_NUMA option to
CONFIG_AMD_NUMA and the related file k8topology_64.c to
amdtopology_64.c. No functional changes intended.
Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core: (34 commits)
i7core_edac: return -ENODEV when devices were already probed
i7core_edac: properly terminate pci_dev_table
i7core_edac: Avoid PCI refcount to reach zero on successive load/reload
i7core_edac: Fix refcount error at PCI devices
i7core_edac: it is safe to i7core_unregister_mci() when mci=NULL
i7core_edac: Fix an oops at i7core probe
i7core_edac: Remove unused member channels in i7core_pvt
i7core_edac: Remove unused arg csrow from get_dimm_config
i7core_edac: Reduce args of i7core_register_mci
i7core_edac: Introduce i7core_unregister_mci
i7core_edac: Use saved pointers
i7core_edac: Check probe counter in i7core_remove
i7core_edac: Call pci_dev_put() when alloc_i7core_dev() failed
i7core_edac: Fix error path of i7core_register_mci
i7core_edac: Fix order of lines in i7core_register_mci
i7core_edac: Always do get/put for all devices
i7core_edac: Introduce i7core_pci_ctl_create/release
i7core_edac: Introduce free_i7core_dev
i7core_edac: Introduce alloc_i7core_dev
i7core_edac: Reduce args of i7core_get_onedevice
...
* 'devel' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/edac: (25 commits)
i7300_edac: Properly initialize per-csrow memory size
V4L/DVB: i7300_edac: better initialize page counts
MAINTAINERS: Add maintainer for i7300-edac driver
i7300-edac: CodingStyle cleanup
i7300_edac: Improve comments
i7300_edac: Cleanup: reorganize the file contents
i7300_edac: Properly detect channel on CE errors
i7300_edac: enrich FBD error info for corrected errors
i7300_edac: enrich FBD error info for fatal errors
i7300_edac: pre-allocate a buffer used to prepare err messages
i7300_edac: Fix MTR x4/x8 detection logic
i7300_edac: Make the debug messages coherent with the others
i7300_edac: Cleanup: remove get_error_info logic
i7300_edac: Add a code to cleanup error registers
i7300_edac: Add support for reporting FBD errors
i7300_edac: Properly detect the type of error correction
i7300_edac: Detect if the device is on single mode
i7300_edac: Adds detection for enhanced scrub mode on x8
i7300_edac: Clear the error bit after reading
i7300_edac: Add error detection code for global errors
...
Due to the nature of i7core, we need to probe and attach all PCI
devices used by this driver during the first time probe is called.
However, PCI core will call the probe routine one time for each CPU
socket. If we return -EINVAL to those calls, it would seem that the
driver fails, when, in fact, there's no more devices left to initialize.
Changing the return code to -ENODEV solves this issue.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
At pci_xeon_fixup(), it waits for a null-terminated table, while at
i7core_get_all_devices, it just do a for 0..ARRAY_SIZE. As other tables
are zero-terminated, change it to be terminate with 0 as well, and fixes
a bug where it may be running out of the table elements.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
That's a nasty bug that took me a lot of time to track, and whose
solution took just one line to solve. The best fragrances and the worse
poisons are shipped on the smalest bottles.
The drivers/pci/quick.c implements the pci_get_device function. The normal
behavior is that you call it, the function returns you a pdev pointer
and increment pdev->kobj.kref.refcount of the pci device. However,
if you want to keep searching an object, you need to pass the previous
pdev function to the search.
When you use a not null pointer to pdev "from" field, pci_get_device
will decrement pdev->kobj.kref.refcount, assuming that the driver won't
be using the previous pdev.
The solution is simple: we just need to call pci_dev_get() manually,
for the pdev's that the driver will actually use.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Probably due to a bug or some testing logic at PCI level, device
refcount for <bus>:00.0 device is decremented at the end of the
pci_get_device, made by i7core_get_all_devices(). The fact is that
the first versions of the driver relied on those devices to probe
for Nehalem, but the current versions don't use it at all.
So, let's just remove those devices from the driver, making it simpler
and fixing the bug.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
changeset c91d57ba9ce5b5c93a7077e2f72510eb1f9131c4 moved the init
of the priv pointer to the end of the probe routine. However, we need
them before that, otherwise, we hit an OOPS:
[ 67.743453] EDAC DEBUG: mci_bind_devs: Associated fn 0.0, dev = ffff88011b46e000, socket 0
[ 67.751861] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 67.759685] IP: [<ffffffffa017e484>] i7core_probe+0x979/0x130c [i7core_edac]
[ 67.766721] PGD 10bd38067 PUD 10bd37067 PMD 0
[ 67.771178] Oops: 0000 [#1] SMP
[ 67.774414] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
[ 67.782213] CPU 1
[ 67.784042] Modules linked in: i7core_edac(+) edac_core cpufreq_ondemand binfmt_misc dm_multipath video output pci_slot snd_hda_codd
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
We can check the number of channels in i7core_register_mci.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
In i7core_probe, when setup of mci for 2nd or later socket failed,
we should cleanup prepared mci for 1st socket or so before "put" of
all devices.
So let have i7core_unregister_mci that can be shared between here
and i7core_remove.
While here fix a typo "hanler".
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Prevent i7core_remove from running multiple times.
Otherwise value proved will be negative and something will be wrong.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
The flag is_registered is not initialized until mci_bind_devs()
is called. Refer it properly.
The mci->dev and mci->edac_check is required in edac_mc_add_mc(),
so prepare them just before the call.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
We already do 'get' for all sockets at once. So do 'put' in the
same way.
And let args of the 'get' function to void since it handles
only the single, static and known size table pci_dev_table[].
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Have a couple of method.
while here sort out lines in the i7core_register_mci() a bit.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Have a method to make a couple with alloc_i7core_dev() previously
introduced. Using in pair will help proper resource handling.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
It's nice to have a method for a single purpose.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Since we need to pass the index of the entry, pass the table itself
instead of passing individual members of the table.
While here make it static.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
commit 47251b4d960bdfa648b0d06dbc6d445f41cb3906 have changed
the logic for unexplained reasons. It looks strange that it
can release i7core_dev without calling i7core_put_devices()
that releases i7core_dev->pdev.
Fix the part.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
The legacy PCI probe sometimes cause hangs. Better to have it
disabled by default, and have a parameter to enable it.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This is a nasty bug. Since kobject count will be reduced by zero by
edac_mc_del_mc(), and this triggers the kobj release method, the
mci memory will be freed automatically. So, all we have left is ctl_name,
as shown by enabling debug:
[ 80.822186] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 1020: edac_remove_sysfs_mci_device() remove_link
[ 80.832590] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 1024: edac_remove_sysfs_mci_device() remove_mci_instance
[ 80.843776] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 640: edac_mci_control_release() mci instance idx=0 releasing
[ 80.855163] EDAC MC: Removed device 0 for i7core_edac.c i7 core #0: DEV 0000:3f:03.0
[ 80.862936] EDAC DEBUG: in drivers/edac/i7core_edac.c, line at 2089: (null): free structs
[ 80.871134] EDAC DEBUG: in drivers/edac/edac_mc.c, line at 238: edac_mc_free()
[ 80.878379] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 726: edac_mc_unregister_sysfs_main_kobj()
[ 80.888043] EDAC DEBUG: in drivers/edac/i7core_edac.c, line at 1232: drivers/edac/i7core_edac.c: i7core_put_devices()
Also, kfree(mci) shouldn't happen at the kobj.release, as it happens
when edac_remove_sysfs_mci_device() is called, but the logic is:
edac_remove_sysfs_mci_device(mci);
edac_printk(KERN_INFO, EDAC_MC,
"Removed device %d for %s %s: DEV %s\n", mci->mc_idx,
mci->mod_name, mci->ctl_name, edac_dev_name(mci));
So, as the edac_printk() needs the mci struct, this generates an OOPS.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
A very nasty bug were happening on edac core, due to the way mci objects are
freed. mci memory is freed when kobject count reaches zero, by
edac_mci_control_release(). However, from the logs, this is clearly happening
before the final usage of mci struct:
[15799.607454] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 640: edac_mci_control_release() mci instance idx=0 releasing
[15799.618773] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 769: edac_inst_grp_release()
[15799.627326] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 894: edac_remove_mci_instance_attributes() end of seeking for group all_channel_counts
[15799.640887] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 877: edac_remove_mci_instance_attributes() sysfs_attrib = ffffffffa01d7240
[15799.653412] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 1020: edac_remove_sysfs_mci_device() remove_link
[15799.663753] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 1024: edac_remove_sysfs_mci_device() remove_mci_instance
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
There are two groups of sysfs attributes: one for rdimm and another
for udimm. Instead of changing dynamically the unique static struct
for handling udimm's, declare two vars and make them constant.
This avoids the risk of having two or more memory controllers, each
needing a different set of attributes.
While here, use const on all places where it is applicable.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
edac_core: use const for constant sysfs arguments
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
With multi-sockets, more than one edac pci handler is enabled. Be sure to
un-register all instances.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
* 'x86-amd-nb-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, amd_nb: Enable GART support for AMD family 0x15 CPUs
x86, amd: Use compute unit information to determine thread siblings
x86, amd: Extract compute unit information for AMD CPUs
x86, amd: Add support for CPUID topology extension of AMD CPUs
x86, nmi: Support NMI watchdog on newer AMD CPU families
x86, mtrr: Assume SYS_CFG[Tom2ForceMemTypeWB] exists on all future AMD CPUs
x86, k8: Rename k8.[ch] to amd_nb.[ch] and CONFIG_K8_NB to CONFIG_AMD_NB
x86, k8-gart: Decouple handling of garts and northbridges
x86, cacheinfo: Fix dependency of AMD L3 CID
x86, kvm: add new AMD SVM feature bits
x86, cpu: Fix allowed CPUID bits for KVM guests
x86, cpu: Update AMD CPUID feature bits
x86, cpu: Fix renamed, not-yet-shipping AMD CPUID feature bit
x86, AMD: Remove needless CPU family check (for L3 cache info)
x86, tsc: Remove CPU frequency calibration on AMD
Fix
drivers/edac/mce_amd.c:262: warning: left shift count >= width of type
on 32-bit builds.
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
F11h has almost the same MCE signatures as K8 except DRAM ECC and MC5
bank errors. Reuse functionality from the other families.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Now that all decoders have been taught about F14h, models < 0x10
MCEs, enable decoding on this family of CPUs. Also, issue a short
informational message upon boot that MCE decoding gets enabled.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
F14h CPUs do not generate LS MCEs so exit early and warn the user in
case this path is ever hit that something else might be going haywire.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add support for IC MCEs for F14h CPUs. K8 and F10h are almost identical
so use one function for both.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add a per-family data cache decoders. Since there is a certain overlap
between the different DC MCE signatures, reuse functionality between the
families as far as possible.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Drop "edac_" string from the filenames since they're prefixed with edac/
in their pathname anyway.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add sysfs injection facilities for testing of the MCE decoding code.
Remove large parts of amd64_edac_dbg.c, as a result, which did only
NB MCE injection anyway and the new injection code supports that
functionality already.
Add an injection module so that MCE decoding code in production kernels
like those in RHEL and SLES can be tested.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Move toplevel sysfs class to the stub and make it available to
non-modularized code too. Add proper refcounting of its users and move
the registration functionality into the reference counting routines.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
... instead of the MCi_STATUS info only for improved handling of certain
types of errors later.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
.. so that the user knows what she's looking at there in dmesg. Also,
fix a minor cosmetic output inconsistency.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The patch below updates broken web addresses in the kernel
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Finn Thain <fthain@telegraphics.com.au>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Dimitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Ben Pfaff <blp@cs.stanford.edu>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Reviewed-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
f4347553b3 removed the edac polling
mechanism in favor of using a notifier chain for conveying MCE
information to edac. However, the module removal path didn't test
whether the driver had setup the polling function workqueue at all and
the rmmod process was hanging in the kernel at try_to_del_timer_sync()
in the cancel_delayed_work() path, trying to cancel an uninitialized
work struct.
Fix that by adding a balancing check to the workqueue removal path.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Due to the current edac-core limits, we cannot represent a per-channel
memory size, for FB-DIMM drivers. So, we need to sum-up all values
for each slot, in order to properly represent the total amount of
memory found by the i7300 driver.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
It is still somewhat fake, as the pages may not be on this exact order,
and may even be used in mirror mode, but this is a best guess than the
other random fake values.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
The file names are somehow misleading as the code is not specific to
AMD K8 CPUs anymore. The files accomodate code for other AMD CPU
northbridges as well.
Same is true for the config option which is valid for AMD CPU
northbridges in general and not specific to K8.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
LKML-Reference: <20100917160343.GD4958@loge.amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
So far we only provide num_k8_northbridges. This is required in
different areas (e.g. L3 cache index disable, GART). But not all AMD
CPUs provide a GART. Thus it is useful to split off the GART handling
from the generic caching of AMD northbridge misc devices.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
LKML-Reference: <20100917160254.GC4958@loge.amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This is basically a cleanup patch, improving the comments for each
function.
While here, do a few cleanups.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This change should do no functional change. It just rearranges the
contents of the c file, in order to make easier to understand and
maintain it.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Instead of dynamically allocating a buffer for it where needed,
just allocate it once. As we'll use the same buffer also during
fatal and non-fatal errors, is is very risky to dynamically allocate
it during an error.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
As the error logic in this driver came from i5400 driver, it
were using one function to get errors, and another to display.
Let's make it simpler and avoid doing it into two steps.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
There's no mention at the datasheet about how to enable global error
reporting. So, I'm assuming that those errors are always enabled.
Maybe I'm plain wrong about that ;)
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
When the Overflow MCi_STATUS bit is set, EDAC reports the lost error
with a "no information available" message which often puzzles users
parsing the dmesg. This doesn't make much sense since this error has
been lost anyway so no need for reporting it separately. Thus, report
the overflow bit setting in the MCE dump instead. While at it, remove
reporting of MiscV and ErrorEnable (en) which are superfluous.
Now it looks like this:
[ 1501.650024] MC4_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
[ 1501.666887] Northbridge Error, node 2
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
-EIO is not the only error code that pci_enable_device() may return, also
the set of errors can be enhanced in future. We should compare return
code with zero, not with concrete error value.
Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Jeff Roberson <jroberson@jroberson.net>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-EIO is not the only error code that pci_enable_device() may return, also
the set of errors can be enhanced in future. We should compare return
code with zero, not with concrete error value.
Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Jeff Roberson <jroberson@jroberson.net>
Cc: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In 5753c082f6 ("powerpc/85xx: Kconfig
cleanup") menuconfig MPC85xx was replaced by FSL_SOC_BOOKE but some
references insider the code were not adjusted accordingly. This patch
adresses these missing pieces.
Signed-off-by: Christoph Egger <siccegge@cs.fau.de>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Peter Tyser <ptyser@xes-inc.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Scott Wood <scottwood@freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
of_device is just an alias for platform_device, so remove it entirely. Also
replace to_of_device() with to_platform_device() and update comment blocks.
This patch was initially generated from the following semantic patch, and then
edited by hand to pick up the bits that coccinelle didn't catch.
@@
@@
-struct of_device
+struct platform_device
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Reviewed-by: David S. Miller <davem@davemloft.net>
EDAC MC3: CE page 0xc32281, offset 0x8a0, grain 0, syndrome 0x1, row 2, channel 1, label "": amd64_edac
EDAC MC3: CE - no information available: amd64_edacError Overflow
Add the missing space before "Error Overflow" on the second line.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Fortify the interface to not accept negative values, remove
memctrl_int_store() as a result. Also, sanitize bandwidth setting by
making the argument a simple u32 instead of strange u32 pointer being
passed around for no obvious reason. Then, fix error handling and teach
it to return proper error values. Finally, make code more readable,
simplify debug messages.
Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Doug Thompson <dougthompson@xmission.com>
The correct check is to verify whether in high range we're below 4GB
and not to extract the DctSelBaseAddr again. See "2.8.5 Routing DRAM
Requests" in the F10h BKDG.
Cc: <stable@kernel.org> # .32.x .33.x .34.x
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Doug Thompson <dougthompson@xmission.com>
Switch to reusing the mcheck core's machine check polling mechanism
instead of duplicating functionality by using the EDAC polling routine.
Correct formatting while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Doug Thompson <dougthompson@xmission.com>
All F2x110-related bit defines are used at only one place so replace
them with simple BIT() macros.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Doug Thompson <dougthompson@xmission.com>
This option differs from EDAC_DEBUG only by printing the file and
line of where the debug statement is placed, which contains unneeded
information. So remove it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Doug Thompson <dougthompson@xmission.com>
Remove the two syndrome extraction macros and add a single function
which does the same thing but with proper typechecking. While at it,
make sure to cache ECC syndrome size and dump it in debug output.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The MPC85xx EDAC driver is missing module device aliases, so the driver
won't load automatically on boot. This patch fixes the issue by adding
proper MODULE_DEVICE_TABLE() macros.
Signed-off-by: Anton Vorontsov <avorontsov@mvista.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Peter Tyser <ptyser@xes-inc.com>
Cc: Dave Jiang <djiang@mvista.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't print failure to detect Core i7 EDAC facilities to the console at
boot time, most often occurring on Core i7 desktops and laptops.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simply add a proper ID into the device table.
Signed-off-by: Anton Vorontsov <avorontsov@mvista.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Peter Tyser <ptyser@xes-inc.com>
Cc: Dave Jiang <djiang@mvista.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 5753c082f6 ("powerpc/85xx:
Kconfig cleanup"), there is no MPC85xx Kconfig symbol anymore, so the
driver became non-selectable.
This patch fixes the issue by switching to PPC_85xx symbol.
Signed-off-by: Anton Vorontsov <avorontsov@mvista.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Peter Tyser <ptyser@xes-inc.com>
Cc: Dave Jiang <djiang@mvista.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core:
MAINTAINERS: Add an entry for i7core_edac
i7core_edac: Avoid doing multiple probes for the same card
i7core_edac: Properly discover the first QPI device
As Nehalem/Nehalem-EP/Westmere devices uses several devices for the same
functionality (memory controller), the default way of proping devices doesn't
work. So, instead of a per-device probe, all devices should be probed at once.
This means that we should block any new attempt of probe, otherwise, it will
try to register the same device several times.
Acked-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
On Nehalem/Nehalem-EP/Westmere, the first QPI device is the last PCI bus.
The last bus is generally at 0x3f or 0xff, but there are also other systems
using different setups. For example, HP Z800 has 0x7f as the last bus.
This patch adds a logic to discover the last bus, dynamically detecting it
at runtime.
Acked-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
When calculating the DCT channel from the syndrome we need to know the
syndrome type (x4 vs x8). On F10h, this is read out from extended PCI
cfg space register F3x180 while on K8 we only support x4 syndromes and
don't have extended PCI config space anyway.
Make the code accessing F3x180 F10h only and fall back to x4 syndromes
on everything else.
Cc: <stable@kernel.org> # .33.x .34.x
Reported-by: Jeffrey Merkey <jeffmerkey@gmail.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core: (83 commits)
i7core_edac: Better describe the supported devices
Add support for Westmere to i7core_edac driver
i7core_edac: don't free on success
i7core_edac: Add support for X5670
Always call i7core_[ur]dimm_check_mc_ecc_err
i7core_edac: fix memory leak of i7core_dev
EDAC: add __init to i7core_xeon_pci_fixup
i7core_edac: Fix wrong device id for channel 1 devices
i7core: add support for Lynnfield alternate address
i7core_edac: Add initial support for Lynnfield
i7core_edac: do not export static functions
edac: fix i7core build
edac: i7core_edac produces undefined behaviour on 32bit
i7core_edac: Use a more generic approach for probing PCI devices
i7core_edac: PCI device is called NONCORE, instead of NOCORE
i7core_edac: Fix ringbuffer maxsize
i7core_edac: First store, then increment
i7core_edac: Better parse "any" addrmask
i7core_edac: Use a lockless ringbuffer
edac: Create an unique instance for each kobj
...
Fixes build errors in EDAC drivers caused by the OF
device_node pointer being moved into struct device
Signed-off-by: Anatolij Gustschin <agust@denx.de>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Merging in current state of Linus' tree to deal with merge conflicts and
build failures in vio.c after merge.
Conflicts:
drivers/i2c/busses/i2c-cpm.c
drivers/i2c/busses/i2c-mpc.c
drivers/net/gianfar.c
Also fixed up one line in arch/powerpc/kernel/vio.c to use the
correct node pointer.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
.name, .match_table and .owner are duplicated in both of_platform_driver
and device_driver. This patch is a removes the extra copies from struct
of_platform_driver and converts all users to the device_driver members.
This patch is a pretty mechanical change. The usage model doesn't change
and if any drivers have been missed, or if anything has been fixed up
incorrectly, then it will fail with a compile time error, and the fixup
will be trivial. This patch looks big and scary because it touches so
many files, but it should be pretty safe.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Sean MacLennan <smaclennan@pikatech.com>
This adds new PCI IDs for the Westmere's memory controller
devices and modifies the i7core_edac driver to be able to
probe both Nehalem and Westmere processors.
Signed-off-by: Vernon Mauery <vernux@us.ibm.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This fixes all occurrences of pci_enable_device and pci_disable_device
in all comments. There are no code changes involved.
Signed-off-by: Roman Fietze <roman.fietze@telemotive.de>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This fixes an error in function i7core_check_error
In commit ca9c90ba09 which converts the
driver to use double buffering, there is a change in the logic. Before,
if mce_count was zero, it skipped over a couple of statements and
finished out with a call to the *check_mc_ecc_err function. The current
code checks to see if mce_count is 0 and then exits.
This change reverts the behavior back to the original where if there are
no errors to report, we skip to the end and call the *check_mc_ecc_err
function.
This fix allows the driver to work again on my Nehalem based blades
again.
Signed-off-by: Vernon Mauery <vernux@us.ibm.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
It's called only from an __init function and is the only user
of pcibios_scan_specific_bus which will be marked as __devinit in
the next patch.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Fix build warning (missing header file) and
build error when CONFIG_SMP=n.
drivers/edac/i7core_edac.c:860: error: implicit declaration of function 'msleep'
drivers/edac/i7core_edac.c:1700: error: 'struct cpuinfo_x86' has no member named 'phys_proc_id'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Currently, only one PCI set of tables is allowed. This prevents using
the driver for other devices like Lynnfield, with have a different
set of PCI ID's.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Fix ringbuffer store logic.
While here, add a few comments to the code and remove the undesired
printk that could otherwise be called during NMI time.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Current code only works when there's just one memory
controller, since we need one kobj for each instance.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Instead of displaying 3 values at the same var, break it into 3
different sysfs nodes:
/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
For registered dimms, however, the error counters are already being
displayed at:
/sys/devices/system/edac/mc/mc0/csrow*/ce_count
So, there's no need to add any extra sysfs nodes.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Currently, all sysfs nodes are stored at /sys/.*/mc. (regex)
However, sometimes it is needed to create attribute groups.
This patch extends edac_core to allow groups creation.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
The old remove module stragegy didn't work on devices with multiple
cores, since only one PCI device is used to open all mc's, due to
Nehalem nature.
Also, it were based at pdev value. However, this doesn't point to the
pci device used at mci->dev.
So, instead, it unregisters all devices at once, deleting them from the
device list.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Instead of creating just one memory controller, create one per socket
(e. g. per Quick Link Path Interconnect).
This better reflects the Nehalem architecture.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Instead of using a static table assuming always 2 CPU sockets, allocate
space dynamically for Nehalem PCI devs.
This patch is part of a series of patches that changes i7core_edac to
allow more than 2 sockets and to properly report one memory controller
per socket.
On the Xeon 55XX series cpus the pci deives are not exposed via acpi so
we much explicitly probe them to make the usable as a Linux PCI device.
This moves the detection of this state to before pci_register_driver is
called. Its present position was not working on my systems, the driver
would complain about not finding a specific device.
This patch allows the driver to load on my systems.
Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Instead of assuming that the entire machine has either registered or
unregistered memories, do it at CPU socket based.
While here, fix a bug at i7core_mce_output_error(), where the we're
using m->cpu directly as if it would represent a socket. Instead, the
proper socket_id is given by cpu_data[m->cpu].phys_proc_id.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
---
Nehalem and upper chipsets provide an special device that has corrected memory
error counters detected with registered dimms. This device is only seen if
there are registered memories plugged.
After this patch, on a machine fully equiped with RDIMM's, it will use the
Device 3 function 2 to count corrected errors instead on relying at mcelog.
For unregistered DIMMs, it will keep the old behavior, counting errors
via mcelog.
This patch were developed together with Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
From: Keith Mannthey <kmannth@us.ibm.com>
Simple correction to a shift value.
ECC_ENABLED is bit 4 of MC_STATUS, Dev 3 Fun 0 Offset 0x4c
This correctly identifies the state of the ECC at the machine.
Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
There were two stupid error injection bugs introduced by wrong
cut-and-paste: one at socket store, and another at the error inject
register. The last one were causing the code to not work at all.
While here, adds debug messages to allow seeing what registers are being
set while sending error injection.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
i7core_get_devices() were preparet to get just the first found device of each type.
Due to that, on Xeon 55xx, only socket 1 were retrived.
Rework i7core_get_devices() to clean it and to properly support Xeon 55xx.
While here, fix a small typo.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Xeon55xx fails to probe with this error message:
EDAC DEBUG: in drivers/edac/i7core_edac.c, line at 1660: MC: drivers/edac/i7core_edac.c: i7core_init()
EDAC i7core: Device not found: dev 00:00.0 PCI ID 8086:2c41
i7core_edac: probe of 0000:00:14.0 failed with error -22
This is due to the fact that, on Xeon35xx (and i7core), device 00.0 has
PCI ID 8086:2c40.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
m->bank is not related to the memory bank but, instead, to the MCA Error
register bank. Fix it accordingly. While here, improves the comments for
Nehalem bank.
A later fix is needed, in order to get bank/rank information from MCA
error log.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Enriches mcelog error by using the encoded information at MCE status and
misc registers (IA32_MCx_STATUS, IA32_MCx_MISC).
Some fixes are still needed here, in order to properly fill the EDAC
fields.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>