(with Martin Schwidefsky <schwidefsky@de.ibm.com>)
The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument. The free functions do not get the mm_struct argument. This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.
[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use set_pte() for setting up the 2MB pages in the direct mapping.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
pud and pmd entries in the RAM area might be marked as non present.
Do not try to modify them in the selftest.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the readout of the large entry into the spinlock section to
prevent an unlikely but possible race.
Mark the pmd/pud entry present after the split. We preserved the
non present bit in the new split mapping.
Remove the stale gfp_flags double initialization.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
lookup_address() returns a wrong level and a wrong pointer to a non
existing pte, when pmd or pud entries are marked !present. This
happens for example due to boot time mapping of GART into the low
memory space.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
An Athlon 64 X2 test system showed hard hangs shortly after marking
the kernel text read-only, if we tried to preserve largepages and
changed the PSE entry from RW to RO. The pagetable code itself is
correct, it's the CPU that locked up hard (and not even the NMI
watchdog could punch through that hard hang).
So be conservative and always do splitups - like we did in the past.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When CPA is called on a range which fits into a large page mapping,
avoid to split the page when:
1) There is no change of attributes
2) The range to change is a complete large mapping
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The number of arguments which need to be transported is increasing
and we want to add flush optimizations and large page preserving.
Create struct cpa data and pass a pointer instead of increasing the
number of arguments further.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We only need to flush the caches in cpa() if the the caching attributes
have changed. Otherwise only flush the TLBs.
This checks the PAT bits too although they are currently not used by
the kernel.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Mask out the not supported bits (e.g. NX). If the clr/set masks
are empty after the mask return without changing anything.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When an ioremap is unmapped, do not change the page attributes. There might
be another mapping of the same physical address. PAT might detect a conflicting
mapping attribute for no good reason. The mapping is removed anyway.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that cpa works on non-direct mappings as well, we can safely
remove the range check in ioremap_change_attr().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove tons of castings which make the code hard to read.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When splitting large pages, we ge the pfn from the existing entry
instead of calculating it ourself.
This removes the last remaining range restriction of the cpa code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When changing the attributes of a pte, we should use the PFN from the
existing PTE rather than going through hoops calculating what we think
it might have been; this is both fragile and totally unneeded. It also
makes it more hairy to call any of these functions on non-direct maps
for no good reason whatsover.
With this change, __change_page_attr() no longer takes a pfn as argument,
which simplifies all the callers.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@tglx.de>
Right now, enforcing that the high mapping of the kernel text doesn't
get the NX bit is done deep in the guts of CPA, rather than in the
static_protection() function that enforces all other per-arch sanity
checks.
This patch moves this sanity check into the central static_protection()
function instead, and makes it apply ONLY to the kernel text, not to all
other areas in the high mapping.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Revert "defer cr3 reload when doing pud_clear()" since I'm going to
replace it.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The constructors for PAE and non-PAE pgd_ctors are more or less
identical, and can be made into the same function.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the _ASM_EXTABLE macro from <asm/asm.h>, instead of open-coding
__ex_table entires in arch/x86/mm/init_32.c.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
print out node_data addr and bootmap_start addr.
helpful for debugging early crashes on high-end NUMA systems.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In split_large_page we clear the NX bit for the new split ptes, but we
need to preserve the original setting of it for the split ptes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kevin Winchester reported the loss of direct rendering, due to:
[ 0.588184] agpgart: Detected AGP bridge 0
[ 0.588184] agpgart: unable to get memory for graphics translation table.
[ 0.588184] agpgart: agp_backend_initialize() failed.
[ 0.588207] agpgart-amd64: probe of 0000:00:00.0 failed with error -12
and bisected it down to:
commit 266b9f8727
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Wed Jan 30 13:34:06 2008 +0100
x86: fix ioremap RAM check
this check was too strict and caused an ioremap() failure.
the problem is due to the somewhat unclean way of how the GART code
reserves a memory range for its aperture, and how it utilizes it
later on.
Allow RAM pages to be ioremap()-ed too, as long as they are reserved.
Bisected-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
swsusp_pg_dir[] is used for suspend, but not for hibernation.
clean-up the ifdefs which worked by accident, while implying the opposite.
Delete the __nosavedata, which also implied the opposite.
Some day we may optimize CONFIG_ACPI_SLEEP to build minimal kernels
for just hibernate or just suspend but not both,
but today isn't that day.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
This patch fixes a bug of early_ioremap_reset(), which had been fixed
before by "convert the boot time page table to the kernels native
format" patch. But that patch has been reverted now.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch replaces __change_page_attr_set_clr() with
change_page_attr_set_clr() in change_page_attr_clear() to flush the
TLB/cache properly.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
memnode.map is s16 array because of nodeid is 16 bit now.
so need to increase the nodemap_size according to that bits.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
one early crash on one 8 node 256g machine:
Command line: console=uart8250,io,0x3f8,115200n8 initrd=kernel.org/mydisk11_x86_64.gz rw root=/dev/ram0 debug initcall_debug apic=debug acpi.debug_level=0x0000000f pci=routeirq ip=dhcp load_ramdisk=1 ramdisk_size=131072 BOOT_IMAGE=kernel.org/bzImage_2.6.25_k8.1
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009bc00 (usable)
BIOS-e820: 000000000009bc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e6000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000dffe0000 (usable)
BIOS-e820: 00000000dffe0000 - 00000000dffee000 (ACPI data)
BIOS-e820: 00000000dffee000 - 00000000dffff050 (ACPI NVS)
BIOS-e820: 00000000dffff050 - 00000000e0000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000ff700000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000004020000000 (usable)
Early serial console at I/O port 0x3f8 (options '115200n8')
console [uart0] enabled
end_pfn_map = 67239936
Kernel panic - not syncing: Duplicated early reservation d40000-e42000
Pid: 0, comm: swapper Not tainted 2.6.24-smp-g5a514e21-dirty #3
Call Trace:
[<ffffffff80221545>] lapic_get_maxlvt+0x0/0x10
[<ffffffff80221657>] clear_local_APIC+0x5/0xcf
[<ffffffff80221726>] disable_local_APIC+0x5/0x17
[<ffffffff8021fe16>] smp_send_stop+0x46/0x4c
[<ffffffff80235293>] panic+0x94/0x13e
[<ffffffff80bc3b03>] sctp_eps_proc_init+0x12/0x34
[<ffffffff80b9f1c5>] reserve_early+0x30/0x6c
[<ffffffff80803925>] init_memory_mapping+0x2cd/0x2dc
[<ffffffff80b9dc01>] setup_arch+0x21f/0x44e
[<ffffffff80b978be>] start_kernel+0x6f/0x2c7
[<ffffffff80b971cc>] _sinittext+0x1cc/0x1d3
it turns out there is overlap between pgtable and bss...
in System.map we have
ffffffff80d40420 b rsi_table
ffffffff80d40620 B krb5_seq_lock
ffffffff80d40628 b i.20437
ffffffff80d40630 b xprt_rdma_inline_write_padding
ffffffff80d40638 b sunrpc_table_header
ffffffff80d40640 b zero
ffffffff80d40644 b min_memreg
ffffffff80d40648 b rpcrdma_tk_lock_g
ffffffff80d40650 B sctp_assocs_id_lock
ffffffff80d40658 B proc_net_sctp
ffffffff80d40660 B sctp_assocs_id
ffffffff80d40680 B sysctl_sctp_mem
ffffffff80d40690 B sysctl_sctp_rmem
ffffffff80d406a0 B sysctl_sctp_wmem
ffffffff80d406b0 b sctp_ctl_socket
ffffffff80d406b8 b sctp_pf_inet6_specific
ffffffff80d406c0 b sctp_pf_inet_specific
ffffffff80d406c8 b sctp_af_v4_specific
ffffffff80d406d0 b sctp_af_v6_specific
ffffffff80d406d8 b sctp_rand.33270
ffffffff80d406dc b sctp_memory_pressure
ffffffff80d406e0 b sctp_sockets_allocated
ffffffff80d406e4 b sctp_memory_allocated
ffffffff80d406e8 b sctp_sysctl_header
ffffffff80d406f0 b zero
ffffffff80d406f4 A __bss_stop
ffffffff80d406f4 A _end
need to round up table_start to PAGE_SIZE.
also make the panic more informative.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This just adds the PCI IDs of AMD's family 10h and 11h CPU's northbridges to
k8topology discovery.
Signed-off-by: Joachim Deguara <joachim.deguara@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Put appropriate pagetable update hooks in so that paravirt knows
what's going on in there.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use a standard list threaded through page->lru for maintaining the pgd
list on PAE. This is the same as 64-bit, and seems saner than using a
non-standard list via page->index.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In x86 PAE mode, stop treating pmds as a special case. Previously
they were always allocated and freed with the pgd. The modifies the
code to be the same as 64-bit mode, where they are allocated on
demand.
This is a step on the way to unifying 32/64-bit pagetable allocation
as much as possible.
There is a complicating wart, however. When you install a new
reference to a pmd in the pgd, the processor isn't guaranteed to see
it unless you reload cr3. Since reloading cr3 also has the
side-effect of flushing the tlb, this is an expense that we want to
avoid whereever possible.
This patch simply avoids reloading cr3 unless the update is to the
current pagetable. Later patches will optimise this further.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: William Irwin <wli@holomorphy.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The change from current to tsk in do_page_fault is safe as
this is set at the very beginning of the function.
Removes a likely() annotation from the 64-bit version, this
could have instead been added to 32-bit.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>