Commit graph

21012 commits

Author SHA1 Message Date
Alexei Starovoitov
1a6877b9c0 lib: introduce strncpy_from_unsafe()
generalize FETCH_FUNC_NAME(memory, string) into
strncpy_from_unsafe() and fix sparse warnings that were
present in original implementation.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-28 16:27:27 -07:00
David S. Miller
0d36938bb8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-08-27 21:45:31 -07:00
Christoph Hellwig
41e94a8513 add devm_memremap_pages
This behaves like devm_memremap except that it ensures we have page
structures available that can back the region.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[djbw: catch attempts to remap RAM, drop flags]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-27 19:40:58 -04:00
Wang Nan
a2fb3382ed tracing/uprobes: Do not print '0x (null)' when offset is 0
When manually added uprobe point with zero address, 'uprobe_events'
output '(null)' instead of 0x00000000:

  # echo p:probe_libc/abs_0 /path/to/lib.bin:0x0 arg1=%ax > \
            /sys/kernel/debug/tracing/uprobe_events

  # cat /sys/kernel/debug/tracing/uprobe_events
    p:probe_libc/abs_0 /path/to/lib.bin:0x          (null) arg1=%ax

 This patch fixes this behavior:

  # cat /sys/kernel/debug/tracing/uprobe_events
  p:probe_libc/abs_0 /path/to/lib.bin:0x0000000000000000

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1440586666-235233-8-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-26 10:43:01 -03:00
Tejun Heo
20f1f4b5ff Merge branch 'for-4.3-unified-base' into for-4.3 2015-08-25 14:19:29 -04:00
Aleksa Sarai
ce52399520 cgroup: pids: fix invalid get/put usage
Fix incorrect usage of css_get and css_put to put a different css in
pids_{cancel_,}attach() than the one grabbed in pids_can_attach(). This
could lead to quite serious memory leakage (and unsafe operations on the
putted css).

tj: minor comment update

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-08-25 14:19:25 -04:00
Jan H. Schönherr
dd9d384375 sched: Fix cpu_active_mask/cpu_online_mask race
There is a race condition in SMP bootup code, which may result
in

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()
or
    kernel BUG at kernel/smpboot.c:135!

It can be triggered with a bit of luck in Linux guests running
on busy hosts.

	CPU0                        CPUn
	====                        ====

	_cpu_up()
	  __cpu_up()
				    start_secondary()
				      set_cpu_online()
					cpumask_set_cpu(cpu,
						   to_cpumask(cpu_online_bits));
	  cpu_notify(CPU_ONLINE)
	    <do stuff, see below>
					cpumask_set_cpu(cpu,
						   to_cpumask(cpu_active_bits));

During the various CPU_ONLINE callbacks CPUn is online but not
active. Several things can go wrong at that point, depending on
the scheduling of tasks on CPU0.

Variant 1:

  cpu_notify(CPU_ONLINE)
    workqueue_cpu_up_callback()
      rebind_workers()
        set_cpus_allowed_ptr()

  This call fails because it requires an active CPU; rebind_workers()
  ends with a warning:

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()

Variant 2:

  cpu_notify(CPU_ONLINE)
    smpboot_thread_call()
      smpboot_unpark_threads()
       ..
        __kthread_unpark()
          __kthread_bind()
          wake_up_state()
           ..
            select_task_rq()
              select_fallback_rq()

  The ->wake_cpu of the unparked thread is not allowed, making a call
  to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
  find an allowed, active CPU and promptly resets the allowed CPUs, so
  that the task in question ends up on CPU0.

  When those unparked tasks are eventually executed, they run
  immediately into a BUG:

    kernel BUG at kernel/smpboot.c:135!

Just changing the order in which the online/active bits are set
(and adding some memory barriers), would solve the two issues
above. However, it would change the order of operations back to
the one before commit 6acbfb9697 ("sched: Fix hotplug vs.
set_cpus_allowed_ptr()"), thus, reintroducing that particular
problem.

Going further back into history, we have at least the following
commits touching this topic:
- commit 2baab4e904 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
- commit 5fbd036b55 ("sched: Cleanup cpu_active madness")

Together, these give us the following non-working solutions:

  - secondary CPU sets active before online, because active is assumed to
    be a subset of online;

  - secondary CPU sets online before active, because the primary CPU
    assumes that an online CPU is also active;

  - secondary CPU sets online and waits for primary CPU to set active,
    because it might deadlock.

Commit 875ebe940d ("powerpc/smp: Wait until secondaries are
active & online") introduces an arch-specific solution to this
arch-independent problem.

Now, go for a more general solution without explicit waiting and
simply set active twice: once on the secondary CPU after online
was set and once on the primary CPU after online was seen.

set_cpus_allowed_ptr()")

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Wilson <msw@amazon.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 6acbfb9697 ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
Link: http://lkml.kernel.org/r/1439408156-18840-1-git-send-email-jschoenh@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-25 12:14:39 +02:00
Ingo Molnar
f612a7b1a7 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU cleanup from Paul E. McKenney:

 "Privatize smp_mb__after_unlock_lock().  This commit moves the
  definition of smp_mb__after_unlock_lock() to kernel/rcu/tree.h,
  in recognition of the fact that RCU is the only thing using
  this, that nothing else is likely to use it, and that it is
  likely to go away completely."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-25 09:44:49 +02:00
Linus Torvalds
84f3fe4608 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A series of small fixlets for a regression visible on OMAP devices
  caused by the conversion of the OMAP interrupt chips to hierarchical
  interrupt domains.  Mostly one liners on the driver side plus a small
  helper function in the core to avoid open coded mess in the drivers"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/crossbar: Restore set_wake functionality
  irqchip/crossbar: Restore the mask on suspend behaviour
  ARM: OMAP: wakeupgen: Restore the irq_set_type() mechanism
  irqchip/crossbar: Restore the irq_set_type() mechanism
  genirq: Introduce irq_chip_set_type_parent() helper
  genirq: Don't return ENOSYS in irq_chip_retrigger_hierarchy
2015-08-22 07:45:36 -07:00
Linus Torvalds
f8a89fc05a Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Two minimalistic fixes for 4.2 regressions:

   - Eric fixed a thinko in the timer_list base switching code caused by
     the overhaul of the timer wheel.  It can cause a cpu to see the
     wrong base for a timer while we move the timer around.

   - Guenter fixed a regression for IMX if booted w/o device tree, where
     the timer interrupt is not initialized and therefor the machine
     fails to boot"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/imx: Fix boot with non-DT systems
  timer: Write timer->flags atomically
2015-08-22 07:37:41 -07:00
Guenter Roeck
85e1cd6e76 hrtimer: Handle failure of tick_init_highres() gracefully
Commit 75e3b37d05 ("hrtimer: Drop return code of hrtimer_switch_to_hres()")
drops the return code of hrtimer_switch_to_hres(). While doing so, it also
drops the return statement itself on failure. This may cause a system hang.
Seen when running arm:multi_v7_defconfig in qemu with devicetree file
vexpress-v2p-ca9.

Fixes: 75e3b37d05 ("hrtimer: Drop return code of hrtimer_switch_to_hres()")
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Link: http://lkml.kernel.org/r/1440231047-16256-1-git-send-email-linux@roeck-us.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-08-22 10:57:50 +02:00
David S. Miller
dc25b25897 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/qmi_wwan.c

Overlapping additions of new device IDs to qmi_wwan.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-21 11:44:04 -07:00
Thomas Gleixner
05ddaa4d6d Merge branch 'fortglx/4.3/time' of https://git.linaro.org/people/john.stultz/linux into timers/core
- A handful or y2038 related items
- A walltime to monotonic limit
- Small fixes for timespec_trunc() and timer_list output
2015-08-20 21:13:22 +02:00
Ingo Molnar
40a2ea1bd9 Merge branch 'perf/urgent' into perf/core, to pick up fixes before adding more changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-20 11:48:56 +02:00
Grygorii Strashko
b7560de198 genirq: Introduce irq_chip_set_type_parent() helper
This helper is required for irq chips which do not implement a
irq_set_type callback and need to call down the irq domain hierarchy
for the actual trigger type change.

This helper is required to fix further wreckage caused by the
conversion of TI OMAP to hierarchical irq domains and therefor tagged
for stable.

[ tglx: Massaged changelog ]

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: <linux@arm.linux.org.uk>
Cc: <nsekhar@ti.com>
Cc: <jason@lakedaemon.net>
Cc: <balbi@ti.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: <tony@atomide.com>
Cc: <marc.zyngier@arm.com>
Cc: stable@vger.kernel.org # 4.1
Link: http://lkml.kernel.org/r/1439554830-19502-3-git-send-email-grygorii.strashko@ti.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-08-20 00:25:25 +02:00
Grygorii Strashko
6d4affea7d genirq: Don't return ENOSYS in irq_chip_retrigger_hierarchy
irq_chip_retrigger_hierarchy() returns -ENOSYS if it was not able to
find at least one .irq_retrigger() callback implemented in the IRQ
domain hierarchy.

That's wrong, because check_irq_resend() expects a 0 return value from
the callback in case that the hardware assisted resend was not
possible. If the return value is non zero the core code assumes
hardware resend success and the software resend is not invoked.

This results in lost interrupts on platforms where none of the parent
irq chips in the hierarchy implements the retrigger callback.

This is observable on TI OMAP, where the hierarchy is:

 ARM GIC <- OMAP wakeupgen <- TI Crossbar

Return 0 instead so the software resend mechanism gets invoked.

[ tglx: Massaged changelog ]

Fixes: 85f08c17de ('genirq: Introduce helper functions...')
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: <linux@arm.linux.org.uk>
Cc: <nsekhar@ti.com>
Cc: <jason@lakedaemon.net>
Cc: <balbi@ti.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: <tony@atomide.com>
Cc: stable@vger.kernel.org # 4.1
Link: http://lkml.kernel.org/r/1439554830-19502-2-git-send-email-grygorii.strashko@ti.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-08-20 00:25:25 +02:00
Tejun Heo
3e1d2eed39 cgroup: introduce cgroup_subsys->legacy_name
This allows cgroup subsystems to use a different name on the unified
hierarchy.  cgroup_subsys->name is used on the unified hierarchy,
->legacy_name elsewhere.  If ->legacy_name is not explicitly set, it's
automatically set to ->name and the userland visible behavior remains
unchanged.

v2: Make parse_cgroupfs_options() only consider ->legacy_name as mount
    options are used only on legacy hierarchies.  Suggested by Li
    Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
2015-08-18 13:58:16 -07:00
Tejun Heo
d98817d496 cgroup: don't print subsystems for the default hierarchy
It doesn't make sense to print subsystems on mount option or
/proc/PID/cgroup for the default hierarchy.

* cgroup.controllers file at the root of the default hierarchy lists
  the currently attached controllers.

* The default hierarchy is catch-all for unmounted subsystems.

* The default hierarchy doesn't accept any mount options.

Suppress subsystem printing on mount options and /proc/PID/cgroup for
the default hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
2015-08-18 13:58:16 -07:00
Frederic Weisbecker
b48362d8aa hrtimer: Unconfuse switch_hrtimer_base() a bit
The variable called "this_base" is confusing because its name suggests
it's of "struct hrtimer_clock_base" type, along with "base" and "new_base"
which doesn't help understanding this complicated function.

Make its name clearer and fix the misleading comment while at it.

[ tglx: Fixed the comment for real ]

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1439907509-9553-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-08-18 18:36:59 +02:00
Frederic Weisbecker
662b3e1946 hrtimer: Simplify get_target_base() by returning current base
Instead of fetching again the current cpu base, just take it from the
parameter.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1439907509-9553-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-08-18 18:36:59 +02:00
Eric Dumazet
d0023a1448 timer: Write timer->flags atomically
lock_timer_base() cannot prevent the following :

CPU1 ( in __mod_timer()
timer->flags |= TIMER_MIGRATING;
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
// The next line clears TIMER_MIGRATING
timer->flags &= ~TIMER_BASEMASK;
                                  CPU2 (in lock_timer_base())
                                  see timer base is cpu0 base
                                  spin_lock_irqsave(&base->lock, *flags);
                                  if (timer->flags == tf)
                                       return base; // oops, wrong base
timer->flags |= base->cpu // too late

We must write timer->flags in one go, otherwise we can fool other cpus.

Fixes: bc7a34b8b9 ("timer: Reduce timer migration overhead if disabled")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Christopherson <jon@jons.org>
Cc: David Miller <davem@davemloft.net>
Cc: xen-devel@lists.xen.org
Cc: david.vrabel@citrix.com
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Link: http://lkml.kernel.org/r/1439831928.32680.11.camel@edumazet-glaptop2.roam.corp.google.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
2015-08-18 15:31:16 +02:00
Ingo Molnar
a5dd192496 Merge branch 'x86/urgent' into x86/asm to fix up conflicts and to pick up fixes
Conflicts:
	arch/x86/entry/entry_64_compat.S
	arch/x86/math-emu/get_address.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-18 09:39:47 +02:00
Linus Torvalds
e9ab22d292 Merge branch 'for-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "A fix for a subtle bug introduced back during 3.17 cycle which
  interferes with setting configurations under specific conditions"

* 'for-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: use trialcs->mems_allowed as a temp variable
2015-08-17 16:15:26 -07:00
Luiz Capitulino
75e3b37d05 hrtimer: Drop return code of hrtimer_switch_to_hres()
It's not checked by the caller.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Link: http://lkml.kernel.org/r/20150811164043.538241ef@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-08-17 23:19:03 +02:00
Baolin Wang
9ca3085060 time: Introduce timespec64_to_jiffies()/jiffies_to_timespec64()
The conversion between struct timespec and jiffies is not year 2038
safe on 32bit systems. Introduce timespec64_to_jiffies() and
jiffies_to_timespec64() functions which use struct timespec64 to
make it ready for 2038 issue.

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-08-17 11:25:41 -07:00
Baolin Wang
8758a240e2 time: Introduce current_kernel_time64()
The current_kernel_time() is not year 2038 safe on 32bit systems
since it returns a timespec value. Introduce current_kernel_time64()
which returns a timespec64 value.

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-08-17 11:25:35 -07:00
Xunlei Pang
7494e9eede time: Add the common weak version of update_persistent_clock()
The weak update_persistent_clock64() calls update_persistent_clock(),
if the architecture defines an update_persistent_clock64() to replace
and remove its update_persistent_clock() version, when building the
kernel the linker will throw an undefined symbol error, that is, any
arch that switches to update_persistent_clock64() will have this issue.

To solve the issue, we add the common weak update_persistent_clock().

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-08-17 11:25:16 -07:00
Wang YanQing
e1d7ba8735 time: Always make sure wall_to_monotonic isn't positive
Two issues were found on an IMX6 development board without an
enabled RTC device(resulting in the boot time and monotonic
time being initialized to 0).

Issue 1:exportfs -a generate:
       "exportfs: /opt/nfs/arm does not support NFS export"
Issue 2:cat /proc/stat:
       "btime 4294967236"

The same issues can be reproduced on x86 after running the
following code:
	int main(void)
	{
	    struct timeval val;
	    int ret;

	    val.tv_sec = 0;
	    val.tv_usec = 0;
	    ret = settimeofday(&val, NULL);
	    return 0;
	}

Two issues are different symptoms of same problem:
The reason is a positive wall_to_monotonic pushes boot time back
to the time before Epoch, and getboottime will return negative
value.

In symptom 1:
          negative boot time cause get_expiry() to overflow time_t
          when input expire time is 2147483647, then cache_flush()
          always clears entries just added in ip_map_parse.
In symptom 2:
          show_stat() uses "unsigned long" to print negative btime
          value returned by getboottime.

This patch fix the problem by prohibiting time from being set to a value which
would cause a negative boot time. As a result one can't set the CLOCK_REALTIME
time prior to (1970 + system uptime).

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wang YanQing <udknight@gmail.com>
[jstultz: reworded commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-08-17 11:24:54 -07:00
Karsten Blees
de4a95faf1 time: Fix nanosecond file time rounding in timespec_trunc()
timespec_trunc() avoids rounding if granularity <= nanoseconds-per-jiffie
(or TICK_NSEC). This optimization assumes that:

 1. current_kernel_time().tv_nsec is already rounded to TICK_NSEC (i.e.
    with HZ=1000 you'd get 1000000, 2000000, 3000000... but never 1000001).
    This is no longer true (probably since hrtimers introduced in 2.6.16).

 2. TICK_NSEC is evenly divisible by all possible granularities. This may
    be true for HZ=100, 250, 1000, but obviously not for HZ=300 /
    TICK_NSEC=3333333 (introduced in 2.6.20).

Thus, sub-second portions of in-core file times are not rounded to on-disk
granularity. I.e. file times may change when the inode is re-read from disk
or when the file system is remounted.

This affects all file systems with file time granularities > 1 ns and < 1s,
e.g. CEPH (1000 ns), UDF (1000 ns), CIFS (100 ns), NTFS (100 ns) and FUSE
(configurable from user mode via struct fuse_init_out.time_gran).

Steps to reproduce with e.g. UDF:

  $ dd if=/dev/zero of=udfdisk count=10000 && mkudffs udfdisk
  $ mkdir udf && mount udfdisk udf
  $ touch udf/test && stat -c %y udf/test
  2015-06-09 10:22:56.130006767 +0200
  $ umount udf && mount udfdisk udf
  $ stat -c %y udf/test
  2015-06-09 10:22:56.130006000 +0200

Remounting truncates the mtime to 1 µs.

Fix the rounding in timespec_trunc() and update the documentation.

timespec_trunc() is exclusively used to calculate inode's [acm]time (mostly
via current_fs_time()), and always with super_block.s_time_gran as second
argument. So this can safely be changed without side effects.

Note: This does _not_ fix the issue for FAT's 2 second mtime resolution,
as super_block.s_time_gran isn't prepared to handle different ctime /
mtime / atime resolutions nor resolutions > 1 second.

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-08-17 11:23:46 -07:00
John Stultz
38bf985b05 timer_list: Add the base offset so remaining nsecs are accurate for non monotonic timers
I noticed for non-monotonic timers in timer_list, some of the
output looked a little confusing.

For example:
 #1: <0000000000000000>, posix_timer_fn, S:01, hrtimer_start_range_ns, leap-a-day/2360
 # expires at 1434412800000000000-1434412800000000000 nsecs [in 1434410725062375469 to 1434410725062375469 nsecs]

You'll note the relative time till the expiration "[in xxx to
yyy nsecs]" is incorrect. This is because its printing the delta
between CLOCK_MONOTONIC time to the CLOCK_REALTIME expiration.

This patch fixes this issue by adding the clock offset to the
"now" time which we use to calculate the delta.

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-08-17 11:23:31 -07:00
Oleg Nesterov
bf3eac84c4 percpu-rwsem: kill CONFIG_PERCPU_RWSEM
Remove CONFIG_PERCPU_RWSEM, the next patch adds the unconditional
user of percpu_rw_semaphore.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2015-08-15 13:52:11 +02:00
Oleg Nesterov
9287f6925a percpu-rwsem: introduce percpu_down_read_trylock()
Add percpu_down_read_trylock(), it will have the user soon.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2015-08-15 13:52:10 +02:00
Christoph Hellwig
7d3dcf26a6 devres: add devm_memremap
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-14 16:01:21 -04:00
Linus Torvalds
b25c6cee55 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes: PMU driver corner cases, tooling fixes, and an 'AUX'
  (Intel PT) race related core fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/cqm: Do not access cpu_data() from CPU_UP_PREPARE handler
  perf/x86/intel: Fix memory leak on hot-plug allocation fail
  perf: Fix PERF_EVENT_IOC_PERIOD migration race
  perf: Fix double-free of the AUX buffer
  perf: Fix fasync handling on inherited events
  perf tools: Fix test build error when bindir contains double slash
  perf stat: Fix transaction lenght metrics
  perf: Fix running time accounting
2015-08-14 10:57:16 -07:00
Linus Torvalds
5e5013c6b5 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
 "A single fix for a locking self-test crash"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/pvqspinlock: Fix kernel panic in locking-selftest
2015-08-14 10:45:23 -07:00
Dan Williams
92281dee82 arch: introduce memremap()
Existing users of ioremap_cache() are mapping memory that is known in
advance to not have i/o side effects.  These users are forced to cast
away the __iomem annotation, or otherwise neglect to fix the sparse
errors thrown when dereferencing pointers to this memory.  Provide
memremap() as a non __iomem annotated ioremap_*() in the case when
ioremap is otherwise a pointer to cacheable memory. Empirically,
ioremap_<cacheable-type>() call sites are seeking memory-like semantics
(e.g.  speculative reads, and prefetching permitted).

memremap() is a break from the ioremap implementation pattern of adding
a new memremap_<type>() for each mapping type and having silent
compatibility fall backs.  Instead, the implementation defines flags
that are passed to the central memremap() and if a mapping type is not
supported by an arch memremap returns NULL.

We introduce a memremap prototype as a trivial wrapper of
ioremap_cache() and ioremap_wt().  Later, once all ioremap_cache() and
ioremap_wt() usage has been removed from drivers we teach archs to
implement arch_memremap() with the ability to strictly enforce the
mapping type.

Cc: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-14 13:23:28 -04:00
David Howells
cfc411e7ff Move certificate handling to its own directory
Move certificate handling out of the kernel/ directory and into a certs/
directory to get all the weird stuff in one place and move the generated
signing keys into this directory.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
2015-08-14 16:06:13 +01:00
David S. Miller
182ad468e7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/cavium/Kconfig

The cavium conflict was overlapping dependency
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 16:23:11 -07:00
Richard Guy Briggs
15ce414b82 fixup: audit: implement audit by executable
The Intel build-bot detected a sparse warning with with a patch I posted a
couple of days ago that was accepted in the audit/next tree:

Subject: [linux-next:master 6689/6751] kernel/audit_watch.c:543:36: sparse: dereference of noderef expression
Date: Friday, August 07, 2015, 06:57:55 PM
From: kbuild test robot <fengguang.wu@intel.com>
tree:   git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
head:   e6455bc5b91f41f842f30465c9193320f0568707
commit: 2e3a8aeb63e5335d4f837d453787c71bcb479796 [6689/6751] Merge remote- tracking branch 'audit/next'
sparse warnings: (new ones prefixed by >>)
>> kernel/audit_watch.c:543:36: sparse: dereference of noderef expression
   kernel/audit_watch.c:544:28: sparse: dereference of noderef expression

34d99af5 Richard Guy Briggs 2015-08-05  541  int audit_exe_compare(struct task_struct *tsk, struct audit_fsnotify_mark *mark)
34d99af5 Richard Guy Briggs 2015-08-05  542  {
34d99af5 Richard Guy Briggs 2015-08-05 @543     unsigned long ino = tsk->mm- >exe_file->f_inode->i_ino;
34d99af5 Richard Guy Briggs 2015-08-05  544     dev_t dev = tsk->mm->exe_file- >f_inode->i_sb->s_dev;

:::::: The code at line 543 was first introduced by commit
:::::: 34d99af52a audit: implement audit by executable

tsk->mm->exe_file requires RCU access.  The warning was reproduceable by adding
"C=1 CF=-D__CHECK_ENDIAN__" to the build command, and verified eliminated with
this patch.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-12 22:04:07 -04:00
Wei-Chun Chao
140d8b335a bpf: fix bpf_perf_event_read() loop upper bound
Verifier rejects programs incorrectly.

Fixes: 35578d7984 ("bpf: Implement function bpf_perf_event_read()")
Cc: Kaixu Xia <xiakaixu@huawei.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-12 16:42:50 -07:00
Eric W. Biederman
faf00da544 userns,pidns: Force thread group sharing, not signal handler sharing.
The code that places signals in signal queues computes the uids, gids,
and pids at the time the signals are enqueued.  Which means that tasks
that share signal queues must be in the same pid and user namespaces.

Sharing signal handlers is fine, but bizarre.

So make the code in fork and userns_install clearer by only testing
for what is functionally necessary.

Also update the comment in unshare about unsharing a user namespace to
be a little more explicit and make a little more sense.

Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-08-12 14:55:28 -05:00
Eric W. Biederman
12c641ab82 unshare: Unsharing a thread does not require unsharing a vm
In the logic in the initial commit of unshare made creating a new
thread group for a process, contingent upon creating a new memory
address space for that process.  That is wrong.  Two separate
processes in different thread groups can share a memory address space
and clone allows creation of such proceses.

This is significant because it was observed that mm_users > 1 does not
mean that a process is multi-threaded, as reading /proc/PID/maps
temporarily increments mm_users, which allows other processes to
(accidentally) interfere with unshare() calls.

Correct the check in check_unshare_flags() to test for
!thread_group_empty() for CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM.
For sighand->count > 1 for CLONE_SIGHAND and CLONE_VM.
For !current_is_single_threaded instead of mm_users > 1 for CLONE_VM.

By using the correct checks in unshare this removes the possibility of
an accidental denial of service attack.

Additionally using the correct checks in unshare ensures that only an
explicit unshare(CLONE_VM) can possibly trigger the slow path of
current_is_single_threaded().  As an explict unshare(CLONE_VM) is
pointless it is not expected there are many applications that make
that call.

Cc: stable@vger.kernel.org
Fixes: b2e0d98705 userns: Implement unshare of the user namespace
Reported-by: Ricky Zhou <rickyz@chromium.org>
Reported-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-08-12 14:54:26 -05:00
David Howells
99db443506 PKCS#7: Appropriately restrict authenticated attributes and content type
A PKCS#7 or CMS message can have per-signature authenticated attributes
that are digested as a lump and signed by the authorising key for that
signature.  If such attributes exist, the content digest isn't itself
signed, but rather it is included in a special authattr which then
contributes to the signature.

Further, we already require the master message content type to be
pkcs7_signedData - but there's also a separate content type for the data
itself within the SignedData object and this must be repeated inside the
authattrs for each signer [RFC2315 9.2, RFC5652 11.1].

We should really validate the authattrs if they exist or forbid them
entirely as appropriate.  To this end:

 (1) Alter the PKCS#7 parser to reject any message that has more than one
     signature where at least one signature has authattrs and at least one
     that does not.

 (2) Validate authattrs if they are present and strongly restrict them.
     Only the following authattrs are permitted and all others are
     rejected:

     (a) contentType.  This is checked to be an OID that matches the
     	 content type in the SignedData object.

     (b) messageDigest.  This must match the crypto digest of the data.

     (c) signingTime.  If present, we check that this is a valid, parseable
     	 UTCTime or GeneralTime and that the date it encodes fits within
     	 the validity window of the matching X.509 cert.

     (d) S/MIME capabilities.  We don't check the contents.

     (e) Authenticode SP Opus Info.  We don't check the contents.

     (f) Authenticode Statement Type.  We don't check the contents.

     The message is rejected if (a) or (b) are missing.  If the message is
     an Authenticode type, the message is rejected if (e) is missing; if
     not Authenticode, the message is rejected if (d) - (f) are present.

     The S/MIME capabilities authattr (d) unfortunately has to be allowed
     to support kernels already signed by the pesign program.  This only
     affects kexec.  sign-file suppresses them (CMS_NOSMIMECAP).

     The message is also rejected if an authattr is given more than once or
     if it contains more than one element in its set of values.

 (3) Add a parameter to pkcs7_verify() to select one of the following
     restrictions and pass in the appropriate option from the callers:

     (*) VERIFYING_MODULE_SIGNATURE

	 This requires that the SignedData content type be pkcs7-data and
	 forbids authattrs.  sign-file sets CMS_NOATTR.  We could be more
	 flexible and permit authattrs optionally, but only permit minimal
	 content.

     (*) VERIFYING_FIRMWARE_SIGNATURE

	 This requires that the SignedData content type be pkcs7-data and
	 requires authattrs.  In future, this will require an attribute
	 holding the target firmware name in addition to the minimal set.

     (*) VERIFYING_UNSPECIFIED_SIGNATURE

	 This requires that the SignedData content type be pkcs7-data but
	 allows either no authattrs or only permits the minimal set.

     (*) VERIFYING_KEXEC_PE_SIGNATURE

	 This only supports the Authenticode SPC_INDIRECT_DATA content type
	 and requires at least an SpcSpOpusInfo authattr in addition to the
	 minimal set.  It also permits an SPC_STATEMENT_TYPE authattr (and
	 an S/MIME capabilities authattr because the pesign program doesn't
	 remove these).

     (*) VERIFYING_KEY_SIGNATURE
     (*) VERIFYING_KEY_SELF_SIGNATURE

	 These are invalid in this context but are included for later use
	 when limiting the use of X.509 certs.

 (4) The pkcs7_test key type is given a module parameter to select between
     the above options for testing purposes.  For example:

	echo 1 >/sys/module/pkcs7_test_key/parameters/usage
	keyctl padd pkcs7_test foo @s </tmp/stuff.pkcs7

     will attempt to check the signature on stuff.pkcs7 as if it contains a
     firmware blob (1 being VERIFYING_FIRMWARE_SIGNATURE).

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: David Woodhouse <David.Woodhouse@intel.com>
2015-08-12 17:01:01 +01:00
David Woodhouse
770f2b9876 modsign: Use extract-cert to process CONFIG_SYSTEM_TRUSTED_KEYS
Fix up the dependencies somewhat too, while we're at it.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-12 17:01:01 +01:00
Ingo Molnar
9b9412dc70 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

  - The combination of tree geometry-initialization simplifications
    and OS-jitter-reduction changes to expedited grace periods.
    These two are stacked due to the large number of conflicts
    that would otherwise result.

    [ With one addition, a temporary commit to silence a lockdep false
      positive. Additional changes to the expedited grace-period
      primitives (queued for 4.4) remove the cause of this false
      positive, and therefore include a revert of this temporary commit. ]

  - Documentation updates.

  - Torture-test updates.

  - Miscellaneous fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:12:12 +02:00
Andrea Parri
ff277d4250 sched/deadline: Fix comment in enqueue_task_dl()
The "dl_boosted" flag is set by comparing *absolute* deadlines
(c.f., rt_mutex_setprio()).

Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1438782979-9057-2-git-send-email-parri.andrea@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:10 +02:00
Andrea Parri
4ffa08ed4c sched/deadline: Fix comment in push_dl_tasks()
The comment is "misleading"; fix it by adapting a comment from
push_rt_tasks().

Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1438782979-9057-1-git-send-email-parri.andrea@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:10 +02:00
Peter Zijlstra
6c37067e27 sched: Change the sched_class::set_cpus_allowed() calling context
Change the calling context of sched_class::set_cpus_allowed() such
that we can assume the task is inactive.

This allows us to easily make changes that affect accounting done by
enqueue/dequeue. This does in fact completely remove
set_cpus_allowed_rt() and greatly reduces set_cpus_allowed_dl().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.667516139@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:10 +02:00
Peter Zijlstra
c5b2803840 sched: Make sched_class::set_cpus_allowed() unconditional
Give every class a set_cpus_allowed() method, this enables some small
optimization in the RT,DL implementation by avoiding a double
cpumask_weight() call.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.614517487@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:09 +02:00
Peter Zijlstra
25834c73f9 sched: Fix a race between __kthread_bind() and sched_setaffinity()
Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
without locks, a caller might observe an old value and race with the
set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
it:

	__kthread_bind()
	  do_set_cpus_allowed()
						<SYSCALL>
						  sched_setaffinity()
						    if (p->flags & PF_NO_SETAFFINITIY)
						    set_cpus_allowed_ptr()
	  p->flags |= PF_NO_SETAFFINITY

Fix the bug by putting everything under the regular scheduler locks.

This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:09 +02:00
Byungchul Park
7855a35ac0 sched: Ensure a task has a non-normalized vruntime when returning back to CFS
Current code ensures that a task has a normalized vruntime when switching away
from the fair class, but it does not ensure the task has a non-normalized
vruntime when switching back to the fair class.

This is an example breaking this consistency:

  1. a task is in fair class and !queued
  2. changes its class to RT class (still !queued)
  3. changes its class to fair class again (still !queued)

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439197375-27927-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:09 +02:00
Aravind Gopalakrishnan
e237882b8f sched/numa: Fix NUMA_DIRECT topology identification
Systems which have all nodes at a distance of at most 1 hop should be
identified as 'NUMA_DIRECT'.

However, the scheduler incorrectly identifies it as 'NUMA_BACKPLANE'.
This is because 'n' is assigned to sched_max_numa_distance but the
code (mis)interprets it to mean 'number of hops'.

Rik had actually used sched_domains_numa_levels for detecting a
'NUMA_DIRECT' topology:

  http://marc.info/?l=linux-kernel&m=141279712429834&w=2

But that was changed when he removed the hops table in the
subsequent version:

  http://marc.info/?l=linux-kernel&m=141353106106771&w=2

Fixing the issue here.

Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439256048-3748-1-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:08 +02:00
Will Deacon
77e430e3e4 locking/qrwlock: Make use of _{acquire|release|relaxed}() atomics
The qrwlock implementation is slightly heavy in its use of memory
barriers, mainly through the use of _cmpxchg() and _return() atomics, which
imply full barrier semantics.

This patch modifies the qrwlock code to use the more relaxed atomic
routines so that we can reduce the unnecessary barrier overhead on
weakly-ordered architectures.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hp.com
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1438880084-18856-7-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 11:59:06 +02:00
Alexander Shishkin
c2ad6b51ef perf/ring-buffer: Clarify the use of page::private for high-order AUX allocations
A question [1] was raised about the use of page::private in AUX buffer
allocations, so let's add a clarification about its intended use.

The private field and flag are used by perf's rb_alloc_aux() path to
tell the pmu driver the size of each high-order allocation, so that the
driver can program those appropriately into its hardware. This only
matters for PMUs that don't support hardware scatter tables. Otherwise,
every page in the buffer is just a page.

This patch adds a comment about the private field to the AUX buffer
allocation path.

  [1] http://marc.info/?l=linux-kernel&m=143803696607968

Reported-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1438063204-665-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 11:43:20 +02:00
Ingo Molnar
3d325bf0da Merge branch 'perf/urgent' into perf/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 11:39:19 +02:00
Peter Zijlstra
c7999c6f3f perf: Fix PERF_EVENT_IOC_PERIOD migration race
I ran the perf fuzzer, which triggered some WARN()s which are due to
trying to stop/restart an event on the wrong CPU.

Use the normal IPI pattern to ensure we run the code on the correct CPU.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: bad7192b84 ("perf: Fix PERF_EVENT_IOC_PERIOD to force-reset the period")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 11:37:22 +02:00
Ben Hutchings
ee9397a6fb perf: Fix double-free of the AUX buffer
If rb->aux_refcount is decremented to zero before rb->refcount,
__rb_free_aux() may be called twice resulting in a double free of
rb->aux_pages.  Fix this by adding a check to __rb_free_aux().

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 57ffc5ca67 ("perf: Fix AUX buffer refcounting")
Link: http://lkml.kernel.org/r/1437953468.12842.17.camel@decadent.org.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 11:37:21 +02:00
Daniel Wagner
9f61668073 tracing: Allow triggers to filter for CPU ids and process names
By extending the filter rules by more generic fields
we can write triggers filters like

  echo 'stacktrace if cpu == 1' > \
	/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger

or

  echo 'stacktrace if comm == sshd'  > \
	/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger

CPU and COMM are not part of struct trace_entry. We could add the two
new fields to ftrace_common_field list and fix up all depending
sides. But that looks pretty ugly. Another thing I would like to
avoid that the 'format' file contents changes.

All this can be avoided by introducing another list which contains
non field members of struct trace_entry.

Link: http://lkml.kernel.org/r/1439210146-24707-1-git-send-email-daniel.wagner@bmw-carit.de

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-08-11 18:01:06 -04:00
Dan Williams
124fe20d94 mm: enhance region_is_ram() to region_intersects()
region_is_ram() is used to prevent the establishment of aliased mappings
to physical "System RAM" with incompatible cache settings.  However, it
uses "-1" to indicate both "unknown" memory ranges (ranges not described
by platform firmware) and "mixed" ranges (where the parameters describe
a range that partially overlaps "System RAM").

Fix this up by explicitly tracking the "unknown" vs "mixed" resource
cases and returning REGION_INTERSECTS, REGION_MIXED, or REGION_DISJOINT.
This re-write also adds support for detecting when the requested region
completely eclipses all of a resource.  Note, the implementation treats
overlaps between "unknown" and the requested memory type as
REGION_INTERSECTS.

Finally, other memory types can be passed in by name, for now the only
usage "System RAM".

Suggested-by: Luis R. Rodriguez <mcgrof@suse.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-10 23:07:05 -04:00
Alban Crequy
24ee3cf89b cpuset: use trialcs->mems_allowed as a temp variable
The comment says it's using trialcs->mems_allowed as a temp variable but
it didn't match the code. Change the code to match the comment.

This fixes an issue when writing in cpuset.mems when a sub-directory
exists: we need to write several times for the information to persist:

| root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
| root@alban:/sys/fs/cgroup/cpuset# cd footest9
| root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9#

This should help to fix the following issue in Docker:
https://github.com/opencontainers/runc/issues/133
In some conditions, a Docker container needs to be started twice in
order to work.

Signed-off-by: Alban Crequy <alban@endocode.com>
Tested-by: Iago López Galeiras <iago@endocode.com>
Cc: <stable@vger.kernel.org> # 3.17+
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-08-10 11:18:41 -04:00
Viresh Kumar
ecbebcb868 kernel: broadcast-hrtimer: Migrate to new 'set-state' interface
Migrate broadcast-hrtimer driver to the new 'set-state' interface
provided by clockevents core, the earlier 'set-mode' interface is marked
obsolete now.

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
2015-08-10 11:41:08 +02:00
Kaixu Xia
35578d7984 bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
According to the perf_event_map_fd and index, the function
bpf_perf_event_read() can convert the corresponding map
value to the pointer to struct perf_event and return the
Hardware PMU counter value.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09 22:50:06 -07:00
Kaixu Xia
ea317b267e bpf: Add new bpf map type to store the pointer to struct perf_event
Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'.
This map only stores the pointer to struct perf_event. The
user space event FDs from perf_event_open() syscall are converted
to the pointer to struct perf_event and stored in map.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09 22:50:05 -07:00
Wang Nan
2a36f0b92e bpf: Make the bpf_prog_array_map more generic
All the map backends are of generic nature. In order to avoid
adding much special code into the eBPF core, rewrite part of
the bpf_prog_array map code and make it more generic. So the
new perf_event_array map type can reuse most of code with
bpf_prog_array map and add fewer lines of special code.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09 22:50:05 -07:00
Kaixu Xia
ffe8690c85 perf: add the necessary core perf APIs when accessing events counters in eBPF programs
This patch add three core perf APIs:
 - perf_event_attrs(): export the struct perf_event_attr from struct
   perf_event;
 - perf_event_get(): get the struct perf_event from the given fd;
 - perf_event_read_local(): read the events counters active on the
   current CPU;
These APIs are needed when accessing events counters in eBPF programs.

The API perf_event_read_local() comes from Peter and I add the
corresponding SOB.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09 22:50:05 -07:00
Greg Kroah-Hartman
5d44f4b348 Merge 4.2-rc6 into char-misc-next
We want the fixes in Linus's tree in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-09 16:28:09 -07:00
David Woodhouse
99d27b1b52 modsign: Add explicit CONFIG_SYSTEM_TRUSTED_KEYS option
Let the user explicitly provide a file containing trusted keys, instead of
just automatically finding files matching *.x509 in the build tree and
trusting whatever we find. This really ought to be an *explicit*
configuration, and the build rules for dealing with the files were
fairly painful too.

Fix applied from James Morris that removes an '=' from a macro definition
in kernel/Makefile as this is a feature that only exists from GNU make 3.82
onwards.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Woodhouse
fb11794991 modsign: Use single PEM file for autogenerated key
The current rule for generating signing_key.priv and signing_key.x509 is
a classic example of a bad rule which has a tendency to break parallel
make. When invoked to create *either* target, it generates the other
target as a side-effect that make didn't predict.

So let's switch to using a single file signing_key.pem which contains
both key and certificate. That matches what we do in the case of an
external key specified by CONFIG_MODULE_SIG_KEY anyway, so it's also
slightly cleaner.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Woodhouse
1329e8cc69 modsign: Extract signing cert from CONFIG_MODULE_SIG_KEY if needed
Where an external PEM file or PKCS#11 URI is given, we can get the cert
from it for ourselves instead of making the user drop signing_key.x509
in place for us.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Woodhouse
19e91b69d7 modsign: Allow external signing key to be specified
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:14 +01:00
David Howells
091f6e26eb MODSIGN: Extract the blob PKCS#7 signature verifier from module signing
Extract the function that drives the PKCS#7 signature verification given a
data blob and a PKCS#7 blob out from the module signing code and lump it with
the system keyring code as it's generic.  This makes it independent of module
config options and opens it to use by the firmware loader.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Kyle McMartin <kyle@kernel.org>
2015-08-07 16:26:13 +01:00
David Howells
1c39449921 system_keyring.c doesn't need to #include module-internal.h
system_keyring.c doesn't need to #include module-internal.h as it doesn't use
the one thing that exports.  Remove the inclusion.

Signed-off-by: David Howells <dhowells@redhat.com>
2015-08-07 16:26:13 +01:00
David Howells
3f1e1bea34 MODSIGN: Use PKCS#7 messages as module signatures
Move to using PKCS#7 messages as module signatures because:

 (1) We have to be able to support the use of X.509 certificates that don't
     have a subjKeyId set.  We're currently relying on this to look up the
     X.509 certificate in the trusted keyring list.

 (2) PKCS#7 message signed information blocks have a field that supplies the
     data required to match with the X.509 certificate that signed it.

 (3) The PKCS#7 certificate carries fields that specify the digest algorithm
     used to generate the signature in a standardised way and the X.509
     certificates specify the public key algorithm in a standardised way - so
     we don't need our own methods of specifying these.

 (4) We now have PKCS#7 message support in the kernel for signed kexec purposes
     and we can make use of this.

To make this work, the old sign-file script has been replaced with a program
that needs compiling in a previous patch.  The rules to build it are added
here.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
2015-08-07 16:26:13 +01:00
Frans Klaver
3da56d1663 kernel: exit: fix typo in comment
s,critiera,criteria,

While at it, add a comma, because it makes sense grammatically.

Signed-off-by: Frans Klaver <fransklaver@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-07 13:59:49 +02:00
David Kershner
18896451ea kthread: export kthread functions
The s-Par visornic driver, currently in staging, processes a queue being
serviced by the an s-Par service partition.  We can get a message that
something has happened with the Service Partition, when that happens, we
must not access the channel until we get a message that the service
partition is back again.

The visornic driver has a thread for processing the channel, when we get
the message, we need to be able to park the thread and then resume it
when the problem clears.

We can do this with kthread_park and unpark but they are not exported
from the kernel, this patch exports the needed functions.

Signed-off-by: David Kershner <david.kershner@unisys.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:41 +03:00
Amanieu d'Antras
26135022f8 signal: fix information leak in copy_siginfo_to_user
This function may copy the si_addr_lsb, si_lower and si_upper fields to
user mode when they haven't been initialized, which can leak kernel
stack data to user mode.

Just checking the value of si_code is insufficient because the same
si_code value is shared between multiple signals.  This is solved by
checking the value of si_signo in addition to si_code.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:40 +03:00
Amanieu d'Antras
3c00cb5e68 signal: fix information leak in copy_siginfo_from_user32
This function can leak kernel stack data when the user siginfo_t has a
positive si_code value.  The top 16 bits of si_code descibe which fields
in the siginfo_t union are active, but they are treated inconsistently
between copy_siginfo_from_user32, copy_siginfo_to_user32 and
copy_siginfo_to_user.

copy_siginfo_from_user32 is called from rt_sigqueueinfo and
rt_tgsigqueueinfo in which the user has full control overthe top 16 bits
of si_code.

This fixes the following information leaks:
x86:   8 bytes leaked when sending a signal from a 32-bit process to
       itself. This leak grows to 16 bytes if the process uses x32.
       (si_code = __SI_CHLD)
x86:   100 bytes leaked when sending a signal from a 32-bit process to
       a 64-bit process. (si_code = -1)
sparc: 4 bytes leaked when sending a signal from a 32-bit process to a
       64-bit process. (si_code = any)

parsic and s390 have similar bugs, but they are not vulnerable because
rt_[tg]sigqueueinfo have checks that prevent sending a positive si_code
to a different process.  These bugs are also fixed for consistency.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:40 +03:00
Richard Guy Briggs
34d99af52a audit: implement audit by executable
This adds the ability audit the actions of a not-yet-running process.

This patch implements the ability to filter on the executable path.  Instead of
just hard coding the ino and dev of the executable we care about at the moment
the rule is inserted into the kernel, use the new audit_fsnotify
infrastructure to manage this dynamically.  This means that if the filename
does not yet exist but the containing directory does, or if the inode in
question is unlinked and creat'd (aka updated) the rule will just continue to
work.  If the containing directory is moved or deleted or the filesystem is
unmounted, the rule is deleted automatically.  A future enhancement would be to
have the rule survive across directory disruptions.

This is a heavily modified version of a patch originally submitted by Eric
Paris with some ideas from Peter Moody.

Cc: Peter Moody <peter@hda3.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: minor whitespace clean to satisfy ./scripts/checkpatch]
Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-06 16:17:25 -04:00
Richard Guy Briggs
7f49294282 audit: clean simple fsnotify implementation
This is to be used to audit by executable path rules, but audit watches should
be able to share this code eventually.

At the moment the audit watch code is a lot more complex.  That code only
creates one fsnotify watch per parent directory.  That 'audit_parent' in
turn has a list of 'audit_watches' which contain the name, ino, dev of
the specific object we care about.  This just creates one fsnotify watch
per object we care about.  So if you watch 100 inodes in /etc this code
will create 100 fsnotify watches on /etc.  The audit_watch code will
instead create 1 fsnotify watch on /etc (the audit_parent) and then 100
individual watches chained from that fsnotify mark.

We should be able to convert the audit_watch code to do one fsnotify
mark per watch and simplify things/remove a whole lot of code.  After
that conversion we should be able to convert the audit_fsnotify code to
support that hierarchy if the optimization is necessary.

Move the access to the entry for audit_match_signal() to the beginning of
the audit_del_rule() function in case the entry found is the same one passed
in.  This will enable it to be used by audit_autoremove_mark_rule(),
kill_rules() and audit_remove_parent_watches().

This is a heavily modified and merged version of two patches originally
submitted by Eric Paris.

Cc: Peter Moody <peter@hda3.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: added a space after a declaration to keep ./scripts/checkpatch happy]
Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-06 16:14:53 -04:00
Richard Guy Briggs
84cb777e67 audit: use macros for unset inode and device values
Clean up a number of places were casted magic numbers are used to represent
unset inode and device numbers in preparation for the audit by executable path
patch set.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: enclosed the _UNSET macros in parentheses for ./scripts/checkpatch]
Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-06 14:39:02 -04:00
Wang Nan
04a22fae4c tracing, perf: Implement BPF programs attached to uprobes
By copying BPF related operation to uprobe processing path, this patch
allow users attach BPF programs to uprobes like what they are already
doing on kprobes.

After this patch, users are allowed to use PERF_EVENT_IOC_SET_BPF on a
uprobe perf event. Which make it possible to profile user space programs
and kernel events together using BPF.

Because of this patch, CONFIG_BPF_EVENTS should be selected by
CONFIG_UPROBE_EVENT to ensure trace_call_bpf() is compiled even if
KPROBE_EVENT is not set.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kaixu Xia <xiakaixu@huawei.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1435716878-189507-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-08-06 15:29:14 -03:00
Mathias Krause
71cf5aeeb8 kernel, cpu: Remove bogus __ref annotations
cpu_chain lost its __cpuinitdata annotation long ago in commit
5c113fbeed ("fix cpu_chain section mismatch..."). This and the global
__cpuinit annotation drop in v3.11 vanished the need to mark all users,
including transitive ones, with the __ref annotation. Just get rid of it
to not wrongly hide section mismatches.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05 15:18:26 -07:00
Richard Guy Briggs
8c85fc9ae6 audit: make audit_del_rule() more robust
Move the access to the entry for audit_match_signal() to earlier in the
function in case the entry found is the same one passed in.  This will enable
it to be used by audit_remove_mark_rule().

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: tweaked subject line as it no longer made sense after multiple revs]
Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-05 17:46:42 -04:00
Tejun Heo
d0ec4230a0 cgroup: export cgrp_dfl_root
While cgroup subsystems can't be modules, blkcg supports dynamically
loadable policies which interact with cgroup core.  Export
cgrp_dfl_root so that cgroup_on_dfl() can be used in those modules.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2015-08-05 16:03:19 -04:00
Vitaly Kuznetsov
32145c4677 cpu-hotplug: export cpu_hotplug_enable/cpu_hotplug_disable
Hyper-V module needs to disable cpu hotplug (offlining) as there is no
support from hypervisor side to reassign already opened event channels
to a different CPU. Currently it is been done by altering
smp_ops.cpu_disable but it is hackish.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05 11:46:44 -07:00
Vitaly Kuznetsov
89af7ba574 cpu-hotplug: convert cpu_hotplug_disabled to a counter
As a prerequisite to exporting cpu_hotplug_enable/cpu_hotplug_disable
functions to modules we need to convert cpu_hotplug_disabled to a counter
to properly support disable -> disable -> enable call sequences. E.g.
after Hyper-V vmbus module (which is supposed to be the first user of
exported cpu_hotplug_enable/cpu_hotplug_disable) did cpu_hotplug_disable()
hibernate path calls disable_nonboot_cpus() and if we hit an error in
_cpu_down() enable_nonboot_cpus() will be called on the failure path (thus
making cpu_hotplug_disabled = 0 and leaving cpu hotplug in 'enabled'
state). Same problem is possible if more than 1 module use
cpu_hotplug_disable/cpu_hotplug_enable on their load/unload paths. When
one of these modules is been unloaded it is logical to leave cpu hotplug
in 'disabled' state.

To support the change we need to increse cpu_hotplug_disabled counter
in disable_nonboot_cpus() unconditionally as all users of
disable_nonboot_cpus() are supposed to do enable_nonboot_cpus() in case
an error was returned.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-05 11:46:44 -07:00
Paul Moore
ae9d2fb482 audit: fix uninitialized variable in audit_add_rule()
As reported by the 0-Day testing service:

   kernel/auditfilter.c: In function 'audit_rule_change':
>> kernel/auditfilter.c:864:6: warning: 'err' may be used uninit...
     int err;

Cc: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-05 11:19:45 -04:00
Richard Guy Briggs
aa7c043d97 audit: eliminate unnecessary extra layer of watch parent references
The audit watch parent count was imbalanced, adding an unnecessary layer of
watch parent references.  Decrement the additional parent reference when a
watch is reused, already having a reference to the parent.

audit_find_parent() gets a reference to the parent, if the parent is
already known.  This additional parental reference is not needed if the
watch is subsequently found by audit_add_to_parent(), and consumed if
the watch does not already exist, so we need to put the parent if the
watch is found, and do nothing if this new watch is added to the parent.

If the parent wasn't already known, it is created with a refcount of 1
and added to the audit_watch_group, then incremented by one to be
subsequently consumed by the newly created watch in
audit_add_to_parent().

The rule points to the watch, not to the parent, so the rule's refcount
gets bumped, not the parent's.

See LKML, 2015-07-16

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-04 18:26:33 -04:00
Richard Guy Briggs
f8259b262b audit: eliminate unnecessary extra layer of watch references
The audit watch count was imbalanced, adding an unnecessary layer of watch
references.  Only add the second reference when it is added to a parent.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-08-04 18:21:39 -04:00
Tim Gardner
1dadafa86a workqueue: Make flush_workqueue() available again to non GPL modules
Commit 37b1ef31a5 ("workqueue: move
flush_scheduled_work() to workqueue.h") moved the exported non GPL
flush_scheduled_work() from a function to an inline wrapper.
Unfortunately, it directly calls flush_workqueue() which is a GPL function.
This has the effect of changing the licensing requirement for this function
and makes it unavailable to non GPL modules.

See commit ad7b1f841f ("workqueue: Make
schedule_work() available again to non GPL modules") for precedent.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-08-04 14:04:54 -04:00
Paul E. McKenney
12d560f4ea rcu,locking: Privatize smp_mb__after_unlock_lock()
RCU is the only thing that uses smp_mb__after_unlock_lock(), and is
likely the only thing that ever will use it, so this commit makes this
macro private to RCU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
2015-08-04 08:49:21 -07:00
Paul E. McKenney
3dbe43f6fb Merge branches 'doc.2015.07.15a' and 'torture.2015.07.15a' into HEAD
doc.2015.07.15a: Documentation updates.
torture.2015.07.15a: Torture-test updates.
2015-08-04 08:42:02 -07:00
Paul E. McKenney
8ff4fbfd69 Merge branches 'fixes.2015.07.22a' and 'initexp.2015.08.04a' into HEAD
fixes.2015.07.22a: Miscellaneous fixes.
initexp.2015.08.04a: Initialization and expedited updates.
	(Single branch due to conflicts.)
2015-08-04 08:40:58 -07:00
Paul E. McKenney
af859beaab rcu: Silence lockdep false positive for expedited grace periods
In a CONFIG_PREEMPT=y kernel, synchronize_rcu_expedited()
acquires the ->exp_funnel_mutex in rcu_preempt_state, then invokes
synchronize_sched_expedited, which acquires the ->exp_funnel_mutex in
rcu_sched_state.  There can be no deadlock because rcu_preempt_state
->exp_funnel_mutex acquisition always precedes that of rcu_sched_state.
But lockdep does not know that, so it gives false-positive splats.

This commit therefore associates a separate lock_class_key structure
with the rcu_sched_state structure's ->exp_funnel_mutex, allowing
lockdep to see the lock ordering, avoiding the false positives.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-08-04 08:39:21 -07:00
Alexander Shishkin
9a6694cfa2 perf/x86/intel/pt: Do not force sync packets on every schedule-in
Currently, the PT driver zeroes out the status register every time before
starting the event. However, all the writable bits are already taken care
of in pt_handle_status() function, except the new PacketByteCnt field,
which in new versions of PT contains the number of packet bytes written
since the last sync (PSB) packet. Zeroing it out before enabling PT forces
a sync packet to be written. This means that, with the existing code, a
sync packet (PSB and PSBEND, 18 bytes in total) will be generated every
time a PT event is scheduled in.

To avoid these unnecessary syncs and save a WRMSR in the fast path, this
patch changes the default behavior to not clear PacketByteCnt field, so
that the sync packets will be generated with the period specified as
"psb_period" attribute config field. This has little impact on the trace
data as the other packets that are normally sent within PSB+ (between PSB
and PSBEND) have their own generation scenarios which do not depend on the
sync packets.

One exception where we do need to force PSB like this when tracing starts,
so that the decoder has a clear sync point in the trace. For this purpose
we aready have hw::itrace_started flag, which we are currently using to
output PERF_RECORD_ITRACE_START. This patch moves setting itrace_started
from perf core to the pmu::start, where it should still be 0 on the very
first run.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1438264104-16189-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:55 +02:00
Andy Lutomirski
e5779e8e12 perf/x86/hw_breakpoints: Disallow kernel breakpoints unless kprobe-safe
Code on the kprobe blacklist doesn't want unexpected int3
exceptions. It probably doesn't want unexpected debug exceptions
either. Be safe: disallow breakpoints in nokprobes code.

On non-CONFIG_KPROBES kernels, there is no kprobe blacklist.  In
that case, disallow kernel breakpoints entirely.

It will be particularly important to keep hw breakpoints out of the
entry and NMI code once we move debug exceptions off the IST stack.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e14b152af99640448d895e3c2a8c2d5ee19a1325.1438312874.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 10:16:54 +02:00
Peter Zijlstra
fed66e2cdd perf: Fix fasync handling on inherited events
Vince reported that the fasync signal stuff doesn't work proper for
inherited events. So fix that.

Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
which upon the demise of the file descriptor ensures the allocation is
freed and state is updated.

Now for perf, we can have the events stick around for a while after the
original FD is dead because of references from child events. So we
cannot copy the fasync pointer around. We can however consistently use
the parent's fasync, as that will be updated.

Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho deMelo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 09:57:52 +02:00
Peter Zijlstra
3c8e479355 sched: Remove finish_arch_switch()
One less arch hook..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-04 09:38:33 +02:00
Vladimir Davydov
cf780b7dc7 cgroup: fix idr_preload usage
It does not make much sense to call idr_preload with the same gfp mask
as the following idr_alloc, but this is what we do in cgroup_idr_alloc.
This patch fixes the idr_preload usage by making cgroup_idr_alloc call
idr_alloc w/o __GFP_WAIT. Since it is now safe to call cgroup_idr_alloc
with GFP_KERNEL, the patch also fixes all its callers appropriately.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2015-08-03 10:40:07 -04:00
Yuyang Du
7ea241afbf sched/fair: Clean up load average references
For cfs_rq, we have load.weight, runnable_load_avg, and load_avg.
Clean up how they are used:

  - First, as group sched_entity already largely uses load_avg, we now expand
    to use load_avg in all cases.

  - Second, for CPU-wide load balancing, we choose to use runnable_load_avg
    in all cases, which is the same as before this series.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-8-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:24:32 +02:00