Commit graph

2376 commits

Author SHA1 Message Date
Rafael J. Wysocki
a61dec7447 cpufreq: schedutil: Avoid missing updates for one-CPU policies
Commit 152db033d7 (schedutil: Allow cpufreq requests to be made
even when kthread kicked) made changes to prevent utilization updates
from being discarded during processing a previous request, but it
left a small window in which that still can happen in the one-CPU
policy case.  Namely, updates coming in after setting work_in_progress
in sugov_update_commit() and clearing it in sugov_work() will still
be dropped due to the work_in_progress check in sugov_update_single().

To close that window, rearrange the code so as to acquire the update
lock around the deferred update branch in sugov_update_single()
and drop the work_in_progress check from it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2018-05-24 10:21:18 +02:00
Joel Fernandes (Google)
152db033d7 schedutil: Allow cpufreq requests to be made even when kthread kicked
Currently there is a chance of a schedutil cpufreq update request to be
dropped if there is a pending update request. This pending request can
be delayed if there is a scheduling delay of the irq_work and the wake
up of the schedutil governor kthread.

A very bad scenario is when a schedutil request was already just made,
such as to reduce the CPU frequency, then a newer request to increase
CPU frequency (even sched deadline urgent frequency increase requests)
can be dropped, even though the rate limits suggest that its Ok to
process a request. This is because of the way the work_in_progress flag
is used.

This patch improves the situation by allowing new requests to happen
even though the old one is still being processed. Note that in this
approach, if an irq_work was already issued, we just update next_freq
and don't bother to queue another request so there's no extra work being
done to make this happen.

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-23 10:37:56 +02:00
Viresh Kumar
036399782b cpufreq: Rename cpufreq_can_do_remote_dvfs()
This routine checks if the CPU running this code belongs to the policy
of the target CPU or if not, can it do remote DVFS for it remotely. But
the current name of it implies as if it is only about doing remote
updates.

Rename it to make it more relevant.

Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-23 10:37:08 +02:00
Patrick Bellasi
fd7d5287fd cpufreq: schedutil: Cleanup and document iowait boost
The iowait boosting code has been recently updated to add a progressive
boosting behavior which allows to be less aggressive in boosting tasks
doing only sporadic IO operations, thus being more energy efficient for
example on mobile platforms.

The current code is now however a bit convoluted. Some functionalities
(e.g. iowait boost reset) are replicated in different paths and their
documentation is slightly misaligned.

Let's cleanup the code by consolidating all the IO wait boosting related
functionality within within few dedicated functions and better define
their role:

- sugov_iowait_boost: set/increase the IO wait boost of a CPU
- sugov_iowait_apply: apply/reduce the IO wait boost of a CPU

Both these two function are used at every sugov update and they make
use of a unified IO wait boost reset policy provided by:

- sugov_iowait_reset: reset/disable the IO wait boost of a CPU
     if a CPU is not updated for more then one tick

This makes possible a cleaner and more self-contained design for the IO
wait boosting code since the rest of the sugov update routines, both for
single and shared frequency domains, follow the same template:

   /* Configure IO boost, if required */
   sugov_iowait_boost()

   /* Return here if freq change is in progress or throttled */

   /* Collect and aggregate utilization information */
   sugov_get_util()
   sugov_aggregate_util()

   /*
    * Add IO boost, if currently enabled, on top of the aggregated
    * utilization value
    */
   sugov_iowait_apply()

As a extra bonus, let's also add the documentation for the new
functions and better align the in-code documentation.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-22 14:05:05 +02:00
Patrick Bellasi
295f1a9953 cpufreq: schedutil: Fix iowait boost reset
A more energy efficient update of the IO wait boosting mechanism has
been introduced in:

   commit a5a0809bc5 ("cpufreq: schedutil: Make iowait boost more energy efficient")

where the boost value is expected to be:

 - doubled at each successive wakeup from IO
   staring from the minimum frequency supported by a CPU

 - reset when a CPU is not updated for more then one tick
   by either disabling the IO wait boost or resetting its value to the
   minimum frequency if this new update requires an IO boost.

This approach is supposed to "ignore" boosting for sporadic wakeups from
IO, while still getting the frequency boosted to the maximum to benefit
long sequence of wakeup from IO operations.

However, these assumptions are not always satisfied.
For example, when an IO boosted CPU enters idle for more the one tick
and then wakes up after an IO wait, since in sugov_set_iowait_boost() we
first check the IOWAIT flag, we keep doubling the iowait boost instead
of restarting from the minimum frequency value.

This misbehavior could happen mainly on non-shared frequency domains,
thus defeating the energy efficiency optimization, but it can also
happen on shared frequency domain systems.

Let fix this issue in sugov_set_iowait_boost() by:
 - first check the IO wait boost reset conditions
   to eventually reset the boost value
 - then applying the correct IO boost value
   if required by the caller

Fixes: a5a0809bc5 (cpufreq: schedutil: Make iowait boost more energy efficient)
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-22 14:05:05 +02:00
Mathieu Malaterre
3febfc8a21 sched/deadline: Make the grub_reclaim() function static
Since the grub_reclaim() function can be made static, make it so.

Silences the following GCC warning (W=1):

  kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-18 09:05:22 +02:00
Mathieu Malaterre
f6a3463063 sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h
In the following commit:

  6b55c9654f ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")

the print_cfs_rq() prototype was added to <kernel/sched/sched.h>,
right next to the prototypes for print_cfs_stats(), print_rt_stats()
and print_dl_stats().

Finish this previous commit and also move related prototypes for
print_rt_rq() and print_dl_rq().

Remove existing extern declarations now that they not needed anymore.

Silences the following GCC warning, triggered by W=1:

  kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
  kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-18 09:05:14 +02:00
Ingo Molnar
13a553199f Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
- Updates to the handling of expedited grace periods, perhaps most
   notably parallelizing their initialization.  Other changes
   include fixes from Boqun Feng.

 - Miscellaneous fixes.  These include an nvme fix from Nitzan Carmi
   that I am carrying because it depends on a new SRCU function
   cleanup_srcu_struct_quiesced().  This branch also includes fixes
   from Byungchul Park and Yury Norov.

 - Updates to reduce lock contention in the rcu_node combining tree.
   These are in preparation for the consolidation of RCU-bh,
   RCU-preempt, and RCU-sched into a single flavor, which was
   requested by Linus Torvalds in response to a security flaw
   whose root cause included confusion between the multiple flavors
   of RCU.

 - Torture-test updates that save their users some time and effort.

Conflicts:
	drivers/nvme/host/core.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-16 09:34:51 +02:00
Christoph Hellwig
fddda2b7b5 proc: introduce proc_create_seq{,_data}
Variants of proc_create{,_data} that directly take a struct seq_operations
argument and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Paul E. McKenney
c3442697c2 softirq: Eliminate unused cond_resched_softirq() macro
The cond_resched_softirq() macro is not used anywhere in mainline, so
this commit simplifies the kernel by eliminating it.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
2018-05-15 10:27:35 -07:00
Viresh Kumar
ecd2884291 cpufreq: schedutil: Don't set next_freq to UINT_MAX
The schedutil driver sets sg_policy->next_freq to UINT_MAX on certain
occasions to discard the cached value of next freq:
- In sugov_start(), when the schedutil governor is started for a group
  of CPUs.
- And whenever we need to force a freq update before rate-limit
  duration, which happens when:
  - there is an update in cpufreq policy limits.
  - Or when the utilization of DL scheduling class increases.

In return, get_next_freq() doesn't return a cached next_freq value but
recalculates the next frequency instead.

But having special meaning for a particular value of frequency makes the
code less readable and error prone. We recently fixed a bug where the
UINT_MAX value was considered as valid frequency in
sugov_update_single().

All we need is a flag which can be used to discard the value of
sg_policy->next_freq and we already have need_freq_update for that. Lets
reuse it instead of setting next_freq to UINT_MAX.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-15 10:38:12 +02:00
Dietmar Eggemann
1b04722c3b Revert "cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily"
This reverts commit e2cabe48c2.

Lifting the restriction that the sugov kthread is bound to the
policy->related_cpus for a system with a slow switching cpufreq driver,
which is able to perform DVFS from any cpu (e.g. cpufreq-dt), is not
only not beneficial it also harms Enery-Aware Scheduling (EAS) on
systems with asymmetric cpu capacities (e.g. Arm big.LITTLE).

The sugov kthread which does the update for the little cpus could
potentially run on a big cpu. It could prevent that the big cluster goes
into deeper idle states although all the tasks are running on the little
cluster.

Example: hikey960 w/ 4.16.0-rc6-+
         Arm big.LITTLE with per-cluster DVFS

root@h960:~# cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03 (Cortex-A53, little cpu)
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd09 (Cortex-A73, big cpu)
CPU part        : 0xd09
CPU part        : 0xd09
CPU part        : 0xd09

root@h960:/sys/devices/system/cpu/cpufreq# ls
policy0  policy4  schedutil

root@h960:/sys/devices/system/cpu/cpufreq# cat policy*/related_cpus
0 1 2 3
4 5 6 7

(1) w/o the revert:

root@h960:~# ps -eo pid,class,rtprio,pri,psr,comm | awk 'NR == 1 ||
/sugov/'
  PID CLS RTPRIO PRI PSR COMMAND
  1489 #6      0 140   1 sugov:0
  1490 #6      0 140   0 sugov:4

The sugov kthread sugov:4 responsible for policy4 runs on cpu0. (In this
case both sugov kthreads run on little cpus).

cross policy (cluster) remote callback example:
...
migration/1-14 [001] enqueue_task_fair: this_cpu=1 cpu_of(rq)=5
migration/1-14 [001] sugov_update_shared: this_cpu=1 sg_cpu->cpu=5
                     sg_cpu->sg_policy->policy->related_cpus=4-7
  sugov:4-1490 [000] sugov_work: this_cpu=0
                     sg_cpu->sg_policy->policy->related_cpus=4-7
...

The remote callback (this_cpu=1, target_cpu=5) is executed on cpu=0.

(2) w/ the revert:

root@h960:~# ps -eo pid,class,rtprio,pri,psr,comm | awk 'NR == 1 ||
/sugov/'
  PID CLS RTPRIO PRI PSR COMMAND
  1491 #6      0 140   2 sugov:0
  1492 #6      0 140   4 sugov:4

The sugov kthread sugov:4 responsible for policy4 runs on cpu4.

cross policy (cluster) remote callback example:
...
migration/1-14 [001] enqueue_task_fair: this_cpu=1 cpu_of(rq)=7
migration/1-14 [001] sugov_update_shared: this_cpu=1 sg_cpu->cpu=7
                     sg_cpu->sg_policy->policy->related_cpus=4-7
  sugov:4-1492 [004] sugov_work: this_cpu=4
                     sg_cpu->sg_policy->policy->related_cpus=4-7
...

The remote callback (this_cpu=1, target_cpu=7) is executed on cpu=4.

Now the sugov kthread executes again on the policy (cluster) for which
the Operating Performance Point (OPP) should be changed.
It avoids the problem that an otherwise idle policy (cluster) is running
schedutil (the sugov kthread) for another one.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-15 10:29:26 +02:00
Rohit Jain
943d355d7f sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu()
In the following commit:

  247f2f6f3c ("sched/core: Don't schedule threads on pre-empted vCPUs")

... we distinguish between idle_cpu() when the vCPU is not running for
scheduling threads.

However, the idle_cpu() function is used in other places for
actually checking whether the state of the CPU is idle or not.

Hence split the use of that function based on the desired return value,
by introducing the available_idle_cpu() function.

This fixes a (slight) regression in that initial vCPU commit, because
some code paths (like the load-balancer) don't care and shouldn't care
if the vCPU is preempted or not, they just want to know if there's any
tasks on the CPU.

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dhaval.giani@oracle.com
Cc: linux-kernel@vger.kernel.org
Cc: matt@codeblueprint.co.uk
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Link: http://lkml.kernel.org/r/1525883988-10356-1-git-send-email-rohit.k.jain@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 09:12:26 +02:00
Mel Gorman
1378447598 sched/numa: Stagger NUMA balancing scan periods for new threads
Threads share an address space and each can change the protections of the
same address space to trap NUMA faults. This is redundant and potentially
counter-productive as any thread doing the update will suffice. Potentially
only one thread is required but that thread may be idle or it may not have
any locality concerns and pick an unsuitable scan rate.

This patch uses independent scan period but they are staggered based on
the number of address space users when the thread is created.  The intent
is that threads will avoid scanning at the same time and have a chance
to adapt their scan rate later if necessary. This reduces the total scan
activity early in the lifetime of the threads.

The different in headline performance across a range of machines and
workloads is marginal but the system CPU usage is reduced as well as overall
scan activity.  The following is the time reported by NAS Parallel Benchmark
using unbound openmp threads and a D size class:

			      4.17.0-rc1             4.17.0-rc1
				 vanilla           stagger-v1r1
	Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
	Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
	Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
	Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
	Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
	Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
	Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
	Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)

Note it's not a universal win but we have no prior knowledge of which
thread matters but the number of threads created often exceeds the size
of the node when the threads are not bound. However, there is a reducation
of overall system CPU usage:

				    4.17.0-rc1             4.17.0-rc1
				       vanilla           stagger-v1r1
	sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
	sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
	sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
	sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
	sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
	sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
	sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
	sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)

NUMA scan activity is also reduced:

	NUMA alloc local               1042828     1342670
	NUMA base PTE updates        140481138    93577468
	NUMA huge PMD updates           272171      180766
	NUMA page range updates      279832690   186129660
	NUMA hint faults               1395972     1193897
	NUMA hint local faults          877925      855053
	NUMA hint local percent             62          71
	NUMA pages migrated           12057909     9158023

Similar observations are made for other thread-intensive workloads. System
CPU usage is lower even though the headline gains in performance tend to be
small. For example, specjbb 2005 shows almost no difference in performance
but scan activity is reduced by a third on a 4-socket box. I didn't find
a workload (thread intensive or otherwise) that suffered badly.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 09:12:24 +02:00
Ingo Molnar
dfd5c3ea64 Linux 4.17-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlr4xw8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNYoH/1d5zyMpVJVUKZ0K
 LuEctCGby1PjSvSOhmMuxFVagFAqfBJXmwWTeohLfLG48r/Yk0AsZQ5HH13/8baj
 k/T8UgUvKZKustndCRp+joQ3Pa1ZpcIFaWRvB8pKFCefJ/F/Lj4B4X1HYI7vLq0K
 /ZBXUdy3ry0lcVuypnaARYAb2O7l/nyZIjZ3FhiuyymWe7Jpo+G7VK922LOMSX/y
 VYFZCWa8nxN+yFhO0ao9X5k7ggIiUrEBtbfNrk19VtAn0hx+OYKW2KfJK/eHNey/
 CKrOT+KAxU8VU29AEIbYzlL3yrQmULcEoIDiqJ/6m5m6JwsEbP6EqQHs0TiuQFpq
 A0MO9rw=
 =yjUP
 -----END PGP SIGNATURE-----

Merge tag 'v4.17-rc5' into sched/core, to pick up fixes and dependencies

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 09:02:14 +02:00
Linus Torvalds
66e1c94db3 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/pti updates from Thomas Gleixner:
 "A mixed bag of fixes and updates for the ghosts which are hunting us.

  The scheduler fixes have been pulled into that branch to avoid
  conflicts.

   - A set of fixes to address a khread_parkme() race which caused lost
     wakeups and loss of state.

   - A deadlock fix for stop_machine() solved by moving the wakeups
     outside of the stopper_lock held region.

   - A set of Spectre V1 array access restrictions. The possible
     problematic spots were discuvered by Dan Carpenters new checks in
     smatch.

   - Removal of an unused file which was forgotten when the rest of that
     functionality was removed"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Remove unused file
  perf/x86/cstate: Fix possible Spectre-v1 indexing for pkg_msr
  perf/x86/msr: Fix possible Spectre-v1 indexing in the MSR driver
  perf/x86: Fix possible Spectre-v1 indexing for x86_pmu::event_map()
  perf/x86: Fix possible Spectre-v1 indexing for hw_perf_event cache_*
  perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]
  sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
  sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
  sched/core: Introduce set_special_state()
  kthread, sched/wait: Fix kthread_parkme() completion issue
  kthread, sched/wait: Fix kthread_parkme() wait-loop
  sched/fair: Fix the update of blocked load when newly idle
  stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
2018-05-13 10:53:08 -07:00
Linus Torvalds
86a4ac433b Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "Revert the new NUMA aware placement approach which turned out to
  create more problems than it solved"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
2018-05-13 10:46:53 -07:00
Mel Gorman
789ba28013 Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
This reverts commit 7347fc87df.

Srikar Dronamra pointed out that while the commit in question did show
a performance improvement on ppc64, it did so at the cost of disabling
active CPU migration by automatic NUMA balancing which was not the intent.
The issue was that a serious flaw in the logic failed to ever active balance
if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled,
the logic is still bizarre and against the original intent.

Investigation showed that fixing the patch in either the way he suggested,
using the correct comparison for jiffies values or introducing a new
numa_migrate_deferred variable in task_struct all perform similarly to a
revert with a mix of gains and losses depending on the workload, machine
and socket count.

The original intent of the commit was to handle a problem whereby
wake_affine, idle balancing and automatic NUMA balancing disagree on the
appropriate placement for a task. This was particularly true for cases where
a single task was a massive waker of tasks but where wake_wide logic did
not apply.  This was particularly noticeable when a futex (a barrier) woke
all worker threads and tried pulling the wakees to the waker nodes. In that
specific case, it could be handled by tuning MPI or openMP appropriately,
but the behavior is not illogical and was worth attempting to fix. However,
the approach was wrong. Given that we're at rc4 and a fix is not obvious,
it's better to play safe, revert this commit and retry later.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: ggherdovich@suse.cz
Cc: hpa@zytor.com
Cc: matt@codeblueprint.co.uk
Cc: mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-12 08:37:56 +02:00
Rafael J. Wysocki
97739501f2 cpufreq: schedutil: Avoid using invalid next_freq
If the next_freq field of struct sugov_policy is set to UINT_MAX,
it shouldn't be used for updating the CPU frequency (this is a
special "invalid" value), but after commit b7eaf1aab9 (cpufreq:
schedutil: Avoid reducing frequency of busy CPUs prematurely) it
may be passed as the new frequency to sugov_update_commit() in
sugov_update_single().

Fix that by adding an extra check for the special UINT_MAX value
of next_freq to sugov_update_single().

Fixes: b7eaf1aab9 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.12+ <stable@vger.kernel.org> # 4.12+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-09 12:21:17 +02:00
Juri Lelli
a744490f12 cpufreq: schedutil: remove stale comment
After commit 794a56ebd9 (sched/cpufreq: Change the worker kthread to
SCHED_DEADLINE) schedutil kthreads are "ignored" for a clock frequency
selection point of view, so the potential corner case for RT tasks is not
possible at all now.

Remove the stale comment mentioning it.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-09 12:20:24 +02:00
Peter Zijlstra
354d779307 sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
> kernel/sched/autogroup.c:230 proc_sched_autogroup_set_nice() warn: potential spectre issue 'sched_prio_to_weight'

Userspace controls @nice, sanitize the array index.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05 08:34:42 +02:00
Peter Zijlstra
7281c8dec8 sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
> kernel/sched/core.c:6921 cpu_weight_nice_write_s64() warn: potential spectre issue 'sched_prio_to_weight'

Userspace controls @nice, so sanitize the value before using it to
index an array.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05 08:32:36 +02:00
Rohit Jain
247f2f6f3c sched/core: Don't schedule threads on pre-empted vCPUs
In paravirt configurations today, spinlocks figure out whether a vCPU is
running to determine whether or not spinlock should bother spinning. We
can use the same logic to prioritize CPUs when scheduling threads. If a
vCPU has been pre-empted, it will incur the extra cost of VMENTER and
the time it actually spends to be running on the host CPU. If we had
other vCPUs which were actually running on the host CPU and idle we
should schedule threads there.

Performance numbers:

Note: With patch is referred to as Paravirt in the following and without
patch is referred to as Base.

1) When only 1 VM is running:

    a) Hackbench test on KVM 8 vCPUs, 10,000 loops (lower is better):

	+-------+-----------------+----------------+
	|Number |Paravirt         |Base            |
	|of     +---------+-------+-------+--------+
	|Threads|Average  |Std Dev|Average| Std Dev|
	+-------+---------+-------+-------+--------+
	|1      |1.817    |0.076  |1.721  | 0.067  |
	|2      |3.467    |0.120  |3.468  | 0.074  |
	|4      |6.266    |0.035  |6.314  | 0.068  |
	|8      |11.437   |0.105  |11.418 | 0.132  |
	|16     |21.862   |0.167  |22.161 | 0.129  |
	|25     |33.341   |0.326  |33.692 | 0.147  |
	+-------+---------+-------+-------+--------+

2) When two VMs are running with same CPU affinities:

    a) tbench test on VM 8 cpus

    Base:

	VM1:

	Throughput 220.59 MB/sec   1 clients  1 procs  max_latency=12.872 ms
	Throughput 448.716 MB/sec  2 clients  2 procs  max_latency=7.555 ms
	Throughput 861.009 MB/sec  4 clients  4 procs  max_latency=49.501 ms
	Throughput 1261.81 MB/sec  7 clients  7 procs  max_latency=76.990 ms

	VM2:

	Throughput 219.937 MB/sec  1 clients  1 procs  max_latency=12.517 ms
	Throughput 470.99 MB/sec   2 clients  2 procs  max_latency=12.419 ms
	Throughput 841.299 MB/sec  4 clients  4 procs  max_latency=37.043 ms
	Throughput 1240.78 MB/sec  7 clients  7 procs  max_latency=77.489 ms

    Paravirt:

	VM1:

	Throughput 222.572 MB/sec  1 clients  1 procs  max_latency=7.057 ms
	Throughput 485.993 MB/sec  2 clients  2 procs  max_latency=26.049 ms
	Throughput 947.095 MB/sec  4 clients  4 procs  max_latency=45.338 ms
	Throughput 1364.26 MB/sec  7 clients  7 procs  max_latency=145.124 ms

	VM2:

	Throughput 224.128 MB/sec  1 clients  1 procs  max_latency=4.564 ms
	Throughput 501.878 MB/sec  2 clients  2 procs  max_latency=11.061 ms
	Throughput 965.455 MB/sec  4 clients  4 procs  max_latency=45.370 ms
	Throughput 1359.08 MB/sec  7 clients  7 procs  max_latency=168.053 ms

    b) Hackbench with 4 fd 1,000,000 loops

	+-------+--------------------------------------+----------------------------------------+
	|Number |Paravirt                              |Base                                    |
	|of     +----------+--------+---------+--------+----------+--------+---------+----------+
	|Threads|Average1  |Std Dev1|Average2 | Std Dev|Average1  |Std Dev1|Average2 | Std Dev 2|
	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
	|  1    | 3.748    | 0.620  | 3.576   | 0.432  | 4.006    | 0.395  | 3.446   | 0.787    |
	+-------+----------+--------+---------+--------+----------+--------+---------+----------+

    Note that this test was run just to show the interference effect
    over-subscription can have in baseline

    c) schbench results with 2 message groups on 8 vCPU VMs

	+-----------+-------+---------------+--------------+------------+
	|           |       | Paravirt      | Base         |            |
	+-----------+-------+-------+-------+-------+------+------------+
	|           |Threads| VM1   | VM2   |  VM1  | VM2  |%Improvement|
	+-----------+-------+-------+-------+-------+------+------------+
	|50.0000th  |    1  | 52    | 53    |  58   | 54   |  +6.25%    |
	|75.0000th  |    1  | 69    | 61    |  83   | 59   |  +8.45%    |
	|90.0000th  |    1  | 80    | 80    |  89   | 83   |  +6.98%    |
	|95.0000th  |    1  | 83    | 83    |  93   | 87   |  +7.78%    |
	|*99.0000th |    1  | 92    | 94    |  99   | 97   |  +5.10%    |
	|99.5000th  |    1  | 95    | 100   |  102  | 103  |  +4.88%    |
	|99.9000th  |    1  | 107   | 123   |  105  | 203  |  +25.32%   |
	+-----------+-------+-------+-------+-------+------+------------+
	|50.0000th  |    2  | 56    | 62    |  67   | 59   |  +6.35%    |
	|75.0000th  |    2  | 69    | 75    |  80   | 71   |  +4.64%    |
	|90.0000th  |    2  | 80    | 82    |  90   | 81   |  +5.26%    |
	|95.0000th  |    2  | 85    | 87    |  97   | 91   |  +8.51%    |
	|*99.0000th |    2  | 98    | 99    |  107  | 109  |  +8.79%    |
	|99.5000th  |    2  | 107   | 105   |  109  | 116  |  +5.78%    |
	|99.9000th  |    2  | 9968  | 609   |  875  | 3116 | -165.02%   |
	+-----------+-------+-------+-------+-------+------+------------+
	|50.0000th  |    4  | 78    | 77    |  78   | 79   |  +1.27%    |
	|75.0000th  |    4  | 98    | 106   |  100  | 104  |   0.00%    |
	|90.0000th  |    4  | 987   | 1001  |  995  | 1015 |  +1.09%    |
	|95.0000th  |    4  | 4136  | 5368  |  5752 | 5192 |  +13.16%   |
	|*99.0000th |    4  | 11632 | 11344 |  11024| 10736|  -5.59%    |
	|99.5000th  |    4  | 12624 | 13040 |  12720| 12144|  -3.22%    |
	|99.9000th  |    4  | 13168 | 18912 |  14992| 17824|  +2.24%    |
	+-----------+-------+-------+-------+-------+------+------------+

    Note: Improvement is measured for (VM1+VM2)

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dhaval.giani@oracle.com
Cc: matt@codeblueprint.co.uk
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Link: http://lkml.kernel.org/r/1525294330-7759-1-git-send-email-rohit.k.jain@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-04 10:00:09 +02:00
Viresh Kumar
c976a862ba sched/fair: Avoid calling sync_entity_load_avg() unnecessarily
Call sync_entity_load_avg() directly from find_idlest_cpu() instead of
select_task_rq_fair(), as that's where we need to use task's utilization
value. And call sync_entity_load_avg() only after making sure sched
domain spans over one of the allowed CPUs for the task.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/cd019d1753824c81130eae7b43e2bbcec47cc1ad.1524738578.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-04 10:00:08 +02:00
Viresh Kumar
f1d88b4468 sched/fair: Rearrange select_task_rq_fair() to optimize it
Rearrange select_task_rq_fair() a bit to avoid executing some
conditional statements in few specific code-paths. That gets rid of the
goto as well.

This shouldn't result in any functional changes.

Tested-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20831b8d237bf3a20e4e328286f678b425ff04c9.1524738578.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-04 10:00:07 +02:00
Peter Zijlstra
b5bf9a90bb sched/core: Introduce set_special_state()
Gaurav reported a perceived problem with TASK_PARKED, which turned out
to be a broken wait-loop pattern in __kthread_parkme(), but the
reported issue can (and does) in fact happen for states that do not do
condition based sleeps.

When the 'current->state = TASK_RUNNING' store of a previous
(concurrent) try_to_wake_up() collides with the setting of a 'special'
sleep state, we can loose the sleep state.

Normal condition based wait-loops are immune to this problem, but for
sleep states that are not condition based are subject to this problem.

There already is a fix for TASK_DEAD. Abstract that and also apply it
to TASK_STOPPED and TASK_TRACED, both of which are also without
condition based wait-loop.

Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-04 07:54:54 +02:00
Peter Zijlstra
85f1abe001 kthread, sched/wait: Fix kthread_parkme() completion issue
Even with the wait-loop fixed, there is a further issue with
kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
smpboot_park_threads() can return before all those threads are in fact
blocked, due to the placement of the complete() in __kthread_parkme().

When that happens, sched_cpu_dying() -> migrate_tasks() can end up
migrating such a still runnable task onto another CPU.

Normally the task will have hit schedule() and gone to sleep by the
time we do kthread_unpark(), which will then do __kthread_bind() to
re-bind the task to the correct CPU.

However, when we loose the initial TASK_PARKED store to the concurrent
wakeup issue described previously, do the complete(), get migrated, it
is possible to either:

 - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
   the park and set TASK_RUNNING, or

 - __kthread_bind()'s wait_task_inactive() to observe the competing
   TASK_RUNNING store.

Either way the WARN() in __kthread_bind() will trigger and fail to
correctly set the CPU affinity.

Fix this by only issuing the complete() when the kthread has scheduled
out. This does away with all the icky 'still running' nonsense.

The alternative is to promote TASK_PARKED to a special state, this
guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
and we'll end up doing the right thing, but this preserves the whole
icky business of potentially migating the still runnable thing.

Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-03 07:38:05 +02:00
Vincent Guittot
457be908c8 sched/fair: Fix the update of blocked load when newly idle
With commit:

  31e77c93e4 ("sched/fair: Update blocked load when newly idle")

... we release the rq->lock when updating blocked load of idle CPUs.

This opens a time window during which another CPU can add a task to this
CPU's cfs_rq.

The check for newly added task of idle_balance() is not in the common path.
Move the out label to include this check.

Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 31e77c93e4 ("sched/fair: Update blocked load when newly idle")
Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-03 07:38:03 +02:00
Linus Torvalds
71b8ebbf3d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "A few scheduler fixes:

   - Prevent a bogus warning vs. runqueue clock update flags in
     do_sched_rt_period_timer()

   - Simplify the helper functions which handle requests for skipping
     the runqueue clock updat.

   - Do not unlock the tunables mutex in the error path of the cpu
     frequency scheduler utils. Its not held.

   - Enforce proper alignement for 'struct util_est' in sched_avg to
     prevent a misalignment fault on IA64"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Force proper alignment of 'struct util_est'
  sched/core: Simplify helpers for rq clock update skip requests
  sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
  sched/cpufreq/schedutil: Fix error path mutex unlock
2018-04-15 12:43:30 -07:00
Linus Torvalds
1fe43114ea More power management updates for 4.17-rc1
- Rework the idle loop in order to prevent CPUs from spending too
    much time in shallow idle states by making it stop the scheduler
    tick before putting the CPU into an idle state only if the idle
    duration predicted by the idle governor is long enough.  That
    required the code to be reordered to invoke the idle governor
    before stopping the tick, among other things (Rafael Wysocki,
    Frederic Weisbecker, Arnd Bergmann).
 
  - Add the missing description of the residency sysfs attribute to
    the cpuidle documentation (Prashanth Prakash).
 
  - Finalize the cpufreq cleanup moving frequency table validation
    from drivers to the core (Viresh Kumar).
 
  - Fix a clock leak regression in the armada-37xx cpufreq driver
    (Gregory Clement).
 
  - Fix the initialization of the CPU performance data structures
    for shared policies in the CPPC cpufreq driver (Shunyong Yang).
 
  - Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers
    a bit (Viresh Kumar, Rafael Wysocki).
 
  - Mark the expected switch fall-throughs in the PM QoS core (Gustavo
    Silva).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJazfv7AAoJEILEb/54YlRx/kYP+gPOX5O5cFF22Y2xvDHPMWjm
 D/3Nc2aRo+5DuHHECSIJ3ZVQzVoamN5zQ1KbsBRV0bJgwim4fw4M199Jr/0I2nES
 1pkByuxLrAtwb83uX3uBIQnwgKOAwRftOTeVaFaMoXgIbyUqK7ZFkGq0xQTnKqor
 6+J+78O7wMaIZ0YXQP98BC6g96vs/f+ICrh7qqY85r4NtO/thTA1IKevBmlFeIWR
 yVhEYgwSFBaWehKK8KgbshmBBEk3qzDOYfwZF/JprPhiN/6madgHgYjHC8Seok5c
 QUUTRlyO1ULTQe4JulyJUKobx7HE9u/FXC0RjbBiKPnYR4tb9Hd8OpajPRZo96AT
 8IQCdzL2Iw/ZyQsmQZsWeO1HwPTwVlF/TO2gf6VdQtH221izuHG025p8/RcZe6zb
 fTTFhh6/tmBvmOlbKMwxaLbGbwcj/5W5GvQXlXAtaElLobwwNEcEyVfF4jo4Zx/U
 DQc7agaAps67lcgFAqNDy0PoU6bxV7yoiAIlTJHO9uyPkDNyIfb0ZPlmdIi3xYZd
 tUD7C+VBezrNCkw7JWL1xXLFfJ5X7K6x5bi9I7TBj1l928Hak0dwzs7KlcNBtF1Y
 SwnJsNa3kxunGsPajya8dy5gdO0aFeB9Bse0G429+ugk2IJO/Q9M9nQUArJiC9Xl
 Gw1bw5Ynv6lx+r5EqxHa
 =Pnk4
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These include one big-ticket item which is the rework of the idle loop
  in order to prevent CPUs from spending too much time in shallow idle
  states. It reduces idle power on some systems by 10% or more and may
  improve performance of workloads in which the idle loop overhead
  matters. This has been in the works for several weeks and it has been
  tested and reviewed quite thoroughly.

  Also included are changes that finalize the cpufreq cleanup moving
  frequency table validation from drivers to the core, a few fixes and
  cleanups of cpufreq drivers, a cpuidle documentation update and a PM
  QoS core update to mark the expected switch fall-throughs in it.

  Specifics:

   - Rework the idle loop in order to prevent CPUs from spending too
     much time in shallow idle states by making it stop the scheduler
     tick before putting the CPU into an idle state only if the idle
     duration predicted by the idle governor is long enough.

     That required the code to be reordered to invoke the idle governor
     before stopping the tick, among other things (Rafael Wysocki,
     Frederic Weisbecker, Arnd Bergmann).

   - Add the missing description of the residency sysfs attribute to the
     cpuidle documentation (Prashanth Prakash).

   - Finalize the cpufreq cleanup moving frequency table validation from
     drivers to the core (Viresh Kumar).

   - Fix a clock leak regression in the armada-37xx cpufreq driver
     (Gregory Clement).

   - Fix the initialization of the CPU performance data structures for
     shared policies in the CPPC cpufreq driver (Shunyong Yang).

   - Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers a
     bit (Viresh Kumar, Rafael Wysocki).

   - Mark the expected switch fall-throughs in the PM QoS core (Gustavo
     Silva)"

* tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
  tick-sched: avoid a maybe-uninitialized warning
  cpufreq: Drop cpufreq_table_validate_and_show()
  cpufreq: SCMI: Don't validate the frequency table twice
  cpufreq: CPPC: Initialize shared perf capabilities of CPUs
  cpufreq: armada-37xx: Fix clock leak
  cpufreq: CPPC: Don't set transition_latency
  cpufreq: ti-cpufreq: Use builtin_platform_driver()
  cpufreq: intel_pstate: Do not include debugfs.h
  PM / QoS: mark expected switch fall-throughs
  cpuidle: Add definition of residency to sysfs documentation
  time: hrtimer: Use timerqueue_iterate_next() to get to the next timer
  nohz: Avoid duplication of code related to got_idle_tick
  nohz: Gather tick_sched booleans under a common flag field
  cpuidle: menu: Avoid selecting shallow states with stopped tick
  cpuidle: menu: Refine idle state selection for running tick
  sched: idle: Select idle state before stopping the tick
  time: hrtimer: Introduce hrtimer_next_event_without()
  time: tick-sched: Split tick_nohz_stop_sched_tick()
  cpuidle: Return nohz hint from cpuidle_select()
  jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC
  ...
2018-04-11 17:03:20 -07:00
Rafael J. Wysocki
554c8aa8ec sched: idle: Select idle state before stopping the tick
In order to address the issue with short idle duration predictions
by the idle governor after the scheduler tick has been stopped,
reorder the code in cpuidle_idle_call() so that the governor idle
state selection runs before tick_nohz_idle_go_idle() and use the
"nohz" hint returned by cpuidle_select() to decide whether or not
to stop the tick.

This isn't straightforward, because menu_select() invokes
tick_nohz_get_sleep_length() to get the time to the next timer
event and the number returned by the latter comes from
__tick_nohz_idle_stop_tick().  Fortunately, however, it is possible
to compute that number without actually stopping the tick and with
the help of the existing code.

Namely, tick_nohz_get_sleep_length() can be made call
tick_nohz_next_event(), introduced earlier, to get the time to the
next non-highres timer event.  If that happens, tick_nohz_next_event()
need not be called by __tick_nohz_idle_stop_tick() again.

If it turns out that the scheduler tick cannot be stopped going
forward or the next timer event is too close for the tick to be
stopped, tick_nohz_get_sleep_length() can simply return the time to
the next event currently programmed into the corresponding clock
event device.

In addition to knowing the return value of tick_nohz_next_event(),
however, tick_nohz_get_sleep_length() needs to know the time to the
next highres timer event, but with the scheduler tick timer excluded,
which can be computed with the help of hrtimer_get_next_event().

That minimum of that number and the tick_nohz_next_event() return
value is the total time to the next timer event with the assumption
that the tick will be stopped.  It can be returned to the idle
governor which can use it for predicting idle duration (under the
assumption that the tick will be stopped) and deciding whether or
not it makes sense to stop the tick before putting the CPU into the
selected idle state.

With the above, the sleep_length field in struct tick_sched is not
necessary any more, so drop it.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199227
Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: Thomas Ilsche <thomas.ilsche@tu-dresden.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2018-04-09 11:54:07 +02:00
Rafael J. Wysocki
45f1ff59e2 cpuidle: Return nohz hint from cpuidle_select()
Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.

Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
 (1) the idle exit latency is constrained at 0, or
 (2) the selected state is a polling one, or
 (3) the expected idle period duration is within the tick period
     range.

In addition to that, the correction factor computations in the menu
governor need to take the possibility that the tick may not be
stopped into account to avoid artificially small correction factor
values.  To that end, add a mechanism to record tick wakeups, as
suggested by Peter Zijlstra, and use it to modify the menu_update()
behavior when tick wakeup occurs.  Namely, if the CPU is woken up by
the tick and the return value of tick_nohz_get_sleep_length() is not
within the tick boundary, the predicted idle duration is likely too
short, so make menu_update() try to compensate for that by updating
the governor statistics as though the CPU was idle for a long time.

Since the value returned through the new argument pointer of
cpuidle_select() is not used by its caller yet, this change by
itself is not expected to alter the functionality of the code.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-06 09:29:34 +02:00
Mark Rutland
3eda69c92d kernel/fork.c: detect early free of a live mm
KASAN splats indicate that in some cases we free a live mm, then
continue to access it, with potentially disastrous results.  This is
likely due to a mismatched mmdrop() somewhere in the kernel, but so far
the culprit remains elusive.

Let's have __mmdrop() verify that the mm isn't live for the current
task, similar to the existing check for init_mm.  This way, we can catch
this class of issue earlier, and without requiring KASAN.

Currently, idle_task_exit() leaves active_mm stale after it switches to
init_mm.  This isn't harmful, but will trigger the new assertions, so we
must adjust idle_task_exit() to update active_mm.

Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Rafael J. Wysocki
ed98c34919 sched: idle: Do not stop the tick before cpuidle_idle_call()
Make cpuidle_idle_call() decide whether or not to stop the tick.

First, the cpuidle_enter_s2idle() path deals with the tick (and with
the entire timekeeping for that matter) by itself and it doesn't need
the tick to be stopped beforehand.

Second, to address the issue with short idle duration predictions
by the idle governor after the tick has been stopped, it will be
necessary to change the ordering of cpuidle_select() with respect
to tick_nohz_idle_stop_tick().  To prepare for that, put a
tick_nohz_idle_stop_tick() call in the same branch in which
cpuidle_select() is called.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-05 19:01:29 +02:00
Rafael J. Wysocki
2aaf709a51 sched: idle: Do not stop the tick upfront in the idle loop
Push the decision whether or not to stop the tick somewhat deeper
into the idle loop.

Stopping the tick upfront leads to unpleasant outcomes in case the
idle governor doesn't agree with the nohz code on the duration of the
upcoming idle period.  Specifically, if the tick has been stopped and
the idle governor predicts short idle, the situation is bad regardless
of whether or not the prediction is accurate.  If it is accurate, the
tick has been stopped unnecessarily which means excessive overhead.
If it is not accurate, the CPU is likely to spend too much time in
the (shallow, because short idle has been predicted) idle state
selected by the governor [1].

As the first step towards addressing this problem, change the code
to make the tick stopping decision inside of the loop in do_idle().
In particular, do not stop the tick in the cpu_idle_poll() code path.
Also don't do that in tick_nohz_irq_exit() which doesn't really have
enough information on whether or not to stop the tick.

Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdf
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-05 19:01:14 +02:00
Rafael J. Wysocki
0e7767687f time: tick-sched: Reorganize idle tick management code
Prepare the scheduler tick code for reworking the idle loop to
avoid stopping the tick in some cases.

The idea is to split the nohz idle entry call to decouple the idle
time stats accounting and preparatory work from the actual tick stop
code, in order to later be able to delay the tick stop once we reach
more power-knowledgeable callers.

Move away the tick_nohz_start_idle() invocation from
__tick_nohz_idle_enter(), rename the latter to
__tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
as a wrapper around it for calling it from the outside.

Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
of calling the entire __tick_nohz_idle_enter(), add another wrapper
disabling and enabling interrupts around tick_nohz_idle_stop_tick()
and make the current callers of tick_nohz_idle_enter() call it too
to retain their current functionality.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2018-04-05 18:58:47 +02:00
Davidlohr Bueso
adcc8da885 sched/core: Simplify helpers for rq clock update skip requests
By renaming the functions we can get rid of the skip parameter
and have better code redability. It makes zero sense to have
things such as:

  rq_clock_skip_update(rq, false)

When the skip request is in fact not going to happen. Ever. Rename
things such that we end up with:

  rq_clock_skip_update(rq)
  rq_clock_cancel_skipupdate(rq)

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: matt@codeblueprint.co.uk
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20180404161539.nhadkff2aats74jh@linux-n805
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05 09:20:46 +02:00
Davidlohr Bueso
d29a20645d sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
While running rt-tests' pi_stress program I got the following splat:

  rq->clock_update_flags < RQCF_ACT_SKIP
  WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20

  [...]

  <IRQ>
  enqueue_top_rt_rq+0xf4/0x150
  ? cpufreq_dbs_governor_start+0x170/0x170
  sched_rt_rq_enqueue+0x65/0x80
  sched_rt_period_timer+0x156/0x360
  ? sched_rt_rq_enqueue+0x80/0x80
  __hrtimer_run_queues+0xfa/0x260
  hrtimer_interrupt+0xcb/0x220
  smp_apic_timer_interrupt+0x62/0x120
  apic_timer_interrupt+0xf/0x20
  </IRQ>

  [...]

  do_idle+0x183/0x1e0
  cpu_startup_entry+0x5f/0x70
  start_secondary+0x192/0x1d0
  secondary_startup_64+0xa5/0xb0

We can get rid of it be the "traditional" means of adding an
update_rq_clock() call after acquiring the rq->lock in
do_sched_rt_period_timer().

The case for the RT task throttling (which this workload also hits)
can be ignored in that the skip_update call is actually bogus and
quite the contrary (the request bits are removed/reverted).

By setting RQCF_UPDATED we really don't care if the skip is happening
or not and will therefore make the assert_clock_updated() check happy.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: linux-kernel@vger.kernel.org
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05 09:20:46 +02:00
Ingo Molnar
ea2a6af517 Merge branch 'linus' into sched/urgent, to pick up fixes and updates
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05 09:20:34 +02:00
Linus Torvalds
642e7fd233 Merge branch 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux
Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
 "System calls are interaction points between userspace and the kernel.
  Therefore, system call functions such as sys_xyzzy() or
  compat_sys_xyzzy() should only be called from userspace via the
  syscall table, but not from elsewhere in the kernel.

  At least on 64-bit x86, it will likely be a hard requirement from
  v4.17 onwards to not call system call functions in the kernel: It is
  better to use use a different calling convention for system calls
  there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
  which then hands processing over to the actual syscall function. This
  means that only those parameters which are actually needed for a
  specific syscall are passed on during syscall entry, instead of
  filling in six CPU registers with random user space content all the
  time (which may cause serious trouble down the call chain). Those
  x86-specific patches will be pushed through the x86 tree in the near
  future.

  Moreover, rules on how data may be accessed may differ between kernel
  data and user data. This is another reason why calling sys_xyzzy() is
  generally a bad idea, and -- at most -- acceptable in arch-specific
  code.

  This patchset removes all in-kernel calls to syscall functions in the
  kernel with the exception of arch/. On top of this, it cleans up the
  three places where many syscalls are referenced or prototyped, namely
  kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"

* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
  bpf: whitelist all syscalls for error injection
  kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
  kernel/sys_ni: sort cond_syscall() entries
  syscalls/x86: auto-create compat_sys_*() prototypes
  syscalls: sort syscall prototypes in include/linux/compat.h
  net: remove compat_sys_*() prototypes from net/compat.h
  syscalls: sort syscall prototypes in include/linux/syscalls.h
  kexec: move sys_kexec_load() prototype to syscalls.h
  x86/sigreturn: use SYSCALL_DEFINE0
  x86: fix sys_sigreturn() return type to be long, not unsigned long
  x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
  mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
  mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
  mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
  fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
  fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
  fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
  fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
  kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
  kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
  ...
2018-04-02 21:22:12 -07:00
Linus Torvalds
ce6eba3dba Merge branch 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull wait_var_event updates from Ingo Molnar:
 "This introduces the new wait_var_event() API, which is a more flexible
  waiting primitive than wait_on_atomic_t().

  All wait_on_atomic_t() users are migrated over to the new API and
  wait_on_atomic_t() is removed. The migration fixes one bug and should
  result in no functional changes for the other usecases"

* 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/wait: Improve __var_waitqueue() code generation
  sched/wait: Remove the wait_on_atomic_t() API
  sched/wait, arch/mips: Fix and convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/ocfs2: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/fscache: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, fs/afs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, drivers/media: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait, drivers/drm: Convert wait_on_atomic_t() usage to the new wait_var_event() API
  sched/wait: Introduce wait_var_event()
2018-04-02 16:50:39 -07:00
Linus Torvalds
46e0d28bdb Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main scheduler changes in this cycle were:

   - NUMA balancing improvements (Mel Gorman)

   - Further load tracking improvements (Patrick Bellasi)

   - Various NOHZ balancing cleanups and optimizations (Peter Zijlstra)

   - Improve blocked load handling, in particular we can now reduce and
     eventually stop periodic load updates on 'very idle' CPUs. (Vincent
     Guittot)

   - On isolated CPUs offload the final 1Hz scheduler tick as well, plus
     related cleanups and reorganization. (Frederic Weisbecker)

   - Core scheduler code cleanups (Ingo Molnar)"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  sched/core: Update preempt_notifier_key to modern API
  sched/cpufreq: Rate limits for SCHED_DEADLINE
  sched/fair: Update util_est only on util_avg updates
  sched/cpufreq/schedutil: Use util_est for OPP selection
  sched/fair: Use util_est in LB and WU paths
  sched/fair: Add util_est on top of PELT
  sched/core: Remove TASK_ALL
  sched/completions: Use bool in try_wait_for_completion()
  sched/fair: Update blocked load when newly idle
  sched/fair: Move idle_balance()
  sched/nohz: Merge CONFIG_NO_HZ_COMMON blocks
  sched/fair: Move rebalance_domains()
  sched/nohz: Optimize nohz_idle_balance()
  sched/fair: Reduce the periodic update duration
  sched/nohz: Stop NOHZ stats when decayed
  sched/cpufreq: Provide migration hint
  sched/nohz: Clean up nohz enter/exit
  sched/fair: Update blocked load from NEWIDLE
  sched/fair: Add NOHZ stats balancing
  sched/fair: Restructure nohz_balance_kick()
  ...
2018-04-02 11:49:41 -07:00
Dominik Brodowski
7d4dd4f159 sched: add do_sched_yield() helper; remove in-kernel call to sched_yield()
Using the sched-internal do_sched_yield() helper allows us to get rid of
the sched-internal call to the sys_sched_yield() syscall.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02 20:15:31 +02:00
Jules Maselbas
1b5d43cfb6 sched/cpufreq/schedutil: Fix error path mutex unlock
This patch prevents the 'global_tunables_lock' mutex from being
unlocked before being locked.  This mutex is not locked if the
sugov_kthread_create() function fails.

Signed-off-by: Jules Maselbas <jules.maselbas@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Chris Redpath <chris.redpath@arm.com>
Cc: Dietmar Eggermann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Stephen Kyle <stephen.kyle@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: nd@arm.com
Link: http://lkml.kernel.org/r/20180329144301.38419-1-jules.maselbas@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-31 20:42:38 +02:00
Davidlohr Bueso
b720342849 sched/core: Update preempt_notifier_key to modern API
No changes in refcount semantics, use DEFINE_STATIC_KEY_FALSE()
for initialization and replace:

  static_key_slow_inc|dec()   =>   static_branch_inc|dec()
  static_key_false()          =>   static_branch_unlikely()

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Link: http://lkml.kernel.org/r/20180326210929.5244-4-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-27 07:51:45 +02:00
Linus Torvalds
bf45bae961 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two sched debug output related fixes: a console output fix and
  formatting fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Adjust newlines for better alignment
  sched/debug: Fix per-task line continuation for console output
2018-03-25 07:33:30 -10:00
Claudio Scordino
e97a90f706 sched/cpufreq: Rate limits for SCHED_DEADLINE
When the SCHED_DEADLINE scheduling class increases the CPU utilization, it
should not wait for the rate limit, otherwise it may miss some deadline.

Tests using rt-app on Exynos5422 with up to 10 SCHED_DEADLINE tasks have
shown reductions of even 10% of deadline misses with a negligible
increase of energy consumption (measured through Baylibre Cape).

Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: linux-pm@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/1520937340-2755-1-git-send-email-claudio@evidence.eu.com
2018-03-23 22:48:22 +01:00
Joe Lawrence
e9ca267096 sched/debug: Adjust newlines for better alignment
Scheduler debug stats include newlines that display out of alignment
when prefixed by timestamps.  For example, the dmesg utility:

  % echo t > /proc/sysrq-trigger
  % dmesg
  ...
  [   83.124251]
  runnable tasks:
   S           task   PID         tree-key  switches  prio     wait-time
  sum-exec        sum-sleep
  -----------------------------------------------------------------------------------------------------------

At the same time, some syslog utilities (like rsyslog by default) don't
like the additional newlines control characters, saving lines like this
to /var/log/messages:

  Mar 16 16:02:29 localhost kernel: #012runnable tasks:#012 S           task   PID         tree-key ...
                                    ^^^^               ^^^^
Clean these up by moving newline characters to their own SEQ_printf
invocation.  This leaves the /proc/sched_debug unchanged, but brings the
entire output into alignment when prefixed:

  % echo t > /proc/sysrq-trigger
  % dmesg
  ...
  [   62.410368] runnable tasks:
  [   62.410368]  S           task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
  [   62.410369] -----------------------------------------------------------------------------------------------------------
  [   62.410369]  I  kworker/u12:0     5      1932.215593       332   120         0.000000         3.621252         0.000000 0 0 /

and no escaped control characters from rsyslog in /var/log/messages:

  Mar 16 16:15:06 localhost kernel: runnable tasks:
  Mar 16 16:15:06 localhost kernel: S           task   PID         tree-key  ...

Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1521484555-8620-3-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 09:30:09 +01:00
Joe Lawrence
a8c024cd9b sched/debug: Fix per-task line continuation for console output
When the SEQ_printf() macro prints to the console, it runs a simple
printk() without KERN_CONT "continued" line printing.  The result of
this is oddly wrapped task info, for example:

  % echo t > /proc/sysrq-trigger
  % dmesg
  ...
  runnable tasks:
  ...
  [   29.608611]  I
  [   29.608613]       rcu_sched     8      3252.013846      4087   120
  [   29.608614]         0.000000        29.090111         0.000000
  [   29.608615]  0 0
  [   29.608616]  /

Modify SEQ_printf to use pr_cont() for expected one-line results:

  % echo t > /proc/sysrq-trigger
  % dmesg
  ...
  runnable tasks:
  ...
  [  106.716329]  S        cpuhp/5    37      2006.315026        14   120         0.000000         0.496893         0.000000 0 0 /

Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1521484555-8620-2-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 09:30:09 +01:00
Peter Zijlstra
b3fc5c9bb3 sched/wait: Improve __var_waitqueue() code generation
Since we fixed hash_64() to not suck there is no need to play games to
attempt to improve the hash value on 64-bit.

Also, since we don't use the bit value for the variables, use hash_ptr()
directly.

No change in functionality.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:23:25 +01:00