Commit graph

16949 commits

Author SHA1 Message Date
Tejun Heo
6612f05b88 cgroup: unify pidlist and other file handling
In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs.  With the previous
changes, the difference between pidlist and other files are very
small.  Both are served by seq_file in a pretty standard way with the
only difference being !pidlist files use single_open().

This patch adds cftype->seq_start(), ->seq_next and ->seq_stop() and
implements the matching cgroup_seqfile_start/next/stop() which either
emulates single_open() behavior or invokes cftype->seq_*() operations
if specified.  This allows using single seq_operations for both
pidlist and other files and makes cgroup_pidlist_operations and
cgorup_pidlist_open() no longer necessary.  As cgroup_pidlist_open()
was the only user of cftype->open(), the method is dropped together.

This brings cftype file interface very close to kernfs interface and
mapping shouldn't be too difficult.  Once converted to kernfs, most of
the plumbing code including cgroup_seqfile_*() will be removed as
kernfs provides those facilities.

This patch does not introduce any behavior changes.

v2: Refreshed on top of the updated "cgroup: introduce struct
    cgroup_pidlist_open_file".

v3: Refreshed on top of the updated "cgroup: attach cgroup_open_file
    to all cgroup files".

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-12-05 12:28:04 -05:00
Tejun Heo
2da8ca822d cgroup: replace cftype->read_seq_string() with cftype->seq_show()
In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs.  This patch
replaces cftype->read_seq_string() with cftype->seq_show() which is
not limited to single_open() operation and will map directcly to
kernfs seq_file interface.

The conversions are mechanical.  As ->seq_show() doesn't have @css and
@cft, the functions which make use of them are converted to use
seq_css() and seq_cft() respectively.  In several occassions, e.f. if
it has seq_string in its name, the function name is updated to fit the
new method better.

This patch does not introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
2013-12-05 12:28:04 -05:00
Tejun Heo
7da1127927 cgroup: attach cgroup_open_file to all cgroup files
In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs.  This patch
attaches cgroup_open_file, which used to be attached to pidlist files,
to all cgroup files, introduces seq_css/cft() accessors to determine
the cgroup_subsys_state and cftype associated with a given cgroup
seq_file, exports them as public interface.

This doesn't cause any behavior changes but unifies cgroup file
handling across different file types and will help converting them to
kernfs seq_show() interface.

v2: Li pointed out that the original patch was using
    single_open_size() incorrectly assuming that the size param is
    private data size.  Fix it by allocating @of separately and
    passing it to single_open() and explicitly freeing it in the
    release path.  This isn't the prettiest but this path is gonna be
    restructured by the following patches pretty soon.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-12-05 12:28:04 -05:00
Tejun Heo
5d22444f42 cgroup: generalize cgroup_pidlist_open_file
In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs.  This patch renames
cgroup_pidlist_open_file to cgroup_open_file and updates it so that it
only contains a field to identify the specific file, ->cfe, and an
opaque ->priv pointer.  When cgroup is converted to kernfs, this will
be replaced by kernfs_open_file which contains about the same
information.

As whether the file is "cgroup.procs" or "tasks" should now be
determined from cgroup_open_file->cfe, the cftype->private for the two
files now carry the file type and cgroup_pidlist_start() reads the
type through cfe->type->private.  This makes the distinction between
cgroup_tasks_open() and cgroup_procs_open() unnecessary.
cgroup_pidlist_open() is now directly used as the open method.

This patch doesn't make any behavior changes.

v2: Refreshed on top of the updated "cgroup: introduce struct
    cgroup_pidlist_open_file".

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-12-05 12:28:04 -05:00
Tejun Heo
896f519963 cgroup: unify read path so that seq_file is always used
With the recent removal of cftype->read() and ->read_map(), only three
operations are remaining, ->read_u64(), ->read_s64() and
->read_seq_string().  Currently, the first two are handled directly
while the last is handled through seq_file.

It is trivial to serve the first two through the seq_file path too.
This patch restructures read path so that all operations are served
through cgroup_seqfile_show().  This makes all cgroup files seq_file -
single_open/release() are now used by default,
cgroup_seqfile_operations is dropped, and cgroup_file_operations uses
seq_read() for read.

This simplifies the code and makes the read path easy to convert to
use kernfs.

Note that, while cgroup_file_operations uses seq_read() for read, it
still uses generic_file_llseek() for seeking instead of seq_lseek().
This is different from cgroup_seqfile_operations but shouldn't break
anything and brings the seeking behavior aligned with kernfs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-12-05 12:28:04 -05:00
Tejun Heo
a742c59de6 cgroup: unify cgroup_write_X64() and cgroup_write_string()
cgroup_write_X64() and cgroup_write_string() both implement about the
same buffering logic.  Unify the two into cgroup_file_write() which
always allocates dynamic buffer for simplicity and uses kstrto*()
instead of simple_strto*().

This patch doesn't make any visible behavior changes except for
possibly different error value from kstrsto*().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-12-05 12:28:03 -05:00
Tejun Heo
6e0755b08d cgroup: remove cftype->read(), ->read_map() and ->write()
In preparation of conversion to kernfs, cgroup file handling is being
consolidated so that it can be easily mapped to the seq_file based
interface of kernfs.

After recent updates, ->read() and ->read_map() don't have any user
left and ->write() never had any user.  Remove them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-12-05 12:28:03 -05:00
Tejun Heo
51ffe41178 cpuset: convert away from cftype->read()
In preparation of conversion to kernfs, cgroup file handling is being
consolidated so that it can be easily mapped to the seq_file based
interface of kernfs.

All users of cftype->read() can be easily served, usually better, by
seq_file and other methods.  Rename cpuset_common_file_read() to
cpuset_common_read_seq_string() and convert it to use
read_seq_string() interface instead.  This not only simplifies the
code but also makes it more versatile.  Before, the file couldn't
output if the result is longer than PAGE_SIZE.  After the conversion,
seq_file automatically grows the buffer until the output can fit.

This patch doesn't make any visible behavior changes except for being
able to handle output larger than PAGE_SIZE.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-12-05 12:28:02 -05:00
Tejun Heo
44ffc75ba9 cgroup, sched: convert away from cftype->read_map()
In preparation of conversion to kernfs, cgroup file handling is being
consolidated so that it can be easily mapped to the seq_file based
interface of kernfs.

cftype->read_map() doesn't add any value and being replaced with
->read_seq_string().  Update cpu_stats_show() and cpuacct_stats_show()
accordingly.

This patch doesn't make any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2013-12-05 12:28:01 -05:00
Tejun Heo
afb2bc14e1 cgroup: don't guarantee cgroup.procs is sorted if sane_behavior
For some reason, tasks and cgroup.procs guarantee that the result is
sorted.  This is the only reason this whole pidlist logic is necessary
instead of just iterating through sorted member tasks.  We can't do
anything about the existing interface but at least ensure that such
expectation doesn't exist for the new interface so that pidlist logic
may be removed in the distant future.

This patch scrambles the sort order if sane_behavior so that the
output is usually not sorted in the new interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-11-29 10:42:59 -05:00
Tejun Heo
045023658c cgroup: remove cgroup_pidlist->use_count
After the recent changes, pidlist ref is held only between
cgroup_pidlist_start() and cgroup_pidlist_stop() during which
cgroup->pidlist_mutex is also held.  IOW, the reference count is
redundant now.  While in use, it's always one and pidlist_mutex is
held - holding the mutex has exactly the same protection.

This patch collapses destroy_dwork queueing into cgroup_pidlist_stop()
so that pidlist_mutex is not released inbetween and drops
pidlist->use_count.

This patch shouldn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-11-29 10:42:59 -05:00
Tejun Heo
4bac00d16a cgroup: load and release pidlists from seq_file start and stop respectively
Currently, pidlists are reference counted from file open and release
methods.  This means that holding onto an open file may waste memory
and reads may return data which is very stale.  Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.

cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show.  This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration.  This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.

The previous patches implemented delayed release and restructured
pidlist handling so that pidlists can be loaded and released from
seq_file start / stop.  This patch actually moves pidlist load to
start and release to stop.

This means that pidlist is pinned only between start and stop and may
go away between two consecutive read calls if the two calls are apart
by more than CGROUP_PIDLIST_DESTROY_DELAY.  cgroup_pidlist_start()
thus can't re-use the stored cgroup_pid_list_open_file->pidlist
directly.  During start, it's only used as a hint indicating whether
this is the first start after open or not and pidlist is always looked
up or created.

pidlist_mutex locking and reference counting are moved out of
pidlist_array_load() so that pidlist_array_load() can perform lookup
and creation atomically.  While this enlarges the area covered by
pidlist_mutex, given how the lock is used, it's highly unlikely to be
noticeable.

v2: Refreshed on top of the updated "cgroup: introduce struct
    cgroup_pidlist_open_file".

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-11-29 10:42:59 -05:00
Tejun Heo
069df3b7ae cgroup: remove cgroup_pidlist->rwsem
cgroup_pidlist locking is needlessly complicated.  It has outer
cgroup->pidlist_mutex to protect the list of pidlists associated with
a cgroup and then each pidlist has rwsem to synchronize updates and
reads.  Given that the only read access is from seq_file operations
which are always invoked back-to-back, the rwsem is a giant overkill.
All it does is adding unnecessary complexity.

This patch removes cgroup_pidlist->rwsem and protects all accesses to
pidlists belonging to a cgroup with cgroup->pidlist_mutex.
pidlist->rwsem locking is removed if it's nested inside
cgroup->pidlist_mutex; otherwise, it's replaced with
cgroup->pidlist_mutex locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-11-29 10:42:59 -05:00
Tejun Heo
e6b817103d cgroup: refactor cgroup_pidlist_find()
Rename cgroup_pidlist_find() to cgroup_pidlist_find_create() and
separate out finding proper to cgroup_pidlist_find().  Also, move
locking to the caller.

This patch is preparation for pidlist restructure and doesn't
introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-11-29 10:42:59 -05:00
Tejun Heo
62236858f3 cgroup: introduce struct cgroup_pidlist_open_file
For pidlist files, seq_file->private pointed to the loaded
cgroup_pidlist; however, pidlist loading is planned to be moved to
cgroup_pidlist_start() for kernfs conversion and seq_file->private
needs to carry more information from open to allow that.

This patch introduces struct cgroup_pidlist_open_file which contains
type, cgrp and pidlist and updates pidlist seq_file->private to point
to it using seq_open_private() and seq_release_private().  Note that
this eventually will be replaced by kernfs_open_file.

While this patch makes more information available to seq_file
operations, they don't use it yet and this patch doesn't introduce any
behavior changes except for allocation of the extra private struct.

v2: use __seq_open_private() instead of seq_open_private() for brevity
    as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-11-29 10:42:58 -05:00
Tejun Heo
b1a2136731 cgroup: implement delayed destruction for cgroup_pidlist
Currently, pidlists are reference counted from file open and release
methods.  This means that holding onto an open file may waste memory
and reads may return data which is very stale.  Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.

cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show.  This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration.  This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.

This patch implements delayed release of pidlist.  As pidlists could
be lingering on cgroup removal waiting for the timer to expire, cgroup
free path needs to queue the destruction work item immediately and
flush.  As those work items are self-destroying, each work item can't
be flushed directly.  A new workqueue - cgroup_pidlist_destroy_wq - is
added to serve as flush domain.

Note that this patch just adds delayed release on top of the current
implementation and doesn't change where pidlist is loaded and
released.  Following patches will make those changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-11-29 10:42:58 -05:00
Tejun Heo
b9f3cecaba cgroup: remove cftype->release()
Now that pidlist files don't use cftype->release(), it doesn't have
any user left.  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-11-29 10:42:58 -05:00
Tejun Heo
ac1e69aa78 cgroup: don't skip seq_open on write only opens on pidlist files
Currently, cgroup_pidlist_open() skips seq_open() and pidlist loading
if the file is opened write-only, which is a sensible optimization as
pidlist loading can be costly and there often are occasions where
tasks or cgroup.procs is opened write-only.  However, pidlist init and
release are planned to be moved to cgroup_pidlist_start/stop()
respectively which would make this optimization unnecessary.

This patch removes the optimization and always fully initializes
pidlist files regardless of open mode.  This will help moving pidlist
handling to start/stop by unifying rw paths and removes the need for
specifying cftype->release() in addition to .release in
cgroup_pidlist_operations as file->f_op is now always overridden.  As
pidlist files were the only user of cftype->release(), the next patch
will remove the method.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-11-29 10:42:58 -05:00
Tejun Heo
c729b11edf cgroup: Merge branch 'for-3.13-fixes' into for-3.14
Pull to receive e605b36575 ("cgroup: fix cgroup_subsys_state leak
for seq_files") as for-3.14 is scheduled to have a lot of changes
which depend on it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-11-27 18:17:27 -05:00
Tejun Heo
e605b36575 cgroup: fix cgroup_subsys_state leak for seq_files
If a cgroup file implements either read_map() or read_seq_string(),
such file is served using seq_file by overriding file->f_op to
cgroup_seqfile_operations, which also overrides the release method to
single_release() from cgroup_file_release().

Because cgroup_file_open() didn't use to acquire any resources, this
used to be fine, but since f7d58818ba ("cgroup: pin
cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
pins the css (cgroup_subsys_state) which is put by
cgroup_file_release().  The patch forgot to update the release path
for seq_files and each open/release cycle leaks a css reference.

Fix it by updating cgroup_file_release() to also handle seq_files and
using it for seq_file release path too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v3.12
2013-11-27 18:16:21 -05:00
Peter Zijlstra
0fc0287c9e cpuset: Fix memory allocator deadlock
Juri hit the below lockdep report:

[    4.303391] ======================================================
[    4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
[    4.303394] 3.12.0-dl-peterz+ #144 Not tainted
[    4.303395] ------------------------------------------------------
[    4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[    4.303399]  (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290
[    4.303417]
[    4.303417] and this task is already holding:
[    4.303418]  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100
[    4.303431] which would create a new lock dependency:
[    4.303432]  (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
[    4.303436]

[    4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
[    4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
[    4.303922]    HARDIRQ-ON-W at:
[    4.303923]                     [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0
[    4.303926]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
[    4.303929]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
[    4.303931]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
[    4.303933]    SOFTIRQ-ON-W at:
[    4.303933]                     [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0
[    4.303935]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
[    4.303940]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
[    4.303955]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
[    4.303959]    INITIAL USE at:
[    4.303960]                    [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0
[    4.303963]                    [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
[    4.303966]                    [<ffffffff81063dd6>] kthreadd+0x86/0x180
[    4.303969]                    [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
[    4.303972]  }

Which reports that we take mems_allowed_seq with interrupts enabled. A
little digging found that this can only be from
cpuset_change_task_nodemask().

This is an actual deadlock because an interrupt doing an allocation will
hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
forever waiting for the write side to complete.

Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Reported-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Juri Lelli <juri.lelli@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
2013-11-27 13:52:47 -05:00
Tejun Heo
edab95103d cgroup: Merge branch 'memcg_event' into for-3.14
Merge v3.12 based patch series to move cgroup_event implementation to
memcg into for-3.14.  The following two commits cause a conflict in
kernel/cgroup.c

  2ff2a7d03b ("cgroup: kill css_id")
  79bd9814e5 ("cgroup, memcg: move cgroup_event implementation to memcg")

Each patch removes a struct definition from kernel/cgroup.c.  As the
two are adjacent, they cause a context conflict.  Easily resolved by
removing both structs.

Signed-off-by: Tejun Heo <tj@kernel.org>
2013-11-22 18:32:25 -05:00
Tejun Heo
b36824c75c cgroup: unexport cgroup_css() and remove __file_cft()
Now that cgroup_event is made memcg specific, the temporarily exported
functions are no longer necessary.  Unexport cgroup_css() and remove
__file_cft() which doesn't have any user left.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
2013-11-22 18:20:44 -05:00
Tejun Heo
fba9480783 cgroup, memcg: move cgroup->event_list[_lock] and event callbacks into memcg
cgroup_event is being moved from cgroup core to memcg and the
implementation is already moved by the previous patch.  This patch
moves the data fields and callbacks.

* cgroup->event_list[_lock] are moved to mem_cgroup.

* cftype->[un]register_event() are moved to cgroup_event.  This makes
  it impossible for individual cftype definitions to specify their
  event callbacks.  This is worked around by simply hard-coding
  filename to event callback mapping in cgroup_write_event_control().
  This is awkward and inflexible, which is actually desirable given
  that we don't want to grow more usages of this feature.

* eventfd_ctx declaration is removed from cgroup.h, which makes
  vmpressure.h miss eventfd_ctx declaration.  Include eventfd.h from
  vmpressure.h.

v2: Use file name from dentry instead of cftype.  This will allow
    removing all cftype handling in the function.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
2013-11-22 18:20:43 -05:00
Tejun Heo
79bd9814e5 cgroup, memcg: move cgroup_event implementation to memcg
cgroup_event is way over-designed and tries to build a generic
flexible event mechanism into cgroup - fully customizable event
specification for each user of the interface.  This is utterly
unnecessary and overboard especially in the light of the planned
unified hierarchy as there's gonna be single agent.  Simply generating
events at fixed points, or if that's too restrictive, configureable
cadence or single set of configureable points should be enough.

Thankfully, memcg is the only user and gets to keep it.  Replacing it
with something simpler on sane_behavior is strongly recommended.

This patch moves cgroup_event and "cgroup.event_control"
implementation to mm/memcontrol.c.  Clearing of events on cgroup
destruction is moved from cgroup_destroy_locked() to
mem_cgroup_css_offline(), which shouldn't make any noticeable
difference.

cgroup_css() and __file_cft() are exported to enable the move;
however, this will soon be reverted once the event code is updated to
be memcg specific.

Note that "cgroup.event_control" will now exist only on the hierarchy
with memcg attached to it.  While this change is visible to userland,
it is unlikely to be noticeable as the file has never been meaningful
outside memcg.

Aside from the above change, this is pure code relocation.

v2: Per Li Zefan's comments, init/Kconfig updated accordingly and
    poll.h inclusion moved from cgroup.c to memcontrol.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
2013-11-22 18:20:42 -05:00
Tejun Heo
e5fca243ab cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.

As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.

Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.

This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.

Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
2013-11-22 17:14:39 -05:00
Linus Torvalds
78dc53c422 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "In this patchset, we finally get an SELinux update, with Paul Moore
  taking over as maintainer of that code.

  Also a significant update for the Keys subsystem, as well as
  maintenance updates to Smack, IMA, TPM, and Apparmor"

and since I wanted to know more about the updates to key handling,
here's the explanation from David Howells on that:

 "Okay.  There are a number of separate bits.  I'll go over the big bits
  and the odd important other bit, most of the smaller bits are just
  fixes and cleanups.  If you want the small bits accounting for, I can
  do that too.

   (1) Keyring capacity expansion.

        KEYS: Consolidate the concept of an 'index key' for key access
        KEYS: Introduce a search context structure
        KEYS: Search for auth-key by name rather than target key ID
        Add a generic associative array implementation.
        KEYS: Expand the capacity of a keyring

     Several of the patches are providing an expansion of the capacity of a
     keyring.  Currently, the maximum size of a keyring payload is one page.
     Subtract a small header and then divide up into pointers, that only gives
     you ~500 pointers on an x86_64 box.  However, since the NFS idmapper uses
     a keyring to store ID mapping data, that has proven to be insufficient to
     the cause.

     Whatever data structure I use to handle the keyring payload, it can only
     store pointers to keys, not the keys themselves because several keyrings
     may point to a single key.  This precludes inserting, say, and rb_node
     struct into the key struct for this purpose.

     I could make an rbtree of records such that each record has an rb_node
     and a key pointer, but that would use four words of space per key stored
     in the keyring.  It would, however, be able to use much existing code.

     I selected instead a non-rebalancing radix-tree type approach as that
     could have a better space-used/key-pointer ratio.  I could have used the
     radix tree implementation that we already have and insert keys into it by
     their serial numbers, but that means any sort of search must iterate over
     the whole radix tree.  Further, its nodes are a bit on the capacious side
     for what I want - especially given that key serial numbers are randomly
     allocated, thus leaving a lot of empty space in the tree.

     So what I have is an associative array that internally is a radix-tree
     with 16 pointers per node where the index key is constructed from the key
     type pointer and the key description.  This means that an exact lookup by
     type+description is very fast as this tells us how to navigate directly to
     the target key.

     I made the data structure general in lib/assoc_array.c as far as it is
     concerned, its index key is just a sequence of bits that leads to a
     pointer.  It's possible that someone else will be able to make use of it
     also.  FS-Cache might, for example.

   (2) Mark keys as 'trusted' and keyrings as 'trusted only'.

        KEYS: verify a certificate is signed by a 'trusted' key
        KEYS: Make the system 'trusted' keyring viewable by userspace
        KEYS: Add a 'trusted' flag and a 'trusted only' flag
        KEYS: Separate the kernel signature checking keyring from module signing

     These patches allow keys carrying asymmetric public keys to be marked as
     being 'trusted' and allow keyrings to be marked as only permitting the
     addition or linkage of trusted keys.

     Keys loaded from hardware during kernel boot or compiled into the kernel
     during build are marked as being trusted automatically.  New keys can be
     loaded at runtime with add_key().  They are checked against the system
     keyring contents and if their signatures can be validated with keys that
     are already marked trusted, then they are marked trusted also and can
     thus be added into the master keyring.

     Patches from Mimi Zohar make this usable with the IMA keyrings also.

   (3) Remove the date checks on the key used to validate a module signature.

        X.509: Remove certificate date checks

     It's not reasonable to reject a signature just because the key that it was
     generated with is no longer valid datewise - especially if the kernel
     hasn't yet managed to set the system clock when the first module is
     loaded - so just remove those checks.

   (4) Make it simpler to deal with additional X.509 being loaded into the kernel.

        KEYS: Load *.x509 files into kernel keyring
        KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicate

     The builder of the kernel now just places files with the extension ".x509"
     into the kernel source or build trees and they're concatenated by the
     kernel build and stuffed into the appropriate section.

   (5) Add support for userspace kerberos to use keyrings.

        KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches
        KEYS: Implement a big key type that can save to tmpfs

     Fedora went to, by default, storing kerberos tickets and tokens in tmpfs.
     We looked at storing it in keyrings instead as that confers certain
     advantages such as tickets being automatically deleted after a certain
     amount of time and the ability for the kernel to get at these tokens more
     easily.

     To make this work, two things were needed:

     (a) A way for the tickets to persist beyond the lifetime of all a user's
         sessions so that cron-driven processes can still use them.

         The problem is that a user's session keyrings are deleted when the
         session that spawned them logs out and the user's user keyring is
         deleted when the UID is deleted (typically when the last log out
         happens), so neither of these places is suitable.

         I've added a system keyring into which a 'persistent' keyring is
         created for each UID on request.  Each time a user requests their
         persistent keyring, the expiry time on it is set anew.  If the user
         doesn't ask for it for, say, three days, the keyring is automatically
         expired and garbage collected using the existing gc.  All the kerberos
         tokens it held are then also gc'd.

     (b) A key type that can hold really big tickets (up to 1MB in size).

         The problem is that Active Directory can return huge tickets with lots
         of auxiliary data attached.  We don't, however, want to eat up huge
         tracts of unswappable kernel space for this, so if the ticket is
         greater than a certain size, we create a swappable shmem file and dump
         the contents in there and just live with the fact we then have an
         inode and a dentry overhead.  If the ticket is smaller than that, we
         slap it in a kmalloc()'d buffer"

* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (121 commits)
  KEYS: Fix keyring content gc scanner
  KEYS: Fix error handling in big_key instantiation
  KEYS: Fix UID check in keyctl_get_persistent()
  KEYS: The RSA public key algorithm needs to select MPILIB
  ima: define '_ima' as a builtin 'trusted' keyring
  ima: extend the measurement list to include the file signature
  kernel/system_certificate.S: use real contents instead of macro GLOBAL()
  KEYS: fix error return code in big_key_instantiate()
  KEYS: Fix keyring quota misaccounting on key replacement and unlink
  KEYS: Fix a race between negating a key and reading the error set
  KEYS: Make BIG_KEYS boolean
  apparmor: remove the "task" arg from may_change_ptraced_domain()
  apparmor: remove parent task info from audit logging
  apparmor: remove tsk field from the apparmor_audit_struct
  apparmor: fix capability to not use the current task, during reporting
  Smack: Ptrace access check mode
  ima: provide hash algo info in the xattr
  ima: enable support for larger default filedata hash algorithms
  ima: define kernel parameter 'ima_template=' to change configured default
  ima: add Kconfig default measurement list template
  ...
2013-11-21 19:46:00 -08:00
Linus Torvalds
3eaded86ac Merge git://git.infradead.org/users/eparis/audit
Pull audit updates from Eric Paris:
 "Nothing amazing.  Formatting, small bug fixes, couple of fixes where
  we didn't get records due to some old VFS changes, and a change to how
  we collect execve info..."

Fixed conflict in fs/exec.c as per Eric and linux-next.

* git://git.infradead.org/users/eparis/audit: (28 commits)
  audit: fix type of sessionid in audit_set_loginuid()
  audit: call audit_bprm() only once to add AUDIT_EXECVE information
  audit: move audit_aux_data_execve contents into audit_context union
  audit: remove unused envc member of audit_aux_data_execve
  audit: Kill the unused struct audit_aux_data_capset
  audit: do not reject all AUDIT_INODE filter types
  audit: suppress stock memalloc failure warnings since already managed
  audit: log the audit_names record type
  audit: add child record before the create to handle case where create fails
  audit: use given values in tty_audit enable api
  audit: use nlmsg_len() to get message payload length
  audit: use memset instead of trying to initialize field by field
  audit: fix info leak in AUDIT_GET requests
  audit: update AUDIT_INODE filter rule to comparator function
  audit: audit feature to set loginuid immutable
  audit: audit feature to only allow unsetting the loginuid
  audit: allow unsetting the loginuid (with priv)
  audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
  audit: loginuid functions coding style
  selinux: apply selinux checks on new audit message types
  ...
2013-11-21 19:18:14 -08:00
Linus Torvalds
b5898cd057 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs bits and pieces from Al Viro:
 "Assorted bits that got missed in the first pull request + fixes for a
  couple of coredump regressions"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fold try_to_ascend() into the sole remaining caller
  dcache.c: get rid of pointless macros
  take read_seqbegin_or_lock() and friends to seqlock.h
  consolidate simple ->d_delete() instances
  gfs2: endianness misannotations
  dump_emit(): use __kernel_write(), not vfs_write()
  dump_align(): fix the dumb braino
2013-11-20 14:25:39 -08:00
Linus Torvalds
82023bb7f7 More ACPI and power management updates for 3.13-rc1
- ACPI-based device hotplug fixes for issues introduced recently and
   a fix for an older error code path bug in the ACPI PCI host bridge
   driver.
 
 - Fix for recently broken OMAP cpufreq build from Viresh Kumar.
 
 - Fix for a recent hibernation regression related to s2disk.
 
 - Fix for a locking-related regression in the ACPI EC driver from
   Puneet Kumar.
 
 - System suspend error code path fix related to runtime PM and
   runtime PM documentation update from Ulf Hansson.
 
 - cpufreq's conservative governor fix from Xiaoguang Chen.
 
 - New processor IDs for intel_idle and turbostat and removal of
   an obsolete Kconfig option from Len Brown.
 
 - New device IDs for the ACPI LPSS (Low-Power Subsystem) driver and
   ACPI-based PCI hotplug (ACPIPHP) cleanup from Mika Westerberg.
 
 - Removal of several ACPI video DMI blacklist entries that are not
   necessary any more from Aaron Lu.
 
 - Rework of the ACPI companion representation in struct device and
   code cleanup related to that change from Rafael J Wysocki,
   Lan Tianyu and Jarkko Nikula.
 
 - Fixes for assigning names to ACPI-enumerated I2C and SPI devices
   from Jarkko Nikula.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABCAAGBQJSjLYNAAoJEILEb/54YlRxkEQP/1pmFWNwSsxLtTHd+PEs0Xbo
 QccqvjQrnw/c8GcmK4eZrz6/xyuepmmjy9kfRKj2ENZniy0NEsSFqkTdSO3vYlva
 8HKWUj7MV3evhFERXAF6Tu0b4Enx4jOP7VMtmYxJo3qrSnKRUcUzc6DGv/ACsUT1
 Nkj0Lhdsg053Z+YzIXrl50w0tCDEMhVmWlMHBtYgr+dMNVnkfPBGkqMblMkKCXT2
 w/yHvauZlxQHtI+8bVqTuGgNN0CPzdlpFGiuUF+5mDf6dRX8zlSn56Ia+Wyw1k9X
 dQp4jYQOgPRo03rNKqQPDiPxUdc7T0RAHRvDB51Ncweuh5PfZGguQe71p6/LKY2W
 i6zblZ0f/vc13hTiMrP+qzKcwZvgPB5DH7SfnHr61JKV7GNFCdYAqoceS5hYMzR9
 d2Fd+txgm763IHWewXfDS/G2cU492R5qr4jpmUIACBQKWDZcqmSRDwRj83t56Ltb
 jgFBMbg4vZxG7IARhind74xsALxdhsgmFjPmx+0qPWjYxcU8otQZpXbgGNI9iOuW
 pxIQv5WPQW0tTmwO4HSuVCOwDPLPz5R0jkev7SvSj3Ek3TeD7He4LmnK055CATiC
 puq+6dp1FISPOPJYk+0DI61qN/CB/qNwRp8LU3ctZwudPVhznIE9FFQ3iN1FdBg2
 X8VDcT9t7VvVuxSBjgkj
 =QMp+
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-2-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more ACPI and power management updates from Rafael Wysocki:

 - ACPI-based device hotplug fixes for issues introduced recently and a
   fix for an older error code path bug in the ACPI PCI host bridge
   driver

 - Fix for recently broken OMAP cpufreq build from Viresh Kumar

 - Fix for a recent hibernation regression related to s2disk

 - Fix for a locking-related regression in the ACPI EC driver from
   Puneet Kumar

 - System suspend error code path fix related to runtime PM and runtime
   PM documentation update from Ulf Hansson

 - cpufreq's conservative governor fix from Xiaoguang Chen

 - New processor IDs for intel_idle and turbostat and removal of an
   obsolete Kconfig option from Len Brown

 - New device IDs for the ACPI LPSS (Low-Power Subsystem) driver and
   ACPI-based PCI hotplug (ACPIPHP) cleanup from Mika Westerberg

 - Removal of several ACPI video DMI blacklist entries that are not
   necessary any more from Aaron Lu

 - Rework of the ACPI companion representation in struct device and code
   cleanup related to that change from Rafael J Wysocki, Lan Tianyu and
   Jarkko Nikula

 - Fixes for assigning names to ACPI-enumerated I2C and SPI devices from
   Jarkko Nikula

* tag 'pm+acpi-2-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (24 commits)
  PCI / hotplug / ACPI: Drop unused acpiphp_debug declaration
  ACPI / scan: Set flags.match_driver in acpi_bus_scan_fixed()
  ACPI / PCI root: Clear driver_data before failing enumeration
  ACPI / hotplug: Fix PCI host bridge hot removal
  ACPI / hotplug: Fix acpi_bus_get_device() return value check
  cpufreq: governor: Remove fossil comment in the cpufreq_governor_dbs()
  ACPI / video: clean up DMI table for initial black screen problem
  ACPI / EC: Ensure lock is acquired before accessing ec struct members
  PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()
  ACPI / AC: Remove struct acpi_device pointer from struct acpi_ac
  spi: Use stable dev_name for ACPI enumerated SPI slaves
  i2c: Use stable dev_name for ACPI enumerated I2C slaves
  ACPI: Provide acpi_dev_name accessor for struct acpi_device device name
  ACPI / bind: Use (put|get)_device() on ACPI device objects too
  ACPI: Eliminate the DEVICE_ACPI_HANDLE() macro
  ACPI / driver core: Store an ACPI device pointer in struct acpi_dev_node
  cpufreq: OMAP: Fix compilation error 'r & ret undeclared'
  PM / Runtime: Fix error path for prepare
  PM / Runtime: Update documentation around probe|remove|suspend
  cpufreq: conservative: set requested_freq to policy max when it is over policy max
  ...
2013-11-20 13:25:04 -08:00
Linus Torvalds
1ee2dcc224 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Mostly these are fixes for fallout due to merge window changes, as
  well as cures for problems that have been with us for a much longer
  period of time"

 1) Johannes Berg noticed two major deficiencies in our genetlink
    registration.  Some genetlink protocols we passing in constant
    counts for their ops array rather than something like
    ARRAY_SIZE(ops) or similar.  Also, some genetlink protocols were
    using fixed IDs for their multicast groups.

    We have to retain these fixed IDs to keep existing userland tools
    working, but reserve them so that other multicast groups used by
    other protocols can not possibly conflict.

    In dealing with these two problems, we actually now use less state
    management for genetlink operations and multicast groups.

 2) When configuring interface hardware timestamping, fix several
    drivers that simply do not validate that the hwtstamp_config value
    is one the driver actually supports.  From Ben Hutchings.

 3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.

 4) In dev_forward_skb(), set the skb->protocol in the right order
    relative to skb_scrub_packet().  From Alexei Starovoitov.

 5) Bridge erroneously fails to use the proper wrapper functions to make
    calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid.  Fix from Toshiaki
    Makita.

 6) When detaching a bridge port, make sure to flush all VLAN IDs to
    prevent them from leaking, also from Toshiaki Makita.

 7) Put in a compromise for TCP Small Queues so that deep queued devices
    that delay TX reclaim non-trivially don't have such a performance
    decrease.  One particularly problematic area is 802.11 AMPDU in
    wireless.  From Eric Dumazet.

 8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
    here.  Fix from Eric Dumzaet, reported by Dave Jones.

 9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.

10) When computing mergeable buffer sizes, virtio-net fails to take the
    virtio-net header into account.  From Michael Dalton.

11) Fix seqlock deadlock in ip4_datagram_connect() wrt.  statistic
    bumping, this one has been with us for a while.  From Eric Dumazet.

12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
    Hugne.

13) 6lowpan bit used for traffic classification was wrong, from Jukka
    Rissanen.

14) macvlan has the same issue as normal vlans did wrt.  propagating LRO
    disabling down to the real device, fix it the same way.  From Michal
    Kubecek.

15) CPSW driver needs to soft reset all slaves during suspend, from
    Daniel Mack.

16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.

17) The xen-netfront RX buffer refill timer isn't properly scheduled on
    partial RX allocation success, from Ma JieYue.

18) When ipv6 ping protocol support was added, the AF_INET6 protocol
    initialization cleanup path on failure was borked a little.  Fix
    from Vlad Yasevich.

19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
    blocks we can do the wrong thing with the msg_name we write back to
    userspace.  From Hannes Frederic Sowa.  There is another fix in the
    works from Hannes which will prevent future problems of this nature.

20) Fix route leak in VTI tunnel transmit, from Fan Du.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
  genetlink: make multicast groups const, prevent abuse
  genetlink: pass family to functions using groups
  genetlink: add and use genl_set_err()
  genetlink: remove family pointer from genl_multicast_group
  genetlink: remove genl_unregister_mc_group()
  hsr: don't call genl_unregister_mc_group()
  quota/genetlink: use proper genetlink multicast APIs
  drop_monitor/genetlink: use proper genetlink multicast APIs
  genetlink: only pass array to genl_register_family_with_ops()
  tcp: don't update snd_nxt, when a socket is switched from repair mode
  atm: idt77252: fix dev refcnt leak
  xfrm: Release dst if this dst is improper for vti tunnel
  netlink: fix documentation typo in netlink_set_err()
  be2net: Delete secondary unicast MAC addresses during be_close
  be2net: Fix unconditional enabling of Rx interface options
  net, virtio_net: replace the magic value
  ping: prevent NULL pointer dereference on write to msg_name
  bnx2x: Prevent "timeout waiting for state X"
  bnx2x: prevent CFC attention
  bnx2x: Prevent panic during DMAE timeout
  ...
2013-11-19 15:50:47 -08:00
Kirill A. Shutemov
24b9fdc59a kernel/bounds: avoid circular dependencies in generated headers
<linux/spinlock.h> has heavy dependencies on other header files.
It triggers circular dependencies in generated headers on IA64, at
least:

  CC      kernel/bounds.s
In file included from /home/space/kas/git/public/linux/arch/ia64/include/asm/thread_info.h:9:0,
                 from include/linux/thread_info.h:54,
                 from include/asm-generic/preempt.h:4,
                 from arch/ia64/include/generated/asm/preempt.h:1,
                 from include/linux/preempt.h:18,
                 from include/linux/spinlock.h:50,
                 from kernel/bounds.c:14:
/home/space/kas/git/public/linux/arch/ia64/include/asm/asm-offsets.h:1:35: fatal error: generated/asm-offsets.h: No such file or directory
compilation terminated.

Let's replace <linux/spinlock.h> with <linux/spinlock_types.h>, it's
enough to find out size of spinlock_t.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-and-Tested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-11-19 14:20:12 -08:00
Johannes Berg
c53ed74236 genetlink: only pass array to genl_register_family_with_ops()
As suggested by David Miller, make genl_register_family_with_ops()
a macro and pass only the array, evaluating ARRAY_SIZE() in the
macro, this is a little safer.

The openvswitch has some indirection, assing ops/n_ops directly in
that code. This might ultimately just assign the pointers in the
family initializations, saving the struct genl_family_and_ops and
code (once mcast groups are handled differently.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-19 16:39:05 -05:00
Linus Torvalds
4007162647 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq cleanups from Ingo Molnar:
 "This is a multi-arch cleanup series from Thomas Gleixner, which we
  kept to near the end of the merge window, to not interfere with
  architecture updates.

  This series (motivated by the -rt kernel) unifies more aspects of IRQ
  handling and generalizes PREEMPT_ACTIVE"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  preempt: Make PREEMPT_ACTIVE generic
  sparc: Use preempt_schedule_irq
  ia64: Use preempt_schedule_irq
  m32r: Use preempt_schedule_irq
  hardirq: Make hardirq bits generic
  m68k: Simplify low level interrupt handling code
  genirq: Prevent spurious detection for unconditionally polled interrupts
2013-11-19 10:40:00 -08:00
Rafael J. Wysocki
b38f67c4ae Merge branch 'pm-sleep'
* pm-sleep:
  PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()
2013-11-19 01:07:08 +01:00
Linus Torvalds
b29c8306a3 This batch of changes is mostly clean ups and small bug fixes.
The only real feature that was added this release is from Namhyung Kim,
 who introduced "set_graph_notrace" filter that lets you run the function
 graph tracer and not trace particular functions and their call chain.
 
 Tom Zanussi added some updates to the ftrace multibuffer tracing that
 made it more consistent with the top level tracing.
 
 One of the fixes for perf function tracing required an API change in
 RCU; the addition of "rcu_is_watching()". As Paul McKenney is pushing
 that change in this release too, he gave me a branch that included
 all the changes to get that working, and I pulled that into my tree
 in order to complete the perf function tracing fix.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJSgX5SAAoJEKQekfcNnQGulUAH/jORqJrKaNAulmZ314VsAqfa
 zMtF5UAAPf7kqc3AN/jtFrhJUNEfxWOo7A4r0FsM/rKdWJF+98GA6aqYVD+XoWFt
 +36fg1enxbXUjixQ96Uh+o1+BJUgYDqljuWzqSu/oiXWfWwl8+WL4kcbhb+V9WcF
 SpdzLCWVZRfhyDiN3+0zvyQ8RSG2Pd7CWn9zroI0e4sxGo0Ki6JUnIcXtZGOBDOQ
 IIZdjXvGSfpJ+3u3XvRPXJcltRCtOsVWxYzrmvRlmHDW5QMe1+WmmrlojTePrLaJ
 xn8+3WINqetAR+ZQnazbpt1XzJzKa8QtFgpiN0kT6qL7cg3N1Owc4vLGohl7wok=
 =Nesf
 -----END PGP SIGNATURE-----

Merge tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing update from Steven Rostedt:
 "This batch of changes is mostly clean ups and small bug fixes.  The
  only real feature that was added this release is from Namhyung Kim,
  who introduced "set_graph_notrace" filter that lets you run the
  function graph tracer and not trace particular functions and their
  call chain.

  Tom Zanussi added some updates to the ftrace multibuffer tracing that
  made it more consistent with the top level tracing.

  One of the fixes for perf function tracing required an API change in
  RCU; the addition of "rcu_is_watching()".  As Paul McKenney is pushing
  that change in this release too, he gave me a branch that included all
  the changes to get that working, and I pulled that into my tree in
  order to complete the perf function tracing fix"

* tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Add rcu annotation for syscall trace descriptors
  tracing: Do not use signed enums with unsigned long long in fgragh output
  tracing: Remove unused function ftrace_off_permanent()
  tracing: Do not assign filp->private_data to freed memory
  tracing: Add helper function tracing_is_disabled()
  tracing: Open tracer when ftrace_dump_on_oops is used
  tracing: Add support for SOFT_DISABLE to syscall events
  tracing: Make register/unregister_ftrace_command __init
  tracing: Update event filters for multibuffer
  recordmcount.pl: Add support for __fentry__
  ftrace: Have control op function callback only trace when RCU is watching
  rcu: Do not trace rcu_is_watching() functions
  ftrace/x86: skip over the breakpoint for ftrace caller
  trace/trace_stat: use rbtree postorder iteration helper instead of opencoding
  ftrace: Add set_graph_notrace filter
  ftrace: Narrow down the protected area of graph_lock
  ftrace: Introduce struct ftrace_graph_data
  ftrace: Get rid of ftrace_graph_filter_enabled
  tracing: Fix potential out-of-bounds in trace_get_user()
  tracing: Show more exact help information about snapshot
2013-11-16 12:23:18 -08:00
Al Viro
b26d4cd385 consolidate simple ->d_delete() instances
Rename simple_delete_dentry() to always_delete_dentry() and export it.
Export simple_dentry_operations, while we are at it, and get rid of
their duplicates

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-15 22:04:17 -05:00
Linus Torvalds
9073e1a804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual earth-shaking, news-breaking, rocket science pile from
  trivial.git"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
  doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
  doc: add missing files to timers/00-INDEX
  timekeeping: Fix some trivial typos in comments
  mm: Fix some trivial typos in comments
  irq: Fix some trivial typos in comments
  NUMA: fix typos in Kconfig help text
  mm: update 00-INDEX
  doc: Documentation/DMA-attributes.txt fix typo
  DRM: comment: `halve' -> `half'
  Docs: Kconfig: `devlopers' -> `developers'
  doc: typo on word accounting in kprobes.c in mutliple architectures
  treewide: fix "usefull" typo
  treewide: fix "distingush" typo
  mm/Kconfig: Grammar s/an/a/
  kexec: Typo s/the/then/
  Documentation/kvm: Update cpuid documentation for steal time and pv eoi
  treewide: Fix common typo in "identify"
  __page_to_pfn: Fix typo in comment
  Correct some typos for word frequency
  clk: fixed-factor: Fix a trivial typo
  ...
2013-11-15 16:47:22 -08:00
Linus Torvalds
f080480488 Here are the 3.13 KVM changes. There was a lot of work on the PPC
side: the HV and emulation flavors can now coexist in a single kernel
 is probably the most interesting change from a user point of view.
 On the x86 side there are nested virtualization improvements and a
 few bugfixes.  ARM got transparent huge page support, improved
 overcommit, and support for big endian guests.
 
 Finally, there is a new interface to connect KVM with VFIO.  This
 helps with devices that use NoSnoop PCI transactions, letting the
 driver in the guest execute WBINVD instructions.  This includes
 some nVidia cards on Windows, that fail to start without these
 patches and the corresponding userspace changes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJShPAhAAoJEBvWZb6bTYbyl48P/297GgmELHAGBgjvb6q7yyGu
 L8+eHjKbh4XBAkPwyzbvUjuww5z2hM0N3JQ0BDV9oeXlO+zwwCEns/sg2Q5/NJXq
 XxnTeShaKnp9lqVBnE6G9rAOUWKoyLJ2wItlvUL8JlaO9xJ0Vmk0ta4n2Nv5GqDp
 db6UD7vju6rHtIAhNpvvAO51kAOwc01xxRixCVb7KUYOnmO9nvpixzoI/S0Rp1gu
 w/OWMfCosDzBoT+cOe79Yx1OKcpaVW94X6CH1s+ShCw3wcbCL2f13Ka8/E3FIcuq
 vkZaLBxio7vjUAHRjPObw0XBW4InXEbhI1DjzIvm8dmc4VsgmtLQkTCG8fj+jINc
 dlHQUq6Do+1F4zy6WMBUj8tNeP1Z9DsABp98rQwR8+BwHoQpGQBpAxW0TE0ZMngC
 t1caqyvjZ5pPpFUxSrAV+8Kg4AvobXPYOim0vqV7Qea07KhFcBXLCfF7BWdwq/Jc
 0CAOlsLL4mHGIQWZJuVGw0YGP7oATDCyewlBuDObx+szYCoV4fQGZVBEL0KwJx/1
 7lrLN7JWzRyw6xTgJ5VVwgYE1tUY4IFQcHu7/5N+dw8/xg9KWA3f4PeMavIKSf+R
 qteewbtmQsxUnvuQIBHLs8NRWPnBPy+F3Sc2ckeOLIe4pmfTte6shtTXcLDL+LqH
 NTmT/cfmYp2BRkiCfCiS
 =rWNf
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM changes from Paolo Bonzini:
 "Here are the 3.13 KVM changes.  There was a lot of work on the PPC
  side: the HV and emulation flavors can now coexist in a single kernel
  is probably the most interesting change from a user point of view.

  On the x86 side there are nested virtualization improvements and a few
  bugfixes.

  ARM got transparent huge page support, improved overcommit, and
  support for big endian guests.

  Finally, there is a new interface to connect KVM with VFIO.  This
  helps with devices that use NoSnoop PCI transactions, letting the
  driver in the guest execute WBINVD instructions.  This includes some
  nVidia cards on Windows, that fail to start without these patches and
  the corresponding userspace changes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
  kvm, vmx: Fix lazy FPU on nested guest
  arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
  arm/arm64: KVM: MMIO support for BE guest
  kvm, cpuid: Fix sparse warning
  kvm: Delete prototype for non-existent function kvm_check_iopl
  kvm: Delete prototype for non-existent function complete_pio
  hung_task: add method to reset detector
  pvclock: detect watchdog reset at pvclock read
  kvm: optimize out smp_mb after srcu_read_unlock
  srcu: API for barrier after srcu read unlock
  KVM: remove vm mmap method
  KVM: IOMMU: hva align mapping page size
  KVM: x86: trace cpuid emulation when called from emulator
  KVM: emulator: cleanup decode_register_operand() a bit
  KVM: emulator: check rex prefix inside decode_register()
  KVM: x86: fix emulation of "movzbl %bpl, %eax"
  kvm_host: typo fix
  KVM: x86: emulate SAHF instruction
  MAINTAINERS: add tree for kvm.git
  Documentation/kvm: add a 00-INDEX file
  ...
2013-11-15 13:51:36 +09:00
Linus Torvalds
ce6513f758 Mainly boring here, too. rmmod --wait finally removed, though.
Cheers,
 Rusty.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJSe3ngAAoJENkgDmzRrbjxqkMP/jFwTIVy+tZbPL36xR4C7UI/
 JZ9JU2c2HTyAtqp/T/bljA0QUDQWUASCfwG5WmRgyvMkEwhfuGrQ3dveQLRq5iKD
 Ln/LIN8JXXijhRr+ywhXLAcp1P5ysSJJYYS5lZTCmJ2Cv9jnAvmUl0KqdTEx+ZNH
 YsWBiI9+WmwhODiAdUlqtThDK37w8OsWeMq2agf97bBERlRYnRZvzwy3tSP2mf5j
 4wx8viOdzPC7NVblyX1cj3gonFFQJtMI4s/e787QzkUpNQjvrN3XecPiQX6aBCX3
 seVjuv6panv1tw1HqyU1KXWo7fs2uCc9mVR5Rr3Zok+8qpKWkj0dyCnF3A+ufsrO
 vlkrFLUsv/U1NUkWJM6mJKzMjKRD4iF702QsEEpNA5rlOsAMMGSSlju4eu6GvadI
 ZJ+ZDaNWUDPbWa9Xgjyp+DKWR6vybNgEHZmLmcCdeLt1u8Th1E/ujsKxv4SN6eIO
 2v+lNPjGEivoNXUX52toRZ1324U3FFzburCSA0c55+r1sjPT6SXCfl8kISSKvVtt
 iFemsDxhaSwqVzqbsx3ztU010Z0f9uVbpZHAQgZ514Uk25HtwhkaQSdiIP+cPXE8
 rClzj9m4gD+Jy0T+P0HjPlSxKCGSlgLiEBWEigX36/F4Isv+GL1HjvrGGCWM4VnO
 lIyw5ux/UH8USct9nH4x
 =xg2p
 -----END PGP SIGNATURE-----

Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module updates from Rusty Russell:
 "Mainly boring here, too.  rmmod --wait finally removed, though"

* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  modpost: fix bogus 'exported twice' warnings.
  init: fix in-place parameter modification regression
  asmlinkage, module: Make ksymtab and kcrctab symbols and __this_module __visible
  kernel: add support for init_array constructors
  modpost: Optionally ignore secondary errors seen if a single module build fails
  module: remove rmmod --wait option.
2013-11-15 13:27:50 +09:00
Christoph Hellwig
ca5ecd64c2 kernel: fix generic_exec_single indentation
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:22 +09:00
Christoph Hellwig
40c01e8bd5 kernel: provide a __smp_call_function_single stub for !CONFIG_SMP
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:22 +09:00
Christoph Hellwig
0a06ff068f kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS
We've switched over every architecture that supports SMP to it, so
remove the new useless config variable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:22 +09:00
Christoph Hellwig
fc21c0cff2 revert "softirq: Add support for triggering softirq work on softirqs"
This commit was incomplete in that code to remove items from the per-cpu
lists was missing and never acquired a user in the 5 years it has been in
the tree.  We're going to implement what it seems to try to archive in a
simpler way, and this code is in the way of doing so.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:22 +09:00
Peter Zijlstra
539edb5846 mm: properly separate the bloated ptl from the regular case
Use kernel/bounds.c to convert build-time spinlock_t size check into a
preprocessor symbol and apply that to properly separate the page::ptl
situation.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:20 +09:00
Kirill A. Shutemov
e009bb30c8 mm: implement split page table lock for PMD level
The basic idea is the same as with PTE level: the lock is embedded into
struct page of table's page.

We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
take mm->page_table_lock anymore.  Let's reuse page->lru of table's page
for that.

pgtable_pmd_page_ctor() returns true, if initialization is successful
and false otherwise.  Current implementation never fails, but assumption
that constructor can fail will help to port it to -rt where spinlock_t
is rather huge and cannot be embedded into struct page -- dynamic
allocation is required.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:15 +09:00
Kirill A. Shutemov
e1f56c89b0 mm: convert mm->nr_ptes to atomic_long_t
With split page table lock for PMD level we can't hold mm->page_table_lock
while updating nr_ptes.

Let's convert it to atomic_long_t to avoid races.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15 09:32:14 +09:00
Rafael J. Wysocki
6a0c7cd330 PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()
I have received a report about the BUG_ON() in free_basic_memory_bitmaps()
triggering mysteriously during an aborted s2disk hibernation attempt.
The only way I can explain that is that /dev/snapshot was first
opened for writing (resume mode), then closed and then opened again
for reading and closed again without freezing tasks.  In that case
the first invocation of snapshot_open() would set the free_bitmaps
flag in snapshot_state, which is a static variable.  That flag
wouldn't be cleared later and the second invocation of snapshot_open()
would just leave it like that, so the subsequent snapshot_release()
would see data->frozen set and free_basic_memory_bitmaps() would be
called unnecessarily.

To prevent that from happening clear data->free_bitmaps in
snapshot_open() when the file is being opened for reading (hibernate
mode).

In addition to that, replace the BUG_ON() in free_basic_memory_bitmaps()
with a WARN_ON() as the kernel can continue just fine if the condition
checked by that macro occurs.

Fixes: aab1728915 (PM / hibernate: Fix user space driven resume regression)
Reported-by: Oliver Lorenz <olli@olorenz.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 3.12+ <stable@vger.kernel.org> # 3.12+
2013-11-14 23:26:58 +01:00
Johannes Berg
4534de8305 genetlink: make all genl_ops users const
Now that genl_ops are no longer modified in place when
registering, they can be made const. This patch was done
mostly with spatch:

@@
identifier ops;
@@
+const
 struct genl_ops ops[] = {
 ...
 };

(except the struct thing in net/openvswitch/datapath.c)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14 17:10:41 -05:00
Johannes Berg
88d36a9949 taskstats: use genl_register_family_with_ops()
This simplifies the code since there's no longer a
need to have error handling in the registration.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14 17:10:40 -05:00