Commit graph

518967 commits

Author SHA1 Message Date
Jens Axboe
ae994ea972 cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
Commit 9470e4a693 only covered the initial bug report, there are
other spots in CFQ where we need to check that we actually have
a valid cfq_group_data structure.

Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-20 10:26:50 -06:00
Jens Axboe
9470e4a693 cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
If none of the devices in the system are using CFQ, then attempting to
read:

/sys/fs/cgroup/blkio/blkio.leaf_weight

will results in a NULL dereference. Check for a valid cfq_group_data
struct before attempting to dereference it.

Reported-by: Andrey Wagin <avagin@gmail.com>
Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-19 10:19:36 -06:00
Jens Axboe
4ceab71b9d cfq-iosched: move group scheduling functions under ifdef
If CFQ_GROUP_IOSCHED is not set, the compiler produces the
following warning:

  CC      block/cfq-iosched.o
  linux/block/cfq-iosched.c:469:2:
    warning: 'cpd_to_cfqgd' defined but not used [-Wunused-function]
    *cpd_to_cfqgd(struct blkcg_policy_data *cpd)
     ^

In reality, two other lookup functions aren't used either if
CFQ_GROUP_IOSCHED isn't set. Move all three under one of the
CFQ_GROUP_IOSCHED sections in the code.

Reported-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-19 10:13:01 -06:00
Jens Axboe
0bb979472a cfq-iosched: fix the setting of IOPS mode on SSDs
A previous commit wanted to make CFQ default to IOPS mode on
non-rotational storage, however it did so when the queue was
initialized and the non-rotational flag is only set later on
in the probe.

Add an elevator hook that gets called off the add_disk() path,
at that point we know that feature probing has finished, and
we can reliably check for the various flags that drivers can
set.

Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
Tested-by: Romain Francoise <romain@orebokech.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-10 08:01:20 -06:00
Steven Rostedt
ae11f7efda blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
blktrace.c is currently maintained by Jens Axboe as it is used for
debugging the block layer. That file should be added to the MAINTAINERS
file under BLOCK LAYER, such that people will know to Cc the BLOCK
LAYER maintainer when modifying that file.

Currently:

 $ scripts/get_maintainer.pl -f kernel/trace/blktrace.c
 Steven Rostedt <rostedt@goodmis.org> (maintainer:TRACING)
 Ingo Molnar <mingo@redhat.com> (maintainer:TRACING)
 linux-kernel@vger.kernel.org (open list)

After the patch:

 $ scripts/get_maintainer.pl -f kernel/trace/blktrace.c
 Jens Axboe <axboe@kernel.dk> (maintainer:BLOCK LAYER)
 Steven Rostedt <rostedt@goodmis.org> (maintainer:TRACING)
 Ingo Molnar <mingo@redhat.com> (maintainer:TRACING)
 linux-kernel@vger.kernel.org (open list)

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-09 15:30:50 -06:00
Arianna Avanzini
e48453c386 block, cgroup: implement policy-specific per-blkcg data
The block IO (blkio) controller enables the block layer to provide service
guarantees in a hierarchical fashion. Specifically, service guarantees
are provided by registered request-accounting policies. As of now, a
proportional-share and a throttling policy are available. They are
implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
subsystem. Unfortunately, as for adding new policies, the current
implementation of the block IO controller is only halfway ready to allow
new policies to be plugged in. This commit provides a solution to make
the block IO controller fully ready to handle new policies.
In what follows, we first describe briefly the current state, and then
list the changes made by this commit.

The throttling policy does not need any per-cgroup information to perform
its task. In contrast, the proportional share policy uses, for each cgroup,
both the weight assigned by the user to the cgroup, and a set of dynamically-
computed weights, one for each device.

The first, user-defined weight is stored in the blkcg data structure: the
block IO controller allocates a private blkcg data structure for each
cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
In other words, the block IO controller internally mirrors the blkio cgroups
with private blkcg data structures.

On the other hand, for each cgroup and device, the corresponding dynamically-
computed weight is maintained in the following, different way. For each device,
the block IO controller keeps a private blkcg_gq structure for each cgroup in
blkio. In other words, block IO also keeps one private mirror copy of the blkio
cgroups hierarchy for each device, made of blkcg_gq structures.
Each blkcg_gq structure keeps per-policy information in a generic array of
dynamically-allocated 'dedicated' data structures, one for each registered
policy (so currently the array contains two elements). To be inserted into the
generic array, each dedicated data structure embeds a generic blkg_policy_data
structure. Consider now the array contained in the blkcg_gq structure
corresponding to a given pair of cgroup and device: one of the elements
of the array contains the dedicated data structure for the proportional-share
policy, and this dedicated data structure contains the dynamically-computed
weight for that pair of cgroup and device.

The generic strategy adopted for storing per-policy data in blkcg_gq structures
is already capable of handling new policies, whereas the one adopted with blkcg
structures is not, because per-policy data are hard-coded in the blkcg
structures themselves (currently only data related to the proportional-
share policy).

This commit addresses the above issues through the following changes:
. It generalizes blkcg structures so that per-policy data are stored in the same
  way as in blkcg_gq structures.
  Specifically, it lets also the blkcg structure store per-policy data in a
  generic array of dynamically-allocated dedicated data structures. We will
  refer to these data structures as blkcg dedicated data structures, to
  distinguish them from the dedicated data structures inserted in the generic
  arrays kept by blkcg_gq structures.
  To allow blkcg dedicated data structures to be inserted in the generic array
  inside a blkcg structure, this commit also introduces a new blkcg_policy_data
  structure, which is the equivalent of blkg_policy_data for blkcg dedicated
  data structures.
. It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
  cpd_size field and a cpd_init field, to be initialized by the policy with,
  respectively, the size of the blkcg dedicated data structures, and the
  address of a constructor function for blkcg dedicated data structures.
. It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
  the fields related to the proportional-share policy), into a new blkcg
  dedicated data structure called cfq_group_data.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-07 08:10:08 -06:00
Tahsin Erdogan
41c0126b3f block: Make CFQ default to IOPS mode on SSDs
CFQ idling causes reduced IOPS throughput on non-rotational disks.
Since disk head seeking is not applicable to SSDs, it doesn't really
help performance by anticipating future near-by IO requests.

By turning off idling (and switching to IOPS mode), we allow other
processes to dispatch IO requests down to the driver and so increase IO
throughput.

Following FIO benchmark results were taken on a cloud SSD offering with
idling on and off:

Idling     iops    avg-lat(ms)    stddev            bw
------------------------------------------------------
    On     7054    90.107         38.697     28217KB/s
   Off    29255    21.836         11.730    117022KB/s

fio --name=temp --size=100G --time_based --ioengine=libaio \
    --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
    --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
    --filename=/dev/sdb --runtime=10 --iodepth=64 --numjobs=10

And the following is from a local SSD run:

Idling     iops    avg-lat(ms)    stddev            bw
------------------------------------------------------
    On    19320    33.043         14.068     77281KB/s
   Off    21626    29.465         12.662     86507KB/s

fio --name=temp --size=5G --time_based --ioengine=libaio \
    --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
    --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
    --filename=/fio_data --runtime=10 --iodepth=64 --numjobs=10

Reviewed-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05 19:21:15 -06:00
Jens Axboe
3f21c265cd block: add blk_set_queue_dying() to blkdev.h
We export this function and NVMe wants to use it, but for some reason
it was never added to the block header. Do that.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-05 10:57:37 -06:00
Keith Busch
f26cdc8536 blk-mq: Shared tag enhancements
Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-01 14:35:56 -06:00
Jens Axboe
e548ca4ee4 block: don't honor chunk sizes for data-less IO
We don't need to honor chunk sizes for IO that doesn't carry any
data.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 13:11:32 -06:00
Jens Axboe
beefa6ba7b block: only honor SG gap prevention for merges that contain data
We can safely merge anything that wont generate an SG list entry,
so if the bio is data-less (discard), don't look at potential
SG gaps.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 13:10:23 -06:00
Julia Lawall
f6454b049d block: fix returnvar.cocci warnings
Remove unneeded variable used to store return value.

Generated by: scripts/coccinelle/misc/returnvar.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-26 14:02:45 -06:00
Christoph Hellwig
5f1b670d0b block, dm: don't copy bios for request clones
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.

This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence.  With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.

I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:57 -06:00
Mike Snitzer
326e1dbb57 block: remove management of bi_remaining when restoring original bi_end_io
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:55 -06:00
Ming Lei
b04a5636a6 block: replace trylock with mutex_lock in blkdev_reread_part()
The only possible problem of using mutex_lock() instead of trylock
is about deadlock.

If there aren't any locks held before calling blkdev_reread_part(),
deadlock can't be caused by this conversion.

If there are locks held before calling blkdev_reread_part(),
and if these locks arn't required in open, close handler and I/O
path, deadlock shouldn't be caused too.

Both user space's ioctl(BLKRRPART) and md_setup_drive() from
init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
for the two cases.

For loop, the previous patches in this pathset has fixed the ABBA lock
dependency, so the conversion is OK.

For nbd, tx_lock is held when calling the function:

	- both open and release won't hold the lock
	- when blkdev_reread_part() is run, I/O thread has been stopped
	already, so tx_lock won't be acquired in I/O path at that time.
	- so the conversion won't cause deadlock for nbd

For dasd, both dasd_open(), dasd_release() and request function don't
acquire any mutex/semphone, so the conversion should be safe.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-20 09:05:45 -06:00
Jarod Wilson
be32417796 block: export blkdev_reread_part() and __blkdev_reread_part()
This patch exports blkdev_reread_part() for block drivers, also
introduce __blkdev_reread_part().

For some drivers, such as loop, reread of partitions can be run
from the release path, and bd_mutex may already be held prior to
calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
__blkdev_reread_part for use in such cases.

CC: Christoph Hellwig <hch@lst.de>
CC: Jens Axboe <axboe@kernel.dk>
CC: Tejun Heo <tj@kernel.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Markus Pargmann <mpa@pengutronix.de>
CC: Stefan Weinhuber <wein@de.ibm.com>
CC: Stefan Haberland <stefan.haberland@de.ibm.com>
CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
CC: Fabian Frederick <fabf@skynet.be>
CC: Ming Lei <ming.lei@canonical.com>
CC: David Herrmann <dh.herrmann@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: nbd-general@lists.sourceforge.net
CC: linux-s390@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-20 09:05:42 -06:00
Christoph Hellwig
343df3c79c suspend: simplify block I/O handling
Stop abusing struct page functionality and the swap end_io handler, and
instead add a modified version of the blk-lib.c bio_batch helpers.

Also move the block I/O code into swap.c as they are directly tied into
each other.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Ming Lin <mlin@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:19:59 -06:00
Jens Axboe
b2dbe0a60f block: collapse bio bit space
Various previous patches removed bits and left holes, collapse them
all. Leave the reset start bit where it is, we don't need to change
that.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:18:28 -06:00
Christoph Hellwig
97ca223c3b block: remove unused BIO_RW_BLOCK and BIO_EOF flags
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:17:05 -06:00
Christoph Hellwig
b25de9d6da block: remove BIO_EOPNOTSUPP
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:17:03 -06:00
Christoph Hellwig
4ecd4fef3a block: use an atomic_t for mq_freeze_depth
lockdep gets unhappy about the not disabling irqs when using the queue_lock
around it.  Instead of trying to fix that up just switch to an atomic_t
and get rid of the lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-19 09:12:59 -06:00
Shaohua Li
5b3f341f09 blk-mq: make plug work for mutiple disks and queues
Last patch makes plug work for multiple queue case. However it only
works for single disk case, because it assumes only one request in the
plug list. If a task is accessing multiple disks, eg MD/DM, the
assumption is wrong. Let blk_attempt_plug_merge() record request from
the same queue.

V2: use NULL parameter in !mq case. Fix a bug. Add comments in
blk_attempt_plug_merge to make it less (hopefully) confusion.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08 14:17:23 -06:00
Shaohua Li
f984df1f0f blk-mq: do limited block plug for multiple queue case
plug is still helpful for workload with IO merge, but it can be harmful
otherwise especially with multiple hardware queues, as there is
(supposed) no lock contention in this case and plug can introduce
latency. For multiple queues, we do limited plug, eg plug only if there
is request merge. If a request doesn't have merge with following
request, the requet will be dispatched immediately.

V2: check blk_queue_nomerges() as suggested by Jeff.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08 14:17:21 -06:00
Shaohua Li
239ad215f0 blk-mq: avoid re-initialize request which is failed in direct dispatch
If we directly issue a request and it fails, we use
blk_mq_merge_queue_io(). But we already assigned bio to a request in
blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
blk_mq_bio_to_request again.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08 14:17:19 -06:00
Jeff Moyer
e6c4438ba7 blk-mq: fix plugging in blk_sq_make_request
The following appears in blk_sq_make_request:

	/*
	 * If we have multiple hardware queues, just go directly to
	 * one of those for sync IO.
	 */

We clearly don't have multiple hardware queues, here!  This comment was
introduced with this commit 07068d5b8e (blk-mq: split make request
handler for multi and single queue):

    We want slightly different behavior from them:

    - On single queue devices, we currently use the per-process plug
      for deferred IO and for merging.

    - On multi queue devices, we don't use the per-process plug, but
      we want to go straight to hardware for SYNC IO.

The old code had this:

        use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);

and that was converted to:

	use_plug = !is_flush_fua && !is_sync;

which is not equivalent.  For the single queue case, that second half of
the && expression is always true.  So, what I think was actually inteded
follows (and this more closely matches what is done in blk_queue_bio).

V2: delete the 'likely', which should not be a big deal

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08 14:17:17 -06:00
Shaohua Li
5596d0d591 sched: always use blk_schedule_flush_plug in io_schedule_out
block plug callback could sleep, so we introduce a parameter
'from_schedule' and corresponding drivers can use it to destinguish a
schedule plug flush or a plug finish. Unfortunately io_schedule_out
still uses blk_flush_plug(). This causes below output (Note, I added a
might_sleep() in raid1_unplug to make it trigger faster, but the whole
thing doesn't matter if I add might_sleep). In raid1/10, this can cause
deadlock.

This patch makes io_schedule_out always uses blk_schedule_flush_plug.
This should only impact drivers (as far as I know, raid 1/10) which are
sensitive to the 'from_schedule' parameter.

[  370.817949] ------------[ cut here ]------------
[  370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
[  370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90
[  370.817971] Modules linked in: raid1
[  370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G        W       4.0.0+ #361
[  370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
[  370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
[  370.817985]  ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
[  370.817988]  ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
[  370.817990]  ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
[  370.817993] Call Trace:
[  370.817999]  [<ffffffff819dd7af>] dump_stack+0x4f/0x7b
[  370.818002]  [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0
[  370.818004]  [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50
[  370.818006]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
[  370.818008]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
[  370.818010]  [<ffffffff810776ef>] __might_sleep+0x7f/0x90
[  370.818014]  [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1]
[  370.818024]  [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0
[  370.818028]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
[  370.818031]  [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140
[  370.818033]  [<ffffffff819e3586>] bit_wait_io+0x36/0x50
[  370.818034]  [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90
[  370.818041]  [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630
[  370.818043]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
[  370.818045]  [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80
[  370.818047]  [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40
[  370.818050]  [<ffffffff811de744>] __wait_on_buffer+0x44/0x50
[  370.818053]  [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0
[  370.818058]  [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790
[  370.818062]  [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50
[  370.818064]  [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200
[  370.818066]  [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360
[  370.818068]  [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0
[  370.818070]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
[  370.818072]  [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460
[  370.818074]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
[  370.818076]  [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620
[  370.818079]  [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0
[  370.818081]  [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290
[  370.818085]  [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0
[  370.818088]  [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50
[  370.818094]  [<ffffffff81149691>] do_writepages+0x21/0x50
[  370.818097]  [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490
[  370.818099]  [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590
[  370.818103]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
[  370.818105]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
[  370.818107]  [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0
[  370.818109]  [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0
[  370.818111]  [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550
[  370.818116]  [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570
[  370.818117]  [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570
[  370.818119]  [<ffffffff8106c09b>] worker_thread+0x11b/0x470
[  370.818121]  [<ffffffff8106bf80>] ? process_one_work+0x570/0x570
[  370.818124]  [<ffffffff81071868>] kthread+0xf8/0x110
[  370.818126]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
[  370.818129]  [<ffffffff819e9322>] ret_from_fork+0x42/0x70
[  370.818131]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
[  370.818132] ---[ end trace 7b4deb71e68b6605 ]---

V2: don't change ->in_iowait

Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08 14:17:15 -06:00
Shaohua Li
dd6cf3e18d blk: clean up plug
Current code looks like inner plug gets flushed with a
blk_finish_plug(). Actually it's a nop. All requests/callbacks are added
to current->plug, while only outmost plug is assigned to current->plug.
So inner plug always has empty request/callback list, which makes
blk_flush_plug_list() a nop. This tries to make the code more clear.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-08 14:17:14 -06:00
Christoph Hellwig
9dc6c806b3 nbd: stop using req->cmd
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:40:44 -06:00
Christoph Hellwig
a7928c1578 block: move PM request support to IDE
This removes the request types and hacks from the block code and into the
old IDE driver.  There is a small amunt of code duplication due to this,
but it's not too bad.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:40:42 -06:00
Christoph Hellwig
ac7cdff00a block: remove REQ_TYPE_PM_SHUTDOWN
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:40:10 -06:00
Christoph Hellwig
b0b93b48a3 block: move REQ_TYPE_SENSE to the ide driver
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:40:07 -06:00
Christoph Hellwig
b42171ef7d block: move REQ_TYPE_ATA_TASKFILE and REQ_TYPE_ATA_PC to ide.h
These values are only used by the IDE driver, so move them into it
by allowing drivers to take cmd_type values after the first private
one.  Note that we have to turn cmd_type into a plain unsigned integer
so that gcc doesn't complain about mismatching enum types.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:40:05 -06:00
Christoph Hellwig
4f8c9510ba block: rename REQ_TYPE_SPECIAL to REQ_TYPE_DRV_PRIV
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:40:03 -06:00
Jens Axboe
dac56212e8 bio: skip atomic inc/dec of ->bi_cnt for most use cases
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.

If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.

Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:49 -06:00
Jens Axboe
c4cf5261f8 bio: skip atomic inc/dec of ->bi_remaining for non-chains
Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.

Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.

For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.

Tested-by: Robert Elliott <elliott@hp.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:47 -06:00
Linus Torvalds
d9cee5d4f6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This fixes a build problem with bcm63xx and yet another fix to the
  memzero_explicit function to ensure that the memset is not elided"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  hwrng: bcm63xx - Fix driver compilation
  lib: make memzero_explicit more robust against dead store elimination
2015-05-05 09:03:52 -07:00
Linus Torvalds
c02d7da3dd media fixes for v4.1-rc3
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVSIvdAAoJEAhfPr2O5OEVegMQAI53Jv5ls3uauICyh6Cd+KJW
 PowBTV041kFefzZf/1QzoJRaGFsXJlkK6ZVv1PsTyfnR02z+2r0dFpTRJHtWK+08
 CbbXbWi7wmQXya6mb+c6heWVZa/NAH+DYbDZwrSLUeHgOst13SUMKefLr90kjhBv
 ay08WsL3P9fZMwRUOndE57qGkBZ1w+Irp12VYk+Udt8F1ImTxfcJFulazD5DGWO0
 JjoYzO1mlRlp7wVkhKfQJd/fjCgHWlFGUV4aH6i6N6nh3RtU5X03eNDmGLiHcn/z
 7fHOB2DNtsEDWqmXLpLwdndt9rmFYmV3pk/vjQDaJmNLS+FojUmrPC5xdOe1e2SW
 ikzQr5BgIGnvl4n7SIECzutjX9knxuhtGRU1myiRVBDcyVmR2e9+Gx3xTmo9Lg7I
 mXbxQx7yZHqpHP9cA6Ih8fCAwqqxysKrRUlfHG0J1/G3iVGbVYTnlSTDeTTkdg5k
 2vpAfxbH5iqwCmcvpNSitU2vCNNF8IOOhTI2LvT5OZk9H2cK0YpfWIozK1KSxefT
 8X3NQ5Tt+lqOVCdiA81HNZDpm56lVo3YNNalgesrQCQl1bCCo0WEuf+LqjzgMBtA
 fRdlYq8KDi9sr6xxfomdD9cuS+SnhTiMpqXTersO8bHEiqKqTOee9IgNsfCr81KI
 j/prjUBnL0VgtzmfmbBb
 =gmuD
 -----END PGP SIGNATURE-----

Merge tag 'media/v4.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:
 "Three driver fixes:

   - fix for omap4, fixing a regression due to a subsystem API that got
     removed for 4.1 (commit efde234674);

   - fix for one of the formats supported by Marvel ccic driver;

   - fix rcar_vin driver that, when stopping abnormally, the driver
     can't return from wait_for_completion"

* tag 'media/v4.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  [media] v4l: omap4iss: Replace outdated OMAP4 control pad API with syscon
  [media] media: soc_camera: rcar_vin: Fix wait_for_completion
  [media] marvell-ccic: fix Y'CbCr ordering
2015-05-05 08:42:06 -07:00
Álvaro Fernández Rojas
f440c4ee3e hwrng: bcm63xx - Fix driver compilation
- s/clk_didsable_unprepare/clk_disable_unprepare
- s/prov/priv
- s/error/ret (bcm63xx_rng_probe)

Fixes: 6229c16060 ("hwrng: bcm63xx - make use of devm_hwrng_register")
Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-05-04 17:49:52 +08:00
Daniel Borkmann
7829fb09a2 lib: make memzero_explicit more robust against dead store elimination
In commit 0b053c9518 ("lib: memzero_explicit: use barrier instead
of OPTIMIZER_HIDE_VAR"), we made memzero_explicit() more robust in
case LTO would decide to inline memzero_explicit() and eventually
find out it could be elimiated as dead store.

While using barrier() works well for the case of gcc, recent efforts
from LLVMLinux people suggest to use llvm as an alternative to gcc,
and there, Stephan found in a simple stand-alone user space example
that llvm could nevertheless optimize and thus elimitate the memset().
A similar issue has been observed in the referenced llvm bug report,
which is regarded as not-a-bug.

Based on some experiments, icc is a bit special on its own, while it
doesn't seem to eliminate the memset(), it could do so with an own
implementation, and then result in similar findings as with llvm.

The fix in this patch now works for all three compilers (also tested
with more aggressive optimization levels). Arguably, in the current
kernel tree it's more of a theoretical issue, but imho, it's better
to be pedantic about it.

It's clearly visible with gcc/llvm though, with the below code: if we
would have used barrier() only here, llvm would have omitted clearing,
not so with barrier_data() variant:

  static inline void memzero_explicit(void *s, size_t count)
  {
    memset(s, 0, count);
    barrier_data(s);
  }

  int main(void)
  {
    char buff[20];
    memzero_explicit(buff, sizeof(buff));
    return 0;
  }

  $ gcc -O2 test.c
  $ gdb a.out
  (gdb) disassemble main
  Dump of assembler code for function main:
   0x0000000000400400  <+0>: lea   -0x28(%rsp),%rax
   0x0000000000400405  <+5>: movq  $0x0,-0x28(%rsp)
   0x000000000040040e <+14>: movq  $0x0,-0x20(%rsp)
   0x0000000000400417 <+23>: movl  $0x0,-0x18(%rsp)
   0x000000000040041f <+31>: xor   %eax,%eax
   0x0000000000400421 <+33>: retq
  End of assembler dump.

  $ clang -O2 test.c
  $ gdb a.out
  (gdb) disassemble main
  Dump of assembler code for function main:
   0x00000000004004f0  <+0>: xorps  %xmm0,%xmm0
   0x00000000004004f3  <+3>: movaps %xmm0,-0x18(%rsp)
   0x00000000004004f8  <+8>: movl   $0x0,-0x8(%rsp)
   0x0000000000400500 <+16>: lea    -0x18(%rsp),%rax
   0x0000000000400505 <+21>: xor    %eax,%eax
   0x0000000000400507 <+23>: retq
  End of assembler dump.

As gcc, clang, but also icc defines __GNUC__, it's sufficient to define
this in compiler-gcc.h only to be picked up. For a fallback or otherwise
unsupported compiler, we define it as a barrier. Similarly, for ecc which
does not support gcc inline asm.

Reference: https://llvm.org/bugs/show_bug.cgi?id=15495
Reported-by: Stephan Mueller <smueller@chronox.de>
Tested-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Stephan Mueller <smueller@chronox.de>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: mancha security <mancha1@zoho.com>
Cc: Mark Charlebois <charlebm@gmail.com>
Cc: Behan Webster <behanw@converseincode.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-05-04 17:49:51 +08:00
Linus Torvalds
5ebe6afaf0 Linux 4.1-rc2 2015-05-03 19:22:23 -07:00
Linus Torvalds
8663da2c09 Some miscellaneous bug fixes and some final on-disk and ABI changes
for ext4 encryption which provide better security and performance.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJVRsVDAAoJEPL5WVaVDYGj/UUIAI6zLGhq3I8uQLZQC22Ew2Ph
 TPj6eABDuTrB/7QpAu21Dk59N70MQpsBTES6yLWWLf/eHp0gsH7gCNY/C9185vOh
 tQjzw18hRH2IfPftOBrjDlPGbbBD8Gu9jAmpm5kKKOtBuSVbKQ4GeN6BTECkgwlg
 U5EJHJJ5Ahl4MalODFreOE5ZrVC7FWGEpc1y/MquQ0qcGSGlNd35leK5FE2bfHWZ
 M1IJfXH5RRVPUBp26uNvzEg0TtpqkigmCJUT6gOVLfSYBw+lYEbGl4lCflrJmbgt
 8EZh3Q0plsDbNhMzqSvOE4RvsOZ28oMjRNbzxkAaoz/FlatWX2hrfAoI2nqRrKg=
 =Unbp
 -----END PGP SIGNATURE-----

Merge tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Some miscellaneous bug fixes and some final on-disk and ABI changes
  for ext4 encryption which provide better security and performance"

* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix growing of tiny filesystems
  ext4: move check under lock scope to close a race.
  ext4: fix data corruption caused by unwritten and delayed extents
  ext4 crypto: remove duplicated encryption mode definitions
  ext4 crypto: do not select from EXT4_FS_ENCRYPTION
  ext4 crypto: add padding to filenames before encrypting
  ext4 crypto: simplify and speed up filename encryption
2015-05-03 18:23:53 -07:00
Linus Torvalds
101a6fd387 Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
 "One intel fix, one rockchip fix, and a bunch of radeon fixes for some
  regressions from audio rework and vm stability"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/i915/chv: Implement WaDisableShadowRegForCpd
  drm/radeon: fix userptr return value checking (v2)
  drm/radeon: check new address before removing old one
  drm/radeon: reset BOs address after clearing it.
  drm/radeon: fix lockup when BOs aren't part of the VM on release
  drm/radeon: add SI DPM quirk for Sapphire R9 270 Dual-X 2G GDDR5
  drm/radeon: adjust pll when audio is not enabled
  drm/radeon: only enable audio streams if the monitor supports it
  drm/radeon: only mark audio as connected if the monitor supports it (v3)
  drm/radeon/audio: don't enable packets until the end
  drm/radeon: drop dce6_dp_enable
  drm/radeon: fix ordering of AVI packet setup
  drm/radeon: Use drm_calloc_ab for CS relocs
  drm/rockchip: fix error check when getting irq
  MAINTAINERS: add entry for Rockchip drm drivers
2015-05-03 18:15:48 -07:00
Dave Airlie
71aee81937 Merge tag 'drm-intel-fixes-2015-04-30' of git://anongit.freedesktop.org/drm-intel into drm-fixes
Just a single intel fix
* tag 'drm-intel-fixes-2015-04-30' of git://anongit.freedesktop.org/drm-intel:
  drm/i915/chv: Implement WaDisableShadowRegForCpd
2015-05-04 08:56:47 +10:00
Dave Airlie
df9ebeb2da Merge branch 'drm-next0420' of https://github.com/markyzq/kernel-drm-rockchip into drm-fixes
one fix and maintainers update
* 'drm-next0420' of https://github.com/markyzq/kernel-drm-rockchip:
  drm/rockchip: fix error check when getting irq
  MAINTAINERS: add entry for Rockchip drm drivers
2015-05-04 08:56:27 +10:00
Linus Torvalds
61f06db00e SCSI fixes on 20150503
This is three logical fixes (as 5 patches).  The 3ware class of drivers were
 causing an oops with multiqueue by tearing down the command mappings after
 completing the command (where the variables in the command used to tear down
 the mapping were no-longer valid). There's also a fix for the qnap iscsi
 target which was choking on us sending it commands that were too long and a
 fix for the reworked aha1542 allocating GFP_KERNEL under a lock.
 
 Signed-off-by: James Bottomley <JBottomley@Odin.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABAgAGBQJVRkgEAAoJEDeqqVYsXL0MZbEIAL7Repky0TI1GvyYVpbJLSc+
 SlJurEQ9DIpsnNtHJlEfLHicqsrK3v/xS+3Kopd1OUEIjQ0kTFPkenTiJbwFNIB+
 l7D3V1EfEdnOki7F8KU5bFf3i4KWeEUZ4v/FR3PC7dz4cFRav7OrMoGaA36yT/ns
 I4NFZ7iOa/6QXzfrywTDU5HbHgQYIN2MfCPy/NrVP95Yq09TkN5ulXDz/h6PD6Iy
 GV/RmeckUqkdO5SZq9kkIgR/czLpVCaqpf3/G6lFdfunNkhQJ96lQTzYwbvtPBrJ
 m6+sxrcCWlHzIkJsXrKtopPYzSzGdApLSsZjuYtP3RJD0uO9psfgW8pbls3dDCg=
 =n4aD
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "This is three logical fixes (as 5 patches).

  The 3ware class of drivers were causing an oops with multiqueue by
  tearing down the command mappings after completing the command (where
  the variables in the command used to tear down the mapping were
  no-longer valid).  There's also a fix for the qnap iscsi target which
  was choking on us sending it commands that were too long and a fix for
  the reworked aha1542 allocating GFP_KERNEL under a lock"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  3w-9xxx: fix command completion race
  3w-xxxx: fix command completion race
  3w-sas: fix command completion race
  aha1542: Allocate memory before taking a lock
  SCSI: add 1024 max sectors black list flag
2015-05-03 13:22:32 -07:00
Linus Torvalds
3333222484 Merge branch 'next' of git://git.infradead.org/users/vkoul/slave-dma
Pull slave dmaengine fixes from Vinod Koul:
 "Here are the fixes in dmaengine subsystem for rc2:

   - privatecnt fix for slave dma request API by Christopher

   - warn fix for PM ifdef in usb-dmac by Geert

   - fix hardware dependency for xgene by Jean"

* 'next' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: increment privatecnt when using dma_get_any_slave_channel
  dmaengine: xgene: Set hardware dependency
  dmaengine: usb-dmac: Protect PM-only functions to kill warning
2015-05-03 10:49:04 -07:00
Linus Torvalds
180d89f6ef powerpc fixes for 4.1 # 2
- Build fix for SMP=n in book3s_xics.c
 - Fix for Daniel's pci_controller_ops on powernv.
 - Revert the TM syscall abort patch for now.
 - CPU affinity fix from Nathan.
 - Two EEH fixes from Gavin.
 - Fix for CR corruption from Sam.
 - Selftest build fix.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVRKvXAAoJEFHr6jzI4aWAQPwP/jctjzdpbt+/Ra+/f48E4TuP
 cLDqbVJcOV+aC0lflXDBwnORn7qff2zzN6yTUcj9lAkq/ILBY7lY8m/bNvj/C0g1
 yH1Bh6EIjKLyqLKyfFnu+H1DU2s+ROhaAFh9JXhW28j7gU0iSwb7kyBlQ3MP7py4
 8OTbVs1vBKg42SND5FX8JsJG7Vk5v/sNz7WXc2HdtIIWQip4tp95vKvftuCABZgj
 2bMfHF5OXCYd3yalVZGeuiIX3ZAezN9F2GpfFoetCn0118Fkp97pfEVkQ0p64tI7
 xomtzgNXZh9jKFvqqhqlcUDFWEpqr27UjB5/ToWa2YKL4ACrYrgvvo+ifL4qLLtb
 M9itrZVfHElHjA0JSn/hDMdaRKBALcyX1+71rvTpGOMvrdtUY7NaD/h+2jQJ6Cz8
 V8o7uI7SGOdGjWtzNV+bHN+bmhF1MKA1WJXk9a1Pexi+T0vtyZNcTQXr00RVoZJp
 zsrE5cZGwgXkz0tlkNK4Zf5U8xURqZKGZWoCxG4kCkwWPPyZZCWH0HDQtNzxMJXJ
 xrxDTuuF9B/B72xZ6UpVHYlIwYGLEzPz5jtL7r9muxjVEuaewT3NmX+3ZAQZKk/f
 hKMiwHpDSKs36K1Afn8g4ycjfzAy2HyL6TVMvHjO8XG14HVyI+tJ49oeqBTRQLO1
 2ZGZCkjGNJd/R1Ii1qeW
 =S0WU
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux

Pull powerpc fixes from Michael Ellerman:
 - build fix for SMP=n in book3s_xics.c
 - fix for Daniel's pci_controller_ops on powernv.
 - revert the TM syscall abort patch for now.
 - CPU affinity fix from Nathan.
 - two EEH fixes from Gavin.
 - fix for CR corruption from Sam.
 - selftest build fix.

* tag 'powerpc-4.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
  powerpc/powernv: Restore non-volatile CRs after nap
  powerpc/eeh: Delay probing EEH device during hotplug
  powerpc/eeh: Fix race condition in pcibios_set_pcie_reset_state()
  powerpc/pseries: Correct cpu affinity for dlpar added cpus
  selftests/powerpc: Fix the pmu install rule
  Revert "powerpc/tm: Abort syscalls in active transactions"
  powerpc/powernv: Fix early pci_controller_ops loading.
  powerpc/kvm: Fix SMP=n build error in book3s_xics.c
2015-05-03 10:28:36 -07:00
Jan Kara
2c869b262a ext4: fix growing of tiny filesystems
The estimate of necessary transaction credits in ext4_flex_group_add()
is too pessimistic. It reserves credit for sb, resize inode, and resize
inode dindirect block for each group added in a flex group although they
are always the same block and thus it is enough to account them only
once. Also the number of modified GDT block is overestimated since we
fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.

Make the estimation more precise. That reduces number of requested
credits enough that we can grow 20 MB filesystem (which has 1 MB
journal, 79 reserved GDT blocks, and flex group size 16 by default).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2015-05-02 23:58:32 -04:00
Davide Italiano
280227a75b ext4: move check under lock scope to close a race.
fallocate() checks that the file is extent-based and returns
EOPNOTSUPP in case is not. Other tasks can convert from and to
indirect and extent so it's safe to check only after grabbing
the inode mutex.

Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-05-02 23:21:15 -04:00
Lukas Czerner
d2dc317d56 ext4: fix data corruption caused by unwritten and delayed extents
Currently it is possible to lose whole file system block worth of data
when we hit the specific interaction with unwritten and delayed extents
in status extent tree.

The problem is that when we insert delayed extent into extent status
tree the only way to get rid of it is when we write out delayed buffer.
However there is a limitation in the extent status tree implementation
so that when inserting unwritten extent should there be even a single
delayed block the whole unwritten extent would be marked as delayed.

At this point, there is no way to get rid of the delayed extents,
because there are no delayed buffers to write out. So when a we write
into said unwritten extent we will convert it to written, but it still
remains delayed.

When we try to write into that block later ext4_da_map_blocks() will set
the buffer new and delayed and map it to invalid block which causes
the rest of the block to be zeroed loosing already written data.

For now we can fix this by simply not allowing to set delayed status on
written extent in the extent status tree. Also add WARN_ON() to make
sure that we notice if this happens in the future.

This problem can be easily reproduced by running the following xfs_io.

xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
          -c "falloc 0 131072" \
          -c "pwrite -S 0xbb 65536 2048" \
          -c "fsync" /mnt/test/fff

echo 3 > /proc/sys/vm/drop_caches
xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff

This can be theoretically also reproduced by at random by running fsx,
but it's not very reliable, though on machines with bigger page size
(like ppc) this can be seen more often (especially xfstest generic/127)

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-05-02 21:36:55 -04:00