Commit graph

2576 commits

Author SHA1 Message Date
Mike Snitzer
70d6c400ac dm kcopyd: add WRITE SAME support to dm_kcopyd_zero
Add WRITE SAME support to dm-io and make it accessible to
dm_kcopyd_zero().  dm_kcopyd_zero() provides an asynchronous interface
whereas the blkdev_issue_write_same() interface is synchronous.

WRITE SAME is a SCSI command that can be leveraged for more efficient
zeroing of a specified logical extent of a device which supports it.
Only a single zeroed logical block is transfered to the target for each
WRITE SAME and the target then writes that same block across the
specified extent.

The dm thin target uses this.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:37 +00:00
Mike Snitzer
4f0b70b047 dm linear: add WRITE SAME support
The linear target can already support WRITE SAME requests so signal
this by setting num_write_same_requests to 1.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:37 +00:00
Mike Snitzer
23508a96cd dm: add WRITE SAME support
WRITE SAME bios have a payload that contain a single page.  When
cloning WRITE SAME bios DM has no need to modify the bi_io_vec
attributes (and doing so would be detrimental).  DM need only alter the
start and end of the WRITE SAME bio accordingly.

Rather than duplicate __clone_and_map_discard, factor out a common
function that is also used by __clone_and_map_write_same.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:37 +00:00
Mike Snitzer
d54eaa5a0f dm: prepare to support WRITE SAME
Allow targets to opt in to WRITE SAME support by setting
'num_write_same_requests' in the dm_target structure.

A dm device will only advertise WRITE SAME support if all its
targets and all its underlying devices support it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:36 +00:00
Mikulas Patocka
9c5091f2ee dm ioctl: use kmalloc if possible
If the parameter buffer is small enough, try to allocate it with kmalloc()
rather than vmalloc().

vmalloc is noticeably slower than kmalloc because it has to manipulate
page tables.

In my tests, on PA-RISC this patch speeds up activation 13 times.
On Opteron this patch speeds up activation by 5%.

This patch introduces a new function free_params() to free the
parameters and this uses new flags that record whether or not vmalloc()
was used and whether or not the input buffer must be wiped after use.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:36 +00:00
Mikulas Patocka
5023e5cf58 dm ioctl: remove PF_MEMALLOC
When allocating memory for the userspace ioctl data, set some
appropriate GPF flags directly instead of using PF_MEMALLOC.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:36 +00:00
Joe Thornber
7960123f2d dm persistent data: improve improve space map block alloc failure message
Improve space map error message when unable to allocate a new
metadata block.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:36 +00:00
Mike Snitzer
c397741c76 dm thin: use DMERR_LIMIT for errors
Throttle all errors logged from the IO path by dm thin.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:34 +00:00
Mike Snitzer
89ddeb8cb1 dm persistent data: use DMERR_LIMIT for errors
Nearly all of persistent-data is in the IO path so throttle error
messages with DMERR_LIMIT to limit the amount logged when
something has gone wrong.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:34 +00:00
Mike Snitzer
a5bd968aeb dm block manager: reinstate message when validator fails
Reinstate a useful error message when the block manager buffer validator fails.
This was mistakenly eliminated when the block manager was converted to use
dm-bufio.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:34 +00:00
Jonathan Brassow
3a0f9aaee0 dm raid: round region_size to power of two
If the user does not supply a bitmap region_size to the dm raid target,
a reasonable size is computed automatically.  If this is not a power of 2,
the md code will report an error later.

This patch catches the problem early and rounds the region_size to the
next power of two.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:33 +00:00
Joe Thornber
2aab38502d dm thin: cleanup dead code
Remove unused @data_block parameter from cell_defer.
Change thin_bio_map to use many returns rather than setting a variable.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:33 +00:00
Joe Thornber
f286ba0eed dm thin: rename cell_defer_except to cell_defer_no_holder
Rename cell_defer_except() to cell_defer_no_holder() which describes
its function more clearly.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:33 +00:00
Mikulas Patocka
9aa0c0e60f dm snapshot: optimize track_chunk
track_chunk is always called with interrupts enabled. Consequently, we
do not need to save and restore interrupt state in "flags" variable.
This patch changes spin_lock_irqsave to spin_lock_irq and
spin_unlock_irqrestore to spin_unlock_irq.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:33 +00:00
Mikulas Patocka
19cbbc60c6 dm raid: use DM_ENDIO_INCOMPLETE
Use a defined macro DM_ENDIO_INCOMPLETE instead of a numeric constant.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:32 +00:00
Mikulas Patocka
7c27213b20 dm raid1: remove impossible mempool_alloc error test
mempool_alloc can't fail if __GFP_WAIT is specified, so the condition
that tests if read_record is non-NULL is always true.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:32 +00:00
Mike Snitzer
018debea8d dm thin: emit ignore_discard in status when discards disabled
If "ignore_discard" is specified when creating the thin pool device then
discard support is disabled for that device.  The pool device's status
should reflect this fact rather than stating "no_discard_passdown"
(which implies discards are enabled but passdown is disabled).

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:32 +00:00
Joe Thornber
e3cbf94513 dm persistent data: fix nested btree deletion
When deleting nested btrees, the code forgets to delete the innermost
btree.  The thin-metadata code serendipitously compensates for this by
claiming there is one extra layer in the tree.

This patch corrects both problems.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:32 +00:00
Joe Thornber
563af186df dm thin: wake worker when discard is prepared
When discards are prepared it is best to directly wake the worker that
will process them.  The worker will be woken anyway, via periodic
commit, but there is no reason to not wake_worker here.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:31 +00:00
Joe Thornber
e8088073c9 dm thin: fix race between simultaneous io and discards to same block
There is a race when discard bios and non-discard bios are issued
simultaneously to the same block.

Discard support is expensive for all thin devices precisely because you
have to be careful to quiesce the area you're discarding.  DM thin must
handle this conflicting IO pattern (simultaneous non-discard vs discard)
even though a sane application shouldn't be issuing such IO.

The race manifests as follows:

1. A non-discard bio is mapped in thin_bio_map.
   This doesn't lock out parallel activity to the same block.

2. A discard bio is issued to the same block as the non-discard bio.

3. The discard bio is locked in a dm_bio_prison_cell in process_discard
   to lock out parallel activity against the same block.

4. The non-discard bio's mapping continues and its all_io_entry is
   incremented so the bio is accounted for in the thin pool's all_io_ds
   which is a dm_deferred_set used to track time locality of non-discard IO.

5. The non-discard bio is finally locked in a dm_bio_prison_cell in
   process_bio.

The race can result in deadlock, leaving the block layer hanging waiting
for completion of a discard bio that never completes, e.g.:

INFO: task ruby:15354 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ruby            D ffffffff8160f0e0     0 15354  15314 0x00000000
 ffff8802fb08bc58 0000000000000082 ffff8802fb08bfd8 0000000000012900
 ffff8802fb08a010 0000000000012900 0000000000012900 0000000000012900
 ffff8802fb08bfd8 0000000000012900 ffff8803324b9480 ffff88032c6f14c0
Call Trace:
 [<ffffffff814e5a19>] schedule+0x29/0x70
 [<ffffffff814e3d85>] schedule_timeout+0x195/0x220
 [<ffffffffa06b9bc1>] ? _dm_request+0x111/0x160 [dm_mod]
 [<ffffffff814e589e>] wait_for_common+0x11e/0x190
 [<ffffffff8107a170>] ? try_to_wake_up+0x2b0/0x2b0
 [<ffffffff814e59ed>] wait_for_completion+0x1d/0x20
 [<ffffffff81233289>] blkdev_issue_discard+0x219/0x260
 [<ffffffff81233e79>] blkdev_ioctl+0x6e9/0x7b0
 [<ffffffff8119a65c>] block_ioctl+0x3c/0x40
 [<ffffffff8117539c>] do_vfs_ioctl+0x8c/0x340
 [<ffffffff8119a547>] ? block_llseek+0x67/0xb0
 [<ffffffff811756f1>] sys_ioctl+0xa1/0xb0
 [<ffffffff810561f6>] ? sys_rt_sigprocmask+0x86/0xd0
 [<ffffffff814ef099>] system_call_fastpath+0x16/0x1b

The thinp-test-suite's test_discard_random_sectors reliably hits this
deadlock on fast SSD storage.

The fix for this race is that the all_io_entry for a bio must be
incremented whilst the dm_bio_prison_cell is held for the bio's
associated virtual and physical blocks.  That cell locking wasn't
occurring early enough in thin_bio_map.  This patch fixes this.

Care is taken to always call the new function inc_all_io_entry() with
the relevant cells locked, but they are generally unlocked before
calling issue() to try to avoid holding the cells locked across
generic_submit_request.

Also, now that thin_bio_map may lock bios in a cell, process_bio() is no
longer the only thread that will do so.  Because of this we must be sure
to use cell_defer_except() to release all non-holder entries, that
were added by the other thread, because they must be deferred.

This patch depends on "dm thin: replace dm_cell_release_singleton with
cell_defer_except".

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@vger.kernel.org
2012-12-21 20:23:31 +00:00
Joe Thornber
b7ca9c9273 dm thin: replace dm_cell_release_singleton with cell_defer_except
Change existing users of the function dm_cell_release_singleton to share
cell_defer_except instead, and then remove the now-unused function.

Everywhere that calls dm_cell_release_singleton, the bio in question
is the holder of the cell.

If there are no non-holder entries in the cell then cell_defer_except
behaves exactly like dm_cell_release_singleton.  Conversely, if there
*are* non-holder entries then dm_cell_release_singleton must not be used
because those entries would need to be deferred.

Consequently, it is safe to replace use of dm_cell_release_singleton
with cell_defer_except.

This patch is a pre-requisite for "dm thin: fix race between
simultaneous io and discards to same block".

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:31 +00:00
Mike Snitzer
c1a94672a8 dm: disable WRITE SAME
WRITE SAME bios are not yet handled correctly by device-mapper so
disable their use on device-mapper devices by setting
max_write_same_sectors to zero.

As an example, a ciphertext device is incompatible because the data
gets changed according to the location at which it written and so the
dm crypt target cannot support it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:30 +00:00
Alasdair G Kergon
e910d7ebec dm ioctl: prevent unsafe change to dm_ioctl data_size
Abort dm ioctl processing if userspace changes the data_size parameter
after we validated it but before we finished copying the data buffer
from userspace.

The dm ioctl parameters are processed in the following sequence:
 1. ctl_ioctl() calls copy_params();
 2. copy_params() makes a first copy of the fixed-sized portion of the
    userspace parameters into the local variable "tmp";
 3. copy_params() then validates tmp.data_size and allocates a new
    structure big enough to hold the complete data and copies the whole
    userspace buffer there;
 4. ctl_ioctl() reads userspace data the second time and copies the whole
    buffer into the pointer "param";
 5. ctl_ioctl() reads param->data_size without any validation and stores it
    in the variable "input_param_size";
 6. "input_param_size" is further used as the authoritative size of the
    kernel buffer.

The problem is that userspace code could change the contents of user
memory between steps 2 and 4.  In particular, the data_size parameter
can be changed to an invalid value after the kernel has validated it.
This lets userspace force the kernel to access invalid kernel memory.

The fix is to ensure that the size has not changed at step 4.

This patch shouldn't have a security impact because CAP_SYS_ADMIN is
required to run this code, but it should be fixed anyway.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2012-12-21 20:23:30 +00:00
Mikulas Patocka
550929faf8 dm persistent data: rename node to btree_node
This patch fixes a compilation failure on sparc32 by renaming struct node.

struct node is already defined in include/linux/node.h. On sparc32, it
happens to be included through other dependencies and persistent-data
doesn't compile because of conflicting declarations.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-12-21 20:23:30 +00:00
Linus Torvalds
4ccc804586 Single bugfix for raid1/raid10.
Fixes a recently introduced deadlock.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAULVcJjnsnt1WYoG5AQKJjw//WXClUwmYi7FD9u8OfqLKsmvJCB8rTQFL
 RQTkm/AGwXMTrmV2iifR7Hpm14wP7pbwwiFNbLv4cw8W8ldt+4PfjCRTyoSwH5HT
 +evDAxmtuTfhznGn9fWCJClmrng0+W0ir3Bmkju+u35orCx97+98Cgv4rVAeYZ0R
 TR59g10y0c1QQuLPXoe3J5iVKuXjrlW4USIGRzkaqKmSZa9LGLETLE9V/RQtcdVD
 HB+uVMEMIGfibWSa918yWRbje8tGYeFyWOrxLs6eS/MdMEWPOdYgesxMErN+8/9q
 ZPYc+wIbGdYfzn8RcuAlE10ZMkS/6eNCn/O5ztBrM9Iyztecv3TKxNzb1S9RHppZ
 ze9d7qfX5kDhwc7YPikPZlnP4CDElDjaPzb0jSyy6FwNzWV45YuC9D5n4xGPOgcC
 83ORlSzMcv6NOFZc8HjrV4NFYE4Dezm0sThFPMEkY2FfLzIztg1H5Q0k0bvfxtqa
 yzCaQtuGjMhsbcLELqHCXFNHFhBVaetuFKAPRnynnkgSDMiZVjmV5/rsapy+qBON
 4BSI7Shwq5jn1xrqVd6ylLic5nkFIGuU7jZ15VftzP3ggQxmSLuq8Q7ZOZ+cXdZ9
 bQTkZyrwsIp8qxBE9DCi33VDDcUNiSvCnTdr18XPAzJ0DQIQ4hhlmLxGGj8BV5f4
 KKOqr3RBP6I=
 =fV2f
 -----END PGP SIGNATURE-----

Merge tag 'md-3.7-fixes' of git://neil.brown.name/md

Pull md bugfix from NeilBrown:
 "Single bugfix for raid1/raid10.

  Fixes a recently introduced deadlock."

* tag 'md-3.7-fixes' of git://neil.brown.name/md:
  md/raid1{,0}: fix deadlock in bitmap_unplug.
2012-12-02 16:24:31 -08:00
NeilBrown
874807a831 md/raid1{,0}: fix deadlock in bitmap_unplug.
If the raid1 or raid10 unplug function gets called
from a make_request function (which is very possible) when
there are bios on the current->bio_list list, then it will not
be able to successfully call bitmap_unplug() and it could
need to submit more bios and wait for them to complete.
But they won't complete while current->bio_list is non-empty.

So detect that case and handle the unplugging off to another thread
just like we already do when called from within the scheduler.

RAID1 version of bug was introduced in 3.6, so that part of fix is
suitable for 3.6.y.  RAID10 part won't apply.

Cc: stable@vger.kernel.org
Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reported-by: Peter Maloney <peter.maloney@brockmann-consult.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-27 12:14:40 +11:00
Linus Torvalds
1d838d70fb Several bug fixes for md in 3.7
- raid5 discard has problems
  - raid10 replacement devices have problems
  - bad block lock seqlock usage has problems
  - dm-raid doesn't free everything
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIVAwUAUK/PfTnsnt1WYoG5AQJlFBAAry6TrfIEed7Sz1BwY0w1Ofd5ZFt6DCN3
 CXc6yi7LQhaMAUYsMcF07BFfuphal0St68vwckFkd1jPShUgruetzsUPLdS1+cql
 AKOQZmJegN+yvpf+N6PxER8z0Ju8M0RNVCvgRZB166ujmoEHGf7A564Hby+FINpZ
 zk1d5eVtcRL05oV0NbeLaX8bNp42nNx2wwvFtM6NEVF4vwbzGzXkC9ePQ6oERJvQ
 Oqsu6F+TzqztIPYk/fbl1Yr/FPVAWXi4dR7KNxs/jHFcnWPi9vKcjjh1jrq46rNy
 xQY+y0xW6FlN0uApIKT6NC3UWutgwOGUqRdCRc4LJ1nT6aHVIn5OCIsipgRrlV0O
 da5pM+rgIMJK3kyT6NjhtuWuQZE4P4OSOmnq5q81VT9XOKADVsFOfibtrIr8cxYS
 c/8mNJVfd+cU58XNKGIEt886DsN+uzWiY8U8HZVckfeVxrBTIPas4ERXlurx+G1D
 jhXqK8TuEfi6ILNdBlWPphAr2ytFqWWpQIGXgYGHEIJp5WaUHoEoEblznl1MiRlZ
 +tYIYy0SRkcZuxs6nUNF8Or5vFidjvaIFJPjIJwSIhwgzkaV+YFad4GfI7/WgWaq
 7VU12MG7UlXLlaGN1Yadvh3jAk7L45DPzWUa/Zgvvtrvvdp3JU7VQhD8d6oc/kxD
 3IOrUdAXWxU=
 =fznK
 -----END PGP SIGNATURE-----

Merge tag 'md-3.7-fixes' of git://neil.brown.name/md

Pull md fixes from NeilBrown:
 "Several bug fixes for md in 3.7:

   - raid5 discard has problems
   - raid10 replacement devices have problems
   - bad block lock seqlock usage has problems
   - dm-raid doesn't free everything"

* tag 'md-3.7-fixes' of git://neil.brown.name/md:
  md/raid10: decrement correct pending counter when writing to replacement.
  md/raid10: close race that lose writes lost when replacement completes.
  md/raid5: Make sure we clear R5_Discard when discard is finished.
  md/raid5: move resolving of reconstruct_state earlier in stripe_handle.
  md/raid5: round discard alignment up to power of 2.
  md: make sure everything is freed when dm-raid stops an array.
  md: Avoid write invalid address if read_seqretry returned true.
  md: Reassigned the parameters if read_seqretry returned true in func md_is_badblock.
2012-11-23 12:11:13 -10:00
Jens Axboe
a8c32a5c98 dm: fix deadlock with request based dm and queue request_fn recursion
Request based dm attempts to re-run the request queue off the
request completion path. If used with a driver that potentially does
end_io from its request_fn, we could deadlock trying to recurse
back into request dispatch. Fix this by punting the request queue
run to kblockd.

Tested to fix a quickly reproducible deadlock in such a scenario.

Cc: stable@kernel.org
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:54 +01:00
NeilBrown
884162df2a md/raid10: decrement correct pending counter when writing to replacement.
When a write to a replacement device completes, we carefully
and correctly found the rdev that the write actually went to
and the blithely called rdev_dec_pending on the primary rdev,
even if this write was to the replacement.

This means that any writes to an array while a replacement
was ongoing would cause the nr_pending count for the primary
device to go negative, so it could never be removed.

This bug has been present since replacement was introduced in
3.3, so it is suitable for any -stable kernel since then.

Reported-by: "George Spelvin" <linux@horizon.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22 15:12:42 +11:00
NeilBrown
e7c0c3fa29 md/raid10: close race that lose writes lost when replacement completes.
When a replacement operation completes there is a small window
when the original device is marked 'faulty' and the replacement
still looks like a replacement.  The faulty should be removed and
the replacement moved in place very quickly, bit it isn't instant.

So the code write out to the array must handle the possibility that
the only working device for some slot in the replacement - but it
doesn't.  If the primary device is faulty it just gives up.  This
can lead to corruption.

So make the code more robust: if either  the primary or the
replacement is present and working, write to them.  Only when
neither are present do we give up.

This bug has been present since replacement was introduced in
3.3, so it is suitable for any -stable kernel since then.

Reported-by: "George Spelvin" <linux@horizon.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22 15:12:36 +11:00
NeilBrown
ca64cae960 md/raid5: Make sure we clear R5_Discard when discard is finished.
commit 9e44476851
    MD: raid5 avoid unnecessary zero page for trim

change raid5 to clear R5_Discard when the complete request is
handled rather than when submitting the per-device discard request.
However it did not clear R5_Discard for the parity device.

This means that if the stripe_head was reused before it expired from
the cache, the setting would be wrong and a hang would result.

Also if the R5_Uptodate bit happens to be set, R5_Discard again
won't be cleared.  But R5_Uptodate really should be clear at this point.

So make sure R5_Discard is cleared in all cases, and clear
R5_Uptodate when a 'discard' completes.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22 09:14:13 +11:00
NeilBrown
ef5b7c69b7 md/raid5: move resolving of reconstruct_state earlier in
stripe_handle.

The chunk of code in stripe_handle which responds to a
*_result value in reconstruct_state is really the completion
of some processing that happened outside of handle_stripe
(possibly asynchronously) and so should be one of the first
things done in handle_stripe().

After the next patch it will be important that it happens before
handle_stripe_clean_event(), as that will clear some dev->flags
bit that this code tests.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-22 09:14:09 +11:00
NeilBrown
4ac6875eeb md/raid5: round discard alignment up to power of 2.
blkdev_issue_discard currently assumes that the granularity
is a power of 2.  So in raid5, round the chosen number up to
avoid embarrassment.

Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20 19:42:56 +11:00
NeilBrown
5eff3c439d md: make sure everything is freed when dm-raid stops an array.
md_stop() would stop an array, but not free various attached
data structures.
For internal arrays, these are freed later in do_md_stop() or
mddev_put(), but they don't apply for dm-raid arrays.
So get md_stop() to free them, and only all it from dm-raid.
For internal arrays we now call __md_stop.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20 10:27:37 +11:00
majianpeng
35f9ac2dce md: Avoid write invalid address if read_seqretry returned true.
If read_seqretry returned true and bbp was changed, it will write
invalid address which can cause some serious problem.

This bug was introduced by commit v3.0-rc7-130-g2699b67.
So fix is suitable for 3.0.y thru 3.6.y.

Reported-by: zhuwenfeng@kedacom.com
Tested-by: zhuwenfeng@kedacom.com
Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20 10:27:17 +11:00
majianpeng
ab05613a06 md: Reassigned the parameters if read_seqretry returned true in func md_is_badblock.
This bug was introduced by commit(v3.0-rc7-126-g2230dfe).
So fix is suitable for 3.0.y thru 3.6.y.

Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-11-20 10:27:05 +11:00
Jonathan Brassow
ed30be077e MD RAID10: Fix oops when creating RAID10 arrays via dm-raid.c
Commit 2863b9eb didn't take into account the changes to add TRIM support to
RAID10 (commit 532a2a3fb).  That is, when using dm-raid.c to create the
RAID10 arrays, there is no mddev->gendisk or mddev->queue.  The code added
to support TRIM simply assumes that mddev->queue is available without
checking.  The result is an oops any time dm-raid.c attempts to create a
RAID10 device.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-31 11:42:30 +11:00
NeilBrown
02b898f2f0 md/raid1: Fix assembling of arrays containing Replacements.
setup_conf in raid1.c uses conf->raid_disks before assigning
a value.  It is used when including 'Replacement' devices.

The consequence is that assembling an array which contains a
replacement will misbehave and either not include the replacement, or
not include the device being replaced.

Though this doesn't lead directly to data corruption, it could lead to
reduced data safety.

So use mddev->raid_disks, which is initialised, instead.

Bug was introduced by commit c19d57980b
      md/raid1: recognise replacements when assembling arrays.

in 3.3, so fix is suitable for 3.3.y thru 3.6.y.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-31 11:42:03 +11:00
Eric Sandeen
0be1fecd7e md faulty: use disk_stack_limits()
in:
fe86cdce block: do not artificially constrain max_sectors for stacking drivers

max_sectors defaults to UINT_MAX.  md faulty wasn't using
disk_stack_limits(), so inherited this large value as well.
This triggered a bug in XFS when stressed over md_faulty, when
a very large bio_alloc() failed.

That was on an older kernel, and I can't reproduce exactly the
same thing upstream, but I think the fix is appropriate in any
case.

Thanks to Mike Snitzer for pointing out the problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-22 10:44:55 +11:00
Linus Torvalds
9db908806b md updates for 3.7
"discard" support, some dm-raid improvements and other assorted
 bits and pieces.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAUHk6Rjnsnt1WYoG5AQKovQ//Ym0ROo5a6uekb2USLyFSdQH3TC7z0v0+
 +kujrgoc4nHZU/vj5yfMvPVomEUsAhHEwTkvvCiXFFHn6cxPzC8ezm8d40xEeISX
 qp6i2bPlvGURhsW1tYeD+THtY82/oyzQ4Wa/vaE1sjVLQ+caa2q7kVVgAL9Bj/Kz
 aESIZjAuPxQNE1674/KR0EmMFcbpd0z1WDV+ydKlRV5jHCHGYf8OmxOenJFf+V/b
 /f9p2u+NUq5BN5WLhThcysO8lPX1Y7GG8IYay3DlSt/crU24R2a2j0qh/BDoK8+t
 /DceoHipbIiGxXLVjM7y+1RwPpCh75HJSZQHltPype2Z3iwtwEth9uTkEE3M2h/W
 tOQEbOZku0kcgsrys7JBmpkBwkR9oZqq1kDd4YBzqW4PiGVP6z0JRH8QpjjB+mjN
 47ODYIZcaEYZ+0Jj8kcVxo3gv4Xj4DWH+auSNZihTVmjQPVqrcy3CAt3CkuDzTkY
 34fZVuCDiCetLGCGQKrwfMDnySVy5xOmtC6iWsEY5rExAeb0E+BCzcBvbAXzt+ef
 MPDsrxWbo/ZkvpuwXOwLFTccBuRtAsFi7CM4jcow53W6XMnPpdubphNw5nylaEm1
 DEzfID58mv8VHWRuW15vr7SbtROjYJkEFCIaEK3oprrRUYftZntIABcknqvcIYR+
 /ULNzkRU1w4=
 =XRmL
 -----END PGP SIGNATURE-----

Merge tag 'md-3.7' of git://neil.brown.name/md

Pull md updates from NeilBrown:
 - "discard" support, some dm-raid improvements and other assorted bits
   and pieces.

* tag 'md-3.7' of git://neil.brown.name/md: (29 commits)
  md: refine reporting of resync/reshape delays.
  md/raid5: be careful not to resize_stripes too big.
  md: make sure manual changes to recovery checkpoint are saved.
  md/raid10: use correct limit variable
  md: writing to sync_action should clear the read-auto state.
  Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
  md/raid5: make sure to_read and to_write never go negative.
  md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
  md/raid5: protect debug message against NULL derefernce.
  md/raid5: add some missing locking in handle_failed_stripe.
  MD: raid5 avoid unnecessary zero page for trim
  MD: raid5 trim support
  md/bitmap:Don't use IS_ERR to judge alloc_page().
  md/raid1: Don't release reference to device while handling read error.
  raid: replace list_for_each_continue_rcu with new interface
  add further __init annotations to crypto/xor.c
  DM RAID: Fix for "sync" directive ineffectiveness
  DM RAID: Fix comparison of index and quantity for "rebuild" parameter
  DM RAID: Add rebuild capability for RAID10
  DM RAID: Move 'rebuild' checking code to its own function
  ...
2012-10-13 13:22:01 -07:00
Mikulas Patocka
dba141601d dm: store dm_target_io in bio front_pad
Use the recently-added bio front_pad field to allocate struct dm_target_io.

Prior to this patch, dm_target_io was allocated from a mempool. For each
dm_target_io, there is exactly one bio allocated from a bioset.

This patch merges these two allocations into one allocation: we create a
bioset with front_pad equal to the size of dm_target_io so that every
bio allocated from the bioset has sizeof(struct dm_target_io) bytes
before it. We allocate a bio and use the bytes before the bio as
dm_target_io.

_tio_cache is removed and the tio_pool mempool is now only used for
request-based devices.

This idea was introduced by Kent Overstreet.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: tj@kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Bill Pemberton <wfp5p@viridian.itc.virginia.edu>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:15 +01:00
Mike Snitzer
4f81a41762 dm thin: move bio_prison code to separate module
The bio prison code will be useful to other future DM targets so
move it to a separate module.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:13 +01:00
Mike Snitzer
44feb387f6 dm thin: prepare to separate bio_prison code
The bio prison code will be useful to share with future DM targets.

Prepare to move this code into a separate module, adding a dm prefix
to structures and functions that will be exported.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:10 +01:00
Mike Snitzer
28eed34e76 dm thin: support discard with non power of two block size
Support discards when the pool's block size is not a power of 2.
The block layer assumes discard_granularity is a power of 2 (in
blkdev_issue_discard), so we set this to the largest power of 2 that is
a divides into the number of sectors in each block, but never less than
DATA_DEV_BLOCK_SIZE_MIN_SECTORS.

This patch eliminates the "Discard support must be disabled when the
block size is not a power of 2" constraint that was imposed in commit
55f2b8b ("dm thin: support for non power of 2 pool blocksize").  That
commit was incomplete: using a block size that is not a power of 2
shouldn't mean disabling discard support on the device completely.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 21:02:07 +01:00
Wei Yongjun
0bcf08798e dm persistent data: convert to use le32_add_cpu
Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu().

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:47 +01:00
Mikulas Patocka
fe5fe90639 dm: use ACCESS_ONCE for sysfs values
Use the ACCESS_ONCE macro in dm-bufio and dm-verity where a variable
can be modified asynchronously (through sysfs) and we want to prevent
compiler optimizations that assume that the variable hasn't changed.
(See Documentation/atomic_ops.txt.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:46 +01:00
Wei Yongjun
54499afbb8 dm bufio: use list_move
Use list_move() instead of list_del() + list_add().

spatch with a semantic match was used to find this.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:44 +01:00
Wei Yongjun
a71a261f5c dm mpath: fix check for null mpio in end_io fn
The mpio dereference should be moved below the BUG_ON NULL test
in multipath_end_io().

spatch with a semantic match was used to found this.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-10-12 16:59:42 +01:00
NeilBrown
72f36d5972 md: refine reporting of resync/reshape delays.
If 'resync_max' is set to 0 (as is often done when starting a
reshape, so the mdadm can remain in control during a sensitive
period), and if the reshape request is initially delayed because
another array using the same array is resyncing or reshaping etc,
when user-space cannot easily tell when the delay changes from being
due to a conflicting reshape, to being due to resync_max = 0.

So introduce a new state: (curr_resync == 3) to reflect this, make
sure it is visible both via /proc/mdstat and via the "sync_completed"
sysfs attribute, and ensure that the event transition from one delay
state to the other is properly notified.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:25:57 +11:00
NeilBrown
e56108d65f md/raid5: be careful not to resize_stripes too big.
When a RAID5 is reshaping, conf->raid_disks is increased
before mddev->delta_disks becomes zero.
This can result in check_reshape calling resize_stripes with a
number that is too large.  This particularly happens
when md_check_recovery calls ->check_reshape().

If we use ->previous_raid_disks, we don't risk this.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-10-11 14:24:13 +11:00