Commit graph

1768 commits

Author SHA1 Message Date
Arnd Bergmann
402ab352c2 dm ioctl: use nonseekable_open
The dm control device does not implement read/write, so it has no use for
seeking.  Using no_llseek prevents falling back to default_llseek, which
requires the BKL.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:57 +01:00
Kiyoshi Ueda
3f77316de0 dm: separate device deletion from dm_put
This patch separates the device deletion code from dm_put()
to make sure the deletion happens in the process context.

By this patch, device deletion always occurs in an ioctl (process)
context and dm_put() can be called in interrupt context.
As a result, the request-based dm's bad dm_put() usage pointed out
by Mikulas below disappears.
    http://marc.info/?l=dm-devel&m=126699981019735&w=2

Without this patch, I confirmed there is a case to crash the system:
    dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt())

Some more backgrounds and details:
In request-based dm, a device opener can remove a mapped_device
while the last request is still completing, because bios in the last
request complete first and then the device opener can close and remove
the mapped_device before the last request completes:
  CPU0                                          CPU1
  =================================================================
  <<INTERRUPT>>
  blk_end_request_all(clone_rq)
    blk_update_request(clone_rq)
      bio_endio(clone_bio) == end_clone_bio
        blk_update_request(orig_rq)
          bio_endio(orig_bio)
                                                <<I/O completed>>
                                                dm_blk_close()
                                                dev_remove()
                                                  dm_put(md)
                                                    <<Free md>>
   blk_finish_request(clone_rq)
     ....
     dm_end_request(clone_rq)
       free_rq_clone(clone_rq)
       blk_end_request_all(orig_rq)
       rq_completed(md)

So request-based dm used dm_get()/dm_put() to hold md for each I/O
until its request completion handling is fully done.
However, the final dm_put() can call the device deletion code which
must not be run in interrupt context and may cause kernel panic.

To solve the problem, this patch moves the device deletion code,
dm_destroy(), to predetermined places that is actually deleting
the mapped_device in ioctl (process) context, and changes dm_put()
just to decrement the reference count of the mapped_device.
By this change, dm_put() can be used in any context and the symmetric
model below is introduced:
    dm_create():  create a mapped_device
    dm_destroy(): destroy a mapped_device
    dm_get():     increment the reference count of a mapped_device
    dm_put():     decrement the reference count of a mapped_device

dm_destroy() waits for all references of the mapped_device to disappear,
then deletes the mapped_device.

dm_destroy() uses active waiting with msleep(1), since deleting
the mapped_device isn't performance-critical task.
And since at this point, nobody opens the mapped_device and no new
reference will be taken, the pending counts are just for racing
completing activity and will eventually decrease to zero.

For the unlikely case of the forced module unload, dm_destroy_immediate(),
which doesn't wait and forcibly deletes the mapped_device, is also
introduced and used in dm_hash_remove_all().  Otherwise, "rmmod -f"
may be stuck and never return.
And now, because the mapped_device is deleted at this point, subsequent
accesses to the mapped_device may cause NULL pointer references.

Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:56 +01:00
Kiyoshi Ueda
98f332855e dm ioctl: release _hash_lock between devices in remove_all
This patch changes dm_hash_remove_all() to release _hash_lock when
removing a device.  After removing the device, dm_hash_remove_all()
takes _hash_lock and searches the hash from scratch again.

This patch is a preparation for the next patch, which changes device
deletion code to wait for md reference to be 0.  Without this patch,
the wait in the next patch may cause AB-BA deadlock:
  CPU0                                CPU1
  -----------------------------------------------------------------------
  dm_hash_remove_all()
    down_write(_hash_lock)
                                      table_status()
                                        md = find_device()
                                               dm_get(md)
                                                 <increment md->holders>
                                        dm_get_live_or_inactive_table()
                                          dm_get_inactive_table()
                                            down_write(_hash_lock)
    <in the md deletion code>
      <wait for md->holders to be 0>

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:55 +01:00
Kiyoshi Ueda
abdc568b05 dm: prevent access to md being deleted
This patch prevents access to mapped_device which is being deleted.

Currently, even after a mapped_device has been removed from the hash,
it could be accessed through idr_find() using minor number.
That could cause a race and NULL pointer reference below:
  CPU0                          CPU1
  ------------------------------------------------------------------
  dev_remove(param)
    down_write(_hash_lock)
    dm_lock_for_deletion(md)
      spin_lock(_minor_lock)
      set_bit(DMF_DELETING)
      spin_unlock(_minor_lock)
    __hash_remove(hc)
    up_write(_hash_lock)
                                dev_status(param)
                                  md = find_device(param)
                                         down_read(_hash_lock)
                                         __find_device_hash_cell(param)
                                           dm_get_md(param->dev)
                                             md = dm_find_md(dev)
                                                    spin_lock(_minor_lock)
                                                    md = idr_find(MINOR(dev))
                                                    spin_unlock(_minor_lock)
    dm_put(md)
      free_dev(md)
                                             dm_get(md)
                                         up_read(_hash_lock)
                                  __dev_status(md, param)
                                  dm_put(md)

This patch fixes such problems.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:54 +01:00
Peter Rajnoha
856a6f1dbd dm ioctl: return uevent flag after rename
All the dm ioctls that generate uevents set the DM_UEVENT_GENERATED flag so
that userspace knows whether or not to wait for a uevent to be processed
before continuing,

The dm rename ioctl sets this flag but was not structured to return it
to userspace.  This patch restructures the rename ioctl processing to
behave like the other ioctls that return data and so fix this.

Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:53 +01:00
Alasdair G Kergon
094ea9a071 dm ioctl: make __dev_status void
__dev_status() cannot fail so make it void and simplify callers.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:52 +01:00
Peter Rajnoha
6be5449401 dm ioctl: remove __dev_status from geometry and target message
Remove useless __dev_status call while processing an ioctl that sets up
device geometry and target message.  The data is not returned to
userspace so there is no point collecting it and in the case of
target_message it is collected before processing the message so if it
did return it might be stale.

Signed-off-by: Peter Rajnoha <prajnoha@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:52 +01:00
Mikulas Patocka
c241104506 dm snapshot: test chunk size against both origin and snapshot
Validate chunk size against both origin and snapshot sector size

Don't allow chunk size smaller than either origin or snapshot logical
sector size. Reading or writing data not aligned to sector size is not
allowed and causes immediate errors.

This requires us to open the origin before initialising the
exception store and to export dm_snap_origin.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:51 +01:00
Mikulas Patocka
1e5554c842 dm snapshot: iterate origin and cow devices
Iterate both origin and snapshot devices

iterate_devices method should call the callback for all the devices where
the bio may be remapped. Thus, snapshot_iterate_devices should call the callback
for both snapshot and origin underlying devices because it remaps some bios
to the snapshot and some to the origin.

snapshot_iterate_devices called the callback only for the origin device.
This led to badly calculated device limits if snapshot and origin were placed
on different types of disks.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:50 +01:00
Alasdair G Kergon
6bbf79a140 dm mpath: fix NULL pointer dereference when path parameters missing
multipath_ctr() forgets to return an error after detecting
missing path parameters.  Fix this.

Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2010-08-12 04:13:49 +01:00
Linus Torvalds
3d30701b58 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md: (24 commits)
  md: clean up do_md_stop
  md: fix another deadlock with removing sysfs attributes.
  md: move revalidate_disk() back outside open_mutex
  md/raid10: fix deadlock with unaligned read during resync
  md/bitmap:  separate out loading a bitmap from initialising the structures.
  md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
  md/bitmap: optimise scanning of empty bitmaps.
  md/bitmap: clean up plugging calls.
  md/bitmap: reduce dependence on sysfs.
  md/bitmap: white space clean up and similar.
  md/raid5: export raid5 unplugging interface.
  md/plug: optionally use plugger to unplug an array during resync/recovery.
  md/raid5: add simple plugging infrastructure.
  md/raid5: export is_congested test
  raid5: Don't set read-ahead when there is no queue
  md: add support for raising dm events.
  md: export various start/stop interfaces
  md: split out md_rdev_init
  md: be more careful setting MD_CHANGE_CLEAN
  md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk
  ...
2010-08-10 15:38:19 -07:00
NeilBrown
fd8aa2c181 Merge git://git.infradead.org/users/dwmw2/libraid-2.6 into for-linus 2010-08-10 10:02:33 +10:00
David Woodhouse
2144381da4 Merge branch 'async' of macbook:git/btrfs-unstable
Conflicts:
	drivers/md/Makefile
	lib/raid6/unroll.pl
2010-08-09 10:36:44 +01:00
NeilBrown
6e17b02764 md: clean up do_md_stop
There is only one error exit from do_md_stop, so make that more
explicit and discard the 'err' variable.
Also drop the 'revalidate' variable by moving the unlock calls around.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-08-08 21:22:45 +10:00
NeilBrown
bb4f1e9d0e md: fix another deadlock with removing sysfs attributes.
Move the deletion of sysfs attributes from reconfig_mutex to
open_mutex didn't really help as a process can try to take
open_mutex while holding reconfig_mutex, so the same deadlock can
happen, just requiring one more process to be involved in the chain.

I looks like I cannot easily use locking to wait for the sysfs
deletion to complete, so don't.

The only things that we cannot do while the deletions are still
pending is other things which can change the sysfs namespace: run,
takeover, stop.  Each of these can fail with -EBUSY.
So set a flag while doing a sysfs deletion, and fail run, takeover,
stop if that flag is set.

This is suitable for 2.6.35.x

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2010-08-08 21:21:27 +10:00
Dan Williams
147e0b6a63 md: move revalidate_disk() back outside open_mutex
Commit b821eaa5 "md: remove ->changed and related code" moved
revalidate_disk() under open_mutex, and lockdep noticed.

[ INFO: possible circular locking dependency detected ]
2.6.32-mdadm-locking #1
-------------------------------------------------------
mdadm/3640 is trying to acquire lock:
 (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff811acecb>] revalidate_disk+0x5b/0x90

but task is already holding lock:
 (&mddev->open_mutex){+.+...}, at: [<ffffffffa055e07a>] do_md_stop+0x4a/0x4d0 [md_mod]

which lock already depends on the new lock.

It is suitable for 2.6.35.x

Cc: <stable@kernel.org>
Reported-by: Przemyslaw Czarnowski <przemyslaw.hawrylewicz.czarnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-08-08 21:20:17 +10:00
Arnd Bergmann
6e9624b8ca block: push down BKL into .open and .release
The open and release block_device_operations are currently
called with the BKL held. In order to change that, we must
first make sure that all drivers that currently rely
on this have no regressions.

This blindly pushes the BKL into all .open and .release
operations for all block drivers to prepare for the
next step. The drivers can subsequently replace the BKL
with their own locks or remove it completely when it can
be shown that it is not needed.

The functions blkdev_get and blkdev_put are the only
remaining users of the big kernel lock in the block
layer, besides a few uses in the ioctl code, none
of which need to serialize with blkdev_{get,put}.

Most of these two functions is also under the protection
of bdev->bd_mutex, including the actual calls to
->open and ->release, and the common code does not
access any global data structures that need the BKL.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:25:34 +02:00
FUJITA Tomonori
00fff26539 block: remove q->prepare_flush_fn completely
This removes q->prepare_flush_fn completely (changes the
blk_queue_ordered API).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:24:15 +02:00
FUJITA Tomonori
144d6ed551 dm: stop using q->prepare_flush_fn
use REQ_FLUSH flag instead.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Alasdair G Kergon <agk@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:24:14 +02:00
Christoph Hellwig
7b6d91daee block: unify flags for struct bio and struct request
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver.  There were two flags in the bio that were
missing in the requests:  BIO_RW_UNPLUG and BIO_RW_AHEAD.  Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:20:39 +02:00
Christoph Hellwig
33659ebbae block: remove wrappers for request type/flags
Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
struct requests.  This allows much easier grepping for different request
types instead of unwinding through macros.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:17:56 +02:00
NeilBrown
51e9ac7703 md/raid10: fix deadlock with unaligned read during resync
If the 'bio_split' path in raid10-read is used while
resync/recovery is happening it is possible to deadlock.
Fix this be elevating ->nr_waiting for the duration of both
parts of the split request.

This fixes a bug that has been present since 2.6.22
but has only started manifesting recently for unknown reasons.
It is suitable for and -stable since then.

Reported-by:  Justin Bronder <jsbronder@gentoo.org>
Tested-by:  Justin Bronder <jsbronder@gentoo.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2010-08-07 21:17:00 +10:00
NeilBrown
69e51b449d md/bitmap: separate out loading a bitmap from initialising the structures.
dm makes this distinction between ->ctr and ->resume, so we need to
too.

Also get the new bitmap_load to clear out the bitmap first, as this is
most consistent with the dm suspend/resume approach

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 13:21:34 +10:00
NeilBrown
e384e58549 md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
This allows md/raid5 to fully work as a dm target.

Normally md uses a 'filemap' which contains a list of pages of bits
each of which may be written separately.
dm-log uses and all-or-nothing approach to writing the log, so
when using a dm-log, ->filemap is NULL and the flags normally stored
in filemap_attr are stored in ->logattrs instead.



Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 13:21:34 +10:00
NeilBrown
ef42567335 md/bitmap: optimise scanning of empty bitmaps.
A bitmap is stored as one page per 2048 bits.
If none of the bits are set, the page is not allocated.

When bitmap_get_counter finds that a page isn't allocate,
it just reports that one bit work of space isn't flagged,
rather than reporting that 2048 bits worth of space are
unflagged.
This can cause searches for flagged bits (e.g. bitmap_close_sync)
to do more work than is really necessary.

So change bitmap_get_counter (when creating) to report a number of
blocks that more accurately reports the range of the device for which
no counter currently exists.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 13:21:32 +10:00
NeilBrown
b63d7c2e29 md/bitmap: clean up plugging calls.
1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
  arrays with no queue attached.

2/ Don't bother plugging the queue when we set a bit in the bitmap.
   The reason for this was to encourage as many bits as possible to
   get set before we unplug and write stuff out.
   However every personality already plugs the queue after
   bitmap_startwrite either directly (raid1/raid10) or be setting
   STRIPE_BIT_DELAY which causes the queue to be plugged later
   (raid5).

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 13:21:32 +10:00
NeilBrown
5ff5afffe6 md/bitmap: reduce dependence on sysfs.
For dm-raid45 we will want to use bitmaps in dm-targets which don't
have entries in sysfs, so cope with the mddev not living in sysfs.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 13:21:31 +10:00
NeilBrown
ac2f40be46 md/bitmap: white space clean up and similar.
Fixes some whitespace problems
Fixed some checkpatch.pl complaints.
Replaced kmalloc ... memset(0), with kzalloc
Fixed an unlikely memory leak on an error path.
Reformatted a number of 'if/else' sets, sometimes
replacing goto with an else clause.
Removed some old comments and commented-out code.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 13:07:22 +10:00
NeilBrown
9f7c222001 md/raid5: export raid5 unplugging interface.
Also remove remaining accesses to ->queue and ->gendisk when ->queue
is NULL (As it is in a DM target).

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 12:53:10 +10:00
NeilBrown
252ac5221a md/plug: optionally use plugger to unplug an array during resync/recovery.
If an array doesn't have a 'queue' then md_do_sync cannot
unplug it.
In that case it will have a 'plugger', so make that available
to the mddev, and use it to unplug the array if needed.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 12:53:08 +10:00
NeilBrown
2ac8740151 md/raid5: add simple plugging infrastructure.
md/raid5 uses the plugging infrastructure provided by the block layer
and 'struct request_queue'.  However when we plug raid5 under dm there
is no request queue so we cannot use that.

So create a similar infrastructure that is much lighter weight and use
it for raid5.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 12:53:08 +10:00
NeilBrown
11d8a6e371 md/raid5: export is_congested test
the dm module will need this for dm-raid45.

Also only access ->queue->backing_dev_info->congested_fn
if ->queue actually exists.  It won't in a dm target.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 12:52:29 +10:00
NeilBrown
4a5add4995 raid5: Don't set read-ahead when there is no queue
dm-raid456 does not provide a 'queue' for raid5 to use,
so we must make raid5 stop depending on the queue.

First: read_ahead
dm handles read-ahead adjustment fully in userspace, so
simply don't do any readahead adjustments if there is
no queue.

Also re-arrange code slightly so all the accesses to ->queue are
together.

Finally, move the blk_queue_merge_bvec function into the 'if' as
the ->split_io setting in dm-raid456 has the same effect.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 12:52:27 +10:00
NeilBrown
768a418db1 md: add support for raising dm events.
dm uses scheduled work to raise events to user-space.
So allow md device to have work_structs and schedule them on an error.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 12:52:27 +10:00
NeilBrown
390ee602a1 md: export various start/stop interfaces
export entry points for starting and stopping md arrays.
This will be used by a module to make md/raid5 work under
dm.
Also stop calling md_stop_writes from md_stop, as that won't
work well with dm - it will want to call the two separately.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 12:52:27 +10:00
NeilBrown
e8bb9a839a md: split out md_rdev_init
This functionality will be needed separately in a subsequent patch, so
split it into it's own exported function.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 12:52:27 +10:00
NeilBrown
676e42d896 md: be more careful setting MD_CHANGE_CLEAN
When MD_CHANGE_CLEAN is set we might block in md_write_start.
So we should only set it when fairly sure that something will clear
it.

There are two places where it is set so as to encourage a metadata
update to record the progress of resync/recovery.  This should only
be done if the internal metadata update mechanisms are in use, which
can be tested by by inspecting '->persistent'.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 12:52:27 +10:00
NeilBrown
f4be6b43f1 md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk
We will shortly allow md devices with no gendisk (they are attached to
a dm-target instead).  That will cause mdname() to return 'mdX'.
There is one place where mdname really needs to be unique: when
creating the name for a slab cache.
So in that case, if there is no gendisk, you the address of the mddev
formatted in HEX to provide a unique name.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-26 12:52:26 +10:00
NeilBrown
c41d4ac40d md/raid5: factor out code for changing size of stripe cache.
Separate the actual 'change' code from the sysfs interface
so that it can eventually be called internally.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-21 13:28:15 +10:00
NeilBrown
00bcb4ac7e md: reduce dependence on sysfs.
We will want md devices to live as dm targets where sysfs is not
visible.  So allow md to not connect to sysfs.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-07-21 13:27:53 +10:00
NeilBrown
3424bf6a77 md/raid5: don't include 'spare' drives when reshaping to fewer devices.
There are few situations where it would make any sense to add a spare
when reducing the number of devices in an array, but it is
conceivable:  A 6 drive RAID6 with two missing devices could be
reshaped to a 5 drive RAID6, and a spare could become available
just in time for the reshape, but not early enough to have been
recovered first.  'freezing' recovery can make this easy to
do without any races.

However doing such a thing is a bad idea.  md will not record the
partially-recovered state of the 'spare' and when the reshape
finished it will think that the spare is still spare.
Easiest way to avoid this confusion is to simply disallow it.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:36:04 +10:00
NeilBrown
2f11588249 md/raid5: add a missing 'continue' in a loop.
As the comment says, the tail of this loop only applies to devices
that are not fully in sync, so if In_sync was set, we should avoid
the rest of the loop.

This bug will hardly ever cause an actual problem.  The worst it
can do is allow an array to be assembled that is dirty and degraded,
which is not generally a good idea (without warning the sysadmin
first).

This will only happen if the array is RAID4 or a RAID5/6 in an
intermediate state during a reshape and so has one drive that is
all 'parity' - no data - while some other device has failed.

This is certainly possible, but not at all common.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:35:49 +10:00
NeilBrown
415e72d034 md/raid5: Allow recovered part of partially recovered devices to be in-sync
During a recovery of reshape the early part of some devices might be
in-sync while the later parts are not.
We we know we are looking at an early part it is good to treat that
part as in-sync for stripe calculations.

This is particularly important for a reshape which suffers device
failure.  Treating the data as in-sync can mean the difference between
data-safety and data-loss.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:35:39 +10:00
NeilBrown
674806d62f md/raid5: More careful check for "has array failed".
When we are reshaping an array, the device failure combinations
that cause us to decide that the array as failed are more subtle.

In particular, any 'spare' will be fully in-sync in the section
of the array that has already been reshaped, thus failures that
affect only that section are less critical.

So encode this subtlety in a new function and call it as appropriate.

The case that showed this problem was a 4 drive RAID5 to 8 drive RAID6
conversion where the last two devices failed.
This resulted in:

  good good good good incomplete good good failed failed

while converting a 5-drive RAID6 to 8 drive RAID5
The incomplete device causes the whole array to look bad,
bad as it was actually good for the section that had been
converted to 8-drives, all the data was actually safe.

Reported-by: Terry Morris <tbmorris@tbmorris.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:35:27 +10:00
NeilBrown
70fffd0bfa md: Don't update ->recovery_offset when reshaping an array to fewer devices.
When an array is reshaped to have fewer devices, the reshape proceeds
from the end of the devices to the beginning.

If a device happens to be non-In_sync (which is possible but rare)
we would normally update the ->recovery_offset as the reshape
progresses. However that would be wrong as the recover_offset records
that the early part of the device is in_sync, while in fact it would
only be the later part that is in_sync, and in any case the offset
number would be measured from the wrong end of the device.

Relatedly, if after a reshape a spare is discovered to not be
recoverred all the way to the end, not allow spare_active
to incorporate it in the array.

This becomes relevant in the following sample scenario:

A 4 drive RAID5 is converted to a 6 drive RAID6 in a combined
operation.
The RAID5->RAID6 conversion will cause a 5 drive to be included as a
spare, then the 5drive -> 6drive reshape will effectively rebuild that
spare as it progresses.  The 6th drive is treated as in_sync the whole
time as there is never any case that we might consider reading from
it, but must not because there is no valid data.

If we interrupt this reshape part-way through and reverse it to return
to a 5-drive RAID6 (or event a 4-drive RAID5), we don't want to update
the recovery_offset - as that would be wrong - and we don't want to
include that spare as active in the 5-drive RAID6 when the reversed
reshape completed and it will be mostly out-of-sync still.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:35:18 +10:00
NeilBrown
e4e11e385d md/raid5: avoid oops when number of devices is reduced then increased.
The entries in the stripe_cache maintained by raid5 are enlarged
when we increased the number of devices in the array, but not
shrunk when we reduce the number of devices.
So if entries are added after reducing the number of devices, we
much ensure to initialise the whole entry, not just the part that
is currently relevant.  Otherwise if we enlarge the array again,
we will reference uninitialised values.

As grow_buffers/shrink_buffer now want to use a count that is stored
explicity in the raid_conf, they should get it from there rather than
being passed it as a parameter.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:35:02 +10:00
Maciej Trela
049d6c1ef9 md: enable raid4->raid0 takeover
Only level 5 with layout=PARITY_N can be taken over to raid0 now.
Lets allow level 4 either.

Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:34:57 +10:00
Maciej Trela
001048a318 md: clear layout after ->raid0 takeover
After takeover from raid5/10 -> raid0 mddev->layout is not cleared.

Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:34:45 +10:00
Maciej Trela
f73ea87375 md: fix raid10 takeover: use new_layout for setup_conf
Use mddev->new_layout in setup_conf.
Also use new_chunk, and don't set ->degraded in takeover().  That
gets set in run()

Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:33:51 +10:00
NeilBrown
e93f68a1fc md: fix handling of array level takeover that re-arranges devices.
Most array level changes leave the list of devices largely unchanged,
possibly causing one at the end to become redundant.
However conversions between RAID0 and RAID10 need to renumber
all devices (except 0).

This renumbering is currently being done in the ->run method when the
new personality takes over.  However this is too late as the common
code in md.c might already have invalidated some of the devices if
they had a ->raid_disk number that appeared to high.

Moving it into the ->takeover method is too early as the array is
still active at that time and wrong ->raid_disk numbers could cause
confusion.

So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
the new raid_disk number.
Now the common code knows exactly which devices need to be renumbered,
and which can be invalidated, and can do it all at a convenient time
when the array is suspend.
It can also update some symlinks in sysfs which previously were not be
updated correctly.

Reported-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:33:24 +10:00
Prasanna S. Panchamukhi
0544a21db0 md: raid10: Fix null pointer dereference in fix_read_error()
Such NULL pointer dereference can occur when the driver was fixing the
read errors/bad blocks and the disk was physically removed
causing a system crash. This patch check if the
rcu_dereference() returns valid rdev before accessing it in fix_read_error().

Cc: stable@kernel.org
Signed-off-by: Prasanna S. Panchamukhi <prasanna.panchamukhi@riverbed.com>
Signed-off-by: Rob Becker <rbecker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:31:03 +10:00
NeilBrown
f3b99be19d Restore partition detection of newly created md arrays.
Commit  b821eaa572 broke partition
detection for md arrays.

The logic was almost right.  However if revalidate_disk is called
when the device is not yet open, bdev->bd_disk won't be set, so the
flush_disk() Call will not set bd_invalidated.

So when md_open is called we still need to ensure that
->bd_invalidated gets set.  This is easily done with a call to
check_disk_size_change in the place where the offending commit removed
check_disk_change.  At the important times, the size will have changed
from 0 to non-zero, so check_disk_size_change will set bd_invalidated.

Tested-by: Duncan <1i5t5.duncan@cox.net>
Reported-by: Duncan <1i5t5.duncan@cox.net>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-06-24 13:31:03 +10:00
Akinobu Mita
55af6bb509 md: convert cpu notifier to return encapsulate errno value
By the previous modification, the cpu notifier can return encapsulate
errno value.  This converts the cpu notifiers for raid5.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 09:12:48 -07:00
Linus Torvalds
e8bebe2f71 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
  fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
  get rid of home-grown mutex in cris eeprom.c
  switch ecryptfs_write() to struct inode *, kill on-stack fake files
  switch ecryptfs_get_locked_page() to struct inode *
  simplify access to ecryptfs inodes in ->readpage() and friends
  AFS: Don't put struct file on the stack
  Ban ecryptfs over ecryptfs
  logfs: replace inode uid,gid,mode initialization with helper function
  ufs: replace inode uid,gid,mode initialization with helper function
  udf: replace inode uid,gid,mode init with helper
  ubifs: replace inode uid,gid,mode initialization with helper function
  sysv: replace inode uid,gid,mode initialization with helper function
  reiserfs: replace inode uid,gid,mode initialization with helper function
  ramfs: replace inode uid,gid,mode initialization with helper function
  omfs: replace inode uid,gid,mode initialization with helper function
  bfs: replace inode uid,gid,mode initialization with helper function
  ocfs2: replace inode uid,gid,mode initialization with helper function
  nilfs2: replace inode uid,gid,mode initialization with helper function
  minix: replace inode uid,gid,mode init with helper
  ext4: replace inode uid,gid,mode init with helper
  ...

Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
2010-05-21 19:37:45 -07:00
NeilBrown
19fdb9eefb Merge commit '3ff195b011d7decf501a4d55aeed312731094796' into for-linus
Conflicts:
	drivers/md/md.c

- Resolved conflict in md_update_sb
- Added extra 'NULL' arg to new instance of sysfs_get_dirent.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-22 08:31:36 +10:00
Christoph Hellwig
8018ab0574 sanitize vfs_fsync calling conventions
Now that the last user passing a NULL file pointer is gone we can remove
the redundant dentry argument and associated hacks inside vfs_fsynmc_range.

The next step will be removig the dentry argument from ->fsync, but given
the luck with the last round of method prototype changes I'd rather
defer this until after the main merge window.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:21 -04:00
Eric W. Biederman
3ff195b011 sysfs: Implement sysfs tagged directory support.
The problem.  When implementing a network namespace I need to be able
to have multiple network devices with the same name.  Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.

What this patch does is to add an additional tag field to the
sysfs dirent structure.  For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible.  Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.

I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.

For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded.  Which means I need
a simple race free way to setup those directories as tagged.

To achieve a reace free design all tagged directories are created
and managed by sysfs itself.

Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_ns_types with the type and it's operations
- sysfs_exit_ns when an individual tag is no longer valid

- Implement mount_ns() which returns the ns of the calling process
  so we can attach it to a sysfs superblock.
- Implement ktype.namespace() which returns the ns of a syfs kobject.

Everything else is left up to sysfs and the driver layer.

For the network namespace mount_ns and namespace() are essentially
one line functions, and look to remain that.

Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equality comparisons,
and is trivial to create for current users, as it is just the
existing namespace pointer.

The work needed in sysfs is more extensive.  At each directory
or symlink creating I need to check if the directory it is being
created in is a tagged directory and if so generate the appropriate
tag to place on the sysfs_dirent.  Likewise at each symlink or
directory removal I need to check if the sysfs directory it is
being removed from is a tagged directory and if so figure out
which tag goes along with the name I am deleting.

Currently only directories which hold kobjects, and
symlinks are supported.  There is not enough information
in the current file attribute interfaces to give us anything
to discriminate on which makes it useless, and there are
no potential users which makes it an uninteresting problem
to solve.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-21 09:37:31 -07:00
NeilBrown
be6800a73a md: don't insist on valid event count for spare devices.
Devices which know that they are spares do not really need to have
an event count that matches the rest of the array, so there are no
data-in-sync issues. It is enough that the uuid matches.
So remove the requirement that the event count is up-to-date.

We currently still write out and event count on spares, but this
allows us in a year or 3 to stop doing that completely.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:28:01 +10:00
NeilBrown
a8707c08f4 md: simplify updating of event count to sometimes avoid updating spares.
When updating the event count for a simple clean <-> dirty transition,
we try to avoid updating the spares so they can safely spin-down.
As the event_counts across an array must be +/- 1, this means
decrementing the event_count on a dirty->clean transition.
This is not always safe and we have to avoid the unsafe time.
We current do this with a misguided idea about it being safe or
not depending on whether the event_count is odd or even.  This
approach only works reliably in a few common instances, but easily
falls down.

So instead, simply keep internal state concerning whether it is safe
or not, and always assume it is not safe when an array is first
assembled.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:28:01 +10:00
Gabriele A. Trombetti
7b0bb5368a md/raid6: Fix raid-6 read-error correction in degraded state
Fix: Raid-6 was not trying to correct a read-error when in
singly-degraded state and was instead dropping one more device, going to
doubly-degraded state. This patch fixes this behaviour.

Tested-by: Janos Haar <janos.haar@netcenter.hu>
Signed-off-by: Gabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com>
Reported-by: Janos Haar <janos.haar@netcenter.hu>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2010-05-18 15:28:00 +10:00
NeilBrown
75a73a29e5 md: restore ability of spare drives to spin down.
Some time ago we stopped the clean/active metadata updates
from being written to a 'spare' device in most cases so that
it could spin down and say spun down.  Device failure/removal
etc are still recorded on spares.

However commit 51d5668cb2 broke this 50% of the time,
depending on whether the event count is even or odd.
The change log entry said:

   This means that the alignment between 'odd/even' and
    'clean/dirty' might take a little longer to attain,

how ever the code makes no attempt to create that alignment, so it
could take arbitrarily long.

So when we find that clean/dirty is not aligned with odd/even,
force a second metadata-update immediately.  There are already cases
where a second metadata-update is needed immediately (e.g. when a
device fails during the metadata update).  We just piggy-back on that.

Reported-by: Joe Bryant <tenminjoe@yahoo.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2010-05-18 15:28:00 +10:00
NeilBrown
af3a2cd6b8 md: Fix read balancing in RAID1 and RAID10 on drives > 2TB
read_balance uses a "unsigned long" for a sector number which
will get truncated beyond 2TB.
This will cause read-balancing to be non-optimal, and can cause
data to be read from the 'wrong' branch during a resync.  This has a
very small chance of returning wrong data.

Reported-by: Jordan Russell <jr-list-2010@quo.to>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:28:00 +10:00
NeilBrown
2dc40f8094 md/linear: standardise all printk messages
md/linear:mdname:

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:59 +10:00
NeilBrown
b5a20961f3 md/raid0: tidy up printk messages.
All messages now start
   md/raid0:md-device-name:

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:59 +10:00
NeilBrown
128595ed6f md/raid10: tidy up printk messages.
All raid10 printk messages now start
   md/raid10:md-device-name:

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:59 +10:00
NeilBrown
9dd1e2faf7 md/raid1: improve printk messages
Make sure the array name is included in a uniform way in all printk
messages.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:59 +10:00
NeilBrown
0c55e02259 md/raid5: improve consistency of error messages.
Many 'printk' messages from the raid456 module mention 'raid5' even
though it may be a 'raid6' or even 'raid4' array.  This can cause
confusion.
Also the actual array name is not always reported and when it is
it is not reported consistently.

So change all the messages to start:
    md/raid:%s:
where '%s' becomes e.g. md3 to identify the particular array.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:58 +10:00
NeilBrown
08fb730ca3 md: remove EXPERIMENTAL designation from RAID10
RAID10 has been available for quite a while now and is quite well
tested, so we can remove the EXPERIMENTAL designation.

Reported-by: Eric MSP Veith <eveith@wwweb-library.net>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:58 +10:00
Dan Williams
f2859af671 md: allow integers to be passed to md/level
e.g. allow md to interpret 'echo 4 > md/level' as a request for raid4.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2010-05-18 15:27:58 +10:00
Dan Williams
bb7f8d2217 md: notify mdstat waiters of level change
Level modifications change the output of mdstat.  The mdmon manager
thread is interested in these events for external metadata management.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2010-05-18 15:27:57 +10:00
Dan Williams
f1b29bcae1 md/raid4: permit raid0 takeover
For consistency allow raid4 to takeover raid0 in addition to raid5 (with a
raid4 layout).

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2010-05-18 15:27:57 +10:00
NeilBrown
e555190d82 md/raid1: delay reads that could overtake behind-writes.
When a raid1 array is configured to support write-behind
on some devices, it normally only reads from other devices.
If all devices are write-behind (because the rest have failed)
it is possible for a read request to be serviced before a
behind-write request, which would appear as data corruption.

So when forced to read from a WriteMostly device, wait for any
write-behind to complete, and don't start any more behind-writes.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:57 +10:00
NeilBrown
d754c5ae1f md/raid1: fix confusing 'redirect sector' message.
This message seems to suggest the named device is the one on which a
read failed, however it is actually the device that the read will be
redirected to.
So make the message a little clearer.

Reported-by: Tim Burgess <ozburgess@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:56 +10:00
NeilBrown
9e35b99c7e md: don't unregister the thread in mddev_suspend
This is
 - unnecessary because mddev_suspend is always followed by a call to
   ->stop, and each ->stop unregisters the thread, and
 - a problem as it makes it awkwards to suspend and then resume a
   device as we will want later.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:56 +10:00
NeilBrown
fafd7fb052 md: factor out init code for an mddev
This is a simple factorisation that makes mddev_find easier to read.


Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:55 +10:00
NeilBrown
21a52c6d05 md: pass mddev to make_request functions rather than request_queue
We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:55 +10:00
NeilBrown
cca9cf90c5 md: call md_stop_writes from md_stop
This moves the call to the other side of set_readonly, but that should
not be an issue.
This encapsulates in 'md_stop' all of the functionality for internally
stopping the array, leaving all the interactions with externalities
(sysfs, request_queue, gendisk) in do_md_stop.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:54 +10:00
NeilBrown
a4bd82d0d0 md: split md_set_readonly out of do_md_stop
Using do_md_stop to set an array to read-only is a little confusing.
Now most of the common code has been factored out, split
md_set_readonly off in to a separate function.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:54 +10:00
NeilBrown
a047e12540 md: factor md_stop_writes out of do_md_stop.
Further refactoring of do_md_stop.
This one requires some explanation as it takes code from different
places in do_md_stop, so some re-ordering happens.

We only get into this part of do_md_stop if there are no active opens
of the device, so no writes can be happening and the device must have
been flushed.  In md_stop_writes we want to stop any internal sources
of writes - i.e. resync - and flush out the metadata.

The only code that was previously before some of this code is
code to clean up the queue, the mddev, the gendisk, or sysfs, all
of which is probably better after code that makes active changes (i.e.
triggers writes).

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:54 +10:00
NeilBrown
6177b472ab md: start to refactor do_md_stop
do_md_stop is large and clunky, so hard to understand.

This is a first step of refactoring, pulling two simple
sub-functions out.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:53 +10:00
NeilBrown
fe60b01428 md: factor do_md_run to separate accesses to ->gendisk
As part of relaxing the binding between an mddev and gendisk,
we separate do_md_run into two functions.
  md_run does all the work internal to md
  do_md_run calls md_run and makes and changes to gendisk
     that are required.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:53 +10:00
NeilBrown
b821eaa572 md: remove ->changed and related code.
We set ->changed to 1 and call check_disk_change at the end
of md_open so that bd_invalidated would be set and thus
partition rescan would happen appropriately.

Now that we call revalidate_disk directly, which sets bd_invalidates,
that indirection is no longer needed and can be removed.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:53 +10:00
NeilBrown
49ce6cea85 md: don't reference gendisk in getgeo
Using ->array_sectors rather than get_capacity() is more
direct and is a step towards relaxing the tight connection
between mddev and gendisk.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:52 +10:00
NeilBrown
490773268c md: move io accounting out of personalities into md_make_request
While I generally prefer letting personalities do as much as possible,
given that we have a central md_make_request anyway we may as well use
it to simplify code.
Also this centralises knowledge of ->gendisk which will help later.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:52 +10:00
NeilBrown
2b7f22284d md/raid5: small tidyup in raid5_align_endio
Diving through ->queue to find mddev is unnecessarily complex - there
is an easier path to finding mddev, so use that.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:50 +10:00
NeilBrown
a78d38a1a1 md: add support for raid5 to raid4 conversion
This is unlikely to be wanted, but we may as well provide it
for completeness.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:49 +10:00
Maciej Trela
5cac7861b2 md: notify level changes through sysfs.
Level changes can be very significant, so make sure
to notify them via sysfs.

Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:49 +10:00
NeilBrown
233fca36bb md: Relax checks on ->max_disks when external metadata handling is used.
When metadata is being managed by user-space, md doesn't know
what the maximum number of devices allowed in an array is
so ->max_disks is 0.  In this case we should allow any (+ve)
number of disks.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:49 +10:00
Maciej Trela
b71031076e md: Correctly handle device removal via sysfs
Writing "none" to "../md/dev-xx/slot" removes that device
from being an active part of the array, but it didn't
set ->raid_disk to -1 to record this fact.


Signed-off-by: Maciej Trela <Maciej.Trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:48 +10:00
Trela, Maciej
dab8b29248 md: Add support for Raid0->Raid10 takeover
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:48 +10:00
Trela, Maciej
9af204cf72 md: Add support for Raid5->Raid0 and Raid10->Raid0 takeover
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:48 +10:00
Trela Maciej
54071b3808 md:Add support for Raid0->Raid5 takeover
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:47 +10:00
NeilBrown
84707f38e7 md: don't use mddev->raid_disks in raid0 or raid10 while array is active.
In a subsequent patch we will make it possible to change
mddev->raid_disks while a RAID0 or RAID10 array is active.  This is
part of the process of reshaping such an array.

This means that we cannot use this value while processes requests
(it is OK to use it during initialisation as we are locked against
changes then).
Both RAID0 and RAID10 have the same value stored in the private data
structure, so use that value instead.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:47 +10:00
NeilBrown
c0cc75f84e md: discard StateChanged device flag.
This was needed when sysfs files could only be 'notified'
from process context.  Now that we have sys_notify_direct,
we can call it directly from an interrupt.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:47 +10:00
H Hartley Sweeten
7b92813c3c drivers/md: Remove unnecessary casts of void *
void pointers do not need to be cast to other pointer types.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:46 +10:00
Paul Clements
696fcd535b md: expose max value of behind writes counter
Keep track of the maximum number of concurrent write-behind requests
for an md array and exposed this number in sysfs at
   md/bitmap/max_backlog_used

Writing any value to this file will clear it.

This allows userspace to be involved in tuning bitmap/backlog.

Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:46 +10:00
NeilBrown
ee8b81b03d md: remove some dead fields from mddev_s
These fields have never been used.
commit 4b6d287f62
added them, but also added identical files to bitmap_super_s,
and only used the latter.

So remove these unused fields.

Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:45 +10:00
NeilBrown
964147d5c8 md/raid1: fix counting of write targets.
There is a very small race window when writing to a
RAID1 such that if a device is marked faulty at exactly the wrong
time, the write-in-progress will not be sent to the device,
but the bitmap (if present) will be updated to say that
the write was sent.

Then if the device turned out to still be usable as was re-added
to the array, the bitmap-based-resync would skip resyncing that
block, possibly leading to corruption.  This would only be a problem
if no further writes were issued to that area of the device (i.e.
that bitmap chunk).

Suitable for any pending -stable kernel.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-18 15:27:13 +10:00
NeilBrown
a64c876fd3 md: manage redundancy group in sysfs when changing level.
Some levels expect the 'redundancy group' to be present,
others don't.
So when we change level of an array we might need to
add or remove this group.

This requires fixing up the current practice of overloading ->private
to indicate (when ->pers == NULL) that something needs to be removed.
So create a new ->to_remove to fill that role.

When changing levels, we may need to add or remove attributes.  When
changing RAID5 -> RAID6, we both add and remove the same thing.  It is
important to catch this and optimise it out as the removal is delayed
until a lock is released, so trying to add immediately would cause
problems.


Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-17 14:45:40 +10:00
NeilBrown
b6eb127d27 md: remove unneeded sysfs files more promptly
When an array is stopped we need to remove some
sysfs files which are dependent on the type of array.

We need to delay that deletion as deleting them while holding
reconfig_mutex can lead to deadlocks.

We currently delay them until the array is completely destroyed.
However it is possible to deactivate and then reactivate the array.
It is also possible to need to remove sysfs files when changing level,
which can potentially happen several times before an array is
destroyed.

So we need to delete these files more promptly: as soon as
reconfig_mutex is dropped.

We need to ensure this happens before do_md_run can restart the array,
so we use open_mutex for some extra locking.  This is not deadlock
prone.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2010-05-17 14:40:07 +10:00