Commit graph

156658 commits

Author SHA1 Message Date
Dan Williams
ad643f54c8 ioat1: trim ioat_dma_desc_sw
Save 4 bytes per software descriptor by transmitting tx_cnt in an unused
portion of the hardware descriptor.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:30:24 -07:00
Dan Williams
345d852391 ioat: ___devinit annotate the initialization paths
Mark all single use initialization routines with __devinit.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:30:24 -07:00
Dan Williams
f6ab95b557 ioat: preserve chanctrl bits when re-arming interrupts
The register write in ioat_dma_cleanup_tasklet is unfortunate in two
ways:
1/ It clears the extra 'enable' bits that we set at alloc_chan_resources time
2/ It gives the impression that it disables interrupts when it is in
   fact re-arming interrupts

[ Impact: fix, persist the value of the chanctrl register when re-arming ]

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:30:24 -07:00
Dan Williams
bb32078630 ioat: ignore reserved bits for chancnt and xfercap
Don't trust that the reserved bits are always zero, also sanity check
the returned value.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:30:24 -07:00
Dan Williams
4fb9b9e8d5 ioat: cleanup completion status reads
The cleanup path makes an effort to only perform an atomic read of the
64-bit completion address.  However in the 32-bit case it does not
matter if we read the upper-32 and lower-32 non-atomically because the
upper-32 will always be zero.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:30:24 -07:00
Dan Williams
6df9183a15 ioat: add some dev_dbg() calls
Provide some output for debugging the driver.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:30:23 -07:00
Dan Williams
38e12f64a1 ioat1: kill unused unmap parameters
The unified ioat1/ioat2 ioat_dma_unmap() implementation derives the
source and dest addresses from the unmap descriptor.  There is no longer
a need to track this information in struct ioat_desc_sw.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:30:23 -07:00
Dan Williams
5cbafa65b9 ioat2,3: convert to a true ring buffer
Replace the current linked list munged into a ring with a native ring
buffer implementation.  The benefit of this approach is reduced overhead
as many parameters can be derived from ring position with simple pointer
comparisons and descriptor allocation/freeing becomes just a
manipulation of head/tail pointers.

It requires a contiguous allocation for the software descriptor
information.

Since this arrangement is significantly different from the ioat1 chain,
move ioat2,3 support into its own file and header.  Common routines are
exported from driver/dma/ioat/dma.[ch].

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:55 -07:00
Dan Williams
dcbc853af6 ioat: prepare the code for ioat[12]_dma_chan split
Prepare the code for the conversion of the ioat2 linked-list-ring into a
native ring buffer.  After this conversion ioat2 channels will share
less of the ioat1 infrastructure, but there will still be places where
sharing is possible.  struct ioat_chan_common is created to house the
channel attributes that will remain common between ioat1 and ioat2
channels.

For every routine that accesses both common and hardware specific fields
the old unified 'ioat_chan' pointer is split into an 'ioat' and  'chan'
pointer.  Where 'chan' references common fields and 'ioat' the
hardware/version specific.

[ Impact: pure structure member movement/variable renames, no logic changes ]

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:55 -07:00
Dan Williams
a6a39ca1ba ioat: fix self test interrupts
If a callback is to be attached to a descriptor the channel needs to
know at ->prep time so it can set the interrupt enable bit.  This is in
preparation for moving descriptor ioat2 descriptor preparation from
->submit to ->prep.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:55 -07:00
Dan Williams
a0587bcf3e ioat1: move descriptor allocation from submit to prep
The async_tx api assumes that after a successful ->prep a subsequent
->submit will not fail due to a lack of resources.

This also fixes a bug in the allocation failure case.  Previously the
descriptors allocated prior to the allocation failure would not be
returned to the free list.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:55 -07:00
Dan Williams
c7984f4e4e ioat: define descriptor control bit-field
This cleans up a mess of and'ing and or'ing bit definitions, and allows
simple assignments from the specified dma_ctrl_flags parameter.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:55 -07:00
Dan Williams
77867fff03 ioat: fix type mismatch for ->dmacount
->dmacount tracks the sequence number of active descriptors.  It is
written to the DMACOUNT register to update the channel's view of pending
descriptors in the chain.  The register is 16-bits so ->dmacount should
be unsigned and 16-bit as well.  Also modify ->desccount to maintain
alignment.

This was never a problem in practice because we never compared dmacount
values, but this is a bug waiting to happen.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:54 -07:00
Dan Williams
f2427e276f ioat: split ioat_dma_probe into core/version-specific routines
Towards the removal of ioatdma_device.version split the initialization
path into distinct versions.  This conversion:
1/ moves version specific probe code to version specific routines
2/ removes the need for ioat_device
3/ turns off the ioat1 msi quirk if the device is reinitialized for intx

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:54 -07:00
Dan Williams
b31b78f1ab ioat: kill function prototype ifdef guards
The only .c files that utilize these protected prototypes depend on
CONFIG_INTEL_IOATDMA=y, so there is no value gained in providing empty
prototypes.

[ Impact: pure cleanup ]

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:54 -07:00
Dan Williams
bc3c702585 ioat: cleanup some long deref chains and 80 column collisions
* reduce device->common. to dma-> in ioat_dma_{probe,remove,selftest}
* ioat_lookup_chan_by_index to ioat_chan_by_index
* multi-line function definitions
* ioat_desc_sw.async_tx to ioat_desc_sw.txd
* desc->txd. to tx-> in cleanup routine

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:54 -07:00
Dan Williams
e6c0b69a43 ioat: convert ioat_probe to pcim/devm
The driver currently duplicates much of what these routines offer, so
just use the common code.  For example ->irq_mode tracks what interrupt
mode was initialized, which duplicates the ->msix_enabled and
->msi_enabled handling in pcim_release.

This also adds a check to the return value of dma_async_device_register,
which can fail.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:44 -07:00
Dan Williams
1f27adc2f0 ioat: move definitions to dma.h
Some of these defines may be useful outside of dma.c and the header is
private so there are no namespace pollution concerns.

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-09-08 17:29:02 -07:00
Dan Williams
a348a7e6fd Merge commit 'v2.6.31-rc1' into dmaengine 2009-09-08 14:32:24 -07:00
Dan Williams
07a3b417dc md/raid456: distribute raid processing over multiple cores
Now that the resources to handle stripe_head operations are allocated
percpu it is possible for raid5d to distribute stripe handling over
multiple cores.  This conversion also adds a call to cond_resched() in
the non-multicore case to prevent one core from getting monopolized for
raid operations.

Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:13 -07:00
Yuri Tikhonov
b774ef491b md/raid6: remove synchronous infrastructure
These routines have been replaced by there asynchronous counterparts.

Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:13 -07:00
Yuri Tikhonov
6c0069c0ae md/raid6: asynchronous handle_stripe6
1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to
   raid_run_ops
2/ Implement a handler for sh->reconstruct_state similar to the raid5 case
   (adds handling of Q parity)
3/ Prevent handle_parity_checks6 from running concurrently with 'compute'
   operations
4/ Hook up raid_run_ops

Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:13 -07:00
Dan Williams
d82dfee0ad md/raid6: asynchronous handle_parity_check6
[ Based on an original patch by Yuri Tikhonov ]

Implement the state machine for handling the RAID-6 parities check and
repair functionality.  Note that the raid6 case does not need to check
for new failures, like raid5, as it will always writeback the correct
disks.  The raid5 case can be updated to check zero_sum_result to avoid
getting confused by new failures rather than retrying the entire check
operation.

Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:13 -07:00
Yuri Tikhonov
a9b39a741a md/raid6: asynchronous handle_stripe_dirtying6
In the synchronous implementation of stripe dirtying we processed a
degraded stripe with one call to handle_stripe_dirtying6().  I.e.
compute the missing blocks from the other drives, then copy in the new
data and reconstruct the parities.

In the asynchronous case we do not perform stripe operations directly.
Instead, operations are scheduled with flags to be later serviced by
raid_run_ops.  So, for the degraded case the final reconstruction step
can only be carried out after all blocks have been brought up to date by
being read, or computed.  Like the raid5 case schedule_reconstruction()
sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and
through operation chaining can handle compute and reconstruct in a
single raid_run_ops pass.

[dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating]
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:12 -07:00
Yuri Tikhonov
5599becca4 md/raid6: asynchronous handle_stripe_fill6
Modify handle_stripe_fill6 to work asynchronously by introducing
fetch_block6 as the raid6 analog of fetch_block5 (schedule compute
operations for missing/out-of-sync disks).

[dan.j.williams@intel.com: compute D+Q in one pass]
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:12 -07:00
Yuri Tikhonov
c0f7bddbe6 md/raid5,6: common schedule_reconstruction for raid5/6
Extend schedule_reconstruction5 for reuse by the raid6 path.  Add
support for generating Q and BUG() if a request is made to perform
'prexor'.

Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:12 -07:00
Dan Williams
ac6b53b6e6 md/raid6: asynchronous raid6 operations
[ Based on an original patch by Yuri Tikhonov ]

The raid_run_ops routine uses the asynchronous offload api and
the stripe_operations member of a stripe_head to carry out xor+pq+copy
operations asynchronously, outside the lock.

The operations performed by RAID-6 are the same as in the RAID-5 case
except for no support of STRIPE_OP_PREXOR operations. All the others
are supported:
STRIPE_OP_BIOFILL
 - copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
 - generate missing blocks (1 or 2) in the cache from the other blocks
STRIPE_OP_BIODRAIN
 - copy data out of request buffers to satisfy a write request
STRIPE_OP_RECONSTRUCT
 - recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
 - verify that the parity is correct

The flow is the same as in the RAID-5 case, and reuses some routines, namely:
1/ ops_complete_postxor (renamed to ops_complete_reconstruct)
2/ ops_complete_compute (updated to set up to 2 targets uptodate)
3/ ops_run_check (renamed to ops_run_check_p for xor parity checks)

[neilb@suse.de: fixes to get it to pass mdadm regression suite]
Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:12 -07:00
Dan Williams
4e7d2c0aef md/raid5: factor out mark_uptodate from ops_complete_compute5
ops_complete_compute5 can be reused in the raid6 path if it is updated to
generically handle a second target.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:13:11 -07:00
Dan Williams
f6dbf65161 iop-adma: P+Q self test
Even though the intent is to extend dmatest with P+Q tests there is
still value in having an always-on sanity check to prevent an
unintentionally broken driver from registering.

This depends on raid6_pq.ko for verification, the side effect being that
PQ capable channels will fail to register when raid6 is disabled.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:12:40 -07:00
Dan Williams
7bf649aee8 iop-adma: P+Q support for iop13xx adma engines
iop33x support is not included because that engine is a bit more awkward
to handle in that it can either be in xor mode or pq mode.  The
dmaengine/async_tx layers currently only comprehend static capabilities.

Note iop13xx does not support hardware PQ continuation so the driver
must handle the DMA_PREP_CONTINUE flag for operations across > 16
sources. From the comment for dma_maxpq:

/* When an engine does not support native continuation we need 3 extra
 * source slots to reuse P and Q with the following coefficients:
 * 1/ {00} * P : remove P from Q', but use it as a source for P'
 * 2/ {01} * Q : use Q to continue Q' calculation
 * 3/ {00} * Q : subtract Q from P' to cancel (2)
 */

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:12:39 -07:00
Dan Williams
72be12f0c3 iop-adma: fix lockdep false positive
lockdep correctly identifies a potential recursive locking case for
iop_chan->lock, but in the dependency submission case we expect that the same
class will be acquired for both the parent dependency and the child channel.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:12:39 -07:00
Dan Williams
507fbec4cf iop-adma: cleanup iop_adma_run_tx_complete_actions
Replace 'desc->async_tx.' with 'tx->'

[ Impact: pure cleanup ]

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:12:39 -07:00
Dan Williams
cb3c82992f async_tx: raid6 recovery self test
Port drivers/md/raid6test/test.c to use the async raid6 recovery
routines.  This is meant as a unit test for raid6 acceleration drivers.  In
addition to the 16-drive test case this implements tests for the 4-disk and
5-disk special cases (dma devices can not generically handle less than 2
sources), and adds a test for the D+Q case.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:28 -07:00
Dan Williams
58691d64c4 dmatest: add pq support
Test raid6 p+q operations with a simple "always multiply by 1" q
calculation to fit into dmatest's current destination verification
scheme.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:27 -07:00
Dan Williams
0a82a6239b async_tx: add support for asynchronous RAID6 recovery operations
async_raid6_2data_recov() recovers two data disk failures

 async_raid6_datap_recov() recovers a data disk and the P disk

These routines are a port of the synchronous versions found in
drivers/md/raid6recov.c.  The primary difference is breaking out the xor
operations into separate calls to async_xor.  Two helper routines are
introduced to perform scalar multiplication where needed.
async_sum_product() multiplies two sources by scalar coefficients and
then sums (xor) the result.  async_mult() simply multiplies a single
source by a scalar.

This implemention also includes, in contrast to the original
synchronous-only code, special case handling for the 4-disk and 5-disk
array cases.  In these situations the default N-disk algorithm will
present 0-source or 1-source operations to dma devices.  To cover for
dma devices where the minimum source count is 2 we implement 4-disk and
5-disk handling in the recovery code.

[ Impact: asynchronous raid6 recovery routines for 2data and datap cases ]

Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Ilya Yanok <yanok@emcraft.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:27 -07:00
Dan Williams
b2f46fd8ef async_tx: add support for asynchronous GF multiplication
[ Based on an original patch by Yuri Tikhonov ]

This adds support for doing asynchronous GF multiplication by adding
two additional functions to the async_tx API:

 async_gen_syndrome() does simultaneous XOR and Galois field
    multiplication of sources.

 async_syndrome_val() validates the given source buffers against known P
    and Q values.

When a request is made to run async_pq against more than the hardware
maximum number of supported sources we need to reuse the previous
generated P and Q values as sources into the next operation.  Care must
be taken to remove Q from P' and P from Q'.  For example to perform a 5
source pq op with hardware that only supports 4 sources at a time the
following approach is taken:

p, q = PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08}))
p', q' = PQ(p, q, q, src4, COEF({00}, {01}, {00}, {10}))

p' = p + q + q + src4 = p + src4
q' = {00}*p + {01}*q + {00}*q + {10}*src4 = q + {10}*src4

Note: 4 is the minimum acceptable maxpq otherwise we punt to
synchronous-software path.

The DMA_PREP_CONTINUE flag indicates to the driver to reuse p and q as
sources (in the above manner) and fill the remaining slots up to maxpq
with the new sources/coefficients.

Note1: Some devices have native support for P+Q continuation and can skip
this extra work.  Devices with this capability can advertise it with
dma_set_maxpq.  It is up to each driver how to handle the
DMA_PREP_CONTINUE flag.

Note2: The api supports disabling the generation of P when generating Q,
this is ignored by the synchronous path but is implemented by some dma
devices to save unnecessary writes.  In this case the continuation
algorithm is simplified to only reuse Q as a source.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
Signed-off-by: Ilya Yanok <yanok@emcraft.com>
Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:27 -07:00
Dan Williams
95475e5711 async_tx: remove walk of tx->parent chain in dma_wait_for_async_tx
We currently walk the parent chain when waiting for a given tx to
complete however this walk may race with the driver cleanup routine.
The routines in async_raid6_recov.c may fall back to the synchronous
path at any point so we need to be prepared to call async_tx_quiesce()
(which calls  dma_wait_for_async_tx).  To remove the ->parent walk we
guarantee that every time a dependency is attached ->issue_pending() is
invoked, then we can simply poll the initial descriptor until
completion.

This also allows for a lighter weight 'issue pending' implementation as
there is no longer a requirement to iterate through all the channels'
->issue_pending() routines as long as operations have been submitted in
an ordered chain.  async_tx_issue_pending() is added for this case.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:27 -07:00
Dan Williams
af1f951eb6 async_tx: kill needless module_{init|exit}
If module_init and module_exit are nops then neither need to be defined.

[ Impact: pure cleanup ]

Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:26 -07:00
Dan Williams
ad283ea4a3 async_tx: add sum check flags
Replace the flat zero_sum_result with a collection of flags to contain
the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
solomon syndrome) zero-sum result.  Use the SUM_CHECK_ namespace instead
of DMA_ since these flags will be used on non-dma-zero-sum enabled
platforms.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:26 -07:00
Dan Williams
d6f38f31f3 md/raid5,6: add percpu scribble region for buffer lists
Use percpu memory rather than stack for storing the buffer lists used in
parity calculations.  Include space for dma address conversions and pass
that to async_tx via the async_submit_ctl.scribble pointer.

[ Impact: move memory pressure from stack to heap ]

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:26 -07:00
Dan Williams
36d1c6476b md/raid6: move the spare page to a percpu allocation
In preparation for asynchronous handling of raid6 operations move the
spare page to a percpu allocation to allow multiple simultaneous
synchronous raid6 recovery operations.

Make this allocation cpu hotplug aware to maximize allocation
efficiency.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-08-29 19:09:26 -07:00
NeilBrown
80ffb3ccea Fix new incorrect error return from do_md_stop.
Recent commit c8c00a6915
changed the exit paths in do_md_stop and was not quite
careful enough.  There is one path were 'err' now needs
to be cleared but it isn't.
So setting an array to readonly (with mdadm --readonly) will
work, but will incorrectly report and error: ENXIO.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-18 10:35:26 +10:00
NeilBrown
4d484a4a7a md: allow upper limit for resync/reshape to be set when array is read-only
Normally we only allow the upper limit for a reshape to be decreased
when the array not performing a sync/recovery/reshape, otherwise there
could be races.  But if an array is part-way through a reshape when it
is assembled the reshape is started immediately leaving no window
to set an upper bound.

If the array is started read-only, the reshape will be suspended until
the array becomes writable, so that provides a window during which it
is perfectly safe to reduce the upper limit of a reshape.

So: allow the upper limit (sync_max) to be reduced even if the reshape
thread is running, as long as the array is still read-only.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:41:50 +10:00
NeilBrown
1a67dde0ab md/raid5: Properly remove excess drives after shrinking a raid5/6
We were removing the drives, from the array, but not
removing symlinks from /sys/.... and not marking the device
as having been removed.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:41:49 +10:00
NeilBrown
a639755cf8 md/raid5: make sure a reshape restarts at the correct address.
This "if" don't allow for the possibility that the number of devices
doesn't change, and so sector_nr isn't set correctly in that case.
So change '>' to '>='.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:13:00 +10:00
NeilBrown
67ac6011db md/raid5: allow new reshape modes to be restarted in the middle.
md/raid5 doesn't allow a reshape to restart if it involves writing
over the same part of disk that it would be reading from.
This happens at the beginning of a reshape that increases the number
of devices, at the end of a reshape that decreases the number of
devices, and continuously for a reshape that does not change the
number of devices.

The current code is correct for the "increase number of devices"
case as the critical section at the start is handled by userspace
performing a backup.

It does not work for reducing the number of devices, or the
no-change case.
For 'reducing', we need to invert the test.  For no-change we cannot
really be sure things will be safe, so simply require the array
to be read-only, which is how the user-space code which carefully
starts such arrays works.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 10:06:24 +10:00
NeilBrown
51d5668cb2 md: never advance 'events' counter by more than 1.
When assembling arrays, md allows two devices to have different event
counts as long as the difference is only '1'.  This is to cope with
a system failure between updating the metadata on two difference
devices.

However there are currently times when we update the event count by
2.  This was done to keep the event count even when the array is clean
and odd when it is dirty, which allows us to avoid writing common
update to spare devices and so allow those spares to go to sleep.

This is bad for the above reason.  So change it to never increase by
two.  This means that the alignment between 'odd/even' and
'clean/dirty' might take a little longer to attain, but that is only a
small cost.  The spares will get a few more updates but that will
still be spared (;-) most updates and can still go to sleep.

Prior to this patch there was a small chance that after a crash an
array would fail to assemble due to the overly large event count
mismatch.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-13 09:54:02 +10:00
NeilBrown
c8c00a6915 Remove deadlock potential in md_open
A recent commit:
  commit 449aad3e25

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
   reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
   takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
   commit d63a5a74de
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL.  So we can be certain that no IO is in flight as the
array is being destroyed.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer" <snitzer@gmail.com>
Reported-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2009-08-10 12:50:52 +10:00
Linus Torvalds
7b2aa037e8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6:
  USB: fix oops on disconnect in cdc-acm
  USB: storage: include Prolific Technology USB drive in unusual_devs list
  USB: ftdi_sio: add product_id for Marvell OpenRD Base, Client
  USB: ftdi_sio: add vendor and product id for Bayer glucose meter serial converter cable
  USB: EHCI: fix counting of transaction error retries
  USB: EHCI: fix two new bugs related to Clear-TT-Buffer
  USB: usbfs: fix -ENOENT error code to be -ENODEV
  USB: musb: fix the nop registration for OMAP3EVM
  USB: devio: Properly do access_ok() checks
  USB: pl2303: New vendor and product id
2009-08-07 19:06:36 -07:00
Linus Torvalds
710ad849ae Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6:
  Staging: rspiusb: Fix buffer overflow
  staging: add dependencies on PCI for drivers that require it
  Staging: rtl8192su: fix build error
  Staging: rt2870: Revert d44ca7 Removal of kernel_thread() API
  Staging: rt2870: Add USB ID for Linksys, Planex Communications, Belkin
2009-08-07 19:06:13 -07:00