Commit graph

233391 commits

Author SHA1 Message Date
Yan, Zheng
6848ad6461 Btrfs: Fix balance panic
Mark the cloned backref_node as checked in clone_backref_node()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 16:00:03 -05:00
Linus Torvalds
15a831f253 Merge branch 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6
* 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6:
  Revert "dt: add documentation of ARM dt boot interface"
2011-02-14 10:10:37 -08:00
Linus Torvalds
f1b6a4ec27 Merge branch 'rtc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'rtc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  RTC: Fix minor compile warning
  RTC: Convert rtc drivers to use the alarm_irq_enable method
  RTC: Fix rtc driver ioctl specific shortcutting
2011-02-14 10:10:07 -08:00
Chris Mason
e3f24cc521 Btrfs: don't release pages when we can't clear the uptodate bits
Btrfs tracks uptodate state in an rbtree as well as in the
page bits.  This is supposed to enable us to use block sizes other than
the page size, but there are a few parts still missing before that
completely works.

But, our readpage routine trusts this additional range based tracking
of uptodateness, much in the same way the buffer head up to date bits
are trusted for the other filesystems.

The problem is that sometimes we need to allocate memory in order to
split records in the rbtree, even when we are just clearing bits.  This
can be difficult when our clearing function is called GFP_ATOMIC, which
can happen in the releasepage path.

So, what happens today looks like this:

releasepage called with GFP_ATOMIC
btrfs_releasepage calls clear_extent_bit
clear_extent_bit fails to allocate ram, leaving the up to date bit set
btrfs_releasepage returns success

The end result is the page being gone, but btrfs thinking the range is
up to date.   Later on if someone tries to read that same page, the
btrfs readpage code will return immediately thinking the page is already
up to date.

This commit fixes things to fail the releasepage when we can't clear the
extent state bits.  It covers both data pages and metadata tree blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 13:04:01 -05:00
Chris Mason
eb14ab8ed2 Btrfs: fix page->private races
There is a race where btrfs_releasepage can drop the
page->private contents just as alloc_extent_buffer is setting
up pages for metadata.  Because of how the Btrfs page flags work,
this results in us skipping the crc on the page during IO.

This patch sovles the race by waiting until after the extent buffer
is inserted into the radix tree before it sets page private.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 13:03:52 -05:00
J. Bruce Fields
83f6b0c182 nfsd: break lease on unlink due to rename
4795bb37ef "nfsd: break lease on unlink,
link, and rename", only broke the lease on the file that was being
renamed, and didn't handle the case where the target path refers to an
already-existing file that will be unlinked by a rename--in that case
the target file should have any leases broken as well.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:19 -05:00
J. Bruce Fields
acfdf5c383 nfsd4: acquire only one lease per file
Instead of acquiring one lease each time another client opens a file,
nfsd can acquire just one lease to represent all of them, and reference
count it to determine when to release it.

This fixes a regression introduced by
c45821d263 "locks: eliminate fl_mylease
callback": after that patch, only the struct file * is used to determine
who owns a given lease.  But since we recently converted the server to
share a single struct file per open, if we acquire multiple leases on
the same file from nfsd, it then becomes impossible on unlocking a lease
to determine which of those leases (all of whom share the same struct
file *) we meant to remove.

Thanks to Takashi Iwai <tiwai@suse.de> for catching a bug in a previous
version of this patch.

Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:19 -05:00
J. Bruce Fields
5d926e8c2f nfsd4: modify fi_delegations under recall_lock
Modify fi_delegations only under the recall_lock, allowing us to use
that list on lease breaks.

Also some trivial cleanup to simplify later changes.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:19 -05:00
J. Bruce Fields
65bc58f518 nfsd4: remove unused deleg dprintk's.
These aren't all that useful, and get in the way of the next steps.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:19 -05:00
J. Bruce Fields
edab9782b5 nfsd4: split lease setting into separate function
Splitting some code into a separate function which we'll be adding some
more to.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
J. Bruce Fields
dd239cc05f nfsd4: fix leak on allocation error
Also share some common exit code.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
J. Bruce Fields
22d38c4c10 nfsd4: add helper function for lease setup
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
J. Bruce Fields
6b57d9c86d nfsd4: split up nfsd_break_deleg_cb
We'll be adding some more code here soon.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
Konstantin Khorenko
3aa6e0aa8a NFSD: memory corruption due to writing beyond the stat array
If nfsd fails to find an exported via NFS file in the readahead cache, it
should increment corresponding nfsdstats counter (ra_depth[10]), but due to a
bug it may instead write to ra_depth[11], corrupting the following field.

In a kernel with NFSDv4 compiled in the corruption takes the form of an
increment of a counter of the number of NFSv4 operation 0's received; since
there is no operation 0, this is harmless.

In a kernel with NFSDv4 disabled it corrupts whatever happens to be in the
memory beyond nfsdstats.

Signed-off-by: Konstantin Khorenko <khorenko@openvz.org>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
Benny Halevy
0af3f814cc NFSD: use nfserr for status after decode_cb_op_status
Bugs introduced in 85a5648019
"NFSD: Update XDR decoders in NFSv4 callback client"

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
J. Bruce Fields
541ce98c10 nfsd: don't leak dentry count on mnt_want_write failure
The exit cleanup isn't quite right here.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:31:08 -05:00
Grant Likely
7211da1778 Revert "dt: add documentation of ARM dt boot interface"
This reverts commit 9830fcd6f6.

The ARM dt support has not been merged yet; this documentation update
was premature.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-02-14 08:13:20 -07:00
Borislav Petkov
1c9d16e359 x86: Fix mwait_usable section mismatch
We use it in non __cpuinit code now too so drop marker.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20110211171754.GA21047@aftab>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-14 12:08:28 +01:00
Dan Williams
4abed0af1e dmaengine: add slave-dma maintainer
Slave-dma has become the predominant usage model for dmaengine and needs
special attention.  Memory-to-memory dma usage cases will continue to be
maintained by Dan.

Cc: Alan Cox <alan@linux.intel.com>
Acked-by: Vinod Koul <vinod.koul@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2011-02-14 02:42:00 -08:00
Dan Williams
e19d1d4988 Merge branch 'imx' into dmaengine-fixes 2011-02-14 02:40:46 -08:00
Anatolij Gustschin
a646bd7f08 dma: ipu_idmac: do not lose valid received data in the irq handler
Currently when two or more buffers are queued by the camera driver
and so the double buffering is enabled in the idmac, we lose one
frame comming from CSI since the reporting of arrival of the first
frame is deferred by the DMAIC_7_EOF interrupt handler and reporting
of the arrival of the last frame is not done at all. So when requesting
N frames from the image sensor we actually receive N - 1 frames in
user space.

The reason for this behaviour is that the DMAIC_7_EOF interrupt
handler misleadingly assumes that the CUR_BUF flag is pointing to the
buffer used by the IDMAC. Actually it is not the case since the
CUR_BUF flag will be flipped by the FSU when the FSU is sending the
<TASK>_NEW_FRM_RDY signal when new frame data is delivered by the CSI.
When sending this singal, FSU updates the DMA_CUR_BUF and the
DMA_BUFx_RDY flags: the DMA_CUR_BUF is flipped, the DMA_BUFx_RDY
is cleared, indicating that the frame data is beeing written by
the IDMAC to the pointed buffer. DMA_BUFx_RDY is supposed to be
set to the ready state again by the MCU, when it has handled the
received data. DMAIC_7_CUR_BUF flag won't be flipped here by the
IPU, so waiting for this event in the EOF interrupt handler is wrong.
Actually there is no spurious interrupt as described in the comments,
this is the valid DMAIC_7_EOF interrupt indicating reception of the
frame from CSI.

The patch removes code that waits for flipping of the DMAIC_7_CUR_BUF
flag in the DMAIC_7_EOF interrupt handler. As the comment in the
current code denotes, this waiting doesn't help anyway. As a result
of this removal the reporting of the first arrived frame is not
deferred to the time of arrival of the next frame and the drivers
software flag 'ichan->active_buffer' is in sync with DMAIC_7_CUR_BUF
flag, so the reception of all requested frames works.

This has been verified on the hardware which is triggering the
image sensor by the programmable state machine, allowing to
obtain exact number of frames. On this hardware we do not tolerate
losing frames.

This patch also removes resetting the DMA_BUFx_RDY flags of
all channels in ipu_disable_channel() since transfers on other
DMA channels might be triggered by other running tasks and the
buffers should always be ready for data sending or reception.

Signed-off-by: Anatolij Gustschin <agust@denx.de>
Reviewed-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Tested-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2011-02-14 02:28:16 -08:00
Thomas Gleixner
6ee5859df5 Merge branch 'fortglx/2.6.38/tip/timers/rtc' of git://git.linaro.org/people/jstultz/linux into timers/urgent 2011-02-14 09:00:30 +01:00
David Miller
795abaf1e4 klist: Fix object alignment on 64-bit.
Commit c0e69a5bbc ("klist.c: bit 0 in pointer can't be used as flag")
intended to make sure that all klist objects were at least pointer size
aligned, but used the constant "4" which only works on 32-bit.

Use "sizeof(void *)" which is correct in all cases.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: stable <stable@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-13 16:54:24 -08:00
Dave Airlie
dc7cec23c7 Merge remote branch 'intel/drm-intel-fixes' of /ssd/git/drm-next into drm-fixes
* 'intel/drm-intel-fixes' of /ssd/git/drm-next:
  drm/i915: Fix resume regression from 5d1d0cc
  drm/i915/tv: Use polling rather than interrupt-based hotplug
  drm/i915: Trigger modesetting if force-audio changes
  drm/i915/sdvo: If we have an EDID confirm it matches the mode of the connection
  drm/i915: Disable RC6 on Ironlake
  drm/i915/lvds: Restore dithering on native modes for gen2/3
  drm/i915: Invalidate TLB caches on SNB BLT/BSD rings
2011-02-14 10:13:34 +10:00
Alex Deucher
c2049b3d29 drm/radeon/kms: improve 6xx/7xx CS error output
Makes debugging CS rejections much easier.

Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 10:13:01 +10:00
Marek Olšák
fff1ce4dc6 drm/radeon/kms: check AA resolve registers on r300
This is an important security fix because we allowed arbitrary values
to be passed to AARESOLVE_OFFSET. This also puts the right buffer address
in the register.

Signed-off-by: Marek Olšák <maraeo@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 10:12:14 +10:00
Marek Olšák
501834349e drm/radeon/kms: fix tracking of BLENDCNTL, COLOR_CHANNEL_MASK, and GB_Z on r300
Also move ZB_DEPTHCLEARVALUE to the list of safe regs.

Signed-off-by: Marek Olšák <maraeo@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 10:11:04 +10:00
Alex Deucher
27dcfc1022 drm/radeon/kms: use linear aligned for evergreen/ni bo blits
Not only is linear aligned supposedly more performant,
linear general is only supported by the CB in single
slice mode.  The texture hardware doesn't support
linear general, but I think the hw automatically
upgrades it to linear aligned.

Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 10:10:50 +10:00
Alex Deucher
1ea9dbf250 drm/radeon/kms: use linear aligned for 6xx/7xx bo blits
Not only is linear aligned supposedly more performant,
linear general is only supported by the CB in single
slice mode.  The texture hardware doesn't support
linear general, but I think the hw automatically
upgrades it to linear aligned.

Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 10:10:48 +10:00
Dave Airlie
8fd1b84cc9 drm/radeon: fix race between GPU reset and TTM delayed delete thread.
My evergreen has been in a remote PC for week and reset has never once
saved me from certain doom, I finally relocated to the box with a
serial cable and noticed an oops when the GPU resets, and the TTM
delayed delete thread tries to remove something from the GTT.

This stops the delayed delete thread from executing across the GPU
reset handler, and woot I can GPU reset now.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 10:10:24 +10:00
Alex Deucher
0f234f5fdc drm/radeon/kms: evergreen/ni big endian fixes (v2)
Based on 6xx/7xx endian fixes from Cédric Cano.

v2: fix typo in shader

Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 10:10:09 +10:00
Cédric Cano
4eace7fdfa drm/radeon/kms: 6xx/7xx big endian fixes
agd5f: minor cleanups

Signed-off-by: Cédric Cano <ccano@interfaceconcept.com>
Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 09:23:38 +10:00
Cédric Cano
4589433c57 drm/radeon/kms: atombios big endian fixes
agd5f: additional cleanups/fixes

Signed-off-by: Cédric Cano <ccano@interfaceconcept.com>
Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 09:23:36 +10:00
Cédric Cano
dee54c40a1 drm/radeon: 6xx/7xx non-kms endian fixes
agd5f: minor cleanups

Signed-off-by: Cédric Cano <ccano@interfaceconcept.com>
Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 09:23:35 +10:00
Marek Olšák
40b4a7599d drm/radeon/kms: optimize CS state checking for r100->r500
The colorbuffer, zbuffer, and texture states are checked only once when
they get changed. This improves performance in the apps which emit
lots of draw packets and few state changes.

This drops performance in glxgears by a 1% or so, but glxgears is not
a benchmark we care about.
The time spent in the kernel when running Torcs dropped from 33% to 23%
and the frame rate is higher, which is a good thing.

r600 might need something like this as well.

Signed-off-by: Marek Olšák <maraeo@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 09:23:27 +10:00
Kees Cook
01e2f533a2 drm: do not leak kernel addresses via /proc/dri/*/vma
In the continuing effort to avoid kernel addresses leaking to unprivileged
users, this patch switches to %pK for /proc/dri/*/vma.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 09:23:20 +10:00
Alex Deucher
9fad321ac6 drm/radeon/kms: add connector table for mac g5 9600
PPC Mac cards do not provide connector tables in
their vbios.  Their connector/encoder configurations
must be hardcoded in the driver.

verified by nyef on #radeon

Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 09:22:55 +10:00
Jesper Juhl
e917fd39eb radeon mkregtable: Add missing fclose() calls
drivers/gpu/drm/radeon/mkregtable.c:parser_auth() almost always remembers
to fclose(file) before returning, but it misses two spots.

This is not really important since the process will exit shortly after and
thus close the file for us, but being explicit prevents static analysis
tools from complaining about leaked memory and missing fclose() calls and
it also seems to be the prefered style of the existing code to explicitly
close the file.

So, here's a patch to add the two missing fclose() calls.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 09:22:54 +10:00
Alex Deucher
c9417bdd4c drm/radeon/kms: fix interlaced modes on dce4+
- set scaler table clears the interleave bit, need to
reset it in encoder quirks, this was already done for
pre-dce4.
- remove the interleave settings from set_base() functions
this is now handled in the encoder quirks functions, and
isn't technically part of the display base setup.
- rename evergreen_do_set_base() to dce4_do_set_base() since
it's used on both evergreen and NI asics.

Fixes:
https://bugzilla.kernel.org/show_bug.cgi?id=28182

Signed-off-by: Alex Deucher <alexdeucher@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 09:22:53 +10:00
Dave Airlie
16f9fdcbcc drm/radeon: fix memory debugging since d961db75ce
The old code dereferenced a value, the new code just needs to pass
the ptr.

fixes an oops looking at files in debugfs.

cc: stable@kernel.org
Signed-off-by: Dave Airlie <airlied@redhat.com>
2011-02-14 09:22:51 +10:00
Linus Torvalds
091994cfb8 Merge branch 'spi/merge' of git://git.secretlab.ca/git/linux-2.6
* 'spi/merge' of git://git.secretlab.ca/git/linux-2.6:
  devicetree-discuss is moderated for non-subscribers
  MAINTAINERS: Add entry for GPIO subsystem
  dt: add documentation of ARM dt boot interface
  dt: Remove obsolete description of powerpc boot interface
  dt: Move device tree documentation out of powerpc directory
  spi/spi_sh_msiof: fix wrong address calculation, which leads to an Oops
2011-02-13 07:59:48 -08:00
Linus Torvalds
d8ed516f82 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: hda - add quirk for Ordissimo EVE using a realtek ALC662
  ALSA: hrtimer: remove superfluous tasklet invocation
  ALSA: hrtimer: handle delayed timer interrupts
  ALSA: HDA: Add subwoofer quirk for Acer Aspire 8942G
  ALSA: hda - Don't handle empty patch files
  ALSA: hda - Fix missing CA initialization for HDMI/DP
  ALSA: usbaudio - Enable the E-MU 0204 USB
  ALSA: hda - switch lfe with side in mixer for 4930g
  ASoC: Improve WM8994 digital power sequencing
  ASoC: Create an AIF1ADCDAT signal widget to match AIF2
  asoc: davinci: da830/omap-l137: correct cpu_dai_name
  ASoC: fill in snd_soc_pcm_runtime.card before calling snd_soc_dai_link.init()
2011-02-13 07:58:50 -08:00
Linus Torvalds
f00eaeea7a Revert "pci: use security_capable() when checking capablities during config space read"
This reverts commit 47970b1b2a.

It turns out it breaks several distributions.  Looks like the stricter
selinux checks fail due to selinux policies not being set to allow the
access - breaking X, but also lspci.

So while the change was clearly the RightThing(tm) to do in theory, in
practice we have backwards compatibility issues making it not work.

Reported-by: Dave Young <hidave.darkstar@gmail.com>
Acked-by: David Airlie <airlied@linux.ie>
Acked-by: Alex Riesen <raa.lkml@gmail.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-13 07:50:50 -08:00
Takashi Iwai
6146124118 Merge branch 'fix/asoc' into for-linus 2011-02-13 10:05:30 +01:00
Grant Likely
c170093d31 Merge branch 'devicetree/merge' into spi/merge 2011-02-12 23:53:34 -07:00
Paul Bolle
78bba987bc devicetree-discuss is moderated for non-subscribers
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2011-02-12 23:27:23 -07:00
Grant Likely
a0dc00b430 MAINTAINERS: Add entry for GPIO subsystem
I'll probably regret this....

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-12 09:46:30 -08:00
Linus Torvalds
c8e0b00ed1 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: call __jbd2_log_start_commit with j_state_lock write locked
  ext4: serialize unaligned asynchronous DIO
  ext4: make grpinfo slab cache names static
  ext4: Fix data corruption with multi-block writepages support
  ext4: fix up ext4 error handling
  ext4: unregister features interface on module unload
  ext4: fix panic on module unload when stopping lazyinit thread
2011-02-12 09:10:24 -08:00
Theodore Ts'o
e447183180 jbd2: call __jbd2_log_start_commit with j_state_lock write locked
On an SMP ARM system running ext4, I've received a report that the
first J_ASSERT in jbd2_journal_commit_transaction has been triggering:

	J_ASSERT(journal->j_running_transaction != NULL);

While investigating possible causes for this problem, I noticed that
__jbd2_log_start_commit() is getting called with j_state_lock only
read-locked, in spite of the fact that it's possible for it might
j_commit_request.  Fix this by grabbing the necessary information so
we can test to see if we need to start a new transaction before
dropping the read lock, and then calling jbd2_log_start_commit() which
will grab the write lock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:18:24 -05:00
Eric Sandeen
e9e3bcecf4 ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.

The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and 
dio_zero_block() will zero out the unwritten portions.  When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.

Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO.  I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write().  But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.

I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex.  So that won't work.

This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment.  I've
tested a backport of this patch with qemu, and it does
avoid the corruption.  It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.

Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.

[tytso@mit.edu: Keep the mutex as a hashed array instead
 of bloating the ext4 inode]

[tytso@mit.edu: Fix up namespace issues so that global
 variables are protected with an "ext4_" prefix.]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:17:34 -05:00