Commit graph

154 commits

Author SHA1 Message Date
Jens Axboe
6eca9004df [BLOCK] Fix bad sharing of tag busy list on queues with shared tag maps
For the locking to work, only the tag map and tag bit map may be shared
(incidentally, I was just explaining this to Nick yesterday, but I
apparently didn't review the code well enough myself). But we also share
the busy list!  The busy_list must be queue private, or we need a
block_queue_tag covering lock as well.

So we have to move the busy_list to the queue. This'll work fine, and
it'll actually also fix a problem with blk_queue_invalidate_tags() which
will invalidate tags across all shared queues. This is a bit confusing,
the low level driver should call it for each queue seperately since
otherwise you cannot kill tags on just a single queue for eg a hard
drive that stops responding. Since the function has no callers
currently, it's not an issue.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-29 11:33:06 +01:00
Nick Piggin
adb4ddbbfb block: use lock bitops for the tag map.
The block queue tag map can use lock bitops.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-29 11:33:06 +01:00
Oleg Nesterov
abbeb88d00 blk_sync_queue() should cancel request_queue->unplug_work
blk_sync_queue() cancels the timer, but forgets to cancel the work.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-29 11:33:05 +01:00
Jerome Marchand
b238b3d4be block layer: remove a unused argument of drive_stat_acct()
The nr_sector argument of drive_stat_acct() is not used anymore since the read and write sectors statistics are now updated in end_that_request_first(). This patch removes the useless argument.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-29 11:33:05 +01:00
Jens Axboe
642f149031 SG: Change sg_set_page() to take length and offset argument
Most drivers need to set length and offset as well, so may as well fold
those three lines into one.

Add sg_assign_page() for those two locations that only needed to set
the page, where the offset/length is set outside of the function context.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-24 11:20:47 +02:00
Jens Axboe
7aeacf9822 [BLOCK] blk_rq_map_sg: force clear termination bit
Since blk_rq_map_sg() sets the termination bit at the end of the sg
table, we could see it prematurely on the next mapping unless we
force drivers to do a full sg_init_table() prior to each mapping. So
force clear the termination bit to avoid having to put that clear in
the driver for every mapping.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-23 09:49:25 +02:00
Jens Axboe
ad0d4083e6 [BLOCK] Don't clear sg_dma_len/addr() in blk_rq_map_sg()
It's not a proper lvalue on all archs.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-23 09:27:05 +02:00
Jens Axboe
9b61764bcb [SG] Update block layer to use sg helpers
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-22 19:39:33 +02:00
Pavel Emelyanov
ba25f9dcc4 Use helpers to obtain task pid in printks
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:43 -07:00
Randy Dunlap
8f731f7d83 kernel-api docbook: fix content problems
Fix kernel-api docbook contents problems.

docproc: linux-2.6.23-git13/include/asm-x86/unaligned_32.h: No such file or directory
Warning(linux-2.6.23-git13//include/linux/list.h:482): bad line: 			of list entry
Warning(linux-2.6.23-git13//mm/filemap.c:864): No description found for parameter 'ra'
Warning(linux-2.6.23-git13//block/ll_rw_blk.c:3760): No description found for parameter 'req'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'private'
Warning(linux-2.6.23-git13//include/linux/input.h:1077): No description found for parameter 'cdev'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:35 -07:00
Jens Axboe
ba951841ce [BLOCK] blk_rq_map_sg() next_sg fixup
Don't ever use sg_next() on the last entry, it may not be valid!

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-17 19:34:11 +02:00
Linus Torvalds
b6257a9036 Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block:
  [SCSI] Remove full sg table memset()
  [SCSI] ide-scsi: remove usage of sg_last()
  Fix loop terminating conditions in fill_sg().
  [BLOCK] Clear sg entry before filling in blk_rq_map_sg()
  IA64: iommu uses sg_next with an invalid sg element
  cciss: disable DMA refetch on Smart Array P600
  swiotlb: fix map_sg failure handling
  SPARC64: fix iommu sg chaining
  [SCSI] ide-scsi: use scsi_sg_count() instead of ->use_sg
2007-10-17 09:08:13 -07:00
Peter Zijlstra
e0bf68ddec mm: bdi init hooks
provide BDI constructor/destructor hooks

[akpm@linux-foundation.org: compile fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Jens Axboe
60573b874b [BLOCK] Clear sg entry before filling in blk_rq_map_sg()
The memset() of the sg entry was originally removed, because it could
overwrite a chain pointer. But it's quite OK to memset() it when we know
it's a valid entry, since it can't contain a chain pointer.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-17 13:02:33 +02:00
Linus Torvalds
92d15c2ccb Merge branch 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/data/git/linux-2.6-block: (63 commits)
  Fix memory leak in dm-crypt
  SPARC64: sg chaining support
  SPARC: sg chaining support
  PPC: sg chaining support
  PS3: sg chaining support
  IA64: sg chaining support
  x86-64: enable sg chaining
  x86-64: update pci-gart iommu to sg helpers
  x86-64: update nommu to sg helpers
  x86-64: update calgary iommu to sg helpers
  swiotlb: sg chaining support
  i386: enable sg chaining
  i386 dma_map_sg: convert to using sg helpers
  mmc: need to zero sglist on init
  Panic in blk_rq_map_sg() from CCISS driver
  remove sglist_len
  remove blk_queue_max_phys_segments in libata
  revert sg segment size ifdefs
  Fixup u14-34f ENABLE_SG_CHAINING
  qla1280: enable use_sg_chaining option
  ...
2007-10-16 10:09:16 -07:00
Fengguang Wu
f2e189827a readahead: remove the limit max_sectors_kb imposed on max_readahead_kb
Remove the size limit max_sectors_kb imposed on max_readahead_kb.

The size restriction is unreasonable.  Especially when max_sectors_kb cannot
grow larger than max_hw_sectors_kb, which can be rather small for some disk
drives.

Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Jens Axboe
563063a808 ll_rw_blk: temporarily enable max_segments tweaking
Expose this setting for now, so that users can play with enabling
large commands without defaulting it to on globally. This is a debug
patch, it will be dropped for the final versions.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:08:53 +02:00
Jens Axboe
f565913ef8 block: convert to using sg helpers
Convert the main rq mapper (blk_rq_map_sg()) to the sg helper setup.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:07:11 +02:00
Jens Axboe
fd5d806266 block: convert blkdev_issue_flush() to use empty barriers
Then we can get rid of ->issue_flush_fn() and all the driver private
implementations of that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:05:02 +02:00
Jens Axboe
bf2de6f5a4 block: Initial support for data-less (or empty) barrier support
This implements functionality to pass down or insert a barrier
in a queue, without having data attached to it. The ->prepare_flush_fn()
infrastructure from data barriers are reused to provide this
functionality.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:03:56 +02:00
Jens Axboe
c07e2b4129 block: factor our bio_check_eod()
End of device check is done twice in __generic_make_request() and it's
fully inlined each time.  Factor out bio_check_eod().

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:03:55 +02:00
Jens Axboe
a0cd128542 block: add end_queued_request() and end_dequeued_request() helpers
We can use this helper in the elevator core for BLKPREP_KILL, and it'll
also be useful for the empty barrier patch.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:03:53 +02:00
Jens Axboe
4fa253f33c block: ll_rw_blk.c: cosmetics
Fix ?: construct, a typo, whitespace, and similar.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-16 11:03:49 +02:00
Greg Kroah-Hartman
19c38de88a kobjects: fix up improper use of the kobject name field
A number of different drivers incorrect access the kobject name field
directly.  This is not correct as the name might not be in the array.
Use the proper accessor function instead.
2007-10-12 14:51:02 -07:00
NeilBrown
6712ecf8f6 Drop 'size' argument from bio_endio and bi_end_io
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant.  Remove it.

Now there is no need for bio_endio to subtract the size completed
from bi_size.  So don't do that either.

While we are at it, change bi_end_io to return void.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
NeilBrown
5bb23a688b Don't decrement bi_size in bio_endio
The only caller of bio_endio that does not pass the full bi_size
is end_that_request_first.  Also, no ->bi_end_io method is really
interested in bi_size being decremented.

So move the decrement and related code into ll_rw_blk and merge it
with order_bio_endio to form req_bio_endio which does endio functionality
specific to request completion.

As some ->bi_end_io methods do check bi_size of 0, we set it thus for
now, but that will go in the next patch.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./block/ll_rw_blk.c |   42 +++++++++++++++++++++++++++---------------
 ./fs/bio.c          |   23 +++++++++++------------
 2 files changed, 38 insertions(+), 27 deletions(-)

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
NeilBrown
d24517d793 Remove flush_dry_bio_endio
The entire function of flush_dry_bio_endio is to undo the effects
of bio_endio (when called on a barrier request).  So remove the
function and the call to bio_endio.

This allows us to remove "bi_size" from "struct request_queue".

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./block/ll_rw_blk.c      |   39 ++-------------------------------------
 ./include/linux/blkdev.h |    1 -
 2 files changed, 2 insertions(+), 38 deletions(-)

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:57 +02:00
Satyam Sharma
db47d47537 ll_rw_blk: blk_cpu_notifier should be __cpuinitdata
blk_cpu_notifier is marked as __devinitdata, but __devinitdata need not
be __init even if HOTPLUG_CPU=n, which wastes space. It should be marked
__cpuinitdata, and the callback itself as __cpuinit.

Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
Jens Axboe
6c92e699b5 Fixup rq_for_each_segment() indentation
Remove one level of nesting where appropriate.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
NeilBrown
bc1c56fde6 Share code between init_request_from_bio and blk_rq_bio_prep
These have very similar functions and should share code where
possible.

Signed-off-by: Neil Brown <neilb@suse.de>

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
NeilBrown
66846572bf Stop exporting blk_rq_bio_prep
blk_rq_bio_prep is exported for use in exactly
one place.  That place can benefit from using
the new blk_rq_append_bio instead.
So
  - change dm-emc to call blk_rq_append_bio
  - stop exporting blk_rq_bio_prep, and
  - initialise rq_disk in blk_rq_bio_prep,
       as dm-emc needs it.

Signed-off-by: Neil Brown <neilb@suse.de>

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
NeilBrown
3001ca7712 New function blk_req_append_bio
ll_back_merge_fn is currently exported to SCSI where is it used,
together with blk_rq_bio_prep, in exactly the same way these
functions are used in __blk_rq_map_user.

So move the common code into a new function (blk_rq_append_bio), and
don't export ll_back_merge_fn any longer.

Signed-off-by: Neil Brown <neilb@suse.de>

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
NeilBrown
5705f70217 Introduce rq_for_each_segment replacing rq_for_each_bio
Every usage of rq_for_each_bio wraps a usage of
bio_for_each_segment, so these can be combined into
rq_for_each_segment.

We define "struct req_iterator" to hold the 'bio' and 'index' that
are needed for the double iteration.

Signed-off-by: Neil Brown <neilb@suse.de>

Various compile fixes by me...

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
NeilBrown
9dfa52831e Merge blk_recount_segments into blk_recalc_rq_segments
blk_recalc_rq_segments calls blk_recount_segments on each bio,
then does some extra calculations to handle segments that overlap
two bios.

If we merge the code from blk_recount_segments into
blk_recalc_rq_segments, we can process the whole request one bio_vec
at a time, and not need the messy cross-bio calculations.

Then blk_recount_segments can be implemented by calling
blk_recalc_rq_segments, passing it a simple on-stack request which
stores just the bio.

Signed-off-by: Neil Brown <neilb@suse.de>

diff .prev/block/ll_rw_blk.c ./block/ll_rw_blk.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10 09:25:56 +02:00
Nick Piggin
dd941252a8 shared tag queue barrier comment
Should add some comments for the tag barriers (they won't be so important
if we can switch over to the explicit _lock bitops, but for now we should
make it clear).

Jens' original patch said a barrier after the test_and_clear_bit was also
required. I can't see why (and it would prevent the use of the _lock bitop).

Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
--
2007-09-14 13:56:47 -07:00
Jens Axboe
f3da54ba14 Fix race with shared tag queue maps
There's a race condition in blk_queue_end_tag() for shared tag maps,
users include stex (promise supertrak thingy) and qla2xxx.  The former
at least has reported bugs in this area, not sure why we haven't seen
any for the latter.  It could be because the window is narrow and that
other conditions in the qla2xxx code hide this.  It's a real bug,
though, as the stex smp users can attest.

We need to ensure two things - the tag bit clearing needs to happen
AFTER we cleared the tag pointer, as the tag bit clearing/setting is
what protects this map.  Secondly, we need to ensure that the visibility
of the tag pointer and tag bit clear are ordered properly.

[ I removed the SMP barriers - "test_and_clear_bit()" already implies
  all the required barriers.  -- Linus ]

Also see http://bugzilla.kernel.org/show_bug.cgi?id=7842

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-13 08:20:25 -07:00
Alan D. Brunelle
c7149d6bce Fix remap handling by blktrace
This patch provides more information concerning REMAP operations on block
IOs. The additional information provides clearer details at the user level,
and supports post-processing analysis in btt.

o  Adds in partition remaps on the same device.
o  Fixed up the remap information in DM to be in the right order
o  Sent up mapped-from and mapped-to device information

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-08-11 22:34:48 +02:00
Jens Axboe
165125e1e4 [BLOCK] Get rid of request_queue_t typedef
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-24 09:28:11 +02:00
Paul Mundt
20c2df83d2 mm: Remove slab destructors from kmem_cache_create().
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-20 10:11:58 +09:00
Christoph Lameter
94f6030ca7 Slab allocators: Replace explicit zeroing with __GFP_ZERO
kmalloc_node() and kmem_cache_alloc_node() were not available in a zeroing
variant in the past.  But with __GFP_ZERO it is possible now to do zeroing
while allocating.

Use __GFP_ZERO to remove the explicit clearing of memory via memset whereever
we can.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:02 -07:00
FUJITA Tomonori
abae1fde63 add a struct request pointer to the request structure
This adds a struct request pointer to the request structure for the
second data phase (bidi for now). A request queue supporting bidi
requests sets QUEUE_FLAG_BIDI. This prevents sending bidi requests to
a non-bidi queue.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:46 +02:00
FUJITA Tomonori
4e2872d6b0 bind bsg to all SCSI devices
This patch binds bsg to all SCSI devices (their request queues) like
the current sg driver does. We can send SCSI commands to non disk and
cdrom scsi devices like OSD via bsg.

This patch removes bsg_register_queue from blk_register_queue so bsg
devices aren't bound to non SCSI block devices. If they want bsg, I'll
send a patch to do that.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:46 +02:00
FUJITA Tomonori
d351af01b9 bsg: bind bsg to request_queue instead of gendisk
This patch binds bsg devices to request_queue instead of gendisk. Any
objects (like transport entities) can define own request_handler and
create own bsg device.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:46 +02:00
Jens Axboe
3d6392cfbd bsg: support for full generic block layer SG v3
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-16 08:52:44 +02:00
Tejun Heo
f4b09303d0 [BLOCK] drop unnecessary bvec rewinding from flush_dry_bio_endio
Barrier bios are completed twice - once after the barrier write itself
is done and again after the whole sequence is complete.
flush_dry_bio_endio() is for the first completion.  It doesn't really
complete the bio.  It rewinds bvec and resets bio so that it can be
completed again when the whole barrier sequence is complete.

The bvec rewinding code has the following problems.

1. The rewinding code is wrong because filesystems may pass bvec with
   non zero bv_offset.

2. The block layer doesn't guarantee anything about the state of
   bvec array on request completion.  bv_offset and len are updated
   iff __end_that_request_first() completes the bvec partially.

Because of #2, #1 doesn't really matter (nobody cares whether bvec is
re-wound correctly or not) but then again by not doing unwinding at
all, we'll always give back the same bvec to the caller as full bvec
completion doesn't alter bvecs and the final completion is always full
completion.

Drop unnecessary rewinding code.

This is spotted by Neil Brown.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:03:33 +02:00
Jens Axboe
32eef96411 blk_hw_contig_segment(): bad segment size checks
Two bugs in there:

- The virt oversize check should use the current bio hardware back
  size and the next bio front size, not the same bio. Spotted by
  Neil Brown.

- The segment size check should add hw front sizes, not total bio
  sizes. Spotted by James Bottomley

Acked-by: James Bottomley <James.Bottomley@SteelEye.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:03:32 +02:00
Tejun Heo
bc90ba093a block: always requeue !fs requests at the front
SCSI marks internal commands with REQ_PREEMPT and push it at the front
of the request queue using blk_execute_rq().  When entering suspended
or frozen state, SCSI devices are quiesced using
scsi_device_quiesce().  In quiesced state, only REQ_PREEMPT requests
are processed.  This is how SCSI blocks other requests out while
suspending and resuming.  As all internal commands are pushed at the
front of the queue, this usually works.

Unfortunately, this interacts badly with ordered requeueing.  To
preserve request order on requeueing (due to busy device, active EH or
other failures), requests are sorted according to ordered sequence on
requeue if IO barrier is in progress.

The following sequence deadlocks.

1. IO barrier sequence issues.

2. Suspend requested.  Queue is quiesced with part or all of IO
   barrier sequence at the front.

3. During suspending or resuming, SCSI issues internal command which
   gets deferred and requeued for some reason.  As the command is
   issued after the IO barrier in #1, ordered requeueing code puts the
   request after IO barrier sequence.

4. The device is ready to process requests again but still is in
   quiesced state and the first request of the queue isn't
   REQ_PREEMPT, so command processing is deadlocked -
   suspending/resuming waits for the issued request to complete while
   the request can't be processed till device is put back into
   running state by resuming.

This can be fixed by always putting !fs requests at the front when
requeueing.

The following thread reports this deadlock.

  http://thread.gmane.org/gmane.linux.kernel/537473

Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: David Greaves <david@dgreaves.com>
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-15 16:12:20 -07:00
Jens Axboe
f653c34dd3 ll_rw_blk: fix gcc 4.2 warning on current_io_context()
current_io_context() is both static and exported with EXPORT_SYMBOL().
As there are no users outside of ll_rw_blk.c itself, just kill the
export.

Problem reported by Martin Michlmayr <tbm@cyrius.com>

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-15 10:44:15 -07:00
Neil Brown
d89d87965d When stacked block devices are in-use (e.g. md or dm), the recursive calls
to generic_make_request can use up a lot of space, and we would rather they
didn't.

As generic_make_request is a void function, and as it is generally not
expected that it will have any effect immediately, it is safe to delay any
call to generic_make_request until there is sufficient stack space
available.

As ->bi_next is reserved for the driver to use, it can have no valid value
when generic_make_request is called, and as __make_request implicitly
assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
certain that all callers set it to NULL.  We can therefore safely use
bi_next to link pending requests together, providing we clear it before
making the real call.

So, we choose to allow each thread to only be active in one
generic_make_request at a time.  If a subsequent (recursive) call is made,
the bio is linked into a per-thread list, and is handled when the active
call completes.

As the list of pending bios is per-thread, there are no locking issues to
worry about.

I say above that it is "safe to delay any call...".  There are, however,
some behaviours of a make_request_fn which would make it unsafe.  These
include any behaviour that assumes anything will have changed after a
recursive call to generic_make_request.

These could include:
 - waiting for that call to finish and call it's bi_end_io function.
   md use to sometimes do this (marking the superblock dirty before
   completing a write) but doesn't any more
 - inspecting the bio for fields that generic_make_request might
   change, such as bi_sector or bi_bdev.  It is hard to see a good
   reason for this, and I don't think anyone actually does it.
 - inspecing the queue to see if, e.g. it is 'full' yet.  Again, I
   think this is very unlikely to be useful, or to be done.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <dm-devel@redhat.com>

Alasdair G Kergon <agk@redhat.com> said:

 I can see nothing wrong with this in principle.

 For device-mapper at the moment though it's essential that, while the bio
 mappings may now get delayed, they still get processed in exactly
 the same order as they were passed to generic_make_request().

 My main concern is whether the timing changes implicit in this patch
 will make the rare data-corrupting races in the existing snapshot code
 more likely. (I'm working on a fix for these races, but the unfinished
 patch is already several hundred lines long.)

 It would be helpful if some people on this mailing list would test
 this patch in various scenarios and report back.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-05-11 13:28:37 +02:00
Linus Torvalds
9a9136e270 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
  sound: convert "sound" subdirectory to UTF-8
  MAINTAINERS: Add cxacru website/mailing list
  include files: convert "include" subdirectory to UTF-8
  general: convert "kernel" subdirectory to UTF-8
  documentation: convert the Documentation directory to UTF-8
  Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
  remove broken URLs from net drivers' output
  Magic number prefix consistency change to Documentation/magic-number.txt
  trivial: s/i_sem /i_mutex/
  fix file specification in comments
  drivers/base/platform.c: fix small typo in doc
  misc doc and kconfig typos
  Remove obsolete fat_cvf help text
  Fix occurrences of "the the "
  Fix minor typoes in kernel/module.c
  Kconfig: Remove reference to external mqueue library
  Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
  Correct comments in genrtc.c to refer to correct /proc file.
  Fix more "deprecated" spellos.
  Fix "deprecated" typoes.
  ...

Fix trivial comment conflict in kernel/relay.c.
2007-05-09 12:54:17 -07:00