Commit graph

1575 commits

Author SHA1 Message Date
Wanlong Gao
d11bb4462c blk-cgroup: be able to remove the record of unplugged device
The bug is we're not able to remove the device from blkio cgroup's
per-device control files if it gets unplugged.

To reproduce the bug:

  # mount -t cgroup -o blkio xxx /cgroup
  # cd /cgroup
  # echo "8:0 1000" > blkio.throttle.read_bps_device
  # unplug the device
  # cat blkio.throttle.read_bps_device
  8:0	1000
  # echo "8:0 0" > blkio.throttle.read_bps_device
  -bash: echo: write error: No such device

After patching, the device removal will succeed.

Thanks for the comments of Paul, Zefan, and Vivek.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21 10:22:10 +02:00
Andrew Morton
499337bb65 block/blk-sysfs.c: fix kerneldoc references
The kerneldoc for blk_release_queue() is referring to blk_cleanup_queue().

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21 10:01:22 +02:00
Suresh Jayaraman
75df713627 block: document blk-plug
Thus spake Andrew Morton:

"And I have the usual maintainability whine.  If someone comes up to
vmscan.c and sees it calling blk_start_plug(), how are they supposed to
work out why that call is there?  They go look at the blk_start_plug()
definition and it is undocumented.  I think we can do better than this?"

Adapted from the LWN article - http://lwn.net/Articles/438256/ by Jens
Axboe and from an earlier attempt by Shaohua Li to document blk-plug.

[akpm@linux-foundation.org: grammatical and spelling tweaks]
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21 10:00:16 +02:00
Christoph Hellwig
27a84d54c0 block: refactor generic_make_request
Move all the checks performed on a bio into a new helper, and call it as
soon as bio is submitted even if it is a re-submission from ->make_request.

We explicitly mark the new helper as beeing non-inlined as the stack
usage for printing the block device name in the failure case is quite
high and this a patch where we have to be extremely conservative about
stack usage.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-15 14:01:40 +02:00
Tao Ma
8ad6a56f56 block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
In __blk_complete_request, we check both QUEUE_FLAG_SAME_COMP and req->cpu
to decide whether we should use req->cpu. Actually the user can also
select the complete cpu by either setting BIO_CPU_AFFINE or by calling
bio_set_completion_cpu. Current solution makes these 2 ways don't work
any more. So we'd better just check req->cpu.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-14 09:31:01 +02:00
Christoph Hellwig
5a7bbad27a block: remove support for bio remapping from ->make_request
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:12:01 +02:00
Jens Axboe
c20e8de27f block: rename __make_request() to blk_queue_bio()
Now that it's exported, lets put it in a more sane namespace.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:08:31 +02:00
Christoph Hellwig
166e1f901b block: export __make_request
Avoid the hacks need for request based device mappers currently by simply
exporting the symbol instead of trying to get it through the back door.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:08:27 +02:00
Wang Sheng-Hui
484fc254b8 elevator: use ELV_NAME_MAX instead of magic number 16 for chosen_elevator
We have ELV_NAME_MAX defined to 16, and hence we should use it
instead of the magic nubmer 16 for elevator's name string.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 08:59:20 +02:00
Nao Nishijima
a72c5e5eb7 [SCSI] genhd: add a new attribute "alias" in gendisk
This patch allows the user to set an "alias" of the disk via sysfs interface.

This patch only adds a new attribute "alias" in gendisk structure.
To show the alias instead of the device name in kernel messages,
we need to revise printk messages and use alias_name() in them.

Example:
(current) printk("disk name is %s\n", disk->disk_name);
(new)     printk("disk name is %s\n", alias_name(disk));

Users can use alphabets, numbers, '-' and '_' in "alias" attribute. A disk can
have an "alias" which length is up to 255 bytes. This attribute is write-once.

Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Suggested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nao Nishijima <nao.nishijima.xt@hitachi.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2011-08-29 00:16:19 -07:00
Shaohua Li
56ebdaf2fa block: simplify force plug flush code a little bit
Cleaning up the code a little bit. attempt_plug_merge() traverses the plug
list anyway, we can do the request counting there, so stack size is reduced
a little bit.
The motivation here is I suspect if we should count the requests for each
queue (task could handle multiple disks in the meantime), but my test doesn't
show it's worthy doing. If somebody proves we should do it, below change
will make that more easier.

Signed-off-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-24 16:04:34 +02:00
Shaohua Li
a632716275 block: change force plug flush call order
Do blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request at the tail. New
request can't be merged to existing requests, but later new requests might
be merged with this new one. If blk_flush_plug_list() is done later, the
merge doesn't happen.
Believe it or not, this fixes a 10% regression running sysbench workload.

Signed-off-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-24 16:04:32 +02:00
Eric Seppanen
e8037d4983 block: Fix queue_flag update when rq_affinity goes from 2 to 1
Commit 5757a6d76c added the QUEUE_FLAG_SAME_FORCE flag, but fails to
clear that flag when the current state is '2' (SAME_COMP + SAME_FORCE)
and the new state is '1' (SAME_COMP).

Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Eric Seppanen <eric@purestorage.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-24 08:51:34 +02:00
Tejun Heo
d27769ec3d block: add GENHD_FL_NO_PART_SCAN
There are cases where suppressing partition scan is useful - e.g. for
lo devices and pseudo SATA devices which advertise to be a disk but
get upset on partition scan (some port multiplier control devices show
such behavior).

This patch adds GENHD_FL_NO_PART_SCAN which suppresses partition scan
regardless of the number of possible partitions.  disk_partitionable()
is renamed to disk_part_scan_enabled() as suppressing partition scan
doesn't imply the device can't be partitioned using
BLKPG_ADD/DEL_PARTITION calls from userland.  show_partition() now
directly tests disk_max_parts() to maintain backward-compatibility.

-v2: Updated to make it clear that only partition scan is suppressed
     not partitioning itself as suggested by Kay Sievers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-23 20:01:04 +02:00
Christoph Hellwig
65299a3b78 block: separate priority boosting from REQ_META
Add a new REQ_PRIO to let requests preempt others in the cfq I/O schedule,
and lave REQ_META purely for marking requests as metadata in blktrace.

All existing callers of REQ_META except for XFS are updated to also
set REQ_PRIO for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-23 14:50:29 +02:00
Linus Torvalds
5ccc38740a Merge branch 'for-linus' of git://git.kernel.dk/linux-block
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
  Revert "cfq: Remove special treatment for metadata rqs."
  block: fix flush machinery for stacking drivers with differring flush flags
  block: improve rq_affinity placement
  blktrace: add FLUSH/FUA support
  Move some REQ flags to the common bio/request area
  allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
  xen/blkback: Make description more obvious.
  cfq-iosched: Add documentation about idling
  block: Make rq_affinity = 1 work as expected
  block: swim3: fix unterminated of_device_id table
  block/genhd.c: remove useless cast in diskstats_show()
  drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
  drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
  bsg-lib: add module.h include
  cfq-iosched: Reduce linked group count upon group destruction
  blk-throttle: correctly determine sync bio
  loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
  loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
  loop: add management interface for on-demand device allocation
  loop: replace linked list of allocated devices with an idr index
  ...
2011-08-19 10:47:07 -07:00
Jens Axboe
b53d1ed734 Revert "cfq: Remove special treatment for metadata rqs."
We have a kernel build regression since 3.1-rc1, which is about 10%
regression. The kernel source is in an ext3 filesystem.
Alex Shi bisect it to commit:
commit a07405b780
Author: Justin TerAvest <teravest@google.com>
Date:   Sun Jul 10 22:09:19 2011 +0200

    cfq: Remove special treatment for metadata rqs.

Apparently this is caused by lack metadata preemption, where ext3/ext4
do use READ_META. I didn't see a way to fix the issue, so suggest
reverting the patch.

This reverts commit a07405b780.

Reported-by: Alex Shi<alex.shi@intel.com>
Reported-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-19 08:34:48 +02:00
Jeff Moyer
4853abaae7 block: fix flush machinery for stacking drivers with differring flush flags
Commit ae1b153962, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:

static inline struct request *__elv_next_request(struct request_queue *q)
{
        struct request *rq;

        while (1) {
-               while (!list_empty(&q->queue_head)) {
+               if (!list_empty(&q->queue_head)) {
                        rq = list_entry_rq(q->queue_head.next);
-                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
-                           (rq->cmd_flags & REQ_FLUSH_SEQ))
-                               return rq;
-                       rq = blk_do_flush(q, rq);
-                       if (rq)
-                               return rq;
+                       return rq;
                }

Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
        unsigned int fflags = q->flush_flags; /* may change, cache it */
        bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
        bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
        bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
        REQ_FUA);
        unsigned skip = 0;
...
        if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                rq->cmd_flags &= ~REQ_FLUSH;
		if (!has_fua)
			rq->cmd_flags &= ~REQ_FUA;
	        return rq;
	}

So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).

Now, however, we don't get into the flush machinery at all.  Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.

The agreed upon approach is to fix the flush machinery to allow
stacking.  While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.

In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io).  Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers.  So, I didn't see a way around the additional field.

I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance.  Comments and other testers, as always, are
appreciated.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-15 21:37:25 +02:00
Shaohua Li
bcf30e75b7 block: improve rq_affinity placement
This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1
work as expected). The purpose is to avoid an unnecessary IPI.
Let's take an example. My test box has cpu 0-7, one socket. Say request is
added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted
patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU
0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and
CPU 7 have no difference from cache sharing point view and we can avoid an
ipi if doing it in CPU 7.
An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually
not. blk_complete_request() is running in interrupt handler, and currently
I/O controller doesn't support multiple interrupts (I checked several LSI
cards and AHCI), so only one CPU can run blk_complete_request(). This is
still quite different as QUEUE_FLAG_SAME_FORCE.
Since only one CPU runs softirq, the only difference with below patch is
softirq not always runs at the first CPU of a group.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-11 10:39:04 +02:00
Jeff Moyer
fa1bf42ff9 allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
blk_insert_flush has the following check:

	/*
	 * If there's data but flush is not necessary, the request can be
	 * processed directly without going through flush machinery.  Queue
	 * for normal execution.
	 */
	if ((policy & REQ_FSEQ_DATA) &&
	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
		list_add_tail(&rq->queuelist, &q->queue_head);
		return;
	}

However, blk_flush_policy will not return with policy set to only
REQ_FSEQ_DATA:

static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
{
	unsigned int policy = 0;

	if (fflags & REQ_FLUSH) {
		if (rq->cmd_flags & REQ_FLUSH)
			policy |= REQ_FSEQ_PREFLUSH;
		if (blk_rq_sectors(rq))
			policy |= REQ_FSEQ_DATA;
		if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
			policy |= REQ_FSEQ_POSTFLUSH;
	}
	return policy;
}

Notice that REQ_FSEQ_DATA is only set if REQ_FLUSH is set.  Fix this
mismatch by moving the setting of REQ_FSEQ_DATA outside of the REQ_FLUSH
check.

Tejun notes:

  Hmmm... yes, this can become a correctness issue if (and only if)
  blk_queue_flush() is called to change q->flush_flags while requests
  are in-flight; otherwise, requests wouldn't reach the function at all.
  Also, I think it would be a generally good idea to always set
  FSEQ_DATA if the request has data.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-09 20:32:09 +02:00
Tao Ma
35ae66e0a0 block: Make rq_affinity = 1 work as expected
Commit 5757a6d76c introduced a new rq_affinity = 2 so as to make
the request completed in the __make_request cpu. But it makes the
old rq_affinity = 1 not work any more. The root cause is that
if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu,
ccpu will be the same as group_cpu, so the completion will be
excuted in the 'cpu' not 'group_cpu'.

This patch fix problem by simpling removing group_cpu and the codes
are more explicit now. If ccpu == cpu, we complete in cpu, otherwise
we raise_blk_irq to ccpu.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-05 09:37:47 +02:00
Akinobu Mita
dd48c085c1 fault-injection: add ability to export fault_attr in arbitrary directory
init_fault_attr_dentries() is used to export fault_attr via debugfs.
But it can only export it in debugfs root directory.

Per Forlin is working on mmc_fail_request which adds support to inject
data errors after a completed host transfer in MMC subsystem.

The fault_attr for mmc_fail_request should be defined per mmc host and
export it in debugfs directory per mmc host like
/sys/kernel/debug/mmc0/mmc_fail_request.

init_fault_attr_dentries() doesn't help for mmc_fail_request.  So this
introduces fault_create_debugfs_attr() which is able to create a
directory in the arbitrary directory and replace
init_fault_attr_dentries().

[akpm@linux-foundation.org: extraneous semicolon, per Randy]
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Tested-by: Per Forlin <per.forlin@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03 14:25:20 -10:00
Herbert Poetzl
f95fe9cfb4 block/genhd.c: remove useless cast in diskstats_show()
Remove the (unsigned long long) cast in diskstats_show() and adjusts the
seq_printf() format string to 'unsigned long'

diskstats_show() uses part_stat_read() to get the stats, which either
accesses the specified field in the struct disk_stats directly (non SMP)
or sums up the per CPU values in a variable of the same type as the field,
so in any case the result will have the same type and range as the
specified field which for all disk_stats entries is unsigned long

Also, for unsigned long ranges the output of %lu should be identical to
the one of %llu, so no change in the actual proc entry contents.

Signed-off-by: Herbert Poetzl <herbert@13thfloor.at>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02 12:43:50 +02:00
Jens Axboe
e2a5429ff7 bsg-lib: add module.h include
Due to conflicts with the moduleh tree in linux-next, we
run into an include file mess. We really need export.h
in that tree, but if we add module.h locally then the
issue is easier to resolve.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02 10:43:35 +02:00
Vivek Goyal
a5395b83b7 cfq-iosched: Reduce linked group count upon group destruction
FQ keeps track of number of groups which are linked on blkcg->blkg_list.
This is useful to avoid races between queue exit and cgroup exit code
paths. So if at the request queue exit time linked group count is not
zero, that means there are some group out there which is yet to be
deleted under rcu read period and queue exit code should wait for
on rcu period.

In my previous patch I forgot to decrease the number of group count.
So in current form, we nr_blkcg_linked_grps is always non-zero and
we will always wait one rcu period (if BLK_CGROUP=y). The side effect
of this is that it can increase boot time. I am surprised, nobody
complained so far.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02 09:24:09 +02:00
Shaohua Li
e5a94f5684 blk-throttle: correctly determine sync bio
read request is always sync. Using rw_is_sync() to determine
if a bio is sync.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-01 10:31:06 +02:00
Mike Christie
aa387cc895 block: add bsg helper library
This moves the FC classes bsg code to the block layer and
makes it a lib so that other classes like iscsi and SAS can use it.

It is helpful because working with the request queue, bios,
creating scatterlists, etc are a pain that the LLD does not
have to worry about with normal IOs and should not have to
worry about for bsg requests.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-31 22:05:09 +02:00
Akinobu Mita
b2c9cd3793 fail_make_request: cleanup should_fail_request
This changes should_fail_request() to more usable wrapper function of
should_fail().  It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in
the middle of a function.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:46 -07:00
Jens Axboe
11ccf116d0 block: fix warning with calling smp_processor_id() in preemptible section
After commit 5757a6d7 introduced an unsafe calling of
smp_processor_id(), with preempt debuggin turned on we spew a lot of:

BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514
caller is __make_request+0x1b8/0x308
[<c0019f44>] (unwind_backtrace+0x0/0xe8) from [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0)
[<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) from [<c0223d14>] (__make_request+0x1b8/0x308)
[<c0223d14>] (__make_request+0x1b8/0x308) from [<c02215ac>] (generic_make_request+0x4dc/0x558)
[<c02215ac>] (generic_make_request+0x4dc/0x558) from [<c022173c>] (submit_bio+0x114/0x138)
[<c022173c>] (submit_bio+0x114/0x138) from [<c011f504>] (submit_bh+0x148/0x16c)
[<c011f504>] (submit_bh+0x148/0x16c) from [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8)
[<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) from [<c01aff78>] (journal_commit_transaction+0x1198/0x1688)
[<c01aff78>] (journal_commit_transaction+0x1198/0x1688) from [<c01b4034>] (kjournald+0xb4/0x224)
[<c01b4034>] (kjournald+0xb4/0x224) from [<c0069ea0>] (kthread+0x8c/0x94)
[<c0069ea0>] (kthread+0x8c/0x94) from [<c00137f8>] (kernel_thread_exit+0x0/0x8)

Fix this by just using raw_smp_processor_id(), it's just a hint
after all. There's no pinning of the CPU or accessing per-cpu
structures involved.

Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-26 15:01:15 +02:00
Linus Torvalds
8ded371f81 Merge branch 'for-3.1/drivers' of git://git.kernel.dk/linux-block
* 'for-3.1/drivers' of git://git.kernel.dk/linux-block:
  cciss: do not attempt to read from a write-only register
  xen/blkback: Add module alias for autoloading
  xen/blkback: Don't let in-flight requests defer pending ones.
  bsg: fix address space warning from sparse
  bsg: remove unnecessary conditional expressions
  bsg: fix bsg_poll() to return POLLOUT properly
2011-07-25 10:38:18 -07:00
Linus Torvalds
096a705bbc Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block
* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
  block: strict rq_affinity
  backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
  block: fix patch import error in max_discard_sectors check
  block: reorder request_queue to remove 64 bit alignment padding
  CFQ: add think time check for group
  CFQ: add think time check for service tree
  CFQ: move think time check variables to a separate struct
  fixlet: Remove fs_excl from struct task.
  cfq: Remove special treatment for metadata rqs.
  block: document blk_plug list access
  block: avoid building too big plug list
  compat_ioctl: fix make headers_check regression
  block: eliminate potential for infinite loop in blkdev_issue_discard
  compat_ioctl: fix warning caused by qemu
  block: flush MEDIA_CHANGE from drivers on close(2)
  blk-throttle: Make total_nr_queued unsigned
  block: Add __attribute__((format(printf...) and fix fallout
  fs/partitions/check.c: make local symbols static
  block:remove some spare spaces in genhd.c
  block:fix the comment error in blkdev.h
  ...
2011-07-25 10:33:36 -07:00
Dan Williams
5757a6d76c block: strict rq_affinity
Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.

Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.

echo 2 > /sys/block/<bdev>/queue/rq_affinity

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-23 20:44:25 +02:00
Jens Axboe
4c64500ead block: fix patch import error in max_discard_sectors check
A '!' snuck in before the unlikely, rendering it useless.

Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-23 20:34:59 +02:00
Linus Torvalds
d4e06701b8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (77 commits)
  [SCSI] fix crash in scsi_dispatch_cmd()
  [SCSI] sr: check_events() ignore GET_EVENT when TUR says otherwise
  [SCSI] bnx2i: Fixed kernel panic due to illegal usage of sc->request->cpu
  [SCSI] bfa: Update the driver version to 3.0.2.1
  [SCSI] bfa: Driver and BSG enhancements.
  [SCSI] bfa: Added support to query PHY.
  [SCSI] bfa: Added HBA diagnostics support.
  [SCSI] bfa: Added support for flash configuration
  [SCSI] bfa: Added support to obtain SFP info.
  [SCSI] bfa: Added support for CEE info and stats query.
  [SCSI] bfa: Extend BSG interface.
  [SCSI] bfa: FCS bug fixes.
  [SCSI] bfa: DMA memory allocation enhancement.
  [SCSI] bfa: Brocade-1860 Fabric Adapter vHBA support.
  [SCSI] bfa: Brocade-1860 Fabric Adapter PLL init fixes.
  [SCSI] bfa: Added Fabric Assigned Address(FAA) support
  [SCSI] bfa: IOC bug fixes.
  [SCSI] bfa: Enable ASIC block configuration and query.
  [SCSI] bnx2i: Updated copyright and bump version
  [SCSI] bnx2i: Modified to skip CNIC registration if iSCSI is not supported
  ...

Fix up some trivial conflicts in:
 - drivers/scsi/bnx2fc/{bnx2fc.h,bnx2fc_fcoe.c}:
	Crazy broadcom version number conflicts
 - drivers/target/tcm_fc/tfc_cmd.c
	Just trivial cleanups done on adjacent lines
2011-07-23 11:13:11 -07:00
James Bottomley
bfe159a512 [SCSI] fix crash in scsi_dispatch_cmd()
USB surprise removal of sr is triggering an oops in
scsi_dispatch_command().  What seems to be happening is that USB is
hanging on to a queue reference until the last close of the upper
device, so the crash is caused by surprise remove of a mounted CD
followed by attempted unmount.

The problem is that USB doesn't issue its final commands as part of
the SCSI teardown path, but on last close when the block queue is long
gone.  The long term fix is probably to make sr do the teardown in the
same way as sd (so remove all the lower bits on ejection, but keep the
upper disk alive until last close of user space).  However, the
current oops can be simply fixed by not allowing any commands to be
sent to a dead queue.

Cc: stable@kernel.org
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2011-07-21 14:21:18 -07:00
Lai Jiangshan
57bdfbf9ee block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu()
The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-20 14:10:13 -07:00
Shaohua Li
7700fc4f67 CFQ: add think time check for group
Currently when the last queue of a group has no request, we don't expire
the queue to hope request from the group comes soon, so the group doesn't
miss its share. But if the think time is big, the assumption isn't correct
and we just waste bandwidth. In such case, we don't do idle.

[global]
runtime=30
direct=1

[test1]
cgroup=test1
cgroup_weight=1000
rw=randread
ioengine=libaio
size=500m
runtime=30
directory=/mnt
filename=file1
thinktime=9000

[test2]
cgroup=test2
cgroup_weight=1000
rw=randread
ioengine=libaio
size=500m
runtime=30
directory=/mnt
filename=file2

	patched		base
test1	64k		39k
test2	548k		540k
total	604k		578k

group1 gets much better throughput because it waits less time.

To check if the patch changes behavior of queue without think time. I also
tried to give test1 2ms think time or no think time. The test result is stable.
The thoughput doesn't change with/without the patch.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12 14:24:56 +02:00
Shaohua Li
f5f2b6ceb2 CFQ: add think time check for service tree
Currently when the last queue of a service tree has no request, we don't
expire the queue to hope request from the service tree comes soon, so the
service tree doesn't miss its share. But if the think time is big, the
assumption isn't correct and we just waste bandwidth. In such case, we
don't do idle.

[global]
runtime=10
direct=1

[test1]
rw=randread
ioengine=libaio
size=500m
directory=/mnt
filename=file1
thinktime=9000

[test2]
rw=read
ioengine=libaio
size=1G
directory=/mnt
filename=file2

	patched		base
test1	41k/s		33k/s
test2	15868k/s	15789k/s
total	15902k/s	15817k/s

A slightly better

To check if the patch changes behavior of queue without think time. I also
tried to give test1 2ms think time or no think time. The test has variation
even without the patch, but the average throughput doesn't change with/without
the patch.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12 14:24:55 +02:00
Shaohua Li
383cd7213f CFQ: move think time check variables to a separate struct
Move the variables to do think time check to a sepatate struct. This is
to prepare adding think time check for service tree and group. No
functional change.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12 14:24:35 +02:00
Justin TerAvest
4aede84b33 fixlet: Remove fs_excl from struct task.
fs_excl is a poor man's priority inheritance for filesystems to hint to
the block layer that an operation is important. It was never clearly
specified, not widely adopted, and will not prevent starvation in many
cases (like across cgroups).

fs_excl was introduced with the time sliced CFQ IO scheduler, to
indicate when a process held FS exclusive resources and thus needed
a boost.

It doesn't cover all file systems, and it was never fully complete.
Lets kill it.

Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12 08:35:10 +02:00
Justin TerAvest
a07405b780 cfq: Remove special treatment for metadata rqs.
There is no consistency among filesystems from what bios (or requests)
are marked as being metadata. It's interesting to expose this in traces,
but we shouldn't schedule the requests differently based on whether or
not they're marked as being metadata.

Signed-off-by: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-10 22:09:19 +02:00
Shaohua Li
55c022bbdd block: avoid building too big plug list
When I test fio script with big I/O depth, I found the total throughput drops
compared to some relative small I/O depth. The reason is the thread accumulates
big requests in its plug list and causes some delays (surely this depends
on CPU speed).
I thought we'd better have a threshold for requests. When a threshold reaches,
this means there is no request merge and queue lock contention isn't severe
when pushing per-task requests to queue, so the main advantages of blk plug
don't exist. We can force a plug list flush in this case.
With this, my test throughput actually increases and almost equals to small
I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list()
for big I/O depth.
The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to
reduce lock contention to me. But I'm open here, 32 is ok in my test too.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-08 08:19:20 +02:00
Mike Snitzer
0f79960391 block: eliminate potential for infinite loop in blkdev_issue_discard
Due to the recently identified overflow in read_capacity_16() it was
possible for max_discard_sectors to be zero but still have discards
enabled on the associated device's queue.

Eliminate the possibility for blkdev_issue_discard to infinitely loop.

Interestingly this issue wasn't identified until a device, whose
discard_granularity was 0 due to read_capacity_16 overflow, was consumed
by blk_stack_limits() to construct limits for a higher-level DM
multipath device.  The multipath device's resulting limits never had the
discard limits stacked because blk_stack_limits() will only do so if
the bottom device's discard_granularity != 0.  This resulted in the
multipath device's limits.max_discard_sectors being 0.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-06 21:32:02 +02:00
Johannes Stezenbach
390192b300 compat_ioctl: fix warning caused by qemu
On Linux x86_64 host with 32bit userspace, running
qemu or even just "qemu-img create -f qcow2 some.img 1G"
causes a kernel warning:

ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img
ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img

ioctl 00005326 is CDROM_DRIVE_STATUS,
ioctl 801c0204 is FDGETPRM.

The warning appears because the Linux compat-ioctl handler for these
ioctls only applies to block devices, while qemu also uses the ioctls on
plain files.

Signed-off-by: Johannes Stezenbach <js@sig21.net>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01 22:32:26 +02:00
Tejun Heo
85ef06d1d2 block: flush MEDIA_CHANGE from drivers on close(2)
Currently, only open(2) is defined as the 'clearing' point.  It has
two roles - first, it's an acknowledgement from userland indicating
that the event has been received and kernel can clear pending states
and proceed to generate more events.  Secondly, it's passed on to
device drivers as a hint indicating that a synchronization point has
been reached and it might want to take a deeper look at the device.

The latter currently is only used by sr which uses two different
mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
to discover events, where the former is lighter weight and safe to be
used repeatedly but may not provide full coverage.  Among other
things, GET_EVENT can't detect media removal while TUR can.

This patch makes close(2) - blkdev_put() - indicate clearing hint for
MEDIA_CHANGE to drivers.  disk_check_events() is renamed to
disk_flush_events() and updated to take @mask for events to flush
which is or'd to ev->clearing and will be passed to the driver on the
next ->check_events() invocation.

This change makes sr generate MEDIA_CHANGE when media is ejected from
userland - e.g. with eject(1).

Note: Given the current usage, it seems @clearing hint is needlessly
complex.  disk_clear_events() can simply clear all events and the hint
can be boolean @flush.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01 16:17:47 +02:00
Jens Axboe
04bf7869ca Merge branch 'for-linus' into for-3.1/core
Conflicts:
	block/blk-throttle.c
	block/cfq-iosched.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01 16:17:13 +02:00
Shaohua Li
726e99ab88 cfq-iosched: make code consistent
ioc->ioc_data is rcu protectd, so uses correct API to access it.
This doesn't change any behavior, but just make code consistent.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: stable@kernel.org # after ab4bd22d
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-27 09:36:06 +02:00
Shaohua Li
3181faa85b cfq-iosched: fix a rcu warning
I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but
doesn't hold rcu_read_lock.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: stable@kernel.org # after ab4bd22d
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-27 09:36:06 +02:00
Namhyung Kim
2b727c6300 bsg: fix address space warning from sparse
copy_from/to_user() and blk_rq_map_user() want __user pointer.
This patch fixes following warnings from sparse:

   CHECK   block/bsg.c
 block/bsg.c:185:38: warning: incorrect type in argument 2 (different address spaces)
 block/bsg.c:185:38:    expected void const [noderef] <asn:1>*from
 block/bsg.c:185:38:    got void *<noident>
 block/bsg.c:295:58: warning: incorrect type in argument 4 (different address spaces)
 block/bsg.c:295:58:    expected void [noderef] <asn:1>*<noident>
 block/bsg.c:295:58:    got void *[assigned] dxferp
 block/bsg.c:311:52: warning: incorrect type in argument 4 (different address spaces)
 block/bsg.c:311:52:    expected void [noderef] <asn:1>*<noident>
 block/bsg.c:311:52:    got void *[assigned] dxferp
 block/bsg.c:448:37: warning: incorrect type in argument 1 (different address spaces)
 block/bsg.c:448:37:    expected void [noderef] <asn:1>*dst
 block/bsg.c:448:37:    got void *<noident>

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20 13:27:45 +02:00
Namhyung Kim
44194e3e88 bsg: remove unnecessary conditional expressions
Second condition in OR always implies first condition is false
thus bytes_read in the second is not needed. The same goes to
bytes_written.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20 13:27:44 +02:00
Namhyung Kim
80ceb05713 bsg: fix bsg_poll() to return POLLOUT properly
POLLOUT should be returned only if bd->queued_cmds < bd->max_queue
so that bsg_alloc_command() can proceed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20 13:27:44 +02:00
Joe Perches
d2f31a5fd6 blk-throttle: Make total_nr_queued unsigned
The total of two unsigned values should also be unsigned.

Update throtl_log output to unsigned.
Update total_nr_queued test to non-zero to be the
same as the other total_nr_queued tests.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13 20:19:27 +02:00
Joe Perches
fd16d26319 block: Add __attribute__((format(printf...) and fix fallout
Use the compiler to verify format strings and arguments.

Fix fallout.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13 20:18:49 +02:00
Wanlong Gao
9f5e486550 block:remove some spare spaces in genhd.c
Remove the end-of-line spaces in genhd.c.

Signed-off-by: Wanlong Gao <wanlong.gao@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13 10:45:43 +02:00
Joe Perches
08e8138ade block: Add __attribute__((format(printf...) and fix fallout
Use the compiler to verify format strings and arguments.

Fix fallout.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13 10:42:49 +02:00
Tejun Heo
fdd514e16b block: make disk_block_events() properly wait for work cancellation
disk_block_events() should guarantee that the event work is not in
flight on return and once blocked it shouldn't issue further
cancellations.

Because there was no synchronization between the first blocker doing
cancel_delayed_work_sync() and the following blockers, the following
blockers could finish before cancellation was complete, which broke
both guarantees - event work could be in flight and cancellation could
happen after return.

This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
bug#34662.

  https://bugzilla.kernel.org/show_bug.cgi?id=34662

Fix it by adding an outer mutex which protects both block count
manipulation and work cancellation.

-v2: Use outer mutex instead of bit waitqueue per Linus.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09 20:43:59 +02:00
Tejun Heo
c3af54afba block: remove non-syncing __disk_block_events() and fold it into disk_block_events()
After the previous update to disk_check_events(), nobody is using
non-syncing __disk_block_events().  Remove @sync and, as this makes
__disk_block_events() virtually identical to disk_block_events(),
remove the underscore prefixed version.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09 20:43:55 +02:00
Tejun Heo
a9dce2a3b4 block: don't use non-syncing event blocking in disk_check_events()
This patch is part of fix for triggering of WARN_ON_ONCE() in
disk_clear_events() reported in bug#34662.

  https://bugzilla.kernel.org/show_bug.cgi?id=34662

disk_clear_events() blocks events, schedules and flushes the event
work.  It expects the work to have started execution on schedule and
finished on return from flush.  WARN_ON_ONCE() triggers if the event
work hasn't executed as expected.  This problem happens because
__disk_block_events() fails to guarantee that the event work item is
not in flight on return from the function in race-free manner.  The
problem is two-fold and this patch addresses one of them.

When __disk_block_events() is called with @sync == %false, it bumps
event block count, calls cancel_delayed_work() and return.  This makes
it impossible to guarantee that event polling is not in flight on
return from syncing __disk_block_events() - if the first blocker was
non-syncing, polling could still be in progress and later syncing ones
would assume that the first blocker already canceled it.

Making __disk_block_events() cancel_sync regardless of block count
isn't feasible either as it may race with forced event checking in
disk_clear_events().

As disk_check_events() is the only user of non-syncing
__disk_block_events(), updating it to directly cancel and schedule
event work is the easiest way to solve the issue.

Note that there's another bug in __disk_block_events() and this patch
doesn't fix the issue completely.  Later patch will fix the other bug.

Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09 20:43:54 +02:00
Paul Bolle
df4156569d block: rename the return of two functions
If we rename the return of alloc_io_context() and get_io_context() from
"ret" to "ioc" the code get's (a bit) more readable and (a lot) more
grepable.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06 05:57:25 +02:00
Paul Bolle
8aea45451b CFQ: make two functions static
Correctly suggested by sparse.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06 05:57:25 +02:00
Jens Axboe
9b50902db5 cfq-iosched: fix locking around ioc->ioc_data assignment
Since we are modifying this RCU pointer, we need to hold
the lock protecting it around it.

This fixes a potential reuse and double free of a cfq
io_context structure. The bug has been in CFQ for a long
time, it hit very few people but those it did hit seemed
to see it a lot.

Tracked in RH bugzilla here:

https://bugzilla.redhat.com/show_bug.cgi?id=577968

Credit goes to Paul Bolle for figuring out that the issue
was around the one-hit ioc->ioc_data cache. Thanks to his
hard work the issue is now fixed.

Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06 05:57:21 +02:00
Jens Axboe
ab4bd22d3c cfq-iosched: fix locking around ioc->ioc_data assignment
Since we are modifying this RCU pointer, we need to hold
the lock protecting it around it.

This fixes a potential reuse and double free of a cfq
io_context structure. The bug has been in CFQ for a long
time, it hit very few people but those it did hit seemed
to see it a lot.

Tracked in RH bugzilla here:

https://bugzilla.redhat.com/show_bug.cgi?id=577968

Credit goes to Paul Bolle for figuring out that the issue
was around the one-hit ioc->ioc_data cache. Thanks to his
hard work the issue is now fixed.

Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06 05:56:49 +02:00
Jeff Moyer
796d5116c4 iosched: prevent aliased requests from starving other I/O
Hi, Jens,

If you recall, I posted an RFC patch for this back in July of last year:
http://lkml.org/lkml/2010/7/13/279

The basic problem is that a process can issue a never-ending stream of
async direct I/Os to the same sector on a device, thus starving out
other I/O in the system (due to the way the alias handling works in both
cfq and deadline).  The solution I proposed back then was to start
dispatching from the fifo after a certain number of aliases had been
dispatched.  Vivek asked why we had to treat aliases differently at all,
and I never had a good answer.  So, I put together a simple patch which
allows aliases to be added to the rb tree (it adds them to the right,
though that doesn't matter as the order isn't guaranteed anyway).  I
think this is the preferred solution, as it doesn't break up time slices
in CFQ or batches in deadline.  I've tested it, and it does solve the
starvation issue.  Let me know what you think.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-02 21:19:05 +02:00
Paul Bolle
e2bd9678fc block: Use hlist_entry() for io_context.cic_list.first
list_entry() and hlist_entry() are both simply aliases for
container_of(), but since io_context.cic_list.first is an hlist_node one
should at least use the correct alias.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-02 13:05:02 +02:00
Paul Bolle
28304f485c cfq-iosched: Remove bogus check in queue_fail path
queue_fail can only be reached if cic is NULL, so its check for cic must
be bogus.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-02 13:05:02 +02:00
Kyungmin Park
4495a7d41d CFQ: Fix typo and remove unnecessary semicolon
Fix comment typo and remove unnecessary semicolon at macro

Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-31 19:49:44 +02:00
Linus Torvalds
bdf7cf1c83 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  loop: export module parameters
  block: export blk_{get,put}_queue()
  block: remove unused variable in bio_attempt_front_merge()
  block: always allocate genhd->ev if check_events is implemented
  brd: export module parameters
  brd: fix comment on initial device creation
  brd: handle on-demand devices correctly
  brd: limit 'max_part' module param to DISK_MAX_PARTS
  brd: get rid of unused members from struct brd_device
  block: fix oops on !disk->queue and sysfs discard alignment display
2011-05-27 10:24:40 -07:00
Jens Axboe
d86e0e83b3 block: export blk_{get,put}_queue()
We need them in SCSI to fix a bug, but currently they are not
exported to modules. Export them.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-27 07:45:45 +02:00
Ben Blum
f780bdb7c1 cgroups: add per-thread subsystem callbacks
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface.  Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by
this.  All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26 17:12:34 -07:00
Luca Tettamanti
700c4f3325 block: remove unused variable in bio_attempt_front_merge()
sector is never read inside the function.

Signed-off-by: Luca Tettamanti <kronos.it@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-26 21:07:26 +02:00
Tejun Heo
75e3f3ee3c block: always allocate genhd->ev if check_events is implemented
9fd097b149 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe
drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block
drivers which have inadequate ->check_events().  Combined with earlier
change 7c88a168da (block: don't propagate unlisted DISK_EVENTs to
userland), this enables using ->check_events() for internal processing
while avoiding enabling in-kernel block event polling which can lead
to infinite event loop.

Unfortunately, this made many drivers including floppy without any bit
set in disk->events and ->async_events in which case disk_add_events()
simply skipped allocation of disk->ev, which disables whole event
handling.  As ->check_events() is still used during open processing
for revalidation, this can lead to open failure.

This patch always allocates disk->ev if ->check_events is implemented.
In the long term, it would make sense to simply include the event
structure inline into genhd as it's now used by virtually all block
devices.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ondrej Zary <linux@rainbow-software.org>
Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-26 21:06:50 +02:00
Namhyung Kim
1547010e6e cfq-iosched: free cic_index if cfqd allocation fails
When struct cfq_data allocation fails, cic_index need to be freed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24 10:23:22 +02:00
Namhyung Kim
20359f27e8 cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add()
The 'group_changed' variable is initialized to 0 and never changed, so
checking the variable is meaningless.

It is a leftover from 0bbfeb8320 ("cfq-iosched: Always provide group
iosolation."). Let's get rid of it.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Justin TerAvest <teravest@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24 10:23:22 +02:00
Namhyung Kim
229836bd63 cfq-iosched: reduce bit operations in cfq_choose_req()
Reduce the number of bit operations in cfq_choose_req() on average
(and worst) cases.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24 10:23:21 +02:00
Namhyung Kim
b9f8ce0599 cfq-iosched: algebraic simplification in cfq_prio_to_maxrq()
Simplify the calculation in cfq_prio_to_maxrq(), plus replace CFQ_PRIO_LISTS to
IOPRIO_BE_NR since they are the same and IOPRIO_BE_NR looks more reasonable in
this context IMHO.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24 10:23:21 +02:00
Vivek Goyal
4cbadbd16e blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time
If we don't explicitly initialize it to zero, CFQ might think that
cgroup of ioc has changed and it generates lots of unnecessary calls
to call_for_each_cic(changed_cgroup). Fix it.

cfq_get_io_context()
  cfq_ioc_set_cgroup()
     call_for_each_cic(ioc, changed_cgroup)

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 19:35:04 +02:00
Vivek Goyal
95cf3dd9db block: call elv_bio_merged() when merged
Commit 73c1010119 ("block: initial patch for on-stack per-task plugging")
removed calls to elv_bio_merged() when @bio merged with @req. Re-add them.

This in turn will update merged stats in associated group. That
should be safe as long as request has got reference to the blkio_group.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Cc: Divyesh Shah <dpshah@google.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 10:02:19 +02:00
Vivek Goyal
317389a773 cfq-iosched: Make IO merge related stats per cpu
Make BLKIO_STAT_MERGED per cpu hence gettring rid of need of taking
blkg->stats_lock.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 10:02:19 +02:00
Vivek Goyal
2abae55f5a cfq-iosched: Fix a memory leak of per cpu stats for root group
We allocated per cpu stats struct for root group but did not free it.
Fix it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23 10:02:19 +02:00
Jens Axboe
771949d03b block: get rid of on-stack plugging debug checks
We don't need them anymore, so kill:

- REQ_ON_PLUG checks in various places
- !rq_mergeable() check in plug merging

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:52:16 +02:00
Jens Axboe
0eb8e88572 Merge branch 'for-linus' into for-2.6.40/core
This patch merges in a fix that missed 2.6.39 final.

Conflicts:
	block/blk.h
2011-05-20 20:36:16 +02:00
Vivek Goyal
af75cd3c67 blk-throttle: Make no throttling rule group processing lockless
Currently we take a queue lock on each bio to check if there are any
throttling rules associated with the group and also update the stats.
Now access the group under rcu and update the stats without taking
the queue lock. Queue lock is taken only if there are throttling rules
associated with the group.

So the common case of root group when there are no rules, save
unnecessary pounding of request queue lock.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:53 +02:00
Vivek Goyal
f0bdc8cdd9 blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats
Now dispatch stats update is lock free. But reset of these stats still
takes blkg->stats_lock and is dependent on that. As stats are per cpu,
we should be able to just reset the stats on each cpu without any locks.
(Atleast for 64bit arch).

On 32bit arch there is a small race where 64bit updates are not atomic.
The result of this race can be that in the presence of other writers,
one might not get 0 value after reset of a stat and might see something
intermediate

One can write more complicated code to cover this race like sending IPI
to other cpus to reset stats and for offline cpus, reset these directly.

Right not I am not taking that path because reset_update is more of a
debug feature and it can happen only on 32bit arch and possibility of
it happening is small. Will fix it if it becomes a real problem. For
the time being going for code simplicity.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:53 +02:00
Vivek Goyal
575969a0dd blk-cgroup: Make 64bit per cpu stats safe on 32bit arch
Some of the stats are 64bit and updation will be non atomic on 32bit
architecture. Use sequence counters on 32bit arch to make reading
of stats safe.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:53 +02:00
Vivek Goyal
5624a4e445 blk-throttle: Make dispatch stats per cpu
Currently we take blkg_stat lock for even updating the stats. So even if
a group has no throttling rules (common case for root group), we end
up taking blkg_lock, for updating the stats.

Make dispatch stats per cpu so that these can be updated without taking
blkg lock.

If cpu goes offline, these stats simply disappear. No protection has
been provided for that yet. Do we really need anything for that?

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal
4843c69d49 blk-throttle: Free up a group only after one rcu grace period
Soon we will allow accessing a throtl_grp under rcu_read_lock(). Hence
start freeing up throtl_grp after one rcu grace period.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal
5617cbef77 blk-throttle: Use helper function to add root throtl group to lists
Use same helper function for root group as we use with dynamically
allocated groups to add it to various lists.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal
269f541555 blk-throttle: Introduce a helper function to fill in device details
A helper function for the code which is used at 2-3 places. Makes reading
code little easier.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal
29b125892f blk-throttle: Dynamically allocate root group
Currently, we allocate root throtl_grp statically. But as we will be
introducing per cpu stat pointers and that will be allocated
dynamically even for root group, we might as well make whole root
throtl_grp allocation dynamic and treat it in same manner as other
groups.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal
f469a7b4d5 blk-cgroup: Allow sleeping while dynamically allocating a group
Currently, all the cfq_group or throtl_group allocations happen while
we are holding ->queue_lock and sleeping is not allowed.

Soon, we will move to per cpu stats and also need to allocate the
per group stats. As one can not call alloc_percpu() from atomic
context as it can sleep, we need to drop ->queue_lock, allocate the
group, retake the lock and continue processing.

In throttling code, I check the queue DEAD flag again to make sure
that driver did not call blk_cleanup_queue() in the mean time.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal
56edf7d75d cfq-iosched: Fix a possible race with cfq cgroup removal code
blkg->key = cfqd is an rcu protected pointer and hence we used to do
call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.

The problem here is that even though cfqd is around, there are no
gurantees that associated request queue (td->queue) or q->queue_lock
is still around. A driver might have called blk_cleanup_queue() and
release the lock.

It might happen that after freeing up the lock we call
blkg->key->queue->queue_ock and crash. This is possible in following
path.

blkiocg_destroy()
 blkio_unlink_group_fn()
  cfq_unlink_blkio_group()

Hence, wait for an rcu peirod if there are groups which have not
been unlinked from blkcg->blkg_list. That way, if there are any groups
which are taking cfq_unlink_blkio_group() path, can safely take queue
lock.

This is how we have taken care of race in throttling logic also.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal
3e59cf9d66 cfq-iosched: Get rid of redundant function parameter "create"
Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
Get rid of that.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal
a23e686955 blk-cgroup: move some fields of unaccounted_time file under right config option
cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y.
there are some fields which are out side this config option. Fix that.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:52 +02:00
Vivek Goyal
a29a171e7c blk-throttle: Do the new group initialization with the help of a function
Group initialization code seems to be at two places. root group
initialization in blk_throtl_init() and dynamically allocated group
in throtl_find_alloc_tg(). Create a common function and use at both
the places.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:34:51 +02:00
Jens Axboe
698567f3fa Merge commit 'v2.6.39' into for-2.6.40/core
Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.

Also fixes up conflicts in the below files.

Conflicts:
	drivers/block/paride/pcd.c
	drivers/cdrom/viocd.c
	drivers/ide/ide-cd.c

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-20 20:33:15 +02:00
James Bottomley
0a58e077eb block: add proper state guards to __elv_next_request
blk_cleanup_queue() calls elevator_exit() and after this, we can't
touch the elevator without oopsing.  __elv_next_request() must check
for this state because in the refcounted queue model, we can still
call it after blk_cleanup_queue() has been called.

This was reported as causing an oops attributable to scsi.

Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-18 19:30:32 +02:00
Shaohua Li
3ec717b7ca block: don't delay blk_run_queue_async
Let's check a scenario:
1. blk_delay_queue(q, SCSI_QUEUE_DELAY);
2. blk_run_queue_async();
the second one will became a noop, because q->delay_work already has
WORK_STRUCT_PENDING_BIT set, so the delayed work will still run after
SCSI_QUEUE_DELAY. But blk_run_queue_async actually hopes the delayed
work runs immediately.

Fix this by doing a cancel on potentially pending delayed work
before queuing an immediate run of the workqueue.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-18 12:24:03 +02:00
Martin K. Petersen
a934a00a69 block: Fix discard topology stacking and reporting
In some cases we would end up stacking discard_zeroes_data incorrectly.
Fix this by enabling the feature by default for stacking drivers and
clearing it for low-level drivers. Incorporating a device that does not
support dzd will then cause the feature to be disabled in the stacking
driver.

Also ensure that the maximum discard value does not overflow when
exported in sysfs and return 0 in the alignment and dzd fields for
devices that don't support discard.

Reported-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-18 10:37:35 +02:00
Vivek Goyal
70087dc38c blk-throttle: Use task_subsys_state() to determine a task's blkio_cgroup
Currentlly we first map the task to cgroup and then cgroup to
blkio_cgroup. There is a more direct way to get to blkio_cgroup
from task using task_subsys_state(). Use that.

The real reason for the fix is that it also avoids a race in generic
cgroup code. During remount/umount rebind_subsystems() is called and
it can do following with and rcu protection.

cgrp->subsys[i] = NULL;

That means if somebody got hold of cgroup under rcu and then it tried
to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
is wrong. I was running into this race condition with ltp running on a
upstream derived kernel and that lead to crash.

So ideally we should also fix cgroup generic code to wait for rcu
grace period before setting pointer to NULL. Li Zefan is not very keen
on introducing synchronize_wait() as he thinks it will slow
down moun/remount/umount operations.

So for the time being atleast fix the kernel crash by taking a more
direct route to blkio_cgroup.

One tester had reported a crash while running LTP on a derived kernel
and with this fix crash is no more seen while the test has been
running for over 6 days.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-16 15:24:08 +02:00
Lukas Czerner
8af1954d17 blkdev: Do not return -EOPNOTSUPP if discard is supported
Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the
bio fails due to underlying device not supporting discard request.
However, if the device is for example dm device composed of devices
which some of them support discard and some of them does not, it is ok
for some bios to fail with EOPNOTSUPP, but it does not mean that discard
is not supported at all.

This commit removes the check for bios failed with EOPNOTSUPP and change
blkdev_issue_discard() to return operation not supported if and only if
the device does not actually supports it, not just part of the device as
some bios might indicate.

This change also fixes problem with BLKDISCARD ioctl() which now works
correctly on such dm devices.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
CC: Jens Axboe <jaxboe@fusionio.com>
CC: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-06 19:30:01 -06:00