The "greedy" code was an attempt to retain glocks for a minimum length
of time when they relate to mmap()ed files. The current implementation
of this feature is not, however, ideal in that it required allocating
memory in order to do this and its overly complicated.
It also misses the mark by ignoring the other I/O operations which are
just as likely to suffer from the same problem. So the plan is to remove
this now and then add the functionality back as part of the glock state
machine at a later date (and thus take into account all the possible
users of this feature)
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Here is something I spotted (while looking for something entirely
different) the other day.
Rather than using a completion in each and every struct gfs2_holder,
this removes it in favour of hashed wait queues, thus saving a
considerable amount of memory both on the stack (where a number of
gfs2_holder structures are allocated) and in particular in the
gfs2_inode which has 8 gfs2_holder structures embedded within it.
As a result on x86_64 the gfs2_inode shrinks from 2488 bytes to
1912 bytes, a saving of 576 bytes per inode (no thats not a typo!).
In actual practice we get a much better result than that since
now that a gfs2_inode is under the 2048 byte barrier, we get two
per 4k slab page effectively halving the amount of memory required
to store gfs2_inodes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This removes the extra filldir callback which gfs2 was using to
enclose an attempt at readahead for inodes during readdir. The
code was too complicated and also hurts performance badly in the
case that the getdents64/readdir call isn't being followed by
stat() and it wasn't even getting it right all the time when it
was.
As a result, on my test box an "ls" of a directory containing 250000
files fell from about 7mins (freshly mounted, so nothing cached) to
between about 15 to 25 seconds. When the directory content was cached,
the time taken fell from about 3mins to about 4 or 5 seconds.
Interestingly in the cached case, running "ls -l" once reduced the time
taken for subsequent runs of "ls" to about 6 secs even without this
patch. Now it turns out that there was a special case of glocks being
used for prefetching the metadata, but because of the timeouts for these
locks (set to 10 secs) the metadata was being timed out before it was
being used and this the prefetch code was constantly trying to prefetch
the same data over and over.
Calling "ls -l" meant that the inodes were brought into memory and once
the inodes are cached, the glocks are not disposed of until the inodes
are pushed out of the cache, thus extending the lifetime of the glocks,
and thus bringing down the time for subsequent runs of "ls"
considerably.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
It occurred to me that although a gfs2 specific writepages for ordered
writes and journaled data would be tricky, by hooking writepages only
for "data=writeback" mounts we could take advantage of not needing
buffer heads (we don't use them on the read side, nor have we for some
time) and create much larger I/Os for the block layer.
Using blktrace both before and after, its possible to see that for large
I/Os, most of the requests generated through writepages are now 1024
sectors after this patch is applied as opposed to 8 sectors before.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If master recovery happens on an rsb in one recovery sequence, then that
sequence is aborted before lock recovery happens, then in the next
sequence, we rely on the previous master recovery (which may now be
invalid due to another node ignoring a lookup result) and go on do to the
lock recovery where we get stuck due to an invalid master value.
recovery cycle begins: master of rsb X has left
nodes A and B send node C an rcom lookup for X to find the new master
C gets lookup from B first, sets B as new master, and sends reply back to B
C gets lookup from A next, and sends reply back to A saying B is master
A gets lookup reply from C and sets B as the new master in the rsb
recovery cycle on A, B and C is aborted to start a new recovery
B gets lookup reply from C and ignores it since there's a new recovery
recovery cycle begins: some other node has joined
B doesn't think it's the master of X so it doesn't rebuild it in the directory
C looks up the master of X, no one is master, so it becomes new master
B looks up the master of X, finds it's C
A believes that B is the master of X, so it sends its lock to B
B sends an error back to A
A resends
this repeats forever, the incorrect master value on A is never corrected
The fix is to do master recovery on an rsb that still has the NEW_MASTER
flag set from an earlier recovery sequence, and therefore didn't complete
lock recovery.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When a user process exits, we clear all the locks it holds. There is a
problem, though, with locks that the process had begun unlocking before it
exited. We couldn't find the lkb's that were in the process of being
unlocked remotely, to flag that they are DEAD. To solve this, we move
lkb's being unlocked onto a new list in the per-process structure that
tracks what locks the process is holding. We can then go through this
list to flag the necessary lkb's when clearing locks for a process when it
exits.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch converts the DLM TCP lowcomms to use workqueues rather than using its
own daemon functions. Simultaneously removing a lot of code and making it more
scalable on multi-processor machines.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote:
>...
> Changes since 2.6.20-rc3-mm1:
>...
> git-gfs2-nmw.patch
>...
> git trees
>...
This patch makes the needlessly globlal gfs2_change_nlink_i() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is for Red Hat bugzilla bug bz #222302:
Moving a virtual IP from node to node between two NFS-over-GFS2
servers was causing one of the GFS2 servers to become confused and
reference a deleted inode. The problem was due to vfs dentries that did
not reference the gfs2_dops and therefore didn't call the gfs2 revalidate
code to revalidate a dentry after a directory had been deleted & recreated.
This patch is a crosswrite from a RHEL4 bug found in GFS1 as
bz #190756 and it is against the latest -nmw git tree.
Signed-off-by: Robert Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Make the dlm_config_info values readable and writeable via configfs
entries.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Add a new dlm_config_info field to enable log_debug output and change
log_debug() to use it.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Add a "ci_" prefix to the fields in the dlm_config_info struct so that we
can use macros to add configfs functions to access them (in a later
patch). No functional changes in this patch, just naming changes.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Some common, non-error messages should use log_debug instead of log_error
so they can be turned off.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Second round of gfs2_rename lock re-ordering to allow Anaconda adding
root partition on top of gfs2. Previous to this patch the recursive
lock detector in glock.c can be triggered due to attempting to lock
the rgrp twice. This fixes it by checking to see whether the rgrp
is already locked.
This fixes Red Hat bugzilla #221237
Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Update the quilt header comments to match the
code changes.
Change gfs2_lookup_simple to return an error in the case
of a NULL inode.
The callers of gfs2_lookup_simple do not check for NULL
in the no entry case and such would end up dereferencing a NULL ptr.
This fixes:
http://projects.info-pull.com/mokb/MOKB-15-11-2006.html
Signed-off-by: Russell Cattelan <cattelan@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In case of unlinked files with dirty pages GFS2 wasn't clearing
the pages in quite the right order. This patch clears the pages
earlier (before the qlock_dq) to avoid the situation that the
release of the glock results in attempting to write back data that
has already been deallocated.
This fixes Red Hat bugzilla: #220117
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I just noticed this message when testing some other changes I'd made to
lowcomms (to use workqueues) but the problem seems to be in the current
git trees too. I'm amazed no-one has seen it.
BUG: spinlock already unlocked on CPU#1, dlm_recoverd/16868
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I was a little over-enthusiastic turning schedule() calls int cond_sched() when fixing the DLM for Andrew Morton.
These four should really be calls to schedule() or the dlm can busy-wait.
Signed-Off-By: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Bugzilla 215088
Fix deadlock in gfs2_change_nlink() while installing RHEL5 into GFS2
partition. The gfs2_rename() apparently needs block allocation for the
new name (into the directory) where it requires rg locks. At the same
time, while updating the nlink count for the replaced file,
gfs2_change_nlink() tries to return the inode meta-data back to resource
group where it needs rg locks too. Our logic doesn't allow process to
acquire these locks recursively by the same process (RHEL installer)
that results a BUG call. This only happens within rename code path and
only if the destination file exists before the rename operation.
Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is partially derrived from a patch written by Russell Cattelan.
It fixes a bug where there is a race between readpages and truncate
by ignoring readpages for stuffed files. This is ok because a stuffed
file will never be more than one block (minus sizeof(struct gfs2_dinode))
in size and block size is always less than page size, so we do not lose
anything efficiency-wise by not doing readahead for stuffed files. They
will have already been "read ahead" by the action of reading the inode
in, in the first place.
This is the remaining part of the fix for Red Hat bugzilla #218966
which had not yet made it upstream.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Russell Cattelan <cattelan@redhat.com>
This patch fixes Red Hat bugzilla #212627 in which a deadlock occurs
due to trying to take the i_mutex while holding a glock. The correct
locking order is defined as i_mutex -> glock in all cases.
I've left dealing with allocating writes. I know that we need to do
that, but for now this should do the trick. We don't need to take the
i_mutex on write, because the VFS has already taken it for us. On read
we don't need it since the glock is enough protection. The reason that
I've made some of the checks into a separate function is that we'll need
to do the checks again in the allocating write case eventually, so this
is partly in preparation for this. Likewise the return value test of !=
1 might look a bit odd and thats because we'll need a third return value
in case of requiring an allocation.
I've made the change to deferred mode on the glock to ensure flushing
read caches on other nodes. I notice that (using blktrace to look at
whats going on) we appear to do a better job of large I/Os than ext3
after this patch (in terms of not splitting up the I/Os).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Wendy Cheng <wcheng@redhat.com>
Remove the following unused functions:
- lowcomms_send_message()
- lowcomms_max_buffer_size()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When the dlm fakes an unlock/cancel reply from a failed node using a stub
message struct, it wasn't setting the flags in the stub message. So, in
the process of receiving the fake message the lkb flags would be updated
and cleared from the zero flags in the message. The problem observed in
tests was the loss of the USER flag which caused the dlm to think a user
lock was a kernel lock and subsequently fail an assertion checking the
validity of the ast/callback field.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
LVB's are not sent as part of new requests, but the code receiving the
request was copying data into the lvb anyway. The space in the message
where it mistakenly thought the lvb lived actually contained the resource
name, so it wound up incorrectly copying this name data into the lvb. Fix
is to just create the lvb, not copy junk into it.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The send_args() function is used to copy parameters into a message for a
number different message types. Only some of those types are set up
beforehand (in create_message) to include space for sending lvb data.
send_args was wrongly copying the lvb for all message types as long as the
lock had an lvb. This means that the lvb data was being written past the
end of the message into unknown space.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Check if we receive a message from another lockspace member running a
version of the dlm with an incompatible inter-node message protocol.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A reply to a recovery message will often be received after the relevant
recovery sequence has aborted and the next recovery sequence has begun.
We need to ignore replies to these old messages from the previous
recovery. There's already a way to do this for synchronous recovery
requests using the rc_id number, but not for async.
Each recovery sequence already has a locally unique sequence number
associated with it. This patch adds a field to the rcom (recovery
message) structure where this recovery sequence number can be placed,
rc_seq. When a node sends a reply to a recovery request, it copies the
rc_seq number it received into rc_seq_reply. When the first node receives
the reply to its recovery message, it will check whether rc_seq_reply
matches the current recovery sequence number, ls_recover_seq, and if not
then it ignores the old reply.
An old, inadequate approach to filtering out old replies (checking if the
current stage of recovery has moved back to the start) has been removed
from two spots.
The protocol version number is changed to reflect the different rcom
structures.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There's a chance the new master of resource hasn't learned it's the new
master before another node sends it a lock during recovery. The node
sending the lock needs to resend if this happens.
- A sends a master lookup for resource R to C
- B sends a master lookup for resource R to C
- C receives A's lookup, assigns A to be master of R and
sends a reply back to A
- C receives B's lookup and sends a reply back to B saying
that A is the master
- B receives lookup reply from C and sends its lock for R to A
- A receives lock from B, doesn't think it's the master of R
and sends an error back to B
- A receives lookup reply from C and becomes master of R
- B gets error back from A and resends its lock back to A
(this resending is what this patch does)
- A receives lock from B, it now sees it's the master of R
and takes the lock
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
If an fs has already been shut down, a lockfs callback should do nothing.
An fs that's been shut down can't acquire locks or do anything with
respect to the cluster.
Also, remove FIXME comment in withdraw function. The missing bits of the
withdraw procedure are now all done by user space.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Some HID devices by Apple have both keyboard and mouse interfaces; the
keyboard interface is handled by usbhid, but the mouse (really
touchpad) interface must be handled by the separate 'appletouch'
driver. Using HID_QUIRK_IGNORE will make hiddev ignore both
interfaces, therefore a new quirk flag to ignore only the mouse
interface is required.
Signed-off-by: Soeren Sonnenburg <kernel@nn7.de>
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
CONFIG_INPUT_DEBUG is non-existent option, so remove anything depending
on it.
Also, as we have new CONFIG_HID_DEBUG, this should be used on places
where ifdef DEBUG was used before.
Suggested by Adrian Bunk.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The comment in hid_get_class_descriptor() says a very obvious thing
and is also violating codingstyle. Just remove it.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The unused hid_find_field_by_usage() function has been commented out for
a pretty long time. Remove it completely.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
hidinput_{open,close}() functions do not belong to usbhid, but
to the generic HID layer. Move them, and fix hooks in struct
hid_device, so that now the callbacks are done to transport-specific
_open() functions, but not input_open() functions.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
hid-debug.h contains a lot of code, and should not therefore
be a header.
This patch moves the code to generic hid layer as .c source, and
introduces CONFIG_HID_DEBUG to conditionally compile it, instead
of playing with #define DEBUG and including hid-debug.h.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Add a force feedback driver for PantherLord USB/PS2 2in1 Adapter,
0810:0001. The device identifies itself as "Twin USB Joystick".
Signed-off-by: Anssi Hannula <anssi.hannula@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Add new quirk HID_QUIRK_SKIP_OUTPUT_REPORTS to skip output reports
when enumerating reports on a hid-input device. Add this quirk and
HID_QUIRK_MULTI_INPUT to 0810:0001.
PantherLord Twin USB Joystick, 0810:0001 has separate input reports
for 2 distinct game controllers in the same interface, so it needs
HID_QUIRK_MULTI_INPUT. However, the device also contains one output
report per controller which is used to control the force feedback
function, and we do not want those to appear as separate input
devices as well. The simplest approach seems to be to add a quirk to
skip output reports on 0810:0001, and allow the force feedback
driver to handle those.
Signed-off-by: Anssi Hannula <anssi.hannula@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Allow hid devices with HID_QUIRK_MULTI_INPUT to have force feedback.
This was previously disabled because there were not any force
feedback drivers for such devices. This will change with my upcoming
patch.
Signed-off-by: Anssi Hannula <anssi.hannula@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Remove prototypes for functions that don't exist.
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch removes do_mmap() from ehca:
- Call remap_pfn_range() for hardware register block
- Use vm_insert_page() to register memory allocated for completion
queues and queue pairs
- The actual mmap() call/trigger is now controlled by user space,
ie. libehca
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The iWARP connection manager uses the ib_addr services to do route
resolution (neighbour discovery in the IP world). The ib_addr
netevent callback routine, however, currently only acts on InfiniBand
neighbour updates. It needs to act on ethernet neighbour updates as
well.
This patch just removes filtering on device type altogether and will
trigger on any neighour updates where the nud_type is valid. This
simplifies the code some.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Make the untyped data region in ib_user_mad have type u64 so that it
gets aligned properly. This avoids alignment faults in ib_umad when
casting the data field to an rmpp_mad and accessing the 64-bit tid
field on architectures like ia64.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When there is a call to send_tsk_mgmt SRP posts a send and waits for 5
seconds to get a response.
When the QP is in the error state it is obvious that there will be no
response so it is quite useless to wait. In fact, the timeout causes
SRP to wait a long time to reconnect when a QP error occurs. (Each
abort and each reset_device calls send_tsk_mgmt, which waits for the
timeout). The following patch solves this problem by identifying the
failure and returning an immediate error code.
Signed-off-by: Ishai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct ib_wc currently only includes the local QP number: this matches
the IB spec, but seems mostly useless. The following patch replaces
this with the pointer to qp itself, and updates all low level drivers
and all users.
This has the following advantages:
- Ability to get a per-qp context through wc->qp->qp_context
- Existing drivers already have the qp pointer ready in poll cq, so
this change actually saves a tiny bit (extra memory read) on data path
(for ehca it would actually be expensive to find the QP pointer when
polling a CQ, but ehca does not support SRQ so we can leave wc->qp as
NULL for ehca)
- Users that need the QP number can still get it through wc->qp->qp_num
Use case:
In IPoIB connected mode code, I have a common CQ shared by multiple
QPs. To track connection usage, I need a way to get at some per-QP
context upon the completion, and I would like to avoid allocating
context object per work request just to stick a QP pointer into it.
With this code, I can just use wc->qp->qp_context.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
<rdma/ib_verbs.h> uses struct kref, so it should include <linux/kref.h>
explicitly to avoid hidden include dependencies.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Since we actively avoid highmem, calling kmap_atomic() instead
of page_address() is effectively only obfuscation.
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Since we actively avoid highmem, calling kmap_atomic() instead
of page_address() is effectively only obfuscation.
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>