Commit graph

225 commits

Author SHA1 Message Date
Andy Grover
0b088e003c RDS: Use page_remainder_alloc() for recv bufs
Instead of splitting up a page into RDS_FRAG_SIZE chunks
ourselves, ask rds_page_remainder_alloc() to do it. While it
is possible PAGE_SIZE > FRAG_SIZE, on x86en it isn't, so having
duplicate "carve up a page into buffers" code seems excessive.

The other modification this spawns is the use of a single
struct scatterlist in rds_page_frag instead of a bare page ptr.
This causes verbosity to increase in some places, and decrease
in others.

Finally, I decided to unify the lifetimes and alloc/free of
rds_page_frag and its page. This is a nice simplification in itself,
but will be extra-nice once we come to adding cmason's recycling
patch.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:15:20 -07:00
Zach Brown
fc19de38be RDS/IB: disconnect when IB devices are removed
Currently IB device removal destroys connections which are associated with the
device.  This prevents connections from being re-established when replacement
devices are added.

Instead we'll queue shutdown work on the connections as their devices are
removed.  When we see that devices are added we triger connection attempts on
all connections that don't currently have a device.

The result is that RDS sockets can resume device-independent work (bcopy, not
RDMA) across IB device removal and restoration.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08 18:15:19 -07:00
Zach Brown
f3c6808d3d RDS: introduce rds_conn_connect_if_down()
A few paths had the same block of code to queue a connection's connect work if
it was in the right state.  Let's move this in to a helper function.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08 18:15:18 -07:00
Zach Brown
3e0249f9c0 RDS/IB: add refcount tracking to struct rds_ib_device
The RDS IB client .remove callback used to free the rds_ibdev for the given
device unconditionally.  This could race other users of the struct.  This patch
adds refcounting so that we only free the rds_ibdev once all of its users are
done.

Many rds_ibdev users are tied to connections.  We give the connection a
reference and change these users to reference the device in the connection
instead of looking it up in the IB client data.  The only user of the IB client
data remaining is the first lookup of the device as connections are built up.

Incrementing the reference count of a device found in the IB client data could
race with final freeing so we use an RCU grace period to make sure that freeing
won't happen until those lookups are done.

MRs need the rds_ibdev to get at the pool that they're freed in to.  They exist
outside a connection and many MRs can reference different devices from one
socket, so it was natural to have each MR hold a reference.  MR refs can be
dropped from interrupt handlers and final device teardown can block so we push
it off to a work struct.  Pool teardown had to be fixed to cancel its pending
work instead of deadlocking waiting for all queued work, including itself, to
finish.

MRs get their reference from the global device list, which gets a reference.
It is left unprotected by locks and remains racy.  A simple global lock would
be a significant bottleneck.  More scalable (complicated) locking should be
done carefully in a later patch.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08 18:15:17 -07:00
Zach Brown
89bf9d4158 RDS/IB: get the xmit max_sge from the RDS IB device on the connection
rds_ib_xmit_rdma() was calling ib_get_client_data() to get at the rds_ibdevice
just to get the max_sge for the transmit.  This patch instead has it get it
directly off the rds_ibdev which is stored on the connection.

The current code won't free the rds_ibdev until all the IB connections that use
it are freed.  So it's safe to reference the rds_ibdev this way.  In the future
it also makes it easier to support proper reference counting of the rds_ibdev
struct.

As an additional bonus, this gets rid of the performance hit of calling in to
the IB stack to look up the rds_ibdev.  The current implementation in the IB
stack acquires an interrupt blocking spinlock to protect the registration of
client callback data.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08 18:15:16 -07:00
Zach Brown
a46ca94e7f RDS/IB: rds_ib_cm_handle_connect() forgot to unlock c_cm_lock
rds_ib_cm_handle_connect() could return without unlocking the c_conn_lock if
rds_setup_qp() failed.  Rather than adding another imbalanced mutex_unlock() to
this error path we only unlock the mutex once as we exit the function, reducing
the likelyhood of making this same mistake in the future.  We remove the
previous mulitple return sites, leaving one unambigious return path.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
2010-09-08 18:15:15 -07:00
Chris Mason
1cc2228c59 rds: Fix reference counting on the for xmit_atomic and xmit_rdma
This makes sure we have the proper number of references in
rds_ib_xmit_atomic and rds_ib_xmit_rdma.  We also consistently
drop references the same way for all message types as the IOs end.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:15:13 -07:00
Chris Mason
bcf50ef2ce rds: use RCU to protect the connection hash
The connection hash was almost entirely RCU ready, this
just makes the final couple of changes to use RCU instead
of spinlocks for everything.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:15:12 -07:00
Chris Mason
abf454398c RDS: use locking on the connection hash list
rds_conn_destroy really needs locking while it changes the
connection hash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:15:11 -07:00
Chris Mason
c9e65383a2 rds: Fix RDMA message reference counting
The RDS send_xmit code was trying to get fancy with message
counting and was dropping the final reference on the RDMA messages
too early.  This resulted in memory corruption and oopsen.

The fix here is to always add a ref as the parts of the message passes
through rds_send_xmit, and always drop a ref as the parts of the message
go through completion handling.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:15:10 -07:00
Chris Mason
7e3f2952ee rds: don't let RDS shutdown a connection while senders are present
This is the first in a long line of patches that tries to fix races
between RDS connection shutdown and RDS traffic.

Here we are maintaining a count of active senders to make sure
the connection doesn't go away while they are using it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:15:09 -07:00
Chris Mason
38a4e5e613 rds: Use RCU for the bind lookup searches
The RDS bind lookups are somewhat expensive in terms of CPU
time and locking overhead.  This commit changes them into a
faster RCU based hash tree instead of the rbtrees they were using
before.

On large NUMA systems it is a significant improvement.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:15:08 -07:00
Andy Grover
e4c52c98e0 RDS/IB: add _to_node() macros for numa and use {k,v}malloc_node()
Allocate send/recv rings in memory that is node-local to the HCA.
This significantly helps performance.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:14:06 -07:00
Andy Grover
4a81802b5e RDS/IB: Remove unused variable in ib_remove_addr()
Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:29 -07:00
Chris Mason
764f2dd92f rds: rcu-ize rds_ib_get_device()
rds_ib_get_device is called very often as we turn an
ip address into a corresponding device structure.  It currently
take a global spinlock as it walks different lists to find active
devices.

This commit changes the lists over to RCU, which isn't very complex
because they are not updated very often at all.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:12:28 -07:00
Chris Mason
c83188dcd7 rds: per-rm flush_wait waitq
This removes a global waitqueue used to wait for rds messages
and replaces it with a waitqueue inside the rds_message struct.

The global waitqueue turns into a global lock and significantly
bottlenecks operations on large machines.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:12:27 -07:00
Chris Mason
976673ee1b rds: switch to rwlock on bind_lock
The bind_lock is almost entirely readonly, but it gets
hammered during normal operations and is a major bottleneck.

This commit changes it to an rwlock, which takes it from 80%
of the system time on a big numa machine down to much lower
numbers.

A better fix would involve RCU, which is done in a later commit

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-09-08 18:12:26 -07:00
Andy Grover
ce47f52f42 RDS: Update comments in rds_send_xmit()
Update comments to reflect changes in previous commit.

Keeping as separate commits due to different authorship.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:25 -07:00
Chris Mason
9e29db0e36 RDS: Use a generation counter to avoid rds_send_xmit loop
rds_send_xmit is required to loop around after it releases the lock
because someone else could done a trylock, found someone working on the
list and backed off.

But, once we drop our lock, it is possible that someone else does come
in and make progress on the list.  We should detect this and not loop
around if another process is actually working on the list.

This patch adds a generation counter that is bumped every time we
get the lock and do some send work.  If the retry notices someone else
has bumped the generation counter, it does not need to loop around and
continue working.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:24 -07:00
Andy Grover
acfcd4d4ec RDS: Get pong working again
Call send_xmit() directly from pong()

Set pongs as op_active

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:23 -07:00
Andy Grover
a40aa9233a RDS: Do wait_event_interruptible instead of wait_event
Can't see a reason not to allow signals to interrupt the wait.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:22 -07:00
Andy Grover
fcc5450c63 RDS: Remove send_quota from send_xmit()
The purpose of the send quota was really to give fairness
when different connections were all using the same
workq thread to send backlogged msgs -- they could only send
so many before another connection could make progress.

Now that each connection is pushing the backlog from its
completion handler, they are all guaranteed to make progress
and the quota isn't needed any longer.

A thread *will* have to send all previously queued data, as well
as any further msgs placed on the queue while while c_send_lock
was held. In a pathological case a single process can get
roped into doing this for long periods while other threads
get off free. But, since it can only do this until the transport
reports full, this is a bounded scenario.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:21 -07:00
Andy Grover
51e2cba8b5 RDS: Move atomic stats from general to ib-specific area
Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:20 -07:00
Andy Grover
ab1a6926f5 RDS: rds_message_unmapped() doesn't need to check if queue active
If the queue has nobody on it, then wake_up does nothing.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:19 -07:00
Andy Grover
cf4b7389ee RDS: Fix locking in send on m_rs_lock
Do not nest m_rs_lock under c_lock

Disable interrupts in {rdma,atomic}_send_complete

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:18 -07:00
Andy Grover
f2ec76f288 RDS: Use NOWAIT in message_map_pages()
Can no longer block, so use NOWAIT.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:17 -07:00
Andy Grover
2fa57129df RDS: Bypass workqueue when queueing cong updates
Now that rds_send_xmit() does not block, we can call it directly
instead of going through the helper thread.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:16 -07:00
Andy Grover
a7d3a28148 RDS: Call rds_send_xmit() directly from sendmsg()
rds_sendmsg() is calling the send worker function to
send the just-queued datagrams, presumably because it wants
the behavior where anything not sent will re-call the send
worker. We now ensure all queued datagrams are sent by retrying
from the send completion handler, so this isn't needed any more.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:15 -07:00
Andy Grover
2ad8099b58 RDS: rds_send_xmit() locking/irq fixes
rds_message_put() cannot be called with irqs off, so move it after
irqs are re-enabled.

Spinlocks throughout the function do not to use _irqsave because
the lock of c_send_lock at top already disabled irqs.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:13 -07:00
Andy Grover
049ee3f500 RDS: Change send lock from a mutex to a spinlock
This change allows us to call rds_send_xmit() from a tasklet,
which is crucial to our new operating model.

* Change c_send_lock to a spinlock
* Update stats fields "sem_" to "_lock"
* Remove unneeded rds_conn_is_sending()

About locking between shutdown and send -- send checks if the
connection is up. Shutdown puts the connection into
DISCONNECTING. After this, all threads entering send will exit
immediately. However, a thread could be *in* send_xmit(), so
shutdown acquires the c_send_lock to ensure everyone is out
before proceeding with connection shutdown.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:12 -07:00
Andy Grover
f17a1a55fb RDS: Refill recv ring directly from tasklet
Performance is better if we use allocations that don't block
to refill the receive ring. Since the whole reason we were
kicking out to the worker thread was so we could do blocking
allocs, we no longer need to do this.

Remove gfp params from rds_ib_recv_refill(); we always use
GFP_NOWAIT.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:11 -07:00
Andy Grover
77dd550e55 RDS: Stop supporting old cong map sending method
We now ask the transport to give us a rm for the congestion
map, and then we handle it normally. Previously, the
transport defined a function that we would call to send
a congestion map.

Convert TCP and loop transports to new cong map method.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:10 -07:00
Andy Grover
e32b4a7049 RDS/IB: Do not wait for send ring to be empty on conn shutdown
Now that we are signaling send completions much less, we are likely
to have dirty entries in the send queue when the connection is
shut down (on rmmod, for example.) These are cleaned up a little
further down in conn_shutdown, but if we wait on the ring_empty_wait
for them, it'll never happen, and we hand on unload.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:09 -07:00
Andy Grover
ff3d7d3613 RDS: Perform unmapping ops in stages
Previously, RDS would wait until the final send WR had completed
and then handle cleanup. With silent ops, we do not know
if an atomic, rdma, or data op will be last. This patch
handles any of these cases by keeping a pointer to the last
op in the message in m_last_op.

When the TX completion event fires, rds dispatches to per-op-type
cleanup functions, and then does whole-message cleanup, if the
last op equalled m_last_op.

This patch also moves towards having op-specific functions take
the op struct, instead of the overall rm struct.

rds_ib_connection has a pointer to keep track of a a partially-
completed data send operation. This patch changes it from an
rds_message pointer to the narrower rm_data_op pointer, and
modifies places that use this pointer as needed.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:08 -07:00
Andy Grover
aa0a4ef4ac RDS: Make sure cmsgs aren't used in improper ways
It hasn't cropped up in the field, but this code ensures it is
impossible to issue operations that pass an rdma cookie (DEST, MAP)
in the same sendmsg call that's actually initiating rdma or atomic
ops.

Disallowing this perverse-but-technically-allowed usage makes silent
RDMA heuristics slightly easier.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:07 -07:00
Andy Grover
2c3a5f9abb RDS: Add flag for silent ops. Do atomic op before RDMA
Add a flag to the API so users can indicate they want
silent operations. This is needed because silent ops
cannot be used with USE_ONCE MRs, so we can't just
assume silent.

Also, change send_xmit to do atomic op before rdma op if
both are present, and centralize the hairy logic to determine if
we want to attempt silent, or not.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:06 -07:00
Andy Grover
7e3bd65ebf RDS: Move some variables around for consistency
Also, add a comment.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:05 -07:00
Andy Grover
940786eb0a RDS: queue failure notifications for dropped atomic ops
When dropping ops in the send queue, we notify the client
of failed rdma ops they asked for notifications on, but not
atomic ops. It should be for both.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:04 -07:00
Andy Grover
ee4c7b47e4 RDS: Add a warning if trying to allocate 0 sgs
rds_message_alloc_sgs() only works when nents is nonzero.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:03 -07:00
Andy Grover
372cd7dedf RDS: Do not set op_active in r_m_copy_from_user().
Do not allocate sgs for data for 0-length datagrams

Set data.op_active in rds_sendmsg() instead of
rds_message_copy_from_user().

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:02 -07:00
Andy Grover
5b2366bd28 RDS: Rewrite rds_send_xmit
Simplify rds_send_xmit().

Send a congestion map (via xmit_cong_map) without
decrementing send_quota.

Move resetting of conn xmit variables to end of loop.

Update comments.

Implement a special case to turn off sending an rds header
when there is an atomic op and no other data.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:12:01 -07:00
Andy Grover
6c7cc6e469 RDS: Rename data op members prefix from m_ to op_
For consistency.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:59 -07:00
Andy Grover
f8b3aaf2ba RDS: Remove struct rds_rdma_op
A big changeset, but it's all pretty dumb.

struct rds_rdma_op was already embedded in struct rm_rdma_op.
Remove rds_rdma_op and put its members in rm_rdma_op. Rename
members with "op_" prefix instead of "r_", for consistency.

Of course this breaks a lot, so fixup the code accordingly.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:58 -07:00
Andy Grover
d0ab25a83c RDS: purge atomic resources too in rds_message_purge()
Add atomic_free_op function, analogous to rdma_free_op,
and call it in rds_message_purge().

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:57 -07:00
Andy Grover
4324879df0 RDS: Inline rdma_prepare into cmsg_rdma_args
cmsg_rdma_args just calls rdma_prepare and does a little
arg checking -- not quite enough to justify its existence.
Plus, it is the only caller of rdma_prepare().

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:56 -07:00
Andy Grover
241eef3e2f RDS: Implement silent atomics
Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:55 -07:00
Andy Grover
d37c935905 RDS: Move loop-only function to loop.c
Also, try to better-document the locking around the
rm and its m_inc in loop.c.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:54 -07:00
Andy Grover
c8de3f1005 RDS/IB: Make all flow control code conditional on i_flowctl
Maybe things worked fine with the flow control code running
even in the non-flow-control case, but making it explicitly
conditional helps the non-fc case be easier to read.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:53 -07:00
Andy Grover
1d34f17571 RDS: Remove unsignaled_bytes sysctl
Removed unsignaled_bytes sysctl and code to signal
based on it. I believe unsignaled_wrs is more than
sufficient for our purposes.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:52 -07:00
Andy Grover
da5a06cef5 RDS: rewrite rds_ib_xmit
Now that the header always goes first, it is possible to
simplify rds_ib_xmit. Instead of having a path to handle 0-byte
dgrams and another path to handle >0, these can both be handled
in one path. This lets us eliminate xmit_populate_wr().

Rename sent to bytes_sent, to differentiate better from other
variable named "send".

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:51 -07:00
Andy Grover
919ced4ce7 RDS/IB: Remove ib_[header/data]_sge() functions
These functions were to cope with differently ordered
sg entries depending on RDS 3.0 or 3.1+. Now that
we've dropped 3.0 compatibility we no longer need them.

Also, modify usage sites for these to refer to sge[0] or [1]
directly. Reorder code to initialize header sgs first.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:50 -07:00
Andy Grover
6f3d05db0d RDS/IB: Remove dead code
Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:49 -07:00
Andy Grover
f147dd9eca RDS/IB: Disallow connections less than RDS 3.1
RDS 3.0 connections (in OFED 1.3 and earlier) put the
header at the end. 3.1 connections put it at the head.
The code has significant added complexity in order to
handle both configurations. In OFED 1.6 we can
drop this and simplify the code by only supporting
"header-first" configuration.

This patch checks the protocol version, and if prior
to 3.1, does not complete the connection.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:48 -07:00
Andy Grover
9c030391e8 RDS/IB: eliminate duplicate code
both atomics and rdmas need to convert ib-specific completion codes
into RDS status codes. Rename rds_ib_rdma_send_complete to
rds_ib_send_complete, and have it take a pointer to the function to
call with the new error code.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:47 -07:00
Andy Grover
809fa148a2 RDS: inc_purge() transport function unused - remove it
Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:46 -07:00
Andy Grover
6200ed7799 RDS: Whitespace
Tidy up some whitespace issues.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:44 -07:00
Andy Grover
d22faec22c RDS: Do not mask address when pinning pages
This does not appear to be necessary.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:43 -07:00
Andy Grover
40589e74f7 RDS: Base init_depth and responder_resources on hw values
Instead of using a constant for initiator_depth and
responder_resources, read the per-QP values when the
device is enumerated, and then use these values when creating
the connection.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:42 -07:00
Andy Grover
15133f6e67 RDS: Implement atomic operations
Implement a CMSG-based interface to do FADD and CSWP ops.

Alter send routines to handle atomic ops.

Add atomic counters to stats.

Add xmit_atomic() to struct rds_transport

Inline rds_ib_send_unmap_rdma into unmap_rm

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:41 -07:00
Andy Grover
a63273d499 RDS: Clear up some confusing code in send_remove_from_sock
The previous code was correct, but made the assumption that
if r_notifier was non-NULL then either r_recverr or r_notify
was true. Valid, but fragile. Changed to explicitly check
r_recverr (shows up in greps for recverr now, too.)

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:40 -07:00
Andy Grover
f4dd96f7b2 RDS: make sure all sgs alloced are initialized
rds_message_alloc_sgs() now returns correctly-initialized
sg lists, so calleds need not do this themselves.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:39 -07:00
Andy Grover
ff87e97a9d RDS: make m_rdma_op a member of rds_message
This eliminates a separate memory alloc, although
it is now necessary to add an "r_active" flag, since
it is no longer to use the m_rdma_op pointer as an
indicator of if an rdma op is present.

rdma SGs allocated from rm sg pool.

rds_rm_size also gets bigger. It's a little inefficient to
run through CMSGs twice, but it makes later steps a lot smoother.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:38 -07:00
Andy Grover
21f79afa5f RDS: fold rdma.h into rds.h
RDMA is now an intrinsic part of RDS, so it's easier to just have
a single header.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:37 -07:00
Andy Grover
fc445084f1 RDS: Explicitly allocate rm in sendmsg()
r_m_copy_from_user used to allocate the rm as well as kernel
buffers for the data, and then copy the data in. Now, sendmsg()
allocates the rm, although the data buffer alloc still happens
in r_m_copy_from_user.

SGs are still allocated with rm, but now r_m_alloc_sgs() is
used to reserve them. This allows multiple SG lists to be
allocated from the one rm -- this is important once we also
want to alloc our rdma sgl from this pool.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:36 -07:00
Andy Grover
3ef13f3c22 RDS: cleanup/fix rds_rdma_unuse
First, it looks to me like the atomic_inc is wrong.
We should be decrementing refcount only once here, no? It's
already being done by the mr_put() at the end.

Second, simplify the logic a bit by bailing early (with a warning)
if !mr.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:35 -07:00
Andy Grover
e779137aa7 RDS: break out rdma and data ops into nested structs in rds_message
Clearly separate rdma-related variables in rm from data-related ones.
This is in anticipation of adding atomic support.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:33 -07:00
Andy Grover
8690bfa17a RDS: cleanup: remove "== NULL"s and "!= NULL"s in ptr comparisons
Favor "if (foo)" style over "if (foo != NULL)".

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:11:32 -07:00
Andy Grover
2dc3935734 RDS: move rds_shutdown_worker impl. to rds_conn_shutdown
This fits better in connection.c, rather than threads.c.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:10:13 -07:00
Andy Grover
9de0864cf5 RDS: Fix locking in send on m_rs_lock
Do not nest m_rs_lock under c_lock

Disable interrupts in {rdma,atomic}_send_complete

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:07:32 -07:00
Andy Grover
7c82eaf00e RDS: Rewrite rds_send_drop_to() for clarity
This function has been the source of numerous bugs; it's just
too complicated. Simplified to nest spinlocks cleanly within
the second loop body, and kick out early if there are no
rms to drop.

This will be a little slower because conn lock is grabbed for
each entry instead of "caching" the lock across rms, but this
should be entirely irrelevant to fastpath performance.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:07:32 -07:00
Tina Yang
35b52c7053 RDS: Fix corrupted rds_mrs
On second look at this bug (OFED #2002), it seems that the
collision is not with the retransmission queue (packet acked
by the peer), but with the local send completion.  A theoretical
sequence of events (from time t0 to t3) is thought to be as
follows,

Thread #1
t0:
    sock_release
    rds_release
    rds_send_drop_to /* wait on send completion */
t2:
    rds_rdma_drop_keys()   /* destroy & free all mrs */

Thread #2
t1:
    rds_ib_send_cq_comp_handler
    rds_ib_send_unmap_rm
    rds_message_unmapped   /* wake up #1 @ t0 */
t3:
    rds_message_put
    rds_message_purge
    rds_mr_put   /* memory corruption detected */

The problem with the rds_rdma_drop_keys() is it could
remove a mr's refcount more than its due (i.e. repeatedly
as long as it still remains in the tree (mr->r_refcount > 0)).
Theoretically it should remove only one reference - reference
by the tree.

        /* Release any MRs associated with this socket */
        while ((node = rb_first(&rs->rs_rdma_keys))) {
                mr = container_of(node, struct rds_mr, r_rb_node);
                if (mr->r_trans == rs->rs_transport)
                        mr->r_invalidate = 0;
                rds_mr_put(mr);
        }

I think the correct way of doing it is to remove the mr from
the tree and rds_destroy_mr it first, then a rds_mr_put()
to decrement its reference count by one.  Whichever thread
holds the last reference will free the mr via rds_mr_put().

Signed-off-by: Tina Yang <tina.yang@oracle.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:07:31 -07:00
Andy Grover
9e2effba2c RDS: Fix BUG_ONs to not fire when in a tasklet
in_interrupt() is true in softirqs. The BUG_ONs are supposed
to check for if irqs are disabled, so we should use
BUG_ON(irqs_disabled()) instead, duh.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
2010-09-08 18:07:31 -07:00
Eric Dumazet
f037590fff rds: fix a leak of kernel memory
struct rds_rdma_notify contains a 32 bits hole on 64bit arches,
make sure it is zeroed before copying it to user.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-18 23:40:03 -07:00
Julia Lawall
5daf47bb4e net/rds: Add missing mutex_unlock
Add a mutex_unlock missing on the error path.  In each case, whenever the
label out is reached from elsewhere in the function, mutex is not locked.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression E1;
@@

* mutex_lock(E1);
  <+... when != E1
  if (...) {
    ... when != E1
*   return ...;
  }
  ...+>
* mutex_unlock(E1);
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Reviewed-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-29 00:18:48 -07:00
Joe Perches
ccbd6a5a4f net: Remove unnecessary semicolons after switch statements
Also added an explicit break; to avoid
a fallthrough in net/ipv4/tcp_input.c

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-17 17:44:35 -07:00
David S. Miller
e1703b36c3 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/e100.c
	drivers/net/e1000e/netdev.c
2010-04-27 12:49:13 -07:00
Dan Carpenter
24acc68956 rdma: potential ERR_PTR dereference
In the original code, the "goto out" calls "rdma_destroy_id(cm_id);"
That isn't needed here and would cause problems because "cm_id" is an
ERR_PTR.  The new code just returns directly.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-22 15:57:26 -07:00
Eric Dumazet
aa39514516 net: sk_sleep() helper
Define a new function to return the waitqueue of a "struct sock".

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
	return sk->sk_sleep;
}

Change all read occurrences of sk_sleep by a call to this function.

Needed for a future RCU conversion. sk_sleep wont be a field directly
available.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-20 16:37:13 -07:00
David S. Miller
871039f02f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/stmmac/stmmac_main.c
	drivers/net/wireless/wl12xx/wl1271_cmd.c
	drivers/net/wireless/wl12xx/wl1271_main.c
	drivers/net/wireless/wl12xx/wl1271_spi.c
	net/core/ethtool.c
	net/mac80211/scan.c
2010-04-11 14:53:53 -07:00
Tejun Heo
5a0e3ad6af include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files.  percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed.  Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability.  As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

  http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
  only the necessary includes are there.  ie. if only gfp is used,
  gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
  blocks and try to put the new include such that its order conforms
  to its surrounding.  It's put in the include block which contains
  core kernel includes, in the same order that the rest are ordered -
  alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
  doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
  because the file doesn't have fitting include block), it prints out
  an error message indicating which .h file needs to be added to the
  file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
   over 4000 files, deleting around 700 includes and adding ~480 gfp.h
   and ~3000 slab.h inclusions.  The script emitted errors for ~400
   files.

2. Each error was manually checked.  Some didn't need the inclusion,
   some needed manual addition while adding it to implementation .h or
   embedding .c file was more appropriate for others.  This step added
   inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
   from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
   e.g. lib/decompress_*.c used malloc/free() wrappers around slab
   APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
   editing them as sprinkling gfp.h and slab.h inclusions around .h
   files could easily lead to inclusion dependency hell.  Most gfp.h
   inclusion directives were ignored as stuff from gfp.h was usually
   wildly available and often used in preprocessor macros.  Each
   slab.h inclusion directive was examined and added manually as
   necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
   were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
   distributed build env didn't work with gcov compiles) and a few
   more options had to be turned off depending on archs to make things
   build (like ipr on powerpc/64 which failed due to missing writeq).

   * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
   * powerpc and powerpc64 SMP allmodconfig
   * sparc and sparc64 SMP allmodconfig
   * ia64 SMP allmodconfig
   * s390 SMP allmodconfig
   * alpha SMP allmodconfig
   * um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
   a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-30 22:02:32 +09:00
Dan Carpenter
18062ca947 rds: cleanup: remove unneeded variable
We never use "sk" so this patch removes it.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-24 13:34:09 -07:00
Tina Yang
768bbedf9c RDS: Enable per-cpu workqueue threads
Create per-cpu workqueue threads instead of a single
krdsd thread. This is a step towards better scalability.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:17:02 -07:00
Andy Grover
561c7df63e RDS: Do not call set_page_dirty() with irqs off
set_page_dirty() unconditionally re-enables interrupts, so
if we call it with irqs off, they will be on after the call,
and that's bad. This patch moves the call after we've re-enabled
interrupts in send_drop_to(), so it's safe.

Also, add BUG_ONs to let us know if we ever do call set_page_dirty
with interrupts off.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:17:01 -07:00
Sherman Pun
450d06c020 RDS: Properly unmap when getting a remote access error
If the RDMA op has aborted with a remote access error,
in addition to what we already do (tell userspace it has
completed with an error) also unmap it and put() the rm.

Otherwise, hangs may occur on arches that track maps and
will not exit without proper cleanup.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:17:00 -07:00
Andy Grover
b98ba52f96 RDS: only put sockets that have seen congestion on the poll_waitq
rds_poll_waitq's listeners will be awoken if we receive a congestion
notification. Bad performance may result because *all* polled sockets
contend for this single lock. However, it should not be necessary to
wake pollers when a congestion update arrives if they have never
experienced congestion, and not putting these on the waitq will
hopefully greatly reduce contention.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:16:59 -07:00
Tina Yang
550a8002e4 RDS: Fix locking in rds_send_drop_to()
It seems rds_send_drop_to() called
__rds_rdma_send_complete(rs, rm, RDS_RDMA_CANCELED)
with only rds_sock lock, but not rds_message lock. It raced with
other threads that is attempting to modify the rds_message as well,
such as from within rds_rdma_send_complete().

Signed-off-by: Tina Yang <tina.yang@oracle.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:16:58 -07:00
Andy Grover
97069788d6 RDS: Turn down alarming reconnect messages
RDS's error messages when a connection goes down are a little
extreme. A connection may go down, and it will be re-established,
and everything is fine. This patch links these messages through
rdsdebug(), instead of to printk directly.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:16:57 -07:00
Andy Grover
571c02fa81 RDS: Workaround for in-use MRs on close causing crash
if a machine is shut down without closing sockets properly, and
freeing all MRs, then a BUG_ON will bring it down. This patch
changes these to WARN_ONs -- leaking MRs is not fatal (although
not ideal, and there is more work to do here for a proper fix.)

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:16:56 -07:00
Tina Yang
048c15e641 RDS: Fix send locking issue
Fix a deadlock between rds_rdma_send_complete() and
rds_send_remove_from_sock() when rds socket lock and
rds message lock are acquired out-of-order.

Signed-off-by: Tina Yang <Tina.Yang@oracle.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:16:55 -07:00
Andy Grover
2e7b3b9945 RDS: Fix congestion issues for loopback
We have two kinds of loopback: software (via loop transport)
and hardware (via IB). sw is used for 127.0.0.1, and doesn't
support rdma ops. hw is used for sends to local device IPs,
and supports rdma. Both are used in different cases.

For both of these, when there is a congestion map update, we
want to call rds_cong_map_updated() but not actually send
anything -- since loopback local and foreign congestion maps
point to the same spot, they're already in sync.

The old code never called sw loop's xmit_cong_map(),so
rds_cong_map_updated() wasn't being called for it. sw loop
ports would not work right with the congestion monitor.

Fixing that meant that hw loopback now would send congestion maps
to itself. This is also undesirable (racy), so we check for this
case in the ib-specific xmit code.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:16:55 -07:00
Andy Grover
8e82376e5f RDS/TCP: Wait to wake thread when write space available
Instead of waking the send thread whenever any send space is available,
wait until it is at least half empty. This is modeled on how
sock_def_write_space() does it, and may help to minimize context
switches.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:16:55 -07:00
Andy Grover
b075cfdb66 RDS: update copy_to_user state in tcp transport
Other transports use rds_page_copy_user, which updates our
s_copy_to_user counter. TCP doesn't, so it needs to explicity
call rds_stats_add().

Reported-by: Richard Frank <richard.frank@oracle.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:16:54 -07:00
Andy Grover
1123fd734d RDS: sendmsg() should check sndtimeo, not rcvtimeo
Most likely cut n paste error - sendmsg() was checking sock_rcvtimeo.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:16:54 -07:00
Andy Grover
735f61e626 RDS: Do not BUG() on error returned from ib_post_send
BUGging on a runtime error code should be avoided. This
patch also eliminates all other BUG()s that have no real
reason to exist.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-16 21:16:53 -07:00
Joe Perches
6884b348ed net/rds: remove uses of NIPQUAD, use %pI4
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Andy Grover <andy.grover@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-03 20:16:48 -08:00
Linus Torvalds
e69381b417 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (45 commits)
  RDMA/cxgb3: Fix error paths in post_send and post_recv
  RDMA/nes: Fix stale ARP issue
  RDMA/nes: FIN during MPA startup causes timeout
  RDMA/nes: Free kmap() resources
  RDMA/nes: Check for zero STag
  RDMA/nes: Fix Xansation test crash on cm_node ref_count
  RDMA/nes: Abnormal listener exit causes loopback node crash
  RDMA/nes: Fix crash in nes_accept()
  RDMA/nes: Resource not freed for REJECTed connections
  RDMA/nes: MPA request/response error checking
  RDMA/nes: Fix query of ORD values
  RDMA/nes: Fix MAX_CM_BUFFER define
  RDMA/nes: Pass correct size to ioremap_nocache()
  RDMA/nes: Update copyright and branding string
  RDMA/nes: Add max_cqe check to nes_create_cq()
  RDMA/nes: Clean up struct nes_qp
  RDMA/nes: Implement IB_SIGNAL_ALL_WR as an iWARP extension
  RDMA/nes: Add additional SFP+ PHY uC status check and PHY reset
  RDMA/nes: Correct fast memory registration implementation
  IB/ehca: Fix error paths in post_send and post_recv
  ...
2009-12-16 10:32:31 -08:00
Linus Torvalds
d7fc02c7ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits)
  mac80211: fix reorder buffer release
  iwmc3200wifi: Enable wimax core through module parameter
  iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter
  iwmc3200wifi: Coex table command does not expect a response
  iwmc3200wifi: Update wiwi priority table
  iwlwifi: driver version track kernel version
  iwlwifi: indicate uCode type when fail dump error/event log
  iwl3945: remove duplicated event logging code
  b43: fix two warnings
  ipw2100: fix rebooting hang with driver loaded
  cfg80211: indent regulatory messages with spaces
  iwmc3200wifi: fix NULL pointer dereference in pmkid update
  mac80211: Fix TX status reporting for injected data frames
  ath9k: enable 2GHz band only if the device supports it
  airo: Fix integer overflow warning
  rt2x00: Fix padding bug on L2PAD devices.
  WE: Fix set events not propagated
  b43legacy: avoid PPC fault during resume
  b43: avoid PPC fault during resume
  tcp: fix a timewait refcnt race
  ...

Fix up conflicts due to sysctl cleanups (dead sysctl_check code and
CTL_UNNUMBERED removed) in
	kernel/sysctl_check.c
	net/ipv4/sysctl_net_ipv4.c
	net/ipv6/addrconf.c
	net/sctp/sysctl.c
2009-12-08 07:55:01 -08:00
Joe Perches
f64f9e7192 net: Move && and || to end of previous line
Not including net/atm/

Compiled tested x86 allyesconfig only
Added a > 80 column line or two, which I ignored.
Existing checkpatch plaints willfully, cheerfully ignored.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-29 16:55:45 -08:00
Sean Hefty
6f8372b69c RDMA/cm: fix loopback address support
The RDMA CM is intended to support the use of a loopback address
when establishing a connection; however, the behavior of the CM
when loopback addresses are used is confusing and does not always
work, depending on whether loopback was specified by the server,
the client, or both.

The defined behavior of rdma_bind_addr is to associate an RDMA
device with an rdma_cm_id, as long as the user specified a non-
zero address.  (ie they weren't just trying to reserve a port)
Currently, if the loopback address is passed to rdam_bind_addr,
no device is associated with the rdma_cm_id.  Fix this.

If a loopback address is specified by the client as the destination
address for a connection, it will fail to establish a connection.
This is true even if the server is listing across all addresses or
on the loopback address itself.  The issue is that the server tries
to translate the IP address carried in the REQ message to a local
net_device address, which fails.  The translation is not needed in
this case, since the REQ carries the actual HW address that should
be used.

Finally, cleanup loopback support to be more transport neutral.
Replace separate calls to get/set the sgid and dgid from the
device address to a single call that behaves correctly depending
on the format of the device address.  And support both IPv4 and
IPv6 address formats.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>

[ Fixed RDS build by s/ib_addr_get/rdma_addr_get/  - Roland ]

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2009-11-19 13:26:06 -08:00
Eric W. Biederman
6d4561110a sysctl: Drop & in front of every proc_handler.
For consistency drop & in front of every proc_handler.  Explicity
taking the address is unnecessary and it prevents optimizations
like stubbing the proc_handlers to NULL.

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2009-11-18 08:37:40 -08:00