Commit graph

86 commits

Author SHA1 Message Date
Roland Dreier
efcd99717f IPoIB/cm: Factor out ipoib_cm_free_rx_reap_list()
Factor out the code for going through the rx_reap list of struct
ipoib_cm_rx and freeing each one.  This consolidates the code
duplicated between ipoib_cm_dev_stop() and ipoib_cm_rx_reap() and
reduces the risk of error when adding additional accounting.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-01-25 14:15:24 -08:00
Roland Dreier
7b3687df66 IPoIB/cm: Factor out ipoib_cm_create_srq()
Factor out the code to create an SRQ and allocate the receive ring in
ipoib_cm_dev_init() into a new function ipoib_cm_create_srq().  This
will make the code neater when support for devices that don't implement
SRQs is added.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-01-25 14:15:24 -08:00
Roland Dreier
1efb61444c IPoIB/cm: Factor out ipoib_cm_free_rx_ring()
Factor out the code to unmap/free skbs and free the receive ring in
ipoib_cm_dev_cleanup() into a new function ipoib_cm_free_rx_ring().
This function will be called from a couple of other places when
support for devices that don't implement SRQs is added.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-01-25 14:15:24 -08:00
Roland Dreier
2337f80941 IPoIB: Trivial formatting cleanups
Fix whitespace blunders, convert "foo* bar" to "foo *bar", etc.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2008-01-25 14:15:23 -08:00
Roland Dreier
09f60f8f54 IPoIB/cm: Fix receive QP cleanup
Commit 1b524963 ("IPoIB/cm: Use common CQ for CM send completions")
changed how the high-order bits of work request IDs were used, which
had the effect that IPOIB_CM_RX_DRAIN_WRID was no longer handled as a
connected mode receive completion.  This leads to the messages

    ib1: cm send completion event with wrid 1073741823 (> 64)
    ib1: RX drain timing out

when an interface with connected mode QPs is brought down.  Fix this
by making sure that both IPOIB_OP_CM and IPOIB_OP_RECV are set in
IPOIB_CM_RX_DRAIN_WRID.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-26 13:44:25 -07:00
Michael S. Tsirkin
1b524963fd IPoIB/cm: Use common CQ for CM send completions
Use the same CQ for CM send completions as for all other IPoIB
completions.  This means all completions are processed via the same
NAPI polling routine.  This should help reduce the number of
interrupts for bi-directional traffic (such as TCP) and fixes "driver
is hogging interrupts" errors reported for IPoIB send side, e.g.
<https://bugs.openfabrics.org/show_bug.cgi?id=508>

To do this, keep a per-interface counter of outstanding send WRs, and
stop the interface when this counter reaches the send queue size to
avoid CQ overruns.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-19 21:39:34 -07:00
Roland Dreier
fd312561ad IPoIB: Rewrite "if (!likely(...))" as "if (unlikely(!(...)))"
It's too hard to figure out what "!likely(...)" really means, and who
knows how compilers interpret the hint.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-17 21:54:44 -07:00
Linus Torvalds
ce9d3c9a6a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (87 commits)
  mlx4_core: Fix section mismatches
  IPoIB: Allow setting policy to ignore multicast groups
  IB/mthca: Mark error paths as unlikely() in post_srq_recv functions
  IB/ipath: Minor fix to ordering of freeing and zeroing of tid pages.
  IB/ipath: Remove redundant link state checks
  IB/ipath: Fix IB_EVENT_PORT_ERR event
  IB/ipath: Better handling of unexpected GPIO interrupts
  IB/ipath: Maintain active time on all chips
  IB/ipath: Fix QHT7040 serial number check
  IB/ipath: Indicate a couple of chip bugs to userspace
  IB/ipath: iba6110 rev4 no longer needs recv header overrun workaround
  IB/ipath: Use counters in ipath_poll and cleanup interrupts in ipath_close
  IB/ipath: Remove duplicate copy of LMC
  IB/ipath: Add ability to set the LMC via the sysfs debugging interface
  IB/ipath: Optimize completion queue entry insertion and polling
  IB/ipath: Implement IB_EVENT_QP_LAST_WQE_REACHED
  IB/ipath: Generate flush CQE when QP is in error state
  IB/ipath: Remove redundant code
  IB/ipath: Future proof eeprom checksum code (contents reading)
  IB/ipath: UC RDMA WRITE with IMMEDIATE doesn't send the immediate
  ...
2007-10-11 19:43:13 -07:00
Roland Dreier
de90351219 [IPoIB]: Convert to netdevice internal stats
Use the stats member of struct netdevice in IPoIB, so we can save
memory by deleting the stats member of struct ipoib_dev_priv, and save
code by deleting ipoib_get_stats().

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:53:41 -07:00
Dotan Barak
ede6bc04f3 IPoIB/cm: Clean up initialization of QP attr in ipoib_cm_create_tx_qp()
Make the way QP is being created in ipoib_cm_create_tx_qp()
consistent with ipoib_cm_create_rx_qp().

Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-10-09 19:59:18 -07:00
Sean Hefty
1d84612649 IB/cm: Include HCA ACK delay in local ACK timeout
The IB CM should include the HCA ACK delay when calculating the local
ACK timeout value to use for RC QPs.  If the HCA ACK delay is large
enough relative to the packet life time, then if it is not taken into
account, the calculated timeout value ends up being too small, which
can result in "retry exceeded" errors.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-07-10 21:50:05 -07:00
Roland Dreier
20089ca557 IPoIB/cm: Fix warning if IPV6 is not enabled
Fix

    drivers/infiniband/ulp/ipoib/ipoib_cm.c:1151: warning: unused variable 'dev'

by getting rid of the variable dev, which is only used if CONFIG_IPV6
is enabled, and replacing the one use of it with the value it is
assigned, namely priv->dev.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-07-10 11:18:34 -07:00
Ralph Campbell
841adfca9c IPoIB/cm: Partial error clean up unmaps wrong address
If a page can't be allocated for the frag list of a skb, the code to
unmap the partially allocated list is off by one.  For exaple, if
'frags' equals one, i == 0, and the alloc_page() fails, then the old
loop would have unmapped mapping[1] which is uninitialized.  The same
would happen if the call to ib_dma_map_page() failed.

Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Acked-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-07-02 20:48:31 -07:00
Roland Dreier
13ef5f44c3 IPoIB/cm: Remove dead definition of struct ipoib_cm_id
It's completely unused.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-06-21 13:39:08 -07:00
Michael S. Tsirkin
82c3aca6ad IPoIB/cm: Fix interoperability when MTU doesn't match
IPoIB connected mode currently rejects a connection request unless the
supported MTU is >= the local netdevice MTU. This breaks
interoperability with implementations that might have tweaked
IPOIB_CM_MTU, and there's real no longer a reason to do so: this test
is just a leftover from when we did not tweak MTU per-connection.  Fix
this by making the test as permissive as possible.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-06-21 13:38:08 -07:00
Michael S. Tsirkin
3ec7393a68 IPoIB/cm: Initialize RX before moving QP to RTR
Fix a crasher bug in IPoIB CM: once a QP is in the RTR state, a
receive completion (or even an asynchronous error) might be observed
on this QP, so we have to initialize all of our receive data
structures before moving to the RTR state.

As an optimization (since modify_qp might take a long time), the
jiffies update done when moving RX to the passive_ids list is also
left in place to reduce the chance of the RX being misdetected as
stale.

This fixes bug <https://bugs.openfabrics.org/show_bug.cgi?id=662>.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-06-21 13:03:50 -07:00
Michael S. Tsirkin
ec56dc0b7f IPoIB/cm: Fix performance regression on Mellanox
commit 518b1646 ("IPoIB/cm: Fix SRQ WR leak") introduced a severe
performance regression on Mellanox cards, because keeping a QP in the
error state for extended periods of time moves hardware to the slow
path (until the QP is destroyed).  For example, MPI latency goes from
~3 usecs to ~7 usecs.

Fix this by posting a send WR on one of the QPs that are being
flushed, instead of using a separate drain QP that is kept in the
error state.

This fixes bug <https://bugs.openfabrics.org/show_bug.cgi?id=636>,
reported and bisected by Scott Weitzenkamp at Cisco and debugged by
Sasha Mikheev at Voltaire.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-05-29 16:07:09 -07:00
Michael S. Tsirkin
2dfbfc3712 IPoIB/cm: Drain cq in ipoib_cm_dev_stop()
Since NAPI polling is disabled while ipoib_cm_dev_stop() is running,
ipoib_cm_dev_stop() must poll the CQ itself in order to see the
packets draining.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-05-24 14:02:40 -07:00
Michael S. Tsirkin
8fd357a6e3 IPoIB/cm: Fix timeout check in ipoib_cm_dev_stop()
time_after() was used backwards, so the timeout occurred immediately.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-05-24 14:02:39 -07:00
Michael S. Tsirkin
518b1646f8 IPoIB/cm: Fix SRQ WR leak
SRQ WR leakage has been observed with IPoIB/CM: e.g. flipping ports on
and off will, with time, leak out all WRs and then all connections
will start getting RNR NAKs.  Fix this in the way suggested by spec:
move the QP being destroyed to the error state, wait for "Last WQE
Reached" event and then post WR on a "drain QP" connected to the same
CQ.  Once we observe a completion on the drain QP, it's safe to call
ib_destroy_qp.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-05-21 13:35:40 -07:00
Michael S. Tsirkin
7c5b9ef857 IPoIB/cm: Optimize stale connection detection
In the presence of some running RX connections, we repeat
queue_delayed_work calls each 4 RX WRs, which is a waste.  It's enough
to start stale task when a first passive connection is added, and
rerun it every IPOIB_CM_RX_DELAY as long as there are outstanding
passive connections.

This removes some code from RX data path.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-05-14 14:11:01 -07:00
Roland Dreier
8d1cc86a62 IPoIB: Convert to NAPI
Convert the IP-over-InfiniBand network device driver over to using
NAPI to handle completions for the main CQ.  This covers all receives
as well as datagram mode sends; send completions for connected mode
connections are still handled from interrupt context.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-05-06 21:18:11 -07:00
Michael S. Tsirkin
f4fd0b224d IB: Add CQ comp_vector support
Add a num_comp_vectors member to struct ib_device and extend
ib_create_cq() to pass in a comp_vector parameter -- this parallels
the userspace libibverbs API.  Update all hardware drivers to set
num_comp_vectors to 1 and have all ULPs pass 0 for the comp_vector
value.  Pass the value of num_comp_vectors to userspace rather than
hard-coding a value of 1.

We want multiple CQ event vector support (via MSI-X or similar for
adapters that can generate multiple interrupts), but it's not clear
how many vectors we want, or how we want to deal with policy issues
such as how to decide which vector to use or how to set up interrupt
affinity.  This patch is useful for experimenting, since no core
changes will be necessary when updating a driver to support multiple
vectors, and we know that we want to make at least these changes
anyway.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-05-06 21:18:11 -07:00
Michael S. Tsirkin
d6ef7d68f6 IPoIB/cm: Don't crash if remote side uses one QP for both directions
The IPoIB CM spec allows the use of a single connection in both
active->passive and passive->active directions.  The current Linux
code uses one connection for both directions, but if another node only
uses one connection for both directions, we oops when we try to look
up the passive connection.  Fix by checking that qp_context is
non-NULL before dereferencing it.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
2007-05-06 21:18:11 -07:00
Michael S. Tsirkin
347fcfbed2 IPoIB/cm: Fix error handling in ipoib_cm_dev_open()
If skb allocation fails when we start the device, we call
ipoib_cm_dev_stop() even though ipoib_cm_dev_open() did not run to
completion, so we pass an invalid pointer to ib_destroy_cm_id and get
an oops.

Fix by clearing cm.id on error, and testing it in cm_dev_stop().
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=561>

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-04-30 17:30:28 -07:00
Linus Torvalds
afc2e82c08 Merge branch 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband
* 'for-linus' of master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband: (49 commits)
  IB: Set class_dev->dev in core for nice device symlink
  IB/ehca: Implement modify_port
  IB/umad: Clarify documentation of transaction ID
  IPoIB/cm: spin_lock_irqsave() -> spin_lock_irq() replacements
  IB/mad: Change SMI to use enums rather than magic return codes
  IB/umad: Implement GRH handling for sent/received MADs
  IB/ipoib: Use ib_init_ah_from_path to initialize ah_attr
  IB/sa: Set src_path_bits correctly in ib_init_ah_from_path()
  IB/ucm: Simplify ib_ucm_event()
  RDMA/ucma: Simplify ucma_get_event()
  IB/mthca: Simplify CQ cleaning in mthca_free_qp()
  IB/mthca: Fix mthca_write_mtt() on HCAs with hidden memory
  IB/mthca: Update HCA firmware revisions
  IB/ipath: Fix WC format drift between user and kernel space
  IB/ipath: Check that a UD work request's address handle is valid
  IB/ipath: Remove duplicate stuff from ipath_verbs.h
  IB/ipath: Check reserved memory keys
  IB/ipath: Fix unit selection when all CPU affinity bits set
  IB/ipath: Don't allow QPs 0 and 1 to be opened multiple times
  IB/ipath: Disable IB link earlier in shutdown sequence
  ...
2007-04-27 09:39:27 -07:00
Arnaldo Carvalho de Melo
459a98ed88 [SK_BUFF]: Introduce skb_reset_mac_header(skb)
For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can
later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:32 -07:00
Roland Dreier
37aebbde70 IPoIB/cm: spin_lock_irqsave() -> spin_lock_irq() replacements
There are quite a few places in ipoib_cm.c where we know IRQs are
enabled because we do something that sleeps in the same function, so
we can convert several occurrences of spin_lock_irqsave() to a plain
spin_lock_irq().  This cleans up the source a little and makes the
code smaller too:

add/remove: 0/0 grow/shrink: 1/5 up/down: 3/-51 (-48)
function                                     old     new   delta
ipoib_cm_tx_reap                             403     406      +3
ipoib_cm_stale_task                          146     145      -1
ipoib_cm_dev_stop                            173     172      -1
ipoib_cm_tx_handler                          964     956      -8
ipoib_cm_rx_handler                          956     937     -19
ipoib_cm_skb_reap                            212     190     -22

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-04-24 21:30:37 -07:00
Roland Dreier
a89875fc7e IPoIB: Remove pointless opcode field from debugging output
There's no point in printing the opcode field in the completion
handling debugging output, since the type of completion is already
printed at the beginning of the line.  In fact the opcode field is not
even defined for completions with a status other than success.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-04-18 20:20:53 -07:00
Michael S. Tsirkin
6371ea3d48 IPoIB/cm: Fix DMA direction typo
Receive buffers need to be mapped with DMA_FROM_DEVICE.  Incorrectly
mapping with DMA_TO_DEVICE causes a hard lock on ppc64 machines with
an IOMMU.

This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=431>

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-04-10 08:58:30 -07:00
Michael S. Tsirkin
77d8e1efea IB/ipoib: Fix thinko in packet length checks
The packet length checks in ipoib are broken: we add 4 bytes (IPoIB
encapsulation header) when sending a packet, not 20 bytes (hardware
address length) to each packet.  Therefore, if connected mode is
enabled so that the interface MTU is larger than the multicast MTU,
IPoIB may end up trying to send too-long multicast packets.  For
example, multicast is broken if a message of size 2048 bytes is sent
on an interface with UD MTU 2048, because 2048 is bigger than the real
limit of 2044 but the code tests against the wrong limit of 2060.

This patch fixes <https://bugs.openfabrics.org/show_bug.cgi?id=418>,
submitted by Scott Weitzenkamp <sweitzen@cisco.com>.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-03-22 14:40:16 -07:00
Michael S. Tsirkin
60a596dab7 IPoIB/cm: Fix reaping of stale connections
The sense of the time_after_eq() test in ipoib_cm_stale_task() is
reversed so that only non-stale connections are reaped.  Fix this by
changing to time_before_eq().

Noticed by Pradeep Satyanarayana <pradeep@us.ibm.com>.

Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-03-22 14:32:09 -07:00
Michael S. Tsirkin
1812063ba3 IPoIB/cm: Improve small message bandwidth
Avoid the overhead of freeing/reallocating and mapping/unmapping for
DMA pages that have not been written to by hardware.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-02-20 20:16:14 -08:00
Michael S. Tsirkin
8a2e65f87c IPoIB: CM error handling thinko fix
ipoib_cm_alloc_rx_skb() might be called from IRQ context, so it must
use dev_kfree_skb_any(), not kfree_skb().

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-02-16 13:57:35 -08:00
Roland Dreier
551fd6122d IPoIB: Only allow root to change between datagram and connected mode
Change the permissions of the "mode" sysfs attribute to be S_IWUSR
instead of S_IWUGO.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-02-16 13:57:33 -08:00
Michael S. Tsirkin
839fcaba35 IPoIB: Connected mode experimental support
The following patch adds experimental support for IPoIB connected
mode, as defined by the draft from the IETF ipoib working group.  The
idea is to increase performance by increasing the MTU from the maximum
of 2K (theoretically 4K) supported by IPoIB on top of UD.  With this
code, I'm able to get 800MByte/sec or more with netperf without
options on a Mellanox 4x back-to-back DDR system.

Some notes on code:
1. SRQ is used for scalability to large cluster sizes
2. Only RC connections are used (UC does not support SRQ now)
3. Retry count is set to 0 since spec draft warns against retries
4. Each connection is used for data transfers in only 1 direction, so
   each connection is either active(TX) or passive (RX).  2 sides that
   want to communicate create 2 connections.
5. Each active (TX) connection has a separate CQ for send completions -
   this keeps the code simple without CQ resize and other tricks
6. To detect stale passive side connections (where the remote side is
   down), we keep an LRU list of passive connections (updated once per
   second per connection) and destroy a connection after it has been
   unused for several seconds. The LRU rule makes it possible to avoid
   scanning connections that have recently been active.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2007-02-10 08:00:48 -08:00