When flushing out queued commands after a successful device reset,
make sure that SRP completes the right commands, instead of calling
scsi_done on the command passed into the device reset handler over and
over.
Signed-off-by: Ishai Rabinovitz <ishai@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a reconnection attempt fails, then SRP does two scsi_host_put()s.
This is a historical relic from an earlier version of the driver that
took a reference on the scsi_host before trying to reconnect, so get
rid of the extra scsi_host_put().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Sending a DREQ may fail, for example because the remote target has
already broken the connection. If so, then SRP should not wait for
the disconnection to complete, because it never will.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Setting fw_cmd_doorbell allows FW command to be queued using posted
writes instead of requiring polling on a "go" bit, so it should be a
performance boost. However, the option causes problems with at least
some device/firmware combinations, so set the default to 0 until we
understand what's going on better.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix race condition during destruction calls to avoid possibility of
accessing object after it has been freed. Instead of waking up a wait
queue directly, which is susceptible to a race where the object is
freed between the reference count going to 0 and the wake_up(), use a
completion to wait in the function doing the freeing.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ipath driver's table of PCI IDs needs a { 0, } entry at the end.
This makes all of the device aliases visible to userspace so hotplug
loads the module for all supported devices. Without the patch,
modinfo ipath_core only shows:
alias: pci:v00001FC1d0000000Dsv*sd*bc*sc*i*
instead of the correct:
alias: pci:v00001FC1d00000010sv*sd*bc*sc*i*
alias: pci:v00001FC1d0000000Dsv*sd*bc*sc*i*
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Addresses for ioremap must be calculated off of pci_resource_start;
we can't directly use the bus address as seen by the HCA. Fix the
code that remaps device memory for FMR access.
Based on patch by Klaus Smolin.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When deleting a child interface with a non-default P_Key via
/sys/class/net/ibX/delete_child, the interface must be freed with
free_netdev() (rather than kfree() on the private data).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix races in in destroying various objects. If a destroy routine
waits for an object to become free by doing
wait_event(&obj->wait, !atomic_read(&obj->refcount));
/* now clean up and destroy the object */
and another place drops a reference to the object by doing
if (atomic_dec_and_test(&obj->refcount))
wake_up(&obj->wait);
then this is susceptible to a race where the wait_event() and final
freeing of the object occur between the atomic_dec_and_test() and the
wake_up(). And this is a use-after-free, since wake_up() will be
called on part of the already-freed object.
Fix this in mthca by replacing the atomic_t refcounts with plain old
integers protected by a spinlock. This makes it possible to do the
decrement of the reference count and the wake_up() so that it appears
as a single atomic operation to the code waiting on the wait queue.
While touching this code, also simplify mthca_cq_clean(): the CQ being
cleaned cannot go away, because it still has a QP attached to it. So
there's no reason to be paranoid and look up the CQ by number; it's
perfectly safe to use the pointer that the callers already have.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a SCSI abort completes, or the command completes successfully, then
the driver must remove the command from its queue of pending
commands. Similarly, if a device reset succeeds, then all commands
queued for the given device must be removed from the queue.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The code to display local_link_integrity_errors and
excessive_buffer_overrun_errors in
/sys/class/infiniband/<hca>/ports/<n>/counters/
uses the wrong shift to extract the 4 bit values.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Names that are the opposite of their intended meanings are not so helpful.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The reset code now turns off the PRESENT flag during a reset, so that
other code won't attempt to access a device that's in mid-reset.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remember when the verbs layer unregisters from the lower-level code.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Different ipath hardware types have different numbers of buffers
available, so we decide on the counts ourselves unless we are specifically
overridden with a module parameter.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some systems do not set up 64-bit maps on systems with 2GB or less of
memory installed, so we have to fall back to trying a 32-bit setup.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We were accidentally exposing the "reset" sysfs file more than once
per device.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
GuidInfo records have 8 byte GUIDs in them, so an index should be
multiplied by 8 to get an offset. mthca_query_gid() was incorrectly
multiplying by 16.
Noticed by Leonid Keller <leonid@mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch makes the needlessly global mthca_update_rate() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Only check that RMPP version is not specified when MAD class does not
support RMPP. Just because a class is allowed to use RMPP doesn't
mean that rmpp_version needs to be set for the MAD agent to
register. Checking this was a recent change which was too pedantic.
Signed-off-by: Hal Rosenstock <halr@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a SCSI abort succeeds, then the aborted request should to be
removed from the list of pending requests. This fixes list corruption
after an abort occurs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The driver allocates SRQ WQEs size with a power of 2 size both for
Tavor and for memfree. For Tavor, however, the hardware only requires
the WQE size to be a multiple of 16, not a power of 2, and the max
number of scatter-gather allowed is reported accordingly by the
firmware (and this is the value currently returned by
ib_query_device() and ibv_query_device()).
If the max number of scatter/gather entries reported by the FW is used
when creating an SRQ, the creation will fail for Tavor, since the
required WQE size will be increased to the next power of 2, which
turns out to be larger than the device permitted max WQE size (which
is not a power of 2).
This patch reduces the reported SRQ max wqe size so that it can be used
successfully in creating an SRQ on Tavor HCAs.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When allocating gid_cache, use kmalloc(sizeof *gid_cache, ...) rather
than kmalloc(sizeof *pkey_cache, ...). It doesn't really matter which
one is used, since the size ends up the same either way, but it's much
better to say what we mean.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We know ipoib_flush_paths() is called from plain process context with
interrupts enabled, since it does wait_for_completion(). So there's
no need to use spin_lock_irqsave() -- spin_lock_irq() is fine.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ib_sa_cancel_query() must be called with priv->lock held since
a completion might arrive and set path->query to NULL.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The PCI spec recommends against drivers playing with a device's PCI
read burst size, and says that systems software should configure it.
And we actually have users that report that changing it from the
default set by BIOS hurts performance and/or stability for them. On
the other hand, the Mellanox Programmer's Reference Manual recommends
turning it up all the way to the maximum value. Some tests conducted
here in the lab do not show performance improvement from this tuning,
but this might be just me.
As a work-around, make this tuning an option, off by default (safe
value), with an eye towards removing it completely one day if no one
complains.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Make IPoIB's send and receive queue sizes tunable via module
parameters ("send_queue_size" and "recv_queue_size"). This allows the
queue sizes to be enlarged to fix disastrously bad performance on some
platforms and workloads, without bloating memory usage when large
queues aren't needed.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ipoib_mcast_restart_task() might free an mcast object while a join
request is still outstanding, leading to an oops when the query
completes. Fix this by waiting for query to complete, similar to what
ipoib_stop_thread() is doing. The wait for mcast completion code is
consolidated in wait_for_mcast_join().
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Push translation of static rate to HCA format into low-level drivers,
where it belongs. For static rate encoding, use encoding of rate
field from IB standard PathRecord, with addition of value 0, for
backwards compatibility with current usage. The changes are:
- Add enum ib_rate to midlayer includes.
- Get rid of static rate translation in IPoIB; just use static rate
directly from Path and MulticastGroup records.
- Update mthca driver to translate absolute static rate into the
format used by hardware. This also fixes mthca's static rate
handling for HCAs that are capable of 4X DDR.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Consolidate IPoIB's private neighbour data handling into
ipoib_neigh_alloc() and ipoib_neigh_free(). This will make it easier
to keep track of the neighbour structures that IPoIB is handling, and
is a nice cleanup of the code:
add/remove: 2/1 grow/shrink: 1/8 up/down: 100/-178 (-78)
function old new delta
ipoib_neigh_alloc - 61 +61
ipoib_neigh_free - 36 +36
ipoib_mcast_join_finish 1288 1291 +3
path_rec_completion 575 573 -2
ipoib_mcast_join_task 664 660 -4
ipoib_neigh_destructor 101 92 -9
ipoib_neigh_setup_dev 14 3 -11
ipoib_neigh_setup 17 - -17
path_free 238 215 -23
ipoib_mcast_free 329 306 -23
ipoib_mcast_send 718 684 -34
neigh_add_path 705 650 -55
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change the mthca debugging trace output code so that it can enabled
and disabled at runtime with the debug_level module parameter in
sysfs. Also, don't allow CONFIG_INFINIBAND_MTHCA_DEBUG to be disabled
unless CONFIG_EMBEDDED is selected. We want users (and especially
distros) to have this turned on unless they really need to save space,
because by the time we want debugging output, it's usually too late to
rebuild a kernel.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't allow CONFIG_INFINIBAND_IPOIB_DEBUG to be disabled unless
CONFIG_EMBEDDED is selected. We want users (and especially distros)
to have this turned on unless they really need to save space, because
by the time we want debugging output, it's usually too late to rebuild
a kernel. The debugging output can be controlled at runtime via the
debug_level module parameter in sysfs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We have seen the following OOPs in cancel_mads, when restarting opensm
multiple times:
Call Trace:
[<c010549b>] show_stack+0x9b/0xb0
[<c01055ec>] show_registers+0x11c/0x190
[<c01057cd>] die+0xed/0x160
[<c031b966>] do_page_fault+0x3f6/0x5d0
[<c010511f>] error_code+0x4f/0x60
[<f8ac4e38>] cancel_mads+0x128/0x150 [ib_mad]
[<f8ac2811>] unregister_mad_agent+0x11/0x130 [ib_mad]
[<f8ac2a12>] ib_unregister_mad_agent+0x12/0x20 [ib_mad]
[<f8b10f23>] ib_umad_close+0xf3/0x130 [ib_umad]
[<c0162937>] __fput+0x187/0x1c0
[<c01627a9>] fput+0x19/0x20
[<c0160f7a>] filp_close+0x3a/0x60
[<c0121ca8>] put_files_struct+0x68/0xa0
[<c0103cf7>] do_signal+0x47/0x100
[<c0103ded>] do_notify_resume+0x3d/0x40
[<c0103f9e>] work_notifysig+0x13/0x25
We traced this back to local_completions unlocking mad_agent_priv->lock
while still keeping a pointer into local_list. A later call to
list_del(&local->completion_list) would then corrupt the list.
To fix this, remove the entry from local_list after looking it up but
before releasing mad_agent_priv->lock, to prevent cancel_mads from
finding and freeing it.
Signed-off-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Integrate the ipath core and OpenIB drivers into the kernel build
infrastructure. Add entry to MAINTAINERS.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ipath_verbs.c file implements the driver-specific components of the
kernel's Infiniband verbs layer.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Completion queues, local and remote memory keys, and memory region
support.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is an implementation of the Infiniband RC ("reliable connection")
protocol.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
These files implement the Infiniband UC ("unreliable connection") and UD
("unreliable datagram") protocols.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
These header files are used by the layered Infiniband driver.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The layering interfaces are used to implement the Infiniband protocols
and the ethernet emulation driver.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
These files introduce a char device that userspace apps use to gain
direct memory-mapped access to the InfiniPath hardware, and routines for
pinning and unpinning user memory in cases where the hardware needs to
DMA into the user address space.
Signed-off-by: Bryan O'Sullivan <bos@pathscale.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>