This replaces find_vq/del_vq with find_vqs/del_vqs virtio operations,
and updates all drivers. This is needed for MSI support, because MSI
needs to know the total number of vectors upfront.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (+ lguest/9p compile fixes)
Add a linked list of all virtqueues for a virtio device: this helps for
debugging and is also needed for upcoming interface change.
Also, add a "name" field for clearer debug messages.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We need to enforce the IP alignment on the non-mergeable RX path just
like the other RX path. Not doing so results in misaligned IP
headers.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Through a bug in the tun driver, I noticed that virtio_net is
producing bogus hdr_len values. In particular, it only includes
the IP header in the linear area, and excludes the entire TCP
header. This causes the TCP header to be copied twice for each
packet. (The bug omitted the second copy :)
This patch corrects this.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts unicast address list to standard list_head using
previously introduced struct netdev_hw_addr. It also relaxes the
locking. Original spinlock (still used for multicast addresses) is not
needed and is no longer used for a protection of this list. All
reading and writing takes place under rtnl (with no changes).
I also removed a possibility to specify the length of the address
while adding or deleting unicast address. It's always dev->addr_len.
The convertion touched especially e1000 and ixgbe codes when the
change is not so trivial.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
drivers/net/bnx2.c | 13 +--
drivers/net/e1000/e1000_main.c | 24 +++--
drivers/net/ixgbe/ixgbe_common.c | 14 ++--
drivers/net/ixgbe/ixgbe_common.h | 4 +-
drivers/net/ixgbe/ixgbe_main.c | 6 +-
drivers/net/ixgbe/ixgbe_type.h | 4 +-
drivers/net/macvlan.c | 11 +-
drivers/net/mv643xx_eth.c | 11 +-
drivers/net/niu.c | 7 +-
drivers/net/virtio_net.c | 7 +-
drivers/s390/net/qeth_l2_main.c | 6 +-
drivers/scsi/fcoe/fcoe.c | 16 ++--
include/linux/netdevice.h | 18 ++--
net/8021q/vlan.c | 4 +-
net/8021q/vlan_dev.c | 10 +-
net/core/dev.c | 195 +++++++++++++++++++++++++++-----------
net/dsa/slave.c | 10 +-
net/packet/af_packet.c | 4 +-
18 files changed, 227 insertions(+), 137 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
We were avoiding calling sg_init* on scatterlists passed
into virtnet_send_command to prevent extraneous end markers.
This caused build warnings for uninitialized variables.
Cleanup the code to create proper scatterlists.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VIRTIO_NET_F_MAC indicates the presence of the mac field in config
space, not the validity of the value it contains. Allow the mac to be
changed at runtime, but only push the change into config space with the
VIRTIO_NET_F_MAC feature present.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Impact: Make NetworkManager work with virtio_net
For now the semantics are simple: There is always carrier.
This allows a seamless experience with e.g., qemu/kvm
where NetworkManager just configures and sets up
everything automagically.
If/when a generally agreed-upon way to control
carrier on/off in the emulator/hypervisor level
emerges, it will be trivial to extend the driver
to support that too, but for now even this 2-liner
makes user experience that much better.
Signed-off-by: Pantelis Koukousoulas <pktoss@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many physical NICs let the OS re-program the "hardware" MAC
address. Virtual NICs should allow this too.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VLAN filtering allows the hypervisor to drop packets from VLANs
that we're not a part of, further reducing the number of extraneous
packets recieved. This makes use of the VLAN virtqueue command class.
The CTRL_VLAN feature bit tells us whether the backend supports VLAN
filtering.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make use of the MAC control virtqueue class to support a MAC
filter table. The filter table is managed by the hypervisor.
We consider the table to be available if the CTRL_RX feature
bit is set. We leave it to the hypervisor to manage the table
and enable promiscuous or all-multi mode as necessary depending
on the resources available to it.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make use of the RX_MODE control virtqueue class to enable the
set_rx_mode netdev interface. This allows us to selectively
enable/disable promiscuous and allmulti mode so we don't see
packets we don't want. For now, we automatically enable these
as needed if additional unicast or multicast addresses are
requested.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be used for RX mode, MAC filter table, VLAN filtering, etc...
The control transaction consists of one or more "out" sg entries and
one or more "in" sg entries. The first out entry contains a header
defining the class and command. Additional out entries may provide
data for the command. The last in entry provides a status response
back from the command.
Virtqueues typically run asynchronous, running a callback function
when there's data in the channel. We can't readily make use of this
in the command paths where we need to use this. Instead, we kick
the virtqueue and spin. The kick causes an I/O write, triggering an
immediate trap into the hypervisor.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this fix, virtio_net makes incorrect usage of scatterlists. It sets
the end of the scatterlist chain after the first element, despite the fact
that more entries come after it.
If you try to run dma_map_sg() on one of the scatterlists given to you by
add_buf(), you will get a null pointer oops.
Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
802.1Q expanded the maximum ethernet frame size by 4 bytes for the
VLAN tag. We're not taking this into account in virtio_net, which
means the buffers we provide to the backend in the virtqueue RX ring
aren't big enough to hold a full MTU VLAN packet. For QEMU/KVM,
this results in the backend exiting with a packet truncation error.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Acked-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the host to inform us that the link is down by adding
a VIRTIO_NET_F_STATUS which indicates that device status is
available in virtio_net config.
This is currently useful for simulating link down conditions
(e.g. using proposed qemu 'set_link' monitor command) but
would also be needed if we were to support device assignment
via virtio.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (added future masking)
Signed-off-by: David S. Miller <davem@davemloft.net>
Following the removal of the unused struct net_device * parameter from
the NAPI functions named *netif_rx_* in commit 908a7a1, they are
exactly equivalent to the corresponding *napi_* functions and are
therefore redundant.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the napi api was changed to separate its 1:1 binding to the net_device
struct, the netif_rx_[prep|schedule|complete] api failed to remove the now
vestigual net_device structure parameter. This patch cleans up that api by
properly removing it..
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't really have a max tx packet size limit, so allow configuring
the device with up to 64k tx MTU.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If segmentation offload is enabled by the host, we currently allocate
maximum sized packet buffers and pass them to the host. This uses up
20 ring entries, allowing us to supply only 20 packet buffers to the
host with a 256 entry ring. This is a huge overhead when receiving
small packets, and is most keenly felt when receiving MTU sized
packets from off-host.
The VIRTIO_NET_F_MRG_RXBUF feature flag is set by hosts which support
using receive buffers which are smaller than the maximum packet size.
In order to transfer large packets to the guest, the host merges
together multiple receive buffers to form a larger logical buffer.
The number of merged buffers is returned to the guest via a field in
the virtio_net_hdr.
Make use of this support by supplying single page receive buffers to
the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
the payload to the skb's linear data buffer and adjust the fragment
offset to point to the remaining data. This ensures proper alignment
and allows us to not use any paged data for small packets. If the
payload occupies multiple pages, we simply append those pages as
fragments and free the associated skbs.
This scheme allows us to be efficient in our use of ring entries
while still supporting large packets. Benchmarking using netperf from
an external machine to a guest over a 10Gb/s network shows a 100%
improvement from ~1Gb/s to ~2Gb/s. With a local host->guest benchmark
with GSO disabled on the host side, throughput was seen to increase
from 700Mb/s to 1.7Gb/s.
Based on a patch from Herbert Xu.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv)
Signed-off-by: David S. Miller <davem@davemloft.net>
Seems like an oversight that we have set-tx-csum and set-sg hooked
up, but not set-tso.
Also leads to the strange situation that if you e.g. disable tx-csum,
then tso doesn't get disabled.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each time we re-fill the recv queue with buffers, we allocate
one too many skbs and free it again when adding fails. We should
recycle the pages allocated in this case.
A previous version of this patch made trim_pages() trim trailing
unused pages from skbs with some paged data, but this actually
caused a barely measurable slowdown.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv)
Signed-off-by: David S. Miller <davem@davemloft.net>
We have some reasons to kill netdev->priv:
1. netdev->priv is equal to netdev_priv().
2. netdev_priv() wraps the calculation of netdev->priv's offset, obviously
netdev_priv() is more flexible than netdev->priv.
But we cann't kill netdev->priv, because so many drivers reference to it
directly.
This patch is a safe convert for netdev->priv to netdev_priv(netdev).
Since all of the netdev->priv is only for read.
But it is too big to be sent in one mail.
I split it to 4 parts and make every part smaller than 100,000 bytes,
which is max size allowed by vger.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts pretty much everything to print_mac. There were
a few things that had conflicts which I have just dropped for
now, no harm done.
I've built an allyesconfig with this and looked at the files
that weren't built very carefully, but it's a huge patch.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we hack the virtio_net driver to always allocate full-sized (64k+)
skbuffs, the driver slows down (lguest numbers):
Time to receive 1GB (small buffers): 10.85 seconds
Time to receive 1GB (64k+ buffers): 24.75 seconds
Of course, large buffers use up more space in the ring, so we increase
that from 128 to 2048:
Time to receive 1GB (64k+ buffers, 2k ring): 16.61 seconds
If we recycle pages rather than using alloc_page/free_page:
Time to receive 1GB (64k+ buffers, 2k ring, recycle pages): 10.81 seconds
This demonstrates that with efficient allocation, we don't need to
have a separate "small buffer" queue.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Finally this patch lets virtio_net receive GSO packets in addition
to sending them. This can definitely be optimised for the non-GSO
case. For comparison the Xen approach stores one page in each skb
and uses subsequent skb's pages to construct an SG skb instead of
preallocating the maximum amount of pages per skb.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (added feature bits)
This patch adds some basic ethtool operations to virtio_net so
I could test SG without GSO (which was really useful because TSO
turned out to be buggy :)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (remove MTU setting)
On Mon, 2008-05-26 at 17:42 +1000, Rusty Russell wrote:
> If we fail to transmit a packet, we assume the queue is full and put
> the skb into last_xmit_skb. However, if more space frees up before we
> xmit it, we loop, and the result can be transmitting the same skb twice.
>
> Fix is simple: set skb to NULL if we've used it in some way, and check
> before sending.
...
> diff -r 564237b31993 drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c Mon May 19 12:22:00 2008 +1000
> +++ b/drivers/net/virtio_net.c Mon May 19 12:24:58 2008 +1000
> @@ -287,21 +287,25 @@ again:
> free_old_xmit_skbs(vi);
>
> /* If we has a buffer left over from last time, send it now. */
> - if (vi->last_xmit_skb) {
> + if (unlikely(vi->last_xmit_skb)) {
> if (xmit_skb(vi, vi->last_xmit_skb) != 0) {
> /* Drop this skb: we only queue one. */
> vi->dev->stats.tx_dropped++;
> kfree_skb(skb);
> + skb = NULL;
> goto stop_queue;
> }
> vi->last_xmit_skb = NULL;
With this, may drop an skb and then later in the function discover that
we could have sent it after all. Poor wee skb :)
How about the incremental patch below?
Cheers,
Mark.
Subject: [PATCH] virtio_net: Delay dropping tx skbs
Currently we drop the skb in start_xmit() if we have a
queued buffer and fail to transmit it.
However, if we delay dropping it until we've stopped the
queue and enabled the tx notification callback, then there
is a chance space might become available for it.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We can handle receiving partial csums, so set the
appropriate feature bit.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
virtio_net uses a timer to free old transmitted packets, rather than
leaving callbacks enabled all the time. If the host promises to
always notify us when the transmit ring is empty, we can free packets
at that point and avoid the timer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
virtio_net currently only frees old transmit skbs just
before queueing new ones. If the queue is full, it then
enables interrupts and waits for notification that more
work has been performed.
However, a side-effect of this scheme is that there are
always xmit skbs left dangling when no new packets are
sent, against the Documentation/networking/driver.txt
guideline:
"... it is not allowed for your TX mitigation scheme
to let TX packets "hang out" in the TX ring unreclaimed
forever if no new TX packets are sent."
Add a timer to ensure that any time we queue new TX
skbs, we will shortly free them again.
This fixes an easily reproduced hang at shutdown where
iptables attempts to unload nf_conntrack and nf_conntrack
waits for an skb it is tracking to be freed, but virtio_net
never frees it.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
hdr->csum_start is the offset from the start of the ethernet
header to the transport layer checksum field. skb->csum_start
is the offset from skb->head.
skb_partial_csum_set() assumes that skb->data points to the
ethernet header - i.e. it computes skb->csum_start by adding
the headroom to hdr->csum_start.
Since eth_type_trans() skb_pull()s the ethernet header,
skb_partial_csum_set() should be called before
eth_type_trans().
(Without this patch, GSO packets from a guest to the world outside the
host are corrupted).
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Because we cache the last failed-to-xmit packet, if there are no
packets queued behind that one we may never send it (reproduced here
as TCP stalls, "cured" by an outgoing ping).
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
If we fail to transmit a packet, we assume the queue is full and put
the skb into last_xmit_skb. However, if more space frees up before we
xmit it, we loop, and the result can be transmitting the same skb twice.
Fix is simple: set skb to NULL if we've used it in some way, and check
before sending.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
A recent proposed feature addition to the virtio block driver revealed
some flaws in the API: in particular, we assume that feature
negotiation is complete once a driver's probe function returns.
There is nothing in the API to require this, however, and even I
didn't notice when it was violated.
So instead, we require the driver to specify what features it supports
in a table, we can then move the feature negotiation into the virtio
core. The intersection of device and driver features are presented in
a new 'features' bitmap in the struct virtio_device.
Note that this highlights the difference between Linux unsigned-long
bitmaps where each unsigned long is in native endian, and a
straight-forward little-endian array of bytes.
Drivers can still remove feature bits in their probe routine if they
really have to.
API changes:
- dev->config->feature() no longer gets and acks a feature.
- drivers should advertise their features in the 'feature_table' field
- use virtio_has_feature() for extra sanity when checking feature bits
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
So, we previously had a 'VIRTIO_NET_F_GSO' bit which meant that 'the
host can handle csum offload, and any TSO (v4&v6 incl ECN) or UFO
packets you might want to send. I thought this was good enough for
Linux, but it actually isn't, since we don't do UFO in software.
So, add separate feature bits for what the host can handle. Add
equivalent ones for the guest to say what it can handle, because LRO
is coming too (thanks Herbert!).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Herbert tells me that returning NETDEV_TX_BUSY from hard_start_xmit is
seen as a poor thing to do; we should cache the packet and stop the queue.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Herbert Xu points out (within another patch) that my scatterlists are
too short: one entry for the gso header, one for the skb->data, and
MAX_SKB_FRAGS for all the fragments.
Fix both xmit and recv sides (recv currently unused, coming in later
patch).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
get_buf() gives the length written by the other side, which will be
zero. We want to add the skb length.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
[NETNS][IPV6] tcp - assign the netns for timewait sockets
[IPV4]: Fix byte value boundary check in do_ip_getsockopt().
BNX2X: Correct bringing chip out of reset
[NETFILTER]: nf_nat: autoload IPv4 connection tracking
[NETFILTER]: xt_hashlimit: fix mask calculation
[XFRM]: xfrm_user: fix selector family initialization
rt61pci: rt61pci_beacon_update do not free skb twice
ssb-mipscore: Fix interrupt vectors
ssb-pcicore: Fix IRQ TPS flag handling
mac80211: use short_preamble mode from capability if ERP IE not present
[NET]: Undo code bloat in hot paths due to print_mac().
[TCP]: Don't allow FRTO to take place while MTU is being probed
[TCP]: tcp_simple_retransmit can cause S+L
[TCP]: Fix NewReno's fast rexmit/recovery problems with GSOed skb
[TCP]: Restore 2.6.24 mark_head_lost behavior for newreno/fack
nl80211: fix STA AID bug
b43legacy: fix bcm4303 crash
iwlwifi: fix n-band association problem
ipw2200: set MAC address on radiotap interface
libertas: fix mode initialization problem
If print_mac() is used inside of a pr_debug() the compiler
can't see that the call is redundant so still performs it
even of pr_debug() ends up being a nop.
So don't use print_mac() in such cases in hot code paths,
use MAC_FMT et al. instead.
As noted by Joe Perches, pr_debug() could be modified to
handle this better, but that is a change to an interface
used by the entire kernel and thus needs to be validated
carefully. This here is thus the less risky fix for
2.6.25
Signed-off-by: David S. Miller <davem@davemloft.net>