Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the napi api was changed to separate its 1:1 binding to the net_device
struct, the netif_rx_[prep|schedule|complete] api failed to remove the now
vestigual net_device structure parameter. This patch cleans up that api by
properly removing it..
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't really have a max tx packet size limit, so allow configuring
the device with up to 64k tx MTU.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If segmentation offload is enabled by the host, we currently allocate
maximum sized packet buffers and pass them to the host. This uses up
20 ring entries, allowing us to supply only 20 packet buffers to the
host with a 256 entry ring. This is a huge overhead when receiving
small packets, and is most keenly felt when receiving MTU sized
packets from off-host.
The VIRTIO_NET_F_MRG_RXBUF feature flag is set by hosts which support
using receive buffers which are smaller than the maximum packet size.
In order to transfer large packets to the guest, the host merges
together multiple receive buffers to form a larger logical buffer.
The number of merged buffers is returned to the guest via a field in
the virtio_net_hdr.
Make use of this support by supplying single page receive buffers to
the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
the payload to the skb's linear data buffer and adjust the fragment
offset to point to the remaining data. This ensures proper alignment
and allows us to not use any paged data for small packets. If the
payload occupies multiple pages, we simply append those pages as
fragments and free the associated skbs.
This scheme allows us to be efficient in our use of ring entries
while still supporting large packets. Benchmarking using netperf from
an external machine to a guest over a 10Gb/s network shows a 100%
improvement from ~1Gb/s to ~2Gb/s. With a local host->guest benchmark
with GSO disabled on the host side, throughput was seen to increase
from 700Mb/s to 1.7Gb/s.
Based on a patch from Herbert Xu.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv)
Signed-off-by: David S. Miller <davem@davemloft.net>
Seems like an oversight that we have set-tx-csum and set-sg hooked
up, but not set-tso.
Also leads to the strange situation that if you e.g. disable tx-csum,
then tso doesn't get disabled.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each time we re-fill the recv queue with buffers, we allocate
one too many skbs and free it again when adding fails. We should
recycle the pages allocated in this case.
A previous version of this patch made trim_pages() trim trailing
unused pages from skbs with some paged data, but this actually
caused a barely measurable slowdown.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv)
Signed-off-by: David S. Miller <davem@davemloft.net>
We have some reasons to kill netdev->priv:
1. netdev->priv is equal to netdev_priv().
2. netdev_priv() wraps the calculation of netdev->priv's offset, obviously
netdev_priv() is more flexible than netdev->priv.
But we cann't kill netdev->priv, because so many drivers reference to it
directly.
This patch is a safe convert for netdev->priv to netdev_priv(netdev).
Since all of the netdev->priv is only for read.
But it is too big to be sent in one mail.
I split it to 4 parts and make every part smaller than 100,000 bytes,
which is max size allowed by vger.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts pretty much everything to print_mac. There were
a few things that had conflicts which I have just dropped for
now, no harm done.
I've built an allyesconfig with this and looked at the files
that weren't built very carefully, but it's a huge patch.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we hack the virtio_net driver to always allocate full-sized (64k+)
skbuffs, the driver slows down (lguest numbers):
Time to receive 1GB (small buffers): 10.85 seconds
Time to receive 1GB (64k+ buffers): 24.75 seconds
Of course, large buffers use up more space in the ring, so we increase
that from 128 to 2048:
Time to receive 1GB (64k+ buffers, 2k ring): 16.61 seconds
If we recycle pages rather than using alloc_page/free_page:
Time to receive 1GB (64k+ buffers, 2k ring, recycle pages): 10.81 seconds
This demonstrates that with efficient allocation, we don't need to
have a separate "small buffer" queue.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Finally this patch lets virtio_net receive GSO packets in addition
to sending them. This can definitely be optimised for the non-GSO
case. For comparison the Xen approach stores one page in each skb
and uses subsequent skb's pages to construct an SG skb instead of
preallocating the maximum amount of pages per skb.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (added feature bits)
This patch adds some basic ethtool operations to virtio_net so
I could test SG without GSO (which was really useful because TSO
turned out to be buggy :)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (remove MTU setting)
On Mon, 2008-05-26 at 17:42 +1000, Rusty Russell wrote:
> If we fail to transmit a packet, we assume the queue is full and put
> the skb into last_xmit_skb. However, if more space frees up before we
> xmit it, we loop, and the result can be transmitting the same skb twice.
>
> Fix is simple: set skb to NULL if we've used it in some way, and check
> before sending.
...
> diff -r 564237b31993 drivers/net/virtio_net.c
> --- a/drivers/net/virtio_net.c Mon May 19 12:22:00 2008 +1000
> +++ b/drivers/net/virtio_net.c Mon May 19 12:24:58 2008 +1000
> @@ -287,21 +287,25 @@ again:
> free_old_xmit_skbs(vi);
>
> /* If we has a buffer left over from last time, send it now. */
> - if (vi->last_xmit_skb) {
> + if (unlikely(vi->last_xmit_skb)) {
> if (xmit_skb(vi, vi->last_xmit_skb) != 0) {
> /* Drop this skb: we only queue one. */
> vi->dev->stats.tx_dropped++;
> kfree_skb(skb);
> + skb = NULL;
> goto stop_queue;
> }
> vi->last_xmit_skb = NULL;
With this, may drop an skb and then later in the function discover that
we could have sent it after all. Poor wee skb :)
How about the incremental patch below?
Cheers,
Mark.
Subject: [PATCH] virtio_net: Delay dropping tx skbs
Currently we drop the skb in start_xmit() if we have a
queued buffer and fail to transmit it.
However, if we delay dropping it until we've stopped the
queue and enabled the tx notification callback, then there
is a chance space might become available for it.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We can handle receiving partial csums, so set the
appropriate feature bit.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
virtio_net uses a timer to free old transmitted packets, rather than
leaving callbacks enabled all the time. If the host promises to
always notify us when the transmit ring is empty, we can free packets
at that point and avoid the timer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
virtio_net currently only frees old transmit skbs just
before queueing new ones. If the queue is full, it then
enables interrupts and waits for notification that more
work has been performed.
However, a side-effect of this scheme is that there are
always xmit skbs left dangling when no new packets are
sent, against the Documentation/networking/driver.txt
guideline:
"... it is not allowed for your TX mitigation scheme
to let TX packets "hang out" in the TX ring unreclaimed
forever if no new TX packets are sent."
Add a timer to ensure that any time we queue new TX
skbs, we will shortly free them again.
This fixes an easily reproduced hang at shutdown where
iptables attempts to unload nf_conntrack and nf_conntrack
waits for an skb it is tracking to be freed, but virtio_net
never frees it.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
hdr->csum_start is the offset from the start of the ethernet
header to the transport layer checksum field. skb->csum_start
is the offset from skb->head.
skb_partial_csum_set() assumes that skb->data points to the
ethernet header - i.e. it computes skb->csum_start by adding
the headroom to hdr->csum_start.
Since eth_type_trans() skb_pull()s the ethernet header,
skb_partial_csum_set() should be called before
eth_type_trans().
(Without this patch, GSO packets from a guest to the world outside the
host are corrupted).
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Because we cache the last failed-to-xmit packet, if there are no
packets queued behind that one we may never send it (reproduced here
as TCP stalls, "cured" by an outgoing ping).
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
If we fail to transmit a packet, we assume the queue is full and put
the skb into last_xmit_skb. However, if more space frees up before we
xmit it, we loop, and the result can be transmitting the same skb twice.
Fix is simple: set skb to NULL if we've used it in some way, and check
before sending.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
A recent proposed feature addition to the virtio block driver revealed
some flaws in the API: in particular, we assume that feature
negotiation is complete once a driver's probe function returns.
There is nothing in the API to require this, however, and even I
didn't notice when it was violated.
So instead, we require the driver to specify what features it supports
in a table, we can then move the feature negotiation into the virtio
core. The intersection of device and driver features are presented in
a new 'features' bitmap in the struct virtio_device.
Note that this highlights the difference between Linux unsigned-long
bitmaps where each unsigned long is in native endian, and a
straight-forward little-endian array of bytes.
Drivers can still remove feature bits in their probe routine if they
really have to.
API changes:
- dev->config->feature() no longer gets and acks a feature.
- drivers should advertise their features in the 'feature_table' field
- use virtio_has_feature() for extra sanity when checking feature bits
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
So, we previously had a 'VIRTIO_NET_F_GSO' bit which meant that 'the
host can handle csum offload, and any TSO (v4&v6 incl ECN) or UFO
packets you might want to send. I thought this was good enough for
Linux, but it actually isn't, since we don't do UFO in software.
So, add separate feature bits for what the host can handle. Add
equivalent ones for the guest to say what it can handle, because LRO
is coming too (thanks Herbert!).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Herbert tells me that returning NETDEV_TX_BUSY from hard_start_xmit is
seen as a poor thing to do; we should cache the packet and stop the queue.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Herbert Xu points out (within another patch) that my scatterlists are
too short: one entry for the gso header, one for the skb->data, and
MAX_SKB_FRAGS for all the fragments.
Fix both xmit and recv sides (recv currently unused, coming in later
patch).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
get_buf() gives the length written by the other side, which will be
zero. We want to add the skb length.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
[NETNS][IPV6] tcp - assign the netns for timewait sockets
[IPV4]: Fix byte value boundary check in do_ip_getsockopt().
BNX2X: Correct bringing chip out of reset
[NETFILTER]: nf_nat: autoload IPv4 connection tracking
[NETFILTER]: xt_hashlimit: fix mask calculation
[XFRM]: xfrm_user: fix selector family initialization
rt61pci: rt61pci_beacon_update do not free skb twice
ssb-mipscore: Fix interrupt vectors
ssb-pcicore: Fix IRQ TPS flag handling
mac80211: use short_preamble mode from capability if ERP IE not present
[NET]: Undo code bloat in hot paths due to print_mac().
[TCP]: Don't allow FRTO to take place while MTU is being probed
[TCP]: tcp_simple_retransmit can cause S+L
[TCP]: Fix NewReno's fast rexmit/recovery problems with GSOed skb
[TCP]: Restore 2.6.24 mark_head_lost behavior for newreno/fack
nl80211: fix STA AID bug
b43legacy: fix bcm4303 crash
iwlwifi: fix n-band association problem
ipw2200: set MAC address on radiotap interface
libertas: fix mode initialization problem
If print_mac() is used inside of a pr_debug() the compiler
can't see that the call is redundant so still performs it
even of pr_debug() ends up being a nop.
So don't use print_mac() in such cases in hot code paths,
use MAC_FMT et al. instead.
As noted by Joe Perches, pr_debug() could be modified to
handle this better, but that is a change to an interface
used by the entire kernel and thus needs to be validated
carefully. This here is thus the less risky fix for
2.6.25
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'disable_cb' is really just a hint and as such, it's possible for more
work to get queued up while callbacks are disabled. Under stress with an
SMP guest, this printk triggers very frequently. There is no race here, this
is how things are designed to work so let's just remove the printk.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race in virtio_net, dealing with disabling/enabling the callback.
I saw the following oops:
kernel BUG at /space/kvm/drivers/virtio/virtio_ring.c:218!
illegal operation: 0001 [#1] SMP
Modules linked in: sunrpc dm_mod
CPU: 2 Not tainted 2.6.25-rc1zlive-host-10623-gd358142-dirty #99
Process swapper (pid: 0, task: 000000000f85a610, ksp: 000000000f873c60)
Krnl PSW : 0404300180000000 00000000002b81a6 (vring_disable_cb+0x16/0x20)
R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 EA:3
Krnl GPRS: 0000000000000001 0000000000000001 0000000010005800 0000000000000001
000000000f3a0900 000000000f85a610 0000000000000000 0000000000000000
0000000000000000 000000000f870000 0000000000000000 0000000000001237
000000000f3a0920 000000000010ff74 00000000002846f6 000000000fa0bcd8
Krnl Code: 00000000002b819a: a7110001 tmll %r1,1
00000000002b819e: a7840004 brc 8,2b81a6
00000000002b81a2: a7f40001 brc 15,2b81a4
>00000000002b81a6: a51b0001 oill %r1,1
00000000002b81aa: 40102000 sth %r1,0(%r2)
00000000002b81ae: 07fe bcr 15,%r14
00000000002b81b0: eb7ff0380024 stmg %r7,%r15,56(%r15)
00000000002b81b6: a7f13e00 tmll %r15,15872
Call Trace:
([<000000000fa0bcd0>] 0xfa0bcd0)
[<00000000002b8350>] vring_interrupt+0x5c/0x6c
[<000000000010ab08>] do_extint+0xb8/0xf0
[<0000000000110716>] ext_no_vtime+0x16/0x1a
[<0000000000107e72>] cpu_idle+0x1c2/0x1e0
The problem can be triggered with a high amount of host->guest traffic.
I think its the following race:
poll says netif_rx_complete
poll calls enable_cb
enable_cb opens the interrupt mask
a new packet comes, an interrupt is triggered----\
enable_cb sees that there is more work |
enable_cb disables the interrupt |
. V
. interrupt is delivered
. skb_recv_done does atomic napi test, ok
some waiting disable_cb is called->check fails->bang!
.
poll would do napi check
poll would do disable_cb
The fix is to let enable_cb not disable the interrupt again, but expect the
caller to do the cleanup if it returns false. In that case, the interrupt is
only disabled, if the napi test_set_bit was successful.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned up doco)
Add a new poll_controller handler that the netpoll interface needs.
This enables netconsole logging from a kvm guest over the virtio
net interface.
Signed-off-by: Amit Shah <amitshah@gmx.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I got the following oops during interface ifup. Unfortunately its not
easily reproducable so I cant say for sure that my fix fixes this
problem, but I am confident and I think its correct anyway:
<2>kernel BUG at /space/kvm/drivers/virtio/virtio_ring.c:234!
<4>illegal operation: 0001 [#1] PREEMPT SMP
<4>Modules linked in:
<4>CPU: 0 Not tainted 2.6.24zlive-guest-07293-gf1ca151-dirty #91
<4>Process swapper (pid: 0, task: 0000000000800938, ksp: 000000000084ddb8)
<4>Krnl PSW : 0404300180000000 0000000000466374 (vring_disable_cb+0x30/0x34)
<4> R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0 EA:3
<4>Krnl GPRS: 0000000000000001 0000000000000001 0000000010003800 0000000000466344
<4> 000000000e980900 00000000008848b0 000000000084e748 0000000000000000
<4> 000000000087b300 0000000000001237 0000000000001237 000000000f85bdd8
<4> 000000000e980920 00000000001137c0 0000000000464754 000000000f85bdd8
<4>Krnl Code: 0000000000466368: e3b0b0700004 lg %r11,112(%r11)
<4> 000000000046636e: 07fe bcr 15,%r14
<4> 0000000000466370: a7f40001 brc 15,466372
<4> >0000000000466374: a7f4fff6 brc 15,466360
<4> 0000000000466378: eb7ff0500024 stmg %r7,%r15,80(%r15)
<4> 000000000046637e: a7f13e00 tmll %r15,15872
<4> 0000000000466382: b90400ef lgr %r14,%r15
<4> 0000000000466386: a7840001 brc 8,466388
<4>Call Trace:
<4>([<000201500f85c000>] 0x201500f85c000)
<4> [<0000000000466556>] vring_interrupt+0x72/0x88
<4> [<00000000004801a0>] kvm_extint_handler+0x34/0x44
<4> [<000000000010d22c>] do_extint+0xbc/0xf8
<4> [<0000000000113f98>] ext_no_vtime+0x16/0x1a
<4> [<000000000010a182>] cpu_idle+0x216/0x238
<4>([<000000000010a162>] cpu_idle+0x1f6/0x238)
<4> [<0000000000568656>] rest_init+0xaa/0xb8
<4> [<000000000084ee2c>] start_kernel+0x3fc/0x490
<4> [<0000000000100020>] _stext+0x20/0x80
<4>
<4> <0>Kernel panic - not syncing: Fatal exception in interrupt
<4>
After looking at the code and the dump I think the following scenario
happened: Ifup was running on cpu2 and the interrupt arrived on cpu0.
Now virtnet_open on cpu 2 managed to execute napi_enable and disable_cb
but did not execute rx_schedule. Meanwhile on cpu 0 skb_recv_done was
called by vring_interrupt, executed netif_rx_schedule_prep, which
succeeded and therefore called disable_cb. This triggered the BUG_ON,
as interrupts were already disabled by cpu 2.
I think the proper solution is to make the call to disable_cb depend on
the atomic update of NAPI_STATE_SCHED by using netif_rx_schedule_prep
in the same way as skb_recv_done.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
This fixes a potential dangling xmit problem.
We also suppress refill interrupts until we need them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fix bug found by Christian Borntraeger: if the other side fills all
the registered network buffers before we enable NAPI, we will never
get an interrupt. The simplest fix is to process the input queue once
on open.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Hello Rusty,
virtnet_probe already calls alloc_etherdev, which calls ether_setup.
There is no need to do that again.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A reset function solves three problems:
1) It allows us to renegotiate features, eg. if we want to upgrade a
guest driver without rebooting the guest.
2) It gives us a clean way of shutting down virtqueues: after a reset,
we know that the buffers won't be used by the host, and
3) It helps the guest recover from messed-up drivers.
So we remove the ->shutdown hook, and the only way we now remove
feature bits is via reset.
We leave it to the driver to do the reset before it deletes queues:
the balloon driver, for example, needs to chat to the host in its
remove function.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Since we want to reset the device to remove them, this is simpler
(device is reset for us on driver remove).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
1) Turn GSO on virtio net into an all-or-nothing (keep checksumming
separate). Having multiple bits is a pain: if you can't support something
you should handle it in software, which is still a performance win.
2) Make VIRTIO_NET_HDR_GSO_ECN a flag in the header, so it can apply to
IPv6 or v4.
3) Rename VIRTIO_NET_F_NO_CSUM to VIRTIO_NET_F_CSUM (ie. means we do
checksumming).
4) Add csum and gso params to virtio_net to allow more testing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's far easier to deal with packets if we don't have to parse the
packet to figure out the header length to know how much to pull into
the skb data. Add the field to the virtio_net_hdr struct (and fix the
spaces that somehow crept in there).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It seems that virtio_net wants to disable callbacks (interrupts) before
calling netif_rx_schedule(), so we can't use the return value to do so.
Rename "restart" to "cb_enable" and introduce "cb_disable" hook: callback
now returns void, rather than a boolean.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Previously we used a type/len pair within the config space, but this
seems overkill. We now simply define a structure which represents the
layout in the config space: the config space can now only be extended
at the end.
The main driver-visible changes:
1) We indicate what fields are present with an explicit feature bit.
2) Virtqueues are explicitly numbered, and not in the config space.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Use it in virtio_net (replacing buggy version there), it's also going
to be used by TAP for partial csum support.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
The virtio code never hooked through the ->remove callback. Although
noone supports device removal at the moment, this code is already
needed for module unloading.
This of course also revealed bugs in virtio_blk, virtio_net and lguest
unloading paths.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The network driver uses two virtqueues: one for input packets and one
for output packets. This has nice locking properties (ie. we don't do
any for recv vs send).
TODO:
1) Big packets.
2) Multi-client devices (maybe separate driver?).
3) Resolve freeing of old xmit skbs (Christian Borntraeger)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: netdev@vger.kernel.org