Meaning receive multiple messages, reducing the number of syscalls and
net stack entry/exit operations.
Next patches will introduce mechanisms where protocols that want to
optimize this operation will provide an unlocked_recvmsg operation.
This takes into account comments made by:
. Paul Moore: sock_recvmsg is called only for the first datagram,
sock_recvmsg_nosec is used for the rest.
. Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
works in the same fashion as the ppoll one.
If the underlying protocol returns a datagram with MSG_OOB set, this
will make recvmmsg return right away with as many datagrams (+ the OOB
one) it has received so far.
. Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
datagrams and then recvmsg returns an error, recvmmsg will return
the successfully received datagrams, store the error and return it
in the next call.
This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
where we will be able to acquire the lock only at batch start and end, not at
every underlying recvmsg call.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new socket level option to report number of queue overflows
Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.
Notes:
1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.
2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.
3) This applies cleanly to net-next assuming that commit
977750076d (my af packet cmsg patch) is reverted
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 sit: Set relay to 0.0.0.0 directly if relay_prefixlen == 0.
Do not use bit-shift if relay_prefixlen == 0;
relay_prefix << 32 does not result in 0.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 sit: Fix 6rd relay address.
Relay's address should be extracted from real IPv6 address
instead of configured prefix.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 sit: Ensure to initialize 6rd parameters.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The error unwinding code in set_netns has a bug
that will make it run into a BUG_ON if passed a
bad wiphy index, fix by not trying to unlock a
wiphy that doesn't exist.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Commit 9ef1d4c7c7 ("[NETLINK]: Missing
initializations in dumped data") introduced a typo in
initialization. This patch fixes this.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow enable/disable UFO on bridge device via ethtool
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that software UFO is supported, UFO can be enabled on master
devices like bridge, bond even though the attached device doesn't
support this feature in hardware.
This allows UFO to be used between KVM host and guest even when a
physical interface attached to the bridge doesn't support UFO.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP_HTABLE_SIZE was initialy defined to 128, which is a bit small for
several setups.
4000 active UDP sockets -> 32 sockets per chain in average. An
incoming frame has to lookup all sockets to find best match, so long
chains hurt latency.
Instead of a fixed size hash table that cant be perfect for every
needs, let UDP stack choose its table size at boot time like tcp/ip
route, using alloc_large_system_hash() helper
Add an optional boot parameter, uhash_entries=x so that an admin can
force a size between 256 and 65536 if needed, like thash_entries and
rhash_entries.
dmesg logs two new lines :
[ 0.647039] UDP hash table entries: 512 (order: 0, 4096 bytes)
[ 0.647099] UDP Lite hash table entries: 512 (order: 0, 4096 bytes)
Maximal size on 64bit arches would be 65536 slots, ie 1 MBytes for non
debugging spinlocks.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason that cipso_v4_delopt() is not
defined as a static function.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function argument len was redeclarated within the
function. This patch fix the redeclaration of symbol 'len'.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Might as well use the ipv6_addr_set_v4mapped() inline we created last
year.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change ip6_route_redirect() to use ipv6_addr_copy().
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for route lookup using sk_mark on IPv4 listening sockets.
Signed-off-by: Atis Elsts <atis@mikrotik.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This continues the previous patch, by applying the same change to CCID-3.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes a redundancy in the CCID half-connection (hc) naming scheme:
* instead of 'hctx->tx_...', write 'hc->tx_...';
* instead of 'hcrx->rx_...', write 'hc->rx_...';
which works because the 'type' of the half-connection is encoded in the
'rx_' / 'tx_' prefixes.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements the new naming scheme also for CCID-3.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch starts a less problematic naming convention for CCID structs.
The old naming convention used 'hc{tx,rx}->ccid?hc{tx,rx}->...' as
recurring prefixes, which made the code
* hard to write (not easy to fit into 80 characters);
* hard to read (most of the space is occupied by prefixes).
The new naming scheme:
* struct entries for the TX socket are prefixed by 'tx_';
* and those for the RX socket are prefixed by 'rx_'.
The identifiers then remain distinguishable when grep-ing through the tree:
(a) RX/TX sockets are distinguished by the naming scheme,
(b) individual CCIDs are distinguished by filename (ccid{2,3,4}.{c,h}).
This first patch implements the scheme for CCID-2.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Everything including this header includes net/cfg80211.h, which
includes linux/netdevice.h, which includes linux/ethtool.h already. Why
slow-down the build, even a little bit?
Signed-off-by: John W. Linville <linville@tuxdriver.com>
It's useful to provide firmware and hardware version to user space and have a
generic interface to retrieve them. Users can provide the version information
in bug reports etc.
Add fields for firmware and hardware version to struct wiphy.
(Dropped nl80211 bits for now and modified remaining bits in favor of
ethtool. -- JWL)
Cc: Kalle Valo <kalle.valo@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Refactor wext to
* split out iwpriv handling
* split out iwspy handling
* split out procfs support
* allow cfg80211 to have wireless extensions compat code
w/o CONFIG_WIRELESS_EXT
After this, drivers need to
- select WIRELESS_EXT - for wext support
- select WEXT_PRIV - for iwpriv support
- select WEXT_SPY - for iwspy support
except cfg80211 -- which gets new hooks in wext-core.c
and can then get wext handlers without CONFIG_WIRELESS_EXT.
Wireless extensions procfs support is auto-selected
based on PROC_FS and anything that requires the wext core
(i.e. WIRELESS_EXT or CFG80211_WEXT).
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Linux keeps scan results up to 15 seconds. This can be a problem for fast
moving clients: they get back stale data. But if the kernel reports the age
of the BSS items, then user-space can simply weed out old entries by itself.
Signed-off-by: Holger Schurig <hs4233@mail.mn-solutions.de>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
kfree_skb() should be used to free struct sk_buff pointers.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Cc: stable@kernel.org
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When receiving data frames, we can send them only to
the interface they belong to based on transmitting
station (this doesn't work for probe requests). Also,
don't try to handle other frames for AP_VLAN at all
since those interface should only receive data.
Additionally, the transmit side must check that the
station we're sending a frame to is actually on the
interface we're transmitting on, and not transmit
packets to functions that live on other interfaces,
so validate that as well.
Another bug fix is needed in sta_info.c where in the
VLAN case when adding/removing stations we overwrite
the sdata variable we still need.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: stable@kernel.org
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This fixes a bug with arp_notify.
If arp_notify is enabled, kernel will crash if address is changed
and no IP address is assigned.
http://bugzilla.kernel.org/show_bug.cgi?id=14330
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All usages of structure net_proto_ops should be declared const.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Friday 02 October 2009 20:53:51 you wrote:
> This is good although I would have shortened the name.
Ah, I knew I forgot something :) Here is v4.
tavi
>From 24d96d825b9fa832b22878cc6c990d5711968734 Mon Sep 17 00:00:00 2001
From: Octavian Purdila <opurdila@ixiacom.com>
Date: Fri, 2 Oct 2009 00:51:15 +0300
Subject: [PATCH] ipv6: new sysctl for sending TLLAO with unicast NAs
Neighbor advertisements responding to unicast neighbor solicitations
did not include the target link-layer address option. This patch adds
a new sysctl option (disabled by default) which controls whether this
option should be sent even with unicast NAs.
The need for this arose because certain routers expect the TLLAO in
some situations even as a response to unicast NS packets.
Moreover, RFC 2461 recommends sending this to avoid a race condition
(section 4.4, Target link-layer address)
Signed-off-by: Cosmin Ratiu <cratiu@ixiacom.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Atis Elsts wrote:
> Not sure if there is need to fill the mark from skb in tunnel xmit functions. In any case, it's not done for GRE or IPIP tunnels at the moment.
Ok, I'll just drop that part, I'm not sure what should be done in this case.
> Also, in this patch you are doing that for SIT (v6-in-v4) tunnels only, and not doing it for v4-in-v6 or v6-in-v6 tunnels. Any reason for that?
I just sent that patch out too quickly, here's a better one with the updates.
Add support for IPv6 route lookups using sk_mark.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After updating firmware stored in flash, users may wish to reset the
relevant hardware and start the new firmware immediately. This should
not be completely automatic as it may be disruptive.
A selective reset may also be useful for debugging or diagnostics.
This adds a separate reset operation which takes flags indicating the
components to be reset. Drivers are allowed to reset only a subset of
those requested, and must indicate the actual subset. This allows the
use of generic component masks and some future expansion.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jarek Poplawski a écrit :
>
>
> Hmm... So you made me to do some "real" work here, and guess what?:
> there is one serious checkpatch warning! ;-) Plus, this new parameter
> should be added to the function description. Otherwise:
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
>
> Thanks,
> Jarek P.
>
> PS: I guess full "Don't" would show we really mean it...
Okay :) Here is the last round, before the night !
Thanks again
[RFC] pkt_sched: gen_estimator: Don't report fake rate estimators
We currently send TCA_STATS_RATE_EST elements to netlink users, even if no estimator
is running.
# tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 112833764978 bytes 1495081739 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
User has no way to tell if the "rate 0bit 0pps" is a real estimation, or a fake
one (because no estimator is active)
After this patch, tc command output is :
$ tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 561075 bytes 1196 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
We add a parameter to gnet_stats_copy_rate_est() function so that
it can use gen_estimator_active(bstats, r), as suggested by Jarek.
This parameter can be NULL if check is not necessary, (htb for
example has a mandatory rate estimator)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is a followup on this area, thanks.
[RFC] af_packet: fill skb->mark at xmit
skb->mark may be used by classifiers, so fill it in case user
set a SO_MARK option on socket.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 Rapid Deployment (6rd; draft-ietf-softwire-ipv6-6rd) builds upon
mechanisms of 6to4 (RFC3056) to enable a service provider to rapidly
deploy IPv6 unicast service to IPv4 sites to which it provides
customer premise equipment. Like 6to4, it utilizes stateless IPv6 in
IPv4 encapsulation in order to transit IPv4-only network
infrastructure. Unlike 6to4, a 6rd service provider uses an IPv6
prefix of its own in place of the fixed 6to4 prefix.
With this option enabled, the SIT driver offers 6rd functionality by
providing additional ioctl API to configure the IPv6 Prefix for in
stead of static 2002::/16 for 6to4.
Original patch was done by Alexandre Cassen <acassen@freebox.fr>
based on old Internet-Draft.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When routing daemon wants to enable forwarding of multicast traffic it
performs something like:
struct vifctl vc = {
.vifc_vifi = 1,
.vifc_flags = 0,
.vifc_threshold = 1,
.vifc_rate_limit = 0,
.vifc_lcl_addr = ip, /* <--- ip address of physical
interface, e.g. eth0 */
.vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
};
setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
This leads (in the kernel) to calling vif_add() function call which
search the (physical) device using assigned IP address:
dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);
The current API (struct vifctl) does not allow to specify an
interface other way than using it's IP, and if there are more than a
single interface with specified IP only the first one will be found.
The attached patch (against 2.6.30.4) allows to specify an interface
by its index, instead of IP address:
struct vifctl vc = {
.vifc_vifi = 1,
.vifc_flags = VIFF_USE_IFINDEX, /* NEW */
.vifc_threshold = 1,
.vifc_rate_limit = 0,
.vifc_lcl_ifindex = if_nametoindex("eth0"), /* NEW */
.vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
};
setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
Signed-off-by: Ilia K. <mail4ilia@gmail.com>
=== modified file 'include/linux/mroute.h'
Signed-off-by: David S. Miller <davem@davemloft.net>
An incoming datagram must bring into cpu cache *lot* of cache lines,
in particular : (other parts omitted (hash chains, ip route cache...))
On 32bit arches :
offsetof(struct sock, sk_rcvbuf) =0x30 (read)
offsetof(struct sock, sk_lock) =0x34 (rw)
offsetof(struct sock, sk_sleep) =0x50 (read)
offsetof(struct sock, sk_rmem_alloc) =0x64 (rw)
offsetof(struct sock, sk_receive_queue)=0x74 (rw)
offsetof(struct sock, sk_forward_alloc)=0x98 (rw)
offsetof(struct sock, sk_callback_lock)=0xcc (rw)
offsetof(struct sock, sk_drops) =0xd8 (read if we add dropcount support, rw if frame dropped)
offsetof(struct sock, sk_filter) =0xf8 (read)
offsetof(struct sock, sk_socket) =0x138 (read)
offsetof(struct sock, sk_data_ready) =0x15c (read)
We can avoid sk->sk_socket and socket->fasync_list referencing on sockets
with no fasync() structures. (socket->fasync_list ptr is probably already in cache
because it shares a cache line with socket->wait, ie location pointed by sk->sk_sleep)
This avoids one cache line load per incoming packet for common cases (no fasync())
We can leave (or even move in a future patch) sk->sk_socket in a cold location
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A number of drivers (recently including cfg80211-based ones)
assume that all wireless handlers, including statistics, can
sleep and they often also implicitly assume that the rtnl is
held around their invocation. This is almost always true now
except when reading from sysfs:
BUG: sleeping function called from invalid context at kernel/mutex.c:280
in_atomic(): 1, irqs_disabled(): 0, pid: 10450, name: head
2 locks held by head/10450:
#0: (&buffer->mutex){+.+.+.}, at: [<c10ceb99>] sysfs_read_file+0x24/0xf4
#1: (dev_base_lock){++.?..}, at: [<c12844ee>] wireless_show+0x1a/0x4c
Pid: 10450, comm: head Not tainted 2.6.32-rc3 #1
Call Trace:
[<c102301c>] __might_sleep+0xf0/0xf7
[<c1324355>] mutex_lock_nested+0x1a/0x33
[<f8cea53b>] wdev_lock+0xd/0xf [cfg80211]
[<f8cea58f>] cfg80211_wireless_stats+0x45/0x12d [cfg80211]
[<c13118d6>] get_wireless_stats+0x16/0x1c
[<c12844fe>] wireless_show+0x2a/0x4c
Fix this by using the rtnl instead of dev_base_lock.
Reported-by: Miles Lane <miles.lane@gmail.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exports the link-speed (in Mbps) and duplex of an interface
via sysfs. This eliminates the need to use ethtool just to check the
link-speed. Not requiring 'ethtool' and not relying on the SIOCETHTOOL
ioctl should be helpful in an embedded environment where space is at a
premium as well.
NOTE: This patch also intentionally allows non-root users to check the link
speed and duplex -- something not possible with ethtool.
Here's some sample output:
# cat /sys/class/net/eth0/speed
100
# cat /sys/class/net/eth0/duplex
half
# ethtool eth0
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Half 1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: Not reported
Advertised auto-negotiation: No
Speed: 100Mb/s
Duplex: Half
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: g
Wake-on: g
Current message level: 0x000000ff (255)
Link detected: yes
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of having to modify every non-mac80211 for device type assignment,
do this inside the netdev notifier callback of cfg80211. So all drivers
that integrate with cfg80211 will export a proper device type.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For various purposes including a wireless extensions
bugfix, we need to hook into the netdev creation before
before netdev_register_kobject(). This will also ease
doing the dev type assignment that Marcel was working
on for cfg80211 drivers w/o touching them all.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently dirty a cache line to update tunnel device stats
(tx_packets/tx_bytes). We better use the txq->tx_bytes/tx_packets
counters that already are present in cpu cache, in the cache
line shared with txq->_xmit_lock
This patch extends IPTUNNEL_XMIT() macro to use txq pointer
provided by the caller.
Also &tunnel->dev->stats can be replaced by &dev->stats
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FIB algorithim for IPV4 is set at compile time, but kernel goes through
the overhead of function call indirection at runtime. Save some
cycles by turning the indirect calls to direct calls to either
hash or trie code.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add Ancilliary data to better represent loss information
I've had a few requests recently to provide more detail regarding frame loss
during an AF_PACKET packet capture session. Specifically the requestors want to
see where in a packet sequence frames were lost, i.e. they want to see that 40
frames were lost between frames 302 and 303 in a packet capture file. In order
to do this we need:
1) The kernel to export this data to user space
2) The applications to make use of it
This patch addresses item (1). It does this by doing the following:
A) Anytime we drop a frame for which we would increment po->stats.tp_drops, we
also no increment a stats called po->stats.tp_gap.
B) Every time we successfully enqueue a frame to sk_receive_queue, we record the
value of po->stats.tp_gap in skb->mark. skb->cb would nominally be the place to
record this, but since all the space there is used up, we're overloading
skb->mark. Its safe to do since any enqueued packet is guaranteed to be
unshared at this point, and skb->mark isn't used for anything else in the rx
path to the application. After we record tp_gap in the skb, we zero
po->stats.tp_gap. This allows us to keep a counter of the number of frames lost
between any two enqueued packets
C) When the application goes to dequeue a frame from the packet socket, we look
at skb->mark for that frame. If it is non-zero, we add a cmsg chunk to the
msghdr of level SOL_PACKET and type PACKET_GAPDATA. Its a 32 bit integer that
represents the number of frames lost between this packet and the last previous
frame received.
Note there is a chance that if there is frame loss after a receive, and then the
socket is closed, some gap data might be lost. This is covered by the use of
the PACKET_AUXDATA socket option, which gives total loss data. With a bit of
math, the final gap can be determined that way.
I've tested this patch myself, and it works well.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
include/linux/if_packet.h | 2 ++
net/packet/af_packet.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)
Signed-off-by: David S. Miller <davem@davemloft.net>