Commit graph

399 commits

Author SHA1 Message Date
Harald Welte
37d2e7a20d [NETFILTER] nfnetlink: unconditionally require CAP_NET_ADMIN
This patch unconditionally requires CAP_NET_ADMIN for all nfnetlink
messages.  It also removes the per-message cap_required field, since all
existing subsystems use CAP_NET_ADMIN for all their messages anyway.

Patrick McHardy owes me a beer if we ever need to re-introduce this.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14 15:24:59 -08:00
Pablo Neira Ayuso
5655820852 [NETFILTER] ctnetlink: More thorough size checking of attributes
Add missing size checks. Thanks Patrick McHardy for the hint.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14 15:22:11 -08:00
Pablo Neira Ayuso
dbd36ea496 [NETFILTER] ctnetlink: use size_t to make gcc-4.x happy
Make gcc-4.x happy. Use size_t instead of int. Thanks to Patrick McHardy
for the hint.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-14 15:21:01 -08:00
Vlad Drukker
a2d7222f0f [NETFILTER] {ip,nf}_conntrack TCP: Accept SYN+PUSH like SYN
Some devices (e.g. Qlogic iSCSI HBA hardware like QLA4010 up to firmware
3.0.0.4) initiates TCP with SYN and PUSH flags set.

The Linux TCP/IP stack deals fine with that, but the connection tracking
code doesn't.

This patch alters TCP connection tracking to accept SYN+PUSH as a valid
flag combination.

Signed-off-by: Vlad Drukker <vlad@storewiz.com>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-12 12:13:14 -08:00
Jeff Garzik
c050970a25 [PATCH] TCP: fix vegas build
Recent TCP changes broke the build.

Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-11 09:21:28 -08:00
Stephen Hemminger
6a438bbe68 [TCP]: speed up SACK processing
Use "hints" to speed up the SACK processing. Various forms 
of this have been used by TCP developers (Web100, STCP, BIC)
to avoid the 2x linear search of outstanding segments.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:14:59 -08:00
Stephen Hemminger
caa20d9abe [TCP]: spelling fixes
Minor spelling fixes for TCP code.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:13:47 -08:00
John Heffner
326f36e9e7 [TCP]: receive buffer growth limiting with mixed MTU
This is a patch for discussion addressing some receive buffer growing issues.
This is partially related to the thread "Possible BUG in IPv4 TCP window
handling..." last week.

Specifically it addresses the problem of an interaction between rcvbuf
moderation (receiver autotuning) and rcv_ssthresh.  The problem occurs when
sending small packets to a receiver with a larger MTU.  (A very common case I
have is a host with a 1500 byte MTU sending to a host with a 9k MTU.)  In
such a case, the rcv_ssthresh code is targeting a window size corresponding
to filling up the current rcvbuf, not taking into account that the new rcvbuf
moderation may increase the rcvbuf size.

One hunk makes rcv_ssthresh use tcp_rmem[2] as the size target rather than
rcvbuf.  The other changes the behavior when it overflows its memory bounds
with in-order data so that it tries to grow rcvbuf (the same as with
out-of-order data).

These changes should help my problem of mixed MTUs, and should also help the
case from last week's thread I think.  (In both cases though you still need
tcp_rmem[2] to be set much larger than the TCP window.)  One question is if
this is too aggressive at trying to increase rcvbuf if it's under memory
stress.

Orignally-from: John Heffner <jheffner@psc.edu>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:11:48 -08:00
Stephen Hemminger
9772efb970 [TCP]: Appropriate Byte Count support
This is an updated version of the RFC3465 ABC patch originally
for Linux 2.6.11-rc4 by Yee-Ting Li. ABC is a way of counting
bytes ack'd rather than packets when updating congestion control.

The orignal ABC described in the RFC applied to a Reno style
algorithm. For advanced congestion control there is little
change after leaving slow start.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:09:53 -08:00
Stephen Hemminger
7faffa1c7f [TCP]: add tcp_slow_start helper
Move all the code that does linear TCP slowstart to one
inline function to ease later patch to add ABC support.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 17:07:24 -08:00
Stephen Hemminger
2d2abbab63 [TCP]: simplify microsecond rtt sampling
Simplify the code that comuputes microsecond rtt estimate used
by TCP Vegas. Move the callback out of the RTT sampler and into
the end of the ack cleanup.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 16:56:12 -08:00
Stephen Hemminger
f4805eded7 [TCP]: fix congestion window update when using TSO deferal
TCP peformance with TSO over networks with delay is awful.
On a 100Mbit link with 150ms delay, we get 4Mbits/sec with TSO and
50Mbits/sec without TSO.

The problem is with TSO, we intentionally do not keep the maximum
number of packets in flight to fill the window, we hold out to until 
we can send a MSS chunk. But, we also don't update the congestion window 
unless we have filled, as per RFC2861.

This patch replaces the check for the congestion window being full
with something smarter that accounts for TSO.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 16:53:30 -08:00
Herbert Xu
fb286bb299 [NET]: Detect hardware rx checksum faults correctly
Here is the patch that introduces the generic skb_checksum_complete
which also checks for hardware RX checksum faults.  If that happens,
it'll call netdev_rx_csum_fault which currently prints out a stack
trace with the device name.  In future it can turn off RX checksum.

I've converted every spot under net/ that does RX checksum checks to
use skb_checksum_complete or __skb_checksum_complete with the
exceptions of:

* Those places where checksums are done bit by bit.  These will call
netdev_rx_csum_fault directly.

* The following have not been completely checked/converted:

ipmr
ip_vs
netfilter
dccp

This patch is based on patches and suggestions from Stephen Hemminger
and David S. Miller.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 13:01:24 -08:00
Thomas Graf
a8f74b2288 [NETLINK]: Make netlink_callback->done() optional
Most netlink families make no use of the done() callback, making
it optional gets rid of all unnecessary dummy implementations.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-10 02:26:40 +01:00
Yasuyuki Kozakai
9fb9cbb108 [NETFILTER]: Add nf_conntrack subsystem.
The existing connection tracking subsystem in netfilter can only
handle ipv4.  There were basically two choices present to add
connection tracking support for ipv6.  We could either duplicate all
of the ipv4 connection tracking code into an ipv6 counterpart, or (the
choice taken by these patches) we could design a generic layer that
could handle both ipv4 and ipv6 and thus requiring only one sub-protocol
(TCP, UDP, etc.) connection tracking helper module to be written.

In fact nf_conntrack is capable of working with any layer 3
protocol.

The existing ipv4 specific conntrack code could also not deal
with the pecularities of doing connection tracking on ipv6,
which is also cured here.  For example, these issues include:

1) ICMPv6 handling, which is used for neighbour discovery in
   ipv6 thus some messages such as these should not participate
   in connection tracking since effectively they are like ARP
   messages

2) fragmentation must be handled differently in ipv6, because
   the simplistic "defrag, connection track and NAT, refrag"
   (which the existing ipv4 connection tracking does) approach simply
   isn't feasible in ipv6

3) ipv6 extension header parsing must occur at the correct spots
   before and after connection tracking decisions, and there were
   no provisions for this in the existing connection tracking
   design

4) ipv6 has no need for stateful NAT

The ipv4 specific conntrack layer is kept around, until all of
the ipv4 specific conntrack helpers are ported over to nf_conntrack
and it is feature complete.  Once that occurs, the old conntrack
stuff will get placed into the feature-removal-schedule and we will
fully kill it off 6 months later.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-09 16:38:16 -08:00
Krzysztof Piotr Oledzki
5fd52fe098 [NETFILTER] ctnetlink: ICMP_ID is u_int16_t not u_int8_t.
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:04:32 -08:00
Krzysztof Piotr Oledzki
439a9994bb [NETFILTER] ctnetlink: Fix oops when no ICMP ID info in message
This patch fixes an userspace triggered oops. If there is no ICMP_ID
info the reference to attr will be NULL.

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:04:08 -08:00
Pablo Neira Ayuso
a856a19a9f [NETFILTER] ctnetlink: Add support to identify expectations by ID's
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:03:42 -08:00
Pablo Neira Ayuso
fcda46128d [NETFILTER] ctnetlink: propagate error instaed of returning -EPERM
Propagate the error to userspace instead of returning -EPERM if the get
conntrack operation fails.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:03:26 -08:00
Pablo Neira Ayuso
fe902a91ff [NETFILTER] ctnetlink: return -EINVAL if size is wrong
Return -EINVAL if the size isn't OK instead of -EPERM.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:03:09 -08:00
Yasuyuki Kozakai
d63a928108 [NETFILTER]: stop tracking ICMP error at early point
Currently connection tracking handles ICMP error like normal packets
if it failed to get related connection. But it fails that after all.

This makes connection tracking stop tracking ICMP error at early point.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:02:45 -08:00
Philip Craig
5978a9b82c [NETFILTER] PPTP helper: fix PNS-PAC expectation call id
The reply tuple of the PNS->PAC expectation was using the wrong call id.

So we had the following situation:
- PNS behind NAT firewall
- PNS call id requires NATing
- PNS->PAC gre packet arrives first

then the PNS->PAC expectation is matched, and the other expectation
is deleted, but the PAC->PNS gre packets do not match the gre conntrack
because the call id is wrong.

We also cannot use ip_nat_follow_master().

Signed-off-by: Philip Craig <philipc@snapgear.com>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:01:53 -08:00
Pablo Neira Ayuso
81e5c27d08 [NETFILTER] ctnetlink: get_conntrack can use GFP_KERNEL
ctnetlink_get_conntrack is always called from user context, so GFP_KERNEL
is enough.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:01:19 -08:00
Pablo Neira Ayuso
7a4fe3664b [NETFILTER] ctnetlink: kill unused includes
Kill some useless headers included in ctnetlink. They aren't used in any
way.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:00:47 -08:00
Pablo Neira Ayuso
119a318494 [NETFILTER] ctnetlink: add module alias to fix autoloading
Add missing module alias. This is a must to load ctnetlink on demand. For
example, the conntrack tool will fail if the module isn't loaded.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:00:29 -08:00
Pablo Neira Ayuso
02a78cdf42 [NETFILTER] ctnetlink: add marking support from userspace
This patch adds support for conntrack marking from user space.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 13:00:04 -08:00
Pablo Neira Ayuso
51df784ed7 [NETFILTER] ctnetlink: check if protoinfo is present
This fixes an oops triggered from userspace. If we don't pass information
about the private protocol info, the reference to attr will be NULL. This is
likely to happen in update messages.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 12:59:41 -08:00
Harald Welte
a2506c0432 [NETFILTER] nfnetlink: nfattr_parse() can never fail, make it void
nfattr_parse (and thus nfattr_parse_nested) always returns success. So we
can make them 'void' and remove all the checking at the caller side.

Based on original patch by Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 12:59:13 -08:00
Yasuyuki Kozakai
eaae4fa45e [NETFILTER]: refcount leak of proto when ctnetlink dumping tuple
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 12:58:46 -08:00
Yasuyuki Kozakai
46998f59c0 [NETFILTER]: packet counter of conntrack is 32bits
The packet counter variable of conntrack was changed to 32bits from 64bits.
This follows that change.
		    
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-09 12:58:05 -08:00
Herbert Xu
89f5f0aeed [IPV4]: Fix ip_queue_xmit identity increment for TSO packets
When ip_queue_xmit calls ip_select_ident_more for IP identity selection
it gives it the wrong packet count for TSO packets.  The ip_select_*
functions expect one less than the number of packets, so we need to
subtract one for TSO packets.

This bug was diagnosed and fixed by Tom Young.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-08 09:41:56 -08:00
Jesper Juhl
a51482bde2 [NET]: kfree cleanup
From: Jesper Juhl <jesper.juhl@gmail.com>

This is the net/ part of the big kfree cleanup patch.

Remove pointless checks for NULL prior to calling kfree() in net/.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2005-11-08 09:41:34 -08:00
Julian Anastasov
dc8103f25f [IPVS]: fix connection leak if expire_nodest_conn=1
There was a fix in 2.6.13 that changed the behaviour of
ip_vs_conn_expire_now function not to put reference to connection,
its callers should hold write lock or connection refcnt. But we
forgot to convert one caller, when the real server for connection
is unavailable caller should put the connection reference. It
happens only when sysctl var expire_nodest_conn is set to 1 and
such connections never expire. Thanks to Roberto Nibali who found
the problem and tested a 2.4.32-rc2 patch, which is equal to this
2.6 version. Patch for 2.4 is already sent to Marcelo.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Roberto Nibali <ratz@drugphish.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-08 09:40:05 -08:00
Stephen Hemminger
6df716340d [TCP/DCCP]: Randomize port selection
This patch randomizes the port selected on bind() for connections
to help with possible security attacks. It should also be faster
in most cases because there is no need for a global lock.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-05 21:23:15 -02:00
Harald Welte
433a4d3b54 [NETFILTER]: CONNMARK target needs ip_conntrack
There's a missing dependency from the CONNMARK target to ip_conntrack.

Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-05 16:39:20 -02:00
Harald Welte
0f81eb4db4 [NETFILTER]: Fix double free after netlink_unicast() in ctnetlink
It's not necessary to free skb if netlink_unicast() failed.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-05 03:28:37 -02:00
Harald Welte
d2a7bb7141 [NETFILTER] NAT: Fix module refcount dropping too far
The unknown protocol is used as a fallback when a protocol isn't known.
Hence we cannot handle it failing, so don't set ".me".  It's OK, since we
only grab a reference from within the same module (iptable_nat.ko), so we
never take the module refcount from 0 to 1.

Also, remove the "protocol is NULL" test: it's never NULL.

Signed-off-by: Rusty Rusty <rusty@rustcorp.com.au>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-05 01:23:34 -02:00
Harald Welte
d811552eda [NETFILTER] PPTP helper: Fix endianness bug in GRE key / CallID NAT
This endianness bug slipped through while changing the 'gre.key' field in the
conntrack tuple from 32bit to 16bit.

None of my tests caught the problem, since the linux pptp client always has
'0' as call id / gre key.  Only windows clients actually trigger the bug.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-04 23:19:17 -02:00
Harald Welte
3428c209c6 [NETFILTER] PPTP helper: Fix compilation of conntrack helper without NAT
This patch fixes compilation of the PPTP conntrack helper when NAT is
configured off.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-04 23:02:53 -02:00
Stephen Hemminger
450b5b1898 [TCP]: BIC max increment too large
The max growth of BIC TCP is too large. Original code was based on
BIC 1.0 and the default there was 32. Later code (2.6.13) included
compensation for delayed acks, and should have reduced the default
value to 16; since normally TCP gets one ack for every two packets sent.

The current value of 32 makes BIC too aggressive and unfair to other
flows.

Submitted-by: Injong Rhee <rhee@eos.ncsu.edu>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ian McDonald <imcdnzl@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-02 21:24:01 -02:00
Yan Zheng
8713dbf057 [MCAST]: ip[6]_mc_add_src should be called when number of sources is zero
And filter mode is exclude.

Further explanation by David Stevens:

Multicast source filters aren't widely used yet, and that's really the only
feature that's affected if an application actually exercises this bug, as far
as I can tell. An ordinary filter-less multicast join should still work, and
only forwarded multicast traffic making use of filters and doing empty-source
filters with the MSFILTER ioctl would be at risk of not getting multicast
traffic forwarded to them because the reports generated would not be based on
the correct counts.

Signed-off-by: Yan Zheng <yanzheng@21cn.com
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-11-02 21:03:57 -02:00
Harald Welte
6b7d31fcdd [NETFILTER]: Add "revision" support to arp_tables and ip6_tables
Like ip_tables already has it for some time, this adds support for
having multiple revisions for each match/target.  We steal one byte from
the name in order to accomodate a 8 bit version number.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-31 16:36:08 -02:00
Jean Delvare
3fa63c7d82 [PATCH] Typo fix: dot after newline in printk strings
Typo fix: dots appearing after a newline in printk strings.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 17:37:20 -08:00
Jayachandran C
9fcc2e8a75 [IPV4]: Fix issue reported by Coverity in ipv4/fib_frontend.c
fib_del_ifaddr() dereferences ifa->ifa_dev, so the code already assumes that
ifa->ifa_dev is non-NULL, the check is unnecessary.

Signed-off-by: Jayachandran C. <c.jayachandran at gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-29 02:53:39 -02:00
Ananda Raju
e89e9cf539 [IPv4/IPv6]: UFO Scatter-gather approach
Attached is kernel patch for UDP Fragmentation Offload (UFO) feature.

1. This patch incorporate the review comments by Jeff Garzik.
2. Renamed USO as UFO (UDP Fragmentation Offload)
3. udp sendfile support with UFO

This patches uses scatter-gather feature of skb to generate large UDP
datagram. Below is a "how-to" on changes required in network device
driver to use the UFO interface.

UDP Fragmentation Offload (UFO) Interface:
-------------------------------------------
UFO is a feature wherein the Linux kernel network stack will offload the
IP fragmentation functionality of large UDP datagram to hardware. This
will reduce the overhead of stack in fragmenting the large UDP datagram to
MTU sized packets

1) Drivers indicate their capability of UFO using
dev->features |= NETIF_F_UFO | NETIF_F_HW_CSUM | NETIF_F_SG

NETIF_F_HW_CSUM is required for UFO over ipv6.

2) UFO packet will be submitted for transmission using driver xmit routine.
UFO packet will have a non-zero value for

"skb_shinfo(skb)->ufo_size"

skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP
fragment going out of the adapter after IP fragmentation by hardware.

skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[]
contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW
indicating that hardware has to do checksum calculation. Hardware should
compute the UDP checksum of complete datagram and also ip header checksum of
each fragmented IP packet.

For IPV6 the UFO provides the fragment identification-id in
skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating
IPv6 fragments.

Signed-off-by: Ananda Raju <ananda.raju@neterion.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (forwarded)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-28 16:30:00 -02:00
Linus Torvalds
236fa08168 Merge master.kernel.org:/pub/scm/linux/kernel/git/acme/net-2.6.15 2005-10-28 08:50:37 -07:00
Herbert Xu
2ad41065d9 [TCP]: Clear stale pred_flags when snd_wnd changes
This bug is responsible for causing the infamous "Treason uncloaked"
messages that's been popping up everywhere since the printk was added.
It has usually been blamed on foreign operating systems.  However,
some of those reports implicate Linux as both systems are running
Linux or the TCP connection is going across the loopback interface.

In fact, there really is a bug in the Linux TCP header prediction code
that's been there since at least 2.1.8.  This bug was tracked down with
help from Dale Blount.

The effect of this bug ranges from harmless "Treason uncloaked"
messages to hung/aborted TCP connections.  The details of the bug
and fix is as follows.

When snd_wnd is updated, we only update pred_flags if
tcp_fast_path_check succeeds.  When it fails (for example,
when our rcvbuf is used up), we will leave pred_flags with
an out-of-date snd_wnd value.

When the out-of-date pred_flags happens to match the next incoming
packet we will again hit the fast path and use the current snd_wnd
which will be wrong.

In the case of the treason messages, it just happens that the snd_wnd
cached in pred_flags is zero while tp->snd_wnd is non-zero.  Therefore
when a zero-window packet comes in we incorrectly conclude that the
window is non-zero.

In fact if the peer continues to send us zero-window pure ACKs we
will continue making the same mistake.  It's only when the peer
transmits a zero-window packet with data attached that we get a
chance to snap out of it.  This is what triggers the treason
message at the next retransmit timeout.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-27 15:11:04 -02:00
David Engel
dcab5e1eec [IPV4]: Fix setting broadcast for SIOCSIFNETMASK
Fix setting of the broadcast address when the netmask is set via
SIOCSIFNETMASK in Linux 2.6.  The code wanted the old value of
ifa->ifa_mask but used it after it had already been overwritten with
the new value.

Signed-off-by: David Engel <gigem@comcast.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-26 01:20:21 -02:00
Jayachandran C
0d0d2bba97 [IPV4]: Remove dead code from ip_output.c
skb_prev is assigned from skb, which cannot be NULL. This patch removes the
unnecessary NULL check.

Signed-off-by: Jayachandran C. <c.jayachandran at gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-26 00:58:54 -02:00
Herbert Xu
1371e37da2 [IPV4]: Kill redundant rcu_dereference on fa_info
This patch kills a redundant rcu_dereference on fa->fa_info in fib_trie.c.
As this dereference directly follows a list_for_each_entry_rcu line, we
have already taken a read barrier with respect to getting an entry from
the list.

This read barrier guarantees that all values read out of fa are valid.
In particular, the contents of structure pointed to by fa->fa_info is
initialised before fa->fa_info is actually set (see fn_trie_insert);
the setting of fa->fa_info itself is further separated with a write
barrier from the insertion of fa into the list.

Therefore by taking a read barrier after obtaining fa from the list
(which is given by list_for_each_entry_rcu), we can be sure that
fa->fa_info contains a valid pointer, as well as the fact that the
data pointed to by fa->fa_info is itself valid.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-26 00:25:03 -02:00