Any socket recv of less than this ammount will not be offloaded
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an extra argument to sk_eat_skb, and make it move early copied
packets to the async_wait_queue instead of freeing them.
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Needed to be able to call tcp_cleanup_rbuf in tcp_input.c for I/OAT
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds an async_wait_queue and some additional fields to tcp_sock, and a
dma_cookie_t to sk_buff.
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provides for pinning user space pages in memory, copying to iovecs,
and copying from sk_buffs including fragmented and chained sk_buffs.
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Attempts to allocate per-CPU DMA channels
Signed-off-by: Chris Leech <christopher.leech@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Aki M Nyrhinen <anyrhine@cs.helsinki.fi>
IMHO the current fix to the problem (in_flight underflow in reno)
is incorrect. it treats the symptons but ignores the problem. the
problem is timing out packets other than the head packet when we
don't have sack. i try to explain (sorry if explaining the obvious).
with sack, scanning the retransmit queue for timed out packets is
fine because we know which packets in our retransmit queue have been
acked by the receiver.
without sack, we know only how many packets in our retransmit queue the
receiver has acknowledged, but no idea which packets.
think of a "typical" slow-start overshoot case, where for example
every third packet in a window get lost because a router buffer gets
full.
with sack, we check for timeouts on those every third packet (as the
rest have been sacked). the packet counting works out and if there
is no reordering, we'll retransmit exactly the packets that were
lost.
without sack, however, we check for timeout on every packet and end up
retransmitting consecutive packets in the retransmit queue. in our
slow-start example, 2/3 of those retransmissions are unnecessary. these
unnecessary retransmissions eat the congestion window and evetually
prevent fast recovery from continuing, if enough packets were lost.
Signed-off-by: David S. Miller <davem@davemloft.net>
A soft lockup existed in the handling of ack vector records.
Specifically, when a tail of the list of ack vector records was
removed, it was possible to end up iterating infinitely on an element
of the tail.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several bugs in error handling in br_add_bridge:
- when dev_alloc_name fails, allocated net_device is not freed
- unregister_netdev is called when rtnl lock is held
- free_netdev is called before netdev_run_todo has a chance to be run after
unregistering net_device
Signed-off-by: Jiri Benc <jbenc@suse.cz>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The skb allocation may fail, which can result in a NULL pointer dereference
in irlap_queue_xmit().
Coverity CID: 434.
Signed-off-by: Florin Malita <fmalita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The /proc/sys/net/ethernet directory has been sitting empty for more than
10 years! Time to eliminate it!
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trimming the head of an skb by calling skb_pull can cause the packet
to become unaligned if the length pulled is odd. Since the length is
entirely arbitrary for a FIN packet carrying data, this is actually
quite common.
Unaligned data is not the end of the world, but we should avoid it if
it's easily done. In this case it is trivial. Since we're discarding
all of the head data it doesn't matter whether we move skb->data forward
or back.
However, it is still possible to have unaligned skb->data in general.
So network drivers should be prepared to handle it instead of crashing.
This patch also adds an unlikely marking on len < headlen since partial
ACKs on head data are extremely rare in the wild. As the return value
of __pskb_trim_head is no longer ever NULL that has been removed.
Signed-off-by: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When snd_cwnd is smaller than 38 and the connection is in
congestion avoidance phase (snd_cwnd > snd_ssthresh), the snd_cwnd
seems to stop growing.
The additive increase was confused because C array's are 0 based.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It appears that sockaddr_in.sin_zero is not zeroed during
getsockopt(...SO_ORIGINAL_DST...) operation. This can lead
to an information leak (CVE-2006-1343).
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Noticed that dev_alloc_name() comment was incorrect, and more spellung
errors.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In addition to the real on-link routes, NONEXTHOP routes
should be considered on-link.
Problem reported by Meelis Roos <mroos@linux.ee>.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge will OOPS on removal if other application has the SAP open.
The bridge SAP might be shared with other usages, so need
to do reference counting on module removal rather than explicit
close/delete.
Since packet might arrive after or during removal, need to clear
the receive function handle, so LLC only hands it to user (if any).
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If kmalloc fails, error path leaks data allocated from asn1_oid_decode().
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When parsing unknown sequence extensions the "son"-pointer points behind
the last known extension for this type, don't try to interpret it.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The condition "> H323_ERROR_STOP" can never be true since H323_ERROR_STOP
is positive and is the highest possible return code, while real errors are
negative, fix the checks. Also only abort on real errors in some spots
that were just interpreting any return value != 0 as error.
Fixes crashes caused by use of stale data after a parsing error occured:
BUG: unable to handle kernel paging request at virtual address bfffffff
printing eip:
c01aa0f8
*pde = 1a801067
*pte = 00000000
Oops: 0000 [#1]
PREEMPT
Modules linked in: ip_nat_h323 ip_conntrack_h323 nfsd exportfs sch_sfq sch_red cls_fw sch_hfsc xt_length ipt_owner xt_MARK iptable_mangle nfs lockd sunrpc pppoe pppoxx
CPU: 0
EIP: 0060:[<c01aa0f8>] Not tainted VLI
EFLAGS: 00210646 (2.6.17-rc4 #8)
EIP is at memmove+0x19/0x22
eax: d77264e9 ebx: d77264e9 ecx: e88d9b17 edx: d77264e9
esi: bfffffff edi: bfffffff ebp: de6a7680 esp: c0349db8
ds: 007b es: 007b ss: 0068
Process asterisk (pid: 3765, threadinfo=c0349000 task=da068540)
Stack: <0>00000006 c0349e5e d77264e3 e09a2b4e e09a38a0 d7726052 d7726124 00000491
00000006 00000006 00000006 00000491 de6a7680 d772601e d7726032 c0349f74
e09a2dc2 00000006 c0349e5e 00000006 00000000 d76dda28 00000491 c0349f74
Call Trace:
[<e09a2b4e>] mangle_contents+0x62/0xfe [ip_nat]
[<e09a2dc2>] ip_nat_mangle_tcp_packet+0xa1/0x191 [ip_nat]
[<e0a2712d>] set_addr+0x74/0x14c [ip_nat_h323]
[<e0ad531e>] process_setup+0x11b/0x29e [ip_conntrack_h323]
[<e0ad534f>] process_setup+0x14c/0x29e [ip_conntrack_h323]
[<e0ad57bd>] process_q931+0x3c/0x142 [ip_conntrack_h323]
[<e0ad5dff>] q931_help+0xe0/0x144 [ip_conntrack_h323]
...
Found by the PROTOS c07-h2250v4 testsuite.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both cause the 'entries' count in the export cache to be non-zero at module
removal time, so unregistering that cache fails and results in an oops.
1/ exp_pseudoroot (used for NFSv4 only) leaks a reference to an export
entry.
2/ sunrpc_cache_update doesn't increment the entries count when it adds
an entry.
Thanks to "david m. richter" <richterd@citi.umich.edu> for triggering the
problem and finding one of the bugs.
Cc: "david m. richter" <richterd@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix memory corruption caused by snmp_trap_decode:
- When snmp_trap_decode fails before the id and address are allocated,
the pointers contain random memory, but are freed by the caller
(snmp_parse_mangle).
- When snmp_trap_decode fails after allocating just the ID, it tries
to free both address and ID, but the address pointer still contains
random memory. The caller frees both ID and random memory again.
- When snmp_trap_decode fails after allocating both, it frees both,
and the callers frees both again.
The corruption can be triggered remotely when the ip_nat_snmp_basic
module is loaded and traffic on port 161 or 162 is NATed.
Found by multiple testcases of the trap-app and trap-enc groups of the
PROTOS c06-snmpv1 testsuite.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable SO_LINGER functionality for 1-N style sockets. The socket API
draft will be clarfied to allow for this functionality. The linger
settings will apply to all associations on a given socket.
Signed-off-by: Vladislav Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
If SCTP receives a badly formatted HB-ACK chunk, it is possible
that we may access invalid memory and potentially have a buffer
overflow. We should really make sure that the chunk format is
what we expect, before attempting to touch the data.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
sctp_rcv().
The goal is to hold the ref on the association/endpoint throughout the
state-machine process. We accomplish like this:
/* ref on the assoc/ep is taken during lookup */
if owned_by_user(sk)
sctp_add_backlog(skb, sk);
else
inqueue_push(skb, sk);
/* drop the ref on the assoc/ep */
However, in sctp_add_backlog() we take the ref on assoc/ep and hold it
while the skb is on the backlog queue. This allows us to get rid of the
sock_hold/sock_put in the lookup routines.
Now sctp_backlog_rcv() needs to account for potential association move.
In the unlikely event that association moved, we need to retest if the
new socket is locked by user. If we don't this, we may have two packets
racing up the stack toward the same socket and we can't deal with it.
If the new socket is still locked, we'll just add the skb to its backlog
continuing to hold the ref on the association. This get's rid of the
need to move packets from one backlog to another and it also safe in
case new packets arrive on the same backlog queue.
The last step, is to lock the new socket when we are moving the
association to it. This is needed in case any new packets arrive on
the association when it moved. We want these to go to the backlog since
we would like to avoid the race between this new packet and a packet
that may be sitting on the backlog queue of the old socket toward the
same association.
Signed-off-by: Vladislav Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
flags is a u16, so use htons instead of htonl. Also avoid double
conversion.
Noticed by Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Solar Designer found a race condition in do_add_counters(). The beginning
of paddc is supposed to be the same as tmp which was sanity-checked
above, but it might not be the same in reality. In case the integer
overflow and/or the race condition are triggered, paddc->num_counters
might not match the allocation size for paddc. If the check below
(t->private->number != paddc->num_counters) nevertheless passes (perhaps
this requires the race condition to be triggered), IPT_ENTRY_ITERATE()
would read kernel memory beyond the allocation size, potentially causing
an oops or leaking sensitive data (e.g., passwords from host system or
from another VPS) via counter increments. This requires CAP_NET_ADMIN.
Signed-off-by: Solar Designer <solar@openwall.com>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE keys are 16 bit.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The prefix argument for nf_log_packet is a format specifier,
so don't pass the user defined string directly to it.
Signed-off-by: Philip Craig <philipc@snapgear.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Coverity checker spotted that we may leak 'hold' in
net/ipv4/netfilter/ipt_recent.c::checkentry() when the following
is true:
if (!curr_table->status_proc) {
...
if(!curr_table) {
...
return 0; <-- here we leak.
Simply moving an existing vfree(hold); up a bit avoids the possible leak.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: "Angelo P. Castellani" <angelo.castellani+lkml@gmail.com>
Using NewReno, if a sk_buff is timed out and is accounted as lost_out,
it should also be removed from the sacked_out.
This is necessary because recovery using NewReno fast retransmit could
take up to a lot RTTs and the sk_buff RTO can expire without actually
being really lost.
left_out = sacked_out + lost_out
in_flight = packets_out - left_out + retrans_out
Using NewReno without this patch, on very large network losses,
left_out becames bigger than packets_out + retrans_out (!!).
For this reason unsigned integer in_flight overflows to 2^32 - something.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the unused EXPORT_SYMBOL(tr_source_route).
(Note, the usage in net/llc/llc_output.c can't be modular.)
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Casting BE16 to int and back may or may not work. Correct, to be sure.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A single caller passes __u32. Inside function "net" is compared with
__u32 (__be32 really, just wasn't annotated).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a potential jiffy wraparound bug in the transmit watchdog
that is easily avoided by using time_after().
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The classical IP over ATM code maintains its own IPv4 <-> <ATM stuff>
ARP table, using the standard neighbour-table code. The
neigh_table_init function adds this neighbour table to a linked list
of all neighbor tables which is used by the functions neigh_delete()
neigh_add() and neightbl_set(), all called by the netlink code.
Once the ATM neighbour table is added to the list, there are two
tables with family == AF_INET there, and ARP entries sent via netlink
go into the first table with matching family. This is indeterminate
and often wrong.
To see the bug, on a kernel with CLIP enabled, create a standard IPv4
ARP entry by pinging an unused address on a local subnet. Then attempt
to complete that entry by doing
ip neigh replace <ip address> lladdr <some mac address> nud reachable
Looking at the ARP tables by using
ip neigh show
will reveal two ARP entries for the same address. One of these can be
found in /proc/net/arp, and the other in /proc/net/atm/arp.
This patch adds a new function, neigh_table_init_no_netlink() which
does everything the neigh_table_init() does, except add the table to
the netlink all-arp-tables chain. In addition neigh_table_init() has a
check that all tables on the chain have a distinct address family.
The init call in clip.c is changed to call
neigh_table_init_no_netlink().
Since ATM ARP tables are rather more complicated than can currently be
handled by the available rtattrs in the netlink protocol, no
functionality is lost by this patch, and non-ATM ARP manipulation via
netlink is rescued. A more complete solution would involve a rtattr
for ATM ARP entries and some way for the netlink code to give
neigh_add and friends more information than just address family with
which to find the correct ARP table.
[ I've changed the assertion checking in neigh_table_init() to not
use BUG_ON() while holding neigh_tbl_lock. Instead we remember that
we found an existing tbl with the same family, and after dropping
the lock we'll give a diagnostic kernel log message and a stack dump.
-DaveM ]
Signed-off-by: Simon Kelley <simon@thekelleys.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
When deleting the last child the level of a class should drop to zero.
Noticed by Andreas Mueller <andreas@stapelspeicher.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet6_csk_xit does not free skb when routing fails.
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that netdevice sysfs registration is done as part of
register_netdevice; bridge code no longer has to be tricky when adding
it's kobjects to bridges.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last step of netdevice registration was being done by a delayed
call, but because it was delayed, it was impossible to return any error
code if the class_device registration failed.
Side effects:
* one state in registration process is unnecessary.
* register_netdevice can sleep inside class_device registration/hotplug
* code in netdev_run_todo only does unregistration so it is simpler.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The test used in the linkwatch does not handle wrap-arounds correctly.
Since the intention of the code is to eliminate bursts of messages we
can afford to delay things up to a second. Using that fact we can
easily handle wrap-arounds by making sure that we don't delay things
by more than one second.
This is based on diagnosis and a patch by Stefan Rompf.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stefan Rompf <stefan@loplof.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the following unused EXPORT_SYMBOL's:
- irias_find_attrib
- irias_new_string_value
- irias_new_octseq_value
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Samuel Ortiz <samuel.ortiz@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Alan Stern <stern@rowland.harvard.edu>
This chain does it's own locking via the RTNL semaphore, and
can also run recursively so adding a new mutex here was causing
deadlocks.
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix error point to options in ip_options_fragment(). optptr get a
error pointer to the ipv4 header, correct is pointer to ipv4 options.
Signed-off-by: Wei Yongjun <weiyj@soft.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is another result from my likely profiling tool
(dwalker@mvista.com just sent the patch of the profiling tool to
linux-kernel mailing list, which is similar to what I use).
On my system (not very busy, normal development machine within a
VMWare workstation), I see a 6/5 miss/hit ratio for this "likely".
Signed-off-by: Hua Zhong <hzhong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Atomically create attributes when class device is added. This avoids
the race between registering class_device (which generates hotplug
event), and the creation of attribute groups.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xiaoliang (David) Wei wrote:
> Hi gurus,
>
> I am reading the code of tcp_highspeed.c in the kernel and have a
> question on the hstcp_cong_avoid function, specifically the following
> AI part (line 136~143 in net/ipv4/tcp_highspeed.c ):
>
> /* Do additive increase */
> if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
> tp->snd_cwnd_cnt += ca->ai;
> if (tp->snd_cwnd_cnt >= tp->snd_cwnd) {
> tp->snd_cwnd++;
> tp->snd_cwnd_cnt -= tp->snd_cwnd;
> }
> }
>
> In this part, when (tp->snd_cwnd_cnt == tp->snd_cwnd),
> snd_cwnd_cnt will be -1... snd_cwnd_cnt is defined as u16, will this
> small chance of getting -1 becomes a problem?
> Shall we change it by reversing the order of the cwnd++ and cwnd_cnt -=
> cwnd?
Absolutely correct. Thanks.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are out of date and don't tell the user anything useful.
The similar messages which IPV4 and the core networking used
to output were killed a long time ago.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling sock_orphan inside bh_lock_sock in dccp_close can lead to dead
locks. For example, the inet_diag code holds sk_callback_lock without
disabling BH. If an inbound packet arrives during that admittedly tiny
window, it will cause a dead lock on bh_lock_sock. Another possible
path would be through sock_wfree if the network device driver frees the
tx skb in process context with BH enabled.
We can fix this by moving sock_orphan out of bh_lock_sock.
The tricky bit is to work out when we need to destroy the socket
ourselves and when it has already been destroyed by someone else.
By moving sock_orphan before the release_sock we can solve this
problem. This is because as long as we own the socket lock its
state cannot change.
So we simply record the socket state before the release_sock
and then check the state again after we regain the socket lock.
If the socket state has transitioned to DCCP_CLOSED in the time being,
we know that the socket has been destroyed. Otherwise the socket is
still ours to keep.
This problem was discoverd by Ingo Molnar using his lock validator.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It makes sense to add this simple statistic to keep track of received
multicast packets.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Discard an unexpected chunk in CLOSED state rather can calling BUG().
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use pskb_pull() to handle incoming COOKIE_ECHO and HEARTBEAT chunks that
are received as skb's with fragment list.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a rare situation that causes lksctp to go into infinite recursion
and crash the system. The trigger is a packet that contains at least the
first two DATA fragments of a message bundled together. The recursion is
triggered when the user data buffer is smaller that the full data message.
The problem is that we clone the skb for every fragment in the message.
When reassembling the full message, we try to link skbs from the "first
fragment" clone using the frag_list. However, since the frag_list is shared
between two clones in this rare situation, we end up setting the frag_list
pointer of the second fragment to point to itself. This causes
sctp_skb_pull() to potentially recurse indefinitely.
Proposed solution is to make a copy of the skb when attempting to link
things using frag_list.
Signed-off-by: Vladislav Yasevich <vladsilav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a deadlock situation in the receive path by allowing
temporary spillover of the receive buffer.
- If the chunk we receive has a tsn that immediately follows the ctsn,
accept it even if we run out of receive buffer space and renege data with
higher TSNs.
- Once we accept one chunk in a packet, accept all the remaining chunks
even if we run out of receive buffer space.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Mark Butler <butlerm@middle.net>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
zd1211 with softmac and wpa_supplicant revealed an issue with softmac
and the use of workqueues. Some of the work functions actually
reschedule themselves, so this meant that there could still be
pending work after flush_scheduled_work() had been called during
ieee80211softmac_stop().
This patch introduces a "running" flag which is used to ensure that
rescheduling does not happen in this situation.
I also used this flag to ensure that softmac's hooks into ieee80211 are
non-operational once the stop operation has been started. This simply
makes softmac a little more robust, because I could crash it easily
by receiving frames in the short timeframe after shutting down softmac
and before turning off the ZD1211 radio. (ZD1211 is now fixed as well!)
Signed-off-by: Daniel Drake <dsd@gentoo.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When wpa_supplicant exits, it uses SIOCSIWMLME to request
deauthentication. softmac then tries to reassociate without any user
intervention, which isn't the desired behaviour of this signal.
This change makes softmac only attempt reassociation if the remote
network itself deauthenticated us.
Signed-off-by: Daniel Drake <dsd@gentoo.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch fixes hello messages sent when a node is a level 1
router. Slightly contrary to the spec (maybe) VMS ignores hello
messages that do not name level2 routers that it also knows about.
So, here we simply name all the routers that the node knows about
rather just other level1 routers. (I hope the patch is clearer than
the description. sorry).
Signed-off-by: Patrick Caulfield <patrick@tykepenguin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling sock_orphan inside bh_lock_sock in tcp_close can lead to dead
locks. For example, the inet_diag code holds sk_callback_lock without
disabling BH. If an inbound packet arrives during that admittedly tiny
window, it will cause a dead lock on bh_lock_sock. Another possible
path would be through sock_wfree if the network device driver frees the
tx skb in process context with BH enabled.
We can fix this by moving sock_orphan out of bh_lock_sock.
The tricky bit is to work out when we need to destroy the socket
ourselves and when it has already been destroyed by someone else.
By moving sock_orphan before the release_sock we can solve this
problem. This is because as long as we own the socket lock its
state cannot change.
So we simply record the socket state before the release_sock
and then check the state again after we regain the socket lock.
If the socket state has transitioned to TCP_CLOSE in the time being,
we know that the socket has been destroyed. Otherwise the socket is
still ours to keep.
Note that I've also moved the increment on the orphan count forward.
This may look like a problem as we're increasing it even if the socket
is just about to be destroyed where it'll be decreased again. However,
this simply enlarges a window that already exists. This also changes
the orphan count test by one.
Considering what the orphan count is meant to do this is no big deal.
This problem was discoverd by Ingo Molnar using his lock validator.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert all ROSE sysctl time values from jiffies to ms as units.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert all NET/ROM sysctl time values from jiffies to ms as units.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert all AX.25 sysctl time values from jiffies to ms as units.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The locking rule for rose_remove_neigh() are that the caller needs to
hold rose_neigh_list_lock, so we better don't take it yet again in
rose_neigh_list_lock.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move AX.25 symbol exports to next to their definitions where they're
supposed to be these days.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jing Min Zhao <zhaojingmin@users.sourceforge.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/netfilter/ip_nat_standalone.c: In function 'ip_nat_out':
net/ipv4/netfilter/ip_nat_standalone.c:223: warning: unused variable 'ctinfo'
net/ipv4/netfilter/ip_nat_standalone.c:222: warning: unused variable 'ct'
Surprisingly no complaints so far ..
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a Choice element contains an unsupported choice no error is returned
and parsing continues normally, but the choice value is not set and
contains data from the last parsed message. This may in turn lead to
parsing of more stale data and following crashes.
Fixes a crash triggered by testcase 0003243 from the PROTOS c07-h2250v4
testsuite following random other testcases:
CPU: 0
EIP: 0060:[<c01a9554>] Not tainted VLI
EFLAGS: 00210646 (2.6.17-rc2 #3)
EIP is at memmove+0x19/0x22
eax: d7be0307 ebx: d7be0307 ecx: e841fcf9 edx: d7be0307
esi: bfffffff edi: bfffffff ebp: da5eb980 esp: c0347e2c
ds: 007b es: 007b ss: 0068
Process events/0 (pid: 4, threadinfo=c0347000 task=dff86a90)
Stack: <0>00000006 c0347ea6 d7be0301 e09a6b2c 00000006 da5eb980 d7be003e d7be0052
c0347f6c e09a6d9c 00000006 c0347ea6 00000006 00000000 d7b9a548 00000000
c0347f6c d7b9a548 00000004 e0a1a119 0000028f 00000006 c0347ea6 00000006
Call Trace:
[<e09a6b2c>] mangle_contents+0x40/0xd8 [ip_nat]
[<e09a6d9c>] ip_nat_mangle_tcp_packet+0xa1/0x191 [ip_nat]
[<e0a1a119>] set_addr+0x60/0x14d [ip_nat_h323]
[<e0ab6e66>] q931_help+0x2da/0x71a [ip_conntrack_h323]
[<e0ab6e98>] q931_help+0x30c/0x71a [ip_conntrack_h323]
[<e09af242>] ip_conntrack_help+0x22/0x2f [ip_conntrack]
[<c022934a>] nf_iterate+0x2e/0x5f
[<c025d357>] xfrm4_output_finish+0x0/0x39f
[<c02294ce>] nf_hook_slow+0x42/0xb0
[<c025d357>] xfrm4_output_finish+0x0/0x39f
[<c025d732>] xfrm4_output+0x3c/0x4e
[<c025d357>] xfrm4_output_finish+0x0/0x39f
[<c0230370>] ip_forward+0x1c2/0x1fa
[<c022f417>] ip_rcv+0x388/0x3b5
[<c02188f9>] netif_receive_skb+0x2bc/0x2ec
[<c0218994>] process_backlog+0x6b/0xd0
[<c021675a>] net_rx_action+0x4b/0xb7
[<c0115606>] __do_softirq+0x35/0x7d
[<c0104294>] do_softirq+0x38/0x3f
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the TPKT len included in the packet is below the lowest valid value
of 4 an underflow occurs which results in an endless loop.
Found by testcase 0000058 from the PROTOS c07-h2250v4 testsuite.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix infinite loop in the SCTP-netfilter code: check SCTP chunk size to
guarantee progress of for_each_sctp_chunk(). (all other uses of
for_each_sctp_chunk() are preceded by do_basic_checks(), so this fix
should be complete.)
Based on patch from Ingo Molnar <mingo@elte.hu>
CVE-2006-1527
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'audit.b10' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
[PATCH] Audit Filter Performance
[PATCH] Rework of IPC auditing
[PATCH] More user space subject labels
[PATCH] Reworked patch for labels on user space messages
[PATCH] change lspp ipc auditing
[PATCH] audit inode patch
[PATCH] support for context based audit filtering, part 2
[PATCH] support for context based audit filtering
[PATCH] no need to wank with task_lock() and pinning task down in audit_syscall_exit()
[PATCH] drop task argument of audit_syscall_{entry,exit}
[PATCH] drop gfp_mask in audit_log_exit()
[PATCH] move call of audit_free() into do_exit()
[PATCH] sockaddr patch
[PATCH] deal with deadlocks in audit_free()
When iptables userspace adds an ipt_standard_target, it calculates the size
of the entire entry as:
sizeof(struct ipt_entry) + XT_ALIGN(sizeof(struct ipt_standard_target))
ipt_standard_target looks like this:
struct xt_standard_target
{
struct xt_entry_target target;
int verdict;
};
xt_entry_target contains a pointer, so when compiled for 64 bit the
structure gets an extra 4 byte of padding at the end. On 32 bit
architectures where iptables aligns to 8 byte it will also have 4
byte padding at the end because it is only 36 bytes large.
The compat_ipt_standard_fn in the kernel adjusts the offsets by
sizeof(struct ipt_standard_target) - sizeof(struct compat_ipt_standard_target),
which will always result in 4, even if the structure from userspace
was already padded to a multiple of 8. On x86 this works out by
accident because userspace only aligns to 4, on all other
architectures this is broken and causes incorrect adjustments to
the size and following offsets.
Thanks to Linus for lots of debugging help and testing.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The below patch should be applied after the inode and ipc sid patches.
This patch is a reworking of Tim's patch that has been updated to match
the inode and ipc patches since its similar.
[updated:
> Stephen Smalley also wanted to change a variable from isec to tsec in the
> user sid patch. ]
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On Thursday 23 March 2006 09:08, John D. Ramsdell wrote:
> I noticed that a socketcall(bind) and socketcall(connect) event contain a
> record of type=SOCKADDR, but I cannot see one for a system call event
> associated with socketcall(accept). Recording the sockaddr of an accepted
> socket is important for cross platform information flow analys
Thanks for pointing this out. The following patch should address this.
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We eliminated rt6_dflt_lock (to protect default router pointer)
at 2.6.17-rc1, and introduced rt6_select() for general router selection.
The function is called in the context of rt6_lock read-lock held,
but this means, we have some race conditions when we do round-robin.
Signed-off-by; YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm_policy_afinfo_lock can be taken in bh context, at:
[<c013fe1a>] lockdep_acquire_read+0x54/0x6d
[<c0f6e024>] _read_lock+0x15/0x22
[<c0e8fcdb>] xfrm_policy_get_afinfo+0x1a/0x3d
[<c0e8fd10>] xfrm_decode_session+0x12/0x32
[<c0e66094>] ip_route_me_harder+0x1c9/0x25b
[<c0e770d3>] ip_nat_local_fn+0x94/0xad
[<c0e2bbc8>] nf_iterate+0x2e/0x7a
[<c0e2bc50>] nf_hook_slow+0x3c/0x9e
[<c0e3a342>] ip_push_pending_frames+0x2de/0x3a7
[<c0e53e19>] icmp_push_reply+0x136/0x141
[<c0e543fb>] icmp_reply+0x118/0x1a0
[<c0e54581>] icmp_echo+0x44/0x46
[<c0e53fad>] icmp_rcv+0x111/0x138
[<c0e36764>] ip_local_deliver+0x150/0x1f9
[<c0e36be2>] ip_rcv+0x3d5/0x413
[<c0df760f>] netif_receive_skb+0x337/0x356
[<c0df76c3>] process_backlog+0x95/0x110
[<c0df5fe2>] net_rx_action+0xa5/0x16d
[<c012d8a7>] __do_softirq+0x6f/0xe6
[<c0105ec2>] do_softirq+0x52/0xb1
this means that all write-locking of xfrm_policy_afinfo_lock must be
bh-safe. This patch fixes xfrm_policy_register_afinfo() and
xfrm_policy_unregister_afinfo().
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm_state_afinfo_lock can be read-locked from bh context, so take it
in a bh-safe manner in xfrm_state_register_afinfo() and
xfrm_state_unregister_afinfo(). Found by the lock validator.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following unlikely should be replaced by likely because the
condition happens every time unless there is a hard error to transmit
a packet.
Signed-off-by: Hua Zhong <hzhong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm typemap->lock may be used in softirq context, so all write_lock()
uses must be softirq-safe.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
I was looking through the xfrm input/output code in order to abstract
out the address family specific encapsulation/decapsulation code. During
that process I found this bug in the IP ID selection code in xfrm4_output.c.
At that point dst is still the xfrm_dst for the current SA which
represents an internal flow as far as the IPsec tunnel is concerned.
Since the IP ID is going to sit on the outside of the encapsulated
packet, we obviously want the external flow which is just dst->child.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert inet_init to an fs_initcall to make sure its called before any
device driver's initcall.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
1 line removal, of unused macro.
ran 'egrep -r' from linux-2.6.16/ for Nprintk and
didn't see it anywhere else but here, in #define...
Signed-off-by: Soyoung Park <speattle@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following one line fix is needed to make loss function of
netem work right when doing loss on the local host.
Otherwise, higher layers just recover.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the sk_timer function x25_heartbeat_expiry() is called by the
kernel in a running/terminating process, spinlock-recursion and
spinlock-lockup locks up the kernel. This has happened with testing
on some distro's and the patch below fixed it.
Signed-off-by: Shaun Pereira <spereira@tusc.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
[PATCH] forcedeth: fix initialization
[PATCH] sky2: version 1.2
[PATCH] sky2: reset function can be devinit
[PATCH] sky2: use ALIGN() macro
[PATCH] sky2: add fake idle irq timer
[PATCH] sky2: reschedule if irq still pending
[PATCH] bcm43xx: make PIO mode usable
[PATCH] bcm43xx: add to MAINTAINERS
[PATCH] softmac: fix SIOCSIWAP
[PATCH] Fix crash on big-endian systems during scan
e1000: Update truesize with the length of the packet for packet split
[PATCH] Fix locking in gianfar
Need to allow for VLAN header when bridging.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The targets don't do the basic verification themselves anymore so
the ipt action needs to take care of it.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
xt_table->lock should be initialized before xt_replace_table() call, which
uses it. This patch removes strict requirement that table should define
lock before registering.
Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>