Commit graph

915 commits

Author SHA1 Message Date
Pavel Emelyanov
b8e1f9b5c3 [NET] sysctl: make sysctl_somaxconn per-namespace
Just move the variable on the struct net and adjust
its usage.

Others sysctls from sys.net.core table are more
difficult to virtualize (i.e. make them per-namespace),
but I'll look at them as well a bit later.

Signed-off-by: Pavel Emelyanov <xemul@oenvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:57 -08:00
Pavel Emelyanov
790a353289 [NET] sysctl: prepare core tables to point to netns variables
Some of ctl variables are going to be on the struct
net. Here's the way to adjust the ->data pointer on the
ctl_table-s to point on the right variable.

Since some pointers still point on the global variables,
I keep turning the write bits off on such tables.

This looks to become a common procedure for net sysctls,
so later parts of this code may migrate to some more
generic place.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:56 -08:00
Pavel Emelyanov
024626e36d [NET] sysctl: make the sys.net.core sysctls per-namespace
Making them per-namespace is required for the following
two reasons:

 First, some ctl values have a per-namespace meaning.
 Second, making them writable from the sub-namespace
 is an isolation hole.

So I introduce the pernet operations to create these
tables. For init_net I use the existing statically
declared tables, for sub-namespace they are duplicated
and the write bits are removed from the mode.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:56 -08:00
Denis Cheng
3b5b34fd2b [NET] net/core/dev.c: use LIST_HEAD instead of LIST_HEAD_INIT
single list_head variable initialized with LIST_HEAD_INIT could almost
always can be replaced with LIST_HEAD declaration, this shrinks the code
and looks better.

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:51 -08:00
Daniel Lezcano
9eb87f3f7e [IPV6]: Make fib6_rules_init to return an error code.
When the fib_rules initialization finished, no return code is provided
so there is no way to know, for the caller, if the initialization has
been successful or has failed. This patch fix that.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:46 -08:00
Denis V. Lunev
5a3e55d68e [NET]: Multiple namespaces in the all dst_ifdown routines.
Move dst entries to a namespace loopback to catch refcounting leaks.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:44 -08:00
Herbert Xu
a59322be07 [UDP]: Only increment counter on first peek/recv
The previous move of the the UDP inDatagrams counter caused each
peek of the same packet to be counted separately.  This may be
undesirable.

This patch fixes this by adding a bit to sk_buff to record whether
this packet has already been seen through skb_recv_datagram.  We
then only increment the counter when the packet is seen for the
first time.

The only dodgy part is the fact that skb_recv_datagram doesn't have
a good way of returning this new bit of information.  So I've added
a new function __skb_recv_datagram that does return this and made
skb_recv_datagram a wrapper around it.

The plan is to eventually replace all uses of skb_recv_datagram with
this new function at which time it can be renamed its proper name.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:34 -08:00
Herbert Xu
27ab256864 [UDP]: Avoid repeated counting of checksum errors due to peeking
Currently it is possible for two processes to peek on the same socket
and end up incrementing the error counter twice for the same packet.

This patch fixes it by making skb_kill_datagram return whether it
succeeded in unlinking the packet and only incrementing the counter
if it did.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:32 -08:00
Pavel Emelyanov
33eb9cfc70 [NET]: Isolate the net/core/ sysctl table
Using ctl paths we can put all the stuff, related to net/core/
sysctl table, into one file and remove all the references on it.

As a good side effect this hides the "core_table" name from
the global scope :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:26 -08:00
Pavel Emelyanov
7e2e109cef [NET]: Remove unneeded ifdefs from sysctl_net_core.c
This file is already compiled out when the SYSCTL=n, so
these ifdefs, that enclose the whole file, can be removed.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:25 -08:00
Pavel Emelyanov
c3bac5a71b [NEIGH]: Use the ctl paths to create neighbours sysctls
The appropriate path is prepared right inside this function. It
is prepared similar to how the ctl tables were.

Since the path is modified, it is put on the stack, to avoid
possible races with multiple calls to neigh_sysctl_register() : it
is called by protocols and I didn't find any protection in this
case. Did I overlooked the rtnl lock?.

The stack growth of the neigh_sysctl_register() is 40 bytes. I
believe this is OK, since this is not that much and this function
is not called with the deep stack (device/protocols register).

The device's name is stored on the template to free it later.

This will help with the net namespaces, as each namespace should
have its own set of these ctls.

Besides, this saves ~350 bytes from the neigh template :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:24 -08:00
Pavel Emelyanov
3c607bbb47 [NEIGH]: Cleanup the neigh_sysctl_register
This mainly removes the err variable, as this call always
return the same error code (-ENOBUFS).

Besides, I moved the call to kmalloc() from the *t declaration
into the code (this is confusing when a variable is initialized
with the result of some call) and removed unneeded comment near
the error path.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:24 -08:00
Patrick McHardy
a99a00cf1a [NET]: Move netfilter checksum helpers to net/core/utils.c
This allows to get rid of the CONFIG_NETFILTER dependency of NET_ACT_NAT.
This patch redefines the old names to keep the noise low, the next patch
converts all users.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:14 -08:00
Pavel Emelyanov
df1b86c53d [NET]: Nicer WARN_ON in netstat_show
The

        if (statement)
                WARN_ON(1);

looks much better as

        WARN_ON(statement);

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:10 -08:00
Pavel Emelyanov
82d8a867ff [NET]: Make macro to specify the ptype_base size
Currently this size is 16, but as the comment says this
is so only because all the chains (except one) has the
length 1. I think, that some day this may change, so
growing this hash will be much easier.

Besides, symbolic names are read better than magic constants.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:04 -08:00
Pavel Emelyanov
8d8ad9d7c4 [NET]: Name magic constants in sock_wake_async()
The sock_wake_async() performs a bit different actions
depending on "how" argument. Unfortunately this argument
ony has numerical magic values.

I propose to give names to their constants to help people
reading this function callers understand what's going on
without looking into this function all the time.

I suppose this is 2.6.25 material, but if it's not (or the
naming seems poor/bad/awful), I can rework it against the
current net-2.6 tree.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:03 -08:00
Arnaldo Carvalho de Melo
ebb53d7565 [NET] proto: Use pcounters for the inuse field
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:40 -08:00
Pavel Emelyanov
9859a79023 [NET]: Compact sk_stream_mem_schedule() code
This function references sk->sk_prot->xxx for many times.
It turned out, that there's so many code in it, that gcc
cannot always optimize access to sk->sk_prot's fields.

After saving the sk->sk_prot on the stack and comparing
disassembled code, it turned out that the function became
~10 bytes shorter and made less dereferences (on i386 and
x86_64). Stack consumption didn't grow.

Besides, this patch drives most of this function into the
80 columns limit.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:36 -08:00
Benjamin Thery
3ef1355dcb [NET]: Make netns cleanup to run in a separate queue
This patch adds a separate workqueue for cleaning up a network
namespace. If we use the keventd workqueue to execute cleanup_net(),
there is a problem to unregister devices in IPv6. Indeed the code
that cleans up also schedule work in keventd: as long as cleanup_net()
hasn't return, dst_gc_task() cannot run and as long as dst_gc_task() has
not run, there are still some references pending on the net devices and
cleanup_net() can not unregister and exit the keventd workqueue.

Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Acked-By: Kirill Korotaev <dev@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:35 -08:00
Adrian Bunk
02d45827fa [NET] net/core/request_sock.c: Remove unused exports.
This patch removes the following unused EXPORT_SYMBOL's:
- reqsk_queue_alloc
- __reqsk_queue_destroy
- reqsk_queue_destroy

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:33 -08:00
Denis V. Lunev
e372c41401 [NET]: Consolidate net namespace related proc files creation.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:28 -08:00
Eric W. Biederman
4b3da706bb [NET]: Make the netlink methods in rtnetlink handle multiple network namespaces
After the previous prep work this just consists of removing checks
limiting the code to work in the initial network namespace, and
updating rtmsg_ifinfo so we can generate events for devices in
something other then the initial network namespace.

Referring to network other network devices like the IFLA_LINK
and IFLA_MASTER attributes do, gets interesting if those network
devices happen to be in other network namespaces.  Currently
ifindex numbers are allocated globally so I have taken the path
of least resistance and not still report the information even
though the devices they are talking about are invisible.

If applications start getting confused or when ifindex
numbers become local to the network namespace we may need
to do something different in the future.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Denis V. Lunev <den@openz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:26 -08:00
Denis V. Lunev
97c53cacf0 [NET]: Make rtnetlink infrastructure network namespace aware (v3)
After this patch none of the netlink callback support anything
except the initial network namespace but the rtnetlink infrastructure
now handles multiple network namespaces.

Changes from v2:
- IPv6 addrlabel processing

Changes from v1:
- no need for special rtnl_unlock handling
- fixed IPv6 ndisc

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:25 -08:00
Denis V. Lunev
b854272b3c [NET]: Modify all rtnetlink methods to only work in the initial namespace (v2)
Before I can enable rtnetlink to work in all network namespaces I need
to be certain that something won't break.  So this patch deliberately
disables all of the rtnletlink methods in everything except the
initial network namespace.  After the methods have been audited this
extra check can be disabled.

Changes from v1:
- added IPv6 addrlabel protection

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2008-01-28 14:54:24 -08:00
Stephen Hemminger
c7b6ea24b4 [NETPOLL]: Don't need rx_flags.
The rx_flags variable is redundant. Turning rx on/off is done
via setting the rx_np pointer.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:18 -08:00
Stephen Hemminger
33f807ba0d [NETPOLL]: Kill NETPOLL_RX_DROP, set but never tested.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:18 -08:00
Stephen Hemminger
0953864160 [NETPOLL]: no need to store local_mac
The local_mac is managed by the network device, no need to keep a
spare copy and all the management problems that could cause.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:17 -08:00
Stephen Hemminger
5106930bd6 [NETPOLL]: netpoll_poll() cleanup
Restructure code slightly to improve readability:
  * dereference device once
  * change obvious while() loop
  * let poll_napi() handle null list itself

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:16 -08:00
Stephen Hemminger
0adc9add77 [NETPOLL]: Use skb_queue_purge().
Use standard routine for flushing queue.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:16 -08:00
Oliver Hartkopp
cd05acfe65 [CAN]: Allocate protocol numbers for PF_CAN
This patch adds a protocol/address family number, ARP hardware type,
ethernet packet type, and a line discipline number for the SocketCAN
implementation.

Signed-off-by: Oliver Hartkopp <oliver.hartkopp@volkswagen.de>
Signed-off-by: Urs Thuermann <urs.thuermann@volkswagen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:09 -08:00
Pavel Emelyanov
c0ef877b2c [NET]: Move sock_valbool_flag to socket.c
The sock_valbool_flag() helper is used in setsockopt to
set or reset some flag on the sock. This helper is required
in the net/socket.c only, so move it there.

Besides, patch two places in sys_setsockopt() that repeat
this helper functionality manually.

Since this is not a bugfix, but a trivial cleanup, I
prepared this patch against net-2.6.25, but it also
applies (with a single offset) to the latest net-2.6.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:54:00 -08:00
Herbert Xu
352e512c32 [NET]: Eliminate duplicate copies of dst_discard
We have a number of copies of dst_discard scattered around the place
which all do the same thing, namely free a packet on the input or
output paths.

This patch deletes all of them except dst_discard and points all the
users to it.

The only non-trivial bit is decnet where it returns an error.
However, conceptually this is identical to the blackhole functions
used in IPv4 and IPv6 which do not return errors.  So they should
either all return errors or all return zero.  For now I've stuck with
the majority and picked zero as the return value.

It doesn't really matter in practice since few if any driver would
react differently depending on a zero return value or NET_RX_DROP.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:37 -08:00
Pavel Emelyanov
b24b8a247f [NET]: Convert init_timer into setup_timer
Many-many code in the kernel initialized the timer->function
and  timer->data together with calling init_timer(timer). There
is already a helper for this. Use it for networking code.

The patch is HUGE, but makes the code 130 lines shorter
(98 insertions(+), 228 deletions(-)).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:35 -08:00
Wang Chen
33c732c361 [IPV4]: Add raw drops counter.
Add raw drops counter for IPv4 in /proc/net/raw .

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:33 -08:00
Jens Axboe
9c55e01c0c [TCP]: Splice receive support.
Support for network splice receive.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:53:31 -08:00
Gautham R Shenoy
86ef5c9a8e cpu-hotplug: replace lock_cpu_hotplug() with get_online_cpus()
Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
get_online_cpus and put_online_cpus instead as it highlights the
refcount semantics in these operations.

The new API guarantees protection against the cpu-hotplug operation, but
it doesn't guarantee serialized access to any of the local data
structures. Hence the changes needs to be reviewed.

In case of pseries_add_processor/pseries_remove_processor, use
cpu_maps_update_begin()/cpu_maps_update_done() as we're modifying the
cpu_present_map there.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:02 +01:00
Denis V. Lunev
ff4b950277 [NETNS]: Re-export init_net via EXPORT_SYMBOL.
init_net is used added as a parameter to a lot of old API calls, f.e.
ip_dev_find. These calls were exported as EXPORT_SYMBOL. So, export init_net
as EXPORT_SYMBOL to keep networking API consistent.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-23 03:11:42 -08:00
Patrick McHardy
68365458a4 [NET]: rtnl_link: fix use-after-free
When unregistering the rtnl_link_ops, all existing devices using
the ops are destroyed. With nested devices this may lead to a
use-after-free despite the use of for_each_netdev_safe() in case
the upper device is next in the device list and is destroyed
by the NETDEV_UNREGISTER notifier.

The easy fix is to restart scanning the device list after removing
a device. Alternatively we could add new devices to the front of
the list to avoid having dependant devices follow the device they
depend on. A third option would be to only restart scanning if
dev->iflink of the next device matches dev->ifindex of the current
one. For now this seems like the safest solution.

With this patch, the veth rtnl_link_ops unregistration can use
rtnl_link_unregister() directly since it now also handles destruction
of multiple devices at once.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-20 20:31:45 -08:00
David S. Miller
cecbb63967 [NEIGH]: Revert 'Fix race between neigh_parms_release and neightbl_fill_parms'
Commit 9cd4002942 (Fix race between
neigh_parms_release and neightbl_fill_parms) introduced device
reference counting regressions for several people, see:

	http://bugzilla.kernel.org/show_bug.cgi?id=9778

for example.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-20 20:31:42 -08:00
Pavel Emelyanov
9cd4002942 [NEIGH]: Fix race between neigh_parms_release and neightbl_fill_parms
The neightbl_fill_parms() is called under the write-locked tbl->lock
and accesses the parms->dev. The negh_parm_release() calls the
dev_put(parms->dev) without this lock. This creates a tiny race window
on which the parms contains potentially stale dev pointer.

To fix this race it's enough to move the dev_put() upper under the
tbl->lock, but note, that the parms are held by neighbors and thus can
live after the neigh_parms_release() is called, so we still can have a
parm with bad dev pointer.

I didn't find where the neigh->parms->dev is accessed, but still think
that putting the dev is to be done in a place, where the parms are
really freed. Am I right with that?

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-10 03:48:38 -08:00
Paul Moore
02f1c89d6e [NET]: Clone the sk_buff 'iif' field in __skb_clone()
Both NetLabel and SELinux (other LSMs may grow to use it as well) rely
on the 'iif' field to determine the receiving network interface of
inbound packets.  Unfortunately, at present this field is not
preserved across a skb clone operation which can lead to garbage
values if the cloned skb is sent back through the network stack.  This
patch corrects this problem by properly copying the 'iif' field in
__skb_clone() and removing the 'iif' field assignment from
skb_act_clone() since it is no longer needed.

Also, while we are here, put the assignments in the same order as the
offsets to reduce cacheline bounces.

Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-08 23:30:17 -08:00
David S. Miller
fed17f3094 [NET]: Stop polling when napi_disable() is pending.
This finally adds the code in net_rx_action() to break out of the
->poll()'ing loop when a napi_disable() is found to be pending.

Now, even if a device is being flooded with packets it can be cleanly
brought down.

Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-08 23:30:13 -08:00
Wei Yongjun
1ac70e7ad2 [NET]: Fix function put_cmsg() which may cause usr application memory overflow
When used function put_cmsg() to copy kernel information to user 
application memory, if the memory length given by user application is 
not enough, by the bad length calculate of msg.msg_controllen, 
put_cmsg() function may cause the msg.msg_controllen to be a large 
value, such as 0xFFFFFFF0, so the following put_cmsg() can also write 
data to usr application memory even usr has no valid memory to store 
this. This may cause usr application memory overflow.

int put_cmsg(struct msghdr * msg, int level, int type, int len, void *data)
{
    struct cmsghdr __user *cm
        = (__force struct cmsghdr __user *)msg->msg_control;
    struct cmsghdr cmhdr;
    int cmlen = CMSG_LEN(len);
    ~~~~~~~~~~~~~~~~~~~~~
    int err;

    if (MSG_CMSG_COMPAT & msg->msg_flags)
        return put_cmsg_compat(msg, level, type, len, data);

    if (cm==NULL || msg->msg_controllen < sizeof(*cm)) {
        msg->msg_flags |= MSG_CTRUNC;
        return 0; /* XXX: return error? check spec. */
    }
    if (msg->msg_controllen < cmlen) {
    ~~~~~~~~~~~~~~~~~~~~~~~~
        msg->msg_flags |= MSG_CTRUNC;
        cmlen = msg->msg_controllen;
    }
    cmhdr.cmsg_level = level;
    cmhdr.cmsg_type = type;
    cmhdr.cmsg_len = cmlen;

    err = -EFAULT;
    if (copy_to_user(cm, &cmhdr, sizeof cmhdr))
        goto out;
    if (copy_to_user(CMSG_DATA(cm), data, cmlen - sizeof(struct cmsghdr)))
        goto out;
    cmlen = CMSG_SPACE(len);
~~~~~~~~~~~~~~~~~~~~~~~~~~~
    If MSG_CTRUNC flags is set, msg->msg_controllen is less than 
CMSG_SPACE(len), "msg->msg_controllen -= cmlen" will cause unsinged int 
type msg->msg_controllen to be a large value.
~~~~~~~~~~~~~~~~~~~~~~~~~~~
    msg->msg_control += cmlen;
    msg->msg_controllen -= cmlen;
    ~~~~~~~~~~~~~~~~~~~~~
    err = 0;
out:
    return err;
}

The same promble exists in put_cmsg_compat(). This patch can fix this 
problem.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-20 14:36:44 -08:00
Joe Perches
53ccaae1ef [NET] net/core/: Spelling fixes
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-20 14:02:06 -08:00
Wang Chen
d59b54b150 [NET]: Fix wrong comments for unregister_net*
There are some return value comments for void functions.
Fixed it.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-11 02:45:32 -08:00
Herbert Xu
2d4baff8da [SKBUFF]: Free old skb properly in skb_morph
The skb_morph function only freed the data part of the dst skb, but leaked
the auxiliary data such as the netfilter fields.  This patch fixes this by
moving the relevant parts from __kfree_skb to skb_release_all and calling
it in skb_morph.

It also makes kfree_skbmem static since it's no longer called anywhere else
and it now no longer does skb_release_data.

Thanks to Yasuyuki KOZAKAI for finding this problem and posting a patch for
it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2007-11-26 23:11:19 +08:00
Pavel Emelyanov
1f8170b0ec [PKTGEN]: Fix double unlock of xfrm_state->lock
The pktgen_output_ipsec() function can unlock this lock twice
due to merged error and plain paths. Remove one of the calls
to spin_unlock.

Other possible solution would be to place "return 0" right 
after the first unlock, but at this place the err is known 
to be 0, so these solutions are the same except for this one
makes the code shorter.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-19 22:51:24 -08:00
Pavel Emelyanov
dab6ba3688 [INET]: Fix potential kfree on vmalloc-ed area of request_sock_queue
The request_sock_queue's listen_opt is either vmalloc-ed or
kmalloc-ed depending on the number of table entries. Thus it 
is expected to be handled properly on free, which is done in 
the reqsk_queue_destroy().

However the error path in inet_csk_listen_start() calls 
the lite version of reqsk_queue_destroy, called 
__reqsk_queue_destroy, which calls the kfree unconditionally. 

Fix this and move the __reqsk_queue_destroy into a .c file as 
it looks too big to be inline.

As David also noticed, this is an error recovery path only,
so no locking is required and the lopt is known to be not NULL.

reqsk_queue_yank_listen_sk is also now only used in
net/core/request_sock.c so we should move it there too.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-15 02:57:06 -08:00
Pavel Emelyanov
c67625a1ec [NET]: Remove notifier block from chain when register_netdevice_notifier fails
Commit fcc5a03ac4:

	[NET]: Allow netdev REGISTER/CHANGENAME events to fail

makes the register_netdevice_notifier() handle the error from the
NETDEV_REGISTER event, sent to the registering block.

The bad news is that in this case the notifier block is 
not removed from the list, but the error is returned to the 
caller. In case the caller is in module init function and 
handles this error this can abort the module loading. The
notifier block will be then removed from the kernel, but 
will be left in the list. Oops :(

I think that the notifier block should be removed from the
chain in case of error, regardless whether this error is 
handled by the caller or not. In the worst case (the error 
is _not_ handled) module will not receive the events any 
longer.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-14 15:53:16 -08:00
Denis V. Lunev
022cbae611 [NET]: Move unneeded data to initdata section.
This patch reverts Eric's commit 2b008b0a8e

It diets .text & .data section of the kernel if CONFIG_NET_NS is not set.
This is safe after list operations cleanup.

Signed-of-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 03:23:50 -08:00
Denis V. Lunev
ed160e839d [NET]: Cleanup pernet operation without CONFIG_NET_NS
If CONFIG_NET_NS is not set, the only namespace is possible.

This patch removes list of pernet_operations and cleanups code a bit.
This list is not needed if there are no namespaces. We should just call
->init method.

Additionally, the ->exit will be called on module unloading only. This
case is safe - the code is not discarded. For the in/kernel code, ->exit
should never be called.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-13 03:23:21 -08:00
Adrian Bunk
6aed42159d [NET]: Unexport sysctl_{r,w}mem_max.
sysctl_{r,w}mem_max can now be unexported.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-12 21:24:14 -08:00
Denis V. Lunev
2994c63863 [INET]: Small possible memory leak in FIB rules
This patch fixes a small memory leak. Default fib rules can be deleted by
the user if the rule does not carry FIB_RULE_PERMANENT flag, f.e. by
	ip rule flush

Such a rule will not be freed as the ref-counter has 2 on start and becomes
clearly unreachable after removal.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 22:12:03 -08:00
Alexey Dobriyan
33d36bb83c [NETNS]: init dev_base_lock only once
* it already statically initialized
* reinitializing live global spinlock every time netns is
  setup is also wrong

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 22:09:25 -08:00
Joe Perches
e9671fcb3b [NET]: Fix infinite loop in dev_mc_unsync().
From: Joe Perches <joe@perches.com>

Based upon an initial patch and report by Luis R. Rodriguez.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-10 21:36:04 -08:00
Pavel Emelyanov
b733c007ed [NET]: Clean proto_(un)register from in-code ifdefs
The struct proto has the per-cpu "inuse" counter, which is handled
with a special care. All the handling code hides under the ifdef
CONFIG_SMP and it introduces some code duplication and makes it
look worse than it could.

Clean this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:04 -08:00
Johann Felix Soden
45a19b0a72 [NETNS]: Fix compiler error in net_namespace.c
Because net_free is called by copy_net_ns before its declaration, the
compiler gives an error. This patch puts net_free before copy_net_ns
to fix this.

The compiler error:
net/core/net_namespace.c: In function 'copy_net_ns':
net/core/net_namespace.c:97: error: implicit declaration of function 'net_free'
net/core/net_namespace.c: At top level:
net/core/net_namespace.c:104: warning: conflicting types for 'net_free'
net/core/net_namespace.c:104: error: static declaration of 'net_free' follows non-static declaration
net/core/net_namespace.c:97: error: previous implicit declaration of 'net_free' was here

The error was introduced by the '[NET]: Hide the dead code in the
net_namespace.c' patch (6a1a3b9f68).

Signed-off-by: Johann Felix Soden <johfel@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:15:02 -08:00
Jiri Olsa
40208d71e0 [NET]: Removing duplicit #includes
Removing duplicit #includes for net/

Signed-off-by: Jiri Olsa <olsajiri@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:11:44 -08:00
Eric Dumazet
286ab3d460 [NET]: Define infrastructure to keep 'inuse' changes in an efficent SMP/NUMA way.
"struct proto" currently uses an array stats[NR_CPUS] to track change on
'inuse' sockets per protocol.

If NR_CPUS is big, this means we use a big memory area for this.
Moreover, all this memory area is located on a single node on NUMA
machines, increasing memory pressure on the boot node.

In this patch, I tried to :

- Keep a fast !CONFIG_SMP implementation
- Keep a fast CONFIG_SMP implementation for often used protocols
(tcp,udp,raw,...)
- Introduce a NUMA efficient implementation

Some helper macros are defined in include/net/sock.h
These macros take into account CONFIG_SMP

If a "struct proto" is declared without using DEFINE_PROTO_INUSE /
REF_PROTO_INUSE
macros, it will automatically use a default implementation, using a
dynamically allocated percpu zone.
This default implementation will be NUMA efficient, but might use 32/64
bytes per possible cpu
because of current alloc_percpu() implementation.
However it still should be better than previous implementation based on
stats[NR_CPUS] field.

When a "struct proto" is changed to use the new macros, we use a single
static "int" percpu variable,
lowering the memory and cpu costs, still preserving NUMA efficiency.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:57 -08:00
Alexey Dobriyan
3f192b5c58 [NET]: Remove /proc/net/stat/*_arp_cache upon module removal
neigh_table_init_no_netlink() creates them, but they aren't removed anywhere.

Steps to reproduce:

	modprobe clip
	rmmod clip
	cat /proc/net/stat/clip_arp_cache

BUG: unable to handle kernel paging request at virtual address f89d7758
printing eip: c05a99da *pdpt = 0000000000004001 *pde = 0000000004408067 *pte = 0000000000000000
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: atm af_packet ipv6 binfmt_misc sbs sbshc fan dock battery backlight ac power_supply parport loop rtc_cmos rtc_core rtc_lib serio_raw button k8temp hwmon amd_rng sr_mod cdrom shpchp pci_hotplug ehci_hcd ohci_hcd uhci_hcd usbcore
Pid: 2082, comm: cat Not tainted (2.6.24-rc1-b1d08ac064268d0ae2281e98bf5e82627e0f0c56-bloat #4)
EIP: 0060:[<c05a99da>] EFLAGS: 00210256 CPU: 0
EIP is at neigh_stat_seq_next+0x26/0x3f
EAX: 00000001 EBX: f89d7600 ECX: c587bf40 EDX: 00000000
ESI: 00000000 EDI: 00000001 EBP: 00000400 ESP: c587bf1c
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process cat (pid: 2082, ti=c587b000 task=c5984e10 task.ti=c587b000)
Stack: c06228cc c5313790 c049e5c0 0804f000 c45a7b00 c53137b0 00000000 00000000
       00000082 00000001 00000000 00000000 00000000 fffffffb c58d6780 c049e437
       c45a7b00 c04b1f93 c587bfa0 00000400 0804f000 00000400 0804f000 c04b1f2f
Call Trace:
 [<c049e5c0>] seq_read+0x189/0x281
 [<c049e437>] seq_read+0x0/0x281
 [<c04b1f93>] proc_reg_read+0x64/0x77
 [<c04b1f2f>] proc_reg_read+0x0/0x77
 [<c048907e>] vfs_read+0x80/0xd1
 [<c0489491>] sys_read+0x41/0x67
 [<c04080fa>] sysenter_past_esp+0x6b/0xc1
 =======================
Code: e9 ec 8d 05 00 56 8b 11 53 8b 40 70 8b 58 3c eb 29 0f a3 15 80 91 7b c0 19 c0 85 c0 8d 42 01 74 17 89 c6 c1 fe 1f 89 01 89 71 04 <8b> 83 58 01 00 00 f7 d0 8b 04 90 eb 09 89 c2 83 fa 01 7e d2 31
EIP: [<c05a99da>] neigh_stat_seq_next+0x26/0x3f SS:ESP 0068:c587bf1c

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-07 04:08:53 -08:00
Jens Axboe
c46f2334c8 [SG] Get rid of __sg_mark_end()
sg_mark_end() overwrites the page_link information, but all users want
__sg_mark_end() behaviour where we just set the end bit. That is the most
natural way to use the sg list, since you'll fill it in and then mark the
end point.

So change sg_mark_end() to only set the termination bit. Add a sg_magic
debug check as well, and clear a chain pointer if it is set.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-11-02 08:47:06 +01:00
Stephen Hemminger
3b582cc14c [NET]: docbook fixes for netif_ functions
Documentation updates for network interfaces.

1. Add doc for netif_napi_add
2. Remove doc for unused returns from netif_rx
3. Add doc for netif_receive_skb

[ Incorporated minor mods from Randy Dunlap -DaveM ]

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 02:21:47 -07:00
Pavel Emelyanov
d57a9212e0 [NET]: Hide the net_ns kmem cache
This cache is only required to create new namespaces,
but we won't have them in CONFIG_NET_NS=n case.

Hide it under the appropriate ifdef.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:46:50 -07:00
Pavel Emelyanov
1a2ee93d28 [NET]: Mark the setup_net as __net_init
The setup_net is called for the init net namespace
only (int the CONFIG_NET_NS=n of course) from the __init
function, so mark it as __net_init to disappear with the
caller after the boot.

Yet again, in the perfect world this has to be under
#ifdef CONFIG_NET_NS, but it isn't guaranteed that every
subsystem is registered *after* the init_net_ns is set
up. After we are sure, that we don't start registering
them before the init net setup, we'll be able to move
this code under the ifdef.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:45:59 -07:00
Pavel Emelyanov
6a1a3b9f68 [NET]: Hide the dead code in the net_namespace.c
The namespace creation/destruction code is never called
if the CONFIG_NET_NS is n, so it's OK to move it under
appropriate ifdef.

The copy_net_ns() in the "n" case checks for flags and
returns -EINVAL when new net ns is requested. In a perfect
world this stub must be in net_namespace.h, but this
function need to know the CLONE_NEWNET value and thus
requires sched.h. On the other hand this header is to be
injected into almost every .c file in the networking code,
and making all this code depend on the sched.h is a
suicidal attempt.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:44:50 -07:00
Pavel Emelyanov
1dba323b3f [NETNS]: Make the init/exit hooks checks outside the loop
When the new pernet something (subsys, device or operations) is
being registered, the init callback is to be called for each
namespace, that currently exitst in the system. During the
unregister, the same is to be done with the exit callback.

However, not every pernet something has both calls, but the
check for the appropriate pointer to be not NULL is performed
inside the for_each_net() loop.

This is (at least) strange, so tune this.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:42:43 -07:00
Pavel Emelyanov
6257ff2177 [NET]: Forget the zero_it argument of sk_alloc()
Finally, the zero_it argument can be completely removed from
the callers and from the function prototype.

Besides, fix the checkpatch.pl warnings about using the
assignments inside if-s.

This patch is rather big, and it is a part of the previous one.
I splitted it wishing to make the patches more readable. Hope 
this particular split helped.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:39:31 -07:00
Pavel Emelyanov
154adbc846 [NET]: Remove bogus zero_it argument from sk_alloc
At this point nobody calls the sk_alloc(() with zero_it == 0,
so remove unneeded checks from it.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:38:43 -07:00
Pavel Emelyanov
8fd1d178a3 [NET]: Make the sk_clone() lighter
The sk_prot_alloc() already performs all the stuff needed by the
sk_clone(). Besides, the sk_prot_alloc() requires almost twice
less arguments than the sk_alloc() does, so call the sk_prot_alloc()
saving the stack a bit.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:37:32 -07:00
Pavel Emelyanov
2e4afe7b35 [NET]: Move some core sock setup into sk_prot_alloc
The security_sk_alloc() and the module_get is a part of the
object allocations - move it in the proper place.

Note, that since we do not reset the newly allocated sock
in the sk_alloc() (memset() is removed with the previous
patch) we can safely do this.

Also fix the error path in sk_prot_alloc() - release the security
context if needed.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:36:26 -07:00
Pavel Emelyanov
3f0666ee30 [NET]: Auto-zero the allocated sock object
We have a __GFP_ZERO flag that allocates a zeroed chunk of memory.
Use it in the sk_alloc() and avoid a hand-made memset().

This is a temporary patch that will help us in the nearest future :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:34:42 -07:00
Pavel Emelyanov
c308c1b20e [NET]: Cleanup the allocation/freeing of the sock object
The sock object is allocated either from the generic cache with
the kmalloc, or from the proc->slab cache.

Move this logic into an isolated set of helpers and make the
sk_alloc/sk_free look a bit nicer.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:33:50 -07:00
Pavel Emelyanov
1e2e6b89f1 [NET]: Move the get_net() from sock_copy()
The sock_copy() is supposed to just clone the socket. In a perfect
world it has to be just memcpy, but we have to handle the security
mark correctly. All the extra setup must be performed in sk_clone() 
call, so move the get_net() into more proper place.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:31:26 -07:00
Pavel Emelyanov
f1a6c4da14 [NET]: Move the sock_copy() from the header
The sock_copy() call is not used outside the sock.c file,
so just move it into a sock.c

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-11-01 00:29:45 -07:00
David S. Miller
51c739d1f4 [NET]: Fix incorrect sg_mark_end() calls.
This fixes scatterlist corruptions added by

	commit 68e3f5dd4d
	[CRYPTO] users: Fix up scatterlist conversion errors

The issue is that the code calls sg_mark_end() which clobbers the
sg_page() pointer of the final scatterlist entry.

The first part fo the fix makes skb_to_sgvec() do __sg_mark_end().

After considering all skb_to_sgvec() call sites the most correct
solution is to call __sg_mark_end() in skb_to_sgvec() since that is
what all of the callers would end up doing anyways.

I suspect this might have fixed some problems in virtio_net which is
the sole non-crypto user of skb_to_sgvec().

Other similar sg_mark_end() cases were converted over to
__sg_mark_end() as well.

Arguably sg_mark_end() is a poorly named function because it doesn't
just "mark", it clears out the page pointer as a side effect, which is
what led to these bugs in the first place.

The one remaining plain sg_mark_end() call is in scsi_alloc_sgtable()
and arguably it could be converted to __sg_mark_end() if only so that
we can delete this confusing interface from linux/scatterlist.h

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-30 21:29:29 -07:00
Daniel Lezcano
310928d963 [NETNS]: fix net released by rcu callback
When a network namespace reference is held by a network subsystem,
and when this reference is decremented in a rcu update callback, we
must ensure that there is no more outstanding rcu update before
trying to free the network namespace.

In the normal case, the rcu_barrier is called when the network namespace
is exiting in the cleanup_net function.

But when a network namespace creation fails, and the subsystems are
undone (like the cleanup), the rcu_barrier is missing.

This patch adds the missing rcu_barrier.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-30 21:16:21 -07:00
Daniel Lezcano
93ee31f14f [NET]: Fix free_netdev on register_netdev failure.
Point 1:
The unregistering of a network device schedule a netdev_run_todo.
This function calls dev->destructor when it is set and the
destructor calls free_netdev.

Point 2:
In the case of an initialization of a network device the usual code
is:
 * alloc_netdev
 * register_netdev
    -> if this one fails, call free_netdev and exit with error.

Point 3:
In the register_netdevice function at the later state, when the device
is at the registered state, a call to the netdevice_notifiers is made.
If one of the notification falls into an error, a rollback to the
registered state is done using unregister_netdevice.

Conclusion:
When a network device fails to register during initialization because
one network subsystem returned an error during a notification call
chain, the network device is freed twice because of fact 1 and fact 2.
The second free_netdev will be done with an invalid pointer.

Proposed solution:
The following patch move all the code of unregister_netdevice *except*
the call to net_set_todo, to a new function "rollback_registered".

The following functions are changed in this way:
 * register_netdevice: calls rollback_registered when a notification fails
 * unregister_netdevice: calls rollback_register + net_set_todo, the call
                         order to net_set_todo is changed because it is the
                         latest now. Since it justs add an element to a list
                         that should not break anything.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-30 21:16:18 -07:00
David S. Miller
0a7606c121 [NET]: Fix race between poll_napi() and net_rx_action()
netpoll_poll_lock() synchronizes the ->poll() invocation
code paths, but once we have the lock we have to make
sure that NAPI_STATE_SCHED is still set.  Otherwise we
get:

	cpu 0			cpu 1

	net_rx_action()		poll_napi()
	netpoll_poll_lock()	... spin on ->poll_lock
	->poll()
	  netif_rx_complete
	netpoll_poll_unlock()	acquire ->poll_lock()
				->poll()
				 netif_rx_complete()
				 CRASH

Based upon a bug report from Tina Yang.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-29 22:37:28 -07:00
Eric W. Biederman
ceaa79c434 [NETNS]: Fix get_net_ns_by_pid
The pid namespace patches changed the semantics of
find_task_by_pid without breaking the compile resulting
in get_net_ns_by_pid doing the wrong thing.

So switch to using the intended find_task_by_vpid.

Combined with Denis' earlier patch to make netlink traffic
fully synchronous the inadvertent race I introduced with
accessing current is actually removed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 22:56:12 -07:00
Eric W. Biederman
2b008b0a8e [NET]: Marking struct pernet_operations __net_initdata was inappropriate
It is not safe to to place struct pernet_operations in a special section.
We need struct pernet_operations to last until we call unregister_pernet_subsys.
Which doesn't happen until module unload.

So marking struct pernet_operations is a disaster for modules in two ways.
- We discard it before we call the exit method it points to.
- Because I keep struct pernet_operations on a linked list discarding
  it for compiled in code removes elements in the middle of a linked
  list and does horrible things for linked insert.

So this looks safe assuming __exit_refok is not discarded
for modules.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 22:54:53 -07:00
Adrian Bunk
bbbb1a812d [NET]: Unexport sock_enable_timestamp().
sock_enable_timestamp() no longer has any modular users.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 03:59:45 -07:00
Stephen Hemminger
c8d90dca32 [NET] dev_change_name: ignore changes to same name
Prevent error/backtrace from dev_rename() when changing
name of network device to the same name. This is a common
situation with udev and other scripts that bind addr to device.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 03:53:42 -07:00
Jamal Hadi Salim
a057ae3c10 [NET_CLS_ACT]: Use skb_act_clone
clean skb_clone of any signs of CONFIG_NET_CLS_ACT and
have mirred us skb_act_clone()

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-26 02:47:54 -07:00
Linus Torvalds
06dbbfef82 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
  [IPV4]: Explicitly call fib_get_table() in fib_frontend.c
  [NET]: Use BUILD_BUG_ON in net/core/flowi.c
  [NET]: Remove in-code externs for some functions from net/core/dev.c
  [NET]: Don't declare extern variables in net/core/sysctl_net_core.c
  [TCP]: Remove unneeded implicit type cast when calling tcp_minshall_update()
  [NET]: Treat the sign of the result of skb_headroom() consistently
  [9P]: Fix missing unlock before return in p9_mux_poll_start
  [PKT_SCHED]: Fix sch_prio.c build with CONFIG_NETDEVICES_MULTIQUEUE
  [IPV4] ip_gre: sendto/recvfrom NBMA address
  [SCTP]: Consolidate sctp_ulpq_renege_xxx functions
  [NETLINK]: Fix ACK processing after netlink_dump_start
  [VLAN]: MAINTAINERS update
  [DCCP]: Implement SIOCINQ/FIONREAD
  [NET]: Validate device addr prior to interface-up
2007-10-25 15:50:32 -07:00
Jens Axboe
642f149031 SG: Change sg_set_page() to take length and offset argument
Most drivers need to set length and offset as well, so may as well fold
those three lines into one.

Add sg_assign_page() for those two locations that only needed to set
the page, where the offset/length is set outside of the function context.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-24 11:20:47 +02:00
Pavel Emelyanov
f0fe91ded3 [NET]: Use BUILD_BUG_ON in net/core/flowi.c
Instead of ugly extern not-existing function.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:57 -07:00
Pavel Emelyanov
342709efc7 [NET]: Remove in-code externs for some functions from net/core/dev.c
Inconsistent prototype and real type for functions may have worse
consequences, than those for variables, so move them into a header.

Since they are used privately in net/core, make this file reside in
the same place.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:56 -07:00
Pavel Emelyanov
a37ae4086e [NET]: Don't declare extern variables in net/core/sysctl_net_core.c
Some are already declared in include/linux/netdevice.h, while
some others (xfrm ones) need to be declared.

The driver/net/rrunner.c just uses same extern as well, so
cleanup it also.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:56 -07:00
Jeff Garzik
bada339ba2 [NET]: Validate device addr prior to interface-up
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-23 21:27:50 -07:00
Linus Torvalds
f09cc910fe Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (30 commits)
  [IPSEC] IPV6: Fix to add tunnel mode SA correctly.
  [NET]: Cut off the queue_mapping field from sk_buff
  [NET]: Hide the queue_mapping field inside netif_subqueue_stopped
  [NET]: Make and use skb_get_queue_mapping
  [NET]: Use the skb_set_queue_mapping where appropriate
  [INET]: Use MODULE_ALIAS_NET_PF_PROTO_TYPE where possible.
  [INET]: Let inet_diag and friends autoload
  [NIU]: Cleanup PAGE_SIZE checks a bit
  [NET]: Fix SKB_WITH_OVERHEAD calculation
  [ATM]: Fix clip module reload crash.
  [TG3]: Update version to 3.85
  [TG3]: PCI command adjustment
  [TG3]: Add management FW version to ethtool report
  [TG3]: Add 5723 support
  [Bluetooth] Convert RFCOMM to use kthread API
  [Bluetooth] Add constant for Bluetooth socket options level
  [Bluetooth] Add support for handling simple eSCO links
  [Bluetooth] Add address and channel attribute to RFCOMM TTY device
  [Bluetooth] Fix wrong argument in debug code of HIDP
  [Bluetooth] Add generic driver for Bluetooth USB devices
  ...
2007-10-22 19:22:33 -07:00
Jens Axboe
fa05f1286b Update net/ to use sg helpers
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-22 21:19:56 +02:00
Pavel Emelyanov
668f895a85 [NET]: Hide the queue_mapping field inside netif_subqueue_stopped
Many places get the queue_mapping field from skb to pass it to the
netif_subqueue_stopped() which will be 0 in any case.

Make the helper that works with sk_buff

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-22 02:59:56 -07:00
Pavel Emelyanov
dfa4091129 [NET]: Use the skb_set_queue_mapping where appropriate
There's already such a helper to initialize this field.  Use it.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-22 02:59:55 -07:00
Randy Dunlap
bfb85c9f75 [ATM]: Fix clip module reload crash.
net/atm/clip.c crashes the kernel if it (module) is loaded, removed,
and then loaded again.  Its exit call to neigh_table_clear()
should destroy the cache after freeing it.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-22 02:59:52 -07:00
Jan Engelhardt
96de0e252c Convert files to UTF-8 and some cleanups
* Convert files to UTF-8.

  * Also correct some people's names
    (one example is Eißfeldt, which was found in a source file.
    Given that the author used an ß at all in a source file
    indicates that the real name has in fact a 'ß' and not an 'ss',
    which is commonly used as a substitute for 'ß' when limited to
    7bit.)

  * Correct town names (Goettingen -> Göttingen)

  * Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313)

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2007-10-19 23:21:04 +02:00
Linus Torvalds
804b908adf Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [NET]: Fix possible dev_deactivate race condition
  [INET]: Justification for local port range robustness.
  [PACKET]: Kill unused pg_vec_endpage() function
  [NET]: QoS/Sched as menuconfig
  [NET]: Fix bug in sk_filter race cures.
  [PATCH] mac80211: make ieee802_11_parse_elems return void
2007-10-19 11:54:39 -07:00
Pavel Emelyanov
ba25f9dcc4 Use helpers to obtain task pid in printks
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:43 -07:00
Jiri Slaby
1977f03272 remove asm/bitops.h includes
remove asm/bitops.h includes

including asm/bitops directly may cause compile errors. don't include it
and include linux/bitops instead. next patch will deny including asm header
directly.

Cc: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:41 -07:00
Pavel Emelyanov
b488893a39 pid namespaces: changes to show virtual ids to user
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.

The idea is:
 - all in-kernel data structures must store either struct pid itself
   or the pid's global nr, obtained with pid_nr() call;
 - when seeking the task from kernel code with the stored id one
   should use find_task_by_pid() call that works with global pids;
 - when showing pid's numerical value to the user the virtual one
   should be used, but however when one shows task's pid outside this
   task's namespace the global one is to be used;
 - when getting the pid from userspace one need to consider this as
   the virtual one and use appropriate task/pid-searching functions.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:40 -07:00
Pavel Emelyanov
cf7b708c8d Make access to task's nsproxy lighter
When someone wants to deal with some other taks's namespaces it has to lock
the task and then to get the desired namespace if the one exists.  This is
slow on read-only paths and may be impossible in some cases.

E.g.  Oleg recently noticed a race between unshare() and the (sent for
review in cgroups) pid namespaces - when the task notifies the parent it
has to know the parent's namespace, but taking the task_lock() is
impossible there - the code is under write locked tasklist lock.

On the other hand switching the namespace on task (daemonize) and releasing
the namespace (after the last task exit) is rather rare operation and we
can sacrifice its speed to solve the issues above.

The access to other task namespaces is proposed to be performed
like this:

     rcu_read_lock();
     nsproxy = task_nsproxy(tsk);
     if (nsproxy != NULL) {
             / *
               * work with the namespaces here
               * e.g. get the reference on one of them
               * /
     } / *
         * NULL task_nsproxy() means that this task is
         * almost dead (zombie)
         * /
     rcu_read_unlock();

This patch has passed the review by Eric and Oleg :) and,
of course, tested.

[clg@fr.ibm.com: fix unshare()]
[ebiederm@xmission.com: Update get_net_ns_by_pid]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 11:53:37 -07:00
Olof Johansson
9b013e05e0 [NET]: Fix bug in sk_filter race cures.
Looks like this might be causing problems, at least for me on ppc. This
happened during a normal boot, right around first interface config/dhcp
run..

cpu 0x0: Vector: 300 (Data Access) at [c00000000147b820]
    pc: c000000000435e5c: .sk_filter_delayed_uncharge+0x1c/0x60
    lr: c0000000004360d0: .sk_attach_filter+0x170/0x180
    sp: c00000000147baa0
   msr: 9000000000009032
   dar: 4
 dsisr: 40000000
  current = 0xc000000004780fa0
  paca    = 0xc000000000650480
    pid   = 1295, comm = dhclient3
0:mon> t
[c00000000147bb20] c0000000004360d0 .sk_attach_filter+0x170/0x180
[c00000000147bbd0] c000000000418988 .sock_setsockopt+0x788/0x7f0
[c00000000147bcb0] c000000000438a74 .compat_sys_setsockopt+0x4e4/0x5a0
[c00000000147bd90] c00000000043955c .compat_sys_socketcall+0x25c/0x2b0
[c00000000147be30] c000000000007508 syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 000000000ff618d8
SP (fffdf040) is in userspace
0:mon> 

I.e. null pointer deref at sk_filter_delayed_uncharge+0x1c:

0:mon> di $.sk_filter_delayed_uncharge
c000000000435e40  7c0802a6      mflr    r0
c000000000435e44  fbc1fff0      std     r30,-16(r1)
c000000000435e48  7c8b2378      mr      r11,r4
c000000000435e4c  ebc2cdd0      ld      r30,-12848(r2)
c000000000435e50  f8010010      std     r0,16(r1)
c000000000435e54  f821ff81      stdu    r1,-128(r1)
c000000000435e58  380300a4      addi    r0,r3,164
c000000000435e5c  81240004      lwz     r9,4(r4)

That's the deref of fp:

static void sk_filter_delayed_uncharge(struct sock *sk, struct sk_filter *fp)
{
        unsigned int size = sk_filter_len(fp);
...

That is called from sk_attach_filter():

...
        rcu_read_lock_bh();
        old_fp = rcu_dereference(sk->sk_filter);
        rcu_assign_pointer(sk->sk_filter, fp);
        rcu_read_unlock_bh();

        sk_filter_delayed_uncharge(sk, old_fp);
        return 0;
...

So, looks like rcu_dereference() returned NULL. I don't know the
filter code at all, but it seems like it might be a valid case?
sk_detach_filter() seems to handle a NULL sk_filter, at least.

So, this needs review by someone who knows the filter, but it fixes the
problem for me:

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-18 21:48:39 -07:00
Linus Torvalds
a57793651f Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (51 commits)
  [IPV6]: Fix again the fl6_sock_lookup() fixed locking
  [NETFILTER]: nf_conntrack_tcp: fix connection reopening fix
  [IPV6]: Fix race in ipv6_flowlabel_opt() when inserting two labels
  [IPV6]: Lost locking in fl6_sock_lookup
  [IPV6]: Lost locking when inserting a flowlabel in ipv6_fl_list
  [NETFILTER]: xt_sctp: fix mistake to pass a pointer where array is required
  [NET]: Fix OOPS due to missing check in dev_parse_header().
  [TCP]: Remove lost_retrans zero seqno special cases
  [NET]: fix carrier-on bug?
  [NET]: Fix uninitialised variable in ip_frag_reasm()
  [IPSEC]: Rename mode to outer_mode and add inner_mode
  [IPSEC]: Disallow combinations of RO and AH/ESP/IPCOMP
  [IPSEC]: Use the top IPv4 route's peer instead of the bottom
  [IPSEC]: Store afinfo pointer in xfrm_mode
  [IPSEC]: Add missing BEET checks
  [IPSEC]: Move type and mode map into xfrm_state.c
  [IPSEC]: Fix length check in xfrm_parse_spi
  [IPSEC]: Move ip_summed zapping out of xfrm6_rcv_spi
  [IPSEC]: Get nexthdr from caller in xfrm6_rcv_spi
  [IPSEC]: Move tunnel parsing for IPv4 out of xfrm4_input
  ...
2007-10-18 14:40:30 -07:00
Eric W. Biederman
d12af679bc sysctl: fix neighbour table sysctls.
- In ipv6 ndisc_ifinfo_syctl_change so it doesn't depend on binary
  sysctl names for a function that works with proc.

- In neighbour.c reorder the table to put the possibly unused entries
  at the end so we can remove them by terminating the table early.

- In neighbour.c kill the entries with questionable binary sysctl
  handling behavior.

- In neighbour.c if we don't have a strategy routine remove the
  binary path.  So we don't the default sysctl strategy routine
  on data that is not ready for it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-18 14:37:22 -07:00
Herbert Xu
13996378e6 [IPSEC]: Rename mode to outer_mode and add inner_mode
This patch adds a new field to xfrm states called inner_mode.  The existing
mode object is renamed to outer_mode.

This is the first part of an attempt to fix inter-family transforms.  As it
is we always use the outer family when determining which mode to use.  As a
result we may end up shoving IPv4 packets into netfilter6 and vice versa.

What we really want is to use the inner family for the first part of outbound
processing and the outer family for the second part.  For inbound processing
we'd use the opposite pairing.

I've also added a check to prevent silly combinations such as transport mode
with inter-family transforms.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:35:51 -07:00
Pavel Emelyanov
47e958eac2 [NET]: Fix the race between sk_filter_(de|at)tach and sk_clone()
The proposed fix is to delay the reference counter decrement
until the quiescent state pass. This will give sk_clone() a
chance to get the reference on the cloned filter.

Regular sk_filter_uncharge can happen from the sk_free() only
and there's no need in delaying the put - the socket is dead
anyway and is to be release itself.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:22:42 -07:00
Pavel Emelyanov
d3904b7399 [NET]: Cleanup the error path in sk_attach_filter
The sk_filter_uncharge is called for error handling and
for releasing the former filter, but this will have to
be done in a bit different manner, so cleanup the error
path a bit.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:22:17 -07:00
Pavel Emelyanov
309dd5fc87 [NET]: Move the filter releasing into a separate call
This is done merely as a preparation for the fix.

The sk_filter_uncharge() unaccounts the filter memory and calls
the sk_filter_release(), which in turn decrements the refcount
anf frees the filter.

The latter function will be required separately.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:21:51 -07:00
Pavel Emelyanov
55b333253d [NET]: Introduce the sk_detach_filter() call
Filter is attached in a separate function, so do the
same for filter detaching.

This also removes one variable sock_setsockopt().

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-17 21:21:26 -07:00
Pavel Emelyanov
4ae289444b [NEIGH]: Ensure that pneigh_lookup is protected with RTNL
The pnigh_lookup is used to lookup proxy entries and to 
create them in case lookup failed. 

However, the "creation" code does not perform the re-lookup
after GFP_KERNEL allocation. This is done because the code
is expected to be protected with the RTNL lock, so add the 
assertion (mainly to address future questions from new network 
developers like me :) ).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:54:15 -07:00
Herbert Xu
a030847e9f [NET]: Avoid copying TCP packets unnecessarily
TCP packets all have writable heads, that is, even though it's cloned, it is
writable up to the end of the TCP header.  This patch makes skb_checksum_help
aware of this fact by using skb_clone_writable and avoiding a copy for TCP.

I've also modified the BUG_ON tests to be unsigned.  The only case where this
makes a difference is if csum_start points to a location before skb->data.
Since skb->data should always include the header where the checksum field
is (and all currently callers adhere to that), this change is safe and may
uncover bugs later.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:34 -07:00
Herbert Xu
172a863f2b [NET]: Fix csum_start update in pskb_expand_head
I got confused by the dual nature of the off variable in the
function pskb_expand_head.  The csum_start offset should use
nhead instead of off which can change depending on whether we
are using offsets or pointers.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:33 -07:00
Herbert Xu
f697c3e8b3 [NET]: Avoid unnecessary cloning for ingress filtering
As it is we always invoke pt_prev before ing_filter, even if there are no
ingress filters attached.  This can cause unnecessary cloning in pt_prev.

This patch changes it so that we only invoke pt_prev if there are ingress
filters attached.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:26 -07:00
Herbert Xu
e0053ec07e [SKBUFF]: Add skb_morph
This patch creates a new function skb_morph that's just like skb_clone
except that it lets user provide the spare skb that will be overwritten
by the one that's to be cloned.

This will be used by IP fragment reassembly so that we get back the same
skb that went in last (rather than the head skb that we get now which
requires us to carry around double pointers all over the place).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:24 -07:00
Herbert Xu
dec18810c5 [SKBUFF]: Merge common code between copy_skb_header and skb_clone
This patch creates a new function __copy_skb_header to merge the common
code between copy_skb_header and skb_clone.  Having two functions which
are largely the same is a source of wasted labour as well as confusion.

In fact the tc_verd stuff is almost certainly a bug since it's treated
differently in skb_clone compared to the callers of copy_skb_header
(skb_copy/pskb_copy/skb_copy_expand).

I've kept that difference in tact with a comment added asking for
clarification.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-15 12:26:24 -07:00
Randy Dunlap
c4ea43c552 net core: fix kernel-doc for new function parameters
Fix networking code kernel-doc for newly added parameters.

Warning(linux-2.6.23-git2//net/core/sock.c:879): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:570): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:594): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:617): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:641): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:667): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:722): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:959): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:1195): No description found for parameter 'dev'
Warning(linux-2.6.23-git2//net/core/dev.c:2105): No description found for parameter 'n'
Warning(linux-2.6.23-git2//net/core/dev.c:3272): No description found for parameter 'net'
Warning(linux-2.6.23-git2//net/core/dev.c:3445): No description found for parameter 'net'
Warning(linux-2.6.23-git2//include/linux/netdevice.h:1301): No description found for parameter 'cpu'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-13 09:52:26 -07:00
Kay Sievers
7eff2e7a8b Driver core: change add_uevent_var to use a struct
This changes the uevent buffer functions to use a struct instead of a
long list of parameters. It does no longer require the caller to do the
proper buffer termination and size accounting, which is currently wrong
in some places. It fixes a known bug where parts of the uevent
environment are overwritten because of wrong index calculations.

Many thanks to Mathieu Desnoyers for finding bugs and improving the
error handling.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-12 14:51:01 -07:00
Denis V. Lunev
cd40b7d398 [NET]: make netlink user -> kernel interface synchronious
This patch make processing netlink user -> kernel messages synchronious.
This change was inspired by the talk with Alexey Kuznetsov about current
netlink messages processing. He says that he was badly wrong when introduced 
asynchronious user -> kernel communication.

The call netlink_unicast is the only path to send message to the kernel
netlink socket. But, unfortunately, it is also used to send data to the
user.

Before this change the user message has been attached to the socket queue
and sk->sk_data_ready was called. The process has been blocked until all
pending messages were processed. The bad thing is that this processing
may occur in the arbitrary process context.

This patch changes nlk->data_ready callback to get 1 skb and force packet
processing right in the netlink_unicast.

Kernel -> user path in netlink_unicast remains untouched.

EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
drop, but the process remains in the cycle until the message will be fully
processed. So, there is no need to use this kludges now.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:15:29 -07:00
Denis V. Lunev
1536cc0d55 [NET]: rtnl_unlock cleanups
There is no need to process outstanding netlink user->kernel packets
during rtnl_unlock now. There is no rtnl_trylock in the rtnetlink_rcv
anymore.

Normal code path is the following:
netlink_sendmsg
   netlink_unicast
       netlink_sendskb
           skb_queue_tail
           netlink_data_ready
               rtnetlink_rcv
                   mutex_lock(&rtnl_mutex);
                   netlink_run_queue(sk, qlen, &rtnetlink_rcv_msg);
                   mutex_unlock(&rtnl_mutex);

So, it is possible, that packets can be present in the rtnl->sk_receive_queue
during rtnl_unlock, but there is no need to process them at that moment as
rtnetlink_rcv for that packet is pending.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 21:12:58 -07:00
Pavel Emelyanov
9b77265235 [NET]: Remove double dev->flags checking when calling dev_close()
The unregister_netdevice() and dev_change_net_namespace()
both check for dev->flags to be IFF_UP before calling the
dev_close(), but the dev_close() checks for IFF_UP itself,
so remove those unneeded checks.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:55:52 -07:00
Pavel Emelyanov
32f0c4cbe4 [NETNS]: Don't memset() netns to zero manually
The newly created net namespace is set to 0 with memset()
in setup_net(). The setup_net() is also called for the
init_net_ns(), which is zeroed naturally as a global var.

So remove this memset and allocate new nets with the
kmem_cache_zalloc().

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:54:59 -07:00
Pavel Emelyanov
4665079cbb [NETNS]: Move some code into __init section when CONFIG_NET_NS=n
With the net namespaces many code leaved the __init section,
thus making the kernel occupy more memory than it did before.
Since we have a config option that prohibits the namespace
creation, the functions that initialize/finalize some netns
stuff are simply not needed and can be freed after the boot.

Currently, this is almost not noticeable, since few calls
are no longer in __init, but when the namespaces will be
merged it will be possible to free more code. I propose to
use the __net_init, __net_exit and __net_initdata "attributes"
for functions/variables that are not used if the CONFIG_NET_NS
is not set to save more space in memory.

The exiting functions cannot just reside in the __exit section,
as noticed by David, since the init section will have
references on it and the compilation will fail due to modpost
checks. These references can exist, since the init namespace
never dies and the exit callbacks are never called. So I
introduce the __exit_refok attribute just like it is already
done with the __init_refok.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:54:58 -07:00
Jeff Garzik
14e3e07979 [NET]: split dev_ifsioc() according to locking
This always bugged me: dev_ioctl() called dev_ifsioc() either inside
read_lock(dev_base_lock) or rtnl_lock(), depending on the ioctl being
executed.

This change moves the ioctls executed inside dev_base_lock to a new
function, dev_ifsioc_locked().  Now the locking context is completely
clear to the reader.

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:54:49 -07:00
Stephen Hemminger
cfcabdcc2d [NET]: sparse warning fixes
Fix a bunch of sparse warnings. Mostly about 0 used as
NULL pointer, and shadowed variable declarations.
One notable case was that hash size should have been unsigned.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:54:48 -07:00
Eric W. Biederman
f4618d39a3 [NETNS]: Simplify the network namespace list locking rules.
Denis V. Lunev <den@sw.ru> noticed that the locking rules
for the network namespace list are over complicated and broken.

In particular the current register_netdev_notifier currently
does not take any lock making the for_each_net iteration racy
with network namespace creation and destruction. Oops.

The fact that we need to use for_each_net in rtnl_unlock() when
the rtnetlink support becomes per network namespace makes designing
the proper locking tricky.  In addition we need to be able to call
rtnl_lock() and rtnl_unlock() when we have the net_mutex held.

After thinking about it and looking at the alternatives carefully
it looks like the simplest and most maintainable solution is
to remove net_list_mutex altogether, and to use the rtnl_mutex instead.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:55 -07:00
Stephen Hemminger
3b04ddde02 [NET]: Move hardware header operations out of netdevice.
Since hardware header operations are part of the protocol class
not the device instance, make them into a separate object and
save memory.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:52 -07:00
Stephen Hemminger
0c4e85813d [NET]: Wrap netdevice hardware header creation.
Add inline for common usage of hardware header creation, and
fix bug in IPV6 mcast where the assumption about negative return is
an errno. Negative return from hard_header means not enough space
was available,(ie -N bytes).

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:50 -07:00
Eric W. Biederman
2774c7aba6 [NET]: Make the loopback device per network namespace.
This patch makes loopback_dev per network namespace.  Adding
code to create a different loopback device for each network
namespace and adding the code to free a loopback device
when a network namespace exits.

This patch modifies all users the loopback_dev so they
access it as init_net.loopback_dev, keeping all of the
code compiling and working.  A later pass will be needed to
update the users to use something other than the initial network
namespace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:49 -07:00
Eric W. Biederman
9dd776b6d7 [NET]: Add network namespace clone & unshare support.
This patch allows you to create a new network namespace
using sys_clone, or sys_unshare.

As the network namespace is still experimental and under development
clone and unshare support is only made available when CONFIG_NET_NS is
selected at compile time.

As this patch introduces network namespace support into code paths
that exist when the CONFIG_NET is not selected there are a few
additions made to net_namespace.h to allow a few more functions
to be used when the networking stack is not compiled in.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:46 -07:00
Eric W. Biederman
8b41d1887d [NET]: Fix running without sysfs
When sysfs support is compiled out the kernel still keeps and maintains
the kobject tree.  So it is not safe to skip our kobject reference counting or
to avoid becoming members of the kobject tree.  It is safe to not add
the networking specific sysfs attributes.

This patch removes the sysfs special cases from net/core/dev.c
renames functions from netdev_sysfs_xxxx to netdev_kobject_xxxx
and always compiles in net-sysfs.c

net-sysfs.c is modified with a CONFIG_SYSFS guard around the parts
that are actually sysfs specific.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:46 -07:00
Daniel Lezcano
de3cb747ff [NET]: Dynamically allocate the loopback device, part 1.
This patch replaces all occurences to the static variable
loopback_dev to a pointer loopback_dev. That provides the
mindless, trivial, uninteressting change part for the dynamic
allocation for the loopback.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-By: Kirill Korotaev <dev@sw.ru>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:52:14 -07:00
Joe Perches
0795af5729 [NET]: Introduce and use print_mac() and DECLARE_MAC_BUF()
This is nicer than the MAC_FMT stuff.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:42 -07:00
Pavel Emelyanov
768f3591e2 [NETNS]: Cleanup list walking in setup_net and cleanup_net
I proposed introducing a list_for_each_entry_continue_reverse macro
to be used in setup_net() when unrolling the failed ->init callback.

Here is the macro and some more cleanup in the setup_net() itself
to remove one variable from the stack :) The same thing is for the
cleanup_net() - the existing list_for_each_entry_reverse() is used.

Minor, but the code looks nicer.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:35 -07:00
Herbert Xu
52886051ff [SKBUFF]: Fix up csum_start when head room changes
Thanks for noticing the bug where csum_start is not updated
when the head room changes.

This patch fixes that.  It also moves the csum/ip_summed
copying into copy_skb_header so that skb_copy_expand gets
it too.  I've checked its callers and no one should be upset
by this.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:24 -07:00
Herbert Xu
0cfad07555 [NETLINK]: Avoid pointer in netlink_run_queue
I was looking at Patrick's fix to inet_diag and it occured
to me that we're using a pointer argument to return values
unnecessarily in netlink_run_queue.  Changing it to return
the value will allow the compiler to generate better code
since the value won't have to be memory-backed.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:24 -07:00
Denis V. Lunev
76c72d4f44 [IPV4/IPV6/DECNET]: Small cleanup for fib rules.
This patch slightly cleanups FIB rules framework. rules_list as a pointer
on struct fib_rules_ops is useless. It is always assigned with a static
per/subsystem list in IPv4, IPv6 and DecNet.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:22 -07:00
Pavel Emelyanov
056925ab31 [NET]: Cleanup calling netdev notifiers.
The call_netdev_notifiers routine can successfully be used in
the net/core_dev.c itself.

This will save 6 lines of code and 62 ;) bytes of .text section.

62 is rather small, but I have one more patch saving ~30 bytes
from netns code (sent to Eric), so altogether they can save
some more noticeable amount.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:21 -07:00
Pavel Emelyanov
30d97d3585 [NETNS]: Consolidate hashes creation in netdev_init()
The dev_name_hash and the dev_index_hash are now booth kmalloc-ed
(and each element is properly initialized as usually) so I think
it's worth consolidating this code making it look nicer (and
saving 28 bytes of .text section ;) )

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:21 -07:00
Eric W. Biederman
ad7379d494 [NET]: Fix the prototype of call_netdevice_notifiers.
This replaces the void * parameter with a struct net_device * which
is what is actually required.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:20 -07:00
Jamal Hadi Salim
22dd749501 [NET]: migrate HARD_TX_LOCK to header file
HARD_TX_LOCK micro is a nice aggregation that could be used
in other spots. move it to netdevice.h
Also makes sure the previously superflous cpu arguement is used.
Thanks to DaveM for the suggestions.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:20 -07:00
Jeff Garzik
88d3aafdae [ETHTOOL] Provide default behaviors for a few ethtool sub-ioctls
For the operations
	get-tx-csum
	get-sg
	get-tso
	get-ufo
the default ethtool_op_xxx behavior is fine for all drivers, so we
permit op==NULL to imply the default behavior.

This provides a more uniform behavior across all drivers, eliminating
ethtool(8) "ioctl not supported" errors on older drivers that had
not been updated for the latest sub-ioctls.

The ethtool_op_xxx() functions are left exported, in case anyone
wishes to call them directly from a driver-private implementation --
a not-uncommon case.  Should an ethtool_op_xxx() helper remain unused
for a while, except by net/core/ethtool.c, we can un-export it at a
later date.

[ Resolved conflicts with set/get value ethtool patch... -DaveM ]

Signed-off-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:51:17 -07:00
Eric W. Biederman
077130c0cf [NET]: Fix race when opening a proc file while a network namespace is exiting.
The problem:  proc_net files remember which network namespace the are
against but do not remember hold a reference count (as that would pin
the network namespace).   So we currently have a small window where
the reference count on a network namespace may be incremented when opening
a /proc file when it has already gone to zero.

To fix this introduce maybe_get_net and get_proc_net.

maybe_get_net increments the network namespace reference count only if it is
greater then zero, ensuring we don't increment a reference count after it
has gone to zero.

get_proc_net handles all of the magic to go from a proc inode to the network
namespace instance and call maybe_get_net on it.

PROC_NET the old accessor is removed so that we don't get confused and use
the wrong helper function.

Then I fix up the callers to use get_proc_net and handle the case case
where get_proc_net returns NULL.  In that case I return -ENXIO because
effectively the network namespace has already gone away so the files
we are trying to access don't exist anymore.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:22 -07:00
David S. Miller
9d5010db7e [NET]: Add a might_sleep() to dev_close().
Requested by Johannes Berg.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:15 -07:00
Eric Dumazet
86bba269d0 [PATCH] NET : convert IP route cache garbage collection from softirq processing to a workqueue
When the periodic IP route cache flush is done (every 600 seconds on
default configuration), some hosts suffer a lot and eventually trigger
the "soft lockup" message.

dst_run_gc() is doing a scan of a possibly huge list of dst_entries,
eventually freeing some (less than 1%) of them, while holding the
dst_lock spinlock for the whole scan.

Then it rearms a timer to redo the full thing 1/10 s later...
The slowdown can last one minute or so, depending on how active are
the tcp sessions.

This second version of the patch converts the processing from a softirq
based one to a workqueue.

Even if the list of entries in garbage_list is huge, host is still
responsive to softirqs and can make progress.

Instead of resetting gc timer to 0.1 second if one entry was freed in a
gc run, we do this if more than 10% of entries were freed.

Before patch :

Aug 16 06:21:37 SRV1 kernel: BUG: soft lockup detected on CPU#0!
Aug 16 06:21:37 SRV1 kernel:
Aug 16 06:21:37 SRV1 kernel: Call Trace:
Aug 16 06:21:37 SRV1 kernel:  <IRQ>  [<ffffffff802286f0>] wake_up_process+0x10/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80251e09>] softlockup_tick+0xe9/0x110
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd380>] dst_run_gc+0x0/0x140
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802376f3>] run_local_timers+0x13/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802379c7>] update_process_times+0x57/0x90
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80216034>] smp_local_timer_interrupt+0x34/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802165cc>] smp_apic_timer_interrupt+0x5c/0x80
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8020a816>] apic_timer_interrupt+0x66/0x70
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd3d3>] dst_run_gc+0x53/0x140
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd3c6>] dst_run_gc+0x46/0x140
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80237148>] run_timer_softirq+0x148/0x1c0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8023340c>] __do_softirq+0x6c/0xe0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8020ad6c>] call_softirq+0x1c/0x30
Aug 16 06:21:37 SRV1 kernel:  <EOI>  [<ffffffff8020cb34>] do_softirq+0x34/0x90
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802331cf>] local_bh_enable_ip+0x3f/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80422913>] _spin_unlock_bh+0x13/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803dfde8>] rt_garbage_collect+0x1d8/0x320
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd4dd>] dst_alloc+0x1d/0xa0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803e1433>] __ip_route_output_key+0x573/0x800
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803c02e2>] sock_common_recvmsg+0x32/0x50
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803e16dc>] ip_route_output_flow+0x1c/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80400160>] tcp_v4_connect+0x150/0x610
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803ebf07>] inet_bind_bucket_create+0x17/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8040cd16>] inet_stream_connect+0xa6/0x2c0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80422981>] _spin_lock_bh+0x11/0x30
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803c0bdf>] lock_sock_nested+0xcf/0xe0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80422981>] _spin_lock_bh+0x11/0x30
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803be551>] sys_connect+0x71/0xa0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803eee3f>] tcp_setsockopt+0x1f/0x30
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803c030f>] sock_common_setsockopt+0xf/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803be4bd>] sys_setsockopt+0x9d/0xc0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8028881e>] sys_ioctl+0x5e/0x80
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80209c4e>] system_call+0x7e/0x83

After patch : (RT_CACHE_DEBUG set to 2 to get following traces)

dst_total: 75469 delayed: 74109 work_perf: 141 expires: 150 elapsed: 8092 us
dst_total: 78725 delayed: 73366 work_perf: 743 expires: 400 elapsed: 8542 us
dst_total: 86126 delayed: 71844 work_perf: 1522 expires: 775 elapsed: 8849 us
dst_total: 100173 delayed: 68791 work_perf: 3053 expires: 1256 elapsed: 9748 us
dst_total: 121798 delayed: 64711 work_perf: 4080 expires: 1997 elapsed: 10146 us
dst_total: 154522 delayed: 58316 work_perf: 6395 expires: 25 elapsed: 11402 us
dst_total: 154957 delayed: 58252 work_perf: 64 expires: 150 elapsed: 6148 us
dst_total: 157377 delayed: 57843 work_perf: 409 expires: 400 elapsed: 6350 us
dst_total: 163745 delayed: 56679 work_perf: 1164 expires: 775 elapsed: 7051 us
dst_total: 176577 delayed: 53965 work_perf: 2714 expires: 1389 elapsed: 8120 us
dst_total: 198993 delayed: 49627 work_perf: 4338 expires: 1997 elapsed: 8909 us
dst_total: 226638 delayed: 46865 work_perf: 2762 expires: 2748 elapsed: 7351 us

I successfully reduced the IP route cache of many hosts by a four factor
thanks to this patch. Previously, I had to disable "ip route flush cache"
to avoid crashes.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:15 -07:00
David S. Miller
678aa8e4eb [NET]: #if 0 out net_alloc() for now.
We will undo this once it is actually used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:14 -07:00
Eric W. Biederman
d8a5ec6727 [NET]: netlink support for moving devices between network namespaces.
The simplest thing to implement is moving network devices between
namespaces.  However with the same attribute IFLA_NET_NS_PID we can
easily implement creating devices in the destination network
namespace as well.  However that is a little bit trickier so this
patch sticks to what is simple and easy.

A pid is used to identify a process that happens to be a member
of the network namespace we want to move the network device to.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:13 -07:00
Eric W. Biederman
ce286d3273 [NET]: Implement network device movement between namespaces
This patch introduces NETIF_F_NETNS_LOCAL a flag to indicate
a network device is local to a single network namespace and
should never be moved.  Useful for pseudo devices that we
need an instance in each network namespace (like the loopback
device) and for any device we find that cannot handle multiple
network namespaces so we may trap them in the initial network
namespace.

This patch introduces the function dev_change_net_namespace
a function used to move a network device from one network
namespace to another.  To the network device nothing
special appears to happen, to the components of the network
stack it appears as if the network device was unregistered
in the network namespace it is in, and a new device
was registered in the network namespace the device
was moved to.

This patch sets up a namespace device destructor that
upon the exit of a network namespace moves all of the
movable network devices  to the initial network namespace
so they are not lost.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:12 -07:00
Eric W. Biederman
b267b17964 [NET]: Factor out __dev_alloc_name from dev_alloc_name
When forcibly changing the network namespace of a device
I need something that can generate a name for the device
in the new namespace without overwriting the old name.

__dev_alloc_name provides me that functionality.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:11 -07:00
Eric W. Biederman
881d966b48 [NET]: Make the device list and device lookups per namespace.
This patch makes most of the generic device layer network
namespace safe.  This patch makes dev_base_head a
network namespace variable, and then it picks up
a few associated variables.  The functions:
dev_getbyhwaddr
dev_getfirsthwbytype
dev_get_by_flags
dev_get_by_name
__dev_get_by_name
dev_get_by_index
__dev_get_by_index
dev_ioctl
dev_ethtool
dev_load
wireless_process_ioctl

were modified to take a network namespace argument, and
deal with it.

vlan_ioctl_set and brioctl_set were modified so their
hooks will receive a network namespace argument.

So basically anthing in the core of the network stack that was
affected to by the change of dev_base was modified to handle
multiple network namespaces.  The rest of the network stack was
simply modified to explicitly use &init_net the initial network
namespace.  This can be fixed when those components of the network
stack are modified to handle multiple network namespaces.

For now the ifindex generator is left global.

Fundametally ifindex numbers are per namespace, or else
we will have corner case problems with migration when
we get that far.

At the same time there are assumptions in the network stack
that the ifindex of a network device won't change.  Making
the ifindex number global seems a good compromise until
the network stack can cope with ifindex changes when
you change namespaces, and the like.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:10 -07:00
Eric W. Biederman
b4b510290b [NET]: Support multiple network namespaces with netlink
Each netlink socket will live in exactly one network namespace,
this includes the controlling kernel sockets.

This patch updates all of the existing netlink protocols
to only support the initial network namespace.  Request
by clients in other namespaces will get -ECONREFUSED.
As they would if the kernel did not have the support for
that netlink protocol compiled in.

As each netlink protocol is updated to be multiple network
namespace safe it can register multiple kernel sockets
to acquire a presence in the rest of the network namespaces.

The implementation in af_netlink is a simple filter implementation
at hash table insertion and hash table look up time.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:09 -07:00
Eric W. Biederman
e9dc865340 [NET]: Make device event notification network namespace safe
Every user of the network device notifiers is either a protocol
stack or a pseudo device.  If a protocol stack that does not have
support for multiple network namespaces receives an event for a
device that is not in the initial network namespace it quite possibly
can get confused and do the wrong thing.

To avoid problems until all of the protocol stacks are converted
this patch modifies all netdev event handlers to ignore events on
devices that are not in the initial network namespace.

As the rest of the code is made network namespace aware these
checks can be removed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10 16:49:09 -07:00