Using ->family in struct xt_*_param, multiple struct xt_{match,target}
can be squashed together.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
By passing in the family through which extensions were invoked, a bit
of data space can be reclaimed. The "family" member will be added to
the parameter structures and the check functions be adjusted.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch does this for target extensions' destroy functions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch does this for target extensions' checkentry functions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch does this for target extensions' target functions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch does this for match extensions' destroy functions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch does this for match extensions' checkentry functions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The function signatures for Xtables extensions have grown over time.
It involves a lot of typing/replication, and also a bit of stack space
even if they are not used. Realize an NFWS2008 idea and pack them into
structs. The skb remains outside of the struct so gcc can continue to
apply its optimizations.
This patch does this for match extensions' match functions.
A few ambiguities have also been addressed. The "offset" parameter for
example has been renamed to "fragoff" (there are so many different
offsets already) and "protoff" to "thoff" (there is more than just one
protocol here, so clarify).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
ip6t_LOG does certainly not depend on the filter table.
(Also, move it so that menuconfig still displays it correctly.)
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
It used to be that {ip,ip6,etc}_tables called extension->checkentry
themselves, but this can be moved into the xtables core.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Usually -EINVAL is used when checkentry fails (see *_tables).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Ebtables ORs (1 << NF_BR_NUMHOOKS) into the hook mask to indicate that
the extension was called from a base chain. So this also needs to be
present in the extensions' ->hooks.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The function signatures will be changed to match those of Xtables, and
the datalen argument will be gone. ebt_among unfortunately relies on
it, so we need to obtain it somehow.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The TPROXY target implements redirection of non-local TCP/UDP traffic to local
sockets. Additionally, it's possible to manipulate the packet mark if and only
if a socket has been found. (We need this because we cannot use multiple
targets in the same iptables rule.)
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Add iptables 'socket' match, which matches packets for which a TCP/UDP
socket lookup succeeds.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The iptables tproxy core is a module that contains the common routines used by
various tproxy related modules (TPROXY target and socket match)
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Netfilter connection tracking requires all IPv4 packets to be defragmented.
Both the socket match and the TPROXY target depend on this functionality, so
this patch separates the Netfilter IPv4 defrag hooks into a separate module.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
From kernel perspective, allow entrance in nf_hook_slow().
Stuff which uses nf_register_hook/nf_register_hooks, but otherwise not netns-ready:
DECnet netfilter
ipt_CLUSTERIP
nf_nat_standalone.c together with XFRM (?)
IPVS
several individual match modules (like hashlimit)
ctnetlink
NOTRACK
all sorts of queueing and reporting to userspace
L3 and L4 protocol sysctls, bridge sysctls
probably something else
Anyway critical mass has been achieved, there is no reason to hide netfilter any longer.
From userspace perspective, allow to manipulate all sorts of
iptables/ip6tables/arptables rules.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Same story as with iptable_filter, iptables_raw tables.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
First, allow entry in notifier hook.
Second, start conntrack cleanup in netns to which netdevice belongs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Add init_net checks to not remove kmem_caches twice and so on.
Refactor functions to split code which should be executed only for
init_net into one place.
ip_ct_attach and ip_ct_destroy assignments remain separate, because
they're separate stages in setup and teardown.
NOTE: NOTRACK code is in for-every-net part. It will be made per-netns
after we decidce how to do it correctly.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Note, sysctl table is always duplicated, this is simpler and less
special-cased.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Heh, last minute proof-reading of this patch made me think,
that this is actually unneeded, simply because "ct" pointers will be
different for different conntracks in different netns, just like they
are different in one netns.
Not so sure anymore.
[Patrick: pointers will be different, flushing can only be done while
inactive though and thus it needs to be per netns]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This is cleaner, we already know conntrack to which event is relevant.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Conntrack code will use it for
a) removing expectations and helpers when corresponding module is removed, and
b) removing conntracks when L3 protocol conntrack module is removed.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Again, it's deducible from skb, but we're going to use it for
nf_conntrack_checksum and statistics, so just pass it from upper layer.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
It's deducible from skb->dev or skb->dst->dev, but we know netns at
the moment of call, so pass it down and use for finding and creating
conntracks.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
What is confirmed connection in one netns can very well be unconfirmed
in another one.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Make per-netns a) expectation hash and b) expectations count.
Expectations always belongs to netns to which it's master conntrack belong.
This is natural and doesn't bloat expectation.
Proc files and leaf users are stubbed to init_net, this is temporary.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Take netns from skb->dst->dev. It should be safe because, they are called
from LOCAL_OUT hook where dst is valid (though, I'm not exactly sure about
IPVS and queueing packets to userspace).
[Patrick: its safe everywhere since they already expect skb->dst to be set]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
* make per-netns conntrack hash
Other solution is to add ->ct_net pointer to tuplehashes and still has one
hash, I tried that it's ugly and requires more code deep down in protocol
modules et al.
* propagate netns pointer to where needed, e. g. to conntrack iterators.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Sysctls and proc files are stubbed to init_net's one. This is temporary.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Conntrack (struct nf_conn) gets pointer to netns: ->ct_net -- netns in which
it was created. It comes from netdevice.
->ct_net is write-once field.
Every conntrack in system has ->ct_net initialized, no exceptions.
->ct_net doesn't pin netns: conntracks are recycled after timeouts and
pinning background traffic will prevent netns from even starting shutdown
sequence.
Right now every conntrack is created in init_net.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
One comment: #ifdefs around #include is necessary to overcome amazing compile
breakages in NOTRACK-in-netns patch (see below).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Now that dev_net() exists, the usefullness of them is even less. Also they're
a big problem in resolving circular header dependencies necessary for
NOTRACK-in-netns patch. See below.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
When a match or target is looked up using xt_find_{match,target},
Xtables will also search the NFPROTO_UNSPEC module list. This allows
for protocol-independent extensions (like xt_time) to be reused from
other components (e.g. arptables, ebtables).
Extensions that take different codepaths depending on match->family
or target->family of course cannot use NFPROTO_UNSPEC within the
registration structure (e.g. xt_pkttype).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The netfilter subsystem only supports a handful of protocols (much
less than PF_*) and even non-PF protocols like ARP and
pseudo-protocols like PF_BRIDGE. By creating NFPROTO_*, we can earn a
few memory savings on arrays that previously were always PF_MAX-sized
and keep the pseudo-protocols to ourselves.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This updates xt_recent to support the IPv6 address family.
The new /proc/net/xt_recent directory must be used for this.
The old proc interface can also be configured out.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Like with other modules (such as ipt_state), ipt_recent.h is changed
to forward definitions to (IOW include) xt_recent.h, and xt_recent.c
is changed to use the new constant names.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
and (try to) consistently use u_int8_t for the L3 family.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Because of rounding, in certain conditions, i.e. when in congestion
avoidance state rho is smaller than 1/128 of the current cwnd, TCP
Hybla congestion control starves and the cwnd is kept constant
forever.
This patch forces an increment by one segment after #send_cwnd calls
without increments(newreno behavior).
Signed-off-by: Daniele Lacamera <root@danielinux.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Benjamin Thery tracked down a bug that explains many instances
of the error
unregister_netdevice: waiting for %s to become free. Usage count = %d
It turns out that netdev_run_todo can dead-lock with itself if
a second instance of it is run in a thread that will then free
a reference to the device waited on by the first instance.
The problem is really quite silly. We were trying to create
parallelism where none was required. As netdev_run_todo always
follows a RTNL section, and that todo tasks can only be added
with the RTNL held, by definition you should only need to wait
for the very ones that you've added and be done with it.
There is no need for a second mutex or spinlock.
This is exactly what the following patch does.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add mc_count to struct in_device and updates
increment/decrement/initilaize of this field in IPv4 and in IPv6.
- Also printing the vfs /proc entry (/proc/net/igmp) is adjusted to
use the new mc_count.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Ali Saidi <saidi@engin.umich.edu>
When TCP receive copy offload is enabled it's possible that
tcp_rcv_established() will cause two acks to be sent for a single
packet. In the case that a tcp_dma_early_copy() is successful,
copied_early is set to true which causes tcp_cleanup_rbuf() to be
called early which can send an ack. Further along in
tcp_rcv_established(), __tcp_ack_snd_check() is called and will
schedule a delayed ACK. If no packets are processed before the delayed
ack timer expires the packet will be acked twice.
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesper Dangaard Brouer <hawk@comx.dk> reported a bug when setting a VLAN
device down that is in promiscous mode:
When the VLAN device is set down, the promiscous count on the real
device is decremented by one by vlan_dev_stop(). When removing the
promiscous flag from the VLAN device afterwards, the promiscous
count on the real device is decremented a second time by the
vlan_change_rx_flags() callback.
The root cause for this is that the ->change_rx_flags() callback is
invoked while the device is down. The synchronization is meant to mirror
the behaviour of the ->set_rx_mode callbacks, meaning the ->open function
is responsible for doing a full sync on open, the ->close() function is
responsible for doing full cleanup on ->stop() and ->change_rx_flags()
is meant to do incremental changes while the device is UP.
Only invoke ->change_rx_flags() while the device is UP to provide the
intended behaviour.
Tested-by: Jesper Dangaard Brouer <jdb@comx.dk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Despite the fact that cloned rpc clients won't have the cl_autobind flag
set, they may still find themselves calling rpcb_getport_async(). For this
to happen, it suffices for a _parent_ rpc_clnt to use autobinding, in which
case any clone may find itself triggering the !xprt_bound() case in
call_bind().
The correct fix for this is to walk back up the tree of cloned rpc clients,
in order to find the parent that 'owns' the transport, either because it
has clnt->cl_autobind set, or because it originally created the
transport...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Basically, try_module_get here are pretty useless. Any other module using
this API will pin sunrpc in memory due using exported symbols.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The content of init_ipv6_mibs/cleanup_ipv6_mibs will be moved to new
calls one by one next.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
idev has been stored on seq->private. NULL has been stored for global
statistics.
The situation is changed with net namespace. We need to store pointer to
struct net and the only place is seq->private. So, we'll have for
/proc/net/dev_snmp6/* and for /proc/net/snmp6 pointers of two different
types stored in the same field.
This effectively requires to separate seq_ops of these files.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple, comsolidate sockstat6 staff in one place, at the beginning of
the file. Right now sockstat6_seq_open/sockstat6_seq_fops looks like an
intrusion in the middle of snmp6 code.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm quite sure that if I give this function in its old format
for you to inspect, you start to wonder what is the type of
demanded or if it's a global variable.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
It all started from me noticing that this urgent check in
tcp_clean_rtx_queue is unnecessarily inside the loop. Then
I took a longer look to it and found out that the users of
urg_mode can trivially do without, well almost, there was
one gotcha.
Bonus: those funny people who use urg with >= 2^31 write_seq -
snd_una could now rejoice too (that's the only purpose for the
between being there, otherwise a simple compare would have done
the thing). Not that I assume that the rest of the tcp code
happily lives with such mind-boggling numbers :-). Alas, it
turned out to be impossible to set wmem to such numbers anyway,
yes I really tried a big sendfile after setting some wmem but
nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf...
So I hacked a bit variable to long and found out that it seems
to work... :-)
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add some packet-split receive hooks.
For one this allows to do NUMA node affine page allocs. Later on these
hooks will be extended to do emergency reserve allocations for
fragments.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes that ip6_route_net_init() does all of the route init code.
There used to be a race between ip6_route_net_init() and ip6_net_init()
and someone relying on the combined result was left out cold.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_route_net_init() error handling looked less than solid, fix 'er up.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the socket cached in the skb if it's present.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The association response IEs are sent to userland with an IWEVCUSTOM
event, which unfortunately is limited to a little more than 100 bytes
of IE information with the encoding used. Many APs send so much
IE information that this message overflows. When the IWEVCUSTOM
event is too large, the kernel doesn't send it to userland anyway --
better just not to send it.
An attempt was made by Jouni Malinen to correct this issue by
converting to use IWEVASSOCREQIE and IWEVASSOCRESPIE messages instead
("mac80211: Use IWEVASSOCREQIE instead of IWEVCUSTOM"). Unfortunately,
that caused a problem due to 32-/64-bit interactions on some systems and
was reverted after the 'userland ABI' rule was invoked. That leaves
us with this option instead of a proper fix, at least until we move
to a cfg80211-based solution.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch adjusts the rate control API to allow multi-rate retry
if supported by the driver. The ieee80211_hw struct specifies how
many alternate rate selections the driver supports.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Free up 2 bytes in skb->cb to be used for multi-rate retry later.
Move iv_len and icv_len initialization into key alloc.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The LED state was not being updated by rfkill_force_state(), which
will cause regressions in wireless drivers that had old-style rfkill
support and are updated to use rfkill_force_state().
The LED state was not being updated when a change was detected through
the rfkill->get_state() hook, either.
Move the LED trigger update calls into notify_rfkill_state_change(),
where it should have been in the first place. This takes care of both
issues above.
Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Ivo van Doorn <IvDoorn@gmail.com>
Cc: stable@kernel.org
Signed-off-by: John W. Linville <linville@tuxdriver.com>
- This patch (against the linux-wireless-next git tree) removes a
redundant check in ieee80211_master_start_xmit (net/mac80211/tx.c)
and adjust indentation in this method accordingly.
In this method, there is no need to call again the
ieee80211_is_data() method; this is checked immediately before, in the
"if" command (we will not enter this block unless ieee80211_is_data()
is true, so that the "and" (&&) condition in that "if" command will be
fullfilled ).
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch removes doubly defined variables in ieee80211_master_start_xmit
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Restore revert "mac80211: Use IWEVASSOCREQIE instead of IWEVCUSTOM",
originally reverted in commit bf7394ccc1.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Since IPVS now has partial IPv6 support, this patch moves IPVS from
net/ipv4/ipvs to net/netfilter/ipvs. It's a result of:
$ git mv net/ipv4/ipvs net/netfilter
and adapting the relevant Kconfigs/Makefiles to the new path.
Signed-off-by: Julius Volz <juliusv@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
While debugging another bug it was found that NetRom socks
are sometimes seen unorphaned in sk_free(). This patch moves
sock_orphan() in nr_release() to the beginning (like in ax25,
or rose).
Reported-and-tested-by: Bernard Pidoux f6bvp <f6bvp@free.fr>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we reverted 30902dc3cb ("ax25: Fix
std timer socket destroy handling.") we have to put some kind of fix
in to cure the issue whereby unaccepted connections do not get destroyed.
The approach used here is from Tihomir Heidelberg - 9a4gl
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 30902dc3cb.
It causes all kinds of problems, based upon a report by
Bernard (f6bvp) and analysis by Jarek Poplawski.
Signed-off-by: David S. Miller <davem@davemloft.net>
The inititator/responder resources in the event have been swapped. They
no represent what the local peer would set their values to in order to
match the peer. Note that iWARP does not exchange these on the wire and
the provider is simply putting in the local device max.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Update the svc_rdma_send_error code to use the DMA LKEY which is valid
regardless of the memory registration strategy in use.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Use FRMR to map local RPC reply data. This allows RDMA_WRITE to send reply
data using a single WR. The FRMR is invalidated by linking the LOCAL_INV WR
to the RDMA_SEND message used to complete the reply.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
RPCRDMA requests that specify a read-list are fetched with RDMA_READ. Using
an FRMR to map the data sink improves NFSRDMA security on transports that
place the RDMA_READ data sink LKEY on the wire because the valid lifetime
of the MR is only the duration of the RDMA_READ. The LKEY is invalidated
when the last RDMA_READ WR completes.
Mapping the data sink also allows for very large amounts to data to be
fetched with a single WR, so if the client is also using FRMR, the entire
RPC read-list can be fetched with a single WR.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
WR can be submitted as linked lists of WR. Update the svc_rdma_send
routine to handle WR chains. This will be used to submit a WR that
uses an FRMR with another WR that invalidates the FRMR.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Update the svc_rdma_post_recv routine to use the adapter's global LKEY
instead of sc_phys_mr which is only valid when using a DMA MR.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Fast Reg MR introduces a new WR type. Add a service to register the
region with the adapter and update the completion handling to support
completions with a NULL WR context.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Query the device capabilities in the svc_rdma_accept function to determine
what advanced memory management capabilities are supported by the device.
Based on the query, select the most secure model available given the
requirements of the transport and capabilities of the adapter.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Add services for the allocating, freeing, and unmapping Fast Reg MR. These
services will be used by the transport connection setup, send and receive
routines.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
qdisc->requeue was planned to universally replace all requeuing code,
but at the top level we never requeue more than one skb, so qdisc->
gso_skb is enough for this. qdisc->requeue would be used on the lower
levels only for one level deep requeuing (like in sch_hfsc) after
finishing all the changes.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jay Cliburn noticed and diagnosed a bug triggered in
dev_gso_skb_destructor() after last change from qdisc->gso_skb
to qdisc->requeue list. Since gso_segmented skbs can't be queued
to another list this patch brings back qdisc->gso_skb for them.
Reported-by: Jay Cliburn <jcliburn@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provides implementation of the enhancements of XFRM/PF_KEY MIGRATE mechanism
specified in draft-ebalard-mext-pfkey-enhanced-migrate-00. Defines associated
PF_KEY SADB_X_EXT_KMADDRESS extension and XFRM/netlink XFRMA_KMADDRESS
attribute.
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This protocol provides some connection handling and negotiated
congestion control. Nokia cellular modems use it for bulk transfers.
It provides packet boundaries (hence SOCK_SEQPACKET). Congestion
control is per packet rather per byte, so we do not re-use the
generic socket memory accounting.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up: The svc_addsock() function no longer uses its "proto"
argument, so remove it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The INIT perameter carries the adapatation value in network-byte
order. We need to store it in host byte order as expected
by data types and the user API.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
This patch enables cookie-echo retransmission transport switch
feature. If COOKIE-ECHO retransmission happens, it will be sent
to the address other than the one last sent to.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
RFC3873 defined SCTP_MIB_OUTOFBLUES:
sctpOutOfBlues OBJECT-TYPE
SYNTAX Counter32
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"The number of out of the blue packets received by the host.
An out of the blue packet is an SCTP packet correctly formed,
including the proper checksum, but for which the receiver was
unable to identify an appropriate association."
REFERENCE
"Section 8.4 in RFC2960 deals with the Out-Of-The-Blue
(OOTB) packet definition and procedures."
But OOTB packet INIT, INIT-ACK and SHUTDOWN-ACK(COOKIE-WAIT or
COOKIE-ECHOED state) are not counted by SCTP_MIB_OUTOFBLUES.
Case 1(INIT):
Endpoint A Endpoint B
(CLOSED) (CLOSED)
INIT ---------->
<---------- ABORT
Case 2(INIT-ACK):
Endpoint A Endpoint B
(CLOSED) (CLOSED)
INIT-ACK ---------->
<---------- ABORT
Case 3(SHUTDOWN-ACK):
Endpoint A Endpoint B
(CLOSED) (CLOSED)
<---------- INIT
SHUTDOWN-ACK ---------->
<---------- SHUTDOWN-COMPLETE
Case 4(SHUTDOWN-ACK):
Endpoint A Endpoint B
(CLOSED) (COOKIE-ECHOED)
SHUTDOWN-ACK ---------->
<---------- SHUTDOWN-COMPLETE
This patch fixed the problem.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
RFC 4960: Section 9.2
The sender of the SHUTDOWN MAY also start an overall guard timer
'T5-shutdown-guard' to bound the overall time for the shutdown
sequence. At the expiration of this timer, the sender SHOULD abort
the association by sending an ABORT chunk. If the 'T5-shutdown-
guard' timer is used, it SHOULD be set to the recommended value of 5
times 'RTO.Max'.
The timer 'T5-shutdown-guard' is used to counter the overall time
for shutdown sequence, and it's start by the sender of the SHUTDOWN.
So timer 'T5-shutdown-guard' should be start when we send the first
SHUTDOWN chunk and enter the SHUTDOWN-SENT state, not start when we
receipt of the SHUTDOWN primitive and enter SHUTDOWN-PENDING state.
If 'T5-shutdown-guard' timer is start at SHUTDOWN-PENDING state, the
association may be ABORT while data is still transmitting.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
sctp_is_any() function that is used to check for wildcard addresses
only looks at the address itself to determine the address family.
This function is used in the API to check the address passed in from
the user. If the user simply zerroes out the sockaddr_storage and
pass that in, we'll end up failing. So, let's try harder to determine
the address family by also checking the socket if it's possible.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
sctp_chunks should be put on a diet. This is some of the low hanging
fruit that we can strip out. Changes all the __s8/__u8 flags to
bitfields. Saves 12 bytes per chunk.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Chunks placed on the retransmit list are marked as inelegible
for fast retrasnmission. Since missing indications determine
when fast reransmission is done, there is not point in calling
sctp_mark_missing() on the retransmit list since those chunks
will not be marked.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
There is a possibility of walking the transport list twice during
SACK processing when doing SFR-CACC algorithm. We can restructure
the code to only do this once.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Frist small step in optimizing SACK processing. Do not call
sctp_mark_missing() when there are no gaps reported and thus
not missing chunks.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
The iptables tproxy code has to be able to do UDP socket hash lookups,
so we have to provide an exported lookup function for this purpose.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current TCP code relies on the local port of the listening socket
being the same as the destination address of the incoming
connection. Port redirection used by many transparent proxying
techniques obviously breaks this, so we have to store the original
destination port address.
This patch extends struct inet_request_sock and stores the incoming
destination port value there. It also modifies the handshake code to
use that value as the source port when sending reply packets.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netfilter's ip_route_me_harder() tries to re-route packets either
generated or re-routed by Netfilter. This patch changes
ip_route_me_harder() to handle packets from non-locally-bound sockets
with IP_TRANSPARENT set as local and to set the appropriate flowi
flags when re-doing the routing lookup.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TCP stack sends out SYN+ACK/ACK/RST reply packets in response to
incoming packets. The non-local source address check on output bites
us again, as replies for transparently redirected traffic won't have a
chance to leave the node.
This patch selectively sets the FLOWI_FLAG_ANYSRC flag when doing the
route lookup for those replies. Transparent replies are enabled if the
listening socket has the transparent socket flag set.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_iif() in inet_sock.h requires route.h. Since users of inet_iif()
usually require other route.h functionality anyway this patch moves
inet_iif() to route.h.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting IP_TRANSPARENT is not really useful without allowing non-local
binds for the socket. To make user-space code simpler we allow these
binds even if IP_TRANSPARENT is set but IP_FREEBIND is not.
Signed-off-by: Tóth László Attila <panther@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces the IP_TRANSPARENT socket option: enabling that
will make the IPv4 routing omit the non-local source address check on
output. Setting IP_TRANSPARENT requires NET_ADMIN capability.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_route_output() contains a check to make sure that no flows with
non-local source IP addresses are routed. This obviously makes using
such addresses impossible.
This patch introduces a flowi flag which makes omitting this check
possible. The new flag provides a way of handling transparent and
non-transparent connections differently.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the caller of pskb_expand_head specifies a negative nhead
we'll silently overwrite other people's memory. This patch
makes it BUG instead.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu came up with the idea and the original patch to make
xfrm_state dump list contain also dumpers:
As it is we go to extraordinary lengths to ensure that states
don't go away while dumpers go to sleep. It's much easier if
we just put the dumpers themselves on the list since they can't
go away while they're going.
I've also changed the order of addition on new states to prevent
a never-ending dump.
Timo Teräs improved the patch to apply cleanly to latest tree,
modified iteration code to be more readable by using a common
struct for entries in the list, implemented the same idea for
xfrm_policy dumping and moved the af_key specific "last" entry
caching to af_key.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a xfrm_{state,policy}_walk leak if pfkey socket is closed while
dumping is on-going.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds skb_recycle_check(), which can be used by a network
driver after transmitting an skb to check whether this skb can be
recycled as a receive buffer.
skb_recycle_check() checks that the skb is not shared or cloned, and
that it is linear and its head portion large enough (as determined by
the driver) to be recycled as a receive buffer. If these conditions
are met, it does any necessary reference count dropping and cleans
up the skbuff as if it just came from __alloc_skb().
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following actions are possible:
tcp_v6_rcv
skb->dev = NULL;
tcp_v6_do_rcv
tcp_v6_hnd_req
tcp_check_req
req->rsk_ops->send_ack == tcp_v6_send_ack
So, skb->dev can be NULL in tcp_v6_send_ack. We must obtain namespace
from dst entry.
Thanks to Vitaliy Gusev <vgusev@openvz.org> for initial problem finding
in IPv4 code.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_MAC80211_MESH=n and CONFIG_MAC80211_NOINLINE=y,
gcc doesn't optimize out a call to ieee80211_rx_h_mesh_fwding,
even if the previous comparison is always false in this case.
This leads to the following errors during modpost:
ERROR: "mpp_path_lookup" [net/mac80211/mac80211.ko] undefined!
ERROR: "mpp_path_add" [net/mac80211/mac80211.ko] undefined!
Fix by removing the possibility of uninlining
ieee80211_rx_h_mesh_fwding rx handler.
Signed-off-by: Davide Pesavento <davidepesa@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch removes wme_tx_queue and wme_rx_queue from struct sta_info
and from the debugfs sub-structure of struct sta_info
in net/mac80211/sta_info.h, as they are useless and not used.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
In "mac80211: make master iface not wireless" I accidentally
forgot to include these changes ... leading to the expected
BUG_ON errors.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Since call to function sctp_sf_abort_violation() need paramter 'arg' with
'struct sctp_chunk' type, it will read the chunk type and chunk length from
the chunk_hdr member of chunk. But call to sctp_sf_violation_paramlen()
always with 'struct sctp_paramhdr' type's parameter, it will be passed to
sctp_sf_abort_violation(). This may cause kernel panic.
sctp_sf_violation_paramlen()
|-- sctp_sf_abort_violation()
|-- sctp_make_abort_violation()
This patch fixed this problem. This patch also fix two place which called
sctp_sf_violation_paramlen() with wrong paramter type.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add more docbook comments to network device functions and cleanup
the comments.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_change_name and netdev_drivername should use const char on
parameters that are read-only input values. The strcpy to newname is
not needed since newname is not used later in function.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're never supposed to shrink the headroom or tailroom. In fact,
shrinking the headroom is a fatal action.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
since commit ff7d9756b5
"nfsd: use static memory for callback program and stats"
do_probe_callback uses a static callback program
(NFS4_CALLBACK) rather than the one set in clp->cl_callback.cb_prog
as passed in by the client in setclientid (4.0)
or create_session (4.1).
This patches introduces rpc_create_args.prognumber that allows
overriding program->number when creating rpc_clnt.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The RPCB XDR functions are used for multiple procedures. For instance,
rpcb_encode_getaddr() is used for RPCB_GETADDR, RPCB_SET, and
RPCB_UNSET. Make the XDR debug messages more generic so they are less
confusing.
And, unlike in other RPC consumers in the kernel, a single debug flag
enables all levels of debug messages in the RPC bind client, including
XDR debug messages. Since the XDR decoders already report success or
failure in this case, remove redundant debug messages in the mid-level
rpcb_register_call() function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
With the new rpcbind code, a PMAP_UNSET will not have any effect on
services registered via rpcbind v3 or v4.
Implement a version of svc_unregister() that uses an RPCB_UNSET with
an empty netid string to make sure we have cleared *all* entries for
a kernel RPC service when shutting down, or before starting a fresh
instance of the service.
Use the new version only when CONFIG_SUNRPC_REGISTER_V4 is enabled;
otherwise, the legacy PMAP version is used to ensure complete
backwards-compatibility with the Linux portmapper daemon.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: When doing an RPCB_SET, make the kernel's rpcb client use the
shorthand "::" for the universal form of the IPv6 ANY address.
Without this patch, rpcbind will advertise:
0000:0000:0000:0000:0000:0000:0000:0000.x.y
This is cosmetic only. It cleans up the display of information from
/sbin/rpcinfo.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
TI-RPC is a user-space library of RPC functions that replaces ONC RPC
and allows RPC to operate in the new world of IPv6.
TI-RPC combines the concept of a transport protocol (UDP and TCP)
and a protocol family (PF_INET and PF_INET6) into a single identifier
called a "netid." For example, "udp" means UDP over IPv4, and "udp6"
means UDP over IPv6.
For rpcbind, then, the RPC service tuple that is registered and
advertised is:
[RPC program, RPC version, service address and port, netid]
instead of
[RPC program, RPC version, port, protocol]
Service address is typically ANYADDR, but can be a specific address
of one of the interfaces on a multi-homed host. The third item in
the new tuple is expressed as a universal address.
The current Linux rpcbind implementation registers a netid for both
protocol families when RPCB_SET is done for just the PF_INET6 version
of the netid (ie udp6 or tcp6). So registering "udp6" causes a
registration for "udp" to appear automatically as well.
We've recently determined that this is incorrect behavior. In the
TI-RPC world, "udp6" is not meant to imply that the registered RPC
service handles requests from AF_INET as well, even if the listener
socket does address mapping. "udp" and "udp6" are entirely separate
capabilities, and must be registered separately.
The Linux kernel, unlike TI-RPC, leverages address mapping to allow a
single listener socket to handle requests for both AF_INET and AF_INET6.
This is still OK, but the kernel currently assumes registering "udp6"
will cover "udp" as well. It registers only "udp6" for it's AF_INET6
services, even though they handle both AF_INET and AF_INET6 on the same
port.
So svc_register() actually needs to register both "udp" and "udp6"
explicitly (and likewise for TCP). Until rpcbind is fixed, the
kernel can ignore the return code for the second RPCB_SET call.
Please merge this with commit 15231312:
SUNRPC: Support IPv6 when registering kernel RPC services
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Olaf Kirch <okir@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
In order to advertise NFS-related services on IPv6 interfaces via
rpcbind, the kernel RPC server implementation must use
rpcb_v4_register() instead of rpcb_register().
A new kernel build option allows distributions to use the legacy
v2 call until they integrate an appropriate user-space rpcbind
daemon that can support IPv6 RPC services.
I tried adding some automatic logic to fall back if registering
with a v4 protocol request failed, but there are too many corner
cases. So I just made it a compile-time switch that distributions
can throw when they've replaced portmapper with rpcbind.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Create a separate server-level interface for unregistering RPC services.
The mechanics of, and the API for, registering and unregistering RPC
services will diverge further as support for IPv6 is added.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Bruce suggested there's no need to expose the difference between an error
sending the PMAP_SET request and an error reply from the portmapper to
rpcb_register's callers. The user space equivalent of rpcb_register() is
pmap_set(3), which returns a bool_t : either the PMAP set worked, or it
didn't. Simple.
So let's remove the "*okay" argument from rpcb_register() and
rpcb_v4_register(), and simply return an error if any part of the call
didn't work.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
My plan is to use an AF_INET listener on systems that support only IPv4,
and an AF_INET6 listener on systems that can support IPv6. Incoming
IPv4 packets will be posted to an AF_INET6 listener with a mapped IPv4
address.
Max Matveev <makc@sgi.com> says:
Creating a single listener can be dangerous - if net.ipv6.bindv6only
is enabled then it's possible to create another listener in v4
namespace on the same port and steal the traffic from the "unifed"
listener. You need to disable V6ONLY explicitly via a sockopt to stop
that.
Set appropriate socket option on RPC server listener sockets to prevent
this.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Teach svc_create_xprt() to use the correct ANY address for AF_INET6 based
RPC services.
No caller uses AF_INET6 yet.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Introduce and initialize an address family field in the svc_serv structure.
This field will determine what family to use for the service's listener
sockets and what families are advertised via the local rpcbind daemon.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The current code ignores rules for internal options in HBH/DST options
header in packet processing if 'Not strict' mode is specified (which is not
implemented). Clearly it is not expected by user.
Kernel should reject HBH/DST rule insertion with 'Not strict' mode
in the first place.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: fix put_data error handling
9p: use an IS_ERR test rather than a NULL test
9p: introduce missing kfree
9p-trans_fd: fix and clean up module init/exit paths
9p-trans_fd: don't do fs segment mangling in p9_fd_poll()
9p-trans_fd: clean up p9_conn_create()
9p-trans_fd: fix trans_fd::p9_conn_destroy()
9p: implement proper trans module refcounting and unregistration
Abhishek Kulkarni pointed out an inconsistency in the way
errors are returned from p9_put_data. On deeper exploration it
seems the error handling for this path was completely wrong.
This patch adds checks for allocation problems and propagates
errors correctly.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Error handling code following a kmalloc should free the allocated data.
The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,l;
position p1,p2;
expression *ptr != NULL;
@@
(
if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
|
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
)
<... when != x
when != if (...) { <+...x...+> }
x->f = E
...>
(
return \(0\|<+...x...+>\|ptr\);
|
return@p2 ...;
)
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
trans_fd leaked p9_mux_wq on module unload. Fix it. While at it,
collapse p9_mux_global_init() into p9_trans_fd_init(). It's easier to
follow this way and the global poll_tasks array is about to removed
anyway.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
p9_fd_poll() is never called with user pointers and f_op->poll()
doesn't expect its arguments to be from userland. There's no need to
set kernel ds before calling f_op->poll() from p9_fd_poll(). Remove
it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* Use kzalloc() to allocate p9_conn and remove 0/NULL initializations.
* Clean up error return paths.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
p9_conn_destroy() first kills all current requests by calling
p9_conn_cancel(), then waits for the request list to be cleared by
waiting on p9_conn->equeue. After that, polling is stopped and the
trans is destroyed. This sequence has a few problems.
* Read and write works were never cancelled and the p9_conn can be
destroyed while the works are running as r/w works remove requests
from the list and dereference the p9_conn from them.
* The list emptiness wait using p9_conn->equeue wouldn't trigger
because p9_conn_cancel() always clears all the lists and the only
way the wait can be triggered is to have another task to issue a
request between the slim window between p9_conn_cancel() and the
wait, which isn't safe under the current implementation with or
without the wait.
This patch fixes the problem by first stopping poll, which can
schedule r/w works, first and cancle r/w works which guarantees that
r/w works are not and will not run from that point and then calling
p9_conn_cancel() and do the rest of destruction.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
9p trans modules aren't refcounted nor were they unregistered
properly. Fix it.
* Add 9p_trans_module->owner and reference the module on each trans
instance creation and put it on destruction.
* Protect v9fs_trans_list with a spinlock. This isn't strictly
necessary as the list is manipulated only during module loading /
unloading but it's a good idea to make the API safe.
* Unregister trans modules when the corresponding module is being
unloaded.
* While at it, kill unnecessary EXPORT_SYMBOL on p9_trans_fd_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
There are two improvements in this simple patch:
1. wiphy_counter is a static var only used in one function, so
can use local static instead of global static;
2. wiphy_counter wrap handling killed one comparision;
Signed-off-by: Denis ChengRq <crquan@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch fixes the led behavior in IBSS. After we joined an IBSS cell we
need to inform the led that we got associated. Although there is no 802.11
association in IBSS mode, the semantic of "There is a link" is relevant.
This allows the led to blink in IBSS mode (at least this solves a bug for
iwlwifi).
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Reviewed-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
While associated, we should probe with the SSID we're associated to,
not the scan SSID.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Long awaited, hard work. This patch totally cleans up the rate control
API to remove the requirement to include internal headers outside of
net/mac80211/.
There's one internal use in the PID algorithm left for mesh networking,
we'll have to figure out a way to clean that one up and decide how to
do the peer link evaluation, possibly independent of the rate control
algorithm or via new API.
Additionally, ath9k is left using the cross-inclusion hack for now, we
will add new API where necessary to make this work properly, but right
now I'm not expert enough to do it. It's still off better than before.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When I split off the action frame handling I made the code drop
all action frames we don't want to handle. This is wrong since
some action frames are actually handled via rx_h_mgmt through
being queued to the sta/mesh implementations.
Thanks to Li YanBo for noticing the problem.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Li YanBo <dreamfly281@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Currently the mesh code doesn't support bridging mesh point interfaces
with wired ethernet or AP to construct an MPP or MAP. This patch adds
code to support the "6 address frame format packet" functionality to
mesh point interfaces. Now the mesh network can be used as backhaul
for end to end communication.
Signed-off-by: Li YanBo <dreamfly281@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Reject configuring mesh-id for non-mesh, monitor flags for non-monitor.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We currently allow monitor flags changes and mesh ID changes when
the interface is up, which can lead to trouble. Change it to only
allow when down.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch makes cfg80211 show the interface in the nl80211
information about a specific interface. API users are required
to keep the type updated (everything else is fairly complicated)
but you will get a warning if you fail to keep it updated.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
There's no need to register the master netdev with cfg80211,
in fact, this is quite dangerous and lead to having to add
checks for the master interface all over the config handlers.
This patch removes the "ieee80211_ptr" from the master iface
in favour of having a small netdev_priv() associated with
the master interface that stores the ieee80211_local pointer.
Because of this, a lot of code in the configuration handlers
can go away. To make this patch easier to verify I have also
removed a number of wiphy_priv() calls in favour of getting
the sdata first and then the local pointer from that.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The statically defined regdomains are used in a very convoluted
way, use them instead to prime the information we have and then
continue operating normally.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
A few pointers and structures in the regulatory code are const,
but because it wasn't done properly a whole bunch of bogus
casts were needed to compile without warning. Mark everything
const properly to avoid that kind of junk code.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
The recent code from Luis is an #ifdef hell and contains lots of
code that's stuffed into the wrong file making a whole bunch of
things needlessly non-static, and besides, what is it doing in
core.c??
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When Luis added the static regdomains back he used +/-20
of the centre frequencies to account for 40MHz bandwidth
neglecting the fact that 40MHz bandwidth cannot be used
on the channels close to the allowed band edges.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This patch removes the potentially allocated ifalias when the (new) given alias is empty.
E.g. when setting
echo "" > /sys/class/net/eth0/ifalias
Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reasons for disabling paccept() are as follows:
* The API is more complex than needed. There is AFAICS no demonstrated
use case that the sigset argument of this syscall serves that couldn't
equally be served by the use of pselect/ppoll/epoll_pwait + traditional
accept(). Roland seems to concur with this opinion
(http://thread.gmane.org/gmane.linux.kernel/723953/focus=732255). I
have (more than once) asked Ulrich to explain otherwise
(http://thread.gmane.org/gmane.linux.kernel/723952/focus=731018), but he
does not respond, so one is left to assume that he doesn't know of such
a case.
* The use of a sigset argument is not consistent with other I/O APIs
that can block on a single file descriptor (e.g., read(), recv(),
connect()).
* The behavior of paccept() when interrupted by a signal is IMO strange:
the kernel restarts the system call if SA_RESTART was set for the
handler. I think that it should not do this -- that it should behave
consistently with paccept()/ppoll()/epoll_pwait(), which never restart,
regardless of SA_RESTART. The reasoning here is that the very purpose
of paccept() is to wait for a connection or a signal, and that
restarting in the latter case is probably never useful. (Note: Roland
disagrees on this point, believing that rather paccept() should be
consistent with accept() in its behavior wrt EINTR
(http://thread.gmane.org/gmane.linux.kernel/723953/focus=732255).)
I believe that instead, a simpler API, consistent with Ulrich's other
recent additions, is preferable:
accept4(int fd, struct sockaddr *sa, socklen_t *salen, ind flags);
(This simpler API was originally proposed by Ulrich:
http://thread.gmane.org/gmane.linux.network/92072)
If this simpler API is added, then if we later decide that the sigset
argument really is required, then a suitable bit in 'flags' could be added
to indicate the presence of the sigset argument.
At this point, I am hoping we either will get a counter-argument from
Ulrich about why we really do need paccept()'s sigset argument, or that he
will resubmit the original accept4() patch.
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Alan Cox <alan@redhat.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Check in dequeue_skb() the state of tx_queue for requeued skb to save
on locking and re-requeuing, and possibly remove the current check in
qdisc_run(). Based on the idea of Alexander Duyck.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason to call into the complicated qdiscs
just to remember the last SKB where we found the device
blocked.
The SKB is outside of the qdiscs realm at this point.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add support for keeping an additional character alias
associated with an network interface. This is useful for maintaining
the SNMP ifAlias value which is a user defined value. Routers use this
to hold information like which circuit or line it is connected to. It
is just an arbitrary text label on the network device.
There are two exposed interfaces with this patch, the value can be
read/written either via netlink or sysfs.
This could be maintained just by the snmp daemon, but it is more
generally useful for other management tools, and the kernel is good
place to act as an agreed upon interface to store it.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When there is no listener socket for a received packet, send an error
back to the sender.
Signed-off-by: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Phonet endpoints are bound to individual ports.
This provides a /proc/sys/net/phonet (or sysctl) interface for
selecting the range of automatically allocated ports (much like the
ip_local_port_range with IPv4).
Signed-off-by: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides the basic SOCK_DGRAM transport protocol for Phonet.
Signed-off-by: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides the socket API for the Phonet protocols family.
Signed-off-by: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides support for configuring Phonet addresses, notifying
Phonet configuration changes, and dumping the configuration.
Signed-off-by: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This provides support for adding Phonet addresses to and removing
Phonet addresses from network devices.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the basis for the Phonet protocol families, and introduces
the ETH_P_PHONET packet type and the PF_PHONET socket family.
Signed-off-by: Remi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As discovered by Timo Teräs, the currently xfrm_state_walk scheme
is racy because if a second dump finishes before the first, we
may free xfrm states that the first dump would walk over later.
This patch fixes this by storing the dumps in a list in order
to calculate the correct completion counter which cures this
problem.
I've expanded netlink_cb in order to accomodate the extra state
related to this. It shouldn't be a big deal since netlink_cb
is kmalloced for each dump and we're just increasing it by 4 or
8 bytes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The break after the return serves no purpose.
Signed-off-by: Julia Lawall <julia@diku.dk>
Reviewed-by: Richard Genoud <richard.genoud@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function localtime_3 in xt_time.c gives a wrong monthday in a leap
year after 28th 2. calculating monthday should use the array
days_since_leapyear[] not days_since_year[] in a leap year.
Signed-off-by: Kaihui Luo <kaih.luo@gmail.com>
Acked-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This minor cleanup simplifies later changes which will convert
struct sk_buff and friends over to using struct list_head.
Signed-off-by: David S. Miller <davem@davemloft.net>
The nr_conns variable in the sync message header is only eight bits wide
and will overflow on interfaces with a large MTU. As a result the backup
won't parse all connections contained in the sync buffer. On regular
ethernet with an MTU of 1500 this isn't a problem, because we can't
overflow the value, but consider jumbo frames being used on a cross-over
connection between both directors.
We now restrict the size of the sync buffer, so that we never put more
than 255 connections into a single sync buffer.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: Simon Horman <horms@verge.net.au>
I'm trying to use the TCP_MAXSEG option to setsockopt() to set the MSS
for both sides of a bidirectional connection.
man tcp says: "If this option is set before connection establishment, it
also changes the MSS value announced to the other end in the initial
packet."
However, the kernel only uses the MTU/route cache to set the advertised
MSS. That means if I set the MSS to, say, 500 before calling connect(),
I will send at most 500-byte packets, but I will still receive 1500-byte
packets in reply.
This is a bug, either in the kernel or the documentation.
This patch (applies to latest net-2.6) reduces the advertised value to
that requested by the user as long as setsockopt() is called before
connect() or accept(). This seems like the behavior that one would
expect as well as that which is documented.
I've tried to make sure that things that depend on the advertised MSS
are set correctly.
Signed-off-by: Tom Quetchenbach <virtualphtn@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently dequeueing a packet and requeueing the same packet will cause a
different packet to be pulled on the next dequeue. This change forces
requeue to rewind the current_band.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently simple_tx_hash is hashing inside of udp fragments. As a result
packets are getting getting sent to all queues when they shouldn't be.
This causes a serious performance regression which can be seen by sending
UDP frames larger than mtu on multiqueue devices. This change will make
it so that fragments are hashed only as IP datagrams w/o any protocol
information.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If lost skb is sacked, we might have nothing to retransmit
as high as the retransmit_high is pointing to, so place
it lower to avoid unnecessary walking.
This is mainly for the case where high L'ed skbs gets sacked.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most importantly avoid doing it with cumulative ACK. However,
since we have lost_cnt_hint in the picture as well needing
adjustments, it's not as trivial as dealing with
retransmit_skb_hint (and cannot be done in the all place we
could trivially leave retransmit_skb_hint untouched).
With the previous patch, this should mostly remove O(n^2)
behavior while cumulative ACKs start flowing once rexmit
after a lossy round-trip made it through.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most importantly avoid doing it with cumulative ACK. Not clearing
means that we no longer need n^2 processing in resolution of each
fast recovery.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
This doesn't much sense here afaict, probably never has. Since
fragmenting and collapsing deal the hints by themselves, there
should be very little reason for the rexmit loop to do that.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both loops are quite similar, so they can be combined
with little effort. As a result, forward_skb_hint becomes
obsolete as well.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The validity of the retransmit_high must then be ensured
if no L'ed skb exits!
This makes a minor change to behavior, we now have to
iterate the head to find out that the loop terminates.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>