Allow privileged users in any user namespace to bind to
privileged sockets in network namespaces they control.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.
In general policy and network stack state changes are allowed
while resource control is left unchanged.
Allow creating raw sockets.
Allow the SIOCSARP ioctl to control the arp cache.
Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.
Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting gre tunnels.
Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting ipip tunnels.
Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting ipsec virtual tunnel interfaces.
Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
sockets.
Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
arbitrary ip options.
Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
Allow setting the IP_TRANSPARENT ipv4 socket option.
Allow setting the TCP_REPAIR socket option.
Allow setting the TCP_CONGESTION socket option.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
users to make netlink calls to modify their local network
namespace.
- In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
that calls that are not safe for unprivileged users are still
protected.
Later patches will remove the extra capable calls from methods
that are safe for unprivilged users.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for supporting the creation of network namespaces
by unprivileged users, modify all of the per net sysctl exports
and refuse to allow them to unprivileged users.
This makes it safe for unprivileged users in general to access
per net sysctls, and allows sysctls to be exported to unprivileged
users on an individual basis as they are deemed safe.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
order-5 allocations can fail with current kernels, we should
try vmalloc() as well.
Reported-by: Julien Tinnes <jln@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
Minor conflict due to some IS_ENABLED conversions done
in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently if a socket was repaired with a few packet in a write queue,
a kernel bug may be triggered:
kernel BUG at net/ipv4/tcp_output.c:2330!
RIP: 0010:[<ffffffff8155784f>] tcp_retransmit_skb+0x5ff/0x610
According to the initial realization v3.4-rc2-963-gc0e88ff,
all skb-s should look like already posted. This patch fixes code
according with this sentence.
Here are three points, which were not done in the initial patch:
1. A tcp send head should not be changed
2. Initialize TSO state of a skb
3. Reset the retransmission time
This patch moves logic from tcp_sendmsg to tcp_write_xmit. A packet
passes the ussual way, but isn't sent to network. This patch solves
all described problems and handles tcp_sendpages.
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the offload callbacks into its own structure.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since GSO/GRO support is now separated, make IPv4 GSO a
stand-alone init call and not part of inet_init().
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch IPv4 code base to using the new GRO/GSO calls and data.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new data structure for IPv4 protocols that holds GRO/GSO
callbacks and a new array to track the protocols that register GRO/GSO.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert to using the new GSO/GRO registration mechanism and new
packet offload structure.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change fixes two sparse warnings triggered by casting the ip addresses
from netlink messages in an u32 instead of be32. This change corrects that
in order to resolve the sparse warnings.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add the support of 'ip link .. type ipip'.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This parameter was missing in the dump.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_state_change() was called only when end points or link was updated. Now
that all parameters are advertised via netlink, we must advertise any change.
This patch also prepares the support of ipip tunnels management via rtnl. The
code which update tunnels will be put in a new function.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the latest kernel there are two things that must be done post decryption
so that the packet are forwarded.
1. Remove the mark from the packet. This will cause the packet to not match
the ipsec-policy again. However doing this causes the post-decryption check to
fail also and the packet will get dropped. (cat /proc/net/xfrm_stat).
2. Remove the sp association in the skbuff so that no policy check is done on
the packet for VTI tunnels.
Due to #2 above we must now do a security-policy check in the vti rcv path
prior to resetting the mark in the skbuff.
Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reported-by: Ruben Herold <ruben@puettmann.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The defitions of for_each_ip_tunnel_rcu() are same,
so unify it. Also, don't hide the parameter 't'.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__IPTUNNEL_XMIT() is an ugly macro, convert it to a static
inline function, so make it more readable.
IPTUNNEL_XMIT() is unused, just remove it.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We added support for RFC 5961 in latest kernels but TCP fails
to perform exhaustive check of ACK sequence.
We can update our view of peer tsval from a frame that is
later discarded by tcp_ack()
This makes timestamps enabled sessions vulnerable to injection of
a high tsval : peers start an ACK storm, since the victim
sends a dupack each time it receives an ACK from the other peer.
As tcp_validate_incoming() is called before tcp_ack(), we should
not peform tcp_replace_ts_recent() from it, and let callers do it
at the right time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Romain Francoise <romain@orebokech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(1<<optname) is undefined behavior in C with a negative optname or
optname larger than 31. In those cases the result of the shift is
not necessarily zero (e.g., on x86).
This patch simplifies the code with a switch statement on optname.
It also allows the compiler to generate better code (e.g., using a
64-bit mask).
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net. Based upon a conflict resolution
patch posted by Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
It is usefull for daemons that monitor link event to have the full parameters of
these interfaces when a rtnl message is sent.
It allows also to dump them via rtnetlink.
It is based on what is done for GRE tunnels.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 6b78f16e4b (gre: add GSO support) we added GSO support to GRE
tunnels.
This patch does the same for IPIP tunnels.
Performance of single TCP flow over an IPIP tunnel is increased by 40%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We've observed that in case if UDP diag module is not
supported in kernel the netlink returns NLMSG_DONE without
notifying a caller that handler is missed.
This patch makes __inet_diag_dump to return error code instead.
So as example it become possible to detect such situation
and handle it gracefully on userspace level.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: David Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.
SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
which we received the ACK from client). Only the last SYNACK is sent
so that we can receive again an ACK from client, to move the req into
accept queue. We plan to change this later to avoid the useless
retransmit (and potential problem as this SYNACK could be lost)
TCP_INFO later gives wrong information to user, claiming imaginary
retransmits.
Decouple req->retrans field into two independent fields :
num_retrans : number of retransmit
num_timeout : number of timeouts
num_timeout is the counter that is incremented at each timeout,
regardless of actual SYNACK being sent or not, and used to
compute the exponential timeout.
Introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.
Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a (retransmitted) SYN.
Prior to this patch, we were not counting these retransmits.
Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfully queued.
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending data into a tcp socket in repair state we should check
for the amount of data being 0 explicitly. Otherwise we'll have an skb
with seq == end_seq in rcv queue, but tcp doesn't expect this to happen
(in particular a warn_on in tcp_recvmsg shoots).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Reported-by: Giorgos Mavrikas <gmavrikas@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix double sizeof when parsing IPv6 address from
user space because it breaks get/del by specific IPv6 address.
Problem noticed by David Binderman:
https://bugzilla.kernel.org/show_bug.cgi?id=49171
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reading TCP stats when using TCP Illinois congestion control algorithm
can cause a divide by zero kernel oops.
The division by zero occur in tcp_illinois_info() at:
do_div(t, ca->cnt_rtt);
where ca->cnt_rtt can become zero (when rtt_reset is called)
Steps to Reproduce:
1. Register tcp_illinois:
# sysctl -w net.ipv4.tcp_congestion_control=illinois
2. Monitor internal TCP information via command "ss -i"
# watch -d ss -i
3. Establish new TCP conn to machine
Either it fails at the initial conn, or else it needs to wait
for a loss or a reset.
This is only related to reading stats. The function avg_delay() also
performs the same divide, but is guarded with a (ca->cnt_rtt > 0) at its
calling point in update_params(). Thus, simply fix tcp_illinois_info().
Function tcp_illinois_info() / get_info() is called without
socket lock. Thus, eliminate any race condition on ca->cnt_rtt
by using a local stack variable. Simply reuse info.tcpv_rttcnt,
as its already set to ca->cnt_rtt.
Function avg_delay() is not affected by this race condition, as
its called with the socket lock.
Cc: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains fixes for your net tree, two of them
are due to relatively recent changes, one has been a longstanding bug,
they are:
* Fix incorrect usage of rt_gateway in the H.323 helper, from
Julian Anastasov.
* Skip re-route in nf_nat code for ICMP traffic. If CONFIG_XFRM is
enabled, we waste cycles to look up for the route again. This problem
seems to be there since really long time. From Ulrich Weber.
* Fix mismatching section in nf_conntrack_reasm, from Hein Tibosch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
adds a "hwaddr" to the "IP-Config: Complete" KERN_INFO message
with the dev_addr of the device selected for auto configuration.
Signed-off-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This message allows to get the devconf for an interface.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ICMP tuples have id in src and type/code in dst.
So comparing src.u.all with dst.u.all will always fail here
and ip_xfrm_me_harder() is called for every ICMP packet,
even if there was no NAT.
Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Make it simple -- just put new nlattr with just sk->sk_shutdown bits.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use same header helpers than tcp_v6_early_demux() because they
are a bit faster, and as they make IPv4/IPv6 versions look
the same.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A packet with an invalid ack_seq may cause a TCP Fast Open socket to switch
to the unexpected TCP_CLOSING state, triggering a BUG_ON kernel panic.
When a FIN packet with an invalid ack_seq# arrives at a socket in
the TCP_FIN_WAIT1 state, rather than discarding the packet, the current
code will accept the FIN, causing state transition to TCP_CLOSING.
This may be a small deviation from RFC793, which seems to say that the
packet should be dropped. Unfortunately I did not expect this case for
Fast Open hence it will trigger a BUG_ON panic.
It turns out there is really nothing bad about a TFO socket going into
TCP_CLOSING state so I could just remove the BUG_ON statements. But after
some thought I think it's better to treat this case like TCP_SYN_RECV
and return a RST to the confused peer who caused the unacceptable ack_seq
to be generated in the first place.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a bit TCPI_OPT_SYN_DATA (32) to the socket option TCP_INFO:tcpi_options.
It's set if the data in SYN (sent or received) is acked by SYN-ACK. Server or
client application can use this information to check Fast Open success rate.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A small host typically needs ~10 fib_info structures, so create initial
hash table with 16 slots instead of only one. This removes potential
false sharing and reallocs/rehashes (1->2->4->8->16)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIOCINQ can use the lock_sock_fast() version to avoid double acquisition
of socket lock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation]
All TCP stacks MAY implement the following mitigation. TCP stacks
that implement this mitigation MUST add an additional input check to
any incoming segment. The ACK value is considered acceptable only if
it is in the range of ((SND.UNA - MAX.SND.WND) <= SEG.ACK <=
SND.NXT). All incoming segments whose ACK value doesn't satisfy the
above condition MUST be discarded and an ACK sent back.
Move tcp_send_challenge_ack() before tcp_ack() to avoid a forward
declaration.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_ioctl() tries to take into account if tcp socket received a FIN
to report correct number bytes in receive queue.
But its flaky because if the application ate the last skb,
we return 1 instead of 0.
Correct way to detect that FIN was received is to test SOCK_DONE.
Reported-by: Elliot Hughes <enh@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we can not flush cached pmtu/redirect informations via
the ipv4_sysctl_rtcache_flush sysctl. We need to check the rt_genid
of the old route and reset the nh exeption if the old route is
expired when we bind a new route to a nh exeption.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use be32_to_cpu instead of htonl to keep sparse happy.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.org>
Signed-off-by: David S. Miller <davem@davemloft.net>