This adds destination address-based selection. The old "inverse"
member is overloaded (memory-wise) with a new "flags" variable,
similar to how J.Park did it with xt_string rev 1. Since revision 0
userspace only sets flag 0x1, no great changes are made to explicitly
test for different revisions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
This patch adds flow-based timestamping for conntracks. This
conntrack extension is disabled by default. Basically, we use
two 64-bits variables to store the creation timestamp once the
conntrack has been confirmed and the other to store the deletion
time. This extension is disabled by default, to enable it, you
have to:
echo 1 > /proc/sys/net/netfilter/nf_conntrack_timestamp
This patch allows to save memory for user-space flow-based
loogers such as ulogd2. In short, ulogd2 does not need to
keep a hashtable with the conntrack in user-space to know
when they were created and destroyed, instead we use the
kernel timestamp. If we want to have a sane IPFIX implementation
in user-space, this nanosecs resolution timestamps are also
useful. Other custom user-space applications can benefit from
this via libnetfilter_conntrack.
This patch modifies the /proc output to display the delta time
in seconds since the flow start. You can also obtain the
flow-start date by means of the conntrack-tools.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Adding support for SNMP broadcast connection tracking. The SNMP
broadcast requests are now paired with the SNMP responses.
Thus allowing using SNMP broadcasts with firewall enabled.
Please refer to the following conversation:
http://marc.info/?l=netfilter-devel&m=125992205006600&w=2
Patrick McHardy wrote:
> > The best solution would be to add generic broadcast tracking, the
> > use of expectations for this is a bit of abuse.
> > The second best choice I guess would be to move the help() function
> > to a shared module and generalize it so it can be used for both.
This patch implements the "second best choice".
Since the netbios-ns conntrack module uses the same helper
functionality as the snmp, only one helper function is added
for both snmp and netbios-ns modules into the new object -
nf_conntrack_broadcast.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
If an skb is to be NF_QUEUE'd, but no program has opened the queue, the
packet is dropped.
This adds a v2 target revision of xt_NFQUEUE that allows packets to
continue through the ruleset instead.
Because the actual queueing happens outside of the target context, the
'bypass' flag has to be communicated back to the netfilter core.
Unfortunately the only choice to do this without adding a new function
argument is to use the target function return value (i.e. the verdict).
In the NF_QUEUE case, the upper 16bit already contain the queue number
to use. The previous patch reduced NF_VERDICT_MASK to 0xff, i.e.
we now have extra room for a new flag.
If a hook issued a NF_QUEUE verdict, then the netfilter core will
continue packet processing if the queueing hook
returns -ESRCH (== "this queue does not exist") and the new
NF_VERDICT_FLAG_QUEUE_BYPASS flag is set in the verdict value.
Note: If the queue exists, but userspace does not consume packets fast
enough, the skb will still be dropped.
Signed-off-by: Florian Westphal <fwestphal@astaro.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch adds a new netfilter target which creates audit records
for packets traversing a certain chain.
It can be used to record packets which are rejected administraively
as follows:
-N AUDIT_DROP
-A AUDIT_DROP -j AUDIT --type DROP
-A AUDIT_DROP -j DROP
a rule which would typically drop or reject a packet would then
invoke the new chain to record packets before dropping them.
-j AUDIT_DROP
The module is protocol independant and works for iptables, ip6tables
and ebtables.
The following information is logged:
- netfilter hook
- packet length
- incomming/outgoing interface
- MAC src/dst/proto for ethernet packets
- src/dst/protocol address for IPv4/IPv6
- src/dst port for TCP/UDP/UDPLITE
- icmp type/code
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Thomas Graf <tgraf@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
One iptables invocation with 135000 rules takes 35 seconds of cpu time
on a recent server, using a 32bit distro and a 64bit kernel.
We eventually trigger NMI/RCU watchdog.
INFO: rcu_sched_state detected stall on CPU 3 (t=6000 jiffies)
COMPAT mode has quadratic behavior and consume 16 bytes of memory per
rule.
Switch the xt_compat algos to use an array instead of list, and use a
binary search to locate an offset in the sorted array.
This halves memory need (8 bytes per rule), and removes quadratic
behavior [ O(N*N) -> O(N*log2(N)) ]
Time of iptables goes from 35 s to 150 ms.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a new revision 3 that contains port ranges for all of origsrc,
origdst, replsrc and repldst. The high ports are appended to the
original v2 data structure to allow sharing most of the code with
v1 and v2. Use of the revision specific port matching function is
made dependant on par->match->revision.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since a string is stored, and not something like a MAC address that
would rely on (un)signedness, drop the qualifier.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Using "iptables -L" with a lot of rules have a too big BH latency.
Jesper mentioned ~6 ms and worried of frame drops.
Switch to a per_cpu seqlock scheme, so that taking a snapshot of
counters doesnt need to block BH (for this cpu, but also other cpus).
This adds two increments on seqlock sequence per ipt_do_table() call,
its a reasonable cost for allowing "iptables -L" not block BH
processing.
Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We are supposed to use the kernel's own types in userspace exports.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
vlan: Calling vlan_hwaccel_do_receive() is always valid.
tproxy: use the interface primary IP address as a default value for --on-ip
tproxy: added IPv6 support to the socket match
cxgb3: function namespace cleanup
tproxy: added IPv6 support to the TPROXY target
tproxy: added IPv6 socket lookup function to nf_tproxy_core
be2net: Changes to use only priority codes allowed by f/w
tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
tproxy: added tproxy sockopt interface in the IPV6 layer
tproxy: added udp6_lib_lookup function
tproxy: added const specifiers to udp lookup functions
tproxy: split off ipv6 defragmentation to a separate module
l2tp: small cleanup
nf_nat: restrict ICMP translation for embedded header
can: mcp251x: fix generation of error frames
can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
can-raw: add msg_flags to distinguish local traffic
9p: client code cleanup
rds: make local functions/variables static
...
Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
This requires a new revision as the old target structure was
IPv4 specific.
Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The conntrack code can export the internal secid to userspace. These are
dynamic, can change on lsm changes, and have no meaning in userspace. We
should instead be sending lsm contexts to userspace instead. This patch sends
the secctx (rather than secid) to userspace over the netlink socket. We use a
new field CTA_SECCTX and stop using the the old CTA_SECMARK field since it did
not send particularly useful information.
Signed-off-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Paul Moore <paul.moore@hp.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: James Morris <jmorris@namei.org>
Right now secmark has lots of direct selinux calls. Use all LSM calls and
remove all SELinux specific knowledge. The only SELinux specific knowledge
we leave is the mode. The only point is to make sure that other LSMs at
least test this generic code before they assume it works. (They may also
have to make changes if they do not represent labels as strings)
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Paul Moore <paul.moore@hp.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: James Morris <jmorris@namei.org>
This patch allows to listen to events that inform about
expectations destroyed.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch adds the basic infrastructure to support user-space
expectation helpers via ctnetlink and the netfilter queuing
infrastructure NFQUEUE. Basically, this patch:
* adds NF_CT_EXPECT_USERSPACE flag to identify user-space
created expectations. I have also added a sanity check in
__nf_ct_expect_check() to avoid that kernel-space helpers
may create an expectation if the master conntrack has no
helper assigned.
* adds some branches to check if the master conntrack helper
exists, otherwise we skip the code that refers to kernel-space
helper such as the local expectation list and the expectation
policy.
* allows to set the timeout for user-space expectations with
no helper assigned.
* a list of expectations created from user-space that depends
on ctnetlink (if this module is removed, they are deleted).
* includes USERSPACE in the /proc output for expectations
that have been created by a user-space helper.
This patch also modifies ctnetlink to skip including the helper
name in the Netlink messages if no kernel-space helper is set
(since no user-space expectation has not kernel-space kernel
assigned).
You can access an example user-space FTP conntrack helper at:
http://people.netfilter.org/pablo/userspace-conntrack-helpers/nf-ftp-helper-userspace-POC.tar.bz
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
With this patch, you can specify the expectation flags for user-space
created expectations.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
netfilter: fix CONFIG_COMPAT support
isdn/avm: fix build when PCMCIA is not enabled
header: fix broken headers for user space
e1000e: don't check for alternate MAC addr on parts that don't support it
e1000e: disable ASPM L1 on 82573
ll_temac: Fix poll implementation
netxen: fix a race in netxen_nic_get_stats()
qlnic: fix a race in qlcnic_get_stats()
irda: fix a race in irlan_eth_xmit()
net: sh_eth: remove unused variable
netxen: update version 4.0.74
netxen: fix inconsistent lock state
vlan: Match underlying dev carrier on vlan add
ibmveth: Fix opps during MTU change on an active device
ehea: Fix synchronization between HW and SW send queue
bnx2x: Update bnx2x version to 1.52.53-4
bnx2x: Fix PHY locking problem
rds: fix a leak of kernel memory
netlink: fix compat recvmsg
netfilter: fix userspace header warning
...
Conflicts:
include/linux/if_pppox.h
Fix conflict between Changli's __packed header file fixes and
the new PPTP driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
__packed is only defined in kernel space, so we should use
__attribute__((packed)) for the code shared between kernel and user space.
Two __attribute() annotations are replaced with __attribute__() too.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"make headers_check" issued the following warning:
CHECK include/linux/netfilter (64 files)
usr/include/linux/netfilter/xt_ipvs.h:19: found __[us]{8,16,32,64} type without #include <linux/types.h>
Fix this by as suggested including linux/types.h.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
unifdef-y and header-y has same semantic.
So there is no need to have both.
Drop the unifdef-y variant and sort all lines again
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
We should copy the initial value to userspace for iptables-save and
to allow removal of specific quota rules.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
In some situations a CPU match permits a better spreading of
connections, or select targets only for a given cpu.
With Remote Packet Steering or multiqueue NIC and appropriate IRQ
affinities, we can distribute trafic on available cpus, per session.
(all RX packets for a given flow is handled by a given cpu)
Some legacy applications being not SMP friendly, one way to scale a
server is to run multiple copies of them.
Instead of randomly choosing an instance, we can use the cpu number as a
key so that softirq handler for a whole instance is running on a single
cpu, maximizing cache effects in TCP/UDP stacks.
Using NAT for example, a four ways machine might run four copies of
server application, using a separate listening port for each instance,
but still presenting an unique external port :
iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 0 \
-j REDIRECT --to-port 8080
iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 1 \
-j REDIRECT --to-port 8081
iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 2 \
-j REDIRECT --to-port 8082
iptables -t nat -A PREROUTING -p tcp --dport 80 -m cpu --cpu 3 \
-j REDIRECT --to-port 8083
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This implements the kernel-space side of the netfilter matcher xt_ipvs.
[ minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
[ Patrick: added xt_ipvs.h to Kbuild ]
Signed-off-by: Patrick McHardy <kaber@trash.net>
This adds a `CHECKSUM' target, which can be used in the iptables mangle
table.
You can use this target to compute and fill in the checksum in
a packet that lacks a checksum. This is particularly useful,
if you need to work around old applications such as dhcp clients,
that do not work well with checksum offloads, but don't want to
disable checksum offload in your device.
The problem happens in the field with virtualized applications.
For reference, see Red Hat bz 605555, as well as
http://www.spinics.net/lists/kvm/msg37660.html
Typical expected use (helps old dhclient binary running in a VM):
iptables -A POSTROUTING -t mangle -p udp --dport bootpc \
-j CHECKSUM --checksum-fill
Includes fixes by Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch moves NFULNL_COPY_PACKET definition from
linux/netfilter/nfnetlink_log.h to net/netfilter/nfnetlink_log.h
since this copy mode is only for internal use.
I have also changed the value from 0x03 to 0xff. Thus, we avoid
a gap from user-space that may confuse users if we add new
copy modes in the future.
This change was introduced in:
http://www.spinics.net/lists/netfilter-devel/msg13535.html
Since this change is not included in any stable Linux kernel,
I think it's safe to make this change now. Anyway, this copy
mode does not make any sense from user-space, so this patch
should not break any existing setup.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch implements an idletimer Xtables target that can be used to
identify when interfaces have been idle for a certain period of time.
Timers are identified by labels and are created when a rule is set with a new
label. The rules also take a timeout value (in seconds) as an option. If
more than one rule uses the same timer label, the timer will be restarted
whenever any of the rules get a hit.
One entry for each timer is created in sysfs. This attribute contains the
timer remaining for the timer to expire. The attributes are located under
the xt_idletimer class:
/sys/class/xt_idletimer/timers/<label>
When the timer expires, the target module sends a sysfs notification to the
userspace, which can then decide what to do (eg. disconnect to save power).
Cc: Timo Teras <timo.teras@iki.fi>
Signed-off-by: Luciano Coelho <luciano.coelho@nokia.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
- must use atomic_inc_not_zero() in instance_lookup_get()
- must use hlist_add_head_rcu() instead of hlist_add_head()
- must use hlist_del_rcu() instead of hlist_del()
- Introduce NFULNL_COPY_DISABLED to stop lockless reader from using an
instance, before we do final instance_put() on it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet. This is bad for performance.
__read_mostly annotation is also a bad choice.
This patch introduces IPS_UNTRACKED bit so that we can use later a
per_cpu untrack structure more easily.
A new helper, nf_ct_untracked_get() returns a pointer to
nf_conntrack_untracked.
Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
IPS_NAT_DONE_MASK bits to untracked status.
nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
commit f3c5c1bfd4 (netfilter: xtables: make ip_tables reentrant)
introduced a performance regression, because stackptr array is shared by
all cpus, adding cache line ping pongs. (16 cpus share a 64 bytes cache
line)
Fix this using alloc_percpu()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-By: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The text describing the return codes that are expected on calls to
checkentry() was incorrect. Instead of returning true or false, or an error
code, it should return 0 or an error code.
Signed-off-by: Luciano Coelho <luciano.coelho@nokia.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Since xt_action_param is writable, let's use it. The pointer to
'bool hotdrop' always worried (8 bytes (64-bit) to write 1 byte!).
Surprisingly results in a reduction in size:
text data bss filename
5457066 692730 357892 vmlinux.o-prev
5456554 692730 357892 vmlinux.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
In future, layer-3 matches will be an xt module of their own, and
need to set the fragoff and thoff fields. Adding more pointers would
needlessy increase memory requirements (esp. so for 64-bit, where
pointers are wider).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The structures carried - besides match/target - almost the same data.
It is possible to combine them, as extensions are evaluated serially,
and so, the callers end up a little smaller.
text data bss filename
-15318 740 104 net/ipv4/netfilter/ip_tables.o
+15286 740 104 net/ipv4/netfilter/ip_tables.o
-15333 540 152 net/ipv6/netfilter/ip6_tables.o
+15269 540 152 net/ipv6/netfilter/ip6_tables.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
There has been quite a confusion in userspace about
XT_FUNCTION_MAXNAMELEN; because struct xt_entry_match used MAX-1,
userspace would have to do an awkward MAX-2 for maximum length
checking (due to '\0'). This patch adds a new define that matches the
definition of XT_TABLE_MAXNAMELEN - being the size of the actual
struct member, not one off.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.
Adding a stats counter to see if the search is restarted too often.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Replace the runtime oif name resolving by netdevice notifier based
resolving. When an oif is given, a netdevice notifier is registered
to resolve the name on NETDEV_REGISTER or NETDEV_CHANGE and unresolve
it again on NETDEV_UNREGISTER or NETDEV_CHANGE to a different name.
Signed-off-by: Patrick McHardy <kaber@trash.net>