summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)AuthorFilesLines
2022-05-31netfilter: flowtable: fix nft_flow_route source address for nat casewenxu1-2/+2
For snat and dnat cases, the saddr should be taken from reverse tuple. Fixes: 3412e1641828 (netfilter: flowtable: nft_flow_route use more data for reverse route) Signed-off-by: wenxu <wenxu@chinatelecom.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-31netfilter: flowtable: fix missing FLOWI_FLAG_ANYSRC flagwenxu1-0/+2
The nf_flow_table gets route through ip_route_output_key. If the saddr is not local one, then FLOWI_FLAG_ANYSRC flag should be set. Without this flag, the route lookup for other_dst will fail. Fixes: 3412e1641828 (netfilter: flowtable: nft_flow_route use more data for reverse route) Signed-off-by: wenxu <wenxu@chinatelecom.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-31netfilter: nf_tables: double hook unregistration in netns pathPablo Neira Ayuso1-13/+41
__nft_release_hooks() is called from pre_netns exit path which unregisters the hooks, then the NETDEV_UNREGISTER event is triggered which unregisters the hooks again. [ 565.221461] WARNING: CPU: 18 PID: 193 at net/netfilter/core.c:495 __nf_unregister_net_hook+0x247/0x270 [...] [ 565.246890] CPU: 18 PID: 193 Comm: kworker/u64:1 Tainted: G E 5.18.0-rc7+ #27 [ 565.253682] Workqueue: netns cleanup_net [ 565.257059] RIP: 0010:__nf_unregister_net_hook+0x247/0x270 [...] [ 565.297120] Call Trace: [ 565.300900] <TASK> [ 565.304683] nf_tables_flowtable_event+0x16a/0x220 [nf_tables] [ 565.308518] raw_notifier_call_chain+0x63/0x80 [ 565.312386] unregister_netdevice_many+0x54f/0xb50 Unregister and destroy netdev hook from netns pre_exit via kfree_rcu so the NETDEV_UNREGISTER path see unregistered hooks. Fixes: 767d1216bff8 ("netfilter: nftables: fix possible UAF over chains from packet path in netns") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-31netfilter: nf_tables: hold mutex on netns pre_exit pathPablo Neira Ayuso1-0/+4
clean_net() runs in workqueue while walking over the lists, grab mutex. Fixes: 767d1216bff8 ("netfilter: nftables: fix possible UAF over chains from packet path in netns") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-31netfilter: nf_tables: sanitize nft_set_desc_concat_parse()Pablo Neira Ayuso1-4/+13
Add several sanity checks for nft_set_desc_concat_parse(): - validate desc->field_count not larger than desc->field_len array. - field length cannot be larger than desc->field_len (ie. U8_MAX) - total length of the concatenation cannot be larger than register array. Joint work with Florian Westphal. Fixes: f3a2181e16f1 ("netfilter: nf_tables: Support for sets with multiple ranged fields") Reported-by: <zhangziming.zzm@antgroup.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-27netfilter: nf_tables: set element extended ACK reporting supportPablo Neira Ayuso1-3/+9
Report the element that causes problems via netlink extended ACK for set element commands. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-27netfilter: cttimeout: fix slab-out-of-bounds read in cttimeout_net_exitFlorian Westphal1-2/+3
syzbot reports: BUG: KASAN: slab-out-of-bounds in __list_del_entry_valid+0xcc/0xf0 lib/list_debug.c:42 [..] list_del include/linux/list.h:148 [inline] cttimeout_net_exit+0x211/0x540 net/netfilter/nfnetlink_cttimeout.c:617 No reproducer so far. Looking at recent changes in this area its clear that the free_head must not be at the end of the structure because nf_ct_timeout structure has variable size. Reported-by: <syzbot+92968395eedbdbd3617d@syzkaller.appspotmail.com> Fixes: 78222bacfca9 ("netfilter: cttimeout: decouple unlink and free on netns destruction") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-27netfilter: nfnetlink: fix warn in nfnetlink_unbindFlorian Westphal1-19/+5
syzbot reports following warn: WARNING: CPU: 0 PID: 3600 at net/netfilter/nfnetlink.c:703 nfnetlink_unbind+0x357/0x3b0 net/netfilter/nfnetlink.c:694 The syzbot generated program does this: socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER) = 3 setsockopt(3, SOL_NETLINK, NETLINK_DROP_MEMBERSHIP, [1], 4) = 0 ... which triggers 'WARN_ON_ONCE(nfnlnet->ctnetlink_listeners == 0)' check. Instead of counting, just enable reporting for every bind request and check if we still have listeners on unbind. While at it, also add the needed bounds check on nfnl_group2type[] access. Reported-by: <syzbot+4903218f7fba0a2d6226@syzkaller.appspotmail.com> Reported-by: <syzbot+afd2d80e495f96049571@syzkaller.appspotmail.com> Fixes: 2794cdb0b97b ("netfilter: nfnetlink: allow to detect if ctnetlink listeners exist") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-26netfilter: nft_limit: Clone packet limits' cost valuePhil Sutter1-0/+2
When cloning a packet-based limit expression, copy the cost value as well. Otherwise the new limit is not functional anymore. Fixes: 3b9e2ea6c11bf ("netfilter: nft_limit: move stateful fields out of expression data") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-26netfilter: nf_tables: disallow non-stateful expression in sets earlierPablo Neira Ayuso1-9/+10
Since 3e135cd499bf ("netfilter: nft_dynset: dynamic stateful expression instantiation"), it is possible to attach stateful expressions to set elements. cd5125d8f518 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase") introduces conditional destruction on the object to accomodate transaction semantics. nft_expr_init() calls expr->ops->init() first, then check for NFT_STATEFUL_EXPR, this stills allows to initialize a non-stateful lookup expressions which points to a set, which might lead to UAF since the set is not properly detached from the set->binding for this case. Anyway, this combination is non-sense from nf_tables perspective. This patch fixes this problem by checking for NFT_STATEFUL_EXPR before expr->ops->init() is called. The reporter provides a KASAN splat and a poc reproducer (similar to those autogenerated by syzbot to report use-after-free errors). It is unknown to me if they are using syzbot or if they use similar automated tool to locate the bug that they are reporting. For the record, this is the KASAN splat. [ 85.431824] ================================================================== [ 85.432901] BUG: KASAN: use-after-free in nf_tables_bind_set+0x81b/0xa20 [ 85.433825] Write of size 8 at addr ffff8880286f0e98 by task poc/776 [ 85.434756] [ 85.434999] CPU: 1 PID: 776 Comm: poc Tainted: G W 5.18.0+ #2 [ 85.436023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Fixes: 0b2d8a7b638b ("netfilter: nf_tables: add helper functions for expression handling") Reported-and-tested-by: Aaron Adams <edg-e@nccgroup.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski3-48/+17
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, misc updates and fallout fixes from recent Florian's code rewritting (from last pull request): 1) Use new flowi4_l3mdev field in ip_route_me_harder(), from Martin Willi. 2) Avoid unnecessary GC with a timestamp in conncount, from William Tu and Yifeng Sun. 3) Remove TCP conntrack debugging, from Florian Westphal. 4) Fix compilation warning in ctnetlink, from Florian. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list" netfilter: conntrack: remove pr_debug callsites from tcp tracker netfilter: nf_conncount: reduce unnecessary GC netfilter: Use l3mdev flow key when re-routing mangled packets ==================== Link: https://lore.kernel.org/r/20220519220206.722153-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-70/+48
drivers/net/ethernet/mellanox/mlx5/core/main.c b33886971dbc ("net/mlx5: Initialize flow steering during driver probe") 40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support") f2b41b32cde8 ("net/mlx5: Remove ipsec_ops function table") https://lore.kernel.org/all/20220519040345.6yrjromcdistu7vh@sx1/ 16d42d313350 ("net/mlx5: Drain fw_reset when removing device") 8324a02c342a ("net/mlx5: Add exit route when waiting for FW") https://lore.kernel.org/all/20220519114119.060ce014@canb.auug.org.au/ tools/testing/selftests/net/mptcp/mptcp_join.sh e274f7154008 ("selftests: mptcp: add subflow limits test-cases") b6e074e171bc ("selftests: mptcp: add infinite map testcase") 5ac1d2d63451 ("selftests: mptcp: Add tests for userspace PM type") https://lore.kernel.org/all/20220516111918.366d747f@canb.auug.org.au/ net/mptcp/options.c ba2c89e0ea74 ("mptcp: fix checksum byte order") 1e39e5a32ad7 ("mptcp: infinite mapping sending") ea66758c1795 ("tcp: allow MPTCP to update the announced window") https://lore.kernel.org/all/20220519115146.751c3a37@canb.auug.org.au/ net/mptcp/pm.c 95d686517884 ("mptcp: fix subflow accounting on close") 4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs") https://lore.kernel.org/all/20220516111435.72f35dca@canb.auug.org.au/ net/mptcp/subflow.c ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure") 0348c690ed37 ("mptcp: add the fallback check") f8d4bcacff3b ("mptcp: infinite mapping receiving") https://lore.kernel.org/all/20220519115837.380bb8d4@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-18netfilter: nf_tables: disable expression reduction infraPablo Neira Ayuso1-10/+1
Either userspace or kernelspace need to pre-fetch keys inconditionally before comparisons for this to work. Otherwise, register tracking data is misleading and it might result in reducing expressions which are not yet registers. First expression is also guaranteed to be evaluated always, however, certain expressions break before writing data to registers, before comparing the data, leaving the register in undetermined state. This patch disables this infrastructure by now. Fixes: b2d306542ff9 ("netfilter: nf_tables: do not reduce read-only expressions") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-18netfilter: flowtable: move dst_check to packet pathRitaro Takenaka2-22/+20
Fixes sporadic IPv6 packet loss when flow offloading is enabled. IPv6 route GC and flowtable GC are not synchronized. When dst_cache becomes stale and a packet passes through the flow before the flowtable GC teardowns it, the packet can be dropped. So, it is necessary to check dst every time in packet path. Fixes: 227e1e4d0d6c ("netfilter: nf_flowtable: skip device lookup from interface index") Signed-off-by: Ritaro Takenaka <ritarot634@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-18netfilter: flowtable: fix TCP flow teardownPablo Neira Ayuso2-27/+9
This patch addresses three possible problems: 1. ct gc may race to undo the timeout adjustment of the packet path, leaving the conntrack entry in place with the internal offload timeout (one day). 2. ct gc removes the ct because the IPS_OFFLOAD_BIT is not set and the CLOSE timeout is reached before the flow offload del. 3. tcp ct is always set to ESTABLISHED with a very long timeout in flow offload teardown/delete even though the state might be already CLOSED. Also as a remark we cannot assume that the FIN or RST packet is hitting flow table teardown as the packet might get bumped to the slow path in nftables. This patch resets IPS_OFFLOAD_BIT from flow_offload_teardown(), so conntrack handles the tcp rst/fin packet which triggers the CLOSE/FIN state transition. Moreover, teturn the connection's ownership to conntrack upon teardown by clearing the offload flag and fixing the established timeout value. The flow table GC thread will asynchonrnously free the flow table and hardware offload entries. Before this patch, the IPS_OFFLOAD_BIT remained set for expired flows on which is also misleading since the flow is back to classic conntrack path. If nf_ct_delete() removes the entry from the conntrack table, then it calls nf_ct_put() which decrements the refcnt. This is not a problem because the flowtable holds a reference to the conntrack object from flow_offload_alloc() path which is released via flow_offload_free(). This patch also updates nft_flow_offload to skip packets in SYN_RECV state. Since we might miss or bump packets to slow path, we do not know what will happen there while we are still in SYN_RECV, this patch postpones offload up to the next packet which also aligns to the existing behaviour in tc-ct. flow_offload_teardown() does not reset the existing tcp state from flow_offload_fixup_tcp() to ESTABLISHED anymore, packets bump to slow path might have already update the state to CLOSE/FIN. Joint work with Oz and Sven. Fixes: 1e5b2471bcc4 ("netfilter: nf_flow_table: teardown flow timeout race") Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-18netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list"Stephen Rothwell1-0/+2
After merging the net-next tree, today's linux-next build (powerpc ppc64_defconfig) produced this warning: nf_conntrack_netlink.c:1717 warning: 'ctnetlink_dump_one_entry' defined but not used Fixes: 8a75a2c17410 ("netfilter: conntrack: remove unconfirmed list") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Florian Westphal <fw@strlen.de>
2022-05-16netfilter: conntrack: remove pr_debug callsites from tcp trackerFlorian Westphal1-48/+4
They are either obsolete or useless. Those in the normal processing path cannot be enabled on a production system; they generate too much noise. One pr_debug call resides in an error path and does provide useful info, merge it with the existing nf_log_invalid(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16netfilter: nf_conncount: reduce unnecessary GCWilliam Tu1-0/+11
Currently nf_conncount can trigger garbage collection (GC) at multiple places. Each GC process takes a spin_lock_bh to traverse the nf_conncount_list. We found that when testing port scanning use two parallel nmap, because the number of connection increase fast, the nf_conncount_count and its subsequent call to __nf_conncount_add take too much time, causing several CPU lockup. This happens when user set the conntrack limit to +20,000, because the larger the limit, the longer the list that GC has to traverse. The patch mitigate the performance issue by avoiding unnecessary GC with a timestamp. Whenever nf_conncount has done a GC, a timestamp is updated, and beforce the next time GC is triggered, we make sure it's more than a jiffies. By doin this we can greatly reduce the CPU cycles and avoid the softirq lockup. To reproduce it in OVS, $ ovs-appctl dpctl/ct-set-limits zone=1,limit=20000 $ ovs-appctl dpctl/ct-get-limits At another machine, runs two nmap $ nmap -p1- <IP> $ nmap -p1- <IP> Signed-off-by: William Tu <u9012063@gmail.com> Co-authored-by: Yifeng Sun <pkusunyifeng@gmail.com> Reported-by: Greg Rose <gvrose8192@gmail.com> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16netfilter: nft_flow_offload: fix offload with pppoe + vlanFelix Fietkau1-1/+2
When running a combination of PPPoE on top of a VLAN, we need to set info->outdev to the PPPoE device, otherwise PPPoE encap is skipped during software offload. Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16netfilter: nft_flow_offload: skip dst neigh lookup for ppp devicesFelix Fietkau1-9/+13
The dst entry does not contain a valid hardware address, so skip the lookup in order to avoid running into errors here. The proper hardware address is filled in from nft_dev_path_info Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16netfilter: flowtable: fix excessive hw offload attempts after failureFelix Fietkau1-1/+3
If a flow cannot be offloaded, the code currently repeatedly tries again as quickly as possible, which can significantly increase system load. Fix this by limiting flow timeout update and hardware offload retry to once per second. Fixes: c07531c01d82 ("netfilter: flowtable: Remove redundant hw refresh bit") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextDavid S. Miller12-300/+411
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next This is v2 including deadlock fix in conntrack ecache rework reported by Jakub Kicinski. The following patchset contains Netfilter updates for net-next, mostly updates to conntrack from Florian Westphal. 1) Add a dedicated list for conntrack event redelivery. 2) Include event redelivery list in conntrack dumps of dying type. 3) Remove per-cpu dying list for event redelivery, not used anymore. 4) Add netns .pre_exit to cttimeout to zap timeout objects before synchronize_rcu() call. 5) Remove nf_ct_unconfirmed_destroy. 6) Add generation id for conntrack extensions for conntrack timeout and helpers. 7) Detach timeout policy from conntrack on cttimeout module removal. 8) Remove __nf_ct_unconfirmed_destroy. 9) Remove unconfirmed list. 10) Remove unconditional local_bh_disable in init_conntrack(). 11) Consolidate conntrack iterator nf_ct_iterate_cleanup(). 12) Detect if ctnetlink listeners exist to short-circuit event path early. 13) Un-inline nf_ct_ecache_ext_add(). 14) Add nf_conntrack_events autodetect ctnetlink listener mode and make it default. 15) Add nf_ct_ecache_exist() to check for event cache extension. 16) Extend flowtable reverse route lookup to include source, iif, tos and mark, from Sven Auhagen. 17) Do not verify zero checksum UDP packets in nf_reject, from Kevin Mitchell. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13netfilter: flowtable: nft_flow_route use more data for reverse routeSven Auhagen1-0/+8
When creating a flow table entry, the reverse route is looked up based on the current packet. There can be scenarios where the user creates a custom ip rule to route the traffic differently. In order to support those scenarios, the lookup needs to add more information based on the current packet. The patch adds multiple new information to the route lookup. Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: add nf_conntrack_events autodetect modeFlorian Westphal3-8/+24
This adds the new nf_conntrack_events=2 mode and makes it the default. This leverages the earlier flag in struct net to allow to avoid the event extension as long as no event listener is active in the namespace. This avoids, for most cases, allocation of ct->ext area. A followup patch will take further advantage of this by avoiding calls down into the event framework if the extension isn't present. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: un-inline nf_ct_ecache_ext_addFlorian Westphal2-3/+33
Only called when new ct is allocated or the extension isn't present. This function will be extended, place this in the conntrack module instead of inlining. The callers already depend on nf_conntrack module. Return value is changed to bool, noone used the returned pointer. Make sure that the core drops the newly allocated conntrack if the extension is requested but can't be added. This makes it necessary to ifdef the section, as the stub always returns false we'd drop every new conntrack if the the ecache extension is disabled in kconfig. Add from data path (xt_CT, nft_ct) is unchanged. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: nfnetlink: allow to detect if ctnetlink listeners existFlorian Westphal1-3/+37
At this time, every new conntrack gets the 'event cache extension' enabled for it. This is because the 'net.netfilter.nf_conntrack_events' sysctl defaults to 1. Changing the default to 0 means that commands that rely on the event notification extension, e.g. 'conntrack -E' or conntrackd, stop working. We COULD detect if there is a listener by means of 'nfnetlink_has_listeners()' and only add the extension if this is true. The downside is a dependency from conntrack module to nfnetlink module. This adds a different way: inc/dec a counter whenever a ctnetlink group is being (un)subscribed and toggle a flag in struct net. Next patches will take advantage of this and will only add the event extension if the flag is set. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: add nf_ct_iter_data object for nf_ct_iterate_cleanup*()Pablo Neira Ayuso5-41/+47
This patch adds a structure to collect all the context data that is passed to the cleanup iterator. struct nf_ct_iter_data { struct net *net; void *data; u32 portid; int report; }; There is a netns field that allows to clean up conntrack entries specifically owned by the specified netns. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: avoid unconditional local_bh_disableFlorian Westphal1-5/+2
Now that the conntrack entry isn't placed on the pcpu list anymore the bh only needs to be disabled in the 'expectation present' case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: remove unconfirmed listFlorian Westphal2-98/+3
It has no function anymore and can be removed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: remove __nf_ct_unconfirmed_destroyFlorian Westphal3-39/+13
Its not needed anymore: A. If entry is totally new, then the rcu-protected resource must already have been removed from global visibility before call to nf_ct_iterate_destroy. B. If entry was allocated before, but is not yet in the hash table (uncofirmed case), genid gets incremented and synchronize_rcu() call makes sure access has completed. C. Next attempt to peek at extension area will fail for unconfirmed conntracks, because ext->genid != genid. D. Conntracks in the hash are iterated as before. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: cttimeout: decouple unlink and free on netns destructionFlorian Westphal1-0/+12
Increment the extid on module removal; this makes sure that even in extreme cases any old uncofirmed entry that happened to be kept e.g. on nfnetlink_queue list will not trip over a stale timeout reference. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: extensions: introduce extension genid countFlorian Westphal2-1/+86
Multiple netfilter extensions store pointers to external data in their extension area struct. Examples: 1. Timeout policies 2. Connection tracking helpers. No references are taken for these. When a helper or timeout policy is removed, the conntrack table gets traversed and affected extensions are cleared. Conntrack entries not yet in the hashtable are referenced via a special list, the unconfirmed list. On removal of a policy or connection tracking helper, the unconfirmed list gets traversed an all entries are marked as dying, this prevents them from getting committed to the table at insertion time: core checks for dying bit, if set, the conntrack entry gets destroyed at confirm time. The disadvantage is that each new conntrack has to be added to the percpu unconfirmed list, and each insertion needs to remove it from this list. The list is only ever needed when a policy or helper is removed -- a rare occurrence. Add a generation ID count: Instead of adding to the list and then traversing that list on policy/helper removal, increment a counter that is stored in the extension area. For unconfirmed conntracks, the extension has the genid valid at ct allocation time. Removal of a helper/policy etc. increments the counter. At confirmation time, validate that ext->genid == global_id. If the stored number is not the same, do not allow the conntrack insertion, just like as if a confirmed-list traversal would have flagged the entry as dying. After insertion, the genid is no longer relevant (conntrack entries are now reachable via the conntrack table iterators and is set to 0. This allows removal of the percpu unconfirmed list. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: remove nf_ct_unconfirmed_destroy helperFlorian Westphal2-15/+3
This helper tags connections not yet in the conntrack table as dying. These nf_conn entries will be dropped instead when the core attempts to insert them from the input or postrouting 'confirm' hook. After the previous change, the entries get unlinked from the list earlier, so that by the time the actual exit hook runs, new connections no longer have a timeout policy assigned. Its enough to walk the hashtable instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: cttimeout: decouple unlink and free on netns destructionFlorian Westphal1-2/+28
Make it so netns pre_exit unlinks the objects from the pernet list, so they cannot be found anymore. netns core issues a synchronize_rcu() before calling the exit hooks so any the time the exit hooks run unconfirmed nf_conn entries have been free'd or they have been committed to the hashtable. The exit hook still tags unconfirmed entries as dying, this can now be removed in a followup change. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: remove the percpu dying listFlorian Westphal3-46/+13
Its no longer needed. Entries that need event redelivery are placed on the new pernet dying list. The advantage is that there is no need to take additional spinlock on conntrack removal unless event redelivery failed or the conntrack entry was never added to the table in the first place (confirmed bit not set). The IPS_CONFIRMED bit now needs to be set as soon as the entry has been unlinked from the unconfirmed list, else the destroy function may attempt to unlink it a second time. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: include ecache dying list in dumpsFlorian Westphal2-0/+53
The new pernet dying list includes conntrack entries that await delivery of the 'destroy' event via ctnetlink. The old percpu dying list will be removed soon. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: ecache: use dedicated list for event redeliveryFlorian Westphal2-70/+80
This disentangles event redelivery and the percpu dying list. Because entries are now stored on a dedicated list, all entries are in NFCT_ECACHE_DESTROY_FAIL state and all entries still have confirmed bit set -- the reference count is at least 1. The 'struct net' back-pointer can be removed as well. The pcpu dying list will be removed eventually, it has no functionality. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-03net: sysctl: introduce sysctl SYSCTL_THREETonghao Zhang1-3/+1
This patch introdues the SYSCTL_THREE. KUnit: [00:10:14] ================ sysctl_test (10 subtests) ================= [00:10:14] [PASSED] sysctl_test_api_dointvec_null_tbl_data [00:10:14] [PASSED] sysctl_test_api_dointvec_table_maxlen_unset [00:10:14] [PASSED] sysctl_test_api_dointvec_table_len_is_zero [00:10:14] [PASSED] sysctl_test_api_dointvec_table_read_but_position_set [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_positive [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_negative [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_positive [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_negative [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_less_int_min [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_greater_int_max [00:10:14] =================== [PASSED] sysctl_test =================== ./run_kselftest.sh -c sysctl ... ok 1 selftests: sysctl: sysctl.sh Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: David Ahern <dsahern@kernel.org> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Lorenz Bauer <lmb@cloudflare.com> Cc: Akhmat Karakotov <hmukos@yandex-team.ru> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Reviewed-by: Simon Horman <horms@verge.net.au> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-32/+51
include/linux/netdevice.h net/core/dev.c 6510ea973d8d ("net: Use this_cpu_inc() to increment net->core_stats") 794c24e9921f ("net-core: rx_otherhost_dropped to core_stats") https://lore.kernel.org/all/20220428111903.5f4304e0@canb.auug.org.au/ drivers/net/wan/cosa.c d48fea8401cf ("net: cosa: fix error check return value of register_chrdev()") 89fbca3307d4 ("net: wan: remove support for COSA and SRP synchronous serial boards") https://lore.kernel.org/all/20220428112130.1f689e5e@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-28netfilter: nft_socket: only do sk lookups when indev is availableFlorian Westphal1-14/+38
Check if the incoming interface is available and NFT_BREAK in case neither skb->sk nor input device are set. Because nf_sk_lookup_slow*() assume packet headers are in the 'in' direction, use in postrouting is not going to yield a meaningful result. Same is true for the forward chain, so restrict the use to prerouting, input and output. Use in output work if a socket is already attached to the skb. Fixes: 554ced0a6e29 ("netfilter: nf_tables: add support for native socket matching") Reported-and-tested-by: Topi Miettinen <toiwoton@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-27netfilter: conntrack: fix udp offload timeout sysctlVolodymyr Mytnyk1-1/+1
`nf_flowtable_udp_timeout` sysctl option is available only if CONFIG_NFT_FLOW_OFFLOAD enabled. But infra for this flow offload UDP timeout was added under CONFIG_NF_FLOW_TABLE config option. So, if you have CONFIG_NFT_FLOW_OFFLOAD disabled and CONFIG_NF_FLOW_TABLE enabled, the `nf_flowtable_udp_timeout` is not present in sysfs. Please note, that TCP flow offload timeout sysctl option is present even CONFIG_NFT_FLOW_OFFLOAD is disabled. I suppose it was a typo in commit that adds UDP flow offload timeout and CONFIG_NF_FLOW_TABLE should be used instead. Fixes: 975c57504da1 ("netfilter: conntrack: Introduce udp offload timeout configuration") Signed-off-by: Volodymyr Mytnyk <volodymyr.mytnyk@plvision.eu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-27netfilter: nf_conntrack_tcp: re-init for syn packets onlyFlorian Westphal1-15/+6
Jaco Kroon reported tcp problems that Eric Dumazet and Neal Cardwell pinpointed to nf_conntrack tcp_in_window() bug. tcp trace shows following sequence: I > R Flags [S], seq 3451342529, win 62580, options [.. tfo [|tcp]> R > I Flags [S.], seq 2699962254, ack 3451342530, win 65535, options [..] R > I Flags [P.], seq 1:89, ack 1, [..] Note 3rd ACK is from responder to initiator so following branch is taken: } else if (((state->state == TCP_CONNTRACK_SYN_SENT && dir == IP_CT_DIR_ORIGINAL) || (state->state == TCP_CONNTRACK_SYN_RECV && dir == IP_CT_DIR_REPLY)) && after(end, sender->td_end)) { ... because state == TCP_CONNTRACK_SYN_RECV and dir is REPLY. This causes the scaling factor to be reset to 0: window scale option is only present in syn(ack) packets. This in turn makes nf_conntrack mark valid packets as out-of-window. This was always broken, it exists even in original commit where window tracking was added to ip_conntrack (nf_conntrack predecessor) in 2.6.9-rc1 kernel. Restrict to 'tcph->syn', just like the 3rd condtional added in commit 82b72cb94666 ("netfilter: conntrack: re-init state for retransmitted syn-ack"). Upon closer look, those conditionals/branches can be merged: Because earlier checks prevent syn-ack from showing up in original direction, the 'dir' checks in the conditional quoted above are redundant, remove them. Return early for pure syn retransmitted in reply direction (simultaneous open). Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") Reported-by: Jaco Kroon <jaco@uls.co.za> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-22netfilter: nft_set_rbtree: overlap detection with element re-addition after ↵Pablo Neira Ayuso1-1/+5
deletion This patch fixes spurious EEXIST errors. Extend d2df92e98a34 ("netfilter: nft_set_rbtree: handle element re-addition after deletion") to deal with elements with same end flags in the same transation. Reset the overlap flag as described by 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion"). Fixes: 7c84d41416d8 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion") Fixes: d2df92e98a34 ("netfilter: nft_set_rbtree: handle element re-addition after deletion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-19ipvs: correctly print the memory size of ip_vs_conn_tabPengcheng Yang1-1/+1
The memory size of ip_vs_conn_tab changed after we use hlist instead of list. Fixes: 731109e78415 ("ipvs: use hlist instead of list") Signed-off-by: Pengcheng Yang <yangpc@wangsu.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni2-5/+4
2022-04-12netfilter: nf_tables: nft_parse_register can return a negative valueAntoine Tenart1-1/+1
Since commit 6e1acfa387b9 ("netfilter: nf_tables: validate registers coming from userspace.") nft_parse_register can return a negative value, but the function prototype is still returning an unsigned int. Fixes: 6e1acfa387b9 ("netfilter: nf_tables: validate registers coming from userspace.") Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextDavid S. Miller7-125/+135
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Replace unnecessary list_for_each_entry_continue() in nf_tables, from Jakob Koschel. 2) Add struct nf_conntrack_net_ecache to conntrack event cache and use it, from Florian Westphal. 3) Refactor ctnetlink_dump_list(), also from Florian. 4) Bump module reference counter on cttimeout object addition/removal, from Florian. 5) Consolidate nf_log MAC printer, from Phil Sutter. 6) Add basic logging support for unknown ethertype, from Phil Sutter. 7) Consolidate check for sysctl nf_log_all_netns toggle, also from Phil. 8) Replace hardcode value in nft_bitwise, from Jeremy Sowden. 9) Rename BASIC-like goto tags in nft_bitwise to more meaningful names, also from Jeremy. 10) nft_fib support for reverse path filtering with policy-based routing on iif. Extend selftests to cover for this new usecase, from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-11netfilter: nft_socket: make cgroup match work in input tooFlorian Westphal1-4/+3
cgroupv2 helper function ignores the already-looked up sk and uses skb->sk instead. Just pass sk from the calling function instead; this will make cgroup matching work for udp and tcp in input even when edemux did not set skb->sk already. Fixes: e0bb96db96f8 ("netfilter: nft_socket: add support for cgroupsv2") Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Topi Miettinen <toiwoton@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-11netfilter: nft_fib: reverse path filter for policy-based routing on iifPablo Neira Ayuso1-0/+4
If policy-based routing using the iif selector is used, then the fib expression fails to look up for the reverse path from the prerouting hook because the input interface cannot be inferred. In order to support this scenario, extend the fib expression to allow to use after the route lookup, from the forward hook. This patch also adds support for the input hook for usability reasons. Since the prerouting hook cannot be used for the scenario described above, users need two rules: one for the forward chain and another rule for the input chain to check for the reverse path check for locally targeted traffic. Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-04-09netfilter: bitwise: improve error goto labelsJeremy Sowden1-5/+6
Replace two labels (`err1` and `err2`) with more informative ones. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Florian Westphal <fw@strlen.de>