Age | Commit message (Collapse) | Author | Files | Lines |
|
using packetdrill it's possible to observe that the receiver key contains
random values when clients transmit MP_CAPABLE with data and checksum (as
specified in RFC8684 §3.1). Fix the layout of mptcp_out_options, to avoid
using the skb extension copy when writing the MP_CAPABLE sub-option.
Fixes: d7b269083786 ("mptcp: shrink mptcp_out_options struct")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/233
Reported-by: Poorva Sonparote <psonparo@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Link: https://lore.kernel.org/r/20211027203855.264600-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
sk->sk_err appears to expect a positive value, a convention that ktls
doesn't always follow and that leads to memory corruption in other code.
For instance,
[kworker]
tls_encrypt_done(..., err=<negative error from crypto request>)
tls_err_abort(.., err)
sk->sk_err = err;
[task]
splice_from_pipe_feed
...
tls_sw_do_sendpage
if (sk->sk_err) {
ret = -sk->sk_err; // ret is positive
splice_from_pipe_feed (continued)
ret = actor(...) // ret is still positive and interpreted as bytes
// written, resulting in underflow of buf->len and
// sd->len, leading to huge buf->offset and bogus
// addresses computed in later calls to actor()
Fix all tls_err_abort() callers to pass a negative error code
consistently and centralize the error-prone sign flip there, throwing in
a warning to catch future misuse and uninlining the function so it
really does only warn once.
Cc: stable@vger.kernel.org
Fixes: c46234ebb4d1e ("tls: RX path for ktls")
Reported-by: syzbot+b187b77c8474f9648fae@syzkaller.appspotmail.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Two fixes:
* bridge vs. 4-addr mode check was wrong
* management frame registrations locking was
wrong, causing list corruption/crashes
====================
Link: https://lore.kernel.org/r/20211027143756.91711-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2021-10-26
We've added 12 non-merge commits during the last 7 day(s) which contain
a total of 23 files changed, 118 insertions(+), 98 deletions(-).
The main changes are:
1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen.
2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang.
3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai.
4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer.
5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun.
6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian.
7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix potential race in tail call compatibility check
bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET
selftests/bpf: Use recv_timeout() instead of retries
net: Implement ->sock_is_readable() for UDP and AF_UNIX
skmsg: Extract and reuse sk_msg_is_readable()
net: Rename ->stream_memory_read to ->sock_is_readable
tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
cgroup: Fix memory leak caused by missing cgroup_bpf_offline
bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()
bpf: Prevent increasing bpf_jit_limit above max
bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT
bpf: Define bpf_jit_alloc_exec_limit for riscv JIT
====================
Link: https://lore.kernel.org/r/20211026201920.11296-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The proto ops ->stream_memory_read() is currently only used
by TCP to check whether psock queue is empty or not. We need
to rename it before reusing it for non-TCP protocols, and
adjust the exsiting users accordingly.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211008203306.37525-2-xiyou.wangcong@gmail.com
|
|
During a testing of an user-space application which transmits UDP
multicast datagrams and utilizes multicast routing to send the UDP
datagrams out of defined network interfaces, I've found a multicast
router does not fill-in UDP checksum into locally produced, looped-back
and forwarded UDP datagrams, if an original output NIC the datagrams
are sent to has UDP TX checksum offload enabled.
The datagrams are sent malformed out of the NIC the datagrams have been
forwarded to.
It is because:
1. If TX checksum offload is enabled on the output NIC, UDP checksum
is not calculated by kernel and is not filled into skb data.
2. dev_loopback_xmit(), which is called solely by
ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
unconditionally.
3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
CHECKSUM_COMPLETE"), the ip_summed value is preserved during
forwarding.
4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
a packet egress.
The minimum fix in dev_loopback_xmit():
1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
case when the original output NIC has TX checksum offload enabled.
The effects are:
a) If the forwarding destination interface supports TX checksum
offloading, the NIC driver is responsible to fill-in the
checksum.
b) If the forwarding destination interface does NOT support TX
checksum offloading, checksums are filled-in by kernel before
skb is submitted to the NIC driver.
c) For local delivery, checksum validation is skipped as in the
case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().
2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
means, for CHECKSUM_NONE, the behavior is unmodified and is there
to skip a looped-back packet local delivery checksum validation.
Signed-off-by: Cyril Strejc <cyril.strejc@skoda.cz>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The management registrations locking was broken, the list was
locked for each wdev, but cfg80211_mgmt_registrations_update()
iterated it without holding all the correct spinlocks, causing
list corruption.
Rather than trying to fix it with fine-grained locking, just
move the lock to the wiphy/rdev (still need the list on each
wdev), we already need to hold the wdev lock to change it, so
there's no contention on the lock in any case. This trivially
fixes the bug since we hold one wdev's lock already, and now
will hold the lock that protects all lists.
Cc: stable@vger.kernel.org
Reported-by: Jouni Malinen <j@w1.fi>
Fixes: 6cd536fe62ef ("cfg80211: change internal management frame registration API")
Link: https://lore.kernel.org/r/20211025133111.5cf733eab0f4.I7b0abb0494ab712f74e2efcd24bb31ac33f7eee9@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Use the more precise __kernel_sa_family_t for smctp_family, to match
struct sockaddr.
Also, use an unsigned int for the network member; negative networks
don't make much sense. We're already using unsigned for mctp_dev and
mctp_skb_cb, but need to change mctp_sock to suit.
Fixes: 60fc63981693 ("mctp: Add sockaddr_mctp to uapi")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Acked-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Multiple VRFs are generally meant to be "separate" but right now md5
keys for the default VRF also affect connections inside VRFs if the IP
addresses happen to overlap.
So far the combination of TCP_MD5SIG_FLAG_IFINDEX with tcpm_ifindex == 0
was an error, accept this to mean "key only applies to default VRF".
This is what applications using VRFs for traffic separation want.
Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
transport encap_port update should be updated when sctp_vtag_verify()
succeeds, namely, returns 1, not returns 0. Correct it in this patch.
While at it, also fix the indentation.
Fixes: a1dd2cf2f1ae ("sctp: allow changing transport encap_port by peer packets")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net (v2)
The following patchset contains Netfilter fixes for net:
1) Move back the defrag users fields to the global netns_nf area.
Kernel fails to boot if conntrack is builtin and kernel is booted
with: nf_conntrack.enable_hooks=1. From Florian Westphal.
2) Rule event notification is missing relevant context such as
the position handle and the NLM_F_APPEND flag.
3) Rule replacement is expanded to add + delete using the existing
rule handle, reverse order of this operation so it makes sense
from rule notification standpoint.
4) Propagate to userspace the NLM_F_CREATE and NLM_F_EXCL flags
from the rule notification path.
Patches #2, #3 and #4 are used by 'nft monitor' and 'iptables-monitor'
userspace utilities which are not correctly representing the following
operations through netlink notifications:
- rule insertions
- rule addition/insertion from position handle
- create table/chain/set/map/flowtable/...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Include the NLM_F_CREATE and NLM_F_EXCL flags in netlink event
notifications, otherwise userspace cannot distiguish between create and
add commands.
Fixes: 96518518cc41 ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Fixes following warning:
include/net/sock.h:533: warning: Function parameter or member 'sk_peer_lock' not described in 'sock'
Fixes: 35306eb23814 ("af_unix: fix races in sk_peer_pid and sk_peer_cred accesses")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lore.kernel.org/r/20211001164622.58520-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Jann Horn reported that SO_PEERCRED and SO_PEERGROUPS implementations
are racy, as af_unix can concurrently change sk_peer_pid and sk_peer_cred.
In order to fix this issue, this patch adds a new spinlock that needs
to be used whenever these fields are read or written.
Jann also pointed out that l2cap_sock_get_peer_pid_cb() is currently
reading sk->sk_peer_pid which makes no sense, as this field
is only possibly set by AF_UNIX sockets.
We will have to clean this in a separate patch.
This could be done by reverting b48596d1dc25 "Bluetooth: L2CAP: Add get_peer_pid callback"
or implementing what was truly expected.
Fixes: 109f6e39fa07 ("af_unix: Allow SO_PEERCRED to work across namespaces.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Syzkaller reported a false positive deadlock involving
the nl socket lock and the subflow socket lock:
MPTCP: kernel_bind error, err=-98
============================================
WARNING: possible recursive locking detected
5.15.0-rc1-syzkaller #0 Not tainted
--------------------------------------------
syz-executor998/6520 is trying to acquire lock:
ffff8880795718a0 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
but task is already holding lock:
ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(k-sk_lock-AF_INET);
lock(k-sk_lock-AF_INET);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by syz-executor998/6520:
#0: ffffffff8d176c50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 net/netlink/genetlink.c:802
#1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_lock net/netlink/genetlink.c:33 [inline]
#1: ffffffff8d176d08 (genl_mutex){+.+.}-{3:3}, at: genl_rcv_msg+0x3e0/0x580 net/netlink/genetlink.c:790
#2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1612 [inline]
#2: ffff8880787c8c60 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close+0x23/0x7b0 net/mptcp/protocol.c:2720
stack backtrace:
CPU: 1 PID: 6520 Comm: syz-executor998 Not tainted 5.15.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_deadlock_bug kernel/locking/lockdep.c:2944 [inline]
check_deadlock kernel/locking/lockdep.c:2987 [inline]
validate_chain kernel/locking/lockdep.c:3776 [inline]
__lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5015
lock_acquire kernel/locking/lockdep.c:5625 [inline]
lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5590
lock_sock_fast+0x36/0x100 net/core/sock.c:3229
mptcp_close+0x267/0x7b0 net/mptcp/protocol.c:2738
inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
__sock_release net/socket.c:649 [inline]
sock_release+0x87/0x1b0 net/socket.c:677
mptcp_pm_nl_create_listen_socket+0x238/0x2c0 net/mptcp/pm_netlink.c:900
mptcp_nl_cmd_add_addr+0x359/0x930 net/mptcp/pm_netlink.c:1170
genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
sock_no_sendpage+0x101/0x150 net/core/sock.c:2980
kernel_sendpage.part.0+0x1a0/0x340 net/socket.c:3504
kernel_sendpage net/socket.c:3501 [inline]
sock_sendpage+0xe5/0x140 net/socket.c:1003
pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
splice_from_pipe_feed fs/splice.c:418 [inline]
__splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
splice_from_pipe fs/splice.c:597 [inline]
generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
do_splice_from fs/splice.c:767 [inline]
direct_splice_actor+0x110/0x180 fs/splice.c:936
splice_direct_to_actor+0x34b/0x8c0 fs/splice.c:891
do_splice_direct+0x1b3/0x280 fs/splice.c:979
do_sendfile+0xae9/0x1240 fs/read_write.c:1249
__do_sys_sendfile64 fs/read_write.c:1314 [inline]
__se_sys_sendfile64 fs/read_write.c:1300 [inline]
__x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1300
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f215cb69969
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc96bb3868 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00007f215cbad072 RCX: 00007f215cb69969
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005
RBP: 0000000000000000 R08: 00007ffc96bb3a08 R09: 00007ffc96bb3a08
R10: 0000000100000002 R11: 0000000000000246 R12: 00007ffc96bb387c
R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
the problem originates from uncorrect lock annotation in the mptcp
code and is only visible since commit 2dcb96bacce3 ("net: core: Correct
the sock::sk_lock.owned lockdep annotations"), but is present since
the port-based endpoint support initial implementation.
This patch addresses the issue introducing a nested variant of
lock_sock_fast() and using it in the relevant code path.
Fixes: 1729cf186d8a ("mptcp: create the listening socket for new port")
Fixes: 2dcb96bacce3 ("net: core: Correct the sock::sk_lock.owned lockdep annotations")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reported-and-tested-by: syzbot+1dd53f7a89b299d59eaf@syzkaller.appspotmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is a revert of
7b1957b049 ("netfilter: nf_defrag_ipv4: use net_generic infra")
and a partial revert of
8b0adbe3e3 ("netfilter: nf_defrag_ipv6: use net_generic infra").
If conntrack is builtin and kernel is booted with:
nf_conntrack.enable_hooks=1
.... kernel will fail to boot due to a NULL deref in
nf_defrag_ipv4_enable(): Its called before the ipv4 defrag initcall is
made, so net_generic() returns NULL.
To resolve this, move the user refcount back to struct net so calls
to those functions are possible even before their initcalls have run.
Fixes: 7b1957b04956 ("netfilter: nf_defrag_ipv4: use net_generic infra")
Fixes: 8b0adbe3e38d ("netfilter: nf_defrag_ipv6: use net_generic infra").
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes berg says:
====================
Some fixes:
* potential use-after-free in CCMP/GCMP RX processing
* potential use-after-free in TX A-MSDU processing
* revert to low data rates for no-ack as the commit
broke other things
* limit VHT MCS/NSS in radiotap injection
* drop frames with invalid addresses in IBSS mode
* check rhashtable_init() return value in mesh
* fix potentially unaligned access in mesh
* fix late beacon hrtimer handling in hwsim (syzbot)
* fix documentation for PTK0 rekeying
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
@IEEE80211_KEY_FLAG_GENERATE_IV setting is irrelevant for RX.
Move the requirement to the correct section in the PTK0 rekey
documentation.
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Link: https://lore.kernel.org/r/20210924200514.7936-1-alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
We observed below report when playing with netlink sock:
UBSAN: shift-out-of-bounds in net/sched/sch_api.c:580:10
shift exponent 249 is too large for 32-bit type
CPU: 0 PID: 685 Comm: a.out Not tainted
Call Trace:
dump_stack_lvl+0x8d/0xcf
ubsan_epilogue+0xa/0x4e
__ubsan_handle_shift_out_of_bounds+0x161/0x182
__qdisc_calculate_pkt_len+0xf0/0x190
__dev_queue_xmit+0x2ed/0x15b0
it seems like kernel won't check the stab log value passing from
user, and will use the insane value later to calculate pkt_len.
This patch just add a check on the size/cell_log to avoid insane
calculation.
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Multipath RTA_FLOW is embedded in nexthop. Dump it in fib_add_nexthop()
to get the length of rtnexthop correct.
Fixes: b0f60193632e ("ipv4: Refactor nexthop attributes in fib_dump_info")
Signed-off-by: Xiao Liang <shaw.leon@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
on error
Commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal")
decided it was fine to ignore errors on certain ports that fail to
probe, and go on with the ports that do probe fine.
Commit fb6ec87f7229 ("net: dsa: Fix type was not set for devlink port")
noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
called, and devlink notices after a timeout of 3600 seconds and prints a
WARN_ON. So it went ahead to unregister the devlink port. And because
there exists an UNUSED port flavour, we actually re-register the devlink
port as UNUSED.
Commit 08156ba430b4 ("net: dsa: Add devlink port regions support to
DSA") added devlink port regions, which are set up by the driver and not
by DSA.
When we trigger the devlink port deregistration and reregistration as
unused, devlink now prints another WARN_ON, from here:
devlink_port_unregister:
WARN_ON(!list_empty(&devlink_port->region_list));
So the port still has regions, which makes sense, because they were set
up by the driver, and the driver doesn't know we're unregistering the
devlink port.
Somebody needs to tear them down, and optionally (actually it would be
nice, to be consistent) set them up again for the new devlink port.
But DSA's layering stays in our way quite badly here.
The options I've considered are:
1. Introduce a function in devlink to just change a port's type and
flavour. No dice, devlink keeps a lot of state, it really wants the
port to not be registered when you set its parameters, so changing
anything can only be done by destroying what we currently have and
recreating it.
2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
and the region returned, keep those in a list, then when the devlink
port unregister needs to take place, the existing devlink regions are
destroyed by DSA, and we replay the creation of new regions using the
cached parameters. Problem: mv88e6xxx keeps the region pointers in
chip->ports[port].region, and these will remain stale after DSA frees
them. There are many things DSA can do, but updating mv88e6xxx's
private pointers is not one of them.
3. Just let the driver do it (i.e. introduce a very specific method
called ds->ops->port_reinit_as_unused, which unregisters its devlink
port devlink regions, then the old devlink port, then registers the
new one, then the devlink port regions for it). While it does work,
as opposed to the others, it's pretty horrible from an API
perspective and we can do better.
4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
which in the case of mv88e6xxx must register and unregister the
devlink port regions. Call these 2 methods when the port must be
reinitialized as unused.
Naturally, I went for the 4th approach.
Fixes: 08156ba430b4 ("net: dsa: Add devlink port regions support to DSA")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
lock_sock_fast() and lock_sock_nested() contain lockdep annotations for the
sock::sk_lock.owned 'mutex'. sock::sk_lock.owned is not a regular mutex. It
is just lockdep wise equivalent. In fact it's an open coded trivial mutex
implementation with some interesting features.
sock::sk_lock.slock is a regular spinlock protecting the 'mutex'
representation sock::sk_lock.owned which is a plain boolean. If 'owned' is
true, then some other task holds the 'mutex', otherwise it is uncontended.
As this locking construct is obviously endangered by lock ordering issues as
any other locking primitive it got lockdep annotated via a dedicated
dependency map sock::sk_lock.dep_map which has to be updated at the lock
and unlock sites.
lock_sock_nested() is a straight forward 'mutex' lock operation:
might_sleep();
spin_lock_bh(sock::sk_lock.slock)
while (!try_lock(sock::sk_lock.owned)) {
spin_unlock_bh(sock::sk_lock.slock);
wait_for_release();
spin_lock_bh(sock::sk_lock.slock);
}
The lockdep annotation for sock::sk_lock.owned is for unknown reasons
_after_ the lock has been acquired, i.e. after the code block above and
after releasing sock::sk_lock.slock, but inside the bottom halves disabled
region:
spin_unlock(sock::sk_lock.slock);
mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
local_bh_enable();
The placement after the unlock is obvious because otherwise the
mutex_acquire() would nest into the spin lock held region.
But that's from the lockdep perspective still the wrong place:
1) The mutex_acquire() is issued _after_ the successful acquisition which
is pointless because in a dead lock scenario this point is never
reached which means that if the deadlock is the first instance of
exposing the wrong lock order lockdep does not have a chance to detect
it.
2) It only works because lockdep is rather lax on the context from which
the mutex_acquire() is issued. Acquiring a mutex inside a bottom halves
and therefore non-preemptible region is obviously invalid, except for a
trylock which is clearly not the case here.
This 'works' stops working on RT enabled kernels where the bottom halves
serialization is done via a local lock, which exposes this misplacement
because the 'mutex' and the local lock nest the wrong way around and
lockdep complains rightfully about a lock inversion.
The placement is wrong since the initial commit a5b5bb9a053a ("[PATCH]
lockdep: annotate sk_locks") which introduced this.
Fix it by moving the mutex_acquire() in front of the actual lock
acquisition, which is what the regular mutex_lock() operation does as well.
lock_sock_fast() is not that straight forward. It looks at the first glance
like a convoluted trylock operation:
spin_lock_bh(sock::sk_lock.slock)
if (!sock::sk_lock.owned)
return false;
while (!try_lock(sock::sk_lock.owned)) {
spin_unlock_bh(sock::sk_lock.slock);
wait_for_release();
spin_lock_bh(sock::sk_lock.slock);
}
spin_unlock(sock::sk_lock.slock);
mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
local_bh_enable();
return true;
But that's not the case: lock_sock_fast() is an interesting optimization
for short critical sections which can run with bottom halves disabled and
sock::sk_lock.slock held. This allows to shortcut the 'mutex' operation in
the non contended case by preventing other lockers to acquire
sock::sk_lock.owned because they are blocked on sock::sk_lock.slock, which
in turn avoids the overhead of doing the heavy processing in release_sock()
including waking up wait queue waiters.
In the contended case, i.e. when sock::sk_lock.owned == true the behavior
is the same as lock_sock_nested().
Semantically this shortcut means, that the task acquired the 'mutex' even
if it does not touch the sock::sk_lock.owned field in the non-contended
case. Not telling lockdep about this shortcut acquisition is hiding
potential lock ordering violations in the fast path.
As a consequence the same reasoning as for the above lock_sock_nested()
case vs. the placement of the lockdep annotation applies.
The current placement of the lockdep annotation was just copied from
the original lock_sock(), now renamed to lock_sock_nested(),
implementation.
Fix this by moving the mutex_acquire() in front of the actual lock
acquisition and adding the corresponding mutex_release() into
unlock_sock_fast(). Also document the fast path return case with a comment.
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Lino reports that on his system with bcmgenet as DSA master and KSZ9897
as a switch, rebooting or shutting down never works properly.
What does the bcmgenet driver have special to trigger this, that other
DSA masters do not? It has an implementation of ->shutdown which simply
calls its ->remove implementation. Otherwise said, it unregisters its
network interface on shutdown.
This message can be seen in a loop, and it hangs the reboot process there:
unregister_netdevice: waiting for eth0 to become free. Usage count = 3
So why 3?
A usage count of 1 is normal for a registered network interface, and any
virtual interface which links itself as an upper of that will increment
it via dev_hold. In the case of DSA, this is the call path:
dsa_slave_create
-> netdev_upper_dev_link
-> __netdev_upper_dev_link
-> __netdev_adjacent_dev_insert
-> dev_hold
So a DSA switch with 3 interfaces will result in a usage count elevated
by two, and netdev_wait_allrefs will wait until they have gone away.
Other stacked interfaces, like VLAN, watch NETDEV_UNREGISTER events and
delete themselves, but DSA cannot just vanish and go poof, at most it
can unbind itself from the switch devices, but that must happen strictly
earlier compared to when the DSA master unregisters its net_device, so
reacting on the NETDEV_UNREGISTER event is way too late.
It seems that it is a pretty established pattern to have a driver's
->shutdown hook redirect to its ->remove hook, so the same code is
executed regardless of whether the driver is unbound from the device, or
the system is just shutting down. As Florian puts it, it is quite a big
hammer for bcmgenet to unregister its net_device during shutdown, but
having a common code path with the driver unbind helps ensure it is well
tested.
So DSA, for better or for worse, has to live with that and engage in an
arms race of implementing the ->shutdown hook too, from all individual
drivers, and do something sane when paired with masters that unregister
their net_device there. The only sane thing to do, of course, is to
unlink from the master.
However, complications arise really quickly.
The pattern of redirecting ->shutdown to ->remove is not unique to
bcmgenet or even to net_device drivers. In fact, SPI controllers do it
too (see dspi_shutdown -> dspi_remove), and presumably, I2C controllers
and MDIO controllers do it too (this is something I have not researched
too deeply, but even if this is not the case today, it is certainly
plausible to happen in the future, and must be taken into consideration).
Since DSA switches might be SPI devices, I2C devices, MDIO devices, the
insane implication is that for the exact same DSA switch device, we
might have both ->shutdown and ->remove getting called.
So we need to do something with that insane environment. The pattern
I've come up with is "if this, then not that", so if either ->shutdown
or ->remove gets called, we set the device's drvdata to NULL, and in the
other hook, we check whether the drvdata is NULL and just do nothing.
This is probably not necessary for platform devices, just for devices on
buses, but I would really insist for consistency among drivers, because
when code is copy-pasted, it is not always copy-pasted from the best
sources.
So depending on whether the DSA switch's ->remove or ->shutdown will get
called first, we cannot really guarantee even for the same driver if
rebooting will result in the same code path on all platforms. But
nonetheless, we need to do something minimally reasonable on ->shutdown
too to fix the bug. Of course, the ->remove will do more (a full
teardown of the tree, with all data structures freed, and this is why
the bug was not caught for so long). The new ->shutdown method is kept
separate from dsa_unregister_switch not because we couldn't have
unregistered the switch, but simply in the interest of doing something
quick and to the point.
The big question is: does the DSA switch's ->shutdown get called earlier
than the DSA master's ->shutdown? If not, there is still a risk that we
might still trigger the WARN_ON in unregister_netdevice that says we are
attempting to unregister a net_device which has uppers. That's no good.
Although the reference to the master net_device won't physically go away
even if DSA's ->shutdown comes afterwards, remember we have a dev_hold
on it.
The answer to that question lies in this comment above device_link_add:
* A side effect of the link creation is re-ordering of dpm_list and the
* devices_kset list by moving the consumer device and all devices depending
* on it to the ends of these lists (that does not happen to devices that have
* not been registered when this function is called).
so the fact that DSA uses device_link_add towards its master is not
exactly for nothing. device_shutdown() walks devices_kset from the back,
so this is our guarantee that DSA's shutdown happens before the master's
shutdown.
Fixes: 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
Link: https://lore.kernel.org/netdev/20210909095324.12978-1-LinoSanfilippo@gmx.de/
Reported-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
messages appear:
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2
(and similarly for other ports)
What happens is that DSA has a policy "even if there are bugs, let's at
least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
dp->mdbs lists, which are supposed to be empty.
But deleting that cleanup code, the warnings go away.
=> the FDB and MDB lists (used for refcounting on shared ports, aka CPU
and DSA ports) will eventually be empty, but are not empty by the time
we tear down those ports. Aka we are deleting them too soon.
The addresses that DSA complains about are host-trapped addresses: the
local addresses of the ports, and the MAC address of the bridge device.
The problem is that offloading those entries happens from a deferred
work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
races with the teardown of the CPU and DSA ports where the refcounting
is kept.
In fact, not only it races, but fundamentally speaking, if we iterate
through the port list linearly, we might end up tearing down the shared
ports even before we delete a DSA user port which has a bridge upper.
So as it turns out, we need to first tear down the user ports (and the
unused ones, for no better place of doing that), then the shared ports
(the CPU and DSA ports). In between, we need to ensure that all work
items scheduled by our switchdev handlers (which only run for user
ports, hence the reason why we tear them down first) have finished.
Fixes: 161ca59d39e9 ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 3df98d79215ace13 ("lsm,selinux: pass flowi_common instead of flowi
to the LSM hooks") introduced flowi{4,6}_to_flowi_common() functions which
cause UBSAN warning when building with LLVM 11.0.1 on Ubuntu 21.04.
================================================================================
UBSAN: object-size-mismatch in ./include/net/flow.h:197:33
member access within address ffffc9000109fbd8 with insufficient space
for an object of type 'struct flowi'
CPU: 2 PID: 7410 Comm: systemd-resolve Not tainted 5.14.0 #51
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
Call Trace:
dump_stack_lvl+0x103/0x171
ubsan_type_mismatch_common+0x1de/0x390
__ubsan_handle_type_mismatch_v1+0x41/0x50
udp_sendmsg+0xda2/0x1300
? ip_skb_dst_mtu+0x1f0/0x1f0
? sock_rps_record_flow+0xe/0x200
? inet_send_prepare+0x2d/0x90
sock_sendmsg+0x49/0x80
____sys_sendmsg+0x269/0x370
__sys_sendmsg+0x15e/0x1d0
? syscall_enter_from_user_mode+0xf0/0x1b0
do_syscall_64+0x3d/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f7081a50497
Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
RSP: 002b:00007ffc153870f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f7081a50497
RDX: 0000000000000000 RSI: 00007ffc15387140 RDI: 000000000000000c
RBP: 00007ffc15387140 R08: 0000563f29a5e4fc R09: 000000000000cd28
R10: 0000563f29a68a30 R11: 0000000000000246 R12: 000000000000000c
R13: 0000000000000001 R14: 0000563f29a68a30 R15: 0000563f29a5e50c
================================================================================
I don't think we need to call flowi{4,6}_to_flowi() from these functions
because the first member of "struct flowi4" and "struct flowi6" is
struct flowi_common __fl_common;
while the first member of "struct flowi" is
union {
struct flowi_common __fl_common;
struct flowi4 ip4;
struct flowi6 ip6;
struct flowidn dn;
} u;
which should point to the same address without access to "struct flowi".
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Daniel Borkmann says:
====================
bpf-next 2021-08-31
We've added 116 non-merge commits during the last 17 day(s) which contain
a total of 126 files changed, 6813 insertions(+), 4027 deletions(-).
The main changes are:
1) Add opaque bpf_cookie to perf link which the program can read out again,
to be used in libbpf-based USDT library, from Andrii Nakryiko.
2) Add bpf_task_pt_regs() helper to access userspace pt_regs, from Daniel Xu.
3) Add support for UNIX stream type sockets for BPF sockmap, from Jiang Wang.
4) Allow BPF TCP congestion control progs to call bpf_setsockopt() e.g. to switch
to another congestion control algorithm during init, from Martin KaFai Lau.
5) Extend BPF iterator support for UNIX domain sockets, from Kuniyuki Iwashima.
6) Allow bpf_{set,get}sockopt() calls from setsockopt progs, from Prankur Gupta.
7) Add bpf_get_netns_cookie() helper for BPF_PROG_TYPE_{SOCK_OPS,CGROUP_SOCKOPT}
progs, from Xu Liu and Stanislav Fomichev.
8) Support for __weak typed ksyms in libbpf, from Hao Luo.
9) Shrink struct cgroup_bpf by 504 bytes through refactoring, from Dave Marchevsky.
10) Fix a smatch complaint in verifier's narrow load handling, from Andrey Ignatov.
11) Fix BPF interpreter's tail call count limit, from Daniel Borkmann.
12) Big batch of improvements to BPF selftests, from Magnus Karlsson, Li Zhijian,
Yucong Sun, Yonghong Song, Ilya Leoshkevich, Jussi Maki, Ilya Leoshkevich, others.
13) Another big batch to revamp XDP samples in order to give them consistent look
and feel, from Kumar Kartikeya Dwivedi.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (116 commits)
MAINTAINERS: Remove self from powerpc BPF JIT
selftests/bpf: Fix potential unreleased lock
samples: bpf: Fix uninitialized variable in xdp_redirect_cpu
selftests/bpf: Reduce more flakyness in sockmap_listen
bpf: Fix bpf-next builds without CONFIG_BPF_EVENTS
bpf: selftests: Add dctcp fallback test
bpf: selftests: Add connect_to_fd_opts to network_helpers
bpf: selftests: Add sk_state to bpf_tcp_helpers.h
bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt
selftests: xsk: Preface options with opt
selftests: xsk: Make enums lower case
selftests: xsk: Generate packets from specification
selftests: xsk: Generate packet directly in umem
selftests: xsk: Simplify cleanup of ifobjects
selftests: xsk: Decrease sending speed
selftests: xsk: Validate tx stats on tx thread
selftests: xsk: Simplify packet validation in xsk tests
selftests: xsk: Rename worker_* functions that are not thread entry points
selftests: xsk: Disassociate umem size with packets sent
selftests: xsk: Remove end-of-test packet
...
====================
Link: https://lore.kernel.org/r/20210830225618.11634-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In HTB offload mode, qdiscs of leaf classes are grafted to netdev
queues. sch_htb expects the dev_queue field of these qdiscs to point to
the corresponding queues. However, qdisc creation may fail, and in that
case noop_qdisc is used instead. Its dev_queue doesn't point to the
right queue, so sch_htb can lose track of used netdev queues, which will
cause internal inconsistencies.
This commit fixes this bug by keeping track of the netdev queue inside
struct htb_class. All reads of cl->leaf.q->dev_queue are replaced by the
new field, the two values are synced on writes, and WARNs are added to
assert equality of the two values.
The driver API has changed: when TC_HTB_LEAF_DEL needs to move a queue,
the driver used to pass the old and new queue IDs to sch_htb. Now that
there is a new field (offload_queue) in struct htb_class that needs to
be updated on this operation, the driver will pass the old class ID to
sch_htb instead (it already knows the new class ID).
Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20210826115425.1744053-1-maximmi@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Clean up and consolidate ct ecache infrastructure by merging ct and
expect notifiers, from Florian Westphal.
2) Missing counters and timestamp in nfnetlink_queue and _log conntrack
information.
3) Missing error check for xt_register_template() in iptables mangle,
as a incremental fix for the previous pull request, also from
Florian Westphal.
4) Add netfilter hooks for the SRv6 lightweigh tunnel driver, from
Ryoga Sato. The hooks are enabled via nf_hooks_lwtunnel sysctl
to make sure existing netfilter rulesets do not break. There is
a static key to disable the hooks by default.
The pktgen_bench_xmit_mode_netif_receive.sh shows no noticeable
impact in the seg6_input path for non-netfilter users: similar
numbers with and without this patch.
This is a sample of the perf report output:
11.67% kpktgend_0 [ipv6] [k] ipv6_get_saddr_eval
7.89% kpktgend_0 [ipv6] [k] __ipv6_addr_label
7.52% kpktgend_0 [ipv6] [k] __ipv6_dev_get_saddr
6.63% kpktgend_0 [kernel.vmlinux] [k] asm_exc_nmi
4.74% kpktgend_0 [ipv6] [k] fib6_node_lookup_1
3.48% kpktgend_0 [kernel.vmlinux] [k] pskb_expand_head
3.33% kpktgend_0 [ipv6] [k] ip6_rcv_core.isra.29
3.33% kpktgend_0 [ipv6] [k] seg6_do_srh_encap
2.53% kpktgend_0 [ipv6] [k] ipv6_dev_get_saddr
2.45% kpktgend_0 [ipv6] [k] fib6_table_lookup
2.24% kpktgend_0 [kernel.vmlinux] [k] ___cache_free
2.16% kpktgend_0 [ipv6] [k] ip6_pol_route
2.11% kpktgend_0 [kernel.vmlinux] [k] __ipv6_addr_type
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch introduces netfilter hooks for solving the problem that
conntrack couldn't record both inner flows and outer flows.
This patch also introduces a new sysctl toggle for enabling lightweight
tunnel netfilter hooks.
Signed-off-by: Ryoga Saito <contact@proelbtn.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The kernel provides a "/proc/sys/net/ipv6/conf/<iface>/mtu"
file, which can temporarily record the mtu value of the last
received RA message when the RA mtu value is lower than the
interface mtu, but this proc has following limitations:
(1) when the interface mtu (/sys/class/net/<iface>/mtu) is
updeated, mtu6 (/proc/sys/net/ipv6/conf/<iface>/mtu) will
be updated to the value of interface mtu;
(2) mtu6 (/proc/sys/net/ipv6/conf/<iface>/mtu) only affect
ipv6 connection, and not affect ipv4.
Therefore, when the mtu option is carried in the RA message,
there will be a problem that the user sometimes cannot obtain
RA mtu value correctly by reading mtu6.
After this patch set, if a RA message carries the mtu option,
you can send a netlink msg which nlmsg_type is RTM_GETLINK,
and then by parsing the attribute of IFLA_INET6_RA_MTU to
get the mtu value carried in the RA message received on the
inet6 device. In addition, you can also get a link notification
when ra_mtu is updated so it doesn't have to poll.
In this way, if the MTU values that the device receives from
the network in the PCO IPv4 and the RA IPv6 procedures are
different, the user can obtain the correct ipv6 ra_mtu value
and compare the value of ra_mtu and ipv4 mtu, then the device
can use the lower MTU value for both IPv4 and IPv6.
Signed-off-by: Rocco Yue <rocco.yue@mediatek.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210827150412.9267-1-rocco.yue@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2021-08-27
1) Remove an unneeded extra variable in esp4 esp_ssg_unref.
From Corey Minyard.
2) Add a configuration option to change the default behaviour
to block traffic if there is no matching policy.
Joint work with Christian Langrock and Antony Antony.
3) Fix a shift-out-of-bounce bug reported from syzbot.
From Pavel Skripkin.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
drivers/net/wwan/mhi_wwan_mbim.c - drop the extra arg.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
A few more things:
* Use correct DFS domain for self-managed devices
* some preparations for transmit power element handling
and other 6 GHz regulatory handling
* TWT support in AP mode in mac80211
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Both SKB_FRAG_PAGE_ORDER are defined to the same value in
net/core/sock.c and drivers/vhost/net.c.
Move the SKB_FRAG_PAGE_ORDER definition to net/core/sock.h,
as both net/core/sock.c and drivers/vhost/net.c include it,
and it seems a reasonable file to put the macro.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Reuse the conntrack event notofier struct, this allows to remove the
extra register/unregister functions and avoids a pointer in struct net.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This prepares for merge for ct and exp notifier structs.
The 'fcn' member is renamed to something unique.
Second, the register/unregister api is simplified. There is only
one implementation so there is no need to do any error checking.
Replace the EBUSY logic with WARN_ON_ONCE. This allows to remove
error unwinding.
The exp notifier register/unregister function is removed in
a followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
nf_conntrack_eventmask_report and nf_ct_deliver_cached_events shared
most of their code. This unifies the layout by changing
if (nf_ct_is_confirmed(ct)) {
foo
}
to
if (!nf_ct_is_confirmed(ct)))
return
foo
This removes one level of indentation.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Function 'mctp_dev_get_rtnl' is declared twice, so remove the
repeated declaration.
Cc: Jeremy Kerr <jk@codeconstruct.com.au>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch added the MP_FAIL suboption sending support.
Add a new flag named send_mp_fail in struct mptcp_subflow_context. If
this flag is set, send out MP_FAIL suboption.
Add a new member fail_seq in struct mptcp_out_options to save the data
sequence number to put into the MP_FAIL suboption.
An MP_FAIL option could be included in a RST or on the subflow-level
ACK.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@xiaomi.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After the previous patch we can alias with a union several
fields in mptcp_out_options. Such struct is stack allocated and
memset() for each plain TCP out packet. Every saved byted counts.
Before:
pahole -EC mptcp_out_options
# ...
/* size: 136, cachelines: 3, members: 17 */
After:
pahole -EC mptcp_out_options
# ...
/* size: 56, cachelines: 1, members: 9 */
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
correct comments in set and get fn_sernum
Signed-off-by: zhang kai <zhangkaiheb@126.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As explained in commit e358bef7c392 ("net: dsa: Give drivers the chance
to veto certain upper devices"), the hellcreek driver uses some tricks
to comply with the network stack expectations: it enforces port
separation in standalone mode using VLANs. For untagged traffic,
bridging between ports is prevented by using different PVIDs, and for
VLAN-tagged traffic, it never accepts 8021q uppers with the same VID on
two ports, so packets with one VLAN cannot leak from one port to another.
That is almost fine*, and has worked because hellcreek relied on an
implicit behavior of the DSA core that was changed by the previous
patch: the standalone ports declare the 'rx-vlan-filter' feature as 'on
[fixed]'. Since most of the DSA drivers are actually VLAN-unaware in
standalone mode, that feature was actually incorrectly reflecting the
hardware/driver state, so there was a desire to fix it. This leaves the
hellcreek driver in a situation where it has to explicitly request this
behavior from the DSA framework.
We configure the ports as follows:
- Standalone: 'rx-vlan-filter' is on. An 8021q upper on top of a
standalone hellcreek port will go through dsa_slave_vlan_rx_add_vid
and will add a VLAN to the hardware tables, giving the driver the
opportunity to refuse it through .port_prechangeupper.
- Bridged with vlan_filtering=0: 'rx-vlan-filter' is off. An 8021q upper
on top of a bridged hellcreek port will not go through
dsa_slave_vlan_rx_add_vid, because there will not be any attempt to
offload this VLAN. The driver already disables VLAN awareness, so that
upper should receive the traffic it needs.
- Bridged with vlan_filtering=1: 'rx-vlan-filter' is on. An 8021q upper
on top of a bridged hellcreek port will call dsa_slave_vlan_rx_add_vid,
and can again be vetoed through .port_prechangeupper.
*It is not actually completely fine, because if I follow through
correctly, we can have the following situation:
ip link add br0 type bridge vlan_filtering 0
ip link set lan0 master br0 # lan0 now becomes VLAN-unaware
ip link set lan0 nomaster # lan0 fails to become VLAN-aware again, therefore breaking isolation
This patch fixes that corner case by extending the DSA core logic, based
on this requested attribute, to change the VLAN awareness state of the
switch (port) when it leaves the bridge.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Introduce TWT action frames parsing support to mac80211.
Currently just individual TWT agreement are support in AP mode.
Whenever the AP receives a TWT action frame from an associated client,
after performing sanity checks, it will notify the underlay driver with
requested parameters in order to check if they are supported and if there
is enough room for a new agreement. The driver is expected to set the
agreement result and report it to mac80211.
Drivers supporting this have two new callbacks:
- add_twt_setup (mandatory)
- twt_teardown_request (optional)
mac80211 will send an action frame reply according to the result
reported by the driver.
Tested-by: Peter Chiu <chui-hao.chiu@mediatek.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/257512f2e22ba42b9f2624942a128dd8f141de4b.1629741512.git.lorenzo@kernel.org
[use le16p_replace_bits(), minor cleanups, use (void *) casts,
fix to use ieee80211_get_he_iftype_cap() correctly]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Right now, cross-tree bridging setups work somewhat by mistake.
In the case of cross-tree bridging with sja1105, all switch instances
need to agree upon a common VLAN ID for forwarding a packet that belongs
to a certain bridging domain.
With TX forwarding offload, the VLAN ID is the bridge VLAN for
VLAN-aware bridging, and the tag_8021q TX forwarding offload VID
(a VLAN which has non-zero VBID bits) for VLAN-unaware bridging.
The VBID for VLAN-unaware bridging is derived from the dp->bridge_num
value calculated by DSA independently for each switch tree.
If ports from one tree join one bridge, and ports from another tree join
another bridge, DSA will assign them the same bridge_num, even though
the bridges are different. If cross-tree bridging is supported, this
is an issue.
Modify DSA to calculate the bridge_num globally across all switch trees.
This has the implication for a driver that the dp->bridge_num value that
DSA will assign to its ports might not be contiguous, if there are
boards with multiple DSA drivers instantiated. Additionally, all
bridge_num values eat up towards each switch's
ds->num_fwd_offloading_bridges maximum, which is potentially unfortunate,
and can be seen as a limitation introduced by this patch. However, that
is the lesser evil for now.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Minor updates:
* BSS coloring support
* MEI commands for Intel platforms
* various fixes/cleanups
* tag 'mac80211-next-for-net-next-2021-08-20' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next:
cfg80211: fix BSS color notify trace enum confusion
mac80211: Fix insufficient headroom issue for AMSDU
mac80211: add support for BSS color change
nl80211: add support for BSS coloring
mac80211: Use flex-array for radiotap header bitmap
mac80211: radiotap: Use BIT() instead of shifts
mac80211: Remove unnecessary variable and label
mac80211: include <linux/rbtree.h>
mac80211: Fix monitor MTU limit so that A-MSDUs get through
mac80211: remove unnecessary NULL check in ieee80211_register_hw()
mac80211: Reject zero MAC address in sta_info_insert_check()
nl80211: vendor-cmd: add Intel vendor commands for iwlmei usage
====================
Link: https://lore.kernel.org/r/20210820105329.48674-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- Add support for Foxconn Mediatek Chip
- Add support for LG LGSBWAC92/TWCM-K505D
- hci_h5 flow control fixes and suspend support
- Switch to use lock_sock for SCO and RFCOMM
- Various fixes for extended advertising
- Reword Intel's setup on btusb unifying the supported generations
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
drivers/ptp/Kconfig:
55c8fca1dae1 ("ptp_pch: Restore dependency on PCI")
e5f31552674e ("ethernet: fix PTP_1588_CLOCK dependencies")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This reverts commit 9ea3e52c5bc8bb4a084938dc1e3160643438927a.
Cited commit added a check to make sure 'action' is not NULL, but
'action' is already dereferenced before the check, when calling
flow_offload_has_one_action().
Therefore, the check does not make any sense and results in a smatch
warning:
include/net/flow_offload.h:322 flow_action_mixed_hw_stats_check() warn:
variable dereferenced before check 'action' (see line 319)
Fix by reverting this commit.
Cc: gushengxian <gushengxian@yulong.com>
Fixes: 9ea3e52c5bc8 ("flow_offload: action should not be NULL when it is referenced")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20210819105842.1315705-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, when creating an ingress qdisc on an indirect device before
the driver registered for callbacks, the driver will not have a chance
to register its filter configuration callbacks.
To fix that, modify the code such that it keeps track of all the ingress
qdiscs that call flow_indr_dev_setup_offload(). When a driver calls
flow_indr_dev_register(), go through the list of tracked ingress qdiscs
and call the driver callback entry point so as to give it a chance to
register its callback.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add gfp_t mask as an input parameter to mem_cgroup_charge_skmem(),
to give more control to the networking stack and enable it to change
memcg charging behavior. In the future, the networking stack may decide
to avoid oom-kills when fallbacks are more appropriate.
One behavior change in mem_cgroup_charge_skmem() by this patch is to
avoid force charging by default and let the caller decide when and if
force charging is needed through the presence or absence of
__GFP_NOFAIL.
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|