Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch makes cpumap functional, by adding SKB allocation and
invoking the network stack on the dequeuing CPU.
For constructing the SKB on the remote CPU, the xdp_buff in converted
into a struct xdp_pkt, and it mapped into the top headroom of the
packet, to avoid allocating separate mem. For now, struct xdp_pkt is
just a cpumap internal data structure, with info carried between
enqueue to dequeue.
If a driver doesn't have enough headroom it is simply dropped, with
return code -EOVERFLOW. This will be picked up the xdp tracepoint
infrastructure, to allow users to catch this.
V2: take into account xdp->data_meta
V4:
- Drop busypoll tricks, keeping it more simple.
- Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei
V5: correct RCU read protection around __netif_receive_skb_core.
V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch connects cpumap to the xdp_do_redirect_map infrastructure.
Still no SKB allocation are done yet. The XDP frames are transferred
to the other CPU, but they are simply refcnt decremented on the remote
CPU. This served as a good benchmark for measuring the overhead of
remote refcnt decrement. If driver page recycle cache is not
efficient then this, exposes a bottleneck in the page allocator.
A shout-out to MST's ptr_ring, which is the secret behind is being so
efficient to transfer memory pointers between CPUs, without constantly
bouncing cache-lines between CPUs.
V3: Handle !CONFIG_BPF_SYSCALL pointed out by kbuild test robot.
V4: Make Generic-XDP aware of cpumap type, but don't allow redirect yet,
as implementation require a separate upstream discussion.
V5:
- Fix a maybe-uninitialized pointed out by kbuild test robot.
- Restrict bpf-prog side access to cpumap, open when use-cases appear
- Implement cpu_map_enqueue() as a more simple void pointer enqueue
V6:
- Allow cpumap type for usage in helper bpf_redirect_map,
general bpf-prog side restriction moved to earlier patch.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The 'cpumap' is primarily used as a backend map for XDP BPF helper
call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
This patch implement the main part of the map. It is not connected to
the XDP redirect system yet, and no SKB allocation are done yet.
The main concern in this patch is to ensure the datapath can run
without any locking. This adds complexity to the setup and tear-down
procedure, which assumptions are extra carefully documented in the
code comments.
V2:
- make sure array isn't larger than NR_CPUS
- make sure CPUs added is a valid possible CPU
V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>
V5:
- Restrict map allocation to root / CAP_SYS_ADMIN
- WARN_ON_ONCE if queue is not empty on tear-down
- Return -EPERM on memlock limit instead of -ENOMEM
- Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
- Moved cpu_map_enqueue() to next patch
V6: all notice by Daniel Borkmann
- Fix err return code in cpu_map_alloc() introduced in V5
- Move cpu_possible() check after max_entries boundary check
- Forbid usage initially in check_map_func_compatibility()
V7:
- Fix alloc error path spotted by Daniel Borkmann
- Did stress test adding+removing CPUs from the map concurrently
- Fixed refcnt issue on cpu_map_entry, kthread started too soon
- Make sure packets are flushed during tear-down, involved use of
rcu_barrier() and kthread_run only exit after queue is empty
- Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring
V8:
- Nitpicking comments and gramma by Edward Cree
- Fix missing semi-colon introduced in V7 due to rebasing
- Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
mlx5-updates-2017-10-11: IPoIB Multi Pkey support
This series provides the support for IPoIB Multi Pkey.
InfiniBand Pkeys are the equivalent of Ethernet vlans.
Currently IPoIB device driver supports only default Pkey and IPoIB Pkey child
interfaces are not supported with IPoIB offloads mode, this series will add
the support for that by allowing creating mlx5 multiple IPoIB netdevices with
a non-default Pkey.
mlx5 IPoIB Pkey child interface is smaller version of mlx5i IPoIB interfaces and shares
most of its resources with the parent IPoIB interface, namely RX steering and ring
queue resources.
The only mlx5 resources a child Pkey interface will be creating are the TX rings,
since they should be assigned to a specific Pkey.
mlx5i Pkey netdev is implemented via new mlx5e netdev profile implemented in
mlx5/core/ipoib/ipoib_vlan.c.
The series starts with a refactoring of mlx5e PTP and mlx5 clock implementation
to move the code to be part of mlx5 core rather than mlx5e netdevice, in order to
make mlx5 clock and PTP registration part of the core to be shared with mlx5e
master Ethernet netdev/IPoIB parent netdev and mlx5_ib in the near future.
Add the support for attaching multiple underlay QPs for the different Pkeys
in mlx5 core RX steering.
Add Pkey index to rdma_netdev to add the ability to set PKEY index to lower
IPoIB offload netdev.
Use hash-table to map between DQPN (Destination QP number) to child netdev
for the IPoIB parent netdev to forward RX packets to the corresponding
child Pkey netdev, since the RX rings are shared.
The reset of the series adds the ipoib child Pkey: mlx5e netdev profile,
netdev nods implementation and minimal set of ethtool callbacks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
PTP code is moved to core section of mlx5 driver in order to share
it between ethernet and infiniband. This movement involves the following
changes:
- Change mlx5e_ prefix to be mlx5_
- Add clock structs to Core
- Add clock object to mlx5_core_dev
- Call Init/Uninit clock from core init/cleanup
- Rename mlx5e_tstamp to be mlx5_clock
Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Eitan Rabin <rabin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
|
When using 'ethtool -L' on a VF to change number of requested queues
from PF, we shouldn't trust the VF to reset itself after making the
request. Doing it that way opens the door for a potentially malicious
VF to do nasty things to the PF which should never be the case.
This makes it such that after VF makes a successful request, PF will
then reset the VF to institute required changes. Only if the request
fails will PF send a message back to VF letting it know the request was
unsuccessful.
Testing-hints:
There should be no real functional changes. This is simply hardening
against a potentially malicious VF.
Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
|
|
Some of Broadcom's PHYs run by default in slave mode with Automatic
Slave/Master configuration disabled. It stops them from working properly
with some devices.
So far it has been verified for BCM54210E and BCM50212E which don't
work well with Intel's I217-LM and I218-LM:
http://ark.intel.com/products/60019/Intel-Ethernet-Connection-I217-LM
http://ark.intel.com/products/71307/Intel-Ethernet-Connection-I218-LM
I was told there is massive ping loss.
This commit adds support for a new flag which can be set by an ethernet
driver to fixup PHY setup.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Work continues in various areas:
* port authorized event for 4-way-HS offload (Avi)
* enable MFP optional for such devices (Emmanuel)
* Kees's timer setup patch for mac80211 mesh
(the part that isn't trivially scripted)
* improve VLAN vs. TXQ handling (myself)
* load regulatory database as firmware file (myself)
* with various other small improvements and cleanups
I merged net-next once in the meantime to allow Kees's
timer setup patch to go in.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Verifier log buffer can be quite large (up to 16MB currently).
As Eric Dumazet points out if we allow multiple verification
requests to proceed simultaneously, malicious user may use the
verifier as a way of allocating large amounts of unswappable
memory to OOM the host.
Switch to a strategy of allocating a smaller buffer (1024B)
and writing it out into the user buffer after every print.
While at it remove the old BUG_ON().
This is in preparation of the global verifier lock removal.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The biggest piece of global state protected by the verifier lock
is the verifier_log. Move that log to struct bpf_verifier_env.
struct bpf_verifier_env has to be passed now to all invocations
of verbose().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Put the loose log_* variables into a structure. This will make
it simpler to remove the global verifier state in following patches.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Switch the DO_ONCE() macro from the deprecated jump label API to the new
one. The new one is more readable, and for DO_ONCE() it also makes the
generated code more icache-friendly: now the one-time initialization
code is placed out-of-line at the jump target, rather than at the inline
fallthrough case.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
Pull networking fixes from David Miller:
1) Fix object leak on IPSEC offload failure, from Steffen Klassert.
2) Fix range checks in ipset address range addition operations, from
Jozsef Kadlecsik.
3) Fix pernet ops unregistration order in ipset, from Florian Westphal.
4) Add missing netlink attribute policy for nl80211 packet pattern
attrs, from Peng Xu.
5) Fix PPP device destruction race, from Guillaume Nault.
6) Write marks get lost when BPF verifier processes R1=R2 register
assignments, causing incorrect liveness information and less state
pruning. Fix from Alexei Starovoitov.
7) Fix blockhole routes so that they are marked dead and therefore not
cached in sockets, otherwise IPSEC stops working. From Steffen
Klassert.
8) Fix broadcast handling of UDP socket early demux, from Paolo Abeni.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (37 commits)
cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan
net: thunderx: mark expected switch fall-throughs in nicvf_main()
udp: fix bcast packet reception
netlink: do not set cb_running if dump's start() errs
ipv4: Fix traffic triggered IPsec connections.
ipv6: Fix traffic triggered IPsec connections.
ixgbe: incorrect XDP ring accounting in ethtool tx_frame param
net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
Revert commit 1a8b6d76dc5b ("net:add one common config...")
ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register
ixgbe: Return error when getting PHY address if PHY access is not supported
netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
netfilter: SYNPROXY: skip non-tcp packet in {ipv4, ipv6}_synproxy_hook
tipc: Unclone message at secondary destination lookup
tipc: correct initialization of skb list
gso: fix payload length when gso_size is zero
mlxsw: spectrum_router: Avoid expensive lookup during route removal
bpf: fix liveness marking
doc: Fix typo "8023.ad" in bonding documentation
ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real
...
|
|
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Fix packet drops due to incorrect ECN handling in IPVS, from Vadim
Fedorenko.
2) Fix splat with mark restoration in xt_socket with non-full-sock,
patch from Subash Abhinov Kasiviswanathan.
3) ipset bogusly bails out when adding IPv4 range containing more than
2^31 addresses, from Jozsef Kadlecsik.
4) Incorrect pernet unregistration order in ipset, from Florian Westphal.
5) Races between dump and swap in ipset results in BUG_ON splats, from
Ross Lagerwall.
6) Fix chain renames in nf_tables, from JingPiao Chen.
7) Fix race in pernet codepath with ebtables table registration, from
Artem Savkov.
8) Memory leak in error path in set name allocation in nf_tables, patch
from Arvind Yadav.
9) Don't dump chain counters if they are not available, this fixes a
crash when listing the ruleset.
10) Fix out of bound memory read in strlcpy() in x_tables compat code,
from Eric Dumazet.
11) Make sure we only process TCP packets in SYNPROXY hooks, patch from
Lin Zhang.
12) Cannot load rules incrementally anymore after xt_bpf with pinned
objects, added in revision 1. From Shmulik Ladkani.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For iWARP unaligned MPA flow, a slowpath event of flushing an
MPA connection that entered an unaligned state is required.
The flush ramrod is received on the ll2 queue, and a pre-registered
callback function is called to handle the flush event.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The option of sending a packet on the ll2 and dropping it exists in
hardware and was not used until now, thus not exposed.
The iWARP unaligned MPA flow requires this functionality for
flushing the tx queue.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When more than one ll2 queue is opened ( that is not an OOO queue )
ll2 code does not have enough information to determine whether
the queue is the main one or not, so a new field is added to the
acquire input data to expose the control of determining whether
the queue is the main queue or a secondary queue.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add an access function that, given a bridge netdevice, returns whether the
bridge device is currently an mrouter or not. The function uses the already
existing br_multicast_is_router function to check that.
This function is needed in order to allow ports that join an already
existing bridge to know the current mrouter state of the bridge device.
Together with the bridge device mrouter ports switchdev notifications, it
is possible to have full offloading of the semantics of the bridge device
mcast router state.
Due to the fact that the bridge multicast router status can change in
packet RX path, take the multicast_router bridge spinlock to protect the
read.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Nogah Frankel <nogahf@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 2c16d6033264 ("netfilter: xt_bpf: support ebpf") introduced
support for attaching an eBPF object by an fd, with the
'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
IPT_SO_SET_REPLACE call.
However this breaks subsequent iptables calls:
# iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
# iptables -A INPUT -s 5.6.7.8 -j ACCEPT
iptables: Invalid argument. Run `dmesg' for more information.
That's because iptables works by loading existing rules using
IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
the replacement set.
However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
(from the initial "iptables -m bpf" invocation) - so when 2nd invocation
occurs, userspace passes a bogus fd number, which leads to
'bpf_mt_check_v1' to fail.
One suggested solution [1] was to hack iptables userspace, to perform a
"entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
process-local fd per every 'xt_bpf_info_v1' entry seen.
However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
'.fd' and instead perform an in-kernel lookup for the bpf object given
the provided '.path'.
It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
expected to provide the path of the pinned object.
Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
[2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2
Reported-by: Rafael Buchbinder <rafi@rbk.ms>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
This patch adds a new bridge port flag BR_NEIGH_SUPPRESS to
suppress arp and nd flood on bridge ports. It implements
rfc7432, section 10.
https://tools.ietf.org/html/rfc7432#section-10
for ethernet VPN deployments. It is similar to the existing
BR_PROXYARP* flags but has a few semantic differences to conform
to EVPN standard. Unlike the existing flags, this new flag suppresses
flood of all neigh discovery packets (arp and nd) to tunnel ports.
Supports both vlan filtering and non-vlan filtering bridges.
In case of EVPN, it is mainly used to avoid flooding
of arp and nd packets to tunnel ports like vxlan.
This patch adds netlink and sysfs support to set this bridge port
flag.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Instead of u8, use char for prog and map name. It can avoid the
userspace tool getting compiler's signess warning. The
bpf_prog_aux, bpf_map, bpf_attr, bpf_prog_info and
bpf_map_info are changed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Lin Zhang <xiaolou4617@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch does not impact existing functionalities.
It contains the changes in perf event area needed for
subsequent bpf_perf_event_read_value and
bpf_perf_prog_read_value helpers.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Fix driver strength selection when selecting hs400es
- Delete bounce buffer handling:
This change fixes a problem related to how bounce buffers are being
allocated. However, instead of trying to fix that, let's just
remove the mmc bounce buffer code altogether, as it has practically
no use.
MMC host:
- meson-gx: A couple of fixes related to clock/phase/tuning
- sdhci-xenon: Fix clock resource by adding an optional bus clock"
* tag 'mmc-v4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci-xenon: Fix clock resource by adding an optional bus clock
mmc: meson-gx: include tx phase in the tuning process
mmc: meson-gx: fix rx phase reset
mmc: meson-gx: make sure the clock is rounded down
mmc: Delete bounce buffer handling
mmc: core: add driver strength selection when selecting hs400es
|
|
Geeralize private netem_rb_to_skb()
TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull watchddog clean-up and fixes from Thomas Gleixner:
"The watchdog (hard/softlockup detector) code is pretty much broken in
its current state. The patch series addresses this by removing all
duct tape and refactoring it into a workable state.
The reasons why I ask for inclusion that late in the cycle are:
1) The code causes lockdep splats vs. hotplug locking which get
reported over and over. Unfortunately there is no easy fix.
2) The risk of breakage is minimal because it's already broken
3) As 4.14 is a long term stable kernel, I prefer to have working
watchdog code in that and the lockdep issues resolved. I wouldn't
ask you to pull if 4.14 wouldn't be a LTS kernel or if the
solution would be easy to backport.
4) The series was around before the merge window opened, but then got
delayed due to the UP failure caused by the for_each_cpu()
surprise which we discussed recently.
Changes vs. V1:
- Addressed your review points
- Addressed the warning in the powerpc code which was discovered late
- Changed two function names which made sense up to a certain point
in the series. Now they match what they do in the end.
- Fixed a 'unused variable' warning, which got not detected by the
intel robot. I triggered it when trying all possible related config
combinations manually. Randconfig testing seems not random enough.
The changes have been tested by and reviewed by Don Zickus and tested
and acked by Micheal Ellerman for powerpc"
* 'core-watchdog-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
watchdog/core: Put softlockup_threads_initialized under ifdef guard
watchdog/core: Rename some softlockup_* functions
powerpc/watchdog: Make use of watchdog_nmi_probe()
watchdog/core, powerpc: Lock cpus across reconfiguration
watchdog/core, powerpc: Replace watchdog_nmi_reconfigure()
watchdog/hardlockup/perf: Fix spelling mistake: "permanetely" -> "permanently"
watchdog/hardlockup/perf: Cure UP damage
watchdog/hardlockup: Clean up hotplug locking mess
watchdog/hardlockup/perf: Simplify deferred event destroy
watchdog/hardlockup/perf: Use new perf CPU enable mechanism
watchdog/hardlockup/perf: Implement CPU enable replacement
watchdog/hardlockup/perf: Implement init time detection of perf
watchdog/hardlockup/perf: Implement init time perf validation
watchdog/core: Get rid of the racy update loop
watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage
watchdog/sysctl: Clean up sysctl variable name space
watchdog/sysctl: Get rid of the #ifdeffery
watchdog/core: Clean up header mess
watchdog/core: Further simplify sysctl handling
watchdog/core: Get rid of the thread teardown/setup dance
...
|
|
Merging this brings in the timer_setup() change, which allows
me to apply Kees's mac80211 changes for it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This patch adds a new queue (list) that tracks the sent but not yet
acked or SACKed skbs for a TCP connection. The list is chronologically
ordered by skb->skb_mstamp (the head is the oldest sent skb).
This list will be used to optimize TCP Rack recovery, which checks
an skb's timestamp to judge if it has been lost and needs to be
retransmitted. Since TCP write queue is ordered by sequence instead
of sent time, RACK has to scan over the write queue to catch all
eligible packets to detect lost retransmission, and iterates through
SACKed skbs repeatedly.
Special cares for rare events:
1. TCP repair fakes skb transmission so the send queue needs adjusted
2. SACK reneging would require re-inserting SACKed skbs into the
send queue. For now I believe it's not worth the complexity to
make RACK work perfectly on SACK reneging, so we do nothing here.
3. Fast Open: currently for non-TFO, send-queue correctly queues
the pure SYN packet. For TFO which queues a pure SYN and
then a data packet, send-queue only queues the data packet but
not the pure SYN due to the structure of TFO code. This is okay
because the SYN receiver would never respond with a SACK on a
missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).
In order to not grow sk_buff, we use an union for the new list and
_skb_refdst/destructor fields. This is a bit complicated because
we need to make sure _skb_refdst and destructor are properly zeroed
before skb is cloned/copied at transmit, and before being freed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Just simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull networking fixes from David Miller:
1) Check iwlwifi 9000 reorder buffer out-of-space condition properly,
from Sara Sharon.
2) Fix RCU splat in qualcomm rmnet driver, from Subash Abhinov
Kasiviswanathan.
3) Fix session and tunnel release races in l2tp, from Guillaume Nault
and Sabrina Dubroca.
4) Fix endian bug in sctp_diag_dump(), from Dan Carpenter.
5) Several mlx5 driver fixes from the Mellanox folks (max flow counters
cap check, invalid memory access in IPoIB support, etc.)
6) tun_get_user() should bail if skb->len is zero, from Alexander
Potapenko.
7) Fix RCU lookups in inetpeer, from Eric Dumazet.
8) Fix locking in packet_do_bund().
9) Handle cb->start() error properly in netlink dump code, from Jason
A. Donenfeld.
10) Handle multicast properly in UDP socket early demux code. From Paolo
Abeni.
11) Several erspan bug fixes in ip_gre, from Xin Long.
12) Fix use-after-free in socket filter code, in order to handle the
fact that listener lock is no longer taken during the three-way TCP
handshake. From Eric Dumazet.
13) Fix infoleak in RTM_GETSTATS, from Nikolay Aleksandrov.
14) Fix tail call generation in x86-64 BPF JIT, from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
net: 8021q: skip packets if the vlan is down
bpf: fix bpf_tail_call() x64 JIT
net: stmmac: dwmac-rk: Add RK3128 GMAC support
rndis_host: support Novatel Verizon USB730L
net: rtnetlink: fix info leak in RTM_GETSTATS call
socket, bpf: fix possible use after free
mlxsw: spectrum_router: Track RIF of IPIP next hops
mlxsw: spectrum_router: Move VRF refcounting
net: hns3: Fix an error handling path in 'hclge_rss_init_hw()'
net: mvpp2: Fix clock resource by adding an optional bus clock
r8152: add Linksys USB3GIGV1 id
l2tp: fix l2tp_eth module loading
ip_gre: erspan device should keep dst
ip_gre: set tunnel hlen properly in erspan_tunnel_init
ip_gre: check packet length and mtu correctly in erspan_xmit
ip_gre: get key from session_id correctly in erspan_rcv
tipc: use only positive error codes in messages
ppp: fix __percpu annotation
udp: perform source validation for mcast early demux
IPv4: early demux can return an error code
...
|
|
Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pass extack to do_set_master and down to ndo_add_slave
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add netlink_ext_ack to netdev_notifier_info to allow notifier
handlers to return errors to userspace.
Clean up the initialization in dev.c such that extack is easily
added in subsequent patches where relevant. Specifically, remove
the init call in call_netdevice_notifiers_info and have callers
initalize on stack when info is declared.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
x-netns interfaces are bound to two netns: the link netns and the upper
netns. Usually, this kind of interfaces is created in the link netns and
then moved to the upper netns. At the end, the interface is visible only
in the upper netns. The link nsid is advertised via netlink in the upper
netns, thus the user always knows where is the link part.
There is no such mechanism in the link netns. When the interface is moved
to another netns, the user cannot "follow" it.
This patch adds a new netlink attribute which helps to follow an interface
which moves to another netns. When the interface is unregistered, the new
nsid is advertised. If the interface is a x-netns interface (ie
rtnl_link_ops->get_link_net is defined), the nsid is allocated if needed.
CC: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
introduce BPF_PROG_QUERY command to retrieve a set of either
attached programs to given cgroup or a set of effective programs
that will execute for events within a cgroup
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
bpf programs to a cgroup.
The difference between three possible flags for BPF_PROG_ATTACH command:
- NONE(default): No further bpf programs allowed in the subtree.
- BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
the program in this cgroup yields to sub-cgroup program.
- BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
that cgroup program gets run in addition to the program in this cgroup.
NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
change their behavior. It only clarifies the semantics in relation
to new flag.
Only one program is allowed to be attached to a cgroup with
NONE or BPF_F_ALLOW_OVERRIDE flag.
Multiple programs are allowed to be attached to a cgroup with
BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
(those that were attached first, run first)
The programs of sub-cgroup are executed first, then programs of
this cgroup and then programs of parent cgroup.
All eligible programs are executed regardless of return code from
earlier programs.
To allow efficient execution of multiple programs attached to a cgroup
and to avoid penalizing cgroups without any programs attached
introduce 'struct bpf_prog_array' which is RCU protected array
of pointers to bpf programs.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The rework of the core hotplug code triggers the WARN_ON in start_wd_cpu()
on powerpc because it is called multiple times for the boot CPU.
The first call is via:
start_wd_on_cpu+0x80/0x2f0
watchdog_nmi_reconfigure+0x124/0x170
softlockup_reconfigure_threads+0x110/0x130
lockup_detector_init+0xbc/0xe0
kernel_init_freeable+0x18c/0x37c
kernel_init+0x2c/0x160
ret_from_kernel_thread+0x5c/0xbc
And then again via the CPU hotplug registration:
start_wd_on_cpu+0x80/0x2f0
cpuhp_invoke_callback+0x194/0x620
cpuhp_thread_fun+0x7c/0x1b0
smpboot_thread_fn+0x290/0x2a0
kthread+0x168/0x1b0
ret_from_kernel_thread+0x5c/0xbc
This can be avoided by setting up the cpu hotplug state with nocalls and
move the initialization to the watchdog_nmi_probe() function. That
initializes the hotplug callbacks without invoking the callback and the
following core initialization function then configures the watchdog for the
online CPUs (in this case CPU0) via softlockup_reconfigure_threads().
Reported-and-tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
|
|
The recent cleanup of the watchdog code split watchdog_nmi_reconfigure()
into two stages. One to stop the NMI and one to restart it after
reconfiguration. That was done by adding a boolean 'run' argument to the
code, which is functionally correct but not necessarily a piece of art.
Replace it by two explicit functions: watchdog_nmi_stop() and
watchdog_nmi_start().
Fixes: 6592ad2fcc8f ("watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage")
Requested-by: Linus 'Nursing his pet-peeve' Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Thomas 'Mopping up garbage' Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710021957480.2114@nanos
|
|
In may, Steven sent a patch deleting the bounce buffer handling
and the CONFIG_MMC_BLOCK_BOUNCE option.
I chose the less invasive path of making it a runtime config
option, and we merged that successfully for kernel v4.12.
The code is however just standing in the way and taking up
space for seemingly no gain on any systems in wide use today.
Pierre says the code was there to improve speed on TI SDHCI
controllers on certain HP laptops and possibly some Ricoh
controllers as well. Early SDHCI controllers lacked the
scatter-gather feature, which made software bounce buffers
a significant speed boost.
We are clearly talking about the list of SDHCI PCI-based
MMC/SD card readers found in the pci_ids[] list in
drivers/mmc/host/sdhci-pci-core.c.
The TI SDHCI derivative is not supported by the upstream
kernel. This leaves the Ricoh.
What we can however notice is that the x86 defconfigs in the
kernel did not enable CONFIG_MMC_BLOCK_BOUNCE option, which
means that any such laptop would have to have a custom
configured kernel to actually take advantage of this
bounce buffer speed-up. It simply seems like there was
a speed optimization for the Ricoh controllers that noone
was using. (I have not checked the distro defconfigs but
I am pretty sure the situation is the same there.)
Bounce buffers increased performance on the OMAP HSMMC
at one point, and was part of the original submission in
commit a45c6cb81647 ("[ARM] 5369/1: omap mmc: Add new
omap hsmmc controller for 2430 and 34xx, v3")
This optimization was removed in
commit 0ccd76d4c236 ("omap_hsmmc: Implement scatter-gather
emulation")
which found that scatter-gather emulation provided even
better performance.
The same was introduced for SDHCI in
commit 2134a922c6e7 ("sdhci: scatter-gather (ADMA) support")
I am pretty positively convinced that software
scatter-gather emulation will do for any host controller what
the bounce buffers were doing. Essentially, the bounce buffer
was a reimplementation of software scatter-gather-emulation in
the MMC subsystem, and it should be done away with.
Cc: Pierre Ossman <pierre@ossman.eu>
Cc: Juha Yrjola <juha.yrjola@solidboot.com>
Cc: Steven J. Hill <Steven.Hill@cavium.com>
Cc: Shawn Lin <shawn.lin@rock-chips.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Suggested-by: Steven J. Hill <Steven.Hill@cavium.com>
Suggested-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
Before commit 9c5d760b8d22 ("mm: split gfp_mask and mapping flags into
separate fields") the private_* fields of struct adrress_space were
grouped together and using "ditto" in comments describing the last
fields was correct.
With introduction of gpf_mask between private_lock and private_list
"ditto" references the wrong description.
Fix it by using the elaborate description.
Link: http://lkml.kernel.org/r/1507009987-8746-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
inline function
pfn_to_section_nr() and section_nr_to_pfn() are defined as macro.
pfn_to_section_nr() has no issue even if it is defined as macro. But
section_nr_to_pfn() has overflow issue if sec is defined as int.
section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT. If sec is
defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value.
But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit
value.
__remove_section() calculates start_pfn using section_nr_to_pfn() and
scn_nr defined as int. So if hot-removed memory address is over 16TB,
overflow issue occurs and section_nr_to_pfn() does not calculate correct
pfn.
To make callers use proper arg, the patch changes the macros to inline
functions.
Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.com
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I do not see anything that restricts this macro to 32 bit width.
Link: http://lkml.kernel.org/r/1505921975-23379-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "exec: binfmt_misc: fix use-after-free, kill
iname[BINPRM_BUF_SIZE]".
It looks like this code was always wrong, then commit 948b701a607f
("binfmt_misc: add persistent opened binary handler for containers")
added more problems.
This patch (of 6):
load_script() can simply use i_name instead, it points into bprm->buf[]
and nobody can change this memory until we call prepare_binprm().
The only complication is that we need to also change the signature of
bprm_change_interp() but this change looks good too.
While at it, do whitespace/style cleanups.
NOTE: the real motivation for this change is that people want to
increase BINPRM_BUF_SIZE, we need to change load_misc_binary() too but
this looks more complicated because afaics it is very buggy.
Link: http://lkml.kernel.org/r/20170918163446.GA26793@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Travis Gummels <tgummels@redhat.com>
Cc: Ben Woodard <woodard@redhat.com>
Cc: Jim Foraker <foraker1@llnl.gov>
Cc: <tdhooge@llnl.gov>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Drop the global lru lock in isolate callback before calling
zap_page_range which calls cond_resched, and re-acquire the global lru
lock before returning. Also change return code to LRU_REMOVED_RETRY.
Use mmput_async when fail to acquire mmap sem in an atomic context.
Fix "BUG: sleeping function called from invalid context"
errors when CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
Also restore mmput_async, which was initially introduced in commit
ec8d7c14ea14 ("mm, oom_reaper: do not mmput synchronously from the oom
reaper context"), and was removed in commit 212925802454 ("mm: oom: let
oom_reap_task and exit_mmap run concurrently").
Link: http://lkml.kernel.org/r/20170914182231.90908-1-sherryy@android.com
Fixes: f2517eb76f1f2 ("android: binder: Add global lru shrinker to binder")
Signed-off-by: Sherry Yang <sherryy@android.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reported-by: Kyle Yan <kyan@codeaurora.org>
Acked-by: Arve Hjønnevåg <arve@android.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Martijn Coenen <maco@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Riley Andrews <riandrews@android.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hoeun Ryu <hoeun.ryu@gmail.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Andrea has noticed that the oom_reaper doesn't invalidate the range via
mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can
corrupt the memory of the kvm guest for example.
tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not
sufficient as per Andrea:
"mmu_notifier_invalidate_range cannot be used in replacement of
mmu_notifier_invalidate_range_start/end. For KVM
mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
notifier implementation has to implement either ->invalidate_range
method or the invalidate_range_start/end methods, not both. And if you
implement invalidate_range_start/end like KVM is forced to do, calling
mmu_notifier_invalidate_range in common code is a noop for KVM.
For those MMU notifiers that can get away only implementing
->invalidate_range, the ->invalidate_range is implicitly called by
mmu_notifier_invalidate_range_end(). And only those secondary MMUs
that share the same pagetable with the primary MMU (like AMD iommuv2)
can get away only implementing ->invalidate_range"
As the callback is allowed to sleep and the implementation is out of
hand of the MM it is safer to simply bail out if there is an mmu
notifier registered. In order to not fail too early make the
mm_has_notifiers check under the oom_lock and have a little nap before
failing to give the current oom victim some more time to exit.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org
Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There's a typo in recent change of VM_MPX definition. We want it to be
VM_HIGH_ARCH_4, not VM_HIGH_ARCH_BIT_4.
This bug does cause visible regressions. In arch_vma_name the vmflags
are tested against VM_MPX. With the incorrect value of VM_MPX, a number
of vmas (such as the stack) test positive and end up being marked as
"[mpx]" in /proc/N/maps instead of their correct names.
This confuses tools like rr which expect to be able to find familiar
vmas.
Fixes: df3735c5b40f ("x86,mpx: make mpx depend on x86-64 to free up VMA flag")
Link: http://lkml.kernel.org/r/20170918140253.36856-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kyle Huey <me@kylehuey.com>
Cc: <stable@vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Device alias can be set by either rtnetlink (rtnl is held) or sysfs.
rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
Add an extra mutex for it and use rcu to protect concurrent accesses.
This allows the sysfs path to not take rtnl and would later allow
to not hold it when dumping ifalias.
Based on suggestion from Eric Dumazet.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In order to allow the ipmr module to do partial multicast forwarding
according to the device parent ID, add the device parent ID field to the
VIF struct. This way, the forwarding path can use the parent ID field
without invoking switchdev calls, which requires the RTNL lock.
When a new VIF is added, set the device parent ID field in it by invoking
the switchdev_port_attr_get call.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Similarly to the offload_fwd_mark field, the offload_mr_fwd_mark field is
used to allow partial offloading of MFC multicast routes.
Switchdev drivers can offload MFC multicast routes to the hardware by
registering to the FIB notification chain. When one of the route output
interfaces is not offload-able, i.e. has different parent ID, the route
cannot be fully offloaded by the hardware. Examples to non-offload-able
devices are a management NIC, dummy device, pimreg device, etc.
Similar problem exists in the bridge module, as one bridge can hold
interfaces with different parent IDs. At the bridge, the problem is solved
by the offload_fwd_mark skb field.
Currently, when a route cannot go through full offload, the only solution
for a switchdev driver is not to offload it at all and let the packet go
through slow path.
Using the offload_mr_fwd_mark field, a driver can indicate that a packet
was already forwarded by hardware to all the devices with the same parent
ID as the input device. Further patches in this patch-set are going to
enhance ipmr to skip multicast forwarding to devices with the same parent
ID if a packets is marked with that field.
The reason why the already existing "offload_fwd_mark" bit cannot be used
is that a switchdev driver would want to make the distinction between a
packet that has already gone through L2 forwarding but did not go through
multicast forwarding, and a packet that has already gone through both L2
and multicast forwarding.
For example: when a packet is ingressing from a switchport enslaved to a
bridge, which is configured with multicast forwarding, the following
scenarios are possible:
- The packet can be trapped to the CPU due to exception while multicast
forwarding (for example, MTU error). In that case, it had already gone
through L2 forwarding in the hardware, thus A switchdev driver would
want to set the skb->offload_fwd_mark and not the
skb->offload_mr_fwd_mark.
- The packet can also be trapped due to a pimreg/dummy device used as one
of the output interfaces. In that case, it can go through both L2 and
(partial) multicast forwarding inside the hardware, thus a switchdev
driver would want to set both the skb->offload_fwd_mark and
skb->offload_mr_fwd_mark.
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellaox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|