summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2021-06-30 15:51:09 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2021-06-30 15:51:09 -0700
commitdbe69e43372212527abf48609aba7fc39a6daa27 (patch)
tree96cfafdf70f5325ceeac1054daf7deca339c9730 /Documentation/networking
parenta6eaf3850cb171c328a8b0db6d3c79286a1eba9d (diff)
parentb6df00789e2831fff7a2c65aa7164b2a4dcbe599 (diff)
Merge tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski: "Core: - BPF: - add syscall program type and libbpf support for generating instructions and bindings for in-kernel BPF loaders (BPF loaders for BPF), this is a stepping stone for signed BPF programs - infrastructure to migrate TCP child sockets from one listener to another in the same reuseport group/map to improve flexibility of service hand-off/restart - add broadcast support to XDP redirect - allow bypass of the lockless qdisc to improving performance (for pktgen: +23% with one thread, +44% with 2 threads) - add a simpler version of "DO_ONCE()" which does not require jump labels, intended for slow-path usage - virtio/vsock: introduce SOCK_SEQPACKET support - add getsocketopt to retrieve netns cookie - ip: treat lowest address of a IPv4 subnet as ordinary unicast address allowing reclaiming of precious IPv4 addresses - ipv6: use prandom_u32() for ID generation - ip: add support for more flexible field selection for hashing across multi-path routes (w/ offload to mlxsw) - icmp: add support for extended RFC 8335 PROBE (ping) - seg6: add support for SRv6 End.DT46 behavior - mptcp: - DSS checksum support (RFC 8684) to detect middlebox meddling - support Connection-time 'C' flag - time stamping support - sctp: packetization Layer Path MTU Discovery (RFC 8899) - xfrm: speed up state addition with seq set - WiFi: - hidden AP discovery on 6 GHz and other HE 6 GHz improvements - aggregation handling improvements for some drivers - minstrel improvements for no-ack frames - deferred rate control for TXQs to improve reaction times - switch from round robin to virtual time-based airtime scheduler - add trace points: - tcp checksum errors - openvswitch - action execution, upcalls - socket errors via sk_error_report Device APIs: - devlink: add rate API for hierarchical control of max egress rate of virtual devices (VFs, SFs etc.) - don't require RCU read lock to be held around BPF hooks in NAPI context - page_pool: generic buffer recycling New hardware/drivers: - mobile: - iosm: PCIe Driver for Intel M.2 Modem - support for Qualcomm MSM8998 (ipa) - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU) - NXP SJA1110 Automotive Ethernet 10-port switch - Qualcomm QCA8327 switch support (qca8k) - Mikrotik 10/25G NIC (atl1c) Driver changes: - ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP (our first foray into MAC/PHY description via ACPI) - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx - Mellanox/Nvidia NIC (mlx5) - NIC VF offload of L2 bridging - support IRQ distribution to Sub-functions - Marvell (prestera): - add flower and match all - devlink trap - link aggregation - Netronome (nfp): connection tracking offload - Intel 1GE (igc): add AF_XDP support - Marvell DPU (octeontx2): ingress ratelimit offload - Google vNIC (gve): new ring/descriptor format support - Qualcomm mobile (rmnet & ipa): inline checksum offload support - MediaTek WiFi (mt76) - mt7915 MSI support - mt7915 Tx status reporting - mt7915 thermal sensors support - mt7921 decapsulation offload - mt7921 enable runtime pm and deep sleep - Realtek WiFi (rtw88) - beacon filter support - Tx antenna path diversity support - firmware crash information via devcoredump - Qualcomm WiFi (wcn36xx) - Wake-on-WLAN support with magic packets and GTK rekeying - Micrel PHY (ksz886x/ksz8081): add cable test support" * tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits) tcp: change ICSK_CA_PRIV_SIZE definition tcp_yeah: check struct yeah size at compile time gve: DQO: Fix off by one in gve_rx_dqo() stmmac: intel: set PCI_D3hot in suspend stmmac: intel: Enable PHY WOL option in EHL net: stmmac: option to enable PHY WOL with PMT enabled net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del} net: use netdev_info in ndo_dflt_fdb_{add,del} ptp: Set lookup cookie when creating a PTP PPS source. net: sock: add trace for socket errors net: sock: introduce sk_error_report net: dsa: replay the local bridge FDB entries pointing to the bridge dev too net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev net: dsa: include fdb entries pointing to bridge in the host fdb list net: dsa: include bridge addresses which are local in the host fdb list net: dsa: sync static FDB entries on foreign interfaces to hardware net: dsa: install the host MDB and FDB entries in the master's RX filter net: dsa: reference count the FDB addresses at the cross-chip notifier level net: dsa: introduce a separate cross-chip notifier type for host FDBs net: dsa: reference count the MDB entries at the cross-chip notifier level ...
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/af_xdp.rst32
-rw-r--r--Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst126
-rw-r--r--Documentation/networking/device_drivers/ethernet/amazon/ena.rst164
-rw-r--r--Documentation/networking/device_drivers/ethernet/google/gve.rst53
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst88
-rw-r--r--Documentation/networking/device_drivers/index.rst1
-rw-r--r--Documentation/networking/device_drivers/wwan/index.rst18
-rw-r--r--Documentation/networking/device_drivers/wwan/iosm.rst96
-rw-r--r--Documentation/networking/devlink/devlink-port.rst35
-rw-r--r--Documentation/networking/devlink/devlink-trap.rst1
-rw-r--r--Documentation/networking/devlink/index.rst1
-rw-r--r--Documentation/networking/devlink/netdevsim.rst26
-rw-r--r--Documentation/networking/devlink/prestera.rst141
-rw-r--r--Documentation/networking/dsa/configuration.rst68
-rw-r--r--Documentation/networking/dsa/dsa.rst21
-rw-r--r--Documentation/networking/dsa/sja1105.rst61
-rw-r--r--Documentation/networking/ethtool-netlink.rst8
-rw-r--r--Documentation/networking/ip-sysctl.rst95
-rw-r--r--Documentation/networking/mptcp-sysctl.rst29
-rw-r--r--Documentation/networking/nf_conntrack-sysctl.rst24
-rw-r--r--Documentation/networking/phy.rst6
21 files changed, 954 insertions, 140 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 2ccc5644cc98..42576880aa4a 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -290,19 +290,19 @@ round-robin example of distributing packets is shown below:
#define MAX_SOCKS 16
struct {
- __uint(type, BPF_MAP_TYPE_XSKMAP);
- __uint(max_entries, MAX_SOCKS);
- __uint(key_size, sizeof(int));
- __uint(value_size, sizeof(int));
+ __uint(type, BPF_MAP_TYPE_XSKMAP);
+ __uint(max_entries, MAX_SOCKS);
+ __uint(key_size, sizeof(int));
+ __uint(value_size, sizeof(int));
} xsks_map SEC(".maps");
static unsigned int rr;
SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
{
- rr = (rr + 1) & (MAX_SOCKS - 1);
+ rr = (rr + 1) & (MAX_SOCKS - 1);
- return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
+ return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
}
Note, that since there is only a single set of FILL and COMPLETION
@@ -379,7 +379,7 @@ would look like this for the TX path:
.. code-block:: c
if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
- sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
+ sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
I.e., only use the syscall if the flag is set.
@@ -442,9 +442,9 @@ purposes. The supported statistics are shown below:
.. code-block:: c
struct xdp_statistics {
- __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
- __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
- __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+ __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+ __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+ __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
};
XDP_OPTIONS getsockopt
@@ -483,15 +483,15 @@ like this:
.. code-block:: c
// struct xdp_rxtx_ring {
- // __u32 *producer;
- // __u32 *consumer;
- // struct xdp_desc *desc;
+ // __u32 *producer;
+ // __u32 *consumer;
+ // struct xdp_desc *desc;
// };
// struct xdp_umem_ring {
- // __u32 *producer;
- // __u32 *consumer;
- // __u64 *desc;
+ // __u32 *producer;
+ // __u32 *consumer;
+ // __u64 *desc;
// };
// typedef struct xdp_rxtx_ring RING;
diff --git a/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
index 70643b58de05..4118384cf8eb 100644
--- a/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
+++ b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
@@ -27,34 +27,136 @@ these MAP frames and send them to appropriate PDN's.
2. Packet format
================
-a. MAP packet (data / control)
+a. MAP packet v1 (data / control)
-MAP header has the same endianness of the IP packet.
+MAP header fields are in big endian format.
Packet format::
- Bit 0 1 2-7 8 - 15 16 - 31
+ Bit 0 1 2-7 8-15 16-31
Function Command / Data Reserved Pad Multiplexer ID Payload length
- Bit 32 - x
- Function Raw Bytes
+
+ Bit 32-x
+ Function Raw bytes
Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
-or data packet. Control packet is used for transport level flow control. Data
+or data packet. Command packet is used for transport level flow control. Data
packets are standard IP packets.
-Reserved bits are usually zeroed out and to be ignored by receiver.
+Reserved bits must be zero when sent and ignored when received.
-Padding is number of bytes to be added for 4 byte alignment if required by
-hardware.
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
Multiplexer ID is to indicate the PDN on which data has to be sent.
Payload length includes the padding length but does not include MAP header
length.
-b. MAP packet (command specific)::
+b. Map packet v4 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Reserved Pad Multiplexer ID Payload length
+
+ Bit 32-(x-33) (x-32)-x
+ Function Raw bytes Checksum offload header
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits must be zero when sent and ignored when received.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+Checksum offload header, has the information about the checksum processing done
+by the hardware.Checksum offload header fields are in big endian format.
+
+Packet format::
+
+ Bit 0-14 15 16-31
+ Function Reserved Valid Checksum start offset
+
+ Bit 31-47 48-64
+ Function Checksum length Checksum value
+
+Reserved bits must be zero when sent and ignored when received.
+
+Valid bit indicates whether the partial checksum is calculated and is valid.
+Set to 1, if its is valid. Set to 0 otherwise.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Checksum start offset, Indicates the offset in bytes from the beginning of the
+IP header, from which modem computed checksum.
+
+Checksum length is the Length in bytes starting from CKSUM_START_OFFSET,
+over which checksum is computed.
+
+Checksum value, indicates the checksum computed.
+
+c. MAP packet v5 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Next header Pad Multiplexer ID Payload length
+
+ Bit 32-x
+ Function Raw bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Next header is used to indicate the presence of another header, currently is
+limited to checksum header.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+d. Checksum offload header v5
+
+Checksum offload header fields are in big endian format.
+
+ Bit 0 - 6 7 8-15 16-31
+ Function Header Type Next Header Checksum Valid Reserved
+
+Header Type is to indicate the type of header, this usually is set to CHECKSUM
+
+Header types
+= ==========================================
+0 Reserved
+1 Reserved
+2 checksum header
+
+Checksum Valid is to indicate whether the header checksum is valid. Value of 1
+implies that checksum is calculated on this packet and is valid, value of 0
+indicates that the calculated packet checksum is invalid.
+
+Reserved bits must be zero when sent and ignored when received.
+
+e. MAP packet v1/v5 (command specific)::
- Bit 0 1 2-7 8 - 15 16 - 31
+ Bit 0 1 2-7 8 - 15 16 - 31
Function Command Reserved Pad Multiplexer ID Payload length
Bit 32 - 39 40 - 45 46 - 47 48 - 63
Function Command name Reserved Command Type Reserved
@@ -74,7 +176,7 @@ Command types
3 is for error during processing of commands
= ==========================================
-c. Aggregation
+f. Aggregation
Aggregation is multiple MAP packets (can be data or command) delivered to
rmnet in a single linear skb. rmnet will process the individual
diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index f8c6469f2bd2..01b2a69b0cb0 100644
--- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -11,12 +11,12 @@ ENA is a networking interface designed to make good use of modern CPU
features and system architectures.
The ENA device exposes a lightweight management interface with a
-minimal set of memory mapped registers and extendable command set
+minimal set of memory mapped registers and extendible command set
through an Admin Queue.
The driver supports a range of ENA devices, is link-speed independent
-(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has
-a negotiated and extendable feature set.
+(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc), and has
+a negotiated and extendible feature set.
Some ENA devices support SR-IOV. This driver is used for both the
SR-IOV Physical Function (PF) and Virtual Function (VF) devices.
@@ -27,9 +27,9 @@ is advertised by the device via the Admin Queue), a dedicated MSI-X
interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation,
and CPU cacheline optimized data placement.
-The ENA driver supports industry standard TCP/IP offload features such
-as checksum offload and TCP transmit segmentation offload (TSO).
-Receive-side scaling (RSS) is supported for multi-core scaling.
+The ENA driver supports industry standard TCP/IP offload features such as
+checksum offload. Receive-side scaling (RSS) is supported for multi-core
+scaling.
The ENA driver and its corresponding devices implement health
monitoring mechanisms such as watchdog, enabling the device and driver
@@ -38,22 +38,20 @@ debug logs.
Some of the ENA devices support a working mode called Low-latency
Queue (LLQ), which saves several more microseconds.
-
ENA Source Code Directory Structure
===================================
================= ======================================================
ena_com.[ch] Management communication layer. This layer is
- responsible for the handling all the management
- (admin) communication between the device and the
- driver.
+ responsible for the handling all the management
+ (admin) communication between the device and the
+ driver.
ena_eth_com.[ch] Tx/Rx data path.
ena_admin_defs.h Definition of ENA management interface.
ena_eth_io_defs.h Definition of ENA data path interface.
ena_common_defs.h Common definitions for ena_com layer.
ena_regs_defs.h Definition of ENA PCI memory-mapped (MMIO) registers.
ena_netdev.[ch] Main Linux kernel driver.
-ena_syfsfs.[ch] Sysfs files.
ena_ethtool.c ethtool callbacks.
ena_pci_id_tbl.h Supported device IDs.
================= ======================================================
@@ -69,7 +67,7 @@ ENA management interface is exposed by means of:
- Asynchronous Event Notification Queue (AENQ)
ENA device MMIO Registers are accessed only during driver
-initialization and are not involved in further normal device
+initialization and are not used during further normal device
operation.
AQ is used for submitting management commands, and the
@@ -100,28 +98,27 @@ group may have multiple syndromes, as shown below
The events are:
- ==================== ===============
- Group Syndrome
- ==================== ===============
- Link state change **X**
- Fatal error **X**
- Notification Suspend traffic
- Notification Resume traffic
- Keep-Alive **X**
- ==================== ===============
+==================== ===============
+Group Syndrome
+==================== ===============
+Link state change **X**
+Fatal error **X**
+Notification Suspend traffic
+Notification Resume traffic
+Keep-Alive **X**
+==================== ===============
ACQ and AENQ share the same MSI-X vector.
-Keep-Alive is a special mechanism that allows monitoring of the
-device's health. The driver maintains a watchdog (WD) handler which,
-if fired, logs the current state and statistics then resets and
-restarts the ENA device and driver. A Keep-Alive event is delivered by
-the device every second. The driver re-arms the WD upon reception of a
-Keep-Alive event. A missed Keep-Alive event causes the WD handler to
-fire.
+Keep-Alive is a special mechanism that allows monitoring the device's health.
+A Keep-Alive event is delivered by the device every second.
+The driver maintains a watchdog (WD) handler which logs the current state and
+statistics. If the keep-alive events aren't delivered as expected the WD resets
+the device and the driver.
Data Path Interface
===================
+
I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx
SQ correspondingly). Each SQ has a completion queue (CQ) associated
with it.
@@ -131,26 +128,24 @@ physical memory.
The ENA driver supports two Queue Operation modes for Tx SQs:
-- Regular mode
+- **Regular mode:**
+ In this mode the Tx SQs reside in the host's memory. The ENA
+ device fetches the ENA Tx descriptors and packet data from host
+ memory.
- * In this mode the Tx SQs reside in the host's memory. The ENA
- device fetches the ENA Tx descriptors and packet data from host
- memory.
+- **Low Latency Queue (LLQ) mode or "push-mode":**
+ In this mode the driver pushes the transmit descriptors and the
+ first 128 bytes of the packet directly to the ENA device memory
+ space. The rest of the packet payload is fetched by the
+ device. For this operation mode, the driver uses a dedicated PCI
+ device memory BAR, which is mapped with write-combine capability.
-- Low Latency Queue (LLQ) mode or "push-mode".
-
- * In this mode the driver pushes the transmit descriptors and the
- first 128 bytes of the packet directly to the ENA device memory
- space. The rest of the packet payload is fetched by the
- device. For this operation mode, the driver uses a dedicated PCI
- device memory BAR, which is mapped with write-combine capability.
+ **Note that** not all ENA devices support LLQ, and this feature is negotiated
+ with the device upon initialization. If the ENA device does not
+ support LLQ mode, the driver falls back to the regular mode.
The Rx SQs support only the regular mode.
-Note: Not all ENA devices support LLQ, and this feature is negotiated
- with the device upon initialization. If the ENA device does not
- support LLQ mode, the driver falls back to the regular mode.
-
The driver supports multi-queue for both Tx and Rx. This has various
benefits:
@@ -165,6 +160,7 @@ benefits:
Interrupt Modes
===============
+
The driver assigns a single MSI-X vector per queue pair (for both Tx
and Rx directions). The driver assigns an additional dedicated MSI-X vector
for management (for ACQ and AENQ).
@@ -190,20 +186,21 @@ unmasked by the driver after NAPI processing is complete.
Interrupt Moderation
====================
+
ENA driver and device can operate in conventional or adaptive interrupt
moderation mode.
-In conventional mode the driver instructs device to postpone interrupt
+**In conventional mode** the driver instructs device to postpone interrupt
posting according to static interrupt delay value. The interrupt delay
-value can be configured through ethtool(8). The following ethtool
-parameters are supported by the driver: tx-usecs, rx-usecs
+value can be configured through `ethtool(8)`. The following `ethtool`
+parameters are supported by the driver: ``tx-usecs``, ``rx-usecs``
-In adaptive interrupt moderation mode the interrupt delay value is
+**In adaptive interrupt** moderation mode the interrupt delay value is
updated by the driver dynamically and adjusted every NAPI cycle
according to the traffic nature.
-Adaptive coalescing can be switched on/off through ethtool(8)
-adaptive_rx on|off parameter.
+Adaptive coalescing can be switched on/off through `ethtool(8)`'s
+:code:`adaptive_rx on|off` parameter.
More information about Adaptive Interrupt Moderation (DIM) can be found in
Documentation/networking/net_dim.rst
@@ -214,17 +211,10 @@ The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK
and can be configured by the ETHTOOL_STUNABLE command of the
SIOCETHTOOL ioctl.
-SKB
-===
-The driver-allocated SKB for frames received from Rx handling using
-NAPI context. The allocation method depends on the size of the packet.
-If the frame length is larger than rx_copybreak, napi_get_frags()
-is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer
-content is copied (by CPU) to the SKB, and the buffer is recycled.
-
Statistics
==========
-The user can obtain ENA device and driver statistics using ethtool.
+
+The user can obtain ENA device and driver statistics using `ethtool`.
The driver can collect regular or extended statistics (including
per-queue stats) from the device.
@@ -232,22 +222,23 @@ In addition the driver logs the stats to syslog upon device reset.
MTU
===
+
The driver supports an arbitrarily large MTU with a maximum that is
negotiated with the device. The driver configures MTU using the
SetFeature command (ENA_ADMIN_MTU property). The user can change MTU
-via ip(8) and similar legacy tools.
+via `ip(8)` and similar legacy tools.
Stateless Offloads
==================
+
The ENA driver supports:
-- TSO over IPv4/IPv6
-- TSO with ECN
- IPv4 header checksum offload
- TCP/UDP over IPv4/IPv6 checksum offloads
RSS
===
+
- The ENA device supports RSS that allows flexible Rx traffic
steering.
- Toeplitz and CRC32 hash functions are supported.
@@ -260,41 +251,42 @@ RSS
function delivered in the Rx CQ descriptor is set in the received
SKB.
- The user can provide a hash key, hash function, and configure the
- indirection table through ethtool(8).
+ indirection table through `ethtool(8)`.
DATA PATH
=========
+
Tx
--
-ena_start_xmit() is called by the stack. This function does the following:
+:code:`ena_start_xmit()` is called by the stack. This function does the following:
-- Maps data buffers (skb->data and frags).
-- Populates ena_buf for the push buffer (if the driver and device are
- in push mode.)
+- Maps data buffers (``skb->data`` and frags).
+- Populates ``ena_buf`` for the push buffer (if the driver and device are
+ in push mode).
- Prepares ENA bufs for the remaining frags.
-- Allocates a new request ID from the empty req_id ring. The request
+- Allocates a new request ID from the empty ``req_id`` ring. The request
ID is the index of the packet in the Tx info. This is used for
- out-of-order TX completions.
+ out-of-order Tx completions.
- Adds the packet to the proper place in the Tx ring.
-- Calls ena_com_prepare_tx(), an ENA communication layer that converts
- the ena_bufs to ENA descriptors (and adds meta ENA descriptors as
- needed.)
+- Calls :code:`ena_com_prepare_tx()`, an ENA communication layer that converts
+ the ``ena_bufs`` to ENA descriptors (and adds meta ENA descriptors as
+ needed).
* This function also copies the ENA descriptors and the push buffer
- to the Device memory space (if in push mode.)
+ to the Device memory space (if in push mode).
-- Writes doorbell to the ENA device.
+- Writes a doorbell to the ENA device.
- When the ENA device finishes sending the packet, a completion
interrupt is raised.
- The interrupt handler schedules NAPI.
-- The ena_clean_tx_irq() function is called. This function handles the
+- The :code:`ena_clean_tx_irq()` function is called. This function handles the
completion descriptors generated by the ENA, with a single
completion descriptor per completed packet.
- * req_id is retrieved from the completion descriptor. The tx_info of
- the packet is retrieved via the req_id. The data buffers are
- unmapped and req_id is returned to the empty req_id ring.
+ * ``req_id`` is retrieved from the completion descriptor. The ``tx_info`` of
+ the packet is retrieved via the ``req_id``. The data buffers are
+ unmapped and ``req_id`` is returned to the empty ``req_id`` ring.
* The function stops when the completion descriptors are completed or
the budget is reached.
@@ -303,12 +295,11 @@ Rx
- When a packet is received from the ENA device.
- The interrupt handler schedules NAPI.
-- The ena_clean_rx_irq() function is called. This function calls
- ena_rx_pkt(), an ENA communication layer function, which returns the
- number of descriptors used for a new unhandled packet, and zero if
+- The :code:`ena_clean_rx_irq()` function is called. This function calls
+ :code:`ena_com_rx_pkt()`, an ENA communication layer function, which returns the
+ number of descriptors used for a new packet, and zero if
no new packet is found.
-- Then it calls the ena_clean_rx_irq() function.
-- ena_eth_rx_skb() checks packet length:
+- :code:`ena_rx_skb()` checks packet length:
* If the packet is small (len < rx_copybreak), the driver allocates
a SKB for the new packet, and copies the packet payload into the
@@ -317,9 +308,10 @@ Rx
- In this way the original data buffer is not passed to the stack
and is reused for future Rx packets.
- * Otherwise the function unmaps the Rx buffer, then allocates the
- new SKB structure and hooks the Rx buffer to the SKB frags.
+ * Otherwise the function unmaps the Rx buffer, sets the first
+ descriptor as `skb`'s linear part and the other descriptors as the
+ `skb`'s frags.
- The new SKB is updated with the necessary information (protocol,
- checksum hw verify result, etc.), and then passed to the network
- stack, using the NAPI interface function napi_gro_receive().
+ checksum hw verify result, etc), and then passed to the network
+ stack, using the NAPI interface function :code:`napi_gro_receive()`.
diff --git a/Documentation/networking/device_drivers/ethernet/google/gve.rst b/Documentation/networking/device_drivers/ethernet/google/gve.rst
index 793693cef6e3..6d73ee78f3d7 100644
--- a/Documentation/networking/device_drivers/ethernet/google/gve.rst
+++ b/Documentation/networking/device_drivers/ethernet/google/gve.rst
@@ -47,13 +47,24 @@ The driver interacts with the device in the following ways:
- Transmit and Receive Queues
- See description below
+Descriptor Formats
+------------------
+GVE supports two descriptor formats: GQI and DQO. These two formats have
+entirely different descriptors, which will be described below.
+
Registers
---------
-All registers are MMIO and big endian.
+All registers are MMIO.
The registers are used for initializing and configuring the device as well as
querying device status in response to management interrupts.
+Endianness
+----------
+- Admin Queue messages and registers are all Big Endian.
+- GQI descriptors and datapath registers are Big Endian.
+- DQO descriptors and datapath registers are Little Endian.
+
Admin Queue (AQ)
----------------
The Admin Queue is a PAGE_SIZE memory block, treated as an array of AQ
@@ -97,10 +108,10 @@ the queues associated with that interrupt.
The handler for these irqs schedule the napi for that block to run
and poll the queues.
-Traffic Queues
---------------
-gVNIC's queues are composed of a descriptor ring and a buffer and are
-assigned to a notification block.
+GQI Traffic Queues
+------------------
+GQI queues are composed of a descriptor ring and a buffer and are assigned to a
+notification block.
The descriptor rings are power-of-two-sized ring buffers consisting of
fixed-size descriptors. They advance their head pointer using a __be32
@@ -121,3 +132,35 @@ Receive
The buffers for receive rings are put into a data ring that is the same
length as the descriptor ring and the head and tail pointers advance over
the rings together.
+
+DQO Traffic Queues
+------------------
+- Every TX and RX queue is assigned a notification block.
+
+- TX and RX buffers queues, which send descriptors to the device, use MMIO
+ doorbells to notify the device of new descriptors.
+
+- RX and TX completion queues, which receive descriptors from the device, use a
+ "generation bit" to know when a descriptor was populated by the device. The
+ driver initializes all bits with the "current generation". The device will
+ populate received descriptors with the "next generation" which is inverted
+ from the current generation. When the ring wraps, the current/next generation
+ are swapped.
+
+- It's the driver's responsibility to ensure that the RX and TX completion
+ queues are not overrun. This can be accomplished by limiting the number of
+ descriptors posted to HW.
+
+- TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
+ buffer_id. These will be returned on the TX completion and RX queues
+ respectively to let the driver know which packet/buffer was completed.
+
+Transmit
+~~~~~~~~
+A packet's buffers are DMA mapped for the device to access before transmission.
+After the packet was successfully transmitted, the buffers are unmapped.
+
+Receive
+~~~~~~~
+The driver posts fixed sized buffers to HW on the RX buffer queue. The packet
+received on the associated RX queue may span multiple descriptors.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
index 936a10f1942c..ef8cb62e82a1 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
@@ -12,6 +12,7 @@ Contents
- `Enabling the driver and kconfig options`_
- `Devlink info`_
- `Devlink parameters`_
+- `Bridge offload`_
- `mlx5 subfunction`_
- `mlx5 function attributes`_
- `Devlink health reporters`_
@@ -217,6 +218,37 @@ users try to enable them.
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+Bridge offload
+==============
+The mlx5 driver implements support for offloading bridge rules when in switchdev
+mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev
+representor is attached to bridge.
+
+- Change device to switchdev mode::
+
+ $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1'::
+
+ $ ip link set enp8s0f0 master bridge1
+
+VLANs
+-----
+Following bridge VLAN functions are supported by mlx5:
+
+- VLAN filtering (including multiple VLANs per port)::
+
+ $ ip link set bridge1 type bridge vlan_filtering 1
+ $ bridge vlan add dev enp8s0f0 vid 2-3
+
+- VLAN push on bridge ingress::
+
+ $ bridge vlan add dev enp8s0f0 vid 3 pvid
+
+- VLAN pop on bridge egress::
+
+ $ bridge vlan add dev enp8s0f0 vid 3 untagged
+
mlx5 subfunction
================
mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
@@ -568,3 +600,59 @@ tc and eswitch offloads tracepoints:
$ cat /sys/kernel/debug/tracing/trace
...
kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
+
+Bridge offloads tracepoints:
+
+- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0
+
+- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16
+
+- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in
+ mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0
+
+- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5
+ representor::
+
+ $ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6
+
+- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5
+ representor::
+
+ $ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8
+
+- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper
+ device::
+
+ $ echo mlx5:mlx5_esw_bridge_vport_init >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1
+
+- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper
+ device::
+
+ $ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1
diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst
index d8279de7bf25..3a5a1d46e77e 100644
--- a/Documentation/networking/device_drivers/index.rst
+++ b/Documentation/networking/device_drivers/index.rst
@@ -18,6 +18,7 @@ Contents:
qlogic/index
wan/index
wifi/index
+ wwan/index
.. only:: subproject and html
diff --git a/Documentation/networking/device_drivers/wwan/index.rst b/Documentation/networking/device_drivers/wwan/index.rst
new file mode 100644
index 000000000000..1cb8c7371401
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+WWAN Device Drivers
+===================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ iosm
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/wwan/iosm.rst b/Documentation/networking/device_drivers/wwan/iosm.rst
new file mode 100644
index 000000000000..aceb0223eb46
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/iosm.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+.. Copyright (C) 2020-21 Intel Corporation
+
+.. _iosm_driver_doc:
+
+===========================================
+IOSM Driver for Intel M.2 PCIe based Modems
+===========================================
+The IOSM (IPC over Shared Memory) driver is a WWAN PCIe host driver developed
+for linux or chrome platform for data exchange over PCIe interface between
+Host platform & Intel M.2 Modem. The driver exposes interface conforming to the
+MBIM protocol [1]. Any front end application ( eg: Modem Manager) could easily
+manage the MBIM interface to enable data communication towards WWAN.
+
+Basic usage
+===========
+MBIM functions are inactive when unmanaged. The IOSM driver only provides a
+userspace interface MBIM "WWAN PORT" representing MBIM control channel and does
+not play any role in managing the functionality. It is the job of a userspace
+application to detect port enumeration and enable MBIM functionality.
+
+Examples of few such userspace application are:
+- mbimcli (included with the libmbim [2] library), and
+- Modem Manager [3]
+
+Management Applications to carry out below required actions for establishing
+MBIM IP session:
+- open the MBIM control channel
+- configure network connection settings
+- connect to network
+- configure IP network interface
+
+Management application development
+==================================
+The driver and userspace interfaces are described below. The MBIM protocol is
+described in [1] Mobile Broadband Interface Model v1.0 Errata-1.
+
+MBIM control channel userspace ABI
+----------------------------------
+
+/dev/wwan0mbim0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an MBIM interface to the MBIM function by implementing
+MBIM WWAN Port. The userspace end of the control channel pipe is a
+/dev/wwan0mbim0 character device. Application shall use this interface for
+MBIM protocol communication.
+
+Fragmentation
+~~~~~~~~~~~~~
+The userspace application is responsible for all control message fragmentation
+and defragmentation as per MBIM specification.
+
+/dev/wwan0mbim0 write()
+~~~~~~~~~~~~~~~~~~~~~~~
+The MBIM control messages from the management application must not exceed the
+negotiated control message size.
+
+/dev/wwan0mbim0 read()
+~~~~~~~~~~~~~~~~~~~~~~
+The management application must accept control messages of up the negotiated
+control message size.
+
+MBIM data channel userspace ABI
+-------------------------------
+
+wwan0-X network device
+~~~~~~~~~~~~~~~~~~~~~~
+The IOSM driver exposes IP link interface "wwan0-X" of type "wwan" for IP
+traffic. Iproute network utility is used for creating "wwan0-X" network
+interface and for associating it with MBIM IP session. The Driver supports
+upto 8 IP sessions for simultaneous IP communication.
+
+The userspace management application is responsible for creating new IP link
+prior to establishing MBIM IP session where the SessionId is greater than 0.
+
+For example, creating new IP link for a MBIM IP session with SessionId 1:
+
+ ip link add dev wwan0-1 parentdev-name wwan0 type wwan linkid 1
+
+The driver will automatically map the "wwan0-1" network device to MBIM IP
+session 1.
+
+References
+==========
+[1] "MBIM (Mobile Broadband Interface Model) Errata-1"
+ - https://www.usb.org/document-library/
+
+[2] libmbim - "a glib-based library for talking to WWAN modems and
+ devices which speak the Mobile Interface Broadband Model (MBIM)
+ protocol"
+ - http://www.freedesktop.org/wiki/Software/libmbim/
+
+[3] Modem Manager - "a DBus-activated daemon which controls mobile
+ broadband (2G/3G/4G) devices and connections"
+ - http://www.freedesktop.org/wiki/Software/ModemManager/
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index ab790e7980b8..7627b1da01f2 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -164,6 +164,41 @@ device to instantiate the subfunction device on particular PCI function.
A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
At this point a matching subfunction driver binds to the subfunction's auxiliary device.
+Rate object management
+======================
+
+Devlink provides API to manage tx rates of single devlink port or a group.
+This is done through rate objects, which can be one of the two types:
+
+``leaf``
+ Represents a single devlink port; created/destroyed by the driver. Since leaf
+ have 1to1 mapping to its devlink port, in user space it is referred as
+ ``pci/<bus_addr>/<port_index>``;
+
+``node``
+ Represents a group of rate objects (leafs and/or nodes); created/deleted by
+ request from the userspace; initially empty (no rate objects added). In
+ userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
+ ``node_name`` can be any identifier, except decimal number, to avoid
+ collisions with leafs.
+
+API allows to configure following rate object's parameters:
+
+``tx_share``
+ Minimum TX rate value shared among all other rate objects, or rate objects
+ that parts of the parent group, if it is a part of the same group.
+
+``tx_max``
+ Maximum TX rate value.
+
+``parent``
+ Parent node name. Parent node rate limits are considered as additional limits
+ to all node children limits. ``tx_max`` is an upper limit for children.
+ ``tx_share`` is a total bandwidth distributed among children.
+
+Driver implementations are allowed to support both or either rate object types
+and setting methods of their parameters.
+
Terms and Definitions
=====================
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index efa5f7f42c88..90d1381b88de 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -497,6 +497,7 @@ drivers:
* Documentation/networking/devlink/netdevsim.rst
* Documentation/networking/devlink/mlxsw.rst
+ * Documentation/networking/devlink/prestera.rst
.. _Generic-Packet-Trap-Groups:
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index 8428a1220723..b3b9e0692088 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -46,3 +46,4 @@ parameters, info versions, and other features it supports.
qed
ti-cpsw-switch
am65-nuss-cpsw-switch
+ prestera
diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst
index 02c2d20dc673..8a292fb5aaea 100644
--- a/Documentation/networking/devlink/netdevsim.rst
+++ b/Documentation/networking/devlink/netdevsim.rst
@@ -57,6 +57,32 @@ entries, FIB rule entries and nexthops that the driver will allow.
$ devlink resource set netdevsim/netdevsim0 path /nexthops size 16
$ devlink dev reload netdevsim/netdevsim0
+Rate objects
+============
+
+The ``netdevsim`` driver supports rate objects management, which includes:
+
+- registerging/unregistering leaf rate objects per VF devlink port;
+- creation/deletion node rate objects;
+- setting tx_share and tx_max rate values for any rate object type;
+- setting parent node for any rate object type.
+
+Rate nodes and it's parameters are exposed in ``netdevsim`` debugfs in RO mode.
+For example created rate node with name ``some_group``:
+
+.. code:: shell
+
+ $ ls /sys/kernel/debug/netdevsim/netdevsim0/rate_groups/some_group
+ rate_parent tx_max tx_share
+
+Same parameters are exposed for leaf objects in corresponding ports directories.
+For ex.:
+
+.. code:: shell
+
+ $ ls /sys/kernel/debug/netdevsim/netdevsim0/ports/1
+ dev ethtool rate_parent tx_max tx_share
+
Driver-specific Traps
=====================
diff --git a/Documentation/networking/devlink/prestera.rst b/Documentation/networking/devlink/prestera.rst
new file mode 100644
index 000000000000..49409d1d3081
--- /dev/null
+++ b/Documentation/networking/devlink/prestera.rst
@@ -0,0 +1,141 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+prestera devlink support
+========================
+
+This document describes the devlink features implemented by the ``prestera``
+device driver.
+
+Driver-specific Traps
+=====================
+
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``arp_bc``
+ - ``trap``
+ - Traps ARP broadcast packets (both requests/responses)
+ * - ``is_is``
+ - ``trap``
+ - Traps IS-IS packets
+ * - ``ospf``
+ - ``trap``
+ - Traps OSPF packets
+ * - ``ip_bc_mac``
+ - ``trap``
+ - Traps IPv4 packets with broadcast DA Mac address
+ * - ``stp``
+ - ``trap``
+ - Traps STP BPDU
+ * - ``lacp``
+ - ``trap``
+ - Traps LACP packets
+ * - ``lldp``
+ - ``trap``
+ - Traps LLDP packets
+ * - ``router_mc``
+ - ``trap``
+ - Traps multicast packets
+ * - ``vrrp``
+ - ``trap``
+ - Traps VRRP packets
+ * - ``dhcp``
+ - ``trap``
+ - Traps DHCP packets
+ * - ``mtu_error``
+ - ``trap``
+ - Traps (exception) packets that exceeded port's MTU
+ * - ``mac_to_me``
+ - ``trap``
+ - Traps packets with switch-port's DA Mac address
+ * - ``ttl_error``
+ - ``trap``
+ - Traps (exception) IPv4 packets whose TTL exceeded
+ * - ``ipv4_options``
+ - ``trap``
+ - Traps (exception) packets due to the malformed IPV4 header options
+ * - ``ip_default_route``
+ - ``trap``
+ - Traps packets that have no specific IP interface (IP to me) and no forwarding prefix
+ * - ``local_route``
+ - ``trap``
+ - Traps packets that have been send to one of switch IP interfaces addresses
+ * - ``ipv4_icmp_redirect``
+ - ``trap``
+ - Traps (exception) IPV4 ICMP redirect packets
+ * - ``arp_response``
+ - ``trap``
+ - Traps ARP replies packets that have switch-port's DA Mac address
+ * - ``acl_code_0``
+ - ``trap``
+ - Traps packets that have ACL priority set to 0 (tc pref 0)
+ * - ``acl_code_1``
+ - ``trap``
+ - Traps packets that have ACL priority set to 1 (tc pref 1)
+ * - ``acl_code_2``
+ - ``trap``
+ - Traps packets that have ACL priority set to 2 (tc pref 2)
+ * - ``acl_code_3``
+ - ``trap``
+ - Traps packets that have ACL priority set to 3 (tc pref 3)
+ * - ``acl_code_4``
+ - ``trap``
+ - Traps packets that have ACL priority set to 4 (tc pref 4)
+ * - ``acl_code_5``
+ - ``trap``
+ - Traps packets that have ACL priority set to 5 (tc pref 5)
+ * - ``acl_code_6``
+ - ``trap``
+ - Traps packets that have ACL priority set to 6 (tc pref 6)
+ * - ``acl_code_7``
+ - ``trap``
+ - Traps packets that have ACL priority set to 7 (tc pref 7)
+ * - ``ipv4_bgp``
+ - ``trap``
+ - Traps IPv4 BGP packets
+ * - ``ssh``
+ - ``trap``
+ - Traps SSH packets
+ * - ``telnet``
+ - ``trap``
+ - Traps Telnet packets
+ * - ``icmp``
+ - ``trap``
+ - Traps ICMP packets
+ * - ``rxdma_drop``
+ - ``drop``
+ - Drops packets (RxDMA) due to the lack of ingress buffers etc.
+ * - ``port_no_vlan``
+ - ``drop``
+ - Drops packets due to faulty-configured network or due to internal bug (config issue).
+ * - ``local_port``
+ - ``drop``
+ - Drops packets whose decision (FDB entry) is to bridge packet back to the incoming port/trunk.
+ * - ``invalid_sa``
+ - ``drop``
+ - Drops packets with multicast source MAC address.
+ * - ``illegal_ip_addr``
+ - ``drop``
+ - Drops packets with illegal SIP/DIP multicast/unicast addresses.
+ * - ``illegal_ipv4_hdr``
+ - ``drop``
+ - Drops packets with illegal IPV4 header.
+ * - ``ip_uc_dip_da_mismatch``
+ - ``drop``
+ - Drops packets with destination MAC being unicast, but destination IP address being multicast.
+ * - ``ip_sip_is_zero``
+ - ``drop``
+ - Drops packets with zero (0) IPV4 source address.
+ * - ``met_red``
+ - ``drop``
+ - Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwith.
diff --git a/Documentation/networking/dsa/configuration.rst b/Documentation/networking/dsa/configuration.rst
index 774f0e76c746..2b08f1a772d3 100644
--- a/Documentation/networking/dsa/configuration.rst
+++ b/Documentation/networking/dsa/configuration.rst
@@ -292,3 +292,71 @@ configuration.
# bring up the bridge devices
ip link set br0 up
+
+Forwarding database (FDB) management
+------------------------------------
+
+The existing DSA switches do not have the necessary hardware support to keep
+the software FDB of the bridge in sync with the hardware tables, so the two
+tables are managed separately (``bridge fdb show`` queries both, and depending
+on whether the ``self`` or ``master`` flags are being used, a ``bridge fdb
+add`` or ``bridge fdb del`` command acts upon entries from one or both tables).
+
+Up until kernel v4.14, DSA only supported user space management of bridge FDB
+entries using the bridge bypass operations (which do not update the software
+FDB, just the hardware one) using the ``self`` flag (which is optional and can
+be omitted).
+
+ .. code-block:: sh
+
+ bridge fdb add dev swp0 00:01:02:03:04:05 self static
+ # or shorthand
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+
+Due to a bug, the bridge bypass FDB implementation provided by DSA did not
+distinguish between ``static`` and ``local`` FDB entries (``static`` are meant
+to be forwarded, while ``local`` are meant to be locally terminated, i.e. sent
+to the host port). Instead, all FDB entries with the ``self`` flag (implicit or
+explicit) are treated by DSA as ``static`` even if they are ``local``.
+
+ .. code-block:: sh
+
+ # This command:
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+ # behaves the same for DSA as this command:
+ bridge fdb add dev swp0 00:01:02:03:04:05 local
+ # or shorthand, because the 'local' flag is implicit if 'static' is not
+ # specified, it also behaves the same as:
+ bridge fdb add dev swp0 00:01:02:03:04:05
+
+The last command is an incorrect way of adding a static bridge FDB entry to a
+DSA switch using the bridge bypass operations, and works by mistake. Other
+drivers will treat an FDB entry added by the same command as ``local`` and as
+such, will not forward it, as opposed to DSA.
+
+Between kernel v4.14 and v5.14, DSA has supported in parallel two modes of
+adding a bridge FDB entry to the switch: the bridge bypass discussed above, as
+well as a new mode using the ``master`` flag which installs FDB entries in the
+software bridge too.
+
+ .. code-block:: sh
+
+ bridge fdb add dev swp0 00:01:02:03:04:05 master static
+
+Since kernel v5.14, DSA has gained stronger integration with the bridge's
+software FDB, and the support for its bridge bypass FDB implementation (using
+the ``self`` flag) has been removed. This results in the following changes:
+
+ .. code-block:: sh
+
+ # This is the only valid way of adding an FDB entry that is supported,
+ # compatible with v4.14 kernels and later:
+ bridge fdb add dev swp0 00:01:02:03:04:05 master static
+ # This command is no longer buggy and the entry is properly treated as
+ # 'local' instead of being forwarded:
+ bridge fdb add dev swp0 00:01:02:03:04:05
+ # This command no longer installs a static FDB entry to hardware:
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+
+Script writers are therefore encouraged to use the ``master static`` set of
+flags when working with bridge FDB entries on DSA switch interfaces.
diff --git a/Documentation/networking/dsa/dsa.rst b/Documentation/networking/dsa/dsa.rst
index 8688009514cc..20baacf2bc5c 100644
--- a/Documentation/networking/dsa/dsa.rst
+++ b/Documentation/networking/dsa/dsa.rst
@@ -93,14 +93,15 @@ A tagging protocol may tag all packets with switch tags of the same length, or
the tag length might vary (for example packets with PTP timestamps might
require an extended switch tag, or there might be one tag length on TX and a
different one on RX). Either way, the tagging protocol driver must populate the
-``struct dsa_device_ops::overhead`` with the length in octets of the longest
-switch frame header. The DSA framework will automatically adjust the MTU of the
-master interface to accomodate for this extra size in order for DSA user ports
-to support the standard MTU (L2 payload length) of 1500 octets. The ``overhead``
-is also used to request from the network stack, on a best-effort basis, the
-allocation of packets with a ``needed_headroom`` or ``needed_tailroom``
-sufficient such that the act of pushing the switch tag on transmission of a
-packet does not cause it to reallocate due to lack of memory.
+``struct dsa_device_ops::needed_headroom`` and/or ``struct dsa_device_ops::needed_tailroom``
+with the length in octets of the longest switch frame header/trailer. The DSA
+framework will automatically adjust the MTU of the master interface to
+accommodate for this extra size in order for DSA user ports to support the
+standard MTU (L2 payload length) of 1500 octets. The ``needed_headroom`` and
+``needed_tailroom`` properties are also used to request from the network stack,
+on a best-effort basis, the allocation of packets with enough extra space such
+that the act of pushing the switch tag on transmission of a packet does not
+cause it to reallocate due to lack of memory.
Even though applications are not expected to parse DSA-specific frame headers,
the format on the wire of the tagging protocol represents an Application Binary
@@ -169,8 +170,8 @@ The job of this method is to prepare the skb in a way that the switch will
understand what egress port the packet is for (and not deliver it towards other
ports). Typically this is fulfilled by pushing a frame header. Checking for
insufficient size in the skb headroom or tailroom is unnecessary provided that
-the ``overhead`` and ``tail_tag`` properties were filled out properly, because
-DSA ensures there is enough space before calling this method.
+the ``needed_headroom`` and ``needed_tailroom`` properties were filled out
+properly, because DSA ensures there is enough space before calling this method.
The reception of a packet goes through the tagger's ``rcv`` function. The
passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst
index 7395a33baaf9..da4057ba37f1 100644
--- a/Documentation/networking/dsa/sja1105.rst
+++ b/Documentation/networking/dsa/sja1105.rst
@@ -5,7 +5,7 @@ NXP SJA1105 switch driver
Overview
========
-The NXP SJA1105 is a family of 6 devices:
+The NXP SJA1105 is a family of 10 SPI-managed automotive switches:
- SJA1105E: First generation, no TTEthernet
- SJA1105T: First generation, TTEthernet
@@ -13,9 +13,11 @@ The NXP SJA1105 is a family of 6 devices:
- SJA1105Q: Second generation, TTEthernet, no SGMII
- SJA1105R: Second generation, no TTEthernet, SGMII
- SJA1105S: Second generation, TTEthernet, SGMII
-
-These are SPI-managed automotive switches, with all ports being gigabit
-capable, and supporting MII/RMII/RGMII and optionally SGMII on one port.
+- SJA1110A: Third generation, TTEthernet, SGMII, integrated 100base-T1 and
+ 100base-TX PHYs
+- SJA1110B: Third generation, TTEthernet, SGMII, 100base-T1, 100base-TX
+- SJA1110C: Third generation, TTEthernet, SGMII, 100base-T1, 100base-TX
+- SJA1110D: Third generation, TTEthernet, SGMII, 100base-T1
Being automotive parts, their configuration interface is geared towards
set-and-forget use, with minimal dynamic interaction at runtime. They
@@ -579,3 +581,54 @@ A board would need to hook up the PHYs connected to the switch to any other
MDIO bus available to Linux within the system (e.g. to the DSA master's MDIO
bus). Link state management then works by the driver manually keeping in sync
(over SPI commands) the MAC link speed with the settings negotiated by the PHY.
+
+By comparison, the SJA1110 supports an MDIO slave access point over which its
+internal 100base-T1 PHYs can be accessed from the host. This is, however, not
+used by the driver, instead the internal 100base-T1 and 100base-TX PHYs are
+accessed through SPI commands, modeled in Linux as virtual MDIO buses.
+
+The microcontroller attached to the SJA1110 port 0 also has an MDIO controller
+operating in master mode, however the driver does not support this either,
+since the microcontroller gets disabled when the Linux driver operates.
+Discrete PHYs connected to the switch ports should have their MDIO interface
+attached to an MDIO controller from the host system and not to the switch,
+similar to SJA1105.
+
+Port compatibility matrix
+-------------------------
+
+The SJA1105 port compatibility matrix is:
+
+===== ============== ============== ==============
+Port SJA1105E/T SJA1105P/Q SJA1105R/S
+===== ============== ============== ==============
+0 xMII xMII xMII
+1 xMII xMII xMII
+2 xMII xMII xMII
+3 xMII xMII xMII
+4 xMII xMII SGMII
+===== ============== ============== ==============
+
+
+The SJA1110 port compatibility matrix is:
+
+===== ============== ============== ============== ==============
+Port SJA1110A SJA1110B SJA1110C SJA1110D
+===== ============== ============== ============== ==============
+0 RevMII (uC) RevMII (uC) RevMII (uC) RevMII (uC)
+1 100base-TX 100base-TX 100base-TX
+ or SGMII SGMII
+2 xMII xMII xMII xMII
+ or SGMII or SGMII
+3 xMII xMII xMII
+ or SGMII or SGMII SGMII
+ or 2500base-X or 2500base-X or 2500base-X
+4 SGMII SGMII SGMII SGMII
+ or 2500base-X or 2500base-X or 2500base-X or 2500base-X
+5 100base-T1 100base-T1 100base-T1 100base-T1
+6 100base-T1 100base-T1 100base-T1 100base-T1
+7 100base-T1 100base-T1 100base-T1 100base-T1
+8 100base-T1 100base-T1 n/a n/a
+9 100base-T1 100base-T1 n/a n/a
+10 100base-T1 n/a n/a n/a
+===== ============== ============== ============== ==============
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 25131df3c2bd..6ea91e41593f 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -1363,8 +1363,8 @@ in an implementation specific way.
``ETHTOOL_A_FEC_AUTO`` requests the driver to choose FEC mode based on SFP
module parameters. This does not mean autonegotiation.
-MODULE_EEPROM
-=============
+MODULE_EEPROM_GET
+=================
Fetch module EEPROM data dump.
This interface is designed to allow dumps of at most 1/2 page at once. This
@@ -1383,12 +1383,14 @@ Request contents:
``ETHTOOL_A_MODULE_EEPROM_I2C_ADDRESS`` u8 page I2C address
======================================= ====== ==========================
+If ``ETHTOOL_A_MODULE_EEPROM_BANK`` is not specified, bank 0 is assumed.
+
Kernel response contents:
+---------------------------------------------+--------+---------------------+
| ``ETHTOOL_A_MODULE_EEPROM_HEADER`` | nested | reply header |
+---------------------------------------------+--------+---------------------+
- | ``ETHTOOL_A_MODULE_EEPROM_DATA`` | nested | array of bytes from |
+ | ``ETHTOOL_A_MODULE_EEPROM_DATA`` | binary | array of bytes from |
| | | module EEPROM |
+---------------------------------------------+--------+---------------------+
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index c2ecc9894fd0..b3fa522e4cd9 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -99,6 +99,35 @@ fib_multipath_hash_policy - INTEGER
- 0 - Layer 3
- 1 - Layer 4
- 2 - Layer 3 or inner Layer 3 if present
+ - 3 - Custom multipath hash. Fields used for multipath hash calculation
+ are determined by fib_multipath_hash_fields sysctl
+
+fib_multipath_hash_fields - UNSIGNED INTEGER
+ When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
+ fields used for multipath hash calculation are determined by this
+ sysctl.
+
+ This value is a bitmask which enables various fields for multipath hash
+ calculation.
+
+ Possible fields are:
+
+ ====== ============================
+ 0x0001 Source IP address
+ 0x0002 Destination IP address
+ 0x0004 IP protocol
+ 0x0008 Unused (Flow Label)
+ 0x0010 Source port
+ 0x0020 Destination port
+ 0x0040 Inner source IP address
+ 0x0080 Inner destination IP address
+ 0x0100 Inner IP protocol
+ 0x0200 Inner Flow Label
+ 0x0400 Inner source port
+ 0x0800 Inner destination port
+ ====== ============================
+
+ Default: 0x0007 (source IP, destination IP and IP protocol)
fib_sync_mem - UNSIGNED INTEGER
Amount of dirty memory from fib entries that can be backlogged before
@@ -732,6 +761,31 @@ tcp_syncookies - INTEGER
network connections you can set this knob to 2 to enable
unconditionally generation of syncookies.
+tcp_migrate_req - BOOLEAN
+ The incoming connection is tied to a specific listening socket when
+ the initial SYN packet is received during the three-way handshake.
+ When a listener is closed, in-flight request sockets during the
+ handshake and established sockets in the accept queue are aborted.
+
+ If the listener has SO_REUSEPORT enabled, other listeners on the
+ same port should have been able to accept such connections. This
+ option makes it possible to migrate such child sockets to another
+ listener after close() or shutdown().
+
+ The BPF_SK_REUSEPORT_SELECT_OR_MIGRATE type of eBPF program should
+ usually be used to define the policy to pick an alive listener.
+ Otherwise, the kernel will randomly pick an alive listener only if
+ this option is enabled.
+
+ Note that migration between listeners with different settings may
+ crash applications. Let's say migration happens from listener A to
+ B, and only B has TCP_SAVE_SYN enabled. B cannot read SYN data from
+ the requests migrated from A. To avoid such a situation, cancel
+ migration by returning SK_DROP in the type of eBPF program, or
+ disable this option.
+
+ Default: 0
+
tcp_fastopen - INTEGER
Enable TCP Fast Open (RFC7413) to send and accept data in the opening
SYN packet.
@@ -1743,6 +1797,35 @@ fib_multipath_hash_policy - INTEGER
- 0 - Layer 3 (source and destination addresses plus flow label)
- 1 - Layer 4 (standard 5-tuple)
- 2 - Layer 3 or inner Layer 3 if present
+ - 3 - Custom multipath hash. Fields used for multipath hash calculation
+ are determined by fib_multipath_hash_fields sysctl
+
+fib_multipath_hash_fields - UNSIGNED INTEGER
+ When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
+ fields used for multipath hash calculation are determined by this
+ sysctl.
+
+ This value is a bitmask which enables various fields for multipath hash
+ calculation.
+
+ Possible fields are:
+
+ ====== ============================
+ 0x0001 Source IP address
+ 0x0002 Destination IP address
+ 0x0004 IP protocol
+ 0x0008 Flow Label
+ 0x0010 Source port
+ 0x0020 Destination port
+ 0x0040 Inner source IP address
+ 0x0080 Inner destination IP address
+ 0x0100 Inner IP protocol
+ 0x0200 Inner Flow Label
+ 0x0400 Inner source port
+ 0x0800 Inner destination port
+ ====== ============================
+
+ Default: 0x0007 (source IP, destination IP and IP protocol)
anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
@@ -2751,6 +2834,18 @@ encap_port - INTEGER
Default: 0
+plpmtud_probe_interval - INTEGER
+ The time interval (in milliseconds) for the PLPMTUD probe timer,
+ which is configured to expire after this period to receive an
+ acknowledgment to a probe packet. This is also the time interval
+ between the probes for the current pmtu when the probe search
+ is done.
+
+ PLPMTUD will be disabled when 0 is set, and other values for it
+ must be >= 5000.
+
+ Default: 0
+
``/proc/sys/net/core/*``
========================
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
index 6af0196c4297..76d939e688b8 100644
--- a/Documentation/networking/mptcp-sysctl.rst
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -7,13 +7,13 @@ MPTCP Sysfs variables
/proc/sys/net/mptcp/* Variables
===============================
-enabled - INTEGER
+enabled - BOOLEAN
Control whether MPTCP sockets can be created.
- MPTCP sockets can be created if the value is nonzero. This is
- a per-namespace sysctl.
+ MPTCP sockets can be created if the value is 1. This is a
+ per-namespace sysctl.
- Default: 1
+ Default: 1 (enabled)
add_addr_timeout - INTEGER (seconds)
Set the timeout after which an ADD_ADDR control message will be
@@ -24,3 +24,24 @@ add_addr_timeout - INTEGER (seconds)
sysctl.
Default: 120
+
+checksum_enabled - BOOLEAN
+ Control whether DSS checksum can be enabled.
+
+ DSS checksum can be enabled if the value is nonzero. This is a
+ per-namespace sysctl.
+
+ Default: 0
+
+allow_join_initial_addr_port - BOOLEAN
+ Allow peers to send join requests to the IP address and port number used
+ by the initial subflow if the value is 1. This controls a flag that is
+ sent to the peer at connection time, and whether such join requests are
+ accepted or denied.
+
+ Joins to addresses advertised with ADD_ADDR are not affected by this
+ value.
+
+ This is a per-namespace sysctl.
+
+ Default: 1
diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst
index 11a9b76786cb..0467b30e4abe 100644
--- a/Documentation/networking/nf_conntrack-sysctl.rst
+++ b/Documentation/networking/nf_conntrack-sysctl.rst
@@ -177,3 +177,27 @@ nf_conntrack_gre_timeout_stream - INTEGER (seconds)
This extended timeout will be used in case there is an GRE stream
detected.
+
+nf_flowtable_tcp_timeout - INTEGER (seconds)
+ default 30
+
+ Control offload timeout for tcp connections.
+ TCP connections may be offloaded from nf conntrack to nf flow table.
+ Once aged, the connection is returned to nf conntrack with tcp pickup timeout.
+
+nf_flowtable_tcp_pickup - INTEGER (seconds)
+ default 120
+
+ TCP connection timeout after being aged from nf flow table offload.
+
+nf_flowtable_udp_timeout - INTEGER (seconds)
+ default 30
+
+ Control offload timeout for udp connections.
+ UDP connections may be offloaded from nf conntrack to nf flow table.
+ Once aged, the connection is returned to nf conntrack with udp pickup timeout.
+
+nf_flowtable_udp_pickup - INTEGER (seconds)
+ default 30
+
+ UDP connection timeout after being aged from nf flow table offload.
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index 3f05d50ecd6e..571ba08386e7 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -292,6 +292,12 @@ Some of the interface modes are described below:
Note: due to legacy usage, some 10GBASE-R usage incorrectly makes
use of this definition.
+``PHY_INTERFACE_MODE_25GBASER``
+ This is the IEEE 802.3 PCS Clause 107 defined 25GBASE-R protocol.
+ The PCS is identical to 10GBASE-R, i.e. 64B/66B encoded
+ running 2.5 as fast, giving a fixed bit rate of 25.78125 Gbaud.
+ Please refer to the IEEE standard for further information.
+
``PHY_INTERFACE_MODE_100BASEX``
This defines IEEE 802.3 Clause 24. The link operates at a fixed data
rate of 125Mpbs using a 4B/5B encoding scheme, resulting in an underlying