summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2018-07-08tcp: remove redundant SOCK_DONE checksEric Dumazet1-9/+5
In both tcp_splice_read() and tcp_recvmsg(), we already test sock_flag(sk, SOCK_DONE) right before evaluating sk->sk_state, so "!sock_flag(sk, SOCK_DONE)" is always true. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: Fix warnings from xchg() on RCU'd cookie pointer.David S. Miller1-1/+1
The kbuild test robot reports: >> net/sched/act_api.c:71:15: sparse: incorrect type in initializer (different address spaces) @@ expected struct tc_cookie [noderef] <asn:4>*__ret @@ got [noderef] <asn:4>*__ret @@ net/sched/act_api.c:71:15: expected struct tc_cookie [noderef] <asn:4>*__ret net/sched/act_api.c:71:15: got struct tc_cookie *new_cookie >> net/sched/act_api.c:71:13: sparse: incorrect type in assignment (different address spaces) @@ expected struct tc_cookie *old @@ got struct tc_cookie [noderef] <struct tc_cookie *old @@ net/sched/act_api.c:71:13: expected struct tc_cookie *old net/sched/act_api.c:71:13: got struct tc_cookie [noderef] <asn:4>*[assigned] __ret >> net/sched/act_api.c:132:48: sparse: dereference of noderef expression Handle this in the usual way by force casting away the __rcu annotation when we are using xchg() on it. Fixes: eec94fdb0480 ("net: sched: use rcu for action cookie update") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: change action API to use array of pointers to actionsVlad Buslov2-54/+56
Act API used linked list to pass set of actions to functions. It is intrusive data structure that stores list nodes inside action structure itself, which means it is not safe to modify such list concurrently. However, action API doesn't use any linked list specific operations on this set of actions, so it can be safely refactored into plain pointer array. Refactor action API to use array of pointers to tc_actions instead of linked list. Change argument 'actions' type of exported action init, destroy and dump functions. Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: atomically check-allocate actionVlad Buslov17-59/+213
Implement function that atomically checks if action exists and either takes reference to it, or allocates idr slot for action index to prevent concurrent allocations of actions with same index. Use EBUSY error pointer to indicate that idr slot is reserved. Implement cleanup helper function that removes temporary error pointer from idr. (in case of error between idr allocation and insertion of newly created action to specified index) Refactor all action init functions to insert new action to idr using this API. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: use reference counting action initVlad Buslov1-18/+17
Change action API to assume that action init function always takes reference to action, even when overwriting existing action. This is necessary because action API continues to use action pointer after init function is done. At this point action becomes accessible for concurrent modifications, so user must always hold reference to it. Implement helper put list function to atomically release list of actions after action API init code is done using them. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: don't release reference on action overwriteVlad Buslov17-58/+50
Return from action init function with reference to action taken, even when overwriting existing action. Action init API initializes its fourth argument (pointer to pointer to tc action) to either existing action with same index or newly created action. In case of existing index(and bind argument is zero), init function returns without incrementing action reference counter. Caller of action init then proceeds working with action, without actually holding reference to it. This means that action could be deleted concurrently. Change action init behavior to always take reference to action before returning successfully, in order to protect from concurrent deletion. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: implement reference counted action releaseVlad Buslov2-23/+62
Implement helper delete function that uses new action ops 'delete', instead of destroying action directly. This is required so act API could delete actions by index, without holding any references to action that is being deleted. Implement function __tcf_action_put() that releases reference to action and frees it, if necessary. Refactor action deletion code to use new put function and not to rely on rtnl lock. Remove rtnl lock assertions that are no longer needed. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: add 'delete' function to action opsVlad Buslov16-0/+136
Extend action ops with 'delete' function. Each action type to implements its own delete function that doesn't depend on rtnl lock. Implement delete function that is required to delete actions without holding rtnl lock. Use action API function that atomically deletes action only if it is still in action idr. This implementation prevents concurrent threads from deleting same action twice. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: implement action API that deletes action by indexVlad Buslov1-0/+39
Implement new action API function that atomically finds and deletes action from idr by index. Intended to be used by lockless actions that do not rely on rtnl lock. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: always take reference to actionVlad Buslov1-26/+20
Without rtnl lock protection it is no longer safe to use pointer to tc action without holding reference to it. (it can be destroyed concurrently) Remove unsafe action idr lookup function. Instead of it, implement safe tcf idr check function that atomically looks up action in idr and increments its reference and bind counters. Implement both action search and check using new safe function Reference taken by idr check is temporal and should not be accounted by userspace clients (both logically and to preserver current API behavior). Subtract temporal reference when dumping action to userspace using existing tca_get_fill function arguments. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: implement unlocked action init APIVlad Buslov18-27/+46
Add additional 'rtnl_held' argument to act API init functions. It is required to implement actions that need to release rtnl lock before loading kernel module and reacquire if afterwards. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: change type of reference and bind countersVlad Buslov17-42/+54
Change type of action reference counter to refcount_t. Change type of action bind counter to atomic_t. This type is used to allow decrementing bind counter without testing for 0 result. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08net: sched: use rcu for action cookie updateVlad Buslov1-14/+30
Implement functions to atomically update and free action cookie using rcu mechanism. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-08openvswitch: kernel datapath clone actionYifeng Sun2-0/+106
Add 'clone' action to kernel datapath by using existing functions. When actions within clone don't modify the current flow, the flow key is not cloned before executing clone actions. This is a follow up patch for this incomplete work: https://patchwork.ozlabs.org/patch/722096/ v1 -> v2: Refactor as advised by reviewer. Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07tipc: extend link reset criteria for stale packet retransmissionJon Maloy1-19/+24
Currently a link is declared stale and reset if there has been 100 repeated attempts to retransmit the same packet. However, in certain infrastructures we see that packet (NACK) duplicates and delays may cause such retransmit attempts to occur at a high rate, so that the peer doesn't have a reasonable chance to acknowledge the reception before the 100-limit is hit. This may take much less than the stipulated link tolerance time, and despite that probe/probe replies otherwise go through as normal. We now extend the criteria for link reset to also being time based. I.e., we don't reset the link until the link tolerance time is passed AND we have made 100 retransmissions attempts. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/sched: flower: Add supprt for matching on QinQ vlan headersJianbo Liu1-14/+51
As support dissecting of QinQ inner and outer vlan headers, user can add rules to match on QinQ vlan headers. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/sched: flower: Dump the ethertype encapsulated in vlanJianbo Liu1-0/+4
Currently the encapsulated ethertype is not dumped as it's the same as TCA_FLOWER_KEY_ETH_TYPE keyvalue. But the dumping result is inconsistent with input, we add dumping it with TCA_FLOWER_KEY_VLAN_ETH_TYPE. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/flow_dissector: Add support for QinQ dissectionJianbo Liu1-15/+17
Dissect the QinQ packets to get both outer and inner vlan information, then store to the extended flow keys. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/sched: flower: Add support for matching on vlan ethertypeJianbo Liu1-2/+5
As flow dissector stores vlan ethertype, tc flower now can match on that. It is to make preparation for supporting QinQ. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07net/flow_dissector: Save vlan ethertype from headersJianbo Liu1-0/+2
Change vlan dissector key to save vlan tpid to support both 802.1Q and 802.1AD ethertype. Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07rtnetlink: add rtnl_link_state check in rtnl_configure_linkRoopa Prabhu1-3/+6
rtnl_configure_link sets dev->rtnl_link_state to RTNL_LINK_INITIALIZED and unconditionally calls __dev_notify_flags to notify user-space of dev flags. current call sequence for rtnl_configure_link rtnetlink_newlink rtnl_link_ops->newlink rtnl_configure_link (unconditionally notifies userspace of default and new dev flags) If a newlink handler wants to call rtnl_configure_link early, we will end up with duplicate notifications to user-space. This patch fixes rtnl_configure_link to check rtnl_link_state and call __dev_notify_flags with gchanges = 0 if already RTNL_LINK_INITIALIZED. Later in the series, this patch will help the following sequence where a driver implementing newlink can call rtnl_configure_link to initialize the link early. makes the following call sequence work: rtnetlink_newlink rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes link and notifies user-space of default dev flags) rtnl_configure_link (updates dev flags if requested by user ifm and notifies user-space of new dev flags) Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07ip: unconditionally set cork gso_sizeWillem de Bruijn2-4/+2
Now that ipc(6)->gso_size is correctly initialized in all callers of ip(6)_setup_cork, it is safe to unconditionally pass it to the cork. Link: http://lkml.kernel.org/r/20180619164752.143249-1-willemdebruijn.kernel@gmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6Willem de Bruijn5-17/+10
skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly after modification by __sock_cmsg_send, by calling sock_tx_timestamp. The IPv4 and IPv6 paths do this conversion differently. In IPv4, the individual protocols that support tx timestamps call this function and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is called in __ip6_append_data. There is no need to store both tx_flags and ts_flags in the cookie as one is derived from the other. Convert when setting up the cork and remove the redundant field. This is similar to IPv6, only have the conversion happen only once per datagram, in ip(6)_setup_cork. Also change __ip6_append_data to match __ip_append_data. Only update tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is redundant: only valid protocols can have non-zero cork->tx_flags. After this change the IPv4 and IPv6 logic is the same. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07ipv6: fold sockcm_cookie into ipcm6_cookieWillem de Bruijn9-43/+27
ipcm_cookie includes sockcm_cookie. Do the same for ipcm6_cookie. This reduces the number of arguments that need to be passed around, applies ipcm6_init to all cookie fields at once and reduces code differentiation between ipv4 and ipv6. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07sock: sockc cookie initializerWillem de Bruijn2-7/+4
Initialize the cookie in one location to reduce code duplication and avoid bugs from inconsistent initialization, such as that fixed in commit 9887cba19978 ("ip: limit use of gso_size to udp"). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07ipv6: ipcm6_cookie initializerWillem de Bruijn5-18/+6
Initialize the cookie in one location to reduce code duplication and avoid bugs from inconsistent initialization, such as that fixed in commit 9887cba19978 ("ip: limit use of gso_size to udp"). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07ipv4: ipcm_cookie initializersWillem de Bruijn5-39/+6
Initialize the cookie in one location to reduce code duplication and avoid bugs from inconsistent initialization, such as that fixed in commit 9887cba19978 ("ip: limit use of gso_size to udp"). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-06net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()Edward Cree2-16/+116
Essentially the same as the ipv4 equivalents. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-06net: ipv4: fix list processing on L3 slave devicesEdward Cree1-8/+15
If we have an L3 master device, l3mdev_ip_rcv() will steal the skb, but we were returning NET_RX_SUCCESS from ip_rcv_finish_core() which meant that ip_list_rcv_finish() would keep it on the list. Instead let's move the l3mdev_ip_rcv() call into the caller, so that our response to a steal can be different in the single packet path (return NET_RX_SUCCESS) and the list path (forget this packet and continue). Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05net: core: filter: mark expected switch fall-throughGustavo A. R. Silva1-0/+1
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Warning level 2 was used: -Wimplicit-fallthrough=2 Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05net: decnet: dn_nsp_in: mark expected switch fall-throughGustavo A. R. Silva1-0/+1
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05tipc: mark expected switch fall-throughsGustavo A. R. Silva2-0/+2
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Warning level 2 was used: -Wimplicit-fallthrough=2 Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add enable_sriov boolean generic parameterVasundhara Volam1-1/+5
enable_sriov - Enables Single-Root Input/Output Virtualization(SR-IOV) characteristic of the device. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add generic parameters internal_err_reset and max_macsMoshe Shemesh1-1/+13
Add 2 first generic parameters to devlink configuration parameters set: internal_err_reset - When set enables reset device on internal errors. max_macs - max number of MACs per ETH port. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add devlink notifications support for paramsMoshe Shemesh1-0/+50
Add devlink_param_notify() function to support devlink param notifications. Add notification call to devlink param set, register and unregister functions. Add devlink_param_value_changed() function to enable the driver notify devlink on value change. Driver should use this function after value was changed on any configuration mode part to driverinit. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add support for get/set driverinit valueMoshe Shemesh1-0/+77
"driverinit" configuration mode value is held by devlink to enable the driver query the value after reload. Two additional functions added to help the driver get/set the value from/to devlink: devlink_param_driverinit_value_set() and devlink_param_driverinit_value_get(). Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add param set commandMoshe Shemesh1-0/+134
Add param set command to set value for a parameter. Value can be set to any of the supported configuration modes. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add param get commandMoshe Shemesh1-0/+250
Add param get command which gets data per parameter. Option to dump the parameters data per device. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05devlink: Add devlink_param register and unregisterMoshe Shemesh1-0/+148
Define configuration parameters data structure. Add functions to register and unregister the driver supported configuration parameters table. For each parameter registered, the driver should fill all the parameter's fields. In case the only supported configuration mode is "driverinit" the parameter's get()/set() functions are not required and should be set to NULL, for any other configuration mode, these functions are required and should be set by the driver. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05net: limit each hash list length to MAX_GRO_SKBSLi RongQing1-33/+23
After commit 07d78363dcff ("net: Convert NAPI gro list into a small hash table.")' there is 8 hash buckets, which allows more flows to be held for merging. but MAX_GRO_SKBS, the total held skb for merging, is 8 skb still, limit the hash table performance. keep MAX_GRO_SKBS as 8 skb, but limit each hash list length to 8 skb, not the total 8 skb Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()Edward Cree1-5/+11
Since callees (ip_rcv_core() and ip_rcv_finish_core()) might free or steal the skb, we can't use the list_cut_before() method; we can't even do a list_del(&skb->list) in the drop case, because skb might have already been freed and reused. So instead, take each skb off the source list before processing, and add it to the sublist afterwards if it wasn't freed or stolen. Fixes: 5fa12739a53d net: ipv4: listify ip_rcv_finish Fixes: 17266ee93984 net: ipv4: listified version of ip_rcv Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Make etf report drops on error_queueJesus Sanchez-Palencia2-2/+37
Use the socket error queue for reporting dropped packets if the socket has enabled that feature through the SO_TXTIME API. Packets are dropped either on enqueue() if they aren't accepted by the qdisc or on dequeue() if the system misses their deadline. Those are reported as different errors so applications can react accordingly. Userspace can retrieve the errors through the socket error queue and the corresponding cmsg interfaces. A struct sock_extended_err* is used for returning the error data, and the packet's timestamp can be retrieved by adding both ee_data and ee_info fields as e.g.: ((__u64) serr->ee_data << 32) + serr->ee_info This feature is disabled by default and must be explicitly enabled by applications. Enabling it can bring some overhead for the Tx cycles of the application. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Add HW offloading capability to ETFJesus Sanchez-Palencia1-1/+70
Add infra so etf qdisc supports HW offload of time-based transmission. For hw offload, the time sorted list is still used, so packets are dequeued always in order of txtime. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \ clockid CLOCK_REALTIME In this example, the Qdisc will use HW offload for the control of the transmission time through the network adapter. The hrtimer used for packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME as reference and packets leave the Qdisc "delta" (100000) nanoseconds before their transmission time. Because this will be using HW offload and since dynamic clocks are not supported by the hrtimer, the system clock and the PHC clock must be synchronized for this mode to behave as expected. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Introduce the ETF QdiscVinicius Costa Gomes3-0/+396
The ETF (Earliest TxTime First) qdisc uses the information added earlier in this series (the socket option SO_TXTIME and the new role of sk_buff->tstamp) to schedule packets transmission based on absolute time. For some workloads, just bandwidth enforcement is not enough, and precise control of the transmission of packets is necessary. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \ clockid CLOCK_TAI In this example, the Qdisc will provide SW best-effort for the control of the transmission time to the network adapter, the time stamp in the socket will be in reference to the clockid CLOCK_TAI and packets will leave the qdisc "delta" (100000) nanoseconds before its transmission time. The ETF qdisc will buffer packets sorted by their txtime. It will drop packets on enqueue() if their skbuff clockid does not match the clock reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped if it expires while being enqueued. The qdisc also supports the SO_TXTIME deadline mode. For this mode, it will dequeue a packet as soon as possible and change the skb timestamp to 'now' during etf_dequeue(). Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid to be configured, but it's been decided that usage of CLOCK_TAI should be enforced until we decide to allow for other clockids to be used. The rationale here is that PTP times are usually in the TAI scale, thus no other clocks should be necessary. For now, the qdisc will return EINVAL if any clocks other than CLOCK_TAI are used. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net/sched: Allow creating a Qdisc watchdog with other clocksVinicius Costa Gomes1-2/+9
This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be passed, this allows other time references to be used when scheduling the Qdisc to run. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: packet: Hook into time based transmission.Richard Cochran1-0/+6
For raw layer-2 packets, copy the desired future transmit time from the CMSG cookie into the skb. Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: ipv6: Hook into time based transmissionJesus Sanchez-Palencia3-5/+14
Add a struct sockcm_cookie parameter to ip6_setup_cork() so we can easily re-use the transmit_time field from struct inet_cork for most paths, by copying the timestamp from the CMSG cookie. This is later copied into the skb during __ip6_make_skb(). For the raw fast path, also pass the sockcm_cookie as a parameter so we can just perform the copy at rawv6_send_hdrinc() directly. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: ipv4: Hook into time based transmissionJesus Sanchez-Palencia5-0/+9
Add a transmit_time field to struct inet_cork, then copy the timestamp from the CMSG cookie at ip_setup_cork() so we can safely copy it into the skb later during __ip_make_skb(). For the raw fast path, just perform the copy at raw_send_hdrinc(). Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: Add a new socket option for a future transmit time.Richard Cochran1-0/+35
This patch introduces SO_TXTIME. User space enables this option in order to pass a desired future transmit time in a CMSG when calling sendmsg(2). The argument to this socket option is a 8-bytes long struct provided by the uapi header net_tstamp.h defined as: struct sock_txtime { clockid_t clockid; u32 flags; }; Note that new fields were added to struct sock by filling a 2-bytes hole found in the struct. For that reason, neither the struct size or number of cachelines were altered. Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: Clear skb->tstamp only on the forwarding pathJesus Sanchez-Palencia1-1/+1
This is done in preparation for the upcoming time based transmission patchset. Now that skb->tstamp will be used to hold packet's txtime, we must ensure that it is being cleared when traversing namespaces. Also, doing that from skb_scrub_packet() before the early return would break our feature when tunnels are used. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>