aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2021-08-09SUNRPC: Fix potential memory corruptionTrond Myklebust1-2/+4
We really should not call rpc_wake_up_queued_task_set_status() with xprt->snd_task as an argument unless we are certain that is actually an rpc_task. Fixes: 0445f92c5d53 ("SUNRPC: Fix disconnection races") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2021-08-09SUNRPC: Convert rpc_client refcount to use refcount_tTrond Myklebust4-15/+13
There are now tools in the refcount library that allow us to convert the client shutdown code. Reported-by: Xiyu Yang <[email protected]> Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2021-08-09xprtrdma: Eliminate rpcrdma_post_sends()Chuck Lever4-18/+2
Clean up. Now that there is only one registration mode, there is only one target "post_send" method: frwr_send(). rpcrdma_post_sends() no longer adds much value, especially since all of its call sites ignore the return code value except to check if it's non-zero. Just have them call frwr_send() directly instead. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2021-08-09xprtrdma: Add an xprtrdma_post_send_err tracepointChuck Lever1-1/+5
Unlike xprtrdma_post_send(), this one can be left enabled all the time, and should almost never fire. But we do want to know about immediate errors when they happen. Note that there is already a similar post_linv_err tracepoint. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2021-08-09xprtrdma: Add xprtrdma_post_recvs_err() tracepointChuck Lever1-1/+2
In the vast majority of cases, rc=0. Don't record that in the post_recvs tracepoint. Instead, add a separate tracepoint that can be left enabled all the time to capture the very rare immediate errors returned by ib_post_recv(). Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2021-08-09xprtrdma: Put rpcrdma_reps before waking the tear-down completionChuck Lever1-5/+5
Ensure the tear-down completion is awoken only /after/ we've stopped fiddling with rpcrdma_rep objects in rpcrdma_post_recvs(). Fixes: 15788d1d1077 ("xprtrdma: Do not refresh Receive Queue while it is draining") Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2021-08-09xprtrdma: Disconnect after an ib_post_send() immediate errorChuck Lever3-1/+10
ib_post_send() does not disconnect the QP when it returns an immediate error. Thus, the code that posts LocalInv has to explicitly disconnect after an immediate error. This is just like the frwr_send() callers handle it. If a disconnect isn't done here, the transport deadlocks. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2021-08-09SUNRPC: Unset RPC_TASK_NO_RETRANS_TIMEOUT for NULL RPCsChuck Lever1-1/+14
In some rare failure modes, the server is actually reading the transport, but then just dropping the requests on the floor. TCP_USER_TIMEOUT cannot detect that case. Prevent such a stuck server from pinning client resources indefinitely by ensuring that certain idempotent requests (such as NULL) can time out even if the connection is still operational. Otherwise rpc_bind_new_program(), gss_destroy_cred(), or rpc_clnt_test_and_add_xprt() can wait forever. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2021-08-09SUNRPC: Refactor rpc_ping()Chuck Lever1-11/+13
Make it use the rpc_null_call_helper() so that it can share the new rpc_call_ops structure to be introduced in the next patch. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2021-08-09devlink: Fix port_type_set function pointer checkLeon Romanovsky1-1/+1
Fix a typo when checking existence of port_type_set function pointer. Fixes: 82564f6c706a ("devlink: Simplify devlink port API calls") Reported-by: kernel test robot <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net: sched: act_mirred: Reset ct info when mirror/redirect skbHangbin Liu1-0/+3
When mirror/redirect a skb to a different port, the ct info should be reset for reclassification. Or the pkts will match unexpected rules. For example, with following topology and commands: ----------- | veth0 -+------- | veth1 -+------- | ------------ tc qdisc add dev veth0 clsact # The same with "action mirred egress mirror dev veth1" or "action mirred ingress redirect dev veth1" tc filter add dev veth0 egress chain 1 protocol ip flower ct_state +trk action mirred ingress mirror dev veth1 tc filter add dev veth0 egress chain 0 protocol ip flower ct_state -inv action ct commit action goto chain 1 tc qdisc add dev veth1 clsact tc filter add dev veth1 ingress chain 0 protocol ip flower ct_state +trk action drop ping <remove ip via veth0> & tc -s filter show dev veth1 ingress With command 'tc -s filter show', we can find the pkts were dropped on veth1. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Roi Dayan <[email protected]> Signed-off-by: Hangbin Liu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net/smc: Allow SMC-D 1MB DMB allocationsStefan Raspl1-15/+16
Commit a3fe3d01bd0d7 ("net/smc: introduce sg-logic for RMBs") introduced a restriction for RMB allocations as used by SMC-R. However, SMC-D does not use scatter-gather lists to back its DMBs, yet it was limited by this restriction, still. This patch exempts SMC, but limits allocations to the maximum RMB/DMB size respectively. Signed-off-by: Stefan Raspl <[email protected]> Signed-off-by: Guvenc Gulce <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net/smc: Correct smc link connection counter in case of smc clientGuvenc Gulce3-3/+5
SMC clients may be assigned to a different link after the initial connection between two peers was established. In such a case, the connection counter was not correctly set. Update the connection counter correctly when a smc client connection is assigned to a different smc link. Fixes: 07d51580ff65 ("net/smc: Add connection counters for links") Signed-off-by: Guvenc Gulce <[email protected]> Tested-by: Karsten Graul <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net/smc: fix wait on already cleared linkKarsten Graul4-7/+33
There can be a race between the waiters for a tx work request buffer and the link down processing that finally clears the link. Although all waiters are woken up before the link is cleared there might be waiters which did not yet get back control and are still waiting. This results in an access to a cleared wait queue head. Fix this by introducing atomic reference counting around the wait calls, and wait with the link clear processing until all waiters have finished. Move the work request layer related calls into smc_wr.c and set the link state to INACTIVE before calling smcr_link_clear() in smc_llc_srv_add_link(). Fixes: 15e1b99aadfb ("net/smc: no WR buffer wait for terminating link group") Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: Guvenc Gulce <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09devlink: Set device as early as possibleLeon Romanovsky2-10/+10
All kernel devlink implementations call to devlink_alloc() during initialization routine for specific device which is used later as a parent device for devlink_register(). Such late device assignment causes to the situation which requires us to call to device_register() before setting other parameters, but that call opens devlink to the world and makes accessible for the netlink users. Any attempt to move devlink_register() to be the last call generates the following error due to access to the devlink->dev pointer. [ 8.758862] devlink_nl_param_fill+0x2e8/0xe50 [ 8.760305] devlink_param_notify+0x6d/0x180 [ 8.760435] __devlink_params_register+0x2f1/0x670 [ 8.760558] devlink_params_register+0x1e/0x20 The simple change of API to set devlink device in the devlink_alloc() instead of devlink_register() fixes all this above and ensures that prior to call to devlink_register() everything already set. Signed-off-by: Leon Romanovsky <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net/iucv: Replace deprecated CPU-hotplug functions.Sebastian Andrzej Siewior1-9/+9
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Cc: Julian Wiedmann <[email protected]> Cc: Karsten Graul <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net/iucv: get rid of register asm usageHeiko Carstens1-20/+22
Using register asm statements has been proven to be very error prone, especially when using code instrumentation where gcc may add function calls, which clobbers register contents in an unexpected way. Therefore get rid of register asm statements in iucv code, even though there is currently nothing wrong with it. This way we know for sure that the above mentioned bug class won't be introduced here. Acked-by: Karsten Graul <[email protected]> Signed-off-by: Heiko Carstens <[email protected]> Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net/af_iucv: remove wrappers around iucv (de-)registrationJulian Wiedmann1-13/+3
These wrappers are just unnecessary obfuscation. Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net/af_iucv: clean up a try_then_request_module()Julian Wiedmann1-11/+3
Use IS_ENABLED(CONFIG_IUCV) to determine whether the iucv_if symbol is available, and let depmod deal with the module dependency. This was introduced back with commit 6fcd61f7bf5d ("af_iucv: use loadable iucv interface"). And to avoid sprinkling IS_ENABLED() over all the code, we're keeping the indirection through pr_iucv->...(). Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net/af_iucv: support drop monitoringJulian Wiedmann1-20/+22
Change the good paths to use consume_skb() instead of kfree_skb(). This avoids flooding dropwatch with false-positives. Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09page_pool: mask the page->signature before the checkingYunsheng Lin1-1/+9
As mentioned in commit c07aea3ef4d4 ("mm: add a signature in struct page"): "The page->signature field is aliased to page->lru.next and page->compound_head." And as the comment in page_is_pfmemalloc(): "lru.next has bit 1 set if the page is allocated from the pfmemalloc reserves. Callers may simply overwrite it if they do not need to preserve that information." The page->signature is OR’ed with PP_SIGNATURE when a page is allocated in page pool, see __page_pool_alloc_pages_slow(), and page->signature is checked directly with PP_SIGNATURE in page_pool_return_skb_page(), which might cause resoure leaking problem for a page from page pool if bit 1 of lru.next is set for a pfmemalloc page. What happens here is that the original pp->signature is OR'ed with PP_SIGNATURE after the allocation in order to preserve any existing bits(such as the bit 1, used to indicate a pfmemalloc page), so when those bits are present, those page is not considered to be from page pool and the DMA mapping of those pages will be left stale. As bit 0 is for page->compound_head, So mask both bit 0/1 before the checking in page_pool_return_skb_page(). And we will return those pfmemalloc pages back to the page allocator after cleaning up the DMA mapping. Fixes: 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling") Reviewed-by: Ilias Apalodimas <[email protected]> Signed-off-by: Yunsheng Lin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09dccp: add do-while-0 stubs for dccp_pr_debug macrosRandy Dunlap1-3/+3
GCC complains about empty macros in an 'if' statement, so convert them to 'do {} while (0)' macros. Fixes these build warnings: net/dccp/output.c: In function 'dccp_xmit_packet': ../net/dccp/output.c:283:71: warning: suggest braces around empty body in an 'if' statement [-Wempty-body] 283 | dccp_pr_debug("transmit_skb() returned err=%d\n", err); net/dccp/ackvec.c: In function 'dccp_ackvec_update_old': ../net/dccp/ackvec.c:163:80: warning: suggest braces around empty body in an 'else' statement [-Wempty-body] 163 | (unsigned long long)seqno, state); Fixes: dc841e30eaea ("dccp: Extend CCID packet dequeueing interface") Fixes: 380240864451 ("dccp ccid-2: Update code for the Ack Vector input/registration routine") Signed-off-by: Randy Dunlap <[email protected]> Cc: [email protected] Cc: "David S. Miller" <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Gerrit Renker <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net: dsa: avoid fast ageing twice when port leaves a bridgeVladimir Oltean1-1/+3
Drivers that support both the toggling of address learning and dynamic FDB flushing (mv88e6xxx, b53, sja1105) currently need to fast-age a port twice when it leaves a bridge: - once, when del_nbp() calls br_stp_disable_port() which puts the port in the BLOCKING state - twice, when dsa_port_switchdev_unsync_attrs() calls dsa_port_clear_brport_flags() which disables address learning The knee-jerk reaction might be to say "dsa_port_clear_brport_flags does not need to fast-age the port at all", but the thing is, we still need both code paths to flush the dynamic FDB entries in different situations. When a DSA switch port leaves a bonding/team interface that is (still) a bridge port, no del_nbp() will be called, so we rely on dsa_port_clear_brport_flags() function to restore proper standalone port functionality with address learning disabled. So the solution is just to avoid double the work when both code paths are called in series. Luckily, DSA already caches the STP port state, so we can skip flushing the dynamic FDB when we disable address learning and the STP state is one where no address learning takes place at all. Under that condition, not flushing the FDB is safe because there is supposed to not be any dynamic FDB entry at all (they were flushed during the transition towards that state, and none were learned in the meanwhile). Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09net: dsa: still fast-age ports joining a bridge if they can't configure learningVladimir Oltean1-1/+17
Commit 39f32101543b ("net: dsa: don't fast age standalone ports") assumed that all standalone ports disable address learning, but if the switch driver implements .port_fast_age but not .port_bridge_flags (like ksz9477, ksz8795, lantiq_gswip, lan9303), then that might not actually be true. So whereas before, the bridge temporarily walking us through the BLOCKING STP state meant that the standalone ports had a checkpoint to flush their baggage and start fresh when they join a bridge, after that commit they no longer do. Restore the old behavior for these drivers by checking if the switch can toggle address learning. If it can't, disregard the "do_fast_age" argument and unconditionally perform fast ageing on STP state changes. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-09netfilter: x_tables: never register tables by defaultFlorian Westphal12-132/+201
For historical reasons x_tables still register tables by default in the initial namespace. Only newly created net namespaces add the hook on demand. This means that the init_net always pays hook cost, even if no filtering rules are added (e.g. only used inside a single netns). Note that the hooks are added even when 'iptables -L' is called. This is because there is no way to tell 'iptables -A' and 'iptables -L' apart at kernel level. The only solution would be to register the table, but delay hook registration until the first rule gets added (or policy gets changed). That however means that counters are not hooked either, so 'iptables -L' would always show 0-counters even when traffic is flowing which might be unexpected. This keeps table and hook registration consistent with what is already done in non-init netns: first iptables(-save) invocation registers both table and hooks. This applies the same solution adopted for ebtables. All tables register a template that contains the l3 family, the name and a constructor function that is called when the initial table has to be added. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-08-09Merge 5.14-rc5 into tty-nextGreg Kroah-Hartman68-277/+655
We need the tty/serial fixes in here as well. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-08-08net: dsa: flush the dynamic FDB of the software bridge when fast ageing a portVladimir Oltean1-0/+20
Currently, when DSA performs fast ageing on a port, 'bridge fdb' shows us that the 'self' entries (corresponding to the hardware bridge, as printed by dsa_slave_fdb_dump) are deleted, but the 'master' entries (corresponding to the software bridge) aren't. Indeed, searching through the bridge driver, neither the brport_attr_learning handler nor the IFLA_BRPORT_LEARNING handler call br_fdb_delete_by_port. However, br_stp_disable_port does, which is one of the paths which DSA uses to trigger a fast ageing process anyway. There is, however, one other very promising caller of br_fdb_delete_by_port, and that is the bridge driver's handler of the SWITCHDEV_FDB_FLUSH_TO_BRIDGE atomic notifier. Currently the s390/qeth HiperSockets card driver is the only user of this. I can't say I understand that driver's architecture or interaction with the bridge, but it appears to not be a switchdev driver in the traditional sense of the word. Nonetheless, the mechanism it provides is a useful way for DSA to express the fact that it performs fast ageing too, in a way that does not change the existing behavior for other drivers. Cc: Alexandra Winter <[email protected]> Cc: Julian Wiedmann <[email protected]> Cc: Roopa Prabhu <[email protected]> Cc: Nikolay Aleksandrov <[email protected]> Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-08net: dsa: don't fast age bridge ports with learning turned offVladimir Oltean1-1/+1
On topology changes, stations that were dynamically learned on ports that are no longer part of the active topology must be flushed - this is described by clause "17.11 Updating learned station location information" of IEEE 802.1D-2004. However, when address learning on the bridge port is turned off in the first place, there is nothing to flush, so skip a potentially expensive operation. We can finally do this now since DSA is aware of the learning state of its bridged ports. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-08net: dsa: centralize fast ageing when address learning is turned offVladimir Oltean2-5/+32
Currently DSA leaves it down to device drivers to fast age the FDB on a port when address learning is disabled on it. There are 2 reasons for doing that in the first place: - when address learning is disabled by user space, through IFLA_BRPORT_LEARNING or the brport_attr_learning sysfs, what user space typically wants to achieve is to operate in a mode with no dynamic FDB entry on that port. But if the port is already up, some addresses might have been already learned on it, and it seems silly to wait for 5 minutes for them to expire until something useful can be done. - when a port leaves a bridge and becomes standalone, DSA turns off address learning on it. This also has the nice side effect of flushing the dynamically learned bridge FDB entries on it, which is a good idea because standalone ports should not have bridge FDB entries on them. We let drivers manage fast ageing under this condition because if DSA were to do it, it would need to track each port's learning state, and act upon the transition, which it currently doesn't. But there are 2 reasons why doing it is better after all: - drivers might get it wrong and not do it (see b53_port_set_learning) - we would like to flush the dynamic entries from the software bridge too, and letting drivers do that would be another pain point So track the port learning state and trigger a fast age process automatically within DSA. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-08batman-adv: Drop NULL check before dropping referencesSven Eckelmann19-337/+169
The check if a batman-adv related object is NULL or not is now directly in the batadv_*_put functions. It is not needed anymore to perform this check outside these function: The changes were generated using a coccinelle semantic patch: @@ expression E; @@ - if (likely(E != NULL)) ( batadv_backbone_gw_put | batadv_claim_put | batadv_dat_entry_put | batadv_gw_node_put | batadv_hardif_neigh_put | batadv_hardif_put | batadv_nc_node_put | batadv_nc_path_put | batadv_neigh_ifinfo_put | batadv_neigh_node_put | batadv_orig_ifinfo_put | batadv_orig_node_put | batadv_orig_node_vlan_put | batadv_softif_vlan_put | batadv_tp_vars_put | batadv_tt_global_entry_put | batadv_tt_local_entry_put | batadv_tt_orig_list_entry_put | batadv_tt_req_node_put | batadv_tvlv_container_put | batadv_tvlv_handler_put )(E); Signed-off-by: Sven Eckelmann <[email protected]> Signed-off-by: Simon Wunderlich <[email protected]>
2021-08-08batman-adv: Check ptr for NULL before reducing its refcntSven Eckelmann14-113/+181
The commit b37a46683739 ("netdevice: add the case if dev is NULL") changed the way how the NULL check for net_devices have to be handled when trying to reduce its reference counter. Before this commit, it was the responsibility of the caller to check whether the object is NULL or not. But it was changed to behave more like kfree. Now the callee has to handle the NULL-case. The batman-adv code was scanned via cocinelle for similar places. These were changed to use the paradigm @@ identifier E, T, R, C; identifier put; @@ void put(struct T *E) { + if (!E) + return; kref_put(&E->C, R); } Functions which were used in other sources files were moved to the header to allow the compiler to inline the NULL check and the kref_put call. Signed-off-by: Sven Eckelmann <[email protected]> Signed-off-by: Simon Wunderlich <[email protected]>
2021-08-08batman-adv: Switch to kstrtox.h for kstrtou64Sven Eckelmann1-1/+1
The commit 4c52729377ea ("kernel.h: split out kstrtox() and simple_strtox() to a separate header") moved the kstrtou64 function to a new header called linux/kstrtox.h. Signed-off-by: Sven Eckelmann <[email protected]> Signed-off-by: Simon Wunderlich <[email protected]>
2021-08-08batman-adv: Start new development cycleSimon Wunderlich1-1/+1
This version will contain all the (major or even only minor) changes for Linux 5.15. The version number isn't a semantic version number with major and minor information. It is just encoding the year of the expected publishing as Linux -rc1 and the number of published versions this year (starting at 0). Signed-off-by: Simon Wunderlich <[email protected]>
2021-08-08devlink: Simplify devlink port API callsLeon Romanovsky1-48/+47
Devlink port already has pointer to the devlink instance and all API calls that forward these devlink ports to the drivers perform same "devlink_port->devlink" assignment before actual call. This patch removes useless parameter and allows us in the future to create specific devlink_port_ops to manage user space access with reliable ops assignment. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-08net: dsa: don't fast age standalone portsVladimir Oltean3-10/+14
DSA drives the procedure to flush dynamic FDB entries from a port based on the change of STP state: whenever we go from a state where address learning is enabled (LEARNING, FORWARDING) to a state where it isn't (LISTENING, BLOCKING, DISABLED), we need to flush the existing dynamic entries. However, there are cases when this is not needed. Internally, when a DSA switch interface is not under a bridge, DSA still keeps it in the "FORWARDING" STP state. And when that interface joins a bridge, the bridge will meticulously iterate that port through all STP states, starting with BLOCKING and ending with FORWARDING. Because there is a state transition from the standalone version of FORWARDING into the temporary BLOCKING bridge port state, DSA calls the fast age procedure. Since commit 5e38c15856e9 ("net: dsa: configure better brport flags when ports leave the bridge"), DSA asks standalone ports to disable address learning. Therefore, there can be no dynamic FDB entries on a standalone port. Therefore, it does not make sense to flush dynamic FDB entries on one. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski17-83/+139
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Restrict range element expansion in ipset to avoid soft lockup, from Jozsef Kadlecsik. 2) Memleak in error path for nf_conntrack_bridge for IPv4 packets, from Yajun Deng. 3) Simplify conntrack garbage collection strategy to avoid frequent wake-ups, from Florian Westphal. 4) Fix NFNLA_HOOK_FUNCTION_NAME string, do not include module name. 5) Missing chain family netlink attribute in chain description in nfnetlink_hook. 6) Incorrect sequence number on nfnetlink_hook dumps. 7) Use netlink request family in reply message for consistency. 8) Remove offload_pickup sysctl, use conntrack for established state instead, from Florian Westphal. 9) Translate NFPROTO_INET/ingress to NFPROTO_NETDEV/ingress, since NFPROTO_INET is not exposed through nfnetlink_hook. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: nfnetlink_hook: translate inet ingress to netdev netfilter: conntrack: remove offload_pickup sysctl again netfilter: nfnetlink_hook: Use same family as request message netfilter: nfnetlink_hook: use the sequence number of the request message netfilter: nfnetlink_hook: missing chain family netfilter: nfnetlink_hook: strip off module name from hookfn netfilter: conntrack: collect all entries in one cycle netfilter: nf_conntrack_bridge: Fix memory leak when error netfilter: ipset: Limit the maximal range of consecutive elements to add/delete ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-06netfilter: nfnetlink_hook: translate inet ingress to netdevPablo Neira Ayuso1-1/+7
The NFPROTO_INET pseudofamily is not exposed through this new netlink interface. The netlink dump either shows NFPROTO_IPV4 or NFPROTO_IPV6 for NFPROTO_INET prerouting/input/forward/output/postrouting hooks. The NFNLA_CHAIN_FAMILY attribute provides the family chain, which specifies if this hook applies to inet traffic only (either IPv4 or IPv6). Translate the inet/ingress hook to netdev/ingress to fully hide the NFPROTO_INET implementation details. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-08-06netfilter: conntrack: remove offload_pickup sysctl againFlorian Westphal4-21/+8
These two sysctls were added because the hardcoded defaults (2 minutes, tcp, 30 seconds, udp) turned out to be too low for some setups. They appeared in 5.14-rc1 so it should be fine to remove it again. Marcelo convinced me that there should be no difference between a flow that was offloaded vs. a flow that was not wrt. timeout handling. Thus the default is changed to those for TCP established and UDP stream, 5 days and 120 seconds, respectively. Marcelo also suggested to account for the timeout value used for the offloading, this avoids increase beyond the value in the conntrack-sysctl and will also instantly expire the conntrack entry with altered sysctls. Example: nf_conntrack_udp_timeout_stream=60 nf_flowtable_udp_timeout=60 This will remove offloaded udp flows after one minute, rather than two. An earlier version of this patch also cleared the ASSURED bit to allow nf_conntrack to evict the entry via early_drop (i.e., table full). However, it looks like we can safely assume that connection timed out via HW is still in established state, so this isn't needed. Quoting Oz: [..] the hardware sends all packets with a set FIN flags to sw. [..] Connections that are aged in hardware are expected to be in the established state. In case it turns out that back-to-sw-path transition can occur for 'dodgy' connections too (e.g., one side disappeared while software-path would have been in RETRANS timeout), we can adjust this later. Cc: Oz Shlomo <[email protected]> Cc: Paul Blakey <[email protected]> Suggested-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Reviewed-by: Marcelo Ricardo Leitner <[email protected]> Reviewed-by: Oz Shlomo <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-08-06netfilter: nfnetlink_hook: Use same family as request messagePablo Neira Ayuso1-3/+3
Use the same family as the request message, for consistency. The netlink payload provides sufficient information to describe the hook object, including the family. This makes it easier to userspace to correlate the hooks are that visited by the packets for a certain family. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-08-06netfilter: nfnetlink_hook: use the sequence number of the request messagePablo Neira Ayuso1-1/+2
The sequence number allows to correlate the netlink reply message (as part of the dump) with the original request message. The cb->seq field is internally used to detect an interference (update) of the hook list during the netlink dump, do not use it as sequence number in the netlink dump header. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-08-06netfilter: nfnetlink_hook: missing chain familyPablo Neira Ayuso1-2/+6
The family is relevant for pseudo-families like NFPROTO_INET otherwise the user needs to rely on the hook function name to differentiate it from NFPROTO_IPV4 and NFPROTO_IPV6 names. Add nfnl_hook_chain_desc_attributes instead of using the existing NFTA_CHAIN_* attributes, since these do not provide a family number. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-08-06netfilter: nfnetlink_hook: strip off module name from hookfnPablo Neira Ayuso1-0/+1
NFNLA_HOOK_FUNCTION_NAME should include the hook function name only, the module name is already provided by NFNLA_HOOK_MODULE_NAME. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-08-06netfilter: conntrack: collect all entries in one cycleFlorian Westphal1-49/+22
Michal Kubecek reports that conntrack gc is responsible for frequent wakeups (every 125ms) on idle systems. On busy systems, timed out entries are evicted during lookup. The gc worker is only needed to remove entries after system becomes idle after a busy period. To resolve this, always scan the entire table. If the scan is taking too long, reschedule so other work_structs can run and resume from next bucket. After a completed scan, wait for 2 minutes before the next cycle. Heuristics for faster re-schedule are removed. GC_SCAN_INTERVAL could be exposed as a sysctl in the future to allow tuning this as-needed or even turn the gc worker off. Reported-by: Michal Kubecek <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2021-08-06net: dsa: don't disable multicast flooding to the CPU even without an IGMP ↵Vladimir Oltean3-19/+0
querier Commit 08cc83cc7fd8 ("net: dsa: add support for BRIDGE_MROUTER attribute") added an option for users to turn off multicast flooding towards the CPU if they turn off the IGMP querier on a bridge which already has enslaved ports (echo 0 > /sys/class/net/br0/bridge/multicast_router). And commit a8b659e7ff75 ("net: dsa: act as passthrough for bridge port flags") simply papered over that issue, because it moved the decision to flood the CPU with multicast (or not) from the DSA core down to individual drivers, instead of taking a more radical position then. The truth is that disabling multicast flooding to the CPU is simply something we are not prepared to do now, if at all. Some reasons: - ICMP6 neighbor solicitation messages are unregistered multicast packets as far as the bridge is concerned. So if we stop flooding multicast, the outside world cannot ping the bridge device's IPv6 link-local address. - There might be foreign interfaces bridged with our DSA switch ports (sending a packet towards the host does not necessarily equal termination, but maybe software forwarding). So if there is no one interested in that multicast traffic in the local network stack, that doesn't mean nobody is. - PTP over L4 (IPv4, IPv6) is multicast, but is unregistered as far as the bridge is concerned. This should reach the CPU port. - The switch driver might not do FDB partitioning. And since we don't even bother to do more fine-grained flood disabling (such as "disable flooding _from_port_N_ towards the CPU port" as opposed to "disable flooding _from_any_port_ towards the CPU port"), this breaks standalone ports, or even multiple bridges where one has an IGMP querier and one doesn't. Reverting the logic makes all of the above work. Fixes: a8b659e7ff75 ("net: dsa: act as passthrough for bridge port flags") Fixes: 08cc83cc7fd8 ("net: dsa: add support for BRIDGE_MROUTER attribute") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-06net: dsa: stop syncing the bridge mcast_router attribute at join timeVladimir Oltean1-10/+0
Qingfang points out that when a bridge with the default settings is created and a port joins it: ip link add br0 type bridge ip link set swp0 master br0 DSA calls br_multicast_router() on the bridge to see if the br0 device is a multicast router port, and if it is, it enables multicast flooding to the CPU port, otherwise it disables it. If we look through the multicast_router_show() sysfs or at the IFLA_BR_MCAST_ROUTER netlink attribute, we see that the default mrouter attribute for the bridge device is "1" (MDB_RTR_TYPE_TEMP_QUERY). However, br_multicast_router() will return "0" (MDB_RTR_TYPE_DISABLED), because an mrouter port in the MDB_RTR_TYPE_TEMP_QUERY state may not be actually _active_ until it receives an actual IGMP query. So, the br_multicast_router() function should really have been called br_multicast_router_active() perhaps. When/if an IGMP query is received, the bridge device will transition via br_multicast_mark_router() into the active state until the ip4_mc_router_timer expires after an multicast_querier_interval. Of course, this does not happen if the bridge is created with an mcast_router attribute of "2" (MDB_RTR_TYPE_PERM). The point is that in lack of any IGMP query messages, and in the default bridge configuration, unregistered multicast packets will not be able to reach the CPU port through flooding, and this breaks many use cases (most obviously, IPv6 ND, with its ICMP6 neighbor solicitation multicast messages). Leave the multicast flooding setting towards the CPU port down to a driver level decision. Fixes: 010e269f91be ("net: dsa: sync up switchdev objects and port attributes when joining the bridge") Reported-by: DENG Qingfang <[email protected]> Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-06ethtool: return error from ethnl_ops_begin if dev is NULLHeiner Kallweit1-2/+2
Julian reported that after d43c65b05b84 Coverity complains about a missing check whether dev is NULL in ethnl_ops_complete(). There doesn't seem to be any valid case where dev could be NULL when calling ethnl_ops_begin(), therefore return an error if dev is NULL. Fixes: d43c65b05b84 ("ethtool: runtime-resume netdev parent in ethnl_ops_begin") Reported-by: Julian Wiedmann <[email protected]> Signed-off-by: Heiner Kallweit <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski18-68/+173
Build failure in drivers/net/wwan/mhi_wwan_mbim.c: add missing parameter (0, assuming we don't want buffer pre-alloc). Conflict in drivers/net/dsa/sja1105/sja1105_main.c between: 589918df9322 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too") 0fac6aa098ed ("net: dsa: sja1105: delete the best_effort_vlan_filtering mode") Follow the instructions from the commit message of the former commit - removed the if conditions. When looking at commit 589918df9322 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too") note that the mask_iotag fields get removed by the following patch. Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-05Merge tag 'net-5.14-rc5' of ↵Linus Torvalds15-44/+129
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from ipsec. Current release - regressions: - sched: taprio: fix init procedure to avoid inf loop when dumping - sctp: move the active_key update after sh_keys is added Current release - new code bugs: - sparx5: fix build with old GCC & bitmask on 32-bit targets Previous releases - regressions: - xfrm: redo the PREEMPT_RT RCU vs hash_resize_mutex deadlock fix - xfrm: fixes for the compat netlink attribute translator - phy: micrel: Fix detection of ksz87xx switch Previous releases - always broken: - gro: set inner transport header offset in tcp/udp GRO hook to avoid crashes when such packets reach GSO - vsock: handle VIRTIO_VSOCK_OP_CREDIT_REQUEST, as required by spec - dsa: sja1105: fix static FDB entries on SJA1105P/Q/R/S and SJA1110 - bridge: validate the NUD_PERMANENT bit when adding an extern_learn FDB entry - usb: lan78xx: don't modify phy_device state concurrently - usb: pegasus: check for errors of IO routines" * tag 'net-5.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits) net: vxge: fix use-after-free in vxge_device_unregister net: fec: fix use-after-free in fec_drv_remove net: pegasus: fix uninit-value in get_interrupt_interval net: ethernet: ti: am65-cpsw: fix crash in am65_cpsw_port_offload_fwd_mark_update() bnx2x: fix an error code in bnx2x_nic_load() net: wwan: iosm: fix recursive lock acquire in unregister net: wwan: iosm: correct data protocol mask bit net: wwan: iosm: endianness type correction net: wwan: iosm: fix lkp buildbot warning net: usb: lan78xx: don't modify phy_device state concurrently docs: networking: netdevsim rules net: usb: pegasus: Remove the changelog and DRIVER_VERSION. net: usb: pegasus: Check the return value of get_geristers() and friends; net/prestera: Fix devlink groups leakage in error flow net: sched: fix lockdep_set_class() typo error for sch->seqlock net: dsa: qca: ar9331: reorder MDIO write sequence VSOCK: handle VIRTIO_VSOCK_OP_CREDIT_REQUEST mptcp: drop unused rcu member in mptcp_pm_addr_entry net: ipv6: fix returned variable type in ip6_skb_dst_mtu nfp: update ethtool reporting of pauseframe control ...
2021-08-05Bluetooth: defer cleanup of resources in hci_unregister_dev()Tetsuo Handa3-24/+44
syzbot is hitting might_sleep() warning at hci_sock_dev_event() due to calling lock_sock() with rw spinlock held [1]. It seems that history of this locking problem is a trial and error. Commit b40df5743ee8 ("[PATCH] bluetooth: fix socket locking in hci_sock_dev_event()") in 2.6.21-rc4 changed bh_lock_sock() to lock_sock() as an attempt to fix lockdep warning. Then, commit 4ce61d1c7a8e ("[BLUETOOTH]: Fix locking in hci_sock_dev_event().") in 2.6.22-rc2 changed lock_sock() to local_bh_disable() + bh_lock_sock_nested() as an attempt to fix the sleep in atomic context warning. Then, commit 4b5dd696f81b ("Bluetooth: Remove local_bh_disable() from hci_sock.c") in 3.3-rc1 removed local_bh_disable(). Then, commit e305509e678b ("Bluetooth: use correct lock to prevent UAF of hdev object") in 5.13-rc5 again changed bh_lock_sock_nested() to lock_sock() as an attempt to fix CVE-2021-3573. This difficulty comes from current implementation that hci_sock_dev_event(HCI_DEV_UNREG) is responsible for dropping all references from sockets because hci_unregister_dev() immediately reclaims resources as soon as returning from hci_sock_dev_event(HCI_DEV_UNREG). But the history suggests that hci_sock_dev_event(HCI_DEV_UNREG) was not doing what it should do. Therefore, instead of trying to detach sockets from device, let's accept not detaching sockets from device at hci_sock_dev_event(HCI_DEV_UNREG), by moving actual cleanup of resources from hci_unregister_dev() to hci_cleanup_dev() which is called by bt_host_release() when all references to this unregistered device (which is a kobject) are gone. Since hci_sock_dev_event(HCI_DEV_UNREG) no longer resets hci_pi(sk)->hdev, we need to check whether this device was unregistered and return an error based on HCI_UNREGISTER flag. There might be subtle behavioral difference in "monitor the hdev" functionality; please report if you found something went wrong due to this patch. Link: https://syzkaller.appspot.com/bug?extid=a5df189917e79d5e59c9 [1] Reported-by: syzbot <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Tetsuo Handa <[email protected]> Fixes: e305509e678b ("Bluetooth: use correct lock to prevent UAF of hdev object") Acked-by: Luiz Augusto von Dentz <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-08-05Bluetooth: Add support hdev to allocate private dataTedd Ho-Jeong An1-3/+10
This patch adds support hdev to allocate extra size for private data. The size of private data is specified in the hdev_alloc_size(priv_size) and the allocated buffer can be accessed with hci_get_priv(hdev). Signed-off-by: Tedd Ho-Jeong An <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>