aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-08-16net: dsa: provide a software untagging function on RX for VLAN-aware bridgesVladimir Oltean3-38/+118
Through code analysis, I realized that the ds->untag_bridge_pvid logic is contradictory - see the newly added FIXME above the kernel-doc for dsa_software_untag_vlan_unaware_bridge(). Moreover, for the Felix driver, I need something very similar, but which is actually _not_ contradictory: untag the bridge PVID on RX, but for VLAN-aware bridges. The existing logic does it for VLAN-unaware bridges. Since I don't want to change the functionality of drivers which were supposedly properly tested with the ds->untag_bridge_pvid flag, I have introduced a new one: ds->untag_vlan_aware_bridge_pvid, and I have refactored the DSA reception code into a common path for both flags. TODO: both flags should be unified under a single ds->software_vlan_untag, which users of both current flags should set. This is not something that can be carried out right away. It needs very careful examination of all drivers which make use of this functionality, since some of them actually get this wrong in the first place. For example, commit 9130c2d30c17 ("net: dsa: microchip: ksz8795: Use software untagging on CPU port") uses this in a driver which has ds->configure_vlan_while_not_filtering = true. The latter mechanism has been known for many years to be broken by design: https://lore.kernel.org/netdev/CABumfLzJmXDN_W-8Z=p9KyKUVi_HhS7o_poBkeKHS2BkAiyYpw@mail.gmail.com/ and we have the situation of 2 bugs canceling each other. There is no private VLAN, and the port follows the PVID of the VLAN-unaware bridge. So, it's kinda ok for that driver to use the ds->untag_bridge_pvid mechanism, in a broken way. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16net: mscc: ocelot: serialize access to the injection/extraction groupsVladimir Oltean4-0/+76
As explained by Horatiu Vultur in commit 603ead96582d ("net: sparx5: Add spinlock for frame transmission from CPU") which is for a similar hardware design, multiple CPUs can simultaneously perform injection or extraction. There are only 2 register groups for injection and 2 for extraction, and the driver only uses one of each. So we'd better serialize access using spin locks, otherwise frame corruption is possible. Note that unlike in sparx5, FDMA in ocelot does not have this issue because struct ocelot_fdma_tx_ring already contains an xmit_lock. I guess this is mostly a problem for NXP LS1028A, as that is dual core. I don't think VSC7514 is. So I'm blaming the commit where LS1028A (aka the felix DSA driver) started using register-based packet injection and extraction. Fixes: 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16net: mscc: ocelot: fix QoS class for injected packets with "ocelot-8021q"Vladimir Oltean2-2/+9
There are 2 distinct code paths (listed below) in the source code which set up an injection header for Ocelot(-like) switches. Code path (2) lacks the QoS class and source port being set correctly. Especially the improper QoS classification is a problem for the "ocelot-8021q" alternative DSA tagging protocol, because we support tc-taprio and each packet needs to be scheduled precisely through its time slot. This includes PTP, which is normally assigned to a traffic class other than 0, but would be sent through TC 0 nonetheless. The code paths are: (1) ocelot_xmit_common() from net/dsa/tag_ocelot.c - called only by the standard "ocelot" DSA tagging protocol which uses NPI-based injection - sets up bit fields in the tag manually to account for a small difference (destination port offset) between Ocelot and Seville. Namely, ocelot_ifh_set_dest() is omitted out of ocelot_xmit_common(), because there's also seville_ifh_set_dest(). (2) ocelot_ifh_set_basic(), called by: - ocelot_fdma_prepare_skb() for FDMA transmission of the ocelot switchdev driver - ocelot_port_xmit() -> ocelot_port_inject_frame() for register-based transmission of the ocelot switchdev driver - felix_port_deferred_xmit() -> ocelot_port_inject_frame() for the DSA tagger ocelot-8021q when it must transmit PTP frames (also through register-based injection). sets the bit fields according to its own logic. The problem is that (2) doesn't call ocelot_ifh_set_qos_class(). Copying that logic from ocelot_xmit_common() fixes that. Unfortunately, although desirable, it is not easily possible to de-duplicate code paths (1) and (2), and make net/dsa/tag_ocelot.c directly call ocelot_ifh_set_basic()), because of the ocelot/seville difference. This is the "minimal" fix with some logic duplicated (but at least more consolidated). Fixes: 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and ↵Vladimir Oltean5-43/+75
register injection Problem description ------------------- On an NXP LS1028A (felix DSA driver) with the following configuration: - ocelot-8021q tagging protocol - VLAN-aware bridge (with STP) spanning at least swp0 and swp1 - 8021q VLAN upper interfaces on swp0 and swp1: swp0.700, swp1.700 - ptp4l on swp0.700 and swp1.700 we see that the ptp4l instances do not see each other's traffic, and they all go to the grand master state due to the ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES condition. Jumping to the conclusion for the impatient ------------------------------------------- There is a zero-day bug in the ocelot switchdev driver in the way it handles VLAN-tagged packet injection. The correct logic already exists in the source code, in function ocelot_xmit_get_vlan_info() added by commit 5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit"). But it is used only for normal NPI-based injection with the DSA "ocelot" tagging protocol. The other injection code paths (register-based and FDMA-based) roll their own wrong logic. This affects and was noticed on the DSA "ocelot-8021q" protocol because it uses register-based injection. By moving ocelot_xmit_get_vlan_info() to a place that's common for both the DSA tagger and the ocelot switch library, it can also be called from ocelot_port_inject_frame() in ocelot.c. We need to touch the lines with ocelot_ifh_port_set()'s prototype anyway, so let's rename it to something clearer regarding what it does, and add a kernel-doc. ocelot_ifh_set_basic() should do. Investigation notes ------------------- Debugging reveals that PTP event (aka those carrying timestamps, like Sync) frames injected into swp0.700 (but also swp1.700) hit the wire with two VLAN tags: 00000000: 01 1b 19 00 00 00 00 01 02 03 04 05 81 00 02 bc ~~~~~~~~~~~ 00000010: 81 00 02 bc 88 f7 00 12 00 2c 00 00 02 00 00 00 ~~~~~~~~~~~ 00000020: 00 00 00 00 00 00 00 00 00 00 00 01 02 ff fe 03 00000030: 04 05 00 01 00 04 00 00 00 00 00 00 00 00 00 00 00000040: 00 00 The second (unexpected) VLAN tag makes felix_check_xtr_pkt() -> ptp_classify_raw() fail to see these as PTP packets at the link partner's receiving end, and return PTP_CLASS_NONE (because the BPF classifier is not written to expect 2 VLAN tags). The reason why packets have 2 VLAN tags is because the transmission code treats VLAN incorrectly. Neither ocelot switchdev, nor felix DSA, declare the NETIF_F_HW_VLAN_CTAG_TX feature. Therefore, at xmit time, all VLANs should be in the skb head, and none should be in the hwaccel area. This is done by: static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb, netdev_features_t features) { if (skb_vlan_tag_present(skb) && !vlan_hw_offload_capable(features, skb->vlan_proto)) skb = __vlan_hwaccel_push_inside(skb); return skb; } But ocelot_port_inject_frame() handles things incorrectly: ocelot_ifh_port_set(ifh, port, rew_op, skb_vlan_tag_get(skb)); void ocelot_ifh_port_set(struct sk_buff *skb, void *ifh, int port, u32 rew_op) { (...) if (vlan_tag) ocelot_ifh_set_vlan_tci(ifh, vlan_tag); (...) } The way __vlan_hwaccel_push_inside() pushes the tag inside the skb head is by calling: static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb) { skb->vlan_present = 0; } which does _not_ zero out skb->vlan_tci as seen by skb_vlan_tag_get(). This means that ocelot, when it calls skb_vlan_tag_get(), sees (and uses) a residual skb->vlan_tci, while the same VLAN tag is _already_ in the skb head. The trivial fix for double VLAN headers is to replace the content of ocelot_ifh_port_set() with: if (skb_vlan_tag_present(skb)) ocelot_ifh_set_vlan_tci(ifh, skb_vlan_tag_get(skb)); but this would not be correct either, because, as mentioned, vlan_hw_offload_capable() is false for us, so we'd be inserting dead code and we'd always transmit packets with VID=0 in the injection frame header. I can't actually test the ocelot switchdev driver and rely exclusively on code inspection, but I don't think traffic from 8021q uppers has ever been injected properly, and not double-tagged. Thus I'm blaming the introduction of VLAN fields in the injection header - early driver code. As hinted at in the early conclusion, what we _want_ to happen for VLAN transmission was already described once in commit 5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit"). ocelot_xmit_get_vlan_info() intends to ensure that if the port through which we're transmitting is under a VLAN-aware bridge, the outer VLAN tag from the skb head is stripped from there and inserted into the injection frame header (so that the packet is processed in hardware through that actual VLAN). And in all other cases, the packet is sent with VID=0 in the injection frame header, since the port is VLAN-unaware and has logic to strip this VID on egress (making it invisible to the wire). Fixes: 08d02364b12f ("net: mscc: fix the injection header") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16selftests: net: bridge_vlan_aware: test that other TPIDs are seen as untaggedVladimir Oltean1-1/+53
The bridge VLAN implementation w.r.t. VLAN protocol is described in merge commit 1a0b20b25732 ("Merge branch 'bridge-next'"). We are only sensitive to those VLAN tags whose TPID is equal to the bridge's vlan_protocol. Thus, an 802.1ad VLAN should be treated as 802.1Q-untagged. Add 3 tests which validate that: - 802.1ad-tagged traffic is learned into the PVID of an 802.1Q-aware bridge - Double-tagged traffic is forwarded when just the PVID of the port is present in the VLAN group of the ports - Double-tagged traffic is not forwarded when the PVID of the port is absent from the VLAN group of the ports The test passes with both veth and ocelot. Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Tested-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16selftests: net: local_termination: add PTP frames to the mixVladimir Oltean1-13/+148
A breakage in the felix DSA driver shows we do not have enough test coverage. More generally, it is sufficiently special that it is likely drivers will treat it differently. This is not meant to be a full PTP test, it just makes sure that PTP packets sent to the different addresses corresponding to their profiles are received correctly. The local_termination selftest seemed like the most appropriate place for this addition. PTP RX/TX in some cases makes no sense (over a bridge) and this is why $skip_ptp exists. And in others - PTP over a bridge port - the IP stack needs convincing through the available bridge netfilter hooks to leave the PTP packets alone and not stolen by the bridge rx_handler. It is safe to assume that users have that figured out already. This is a driver level test, and by using tcpdump, all that extra setup is out of scope here. send_non_ip() was an unfinished idea; written but never used. Replace it with a more generic send_raw(), and send 3 PTP packet types times 3 transports. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16selftests: net: local_termination: don't use xfail_on_veth()Vladimir Oltean2-16/+99
xfail_on_veth() for this test is an incorrect approximation which gives false positives and false negatives. When local_termination fails with "reception succeeded, but should have failed", it is because the DUT ($h2) accepts packets even when not configured as promiscuous. This is not something specific to veth; even the bridge behaves that way, but this is not captured by the xfail_on_veth test. The IFF_UNICAST_FLT flag is not explicitly exported to user space, but it can somewhat be determined from the interface's behavior. We have to create a macvlan upper with a different MAC address. This forces a dev_uc_add() call in the kernel. When the unicast filtering list is not empty, but the device doesn't support IFF_UNICAST_FLT, __dev_set_rx_mode() force-enables promiscuity on the interface, to ensure correct behavior (that the requested address is received). We can monitor the change in the promiscuity flag and infer from it whether the device supports unicast filtering. There is no equivalent thing for allmulti, unfortunately. We never know what's hiding behind a device which has allmulti=off. Whether it will actually perform RX multicast filtering of unknown traffic is a strong "maybe". The bridge driver, for example, completely ignores the flag. We'll have to keep the xfail behavior, but instead of XFAIL on just veth, always XFAIL. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16selftests: net: local_termination: introduce new tests which capture VLAN ↵Vladimir Oltean1-7/+110
behavior Add more coverage to the local termination selftest as follows: - 8021q upper of $h2 - 8021q upper of $h2, where $h2 is a port of a VLAN-unaware bridge - 8021q upper of $h2, where $h2 is a port of a VLAN-aware bridge - 8021q upper of VLAN-unaware br0, which is the upper of $h2 - 8021q upper of VLAN-aware br0, which is the upper of $h2 Especially the cases with traffic sent through the VLAN upper of a VLAN-aware bridge port will be immediately relevant when we will start transmitting PTP packets as an additional kind of traffic. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16selftests: net: local_termination: add one more test for VLAN-aware bridgesVladimir Oltean1-4/+18
The current bridge() test is for packet reception on a VLAN-unaware bridge. Some things are different enough with VLAN-aware bridges that it's worth renaming this test into vlan_unaware_bridge(), and add a new vlan_aware_bridge() test. The two will share the same implementation: bridge() becomes a common function, which receives $vlan_filtering as an argument. Rename it to test_bridge() at the same time, because just bridge() pollutes the global namespace and we cannot invoke the binary with the same name from the iproute2 package currently. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16selftests: net: local_termination: parameterize test nameVladimir Oltean1-18/+20
There are upcoming tests which verify the RX filtering of a bridge (or bridge port), but under differing vlan_filtering conditions. Since we currently print $h2 (the DUT) in the log_test() output, it becomes necessary to make a further distinction between tests, to not give the user the impression that the exact same thing is run twice. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16selftests: net: local_termination: parameterize sending interfaceVladimir Oltean1-19/+20
In future changes we will want to subject the DUT, $h2, to additional VLAN-tagged traffic. For that, we need to run the tests using $h1.100 as a sending interface, rather than the currently hardcoded $h1. Add a parameter to run_test() and modify its 2 callers to explicitly pass $h1, as was implicit before. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16selftests: net: local_termination: refactor macvlan creation/deletionVladimir Oltean1-12/+18
This will be used in other subtests as well; make new macvlan_create() and macvlan_destroy() functions. Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-16char: xillybus: Check USB endpoints when probing deviceEli Billauer1-2/+20
Ensure, as the driver probes the device, that all endpoints that the driver may attempt to access exist and are of the correct type. All XillyUSB devices must have a Bulk IN and Bulk OUT endpoint at address 1. This is verified in xillyusb_setup_base_eps(). On top of that, a XillyUSB device may have additional Bulk OUT endpoints. The information about these endpoints' addresses is deduced from a data structure (the IDT) that the driver fetches from the device while probing it. These endpoints are checked in setup_channels(). A XillyUSB device never has more than one IN endpoint, as all data towards the host is multiplexed in this single Bulk IN endpoint. This is why setup_channels() only checks OUT endpoints. Reported-by: [email protected] Cc: stable <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/T/ Fixes: a53d1202aef1 ("char: xillybus: Add driver for XillyUSB (Xillybus variant for USB)"). Signed-off-by: Eli Billauer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-08-16char: xillybus: Refine workqueue handlingEli Billauer1-3/+5
As the wakeup work item now runs on a separate workqueue, it needs to be flushed separately along with flushing the device's workqueue. Also, move the destroy_workqueue() call to the end of the exit method, so that deinitialization is done in the opposite order of initialization. Fixes: ccbde4b128ef ("char: xillybus: Don't destroy workqueue from work item running on it") Cc: stable <[email protected]> Signed-off-by: Eli Billauer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-08-15mm/migrate: fix deadlock in migrate_pages_batch() on large foliosGao Xiang1-5/+11
Currently, migrate_pages_batch() can lock multiple locked folios with an arbitrary order. Although folio_trylock() is used to avoid deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly") mentioned, it seems try_split_folio() is still missing. It was found by compaction stress test when I explicitly enable EROFS compressed files to use large folios, which case I cannot reproduce with the same workload if large folio support is off (current mainline). Typically, filesystem reads (with locked file-backed folios) could use another bdev/meta inode to load some other I/Os (e.g. inode extent metadata or caching compressed data), so the locking order will be: file-backed folios (A) bdev/meta folios (B) The following calltrace shows the deadlock: Thread 1 takes (B) lock and tries to take folio (A) lock Thread 2 takes (A) lock and tries to take folio (B) lock [Thread 1] INFO: task stress:1824 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1824 tgid:1824 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio mapping ffff00036d69cb18 index 996 (**) __folio_lock+0x24/0x38 migrate_pages_batch+0x77c/0xea0 // try_split_folio (mm/migrate.c:1486:2) // migrate_pages_batch (mm/migrate.c:1734:16) <--- LIST_HEAD(unmap_folios) has .. folio mapping 0xffff0000d184f1d8 index 1711; (*) folio mapping 0xffff0000d184f1d8 index 1712; .. migrate_pages+0xb28/0xe90 compact_zone+0xa08/0x10f0 compact_node+0x9c/0x180 sysctl_compaction_handler+0x8c/0x118 proc_sys_call_handler+0x1a8/0x280 proc_sys_write+0x1c/0x30 vfs_write+0x240/0x380 ksys_write+0x78/0x118 __arm64_sys_write+0x24/0x38 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198 [Thread 2] INFO: task stress:1825 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1825 tgid:1825 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*) __folio_lock+0x24/0x38 z_erofs_runqueue+0x384/0x9c0 [erofs] z_erofs_readahead+0x21c/0x350 [erofs] <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**) read_pages+0x74/0x328 page_cache_ra_order+0x26c/0x348 ondemand_readahead+0x1c0/0x3a0 page_cache_sync_ra+0x9c/0xc0 filemap_get_pages+0xc4/0x708 filemap_read+0x104/0x3a8 generic_file_read_iter+0x4c/0x150 vfs_read+0x27c/0x330 ksys_pread64+0x84/0xd0 __arm64_sys_pread64+0x28/0x40 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198 Link: https://lkml.kernel.org/r/[email protected] Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: Gao Xiang <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15alloc_tag: mark pages reserved during CMA activation as not taggedSuren Baghdasaryan1-0/+2
During CMA activation, pages in CMA area are prepared and then freed without being allocated. This triggers warnings when memory allocation debug config (CONFIG_MEM_ALLOC_PROFILING_DEBUG) is enabled. Fix this by marking these pages not tagged before freeing them. Link: https://lkml.kernel.org/r/[email protected] Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty") Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> [6.10] Signed-off-by: Andrew Morton <[email protected]>
2024-08-15alloc_tag: introduce clear_page_tag_ref() helper functionSuren Baghdasaryan3-17/+15
In several cases we are freeing pages which were not allocated using common page allocators. For such cases, in order to keep allocation accounting correct, we should clear the page tag to indicate that the page being freed is expected to not have a valid allocation tag. Introduce clear_page_tag_ref() helper function to be used for this. Link: https://lkml.kernel.org/r/[email protected] Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty") Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Pasha Tatashin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> [6.10] Signed-off-by: Andrew Morton <[email protected]>
2024-08-15crash: fix riscv64 crash memory reserve dead loopJinjie Ruan1-1/+2
On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high" will cause system stall as below: Zone ranges: DMA32 [mem 0x0000000080000000-0x000000009fffffff] Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000080000000-0x000000008005ffff] node 0: [mem 0x0000000080060000-0x000000009fffffff] Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff] (stall here) commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop bug") fix this on 32-bit architecture. However, the problem is not completely solved. If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on 64-bit architecture, for example, when system memory is equal to CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also occur: -> reserve_crashkernel_generic() and high is true -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX). As Catalin suggested, do not remove the ",high" reservation fallback to ",low" logic which will change arm64's kdump behavior, but fix it by skipping the above situation similar to commit d2f32f23190b ("crash: fix x86_32 crash memory reserve dead loop"). After this patch, it print: cannot allocate crashkernel (size:0x1f400000) Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinjie Ruan <[email protected]> Suggested-by: Catalin Marinas <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Albert Ou <[email protected]> Cc: Dave Young <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15selftests: memfd_secret: don't build memfd_secret test on unsupported archesMuhammad Usama Anjum2-0/+5
[1] mentions that memfd_secret is only supported on arm64, riscv, x86 and x86_64 for now. It doesn't support other architectures. I found the build error on arm and decided to send the fix as it was creating noise on KernelCI: memfd_secret.c: In function 'memfd_secret': memfd_secret.c:42:24: error: '__NR_memfd_secret' undeclared (first use in this function); did you mean 'memfd_secret'? 42 | return syscall(__NR_memfd_secret, flags); | ^~~~~~~~~~~~~~~~~ | memfd_secret Hence I'm adding condition that memfd_secret should only be compiled on supported architectures. Also check in run_vmtests script if memfd_secret binary is present before executing it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Fixes: 76fe17ef588a ("secretmem: test: add basic selftest for memfd_secret(2)") Signed-off-by: Muhammad Usama Anjum <[email protected]> Reviewed-by: Shuah Khan <[email protected]> Acked-by: Mike Rapoport (Microsoft) <[email protected]> Cc: Albert Ou <[email protected]> Cc: James Bottomley <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm: fix endless reclaim on machines with unaccepted memoryKirill A. Shutemov1-22/+20
Unaccepted memory is considered unusable free memory, which is not counted as free on the zone watermark check. This causes get_page_from_freelist() to accept more memory to hit the high watermark, but it creates problems in the reclaim path. The reclaim path encounters a failed zone watermark check and attempts to reclaim memory. This is usually successful, but if there is little or no reclaimable memory, it can result in endless reclaim with little to no progress. This can occur early in the boot process, just after start of the init process when the only reclaimable memory is the page cache of the init executable and its libraries. Make unaccepted memory free from watermark check point of view. This way unaccepted memory will never be the trigger of memory reclaim. Accept more memory in the get_page_from_freelist() if needed. Link: https://lkml.kernel.org/r/[email protected] Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Signed-off-by: Kirill A. Shutemov <[email protected]> Reported-by: Jianxiong Gao <[email protected]> Acked-by: David Hildenbrand <[email protected]> Tested-by: Jianxiong Gao <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> [6.5+] Signed-off-by: Andrew Morton <[email protected]>
2024-08-15selftests/mm: compaction_test: fix off by one in check_compaction()Dan Carpenter1-2/+3
The "initial_nr_hugepages" variable is unsigned long so it takes up to 20 characters to print, plus 1 more character for the NUL terminator. Unfortunately, this buffer is not quite large enough for the terminator to fit. Also use snprintf() for a belt and suspenders approach. Link: https://lkml.kernel.org/r/[email protected] Fixes: fb9293b6b015 ("selftests/mm: compaction_test: fix bogus test success and reduce probability of OOM-killer invocation") Signed-off-by: Dan Carpenter <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm/numa: no task_numa_fault() call if PMD is changedZi Yan1-16/+13
When handling a numa page fault, task_numa_fault() should be called by a process that restores the page table of the faulted folio to avoid duplicated stats counting. Commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling") restructured do_huge_pmd_numa_page() and did not avoid task_numa_fault() call in the second page table check after a numa migration failure. Fix it by making all !pmd_same() return immediately. This issue can cause task_numa_fault() being called more than necessary and lead to unexpected numa balancing results (It is hard to tell whether the issue will cause positive or negative performance impact due to duplicated numa fault counting). Link: https://lkml.kernel.org/r/[email protected] Fixes: c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling") Reported-by: "Huang, Ying" <[email protected]> Closes: https://lore.kernel.org/linux-mm/[email protected]/ Signed-off-by: Zi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Yang Shi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm/numa: no task_numa_fault() call if PTE is changedZi Yan1-17/+16
When handling a numa page fault, task_numa_fault() should be called by a process that restores the page table of the faulted folio to avoid duplicated stats counting. Commit b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault") restructured do_numa_page() and did not avoid task_numa_fault() call in the second page table check after a numa migration failure. Fix it by making all !pte_same() return immediately. This issue can cause task_numa_fault() being called more than necessary and lead to unexpected numa balancing results (It is hard to tell whether the issue will cause positive or negative performance impact due to duplicated numa fault counting). Link: https://lkml.kernel.org/r/[email protected] Fixes: b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault") Signed-off-by: Zi Yan <[email protected]> Reported-by: "Huang, Ying" <[email protected]> Closes: https://lore.kernel.org/linux-mm/[email protected]/ Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Yang Shi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order ↵Hailong Liu1-9/+2
fallback to order 0 The __vmap_pages_range_noflush() assumes its argument pages** contains pages with the same page shift. However, since commit e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations"), if gfp_flags includes __GFP_NOFAIL with high order in vm_area_alloc_pages() and page allocation failed for high order, the pages** may contain two different page shifts (high order and order-0). This could lead __vmap_pages_range_noflush() to perform incorrect mappings, potentially resulting in memory corruption. Users might encounter this as follows (vmap_allow_huge = true, 2M is for PMD_SIZE): kvmalloc(2M, __GFP_NOFAIL|GFP_X) __vmalloc_node_range_noprof(vm_flags=VM_ALLOW_HUGE_VMAP) vm_area_alloc_pages(order=9) ---> order-9 allocation failed and fallback to order-0 vmap_pages_range() vmap_pages_range_noflush() __vmap_pages_range_noflush(page_shift = 21) ----> wrong mapping happens We can remove the fallback code because if a high-order allocation fails, __vmalloc_node_range_noprof() will retry with order-0. Therefore, it is unnecessary to fallback to order-0 here. Therefore, fix this by removing the fallback code. Link: https://lkml.kernel.org/r/[email protected] Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations") Signed-off-by: Hailong Liu <[email protected]> Reported-by: Tangquan Zheng <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Acked-by: Barry Song <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpuWaiman Long1-9/+11
The memory_failure_cpu structure is a per-cpu structure. Access to its content requires the use of get_cpu_var() to lock in the current CPU and disable preemption. The use of a regular spinlock_t for locking purpose is fine for a non-RT kernel. Since the integration of RT spinlock support into the v5.15 kernel, a spinlock_t in a RT kernel becomes a sleeping lock and taking a sleeping lock in a preemption disabled context is illegal resulting in the following kind of warning. [12135.732244] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [12135.732248] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 270076, name: kworker/0:0 [12135.732252] preempt_count: 1, expected: 0 [12135.732255] RCU nest depth: 2, expected: 2 : [12135.732420] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021 [12135.732423] Workqueue: kacpi_notify acpi_os_execute_deferred [12135.732433] Call Trace: [12135.732436] <TASK> [12135.732450] dump_stack_lvl+0x57/0x81 [12135.732461] __might_resched.cold+0xf4/0x12f [12135.732479] rt_spin_lock+0x4c/0x100 [12135.732491] memory_failure_queue+0x40/0xe0 [12135.732503] ghes_do_memory_failure+0x53/0x390 [12135.732516] ghes_do_proc.constprop.0+0x229/0x3e0 [12135.732575] ghes_proc+0xf9/0x1a0 [12135.732591] ghes_notify_hed+0x6a/0x150 [12135.732602] notifier_call_chain+0x43/0xb0 [12135.732626] blocking_notifier_call_chain+0x43/0x60 [12135.732637] acpi_ev_notify_dispatch+0x47/0x70 [12135.732648] acpi_os_execute_deferred+0x13/0x20 [12135.732654] process_one_work+0x41f/0x500 [12135.732695] worker_thread+0x192/0x360 [12135.732715] kthread+0x111/0x140 [12135.732733] ret_from_fork+0x29/0x50 [12135.732779] </TASK> Fix it by using a raw_spinlock_t for locking instead. Also move the pr_err() out of the lock critical section and after put_cpu_ptr() to avoid indeterminate latency and the possibility of sleep with this call. [[email protected]: don't hold percpu ref across pr_err(), per Miaohe] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant") Signed-off-by: Waiman Long <[email protected]> Acked-by: Miaohe Lin <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Len Brown <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm: don't account memmap per-nodePasha Tatashin9-60/+41
Fix invalid access to pgdat during hot-remove operation: ndctl users reported a GPF when trying to destroy a namespace: $ ndctl destroy-namespace all -r all -f Segmentation fault dmesg: Oops: general protection fault, probably for non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: probably user-memory-access in range [0x000000000002b280-0x000000000002b287] CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.20.1 09/13/2023 RIP: 0010:mod_node_page_state+0x2a/0x110 cxl-test users report a GPF when trying to unload the test module: $ modrpobe -r cxl-test dmesg BUG: unable to handle page fault for address: 0000000000004200 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197 Tainted: [O]=OOT_MODULE, [N]=TEST Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15 RIP: 0010:mod_node_page_state+0x6/0x90 Currently, when memory is hot-plugged or hot-removed the accounting is done based on the assumption that memmap is allocated from the same node as the hot-plugged/hot-removed memory, which is not always the case. In addition, there are challenges with keeping the node id of the memory that is being remove to the time when memmap accounting is actually performed: since this is done after remove_pfn_range_from_zone(), and also after remove_memory_block_devices(). Meaning that we cannot use pgdat nor walking though memblocks to get the nid. Given all of that, account the memmap overhead system wide instead. For this we are going to be using global atomic counters, but given that memmap size is rarely modified, and normally is only modified either during early boot when there is only one CPU, or under a hotplug global mutex lock, therefore there is no need for per-cpu optimizations. Also, while we are here rename nr_memmap to nr_memmap_pages, and nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the units are in page count. [[email protected]: address a few nits from David Hildenbrand] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Pasha Tatashin <[email protected]> Reported-by: Yi Zhang <[email protected]> Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@mail.gmail.com Reported-by: Alison Schofield <[email protected]> Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t Tested-by: Dan Williams <[email protected]> Tested-by: Alison Schofield <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: Yi Zhang <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Fan Ni <[email protected]> Cc: Joel Granados <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm: add system wide stats items categoryPasha Tatashin2-14/+7
/proc/vmstat contains events and stats, events can only grow, but stats can grow and shrink. vmstat has the following: ------------------------- NR_VM_ZONE_STAT_ITEMS: per-zone stats NR_VM_NUMA_EVENT_ITEMS: per-numa events NR_VM_NODE_STAT_ITEMS: per-numa stats NR_VM_WRITEBACK_STAT_ITEMS: system-wide background-writeback and dirty-throttling tresholds. NR_VM_EVENT_ITEMS: system-wide events ------------------------- Rename NR_VM_WRITEBACK_STAT_ITEMS to NR_VM_STAT_ITEMS, to track the system-wide stats, we are going to add per-page metadata stats to this category in the next patch. Also delete unused writeback_stat_name(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Pasha Tatashin <[email protected]> Suggested-by: Yosry Ahmed <[email protected]> Tested-by: Alison Schofield <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Joel Granados <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yi Zhang <[email protected]> Cc: Fan Ni <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm: don't account memmap on failurePasha Tatashin1-4/+1
Patch series "Fixes for memmap accounting", v4. Memmap accounting provides us with observability of how much memory is used for per-page metadata: i.e. "struct page"'s and "struct page_ext". It also provides with information of how much was allocated using boot allocator (i.e. not part of MemTotal), and how much was allocated using buddy allocated (i.e. part of MemTotal). This small series fixes a few problems that were discovered with the original patch. This patch (of 3): When we fail to allocate the mmemmap in alloc_vmemmap_page_list(), do not account any already-allocated pages: we're going to free all them before we return from the function. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Pasha Tatashin <[email protected]> Reviewed-by: Fan Ni <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Acked-by: David Hildenbrand <[email protected]> Tested-by: Alison Schofield <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Joel Granados <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yi Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm/hugetlb: fix hugetlb vs. core-mm PT lockingDavid Hildenbrand2-3/+41
We recently made GUP's common page table walking code to also walk hugetlb VMAs without most hugetlb special-casing, preparing for the future of having less hugetlb-specific page table walking code in the codebase. Turns out that we missed one page table locking detail: page table locking for hugetlb folios that are not mapped using a single PMD/PUD. Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the page tables, will perform a pte_offset_map_lock() to grab the PTE table lock. However, hugetlb that concurrently modifies these page tables would actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the locks would differ. Something similar can happen right now with hugetlb folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. This issue can be reproduced [1], for example triggering: [ 3105.936100] ------------[ cut here ]------------ [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188 [ 3105.944634] Modules linked in: [...] [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1 [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024 [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 3105.991108] pc : try_grab_folio+0x11c/0x188 [ 3105.994013] lr : follow_page_pte+0xd8/0x430 [ 3105.996986] sp : ffff80008eafb8f0 [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43 [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48 [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978 [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001 [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000 [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000 [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0 [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080 [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000 [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000 [ 3106.047957] Call trace: [ 3106.049522] try_grab_folio+0x11c/0x188 [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 [ 3106.055527] follow_page_mask+0x1a0/0x2b8 [ 3106.058118] __get_user_pages+0xf0/0x348 [ 3106.060647] faultin_page_range+0xb0/0x360 [ 3106.063651] do_madvise+0x340/0x598 Let's make huge_pte_lockptr() effectively use the same PT locks as any core-mm page table walker would. Add ptep_lockptr() to obtain the PTE page table lock using a pte pointer -- unfortunately we cannot convert pte_lockptr() because virt_to_page() doesn't work with kmap'ed page tables we can have with CONFIG_HIGHPTE. Handle CONFIG_PGTABLE_LEVELS correctly by checking in reverse order, such that when e.g., CONFIG_PGTABLE_LEVELS==2 with PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE will work as expected. Document why that works. There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb folio being mapped using two PTE page tables. While hugetlb wants to take the PMD table lock, core-mm would grab the PTE table lock of one of both PTE page tables. In such corner cases, we have to make sure that both locks match, which is (fortunately!) currently guaranteed for 8xx as it does not support SMP and consequently doesn't use split PT locks. [1] https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Peter Xu <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Tested-by: Baolin Wang <[email protected]> Cc: Peter Xu <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mseal: fix is_madv_discard()Pedro Falcato1-3/+11
is_madv_discard did its check wrong. MADV_ flags are not bitwise, they're normal sequential numbers. So, for instance: behavior & (/* ... */ | MADV_REMOVE) tagged both MADV_REMOVE and MADV_RANDOM (bit 0 set) as discard operations. As a result the kernel could erroneously block certain madvises (e.g MADV_RANDOM or MADV_HUGEPAGE) on sealed VMAs due to them sharing bits with blocked MADV operations (e.g REMOVE or WIPEONFORK). This is obviously incorrect, so use a switch statement instead. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 8be7258aad44 ("mseal: add mseal syscall") Signed-off-by: Pedro Falcato <[email protected]> Tested-by: Jeff Xu <[email protected]> Reviewed-by: Jeff Xu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-16Merge tag 'mediatek-drm-fixes-20240805' of ↵Dave Airlie1-2/+2
https://git.kernel.org/pub/scm/linux/kernel/git/chunkuang.hu/linux into drm-fixes Mediatek DRM Fixes - 20240805 1. Set sensible cursor width/height values to fix crash Signed-off-by: Dave Airlie <[email protected]> From: Chun-Kuang Hu <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-08-15MAINTAINERS: add selftests to network driversJakub Kicinski1-0/+1
tools/testing/selftests/drivers/net/ is not listed under networking entries. Add it to NETWORKING DRIVERS. Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-08-15bnxt_en: Don't clear ntuple filters and rss contexts during ethtool opsPavan Chebbi3-8/+2
The driver currently blindly deletes its cache of RSS cotexts and ntuple filters when the ethtool channel count is changing. It also deletes the ntuple filters cache when the default indirection table is changing. The core will not allow ethtool channels to drop below any that have been configured as ntuple destinations since this commit from 2022: 47f3ecf4763d ("ethtool: Fail number of channels change when it conflicts with rxnfc") So there is absolutely no need to delete the ntuple filters and RSS contexts when changing ethtool channels. It is also unnecessary to delete ntuple filters when the default RSS indirection table is changing. Remove bnxt_clear_usr_fltrs() and bnxt_clear_rss_ctxis() from the ethtool ops and change them to static functions. This bug will cause confusion to the end user and causes failure when running the rss_ctx.py selftest. Fixes: 1018319f949c ("bnxt_en: Invalidate user filters when needed") Reported-by: Jakub Kicinski <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Reviewed-by: Andy Gospodarek <[email protected]> Signed-off-by: Pavan Chebbi <[email protected]> Signed-off-by: Michael Chan <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-08-15virtio_net: move netdev_tx_reset_queue() call before RX napi enableJiri Pirko1-1/+1
During suspend/resume the following BUG was hit: ------------[ cut here ]------------ kernel BUG at lib/dynamic_queue_limits.c:99! Internal error: Oops - BUG: 0 [#1] SMP ARM Modules linked in: bluetooth ecdh_generic ecc libaes CPU: 1 PID: 1282 Comm: rtcwake Not tainted 6.10.0-rc3-00732-gc8bd1f7f3e61 #15240 Hardware name: Generic DT based system PC is at dql_completed+0x270/0x2cc LR is at __free_old_xmit+0x120/0x198 pc : [<c07ffa54>]    lr : [<c0c42bf4>]    psr: 80000013 ... Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none Control: 10c5387d  Table: 43a4406a  DAC: 00000051 ... Process rtcwake (pid: 1282, stack limit = 0xfbc21278) Stack: (0xe0805e80 to 0xe0806000) ... Call trace:  dql_completed from __free_old_xmit+0x120/0x198  __free_old_xmit from free_old_xmit+0x44/0xe4  free_old_xmit from virtnet_poll_tx+0x88/0x1b4  virtnet_poll_tx from __napi_poll+0x2c/0x1d4  __napi_poll from net_rx_action+0x140/0x2b4  net_rx_action from handle_softirqs+0x11c/0x350  handle_softirqs from call_with_stack+0x18/0x20  call_with_stack from do_softirq+0x48/0x50  do_softirq from __local_bh_enable_ip+0xa0/0xa4  __local_bh_enable_ip from virtnet_open+0xd4/0x21c  virtnet_open from virtnet_restore+0x94/0x120  virtnet_restore from virtio_device_restore+0x110/0x1f4  virtio_device_restore from dpm_run_callback+0x3c/0x100  dpm_run_callback from device_resume+0x12c/0x2a8  device_resume from dpm_resume+0x12c/0x1e0  dpm_resume from dpm_resume_end+0xc/0x18  dpm_resume_end from suspend_devices_and_enter+0x1f0/0x72c  suspend_devices_and_enter from pm_suspend+0x270/0x2a0  pm_suspend from state_store+0x68/0xc8  state_store from kernfs_fop_write_iter+0x10c/0x1cc  kernfs_fop_write_iter from vfs_write+0x2b0/0x3dc  vfs_write from ksys_write+0x5c/0xd4  ksys_write from ret_fast_syscall+0x0/0x54 Exception stack(0xe8bf1fa8 to 0xe8bf1ff0) ... ---[ end trace 0000000000000000 ]--- After virtnet_napi_enable() is called, the following path is hit: __napi_poll() -> virtnet_poll() -> virtnet_poll_cleantx() -> netif_tx_wake_queue() That wakes the TX queue and allows skbs to be submitted and accounted by BQL counters. Then netdev_tx_reset_queue() is called that resets BQL counters and eventually leads to the BUG in dql_completed(). Move virtnet_napi_tx_enable() what does BQL counters reset before RX napi enable to avoid the issue. Reported-by: Marek Szyprowski <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Fixes: c8bd1f7f3e61 ("virtio_net: add support for Byte Queue Limits") Tested-by: Marek Szyprowski <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Acked-by: Jason Wang <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-08-16Merge tag 'drm-xe-fixes-2024-08-15' of ↵Dave Airlie16-160/+247
https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes - Validate user fence during creation (Brost) - Fix use after free when client stats are captured (Umesh) - SRIOV fixes (Michal) - Runtime PM fixes (Brost) Signed-off-by: Dave Airlie <[email protected]> From: Rodrigo Vivi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-08-16Merge tag 'drm-misc-fixes-2024-08-15' of ↵Dave Airlie4-12/+30
https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes Short summary of fixes pull: panel: - dt-bindings style fixes panel-orientation: - add quirk for Any Loki Max - add quirk for Any Loki Zero rockchip: - inno-hdmi: fix infoframe upload v3d: - fix OOB access in v3d_csd_job_run() Signed-off-by: Dave Airlie <[email protected]> From: Thomas Zimmermann <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-08-15block: Fix lockdep warning in blk_mq_mark_tag_waitLi Lingfeng1-2/+3
Lockdep reported a warning in Linux version 6.6: [ 414.344659] ================================ [ 414.345155] WARNING: inconsistent lock state [ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted [ 414.346221] -------------------------------- [ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes: [ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.351204] {IN-SOFTIRQ-W} state was registered at: [ 414.351751] lock_acquire+0x18d/0x460 [ 414.352218] _raw_spin_lock_irqsave+0x39/0x60 [ 414.352769] __wake_up_common_lock+0x22/0x60 [ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0 [ 414.353829] sbitmap_queue_clear+0xdd/0x270 [ 414.354338] blk_mq_put_tag+0xdf/0x170 [ 414.354807] __blk_mq_free_request+0x381/0x4d0 [ 414.355335] blk_mq_free_request+0x28b/0x3e0 [ 414.355847] __blk_mq_end_request+0x242/0xc30 [ 414.356367] scsi_end_request+0x2c1/0x830 [ 414.345155] WARNING: inconsistent lock state [ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted [ 414.346221] -------------------------------- [ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes: [ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.351204] {IN-SOFTIRQ-W} state was registered at: [ 414.351751] lock_acquire+0x18d/0x460 [ 414.352218] _raw_spin_lock_irqsave+0x39/0x60 [ 414.352769] __wake_up_common_lock+0x22/0x60 [ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0 [ 414.353829] sbitmap_queue_clear+0xdd/0x270 [ 414.354338] blk_mq_put_tag+0xdf/0x170 [ 414.354807] __blk_mq_free_request+0x381/0x4d0 [ 414.355335] blk_mq_free_request+0x28b/0x3e0 [ 414.355847] __blk_mq_end_request+0x242/0xc30 [ 414.356367] scsi_end_request+0x2c1/0x830 [ 414.356863] scsi_io_completion+0x177/0x1610 [ 414.357379] scsi_complete+0x12f/0x260 [ 414.357856] blk_complete_reqs+0xba/0xf0 [ 414.358338] __do_softirq+0x1b0/0x7a2 [ 414.358796] irq_exit_rcu+0x14b/0x1a0 [ 414.359262] sysvec_call_function_single+0xaf/0xc0 [ 414.359828] asm_sysvec_call_function_single+0x1a/0x20 [ 414.360426] default_idle+0x1e/0x30 [ 414.360873] default_idle_call+0x9b/0x1f0 [ 414.361390] do_idle+0x2d2/0x3e0 [ 414.361819] cpu_startup_entry+0x55/0x60 [ 414.362314] start_secondary+0x235/0x2b0 [ 414.362809] secondary_startup_64_no_verify+0x18f/0x19b [ 414.363413] irq event stamp: 428794 [ 414.363825] hardirqs last enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200 [ 414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50 [ 414.365629] softirqs last enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2 [ 414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0 [ 414.367425] other info that might help us debug this: [ 414.368194] Possible unsafe locking scenario: [ 414.368900] CPU0 [ 414.369225] ---- [ 414.369548] lock(&sbq->ws[i].wait); [ 414.370000] <Interrupt> [ 414.370342] lock(&sbq->ws[i].wait); [ 414.370802] *** DEADLOCK *** [ 414.371569] 5 locks held by kworker/u10:3/1152: [ 414.372088] #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0 [ 414.373180] #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0 [ 414.374384] #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00 [ 414.375342] #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.376377] #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0 [ 414.378607] stack backtrace: [ 414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6 [ 414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 414.381177] Workqueue: writeback wb_workfn (flush-253:0) [ 414.381805] Call Trace: [ 414.382136] <TASK> [ 414.382429] dump_stack_lvl+0x91/0xf0 [ 414.382884] mark_lock_irq+0xb3b/0x1260 [ 414.383367] ? __pfx_mark_lock_irq+0x10/0x10 [ 414.383889] ? stack_trace_save+0x8e/0xc0 [ 414.384373] ? __pfx_stack_trace_save+0x10/0x10 [ 414.384903] ? graph_lock+0xcf/0x410 [ 414.385350] ? save_trace+0x3d/0xc70 [ 414.385808] mark_lock.part.20+0x56d/0xa90 [ 414.386317] mark_held_locks+0xb0/0x110 [ 414.386791] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 414.387320] lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.387901] ? _raw_spin_unlock_irq+0x28/0x50 [ 414.388422] trace_hardirqs_on+0x58/0x100 [ 414.388917] _raw_spin_unlock_irq+0x28/0x50 [ 414.389422] __blk_mq_tag_busy+0x1d6/0x2a0 [ 414.389920] __blk_mq_get_driver_tag+0x761/0x9f0 [ 414.390899] blk_mq_dispatch_rq_list+0x1780/0x1ee0 [ 414.391473] ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10 [ 414.392070] ? sbitmap_get+0x2b8/0x450 [ 414.392533] ? __blk_mq_get_driver_tag+0x210/0x9f0 [ 414.393095] __blk_mq_sched_dispatch_requests+0xd99/0x1690 [ 414.393730] ? elv_attempt_insert_merge+0x1b1/0x420 [ 414.394302] ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10 [ 414.394970] ? lock_acquire+0x18d/0x460 [ 414.395456] ? blk_mq_run_hw_queue+0x637/0xa00 [ 414.395986] ? __pfx_lock_acquire+0x10/0x10 [ 414.396499] blk_mq_sched_dispatch_requests+0x109/0x190 [ 414.397100] blk_mq_run_hw_queue+0x66e/0xa00 [ 414.397616] blk_mq_flush_plug_list.part.17+0x614/0x2030 [ 414.398244] ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10 [ 414.398897] ? writeback_sb_inodes+0x241/0xcc0 [ 414.399429] blk_mq_flush_plug_list+0x65/0x80 [ 414.399957] __blk_flush_plug+0x2f1/0x530 [ 414.400458] ? __pfx___blk_flush_plug+0x10/0x10 [ 414.400999] blk_finish_plug+0x59/0xa0 [ 414.401467] wb_writeback+0x7cc/0x920 [ 414.401935] ? __pfx_wb_writeback+0x10/0x10 [ 414.402442] ? mark_held_locks+0xb0/0x110 [ 414.402931] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 414.403462] ? lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.404062] wb_workfn+0x2b3/0xcf0 [ 414.404500] ? __pfx_wb_workfn+0x10/0x10 [ 414.404989] process_scheduled_works+0x432/0x13f0 [ 414.405546] ? __pfx_process_scheduled_works+0x10/0x10 [ 414.406139] ? do_raw_spin_lock+0x101/0x2a0 [ 414.406641] ? assign_work+0x19b/0x240 [ 414.407106] ? lock_is_held_type+0x9d/0x110 [ 414.407604] worker_thread+0x6f2/0x1160 [ 414.408075] ? __kthread_parkme+0x62/0x210 [ 414.408572] ? lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.409168] ? __kthread_parkme+0x13c/0x210 [ 414.409678] ? __pfx_worker_thread+0x10/0x10 [ 414.410191] kthread+0x33c/0x440 [ 414.410602] ? __pfx_kthread+0x10/0x10 [ 414.411068] ret_from_fork+0x4d/0x80 [ 414.411526] ? __pfx_kthread+0x10/0x10 [ 414.411993] ret_from_fork_asm+0x1b/0x30 [ 414.412489] </TASK> When interrupt is turned on while a lock holding by spin_lock_irq it throws a warning because of potential deadlock. blk_mq_prep_dispatch_rq blk_mq_get_driver_tag __blk_mq_get_driver_tag __blk_mq_alloc_driver_tag blk_mq_tag_busy -> tag is already busy // failed to get driver tag blk_mq_mark_tag_wait spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait) __add_wait_queue(wq, wait) -> wait queue active blk_mq_get_driver_tag __blk_mq_tag_busy -> 1) tag must be idle, which means there can't be inflight IO spin_lock_irq(&tags->lock) -> lock B (hctx->tags) spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally -> 2) context must be preempt by IO interrupt to trigger deadlock. As shown above, the deadlock is not possible in theory, but the warning still need to be fixed. Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq. Fixes: 4f1731df60f9 ("blk-mq: fix potential io hang by wrong 'wake_batch'") Signed-off-by: Li Lingfeng <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-08-16Merge tag 'amd-drm-fixes-6.11-2024-08-14' of ↵Dave Airlie26-252/+420
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes amd-drm-fixes-6.11-2024-08-14: amdgpu: - Fix MES ring buffer overflow - DCN 3.5 fix - DCN 3.2.1 fix - DP MST fix - Cursor fixes - JPEG fixes - Context ops validation - MES 12 fixes - VCN 5.0 fix - HDP fix Signed-off-by: Dave Airlie <[email protected]> From: Alex Deucher <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2024-08-15Merge tag 'perf-tools-fixes-for-v6.11-2024-08-15' of ↵Linus Torvalds22-426/+814
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools Pull perf tools fixes from Namhyung Kim: "The usual header file sync-ups and one more build fix: - Add README file to explain why we copy the headers - Sync UAPI and other header files with kernel source - Fix build on MIPS 32-bit" * tag 'perf-tools-fixes-for-v6.11-2024-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: perf daemon: Fix the build on 32-bit architectures tools/include: Sync arm64 headers with the kernel sources tools/include: Sync x86 headers with the kernel sources tools/include: Sync filesystem headers with the kernel sources tools/include: Sync network socket headers with the kernel sources tools/include: Sync uapi/asm-generic/unistd.h with the kernel sources tools/include: Sync uapi/sound/asound.h with the kernel sources tools/include: Sync uapi/linux/perf.h with the kernel sources tools/include: Sync uapi/linux/kvm.h with the kernel sources tools/include: Sync uapi/drm/i915_drm.h with the kernel sources perf tools: Add tools/include/uapi/README
2024-08-15smb: smb2pdu.h: Use static_assert() to check struct sizesGustavo A. R. Silva1-0/+2
Commit 9f9bef9bc5c6 ("smb: smb2pdu.h: Avoid -Wflex-array-member-not-at-end warnings") introduced tagged `struct create_context_hdr`. We want to ensure that when new members need to be added to the flexible structure, they are always included within this tagged struct. So, we use `static_assert()` to ensure that the memory layout for both the flexible structure and the tagged struct is the same after any changes. Acked-by: Namjae Jeon <[email protected]> Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Steve French <[email protected]>
2024-08-15smb3: fix lock breakage for cached writesSteve French1-4/+9
Mandatory locking is enforced for cached writes, which violates default posix semantics, and also it is enforced inconsistently. This apparently breaks recent versions of libreoffice, but can also be demonstrated by opening a file twice from the same client, locking it from handle one and writing to it from handle two (which fails, returning EACCES). Since there was already a mount option "forcemandatorylock" (which defaults to off), with this change only when the user intentionally specifies "forcemandatorylock" on mount will we break posix semantics on write to a locked range (ie we will only fail the write in this case, if the user mounts with "forcemandatorylock"). Fixes: 85160e03a79e ("CIFS: Implement caching mechanism for mandatory brlocks") Cc: [email protected] Cc: Pavel Shilovsky <[email protected]> Reported-by: [email protected] Reported-by: Kevin Ottens <[email protected]> Reviewed-by: David Howells <[email protected]> Signed-off-by: Steve French <[email protected]>
2024-08-15Merge tag 'md-6.11-20240815' of ↵Jens Axboe1-4/+10
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.11 Pull MD fix from Song: "This patch fixes a potential data corruption in degraded raid0 array with slow (WriteMostly) drives. This issue was introduced in upstream 6.9 kernel." * tag 'md-6.11-20240815' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid1: Fix data corruption for degraded array with slow disk
2024-08-15md/raid1: Fix data corruption for degraded array with slow diskYu Kuai1-4/+10
read_balance() will avoid reading from slow disks as much as possible, however, if valid data only lands in slow disks, and a new normal disk is still in recovery, unrecovered data can be read: raid1_read_request read_balance raid1_should_read_first -> return false choose_best_rdev -> normal disk is not recovered, return -1 choose_bb_rdev -> missing the checking of recovery, return the normal disk -> read unrecovered data Root cause is that the checking of recovery is missing in choose_bb_rdev(). Hence add such checking to fix the problem. Also fix similar problem in choose_slow_rdev(). Cc: [email protected] Fixes: 9f3ced792203 ("md/raid1: factor out choose_bb_rdev() from read_balance()") Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") Reported-and-tested-by: Mateusz Jończyk <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2024-08-15smb/client: avoid possible NULL dereference in cifs_free_subrequest()Su Hui1-2/+6
Clang static checker (scan-build) warning: cifsglob.h:line 890, column 3 Access to field 'ops' results in a dereference of a null pointer. Commit 519be989717c ("cifs: Add a tracepoint to track credits involved in R/W requests") adds a check for 'rdata->server', and let clang throw this warning about NULL dereference. When 'rdata->credits.value != 0 && rdata->server == NULL' happens, add_credits_and_wake_if() will call rdata->server->ops->add_credits(). This will cause NULL dereference problem. Add a check for 'rdata->server' to avoid NULL dereference. Cc: [email protected] Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks") Reviewed-by: David Howells <[email protected]> Signed-off-by: Su Hui <[email protected]> Signed-off-by: Steve French <[email protected]>
2024-08-15Merge patch series "RISC-V: hwprobe: Misaligned scalar perf fix and rename"Palmer Dabbelt6-29/+44
Evan Green <[email protected]> says: The CPUPERF0 hwprobe key was documented and identified in code as a bitmask value, but its contents were an enum. This produced incorrect behavior in conjunction with the WHICH_CPUS hwprobe flag. The first patch in this series fixes the bitmask/enum problem by creating a new hwprobe key that returns the same data, but is properly described as a value instead of a bitmask. The second patch renames the value definitions in preparation for adding vector misaligned access info. As of this version, the old defines are kept in place to maintain source compatibility with older userspace programs. * b4-shazam-merge: RISC-V: hwprobe: Add SCALAR to misaligned perf defines RISC-V: hwprobe: Add MISALIGNED_PERF key Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2024-08-15riscv: Fix out-of-bounds when accessing Andes per hart vendor extension arrayAlexandre Ghiti1-1/+1
The out-of-bounds access is reported by UBSAN: [ 0.000000] UBSAN: array-index-out-of-bounds in ../arch/riscv/kernel/vendor_extensions.c:41:66 [ 0.000000] index -1 is out of range for type 'riscv_isavendorinfo [32]' [ 0.000000] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.11.0-rc2ubuntu-defconfig #2 [ 0.000000] Hardware name: riscv-virtio,qemu (DT) [ 0.000000] Call Trace: [ 0.000000] [<ffffffff94e078ba>] dump_backtrace+0x32/0x40 [ 0.000000] [<ffffffff95c83c1a>] show_stack+0x38/0x44 [ 0.000000] [<ffffffff95c94614>] dump_stack_lvl+0x70/0x9c [ 0.000000] [<ffffffff95c94658>] dump_stack+0x18/0x20 [ 0.000000] [<ffffffff95c8bbb2>] ubsan_epilogue+0x10/0x46 [ 0.000000] [<ffffffff95485a82>] __ubsan_handle_out_of_bounds+0x94/0x9c [ 0.000000] [<ffffffff94e09442>] __riscv_isa_vendor_extension_available+0x90/0x92 [ 0.000000] [<ffffffff94e043b6>] riscv_cpufeature_patch_func+0xc4/0x148 [ 0.000000] [<ffffffff94e035f8>] _apply_alternatives+0x42/0x50 [ 0.000000] [<ffffffff95e04196>] apply_boot_alternatives+0x3c/0x100 [ 0.000000] [<ffffffff95e05b52>] setup_arch+0x85a/0x8bc [ 0.000000] [<ffffffff95e00ca0>] start_kernel+0xa4/0xfb6 The dereferencing using cpu should actually not happen, so remove it. Fixes: 23c996fc2bc1 ("riscv: Extend cpufeature.c to detect vendor extensions") Signed-off-by: Alexandre Ghiti <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2024-08-15KEYS: trusted: dcp: fix leak of blob encryption keyDavid Gstir1-12/+21
Trusted keys unseal the key blob on load, but keep the sealed payload in the blob field so that every subsequent read (export) will simply convert this field to hex and send it to userspace. With DCP-based trusted keys, we decrypt the blob encryption key (BEK) in the Kernel due hardware limitations and then decrypt the blob payload. BEK decryption is done in-place which means that the trusted key blob field is modified and it consequently holds the BEK in plain text. Every subsequent read of that key thus send the plain text BEK instead of the encrypted BEK to userspace. This issue only occurs when importing a trusted DCP-based key and then exporting it again. This should rarely happen as the common use cases are to either create a new trusted key and export it, or import a key blob and then just use it without exporting it again. Fix this by performing BEK decryption and encryption in a dedicated buffer. Further always wipe the plain text BEK buffer to prevent leaking the key via uninitialized memory. Cc: [email protected] # v6.10+ Fixes: 2e8a0f40a39c ("KEYS: trusted: Introduce NXP DCP-backed trusted keys") Signed-off-by: David Gstir <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Jarkko Sakkinen <[email protected]>
2024-08-15KEYS: trusted: fix DCP blob payload length assignmentDavid Gstir1-1/+1
The DCP trusted key type uses the wrong helper function to store the blob's payload length which can lead to the wrong byte order being used in case this would ever run on big endian architectures. Fix by using correct helper function. Cc: [email protected] # v6.10+ Fixes: 2e8a0f40a39c ("KEYS: trusted: Introduce NXP DCP-backed trusted keys") Suggested-by: Richard Weinberger <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: David Gstir <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Jarkko Sakkinen <[email protected]>
2024-08-15Merge tag 'hardening-v6.11-rc4' of ↵Linus Torvalds8-111/+28
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement (Thorsten Blum) - kallsyms: Clean up interaction with LTO suffixes (Song Liu) - refcount: Report UAF for refcount_sub_and_test(0) when counter==0 (Petr Pavlu) - kunit/overflow: Avoid misallocation of driver name (Ivan Orlov) * tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kallsyms: Match symbols exactly with CONFIG_LTO_CLANG kallsyms: Do not cleanup .llvm.<hash> suffix before sorting symbols kunit/overflow: Fix UB in overflow_allocation_test gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement refcount: Report UAF for refcount_sub_and_test(0) when counter==0
2024-08-15btrfs: zoned: properly take lock to read/update block group's zoned variablesNaohiro Aota1-6/+8
__btrfs_add_free_space_zoned() references and modifies bg's alloc_offset, ro, and zone_unusable, but without taking the lock. It is mostly safe because they monotonically increase (at least for now) and this function is mostly called by a transaction commit, which is serialized by itself. Still, taking the lock is a safer and correct option and I'm going to add a change to reset zone_unusable while a block group is still alive. So, add locking around the operations. Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") CC: [email protected] # 5.15+ Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>