aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-07-27Merge tag 'firewire-fixes-6.11-rc1' of ↵Linus Torvalds2-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 Pull firewire fixes from Takashi Sakamoto: "The recent integration of compiler collections introduced the technology to check flexible array length at runtime by providing proper annotations. In v6.10 kernel, a patch was merged into firewire subsystem to utilize it, however the annotation was inadequate. There is also the related change for the flexible array in sound subsystem, but it causes a regression where the data in the payload of isochronous packet is incorrect for some devices. These bugs are now fixed" * tag 'firewire-fixes-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394: ALSA: firewire-lib: fix wrong value as length of header for CIP_NO_HEADER case Revert "firewire: Annotate struct fw_iso_packet with __counted_by()"
2024-07-27Merge tag 'spi-fix-v6.11-merge-window' of ↵Linus Torvalds3-81/+114
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull spi fixes from Mark Brown: "The bulk of this is a series of fixes for the microchip-core driver mostly originating from one of their customers, I also applied an additional patch adding support for controlling the word size which came along with it since it's still the merge window and clearly had a bunch of fairly thorough testing. We also have a fix for the compatible used to bind spidev to the BH2228FV" * tag 'spi-fix-v6.11-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: spi: spidev: add correct compatible for Rohm BH2228FV dt-bindings: trivial-devices: fix Rohm BH2228FV compatible string spi: microchip-core: add support for word sizes of 1 to 32 bits spi: microchip-core: ensure TX and RX FIFOs are empty at start of a transfer spi: microchip-core: fix init function not setting the master and motorola modes spi: microchip-core: only disable SPI controller when register value change requires it spi: microchip-core: defer asserting chip select until just before write to TX FIFO spi: microchip-core: fix the issues in the isr
2024-07-27Merge tag 'regulator-fix-v6.11-merge-window' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "These two commits clean up the excessively loose dependencies for the RZG2L USB VBCTRL regulator driver, ensuring it shouldn't prompt for people who can't use it" * tag 'regulator-fix-v6.11-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: Further restrict RZG2L USB VBCTRL regulator dependencies regulator: renesas-usb-vbus-regulator: Update the default
2024-07-27Merge tag 'regmap-fix-v6.11-merge-window' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap fix from Mark Brown: "Arnd sent a workaround for a false positive warning which was showing up with GCC 14.1" * tag 'regmap-fix-v6.11-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: regmap: maple: work around gcc-14.1 false-positive warning
2024-07-27Merge tag 'clk-for-linus' of ↵Linus Torvalds4-9/+11
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fixes from Stephen Boyd: "A few clk driver fixes for the merge window to fix the build and boot on some SoCs. - Initialize struct clk_init_data in the TI da8xx-cfgchip driver so that stack contents aren't used for things like clk flags leading to unexpected behavior - Don't leak stack contents in a debug print in the new Sophgo clk driver - Disable the new T-Head clk driver on 32-bit targets to fix the build due to a division - Fix Samsung Exynos4 fin_pll wreckage from the clkdev rework done last cycle by using a struct clk_hw directly instead of a struct clk consumer" * tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: samsung: fix getting Exynos4 fin_pll rate from external clocks clk: T-Head: Disable on 32-bit Targets clk: sophgo: clk-sg2042-pll: Fix uninitialized variable in debug output clk: davinci: da8xx-cfgchip: Initialize clk_init_data before use
2024-07-27Merge tag 'i3c/for-6.11' of ↵Linus Torvalds13-143/+431
git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux Pull i3c updates from Alexandre Belloni: "This cycle, there are new features for the Designware controller and fixes for the other IPs: - dw: optional apb clock and power management support, IBI handling fixes - mipi-i3c-hci: IBI handling fixes - svc: a few fixes" * tag 'i3c/for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux: dt-bindings: i3c: add header for generic I3C flags i3c: master: svc: Fix error code in svc_i3c_master_do_daa_locked() i3c: master: Enhance i3c_bus_type visibility for device searching & event monitoring i3c: dw: Add power management support i3c: dw: Add some functions for reusability i3c: dw: Save timing registers and other values i3c: master: svc: Improve DAA STOP handle code logic i3c: dw: Add optional apb clock i3c: dw: Use new *_enabled clk API dt-bindings: i3c: dw: Add apb clock binding i3c: master: svc: Convert comma to semicolon i3c: mipi-i3c-hci: Round IBI data chunk size to HW supported value i3c: mipi-i3c-hci: Error out instead on BUG_ON() in IBI DMA setup i3c: mipi-i3c-hci: Set IBI Status and Data Ring base addresses i3c: mipi-i3c-hci: Switch to lower_32_bits()/upper_32_bits() helpers i3c: dw: Remove ibi_capable property i3c: dw: Fix IBI intr programming i3c: dw: Fix clearing queue thld i3c: mipi-i3c-hci: Fix number of DAT/DCT entries for HCI versions < 1.1 i3c: master: svc: resend target address when get NACK
2024-07-27Merge tag 'thermal-6.11-rc1-3' of ↵Linus Torvalds2-14/+85
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull thermal control fix from Rafael Wysocki: "Prevent the thermal core from flooding the kernel log with useless messages if thermal zone temperature can never be determined (or its sensor has failed permanently) and make it finally give up and disable defective thermal zones (Rafael Wysocki)" * tag 'thermal-6.11-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: thermal: core: Back off when polling thermal zones on errors thermal: trip: Split thermal_zone_device_set_mode()
2024-07-27Merge tag 'mm-hotfixes-stable-2024-07-26-14-33' of ↵Linus Torvalds14-42/+95
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "11 hotfixes, 7 of which are cc:stable. 7 are MM, 4 are other" * tag 'mm-hotfixes-stable-2024-07-26-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: nilfs2: handle inconsistent state in nilfs_btnode_create_block() selftests/mm: skip test for non-LPA2 and non-LVA systems mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist() mm: memcg: add cacheline padding after lruvec in mem_cgroup_per_node alloc_tag: outline and export free_reserved_page() decompress_bunzip2: fix rare decompression failure mm/huge_memory: avoid PMD-size page cache if needed mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit machines mm: fix old/young bit handling in the faulting path dt-bindings: arm: update James Clark's email address MAINTAINERS: mailmap: update James Clark's email address
2024-07-27Merge tag 'timers-urgent-2024-07-26' of ↵Linus Torvalds4-213/+224
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer migration updates from Thomas Gleixner: "Fixes and minor updates for the timer migration code: - Stop testing the group->parent pointer as it is not guaranteed to be stable over a chain of operations by design. This includes a warning which would be nice to have but it produces false positives due to the racy nature of the check. - Plug a race between CPUs going in and out of idle and a CPU hotplug operation. The latter can create and connect a new hierarchy level which is missed in the concurrent updates of CPUs which go into idle. As a result the events of such a CPU might not be processed and timers go stale. Cure it by splitting the hotplug operation into a prepare and online callback. The prepare callback is guaranteed to run on an online and therefore active CPU. This CPU updates the hierarchy and being online ensures that there is always at least one migrator active which handles the modified hierarchy correctly when going idle. The online callback which runs on the incoming CPU then just marks the CPU active and brings it into operation. - Improve tracing and polish the code further so it is more obvious what's going on" * tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Fix grammar in comment timers/migration: Spare write when nothing changed timers/migration: Rename childmask by groupmask to make naming more obvious timers/migration: Read childmask and parent pointer in a single place timers/migration: Use a single struct for hierarchy walk data timers/migration: Improve tracing timers/migration: Move hierarchy setup into cpuhotplug prepare callback timers/migration: Do not rely always on group->parent
2024-07-27Merge tag 'riscv-for-linus-6.11-mw2' of ↵Linus Torvalds43-245/+750
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull more RISC-V updates from Palmer Dabbelt: - Support for NUMA (via SRAT and SLIT), console output (via SPCR), and cache info (via PPTT) on ACPI-based systems. - The trap entry/exit code no longer breaks the return address stack predictor on many systems, which results in an improvement to trap latency. - Support for HAVE_ARCH_STACKLEAK. - The sv39 linear map has been extended to support 128GiB mappings. - The frequency of the mtime CSR is now visible via hwprobe. * tag 'riscv-for-linus-6.11-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (21 commits) RISC-V: Provide the frequency of time CSR via hwprobe riscv: Extend sv39 linear mapping max size to 128G riscv: enable HAVE_ARCH_STACKLEAK riscv: signal: Remove unlikely() from WARN_ON() condition riscv: Improve exception and system call latency RISC-V: Select ACPI PPTT drivers riscv: cacheinfo: initialize cacheinfo's level and type from ACPI PPTT riscv: cacheinfo: remove the useless input parameter (node) of ci_leaf_init() RISC-V: ACPI: Enable SPCR table for console output on RISC-V riscv: boot: remove duplicated targets line trace: riscv: Remove deprecated kprobe on ftrace support riscv: cpufeature: Extract common elements from extension checking riscv: Introduce vendor variants of extension helpers riscv: Add vendor extensions to /proc/cpuinfo riscv: Extend cpufeature.c to detect vendor extensions RISC-V: run savedefconfig for defconfig RISC-V: hwprobe: sort EXT_KEY()s in hwprobe_isa_ext0() alphabetically ACPI: NUMA: replace pr_info with pr_debug in arch_acpi_numa_init ACPI: NUMA: change the ACPI_NUMA to a hidden option ACPI: NUMA: Add handler for SRAT RINTC affinity structure ...
2024-07-27Merge tag 'for-linus-6.11-rc1a-tag' of ↵Linus Torvalds6-64/+74
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Two fixes for issues introduced in this merge window: - fix enhanced debugging in the Xen multicall handling - two patches fixing a boot failure when running as dom0 in PVH mode" * tag 'for-linus-6.11-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: fix memblock_reserve() usage on PVH x86/xen: move xen_reserve_extra_memory() xen: fix multicall debug data referencing
2024-07-27Merge branches 'fixes' and 'misc' into for-linusRussell King (Oracle)12-15/+31
2024-07-27hostfs: fix the host directory parse when mounting.Hongbo Li1-10/+55
hostfs not keep the host directory when mounting. When the host directory is none (default), fc->source is used as the host root directory, and this is wrong. Here we use `parse_monolithic` to handle the old mount path for parsing the root directory. For new mount path, The `parse_param` is used for the host directory parse. Reported-and-tested-by: Maciej Żenczykowski <[email protected]> Fixes: cd140ce9f611 ("hostfs: convert hostfs to use the new mount API") Link: https://lore.kernel.org/all/CANP3RGceNzwdb7w=vPf5=7BCid5HVQDmz1K5kC9JG42+HVAh_g@mail.gmail.com/ Cc: Christian Brauner <[email protected]> Signed-off-by: Hongbo Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] [brauner: minor fixes] Signed-off-by: Christian Brauner <[email protected]>
2024-07-27fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNTSeth Forshee (DigitalOcean)1-0/+11
Christian noticed that it is possible for a privileged user to mount most filesystems with a non-initial user namespace in sb->s_user_ns. When fsopen() is called in a non-init namespace the caller's namespace is recorded in fs_context->user_ns. If the returned file descriptor is then passed to a process priviliged in init_user_ns, that process can call fsconfig(fd_fs, FSCONFIG_CMD_CREATE), creating a new superblock with sb->s_user_ns set to the namespace of the process which called fsopen(). This is problematic. We cannot assume that any filesystem which does not set FS_USERNS_MOUNT has been written with a non-initial s_user_ns in mind, increasing the risk for bugs and security issues. Prevent this by returning EPERM from sget_fc() when FS_USERNS_MOUNT is not set for the filesystem and a non-initial user namespace will be used. sget() does not need to be updated as it always uses the user namespace of the current context, or the initial user namespace if SB_SUBMOUNT is set. Fixes: cb50b348c71f ("convenience helpers: vfs_get_super() and sget_fc()") Reported-by: Christian Brauner <[email protected]> Signed-off-by: Seth Forshee (DigitalOcean) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Alexander Mikhalitsyn <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-07-27ALSA: firewire-lib: fix wrong value as length of header for CIP_NO_HEADER caseTakashi Sakamoto1-2/+1
In a commit 1d717123bb1a ("ALSA: firewire-lib: Avoid -Wflex-array-member-not-at-end warning"), DEFINE_FLEX() macro was used to handle variable length of array for header field in struct fw_iso_packet structure. The usage of macro has a side effect that the designated initializer assigns the count of array to the given field. Therefore CIP_HEADER_QUADLETS (=2) is assigned to struct fw_iso_packet.header, while the original designated initializer assigns zero to all fields. With CIP_NO_HEADER flag, the change causes invalid length of header in isochronous packet for 1394 OHCI IT context. This bug affects all of devices supported by ALSA fireface driver; RME Fireface 400, 800, UCX, UFX, and 802. This commit fixes the bug by replacing it with the alternative version of macro which corresponds no initializer. Cc: [email protected] Fixes: 1d717123bb1a ("ALSA: firewire-lib: Avoid -Wflex-array-member-not-at-end warning") Reported-by: Edmund Raile <[email protected]> Closes: https://lore.kernel.org/r/rrufondjeynlkx2lniot26ablsltnynfaq2gnqvbiso7ds32il@qk4r6xps7jh2/ Reviewed-by: Takashi Iwai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Sakamoto <[email protected]>
2024-07-27Revert "firewire: Annotate struct fw_iso_packet with __counted_by()"Takashi Sakamoto1-3/+2
This reverts commit d3155742db89df3b3c96da383c400e6ff4d23c25. The header_length field is byte unit, thus it can not express the number of elements in header field. It seems that the argument for counted_by attribute can have no arithmetic expression, therefore this commit just reverts the issued commit. Suggested-by: Gustavo A. R. Silva <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Sakamoto <[email protected]>
2024-07-26Merge tag 'for-net-2024-07-26' of ↵Jakub Kicinski6-10/+33
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - btmtk: Fix kernel crash when entering btmtk_usb_suspend - btmtk: Fix btmtk.c undefined reference build error - btintel: Fail setup on error - hci_sync: Fix suspending with wrong filter policy - hci_event: Fix setting DISCOVERY_FINDING for passive scanning * tag 'for-net-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: hci_event: Fix setting DISCOVERY_FINDING for passive scanning Bluetooth: btmtk: remove #ifdef around declarations Bluetooth: btmtk: Fix btmtk.c undefined reference build error harder Bluetooth: btmtk: Fix btmtk.c undefined reference build error Bluetooth: hci_sync: Fix suspending with wrong filter policy Bluetooth: btmtk: Fix kernel crash when entering btmtk_usb_suspend Bluetooth: btintel: Fail setup on error ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-07-26fbnic: Change kconfig prompt from S390=n to !S390Alexander Duyck1-1/+1
In testing the recent kernel I found that the fbnic driver couldn't be enabled on x86_64 builds. A bit of digging showed that the fbnic driver was the only one to check for S390 to be n, all others had checked for !S390. Since it is a boolean and not a tristate I am not sure it will be N. So just update it to use the !S390 flag. A quick check via "make menuconfig" verified that after making this change there was an option to select the fbnic driver. Fixes 0e03c643dc93 ("eth: fbnic: fix s390 build.") Signed-off-by: Alexander Duyck <[email protected]> Reviewed-by: Joe Damato <[email protected]> Link: https://patch.msgid.link/172192698293.1903337.4255690118685300353.stgit@ahduyck-xeon-server.home.arpa Signed-off-by: Jakub Kicinski <[email protected]>
2024-07-26Merge tag 'wireless-2024-07-26' of ↵Jakub Kicinski8-13/+25
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Couple of more urgent fixes: * ath12k: wowlan loop iteration issue * ath12k: fix soft lockup on suspend in certain scenarios * mt76: fix crash when removing an interface * mac80211: fix injection crash with some drivers that don't want monitor vif * cfg80211: fix S1G beacon parsing in scan * cfg80211: fix MLO link status reporting on connect * tag 'wireless-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: ath12k: fix soft lockup on suspend wifi: mt76: mt7921: fix null pointer access in mt792x_mac_link_bss_remove wifi: ath12k: fix reusing outside iterator in ath12k_wow_vif_set_wakeups() wifi: cfg80211: correct S1G beacon length calculation wifi: cfg80211: fix reporting failed MLO links status with cfg80211_connect_done wifi: mac80211: use monitor sdata with driver only if desired ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-07-26minmax: avoid overly complicated constant expressions in VM codeLinus Torvalds2-2/+9
The minmax infrastructure is overkill for simple constants, and can cause huge expansions because those simple constants are then used by other things. For example, 'pageblock_order' is a core VM constant, but because it was implemented using 'min_t()' and all the type-checking that involves, it actually expanded to something like 2.5kB of preprocessor noise. And when that simple constant was then used inside other expansions: #define pageblock_nr_pages (1UL << pageblock_order) #define pageblock_start_pfn(pfn) ALIGN_DOWN((pfn), pageblock_nr_pages) and we then use that inside a 'max()' macro: case ISOLATE_SUCCESS: update_cached = false; last_migrated_pfn = max(cc->zone->zone_start_pfn, pageblock_start_pfn(cc->migrate_pfn - 1)); the end result was that one statement expanding to 253kB in size. There are probably other cases of this, but this one case certainly stood out. I've added 'MIN_T()' and 'MAX_T()' macros for this kind of "core simple constant with specific type" use. These macros skip the type checking, and as such need to be very sparingly used only for obvious cases that have active issues like this. Reported-by: Lorenzo Stoakes <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Cc: David Laight <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2024-07-26minmax: avoid overly complex min()/max() macro arguments in xenLinus Torvalds1-2/+3
We have some very fancy min/max macros that have tons of sanity checking to warn about mixed signedness etc. This is all things that a sane compiler should warn about, but there are no sane compiler interfaces for this, and '-Wsign-compare' is broken [1] and not useful. So then we compensate (some would say over-compensate) by doing the checks manually with some truly horrid macro games. And no, we can't just use __builtin_types_compatible_p(), because the whole question of "does it make sense to compare these two values" is a lot more complicated than that. For example, it makes a ton of sense to compare unsigned values with simple constants like "5", even if that is indeed a signed type. So we have these very strange macros to try to make sensible type checking decisions on the arguments to 'min()' and 'max()'. But that can cause enormous code expansion if the min()/max() macros are used with complicated expressions, and particularly if you nest these things so that you get the first big expansion then expanded again. The xen setup.c file ended up ballooning to over 50MB of preprocessed noise that takes 15s to compile (obviously depending on the build host), largely due to one single line. So let's split that one single line to just be simpler. I think it ends up being more legible to humans too at the same time. Now that single file compiles in under a second. Reported-and-reviewed-by: Lorenzo Stoakes <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Link: https://staticthinking.wordpress.com/2023/07/25/wsign-compare-is-garbage/ [1] Cc: David Laight <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2024-07-26nilfs2: handle inconsistent state in nilfs_btnode_create_block()Ryusuke Konishi2-7/+22
Syzbot reported that a buffer state inconsistency was detected in nilfs_btnode_create_block(), triggering a kernel bug. It is not appropriate to treat this inconsistency as a bug; it can occur if the argument block address (the buffer index of the newly created block) is a virtual block number and has been reallocated due to corruption of the bitmap used to manage its allocation state. So, modify nilfs_btnode_create_block() and its callers to treat it as a possible filesystem error, rather than triggering a kernel bug. Link: https://lkml.kernel.org/r/[email protected] Fixes: a60be987d45d ("nilfs2: B-tree node cache") Signed-off-by: Ryusuke Konishi <[email protected]> Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=89cc4f2324ed37988b60 Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-26selftests/mm: skip test for non-LPA2 and non-LVA systemsDev Jain1-1/+15
Post my improvement of the test in e4a4ba415419 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing"): The test begins to fail on 4k and 16k pages, on non-LPA2 systems. To reduce noise in the CI systems, let us skip the test when higher address space is not implemented. Link: https://lkml.kernel.org/r/[email protected] Fixes: e4a4ba415419 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing") Signed-off-by: Dev Jain <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Mark Brown <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-26mm/page_alloc: fix pcp->count race between drain_pages_zone() vs ↵Li Zhijian1-7/+11
__rmqueue_pcplist() It's expected that no page should be left in pcp_list after calling zone_pcp_disable() in offline_pages(). Previously, it's observed that offline_pages() gets stuck [1] due to some pages remaining in pcp_list. Cause: There is a race condition between drain_pages_zone() and __rmqueue_pcplist() involving the pcp->count variable. See below scenario: CPU0 CPU1 ---------------- --------------- spin_lock(&pcp->lock); __rmqueue_pcplist() { zone_pcp_disable() { /* list is empty */ if (list_empty(list)) { /* add pages to pcp_list */ alloced = rmqueue_bulk() mutex_lock(&pcp_batch_high_lock) ... __drain_all_pages() { drain_pages_zone() { /* read pcp->count, it's 0 here */ count = READ_ONCE(pcp->count) /* 0 means nothing to drain */ /* update pcp->count */ pcp->count += alloced << order; ... ... spin_unlock(&pcp->lock); In this case, after calling zone_pcp_disable() though, there are still some pages in pcp_list. And these pages in pcp_list are neither movable nor isolated, offline_pages() gets stuck as a result. Solution: Expand the scope of the pcp->lock to also protect pcp->count in drain_pages_zone(), to ensure no pages are left in the pcp list after zone_pcp_disable() [1] https://lore.kernel.org/linux-mm/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock") Signed-off-by: Li Zhijian <[email protected]> Reported-by: Yao Xingtao <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-26mm: memcg: add cacheline padding after lruvec in mem_cgroup_per_nodeRoman Gushchin1-0/+1
Oliver Sand reported a performance regression caused by commit 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node"), which puts some fields of the mem_cgroup_per_node structure under the CONFIG_MEMCG_V1 config option. Apparently it causes a false cache sharing between lruvec and lru_zone_size members of the structure. Fix it by adding an explicit padding after the lruvec member. Even though the padding is not required with CONFIG_MEMCG_V1 set, it seems like the introduced memory overhead is not significant enough to warrant another divergence in the mem_cgroup_per_node layout, so the padding is added unconditionally. Link: https://lkml.kernel.org/r/[email protected] Fixes: 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node") Signed-off-by: Roman Gushchin <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Tested-by: Oliver Sang <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-26alloc_tag: outline and export free_reserved_page()Suren Baghdasaryan2-15/+18
Outline and export free_reserved_page() because modules use it and it in turn uses page_ext_{get|put} which should not be exported. The same result could be obtained by outlining {get|put}_page_tag_ref() but that would have higher performance impact as these functions are used in more performance critical paths. Link: https://lkml.kernel.org/r/[email protected] Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging") Signed-off-by: Suren Baghdasaryan <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Suggested-by: Christoph Hellwig <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Sourav Panda <[email protected]> Cc: <[email protected]> [6.10] Signed-off-by: Andrew Morton <[email protected]>
2024-07-26decompress_bunzip2: fix rare decompression failureRoss Lagerwall1-1/+2
The decompression code parses a huffman tree and counts the number of symbols for a given bit length. In rare cases, there may be >= 256 symbols with a given bit length, causing the unsigned char to overflow. This causes a decompression failure later when the code tries and fails to find the bit length for a given symbol. Since the maximum number of symbols is 258, use unsigned short instead. Link: https://lkml.kernel.org/r/[email protected] Fixes: bc22c17e12c1 ("bzip2/lzma: library support for gzip, bzip2 and lzma decompression") Signed-off-by: Ross Lagerwall <[email protected]> Cc: Alain Knaff <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-26mm/huge_memory: avoid PMD-size page cache if neededGavin Shan2-5/+19
xarray can't support arbitrary page cache size. the largest and supported page cache size is defined as MAX_PAGECACHE_ORDER by commit 099d90642a71 ("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However, it's possible to have 512MB page cache in the huge memory's collapsing path on ARM64 system whose base page size is 64KB. 512MB page cache is breaking the limitation and a warning is raised when the xarray entry is split as shown in the following example. [root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize KernelPageSize: 64 kB [root@dhcp-10-26-1-207 ~]# cat /tmp/test.c : int main(int argc, char **argv) { const char *filename = TEST_XFS_FILENAME; int fd = 0; void *buf = (void *)-1, *p; int pgsize = getpagesize(); int ret = 0; if (pgsize != 0x10000) { fprintf(stdout, "System with 64KB base page size is required!\n"); return -EPERM; } system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb"); system("echo 1 > /proc/sys/vm/drop_caches"); /* Open the xfs file */ fd = open(filename, O_RDONLY); assert(fd > 0); /* Create VMA */ buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0); assert(buf != (void *)-1); fprintf(stdout, "mapped buffer at 0x%p\n", buf); /* Populate VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ); assert(ret == 0); /* Collapse VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE); if (ret) { fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno); goto out; } /* Split xarray entry. Write permission is needed */ munmap(buf, TEST_MEM_SIZE); buf = (void *)-1; close(fd); fd = open(filename, O_RDWR); assert(fd > 0); fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, TEST_MEM_SIZE - pgsize, pgsize); out: if (buf != (void *)-1) munmap(buf, TEST_MEM_SIZE); if (fd > 0) close(fd); return ret; } [root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test [root@dhcp-10-26-1-207 ~]# /tmp/test ------------[ cut here ]------------ WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse \ xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net \ sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x780 sp : ffff8000ac32f660 x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0 x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000 x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8 x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x780 truncate_inode_partial_folio+0xdc/0x160 truncate_inode_pages_range+0x1b4/0x4a8 truncate_pagecache_range+0x84/0xa0 xfs_flush_unmap_range+0x70/0x90 [xfs] xfs_file_fallocate+0xfc/0x4d8 [xfs] vfs_fallocate+0x124/0x2f0 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Fix it by correcting the supported page cache orders, different sets for DAX and other files. With it corrected, 512MB page cache becomes disallowed on all non-DAX files on ARM64 system where the base page size is 64KB. After this patch is applied, the test program fails with error -EINVAL returned from __thp_vma_allowable_orders() and the madvise() system call to collapse the page caches. Link: https://lkml.kernel.org/r/[email protected] Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Gavin Shan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Acked-by: Zi Yan <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: Don Dutile <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: William Kucharski <[email protected]> Cc: <[email protected]> [5.17+] Signed-off-by: Andrew Morton <[email protected]>
2024-07-26mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit ↵Yang Shi1-1/+1
machines Yves-Alexis Perez reported commit 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") didn't work for x86_32 [1]. It is because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT. !CONFIG_64BIT should cover all 32 bit machines. [1] https://lore.kernel.org/linux-mm/CAHbLzkr1LwH3pcTgM+aGQ31ip2bKqiqEQ8=FQB+t2c3dhNKNHA@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") Signed-off-by: Yang Shi <[email protected]> Reported-by: Yves-Alexis Perez <[email protected]> Tested-by: Yves-Alexis Perez <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Salvatore Bonaccorso <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: <[email protected]> [6.8+] Signed-off-by: Andrew Morton <[email protected]>
2024-07-26mm: fix old/young bit handling in the faulting pathRam Tummala1-1/+1
Commit 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()") replaced do_set_pte() with set_pte_range() and that introduced a regression in the following faulting path of non-anonymous vmas which caused the PTE for the faulting address to be marked as old instead of young. handle_pte_fault() do_pte_missing() do_fault() do_read_fault() || do_cow_fault() || do_shared_fault() finish_fault() set_pte_range() The polarity of prefault calculation is incorrect. This leads to prefault being incorrectly set for the faulting address. The following check will incorrectly mark the PTE old rather than young. On some architectures this will cause a double fault to mark it young when the access is retried. if (prefault && arch_wants_old_prefaulted_pte()) entry = pte_mkold(entry); On a subsequent fault on the same address, the faulting path will see a non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE will then be correctly marked young in handle_pte_fault() itself. Due to this bug, performance degradation in the fault handling path will be observed due to unnecessary double faulting. Link: https://lkml.kernel.org/r/[email protected] Fixes: 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()") Signed-off-by: Ram Tummala <[email protected]> Reviewed-by: Yin Fengwei <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Yin Fengwei <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-26dt-bindings: arm: update James Clark's email addressJames Clark2-2/+2
My new address is [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: James Clark <[email protected]> Cc: Bjorn Andersson <[email protected]> Cc: Conor Dooley <[email protected]> Cc: David S. Miller <[email protected]> Cc: Geliang Tang <[email protected]> Cc: Hao Zhang <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kees Cook <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Mao Jinlong <[email protected]> Cc: Matthieu Baerts <[email protected]> Cc: Matt Ranostay <[email protected]> Cc: Mike Leach <[email protected]> Cc: Oleksij Rempel <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Suzuki K Poulose <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-26MAINTAINERS: mailmap: update James Clark's email addressJames Clark2-2/+3
My new address is [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: James Clark <[email protected]> Cc: Bjorn Andersson <[email protected]> Cc: Conor Dooley <[email protected]> Cc: David S. Miller <[email protected]> Cc: Geliang Tang <[email protected]> Cc: Hao Zhang <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kees Cook <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Mao Jinlong <[email protected]> Cc: Matthieu Baerts <[email protected]> Cc: Matt Ranostay <[email protected]> Cc: Mike Leach <[email protected]> Cc: Oleksij Rempel <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Suzuki K Poulose <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-26irqchip/loongarch-cpu: Fix return value of lpic_gsi_to_irq()Huacai Chen1-2/+4
lpic_gsi_to_irq() should return a valid Linux interrupt number if acpi_register_gsi() succeeds, and return 0 otherwise. But lpic_gsi_to_irq() converts a negative return value of acpi_register_gsi() to a positive value silently. Convert the return value explicitly. Fixes: e8bba72b396c ("irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArch") Reported-by: Miao Wang <[email protected]> Signed-off-by: Huacai Chen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Jiaxun Yang <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-26dt-bindings: iio: adc: ad7192: Fix 'single-channel' constraintsRob Herring (Arm)1-3/+2
The 'single-channel' property is an uint32, not an array, so 'items' is an incorrect constraint. This didn't matter until dtschema recently changed how properties are decoded. This results in this warning: Documentation/devicetree/bindings/iio/adc/adi,ad7192.example.dtb: adc@0: \ channel@1:single-channel: 1 is not of type 'array' Fixes: caf7b7632b8d ("dt-bindings: iio: adc: ad7192: Add AD7194 support") Reviewed-by: Conor Dooley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Rob Herring (Arm) <[email protected]>
2024-07-26KVM: guest_memfd: abstract how prepared folios are recordedPaolo Bonzini1-13/+20
Right now, large folios are not supported in guest_memfd, and therefore the order used by kvm_gmem_populate() is always 0. In this scenario, using the up-to-date bit to track prepared-ness is nice and easy because we have one bit available per page. In the future, however, we might have large pages that are partially populated; for example, in the case of SEV-SNP, if a large page has both shared and private areas inside, it is necessary to populate it at a granularity that is smaller than that of the guest_memfd's backing store. In that case we will have to track preparedness at a 4K level, probably as a bitmap. In preparation for that, do not use explicitly folio_test_uptodate() and folio_mark_uptodate(). Return the state of the page directly from __kvm_gmem_get_pfn(), so that it is expected to apply to 2^N pages with N=*max_order. The function to mark a range as prepared for now takes just a folio, but is expected to take also an index and order (or something like that) when large pages are introduced. Thanks to Michael Roth for pointing out the issue with large pages. Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfnsPaolo Bonzini3-7/+14
This check is currently performed by sev_gmem_post_populate(), but it applies to all callers of kvm_gmem_populate(): the point of the function is that the memory is being encrypted and some work has to be done on all the gfns in order to encrypt them. Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior to invoking the callback, and stop the operation if a shared page is encountered. Because CONFIG_KVM_PRIVATE_MEM in principle does not require attributes, this makes kvm_gmem_populate() depend on CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them). Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: extend kvm_range_has_memory_attributes() to check subset of attributesPaolo Bonzini3-8/+9
While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM code such as kvm_mem_is_private() is written to expect their existence. Allow using kvm_range_has_memory_attributes() as a multi-page version of kvm_mem_is_private(), without it breaking later when more attributes are introduced. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes()Paolo Bonzini1-22/+20
Use a guard to simplify early returns, and add two more easy shortcuts. If the requested attributes are invalid, the attributes xarray will never show them as set. And if testing a single page, kvm_get_memory_attributes() is more efficient. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: guest_memfd: move check for already-populated page to common codePaolo Bonzini2-1/+8
Do not allow populating the same page twice with startup data. In the case of SEV-SNP, for example, the firmware does not allow it anyway, since the launch-update operation is only possible on pages that are still shared in the RMP. Even if it worked, kvm_gmem_populate()'s callback is meant to have side effects such as updating launch measurements, and updating the same page twice is unlikely to have the desired results. Races between calls to the ioctl are not possible because kvm_gmem_populate() holds slots_lock and the VM should not be running. But again, even if this worked on other confidential computing technology, it doesn't matter to guest_memfd.c whether this is something fishy such as missing synchronization in userspace, or rather something intentional. One of the racers wins, and the page is initialized by either kvm_gmem_prepare_folio() or kvm_gmem_populate(). Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use the same errno that kvm_gmem_populate() is using. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: remove kvm_arch_gmem_prepare_needed()Paolo Bonzini3-16/+3
It is enough to return 0 if a guest need not do any preparation. This is in fact how sev_gmem_prepare() works for non-SNP guests, and it extends naturally to Intel hosts: the x86 callback for gmem_prepare is optional and returns 0 if not defined. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: guest_memfd: make kvm_gmem_prepare_folio() operate on a single struct kvmPaolo Bonzini1-30/+19
This is now possible because preparation is done by kvm_gmem_get_pfn() instead of fallocate(). In practice this is not a limitation, because even though guest_memfd can be bound to multiple struct kvm, for hardware implementations of confidential computing only one guest (identified by an ASID on SEV-SNP, or an HKID on TDX) will be able to access it. In the case of intra-host migration (not implemented yet for SEV-SNP, but we can use SEV-ES as an idea of how it will work), the new struct kvm inherits the same ASID and preparation need not be repeated. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: guest_memfd: delay kvm_gmem_prepare_folio() until the memory is passed ↵Paolo Bonzini1-44/+66
to the guest Initializing the contents of the folio on fallocate() is unnecessarily restrictive. It means that the page is registered with the firmware and then it cannot be touched anymore. In particular, this loses the possibility of using fallocate() to pre-allocate the page for SEV-SNP guests, because kvm_arch_gmem_prepare() then fails. It's only when the guest actually accesses the page (and therefore kvm_gmem_get_pfn() is called) that the page must be cleared from any stale host data and registered with the firmware. The up-to-date flag is clear if this has to be done (i.e. it is the first access and kvm_gmem_populate() has not been called). All in all, there are enough differences between kvm_gmem_get_pfn() and kvm_gmem_populate(), that it's better to separate the two flows completely. Extract the bulk of kvm_gmem_get_folio(), which take a folio and end up setting its up-to-date flag, to a new function kvm_gmem_prepare_folio(); these are now done only by the non-__-prefixed kvm_gmem_get_pfn(). As a bonus, __kvm_gmem_get_pfn() loses its ugly "bool prepare" argument. One difference is that fallocate(PUNCH_HOLE) can now race with a page fault. Potentially this causes a page to be prepared and into the filemap even after fallocate(PUNCH_HOLE). This is harmless, as it can be fixed by another hole punching operation, and can be avoided by clearing the private-page attribute prior to invoking fallocate(PUNCH_HOLE). This way, the page fault will cause an exit to user space. The previous semantics, where fallocate() could be used to prepare the pages in advance of running the guest, can be accessed with KVM_PRE_FAULT_MEMORY. For now, accessing a page in one VM will attempt to call kvm_arch_gmem_prepare() in all of those that have bound the guest_memfd. Cleaning this up is left to a separate patch. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: guest_memfd: return locked folio from __kvm_gmem_get_pfnPaolo Bonzini1-1/+4
Allow testing the up-to-date flag in the caller without taking the lock again. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_*Paolo Bonzini5-11/+11
Add "ARCH" to the symbols; shortly, the "prepare" phase will include both the arch-independent step to clear out contents left in the page by the host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE. For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: guest_memfd: do not go through struct pagePaolo Bonzini1-10/+17
We have a perfectly usable folio, use it to retrieve the pfn and order. All that's needed is a version of folio_file_page that returns a pfn. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: guest_memfd: delay folio_mark_uptodate() until after successful preparationPaolo Bonzini1-2/+4
The up-to-date flag as is now is not too useful; it tells guest_memfd not to overwrite the contents of a folio, but it doesn't say that the page is ready to be mapped into the guest. For encrypted guests, mapping a private page requires that the "preparation" phase has succeeded, and at the same time the same page cannot be prepared twice. So, ensure that folio_mark_uptodate() is only called on a prepared page. If kvm_gmem_prepare_folio() or the post_populate callback fail, the folio will not be marked up-to-date; it's not a problem to call clear_highpage() again on such a page prior to the next preparation attempt. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: guest_memfd: return folio from __kvm_gmem_get_pfn()Paolo Bonzini1-17/+20
Right now this is simply more consistent and avoids use of pfn_to_page() and put_page(). It will be put to more use in upcoming patches, to ensure that the up-to-date flag is set at the very end of both the kvm_gmem_get_pfn() and kvm_gmem_populate() flows. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: x86: disallow pre-fault for SNP VMs before initializationPaolo Bonzini6-0/+22
KVM_PRE_FAULT_MEMORY for an SNP guest can race with sev_gmem_post_populate() in bad ways. The following sequence for instance can potentially trigger an RMP fault: thread A, sev_gmem_post_populate: called thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i); thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE); RMP #PF Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's initial private memory contents have been finalized via KVM_SEV_SNP_LAUNCH_FINISH. Beyond fixing this issue, it just sort of makes sense to enforce this, since the KVM_PRE_FAULT_MEMORY documentation states: "KVM maps memory as if the vCPU generated a stage-2 read page fault" which sort of implies we should be acting on the same guest state that a vCPU would see post-launch after the initial guest memory is all set up. Co-developed-by: Michael Roth <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26tools/power turbostat: version 2024.07.26Len Brown1-53/+52
Release 2024.07.26: Enable turbostat extensions to add both perf and PMT (Intel Platform Monitoring Technology) counters from the cmdline. Demonstrate PMT access with built-in support for Meteor Lake's Die%c6 counter. This commit: Clean up white-space nits introduced since version 2024.05.10 Signed-off-by: Len Brown <[email protected]>
2024-07-26tools/power turbostat: Include umask=%x in perf counter's configPatryk Wlazlyn1-10/+50
Some counters, like cpu/cache-misses/, expose and require umask=%x parameter alongside event=%x in the sysfs perf counter's event file. This change make sure we parse and use it when opening user added counters. Signed-off-by: Patryk Wlazlyn <[email protected]> Signed-off-by: Len Brown <[email protected]>