aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2023-02-03vfio: Support VFIO_NOIOMMU with iommufdJason Gunthorpe1-3/+9
Add a small amount of emulation to vfio_compat to accept the SET_IOMMU to VFIO_NOIOMMU_IOMMU and have vfio just ignore iommufd if it is working on a no-iommu enabled device. Move the enable_unsafe_noiommu_mode module out of container.c into vfio_main.c so that it is always available even if VFIO_CONTAINER=n. This passes Alex's mini-test: https://github.com/awilliam/tests/blob/master/vfio-noiommu-pci-device-open.c Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Kevin Tian <[email protected]> Acked-by: Alex Williamson <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2023-02-03Merge tag 'ceph-for-6.2-rc7' of https://github.com/ceph/ceph-clientLinus Torvalds1-10/+0
Pull ceph fix from Ilya Dryomov: "A safeguard to prevent the kernel client from further damaging the filesystem after running into a case of an invalid snap trace. The root cause of this metadata corruption is still being investigated but it appears to be stemming from the MDS. As such, this is the best we can do for now" * tag 'ceph-for-6.2-rc7' of https://github.com/ceph/ceph-client: ceph: blocklist the kclient when receiving corrupted snap trace ceph: move mount state enum to super.h
2023-02-03Merge tag 'mm-hotfixes-stable-2023-02-02-19-24-2' of ↵Linus Torvalds4-5/+20
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "25 hotfixes, mainly for MM. 13 are cc:stable" * tag 'mm-hotfixes-stable-2023-02-02-19-24-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (26 commits) mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath() Kconfig.debug: fix the help description in SCHED_DEBUG mm/swapfile: add cond_resched() in get_swap_pages() mm: use stack_depot_early_init for kmemleak Squashfs: fix handling and sanity checking of xattr_ids count sh: define RUNTIME_DISCARD_EXIT highmem: round down the address passed to kunmap_flush_on_unmap() migrate: hugetlb: check for hugetlb shared PMD in node migration mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups Revert "mm: kmemleak: alloc gray object for reserved region with direct map" freevxfs: Kconfig: fix spelling maple_tree: should get pivots boundary by type .mailmap: update e-mail address for Eugen Hristev mm, mremap: fix mremap() expanding for vma's with vm_ops->close() squashfs: harden sanity check in squashfs_read_xattr_id_table ia64: fix build error due to switch case label appearing next to declaration mm: multi-gen LRU: fix crash during cgroup migration Revert "mm: add nodes= arg to memory.reclaim" zsmalloc: fix a race with deferred_handles storing ...
2023-02-03efi: Drop minimum EFI version check at bootArd Biesheuvel1-2/+1
We currently pass a minimum major version to the generic EFI helper that checks the system table magic and version, and refuse to boot if the value is lower. The motivation for this check is unknown, and even the code that uses major version 2 as the minimum (ARM, arm64 and RISC-V) should make it past this check without problems, and boot to a point where we have access to a console or some other means to inform the user that the firmware's major revision number made us unhappy. (Revision 2.0 of the UEFI specification was released in January 2006, whereas ARM, arm64 and RISC-V support where added in 2009, 2013 and 2017, respectively, so checking for major version 2 or higher is completely arbitrary) So just drop the check. Signed-off-by: Ard Biesheuvel <[email protected]>
2023-02-03ALSA: hda: Fix the control element identification for multiple codecsJaroslav Kysela1-0/+1
Some motherboards have multiple HDA codecs connected to the serial bus. The current code may create multiple mixer controls with the almost identical identification. The current code use id.device field from the control element structure to store the codec address to avoid such clashes for multiple codecs. Unfortunately, the user space do not handle this correctly. For mixer controls, only name and index are used for the identifiers. This patch fixes this problem to compose the index using the codec address as an offset in case, when the control already exists. It is really unlikely that one codec will create 10 similar controls. This patch adds new kernel module parameter 'ctl_dev_id' to allow select the old behaviour, too. The CONFIG_SND_HDA_CTL_DEV_ID Kconfig option sets the default value. BugLink: https://github.com/alsa-project/alsa-lib/issues/294 BugLink: https://github.com/alsa-project/alsa-lib/issues/205 Fixes: 54d174031576 ("[ALSA] hda-codec - Fix connection list parsing") Fixes: 1afe206ab699 ("ALSA: hda - Try to find an empty control index when it's occupied") Signed-off-by: Jaroslav Kysela <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]>
2023-02-03block: add a bvec_set_virt helperChristoph Hellwig1-0/+12
A small wrapper around bvec_set_page for callers that have a virtual address. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-02-03block: add a bvec_set_folio helperChristoph Hellwig1-0/+13
A smaller wrapper around bvec_set_page that takes a folio instead. There are only two potential users for this in the tree, but the number will grow in the future. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-02-03block: factor out a bvec_set_page helperChristoph Hellwig1-0/+15
Add a helper to initialize a bvec based of a page pointer. This will help removing various open code bvec initializations. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-02-03blk-cgroup: move the cgroup information to struct gendiskChristoph Hellwig1-6/+6
cgroup information only makes sense on a live gendisk that allows file system I/O (which includes the raw block device). So move over the cgroup related members. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-02-03blk-cgroup: store a gendisk to throttle in struct task_structChristoph Hellwig1-1/+1
Switch from a request_queue pointer and reference to a gendisk once for the throttle information in struct task_struct. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-02-03locking/atomic: cmpxchg: Make __generic_cmpxchg_local compare against ↵Matt Evans1-3/+3
zero-extended 'old' value __generic_cmpxchg_local takes unsigned long old/new arguments which might end up being up-cast from smaller signed types (which will sign-extend). The loaded compare value must be compared against a truncated smaller type, so down-cast appropriately for each size. The issue is apparent on 64-bit machines with code, such as atomic_dec_unless_positive(), that sign-extends from int. 64-bit machines generally don't use the generic cmpxchg but development/early ports might make use of it, so make it correct. Signed-off-by: Matt Evans <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2023-02-03Merge tag 'v6.2-next-soc' of ↵Arnd Bergmann4-0/+235
https://git.kernel.org/pub/scm/linux/kernel/git/matthias.bgg/linux into soc/drivers Introduce MediaTek regulator coupler driver to ensure that the SRAM voltage in par with the GPU voltage. This allows for a stable use of the GPU. mtk-mutex: - add support for MT8188 vdosys0 path - allow it to be build as module - add support for MT8195 vdosys1 path mmsys: - add MT8188 vdosys0 path - allow to be build as a module - add MT8195 vdosys1 path - add support for CMDQ - allow for up to 64 reset bits - add supprot for the MT8195 vppsys[0,1] pathes pm-domains: - keep power for the MT8186 ADSP on by default - add support for MT8188 - add support for buck isolation needed in specific pm-domains for MT8188 and MT8192 mtk-svs: - enable IRQ later to allow using kexec - several improvments on the code base - fix modalias pmic wrapper: - convert binding to yaml. As this is thightly coupled to the MT6357 PMIC, I took patches regarding it as well. * tag 'v6.2-next-soc' of https://git.kernel.org/pub/scm/linux/kernel/git/matthias.bgg/linux: (41 commits) soc: mediatek: mtk-svs: add missing MODULE_DEVICE_TABLE soc: mediatek: mtk-devapc: Switch to devm_clk_get_enabled() soc: mtk-svs: mt8183: refactor o_slope calculation soc: mediatek: mtk-svs: delete superfluous platform data entries soc: mediatek: mtk-svs: move svs_platform_probe into probe soc: mediatek: mtk-svs: improve readability of platform_probe soc: mediatek: mtk-svs: clean up platform probing soc: mediatek: mtk-svs: keep svs alive if CONFIG_DEBUG_FS not supported soc: mediatek: mtk-svs: Use pm_runtime_resume_and_get() in svs_init01() soc: mediatek: mtk-svs: reset svs when svs_resume() fail soc: mediatek: mtk-svs: restore default voltages when svs_init02() fail soc: mediatek: mmsys: add support for MT8195 VPPSYS dt-bindings: arm: mediatek: mmsys: Add support for MT8195 VPPSYS soc: mediatek: Introduce mediatek-regulator-coupler driver soc: mediatek: mtk-svs: Enable the IRQ later soc: mediatek: add mtk-mutex support for mt8195 vdosys1 soc: mediatek: add mtk-mutex component - dp_intf1 soc: mediatek: mmsys: add reset control for MT8195 vdosys1 soc: mediatek: mmsys: add mmsys for support 64 reset bits soc: mediatek: add cmdq support of mtk-mmsys config API for mt8195 vdosys1 ... Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>
2023-02-03ASoC: amd: update ps platform acp header fileVijendar Mukunda1-457/+294
Rename Audio buffer and soundwire manager instance registers. Remove scratch registers as these registers can be accessed using ACP_SCRATCH_REG_0 register relative offset. Signed-off-by: Vijendar Mukunda <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2023-02-03livepatch,x86: Clear relocation targets on a module removalSong Liu1-0/+17
Josh reported a bug: When the object to be patched is a module, and that module is rmmod'ed and reloaded, it fails to load with: module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 2, loc 00000000ba0302e9, val ffffffffa03e293c livepatch: failed to initialize patch 'livepatch_nfsd' for module 'nfsd' (-8) livepatch: patch 'livepatch_nfsd' failed for module 'nfsd', refusing to load module 'nfsd' The livepatch module has a relocation which references a symbol in the _previous_ loading of nfsd. When apply_relocate_add() tries to replace the old relocation with a new one, it sees that the previous one is nonzero and it errors out. He also proposed three different solutions. We could remove the error check in apply_relocate_add() introduced by commit eda9cec4c9a1 ("x86/module: Detect and skip invalid relocations"). However the check is useful for detecting corrupted modules. We could also deny the patched modules to be removed. If it proved to be a major drawback for users, we could still implement a different approach. The solution would also complicate the existing code a lot. We thus decided to reverse the relocation patching (clear all relocation targets on x86_64). The solution is not universal and is too much arch-specific, but it may prove to be simpler in the end. Reported-by: Josh Poimboeuf <[email protected]> Originally-by: Miroslav Benes <[email protected]> Signed-off-by: Song Liu <[email protected]> Acked-by: Miroslav Benes <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Reviewed-by: Joe Lawrence <[email protected]> Tested-by: Joe Lawrence <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-02-03iommu/vt-d: Support cpumask for IOMMU perfmonKan Liang1-0/+1
The perf subsystem assumes that all counters are by default per-CPU. So the user space tool reads a counter from each CPU. However, the IOMMU counters are system-wide and can be read from any CPU. Here we use a CPU mask to restrict counting to one CPU to handle the issue. (with CPU hotplug notifier to choose a different CPU if the chosen one is taken off-line). The CPU is exposed to /sys/bus/event_source/devices/dmar*/cpumask for the user space perf tool. Signed-off-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Support size of the register set in DRHDKan Liang2-1/+2
A new field, which indicates the size of the remapping hardware register set for this remapping unit, is introduced in the DMA-remapping hardware unit definition (DRHD) structure with the VT-d Spec 4.0. With this information, SW doesn't need to 'guess' the size of the register set anymore. Update the struct acpi_dmar_hardware_unit to reflect the field. Store the size of the register set in struct dmar_drhd_unit for each dmar device. The 'size' information is ResvZ for the old BIOS and platforms. Fall back to the old guessing method. There is nothing changed. Signed-off-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03iommu/vt-d: Remove include/linux/intel-svm.hLu Baolu1-16/+0
There's no need to have a public header for Intel SVA implementation. The device driver should interact with Intel SVA implementation via the IOMMU generic APIs. Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-02-03netfilter: flowtable: cache info of last offloadVlad Buslov1-3/+4
Modify flow table offload to cache the last ct info status that was passed to the driver offload callbacks by extending enum nf_flow_flags with new "NF_FLOW_HW_ESTABLISHED" flag. Set the flag if ctinfo was 'established' during last act_ct meta actions fill call. This infrastructure change is necessary to optimize promoting of UDP connections from 'new' to 'established' in following patches in this series. Signed-off-by: Vlad Buslov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-03netfilter: flowtable: allow unidirectional rulesVlad Buslov1-0/+1
Modify flow table offload to support unidirectional connections by extending enum nf_flow_flags with new "NF_FLOW_HW_BIDIRECTIONAL" flag. Only offload reply direction when the flag is set. This infrastructure change is necessary to support offloading UDP NEW connections in original direction in following patches in series. Signed-off-by: Vlad Buslov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-02-03media: v4l2-core: Make the v4l2-core code enable/disable the privacy LED if ↵Hans de Goede1-0/+4
present Make v4l2_async_register_subdev_sensor() try to get a privacy LED associated with the sensor and extend the call_s_stream() wrapper to enable/disable the privacy LED if found. This makes the core handle privacy LED control, rather then having to duplicate this code in all the sensor drivers. Suggested-by: Sakari Ailus <[email protected]> Acked-by: Linus Walleij <[email protected]> Signed-off-by: Hans de Goede <[email protected]> Acked-by: Sakari Ailus <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-02-03Merge tag 'ib-leds-led_get-v6.3' into HEADHans de Goede1-0/+21
Immutable branch from LEDs due for the v6.3 merge window
2023-02-02kexec: introduce sysctl parameters kexec_load_limit_*Ricardo Ribalda1-1/+1
kexec allows replacing the current kernel with a different one. This is usually a source of concerns for sysadmins that want to harden a system. Linux already provides a way to disable loading new kexec kernel via kexec_load_disabled, but that control is very coard, it is all or nothing and does not make distinction between a panic kexec and a normal kexec. This patch introduces new sysctl parameters, with finer tuning to specify how many times a kexec kernel can be loaded. The sysadmin can set different limits for kexec panic and kexec reboot kernels. The value can be modified at runtime via sysctl, but only with a stricter value. With these new parameters on place, a system with loadpin and verity enabled, using the following kernel parameters: sysctl.kexec_load_limit_reboot=0 sysct.kexec_load_limit_panic=1 can have a good warranty that if initrd tries to load a panic kernel, a malitious user will have small chances to replace that kernel with a different one, even if they can trigger timeouts on the disk where the panic kernel lives. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ricardo Ribalda <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Bagas Sanjaya <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Guilherme G. Piccoli <[email protected]> # Steam Deck Cc: Joel Fernandes (Google) <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Philipp Rudo <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02kexec: factor out kexec_load_permittedRicardo Ribalda1-1/+2
Both syscalls (kexec and kexec_file) do the same check, let's factor it out. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ricardo Ribalda <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Bagas Sanjaya <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Guilherme G. Piccoli <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Philipp Rudo <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02util_macros.h: add missing inclusionAndy Shevchenko1-0/+2
The header is the direct user of definitions from the math.h, include it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02include/linux/percpu_counter.h: race in uniprocessor percpu_counter_add()Manfred Spraul1-2/+4
The percpu interface is supposed to be preempt and irq safe. But: The uniprocessor implementation of percpu_counter_add() is not irq safe: if an interrupt happens during the +=, then the result is undefined. Therefore: switch from preempt_disable() to local_irq_save(). This prevents interrupts from interrupting the +=, and as a side effect prevents preemption. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Manfred Spraul <[email protected]> Cc: "Sun, Jiebin" <[email protected]> Cc: <[email protected]> Cc: Alexander Sverdlin <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02docs: fault-injection: add requirements of error injectable functionsMasami Hiramatsu (Google)1-2/+4
Add a section about the requirements of the error injectable functions and the type of errors. Since this section must be read before using ALLOW_ERROR_INJECTION() macro, that section is referred from the comment of the macro too. Link: https://lkml.kernel.org/r/167081321427.387937.15475445689482551048.stgit@devnote3 Link: https://lore.kernel.org/all/[email protected]/T/#u Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Florent Revest <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Kees Cook <[email protected]> Cc: KP Singh <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02error-injection: remove EI_ETYPE_NONEMasami Hiramatsu (Google)2-2/+2
Patch series "error-injection: Clarify the requirements of error injectable functions". Patches for clarifying the requirement of error injectable functions and to remove the confusing EI_ETYPE_NONE. This patch (of 2): Since the EI_ETYPE_NONE is confusing type, replace it with appropriate errno. The EI_ETYPE_NONE has been introduced for a dummy (error) value, but it can mislead people that they can use ALLOW_ERROR_INJECTION(func, NONE). So remove it from the EI_ETYPE and use appropriate errno instead. [[email protected]: include/linux/error-injection.h needs errno.h] Link: https://lkml.kernel.org/r/167081319306.387937.10079195394503045678.stgit@devnote3 Link: https://lkml.kernel.org/r/167081320421.387937.4259807348852421112.stgit@devnote3 Fixes: 663faf9f7bee ("error-injection: Add injectable error types") Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Florent Revest <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Kees Cook <[email protected]> Cc: KP Singh <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02fs: convert writepage_t callback to pass a folioMatthew Wilcox (Oracle)1-1/+1
Patch series "Convert writepage_t to use a folio". More folioisation. I split out the mpage work from everything else because it completely dominated the patch, but some implementations I just converted outright. This patch (of 2): We always write back an entire folio, but that's currently passed as the head page. Convert all filesystems that use write_cache_pages() to expect a folio instead of a page. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: add memcpy_from_file_folio()Matthew Wilcox (Oracle)2-0/+30
This is the equivalent of memcpy_from_page(). It differs in that it takes the position in a file instead of offset in a folio, it accepts the total number of bytes to be copied (instead of the number of bytes to be copied from this folio) and it returns how many bytes were copied from the folio, rather than making the caller calculate that and then checking if the caller got it right. [[email protected]: fix typo in comment] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: "Fabio M. De Francesco" <[email protected]> Cc: Ira Weiny <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02block: remove ->rw_pageChristoph Hellwig1-5/+7
The ->rw_page method is a special purpose bypass of the usual bio handling path that is limited to single-page reads and writes and synchronous which causes a lot of extra code in the drivers, callers and the block layer. The only remaining user is the MM swap code. Switch that swap code to simply submit a single-vec on-stack bio an synchronously wait on it based on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that currently implement ->rw_page instead. While this touches one extra cache line and executes extra code, it simplifies the block layer and drivers and ensures that all feastures are properly supported by all drivers, e.g. right now ->rw_page bypassed cgroup writeback entirely. [[email protected]: fix comment typo, per Dan] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Keith Busch <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vishal Verma <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: memory-failure: add memory failure stats to sysfsJiaqi Yan2-0/+33
Patch series "Introduce per NUMA node memory error statistics", v2. Background ========== In the RFC for Kernel Support of Memory Error Detection [1], one advantage of software-based scanning over hardware patrol scrubber is the ability to make statistics visible to system administrators. The statistics include 2 categories: * Memory error statistics, for example, how many memory error are encountered, how many of them are recovered by the kernel. Note these memory errors are non-fatal to kernel: during the machine check exception (MCE) handling kernel already classified MCE's severity to be unnecessary to panic (but either action required or optional). * Scanner statistics, for example how many times the scanner have fully scanned a NUMA node, how many errors are first detected by the scanner. The memory error statistics are useful to userspace and actually not specific to scanner detected memory errors, and are the focus of this patchset. Motivation ========== Memory error stats are important to userspace but insufficient in kernel today. Datacenter administrators can better monitor a machine's memory health with the visible stats. For example, while memory errors are inevitable on servers with 10+ TB memory, starting server maintenance when there are only 1~2 recovered memory errors could be overreacting; in cloud production environment maintenance usually means live migrate all the workload running on the server and this usually causes nontrivial disruption to the customer. Providing insight into the scope of memory errors on a system helps to determine the appropriate follow-up action. In addition, the kernel's existing memory error stats need to be standardized so that userspace can reliably count on their usefulness. Today kernel provides following memory error info to userspace, but they are not sufficient or have disadvantages: * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, not per NUMA node stats though * ras:memory_failure_event: only available after explicitly enabled * /dev/mcelog provides many useful info about the MCEs, but doesn't capture how memory_failure recovered memory MCEs * kernel logs: userspace needs to process log text Exposing memory error stats is also a good start for the in-kernel memory error detector. Today the data source of memory error stats are either direct memory error consumption, or hardware patrol scrubber detection (either signaled as UCNA or SRAO). Once in-kernel memory scanner is implemented, it will be the main source as it is usually configured to scan memory DIMMs constantly and faster than hardware patrol scrubber. How Implemented =============== As Naoya pointed out [2], exposing memory error statistics to userspace is useful independent of software or hardware scanner. Therefore we implement the memory error statistics independent of the in-kernel memory error detector. It exposes the following per NUMA node memory error counters: /sys/devices/system/node/node${X}/memory_failure/total /sys/devices/system/node/node${X}/memory_failure/recovered /sys/devices/system/node/node${X}/memory_failure/ignored /sys/devices/system/node/node${X}/memory_failure/failed /sys/devices/system/node/node${X}/memory_failure/delayed These counters describe how many raw pages are poisoned and after the attempted recoveries by the kernel, their resolutions: how many are recovered, ignored, failed, or delayed respectively. This approach can be easier to extend for future use cases than /proc/meminfo, trace event, and log. The following math holds for the statistics: * total = recovered + ignored + failed + delayed These memory error stats are reset during machine boot. The 1st commit introduces these sysfs entries. The 2nd commit populates memory error stats every time memory_failure attempts memory error recovery. The 3rd commit adds documentations for introduced stats. [1] https://lore.kernel.org/linux-mm/[email protected]/T/#mc22959244f5388891c523882e61163c6e4d703af [2] https://lore.kernel.org/linux-mm/[email protected]/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6 This patch (of 3): Today kernel provides following memory error info to userspace, but each has its own disadvantage * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, not per NUMA node stats though * ras:memory_failure_event: only available after explicitly enabled * /dev/mcelog provides many useful info about the MCEs, but doesn't capture how memory_failure recovered memory MCEs * kernel logs: userspace needs to process log text Exposes per NUMA node memory error stats as sysfs entries: /sys/devices/system/node/node${X}/memory_failure/total /sys/devices/system/node/node${X}/memory_failure/recovered /sys/devices/system/node/node${X}/memory_failure/ignored /sys/devices/system/node/node${X}/memory_failure/failed /sys/devices/system/node/node${X}/memory_failure/delayed These counters describe how many raw pages are poisoned and after the attempted recoveries by the kernel, their resolutions: how many are recovered, ignored, failed, or delayed respectively. The following math holds for the statistics: * total = recovered + ignored + failed + delayed Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jiaqi Yan <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: multi-gen LRU: section for memcg LRUT.J. Alumbaugh2-28/+2
Move memcg LRU code into a dedicated section. Improve the design doc to outline its architecture. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm/damon: update comments in damon.h for damon_attrsSeongJae Park1-3/+3
Patch series "mm/damon: misc fixes". This patchset contains three miscellaneous simple fixes for DAMON online tuning. This patch (of 3): Commit cbeaa77b0449 ("mm/damon/core: use a dedicated struct for monitoring attributes") moved monitoring intervals from damon_ctx to a new struct, damon_attrs, but a comment in the header file has not updated for the change. Update it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: cbeaa77b0449 ("mm/damon/core: use a dedicated struct for monitoring attributes") Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: implement memory-deny-write-execute as a prctlJoey Gouly3-1/+45
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)", v2. The background to this is that systemd has a configuration option called MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter. Its aim is to prevent a user task from inadvertently creating an executable mapping that is (or was) writeable. Since such BPF filter is stateless, it cannot detect mappings that were previously writeable but subsequently changed to read-only. Therefore the filter simply rejects any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI support (Branch Target Identification), the dynamic loader cannot change an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). For libraries, it can resort to unmapping and re-mapping but for the main executable it does not have a file descriptor. The original bug report in the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries - [4]. This series adds in-kernel support for this feature as a prctl PR_SET_MDWE, that is inherited on fork(). The prctl denies PROT_WRITE | PROT_EXEC mappings. Like the systemd BPF filter it also denies adding PROT_EXEC to mappings. However unlike the BPF filter it only denies it if the mapping didn't previous have PROT_EXEC. This allows to PROT_EXEC -> PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF filter. This patch (of 2): The aim of such policy is to prevent a user task from creating an executable mapping that is also writeable. An example of mmap() returning -EACCESS if the policy is enabled: mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0); Similarly, mprotect() would return -EACCESS below: addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC); The BPF filter that systemd MDWE uses is stateless, and disallows mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to be enabled if it was already PROT_EXEC, which allows the following case: addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI); where PROT_BTI enables branch tracking identification on arm64. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joey Gouly <[email protected]> Co-developed-by: Catalin Marinas <[email protected]> Signed-off-by: Catalin Marinas <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Jeremy Linton <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lennart Poettering <[email protected]> Cc: Mark Brown <[email protected]> Cc: nd <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Szabolcs Nagy <[email protected]> Cc: Topi Miettinen <[email protected]> Cc: Zbigniew Jędrzejewski-Szmek <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02swap_state: update shadow_nodes for anonymous pageYang Yang1-1/+2
Shadow_nodes is for shadow nodes reclaiming of workingset handling, it is updated when page cache add or delete since long time ago workingset only supported page cache. But when workingset supports anonymous page detection, we missied updating shadow nodes for it. This caused that shadow nodes of anonymous page will never be reclaimd by scan_shadow_nodes() even they use much memory and system memory is tense. So update shadow_nodes of anonymous page when swap cache is add or delete by calling xas_set_update(..workingset_update_node). Link: https://lkml.kernel.org/r/[email protected] Fixes: aae466b0052e ("mm/swap: implement workingset detection for anonymous LRU") Signed-off-by: Yang Yang <[email protected]> Reviewed-by: Ran Xiaokai <[email protected]> Cc: Bagas Sanjaya <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm/hugetlb: convert get_hwpoison_huge_page() to foliosSidhartha Kumar1-2/+2
Straightforward conversion of get_hwpoison_huge_page() to get_hwpoison_hugetlb_folio(). Reduces two references to a head page in memory-failure.c [[email protected]: fix get_hwpoison_hugetlb_folio() stub] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm/page_ext: init page_ext early if there are no deferred struct pagesPasha Tatashin1-0/+2
page_ext must be initialized after all struct pages are initialized. Therefore, page_ext is initialized after page_alloc_init_late(), and can optionally be initialized earlier via early_page_ext kernel parameter which as a side effect also disables deferred struct pages. Allow to automatically init page_ext early when there are no deferred struct pages in order to be able to use page_ext during kernel boot and track for example page allocations early. [[email protected]: fix build with CONFIG_PAGE_EXTENSION=n] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Charan Teja Kalla <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Li Zhe <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio()Matthew Wilcox (Oracle)1-1/+1
Only one caller doesn't have a folio, so move the page_folio() call to that one caller from mem_cgroup_css_from_folio(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm/fs: convert inode_attach_wb() to take a folioMatthew Wilcox (Oracle)1-6/+6
Patch series "Writeback folio conversions". Remove more calls to compound_head() by passing folios around instead of pages. This patch (of 2): The only caller of inode_attach_wb() which doesn't pass NULL already has a folio, so convert the whole call-chain to take folios. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: add vma_alloc_zeroed_movable_folio()Matthew Wilcox (Oracle)1-17/+16
Replace alloc_zeroed_user_highpage_movable(). The main difference is returning a folio containing a single page instead of returning the page, but take the opportunity to rename the function to match other allocation functions a little better and rewrite the documentation to place more emphasis on the zeroing rather than the highmem aspect. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02filemap: remove find_get_pages_range_tag()Vishal Moola (Oracle)2-18/+0
All callers to find_get_pages_range_tag(), find_get_pages_tag(), pagevec_lookup_range_tag(), and pagevec_lookup_tag() have been removed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02filemap: add filemap_get_folios_tag()Vishal Moola (Oracle)1-0/+2
This is the equivalent of find_get_pages_range_tag(), except for folios instead of pages. One noteable difference is filemap_get_folios_tag() does not take in a maximum pages argument. It instead tries to fill a folio batch and stops either once full (15 folios) or reaching the end of the search range. The new function supports large folios, the initial function did not since all callers don't use large folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Matthew Wilcow (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02pagemap: add filemap_grab_folio()Vishal Moola (Oracle)1-0/+20
Patch series "Convert to filemap_get_folios_tag()", v5. This patch series replaces find_get_pages_range_tag() with filemap_get_folios_tag(). This also allows the removal of multiple calls to compound_head() throughout. It also makes a good chunk of the straightforward conversions to folios, and takes the opportunity to introduce a function that grabs a folio from the pagecache. This patch (of 23): Add function filemap_grab_folio() to grab a folio from the page cache. This function is meant to serve as a folio replacement for grab_cache_page, and is used to facilitate the removal of find_get_pages_range_tag(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: fix khugepaged with shmem_enabled=adviseDavid Stevens1-8/+2
Pass vm_flags as a parameter to shmem_is_huge, rather than reading the flags from the vm_area_struct in question. This allows the updated flags from hugepage_madvise to be passed to the check, which is necessary because madvise does not update the vm_area_struct's flags until after hugepage_madvise returns. This fixes an issue when shmem_enabled=madvise, where MADV_HUGEPAGE on shmem was not able to register the mm_struct with khugepaged. Prior to cd89fb065099, the mm_struct was registered by MADV_HUGEPAGE regardless of the value of shmem_enabled (which was only checked when scanning vmas). Link: https://lkml.kernel.org/r/[email protected] Fixes: cd89fb065099 ("mm,thp,shmem: make khugepaged obey tmpfs mount flags") Signed-off-by: David Stevens <[email protected]> Cc: David Stevens <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: discard __GFP_ATOMICNeilBrown2-9/+4
__GFP_ATOMIC serves little purpose. Its main effect is to set ALLOC_HARDER which adds a few little boosts to increase the chance of an allocation succeeding, one of which is to lower the water-mark at which it will succeed. It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also adjusts this watermark. It is probable that other users of __GFP_HIGH should benefit from the other little bonuses that __GFP_ATOMIC gets. __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM. There is little point to this. We already get a might_sleep() warning if __GFP_DIRECT_RECLAIM is set. __GFP_ATOMIC allows the "watermark_boost" to be side-stepped. It is probable that testing ALLOC_HARDER is a better fit here. __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might sleep. This should test __GFP_DIRECT_RECLAIM instead. This patch: - removes __GFP_ATOMIC - allows __GFP_HIGH allocations to ignore watermark boosting as well as GFP_ATOMIC requests. - makes other adjustments as suggested by the above. The net result is not change to GFP_ATOMIC allocations. Other allocations that use __GFP_HIGH will benefit from a few different extra privileges. This affects: xen, dm, md, ntfs3 the vermillion frame buffer hibernation ksm swap all of which likely produce more benefit than cost if these selected allocation are more likely to succeed quickly. [mgorman: Minor adjustments to rework on top of a series] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Thierry Reding <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm/page_ext: do not allocate space for page_ext->flags if not neededPasha Tatashin1-0/+18
There is 8 byte page_ext->flags field allocated per page whenever CONFIG_PAGE_EXTENSION is enabled. However, not every user of page_ext uses flags. Therefore, check whether flags is needed at least by one user and if so allocate space for it. For example when page_table_check is enabled, on a machine with 128G of memory before the fix: [ 2.244288] allocated 536870912 bytes of page_ext after the fix: [ 2.160154] allocated 268435456 bytes of page_ext Also, add a kernel-doc comment before page_ext_operations that describes the fields, and remove check if need() is set, as that is now a required field. [[email protected]: address comments from Mike Rapoport] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Charan Teja Kalla <[email protected]> Cc: Li Zhe <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: remove __HAVE_ARCH_PTE_SWP_EXCLUSIVEDavid Hildenbrand1-29/+0
__HAVE_ARCH_PTE_SWP_EXCLUSIVE is now supported by all architectures that support swp PTEs, so let's drop it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: pagevec: add folio_batch_reinit()Lorenzo Stoakes1-0/+5
Patch series "update mlock to use folios", v4. This series updates mlock to use folios, converting the internal interface to using folios exclusively and exposing the folio interface externally. As a product of this we move to using a folio batch rather than a pagevec for mlock folios, which brings it in line with the core folio batches contained in mm/swap.c. This patch (of 5): This performs the same task as pagevec_reinit(), only modifying a folio batch rather than a pagevec. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/9018cecacb39e34c883540f997f9be8281153613.1673526881.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: William Kucharski <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm/memory-failure: convert hugetlb_clear_page_hwpoison to foliosSidhartha Kumar1-2/+2
Change hugetlb_clear_page_hwpoison() to folio_clear_hugetlb_hwpoison() by changing the function to take in a folio. This converts one use of ClearPageHWPoison and HPageRawHwpUnreliable to their folio equivalents. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-02-02mm: remove the hugetlb field from struct pageSidhartha Kumar1-12/+0
Patch series "Get rid of tail page fields". Continue the shrinkage of the struct page definition by getting rid of the 'first tail page' and 'second tail page' fields. I originally did this patch set before Hugh's rewrite of the subpages_mapcount, so it needed substantial updates; hope I didn't miss anything. This patch (of 28): commit dad6a5eb5556(mm,hugetlb: use folio fields in second tail page) added a transitional hugetlb field to struct page and struct folio to make room for another int in the first tail of a compound page. Hugetlb folio conversions have changed all page users of this field to use the fields within the folio so struct page no longer needs this hugetlb specific field. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>