aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2022-12-02printk: Convert console_drivers list to hlistThomas Gleixner1-3/+5
Replace the open coded single linked list with a hlist so a conversion to SRCU protected list walks can reuse the existing primitives. Co-developed-by: John Ogness <[email protected]> Signed-off-by: John Ogness <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-12-02crypto: api - Add crypto_tfm_ctx_dmaHerbert Xu1-5/+0
This patch adds the helpers crypto_tfm_ctx_aligned and crypto_tfm_ctx_dma. The first aligns the tfm context to the value cra_alignmask. The second sets the alignment according to dma_cache_get_alignment(); This patch also moves crypto_tfm_ctx into algapi.h. Signed-off-by: Herbert Xu <[email protected]>
2022-12-02configfs: remove mentions of committable itemsBartosz Golaszewski1-3/+0
A proposition of implementation of committable items has been rejected due to the gpio-sim module being the only user and configfs not getting much development in general. In that case, let's remove the notion of committable items from docs and headers. Signed-off-by: Bartosz Golaszewski <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-12-02module: add module_elf_check_arch for module-specific checksNicholas Piggin1-0/+3
The elf_check_arch() function is also used to test compatibility of usermode binaries. Kernel modules may have more specific requirements, for example powerpc would like to test for ABI version compatibility. Add a weak module_elf_check_arch() that defaults to true, and call it from elf_validity_check(). Signed-off-by: Jessica Yu <[email protected]> [np: added changelog, adjust name, rebase] Acked-by: Luis Chamberlain <[email protected]> Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Joel Stanley <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-12-01jump_label: Prevent key->enabled int overflowDmitry Safonov1-4/+17
1. With CONFIG_JUMP_LABEL=n static_key_slow_inc() doesn't have any protection against key->enabled refcounter overflow. 2. With CONFIG_JUMP_LABEL=y static_key_slow_inc_cpuslocked() still may turn the refcounter negative as (v + 1) may overflow. key->enabled is indeed a ref-counter as it's documented in multiple places: top comment in jump_label.h, Documentation/staging/static-keys.rst, etc. As -1 is reserved for static key that's in process of being enabled, functions would break with negative key->enabled refcount: - for CONFIG_JUMP_LABEL=n negative return of static_key_count() breaks static_key_false(), static_key_true() - the ref counter may become 0 from negative side by too many static_key_slow_inc() calls and lead to use-after-free issues. These flaws result in that some users have to introduce an additional mutex and prevent the reference counter from overflowing themselves, see bpf_enable_runtime_stats() checking the counter against INT_MAX / 2. Prevent the reference counter overflow by checking if (v + 1) > 0. Change functions API to return whether the increment was successful. Signed-off-by: Dmitry Safonov <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-12-01nvdimm/cxl/pmem: Add support for master passphrase disable security commandDave Jiang1-0/+2
The original nvdimm_security_ops ->disable() only supports user passphrase for security disable. The CXL spec introduced the disabling of master passphrase. Add a ->disable_master() callback to support this new operation and leaving the old ->disable() mechanism alone. A "disable_master" command is added for the sysfs attribute in order to allow command to be issued from userspace. ndctl will need enabling in order to utilize this new operation. Reviewed-by: Jonathan Cameron <[email protected]> Signed-off-by: Dave Jiang <[email protected]> Link: https://lore.kernel.org/r/166983616454.2734609.14204031148234398086.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dan Williams <[email protected]>
2022-12-01blk-crypto: Add support for SM4-XTS blk crypto modeTianjia Zhang1-0/+1
SM4 is a symmetric cipher algorithm widely used in China. The SM4-XTS variant is used to encrypt length-preserving data. This is the mandatory algorithm in some special scenarios. Add support for the algorithm to block inline encryption. This is needed for the inlinecrypt mount option to be supported via blk-crypto-fallback, as it is for the other fscrypt modes. Signed-off-by: Tianjia Zhang <[email protected]> Signed-off-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-12-01firmware: arm_ffa: Move comment before the field it is documentingWill Deacon1-1/+1
This is consistent with the other comments in the struct. Co-developed-by: Andrew Walbran <[email protected]> Signed-off-by: Andrew Walbran <[email protected]> Signed-off-by: Quentin Perret <[email protected]> Reviewed-by: Sudeep Holla <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2022-12-01firmware: arm_ffa: Move constants to header fileWill Deacon1-0/+83
FF-A function IDs and error codes will be needed in the hypervisor too, so move to them to the header file where they can be shared. Rename the version constants with an "FFA_" prefix so that they are less likely to clash with other code in the tree. Co-developed-by: Andrew Walbran <[email protected]> Signed-off-by: Andrew Walbran <[email protected]> Signed-off-by: Quentin Perret <[email protected]> Reviewed-by: Sudeep Holla <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2022-12-01ext4: journal_path mount options should follow linksLukas Czerner1-0/+1
Before the commit 461c3af045d3 ("ext4: Change handle_mount_opt() to use fs_parameter") ext4 mount option journal_path did follow links in the provided path. Bring this behavior back by allowing to pass pathwalk flags to fs_lookup_param(). Fixes: 461c3af045d3 ("ext4: Change handle_mount_opt() to use fs_parameter") Signed-off-by: Lukas Czerner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]> Cc: [email protected]
2022-12-01efi: libstub: Always enable initrd command line loader and bump versionArd Biesheuvel1-1/+1
In preparation for setting a cross-architecture baseline for EFI boot support, remove the Kconfig option that permits the command line initrd loader to be disabled. Also, bump the minor version so that any image built with the new version can be identified as supporting this. Acked-by: Leif Lindholm <[email protected]> Reviewed-by: Daniel Kiper <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]>
2022-12-01wifi: ieee80211: Do not open-code qos address offsetsKees Cook1-6/+22
When building with -Wstringop-overflow, GCC's KASAN implementation does not correctly perform bounds checking within some complex structures when faced with literal offsets, and can get very confused. For example, this warning is seen due to literal offsets into sturct ieee80211_hdr that may or may not be large enough: drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c: In function 'iwl_mvm_rx_mpdu_mq': drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:2022:29: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=] 2022 | *qc &= ~IEEE80211_QOS_CTL_A_MSDU_PRESENT; In file included from drivers/net/wireless/intel/iwlwifi/mvm/fw-api.h:32, from drivers/net/wireless/intel/iwlwifi/mvm/sta.h:15, from drivers/net/wireless/intel/iwlwifi/mvm/mvm.h:27, from drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:10: drivers/net/wireless/intel/iwlwifi/mvm/../fw/api/rx.h:559:16: note: at offset [78, 166] into destination object 'mpdu_len' of size 2 559 | __le16 mpdu_len; | ^~~~~~~~ Refactor ieee80211_get_qos_ctl() to avoid using literal offsets, requiring the creation of the actual structure that is described in the comments. Explicitly choose the desired offset, making the code more human-readable too. This is one of the last remaining warning to fix before enabling -Wstringop-overflow globally. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97490 Link: https://github.com/KSPP/linux/issues/181 Cc: Johannes Berg <[email protected]> Cc: Kalle Valo <[email protected]> Cc: Gregory Greenman <[email protected]> Cc: "Gustavo A. R. Silva" <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2022-12-01vdso/timens: Refactor copy-pasted find_timens_vvar_page() helper into one copyJann Horn1-0/+6
find_timens_vvar_page() is not architecture-specific, as can be seen from how all five per-architecture versions of it are the same. (arm64, powerpc and riscv are exactly the same; x86 and s390 have two characters difference inside a comment, less blank lines, and mark the !CONFIG_TIME_NS version as inline.) Refactor the five copies into a central copy in kernel/time/namespace.c. Signed-off-by: Jann Horn <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-30Merge tag 'mlx5-updates-2022-11-29' of ↵Jakub Kicinski2-6/+3
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2022-11-29 Misc update for mlx5 driver 1) Various trivial cleanups 2) Maor Dickman, Adds support for trap offload with additional actions 3) From Tariq, UMR (device memory registrations) cleanups, UMR WQE must be aligned to 64B per device spec, (not a bug fix). * tag 'mlx5-updates-2022-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5e: Support devlink reload of IPsec core net/mlx5e: TC, Add offload support for trap with additional actions net/mlx5e: Do early return when setup vports dests for slow path flow net/mlx5: Remove redundant check net/mlx5e: Delete always true DMA check net/mlx5e: Don't access directly DMA device pointer net/mlx5e: Don't use termination table when redundant net/mlx5: Fix orthography errors in documentation net/mlx5: Use generic definition for UMR KLM alignment net/mlx5: Generalize name of UMR alignment definition net/mlx5: Remove unused UMR MTT definitions net/mlx5e: Add padding when needed in UMR WQEs net/mlx5: Remove unused ctx variables net/mlx5e: Replace zero-length arrays with DECLARE_FLEX_ARRAY() helper net/mlx5e: Remove unneeded io-mapping.h #include ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-30net: phy: Add link between phy dev and mac devXiaolei Wang1-0/+4
If the external phy used by current mac interface is managed by another mac interface, it means that this network port cannot work independently, especially when the system suspends and resumes, the following trace may appear, so we should create a device link between phy dev and mac dev. WARNING: CPU: 0 PID: 24 at drivers/net/phy/phy.c:983 phy_error+0x20/0x68 Modules linked in: CPU: 0 PID: 24 Comm: kworker/0:2 Not tainted 6.1.0-rc3-00011-g5aaef24b5c6d-dirty #34 Hardware name: Freescale i.MX6 SoloX (Device Tree) Workqueue: events_power_efficient phy_state_machine unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x68/0x90 dump_stack_lvl from __warn+0xb4/0x24c __warn from warn_slowpath_fmt+0x5c/0xd8 warn_slowpath_fmt from phy_error+0x20/0x68 phy_error from phy_state_machine+0x22c/0x23c phy_state_machine from process_one_work+0x288/0x744 process_one_work from worker_thread+0x3c/0x500 worker_thread from kthread+0xf0/0x114 kthread from ret_from_fork+0x14/0x28 Exception stack(0xf0951fb0 to 0xf0951ff8) Signed-off-by: Xiaolei Wang <[email protected]> Tested-by: Florian Fainelli <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-11-30bpf: Fix a compilation failure with clang lto buildYonghong Song1-0/+1
When building the kernel with clang lto (CONFIG_LTO_CLANG_FULL=y), the following compilation error will appear: $ make LLVM=1 LLVM_IAS=1 -j ... ld.lld: error: ld-temp.o <inline asm>:26889:1: symbol 'cgroup_storage_map_btf_ids' is already defined cgroup_storage_map_btf_ids:; ^ make[1]: *** [/.../bpf-next/scripts/Makefile.vmlinux_o:61: vmlinux.o] Error 1 In local_storage.c, we have BTF_ID_LIST_SINGLE(cgroup_storage_map_btf_ids, struct, bpf_local_storage_map) Commit c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs") added the above identical BTF_ID_LIST_SINGLE definition in bpf_cgrp_storage.c. With duplicated definitions, llvm linker complains with lto build. Also, extracting btf_id of 'struct bpf_local_storage_map' is defined four times for sk, inode, task and cgrp local storages. Let us define a single global one with a different name than cgroup_storage_map_btf_ids, which also fixed the lto compilation error. Fixes: c4bcfb38a95e ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs") Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-11-30iommufd: Add kernel support for testing iommufdJason Gunthorpe1-0/+3
Provide a mock kernel module for the iommu_domain that allows it to run without any HW and the mocking provides a way to directly validate that the PFNs loaded into the iommu_domain are correct. This exposes the access kAPI toward userspace to allow userspace to explore the functionality of pages.c and io_pagetable.c The mock also simulates the rare case of PAGE_SIZE > iommu page size as the mock will operate at a 2K iommu page size. This allows exercising all of the calculations to support this mismatch. This is also intended to support syzkaller exploring the same space. However, it is an unusually invasive config option to enable all of this. The config option should not be enabled in a production kernel. Link: https://lore.kernel.org/r/[email protected] Tested-by: Matthew Rosato <[email protected]> # s390 Tested-by: Eric Auger <[email protected]> # aarch64 Signed-off-by: Jason Gunthorpe <[email protected]>
2022-11-30iommufd: vfio container FD ioctl compatibilityJason Gunthorpe1-0/+7
iommufd can directly implement the /dev/vfio/vfio container IOCTLs by mapping them into io_pagetable operations. A userspace application can test against iommufd and confirm compatibility then simply make a small change to open /dev/iommu instead of /dev/vfio/vfio. For testing purposes /dev/vfio/vfio can be symlinked to /dev/iommu and then all applications will use the compatibility path with no code changes. A later series allows /dev/vfio/vfio to be directly provided by iommufd, which allows the rlimit mode to work the same as well. This series just provides the iommufd side of compatibility. Actually linking this to VFIO_SET_CONTAINER is a followup series, with a link in the cover letter. Internally the compatibility API uses a normal IOAS object that, like vfio, is automatically allocated when the first device is attached. Userspace can also query or set this IOAS object directly using the IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only features while still using the VFIO style map/unmap ioctls. While this is enough to operate qemu, it has a few differences: - Resource limits rely on memory cgroups to bound what userspace can do instead of the module parameter dma_entry_limit. - VFIO P2P is not implemented. The DMABUF patches for vfio are a start at a solution where iommufd would import a special DMABUF. This is to avoid further propogating the follow_pfn() security problem. - A full audit for pedantic compatibility details (eg errnos, etc) has not yet been done - powerpc SPAPR is left out, as it is not connected to the iommu_domain framework. It seems interest in SPAPR is minimal as it is currently non-working in v6.1-rc1. They will have to convert to the iommu subsystem framework to enjoy iommfd. The following are not going to be implemented and we expect to remove them from VFIO type1: - SW access 'dirty tracking'. As discussed in the cover letter this will be done in VFIO. - VFIO_TYPE1_NESTING_IOMMU https://lore.kernel.org/all/[email protected]/ - VFIO_DMA_MAP_FLAG_VADDR https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/r/[email protected] Tested-by: Nicolin Chen <[email protected]> Tested-by: Yi Liu <[email protected]> Tested-by: Lixiao Yang <[email protected]> Tested-by: Matthew Rosato <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Eric Auger <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2022-11-30iommufd: Add kAPI toward external drivers for kernel accessJason Gunthorpe1-1/+42
Kernel access is the mode that VFIO "mdevs" use. In this case there is no struct device and no IOMMU connection. iommufd acts as a record keeper for accesses and returns the actual struct pages back to the caller to use however they need. eg with kmap or the DMA API. Each caller must create a struct iommufd_access with iommufd_access_create(), similar to how iommufd_device_bind() works. Using this struct the caller can access blocks of IOVA using iommufd_access_pin_pages() or iommufd_access_rw(). Callers must provide a callback that immediately unpins any IOVA being used within a range. This happens if userspace unmaps the IOVA under the pin. The implementation forwards the access requests directly to the iopt infrastructure that manages the iopt_pages_access. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Kevin Tian <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Yi Liu <[email protected]> Tested-by: Lixiao Yang <[email protected]> Tested-by: Matthew Rosato <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2022-11-30iommufd: Add kAPI toward external drivers for physical devicesJason Gunthorpe1-0/+9
Add the four functions external drivers need to connect physical DMA to the IOMMUFD: iommufd_device_bind() / iommufd_device_unbind() Register the device with iommufd and establish security isolation. iommufd_device_attach() / iommufd_device_detach() Connect a bound device to a page table Binding a device creates a device object ID in the uAPI, however the generic API does not yet provide any IOCTLs to manipulate them. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Kevin Tian <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Yi Liu <[email protected]> Tested-by: Lixiao Yang <[email protected]> Tested-by: Matthew Rosato <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2022-11-30iommufd: PFN handling for iopt_pagesJason Gunthorpe1-0/+7
The top of the data structure provides an IO Address Space (IOAS) that is similar to a VFIO container. The IOAS allows map/unmap of memory into ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables) and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to access the PFNs that those IOVA areas cover. The IO Address Space (IOAS) datastructure is composed of: - struct io_pagetable holding the IOVA map - struct iopt_areas representing populated portions of IOVA - struct iopt_pages representing the storage of PFNs - struct iommu_domain representing each IO page table in the system IOMMU - struct iopt_pages_access representing in-kernel accesses of PFNs (ie VFIO mdevs) - struct xarray pinned_pfns holding a list of pages pinned by in-kernel accesses This patch introduces the lowest part of the datastructure - the movement of PFNs in a tiered storage scheme: 1) iopt_pages::pinned_pfns xarray 2) Multiple iommu_domains 3) The origin of the PFNs, i.e. the userspace pointer PFN have to be copied between all combinations of tiers, depending on the configuration. The interface is an iterator called a 'pfn_reader' which determines which tier each PFN is stored and loads it into a list of PFNs held in a struct pfn_batch. Each step of the iterator will fill up the pfn_batch, then the caller can use the pfn_batch to send the PFNs to the required destination. Repeating this loop will read all the PFNs in an IOVA range. The pfn_reader and pfn_batch also keep track of the pinned page accounting. While PFNs are always stored and accessed as full PAGE_SIZE units the iommu_domain tier can store with a sub-page offset/length to support IOMMUs with a smaller IOPTE size than PAGE_SIZE. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Kevin Tian <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Yi Liu <[email protected]> Tested-by: Lixiao Yang <[email protected]> Tested-by: Matthew Rosato <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2022-11-30kernel/user: Allow user_struct::locked_vm to be usable for iommufdJason Gunthorpe1-1/+1
Following the pattern of io_uring, perf, skb, and bpf, iommfd will use user->locked_vm for accounting pinned pages. Ensure the value is included in the struct and export free_uid() as iommufd is modular. user->locked_vm is the good accounting to use for ulimit because it is per-user, and the security sandboxing of locked pages is not supposed to be per-process. Other places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or mm->locked_vm for accounting pinned pages, but this is only per-process and inconsistent with the new FOLL_LONGTERM users in the kernel. Concurrent work is underway to try to put this in a cgroup, so everything can be consistent and the kernel can provide a FOLL_LONGTERM limit that actually provides security. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Eric Auger <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Yi Liu <[email protected]> Tested-by: Lixiao Yang <[email protected]> Tested-by: Matthew Rosato <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2022-11-30iommufd: File descriptor, context, kconfig and makefilesJason Gunthorpe1-0/+31
This is the basic infrastructure of a new miscdevice to hold the iommufd IOCTL API. It provides: - A miscdevice to create file descriptors to run the IOCTL interface over - A table based ioctl dispatch and centralized extendable pre-validation step - An xarray mapping userspace ID's to kernel objects. The design has multiple inter-related objects held within in a single IOMMUFD fd - A simple usage count to build a graph of object relations and protect against hostile userspace racing ioctls The only IOCTL provided in this patch is the generic 'destroy any object by handle' operation. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Eric Auger <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Yi Liu <[email protected]> Tested-by: Lixiao Yang <[email protected]> Tested-by: Matthew Rosato <[email protected]> Signed-off-by: Yi Liu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2022-11-30linux/init.h: include <linux/build_bug.h> and <linux/stringify.h>Masahiro Yamada1-0/+2
With CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y, the following code fails to build: ---------------->8---------------- #include <linux/init.h> int foo(void) { return 0; } core_initcall(foo); ---------------->8---------------- Include <linux/build_bug.h> for static_assert() and <linux/stringify.h> for __stringify(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Cc: Jiangshan Yi <[email protected]> Cc: Kees Cook <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Randy Dunlap <[email protected]> # build-tested Cc: Sami Tolvanen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30debugfs: fix error when writing negative value to atomic_t debugfs fileAkinobu Mita1-2/+17
The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), so we have to use a 64-bit value to write a negative value for a debugfs file created by debugfs_create_atomic_t(). This restores the previous behaviour by introducing DEFINE_DEBUGFS_ATTRIBUTE_SIGNED for a signed value. Link: https://lkml.kernel.org/r/[email protected] Fixes: 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()") Signed-off-by: Akinobu Mita <[email protected]> Reported-by: Zhao Gongyi <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Wei Yongjun <[email protected]> Cc: Yicong Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed valueAkinobu Mita1-2/+10
Patch series "fix error when writing negative value to simple attribute files". The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), but some attribute files want to accept a negative value. This patch (of 3): The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), so we have to use a 64-bit value to write a negative value. This adds DEFINE_SIMPLE_ATTRIBUTE_SIGNED for a signed value. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()") Signed-off-by: Akinobu Mita <[email protected]> Reported-by: Zhao Gongyi <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Wei Yongjun <[email protected]> Cc: Yicong Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-12-01bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytesPengcheng Yang1-0/+1
When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS flag from the msg->flags. If apply_bytes is used and it is larger than the current data being processed, sk_psock_msg_verdict() will not be called when sendmsg() is called again. At this time, the msg->flags is 0, and we lost the BPF_F_INGRESS flag. So we need to save the BPF_F_INGRESS flag in sk_psock and use it when redirection. Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support") Signed-off-by: Pengcheng Yang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Jakub Sitnicki <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-11-30s390/mm: use pmd_pgtable_page() helper in __gmap_segment_gaddr()Anshuman Khandual1-1/+1
In __gmap_segment_gaddr() pmd level page table page is being extracted from the pmd pointer, similar to pmd_pgtable_page() implementation. This reduces some redundancy by directly using pmd_pgtable_page() instead, though first making it available. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Acked-by: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Heiko Carstens <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/thp: rename pmd_to_page() as pmd_pgtable_page()Anshuman Khandual1-3/+3
Current pmd_to_page(), which derives the page table page containing the pmd address has a very misleading name. The problem being, it sounds similar to pmd_page() which derives page embedded in a given pmd entry either for next level page or a mapped huge page. Rename it as pmd_pgtable_page() instead. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_set_min_ratio_no_scale() functionStefan Roesch1-0/+1
This introduces bdi_set_min_ratio_no_scale(). It uses the max granularity for the ratio. This function by the new sysfs knob min_ratio_fine. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_set_max_ratio_no_scale() functionStefan Roesch1-0/+1
This introduces bdi_set_max_ratio_no_scale(). It uses the max granularity for the ratio. This function by the new sysfs knob max_ratio_fine. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_set_min_bytes() functionStefan Roesch1-0/+1
This introduces the bdi_set_min_bytes() function. The min_bytes function does not store the min_bytes value. Instead it converts the min_bytes value into the corresponding ratio value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_get_min_bytes() functionStefan Roesch1-0/+1
This adds a function to return the specified value for min_bytes. It converts the stored min_ratio of the bdi to the corresponding bytes value. This is an approximation as it is based on the value that is returned by global_dirty_limits(), which can change. The returned value can be different than the value when the min_bytes value was set. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_set_max_bytes() functionStefan Roesch1-0/+1
This introduces the bdi_set_max_bytes() function. The max_bytes function does not store the max_bytes value. Instead it converts the max_bytes value into the corresponding ratio value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_get_max_bytes() functionStefan Roesch1-0/+1
This adds a function to return the specified value for max_bytes. It converts the stored max_ratio of the bdi to the corresponding bytes value. It introduces the bdi_get_bytes helper function to do the conversion. This is an approximation as it is based on the value that is returned by global_dirty_limits(), which can change. The helper function will also be used by the min_bytes bdi knob. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: use part per 1000000 for bdi ratiosStefan Roesch1-0/+3
To get finer granularity for ratio calculations use part per million instead of percentiles. This is especially important if we want to automatically convert byte values to ratios. Otherwise the values that are actually used can be quite different. This is also important for machines with more main memory (1% of 256GB is already 2.5GB). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Chris Mason <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: add bdi_set_strict_limit() functionStefan Roesch1-0/+1
Patch series "mm/block: add bdi sysfs knobs", v4. At meta network block devices (nbd) are used to implement remote block storage. In testing and during production it has been observed that these network block devices can consume a huge portion of the dirty writeback cache and writeback can take a considerable time. To be able to give stricter limits, I'm proposing the following changes: 1) introduce strictlimit knob Currently the max_ratio knob exists to limit the dirty_memory. However this knob only applies once (dirty_ratio + dirty_background_ratio) / 2 has been reached. With the BDI_CAP_STRICTLIMIT flag, the max_ratio can be applied without reaching that limit. This change exposes that knob. This knob can also be useful for NFS, fuse filesystems and USB devices. 2) Use part of 1000000 internal calculation The max_ratio is based on percentage. With the current machine sizes percentage values can be very high (1% of a 256GB main memory is already 2.5GB). This change uses part of 1000000 instead of percentages for the internal calculations. 3) Introduce two new sysfs knobs: min_bytes and max_bytes. Currently all calculations are based on ratio, but for a user it often more convenient to specify a limit in bytes. The new knobs will not store bytes values, instead they will translate the byte value to a corresponding ratio. As the internal values are now part of 1000, the ratio is closer to the specified value. However the value should be more seen as an approximation as it can fluctuate over time. 3) Introduce two new sysfs knobs: min_ratio_fine and max_ratio_fine. The granularity for the existing sysfs bdi knobs min_ratio and max_ratio is based on percentage values. The new sysfs bdi knobs min_ratio_fine and max_ratio_fine allow to specify the ratio as part of 1 million. This patch (of 20): This adds the bdi_set_strict_limit function to be able to set/unset the BDI_CAP_STRICTLIMIT flag. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Roesch <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Chris Mason <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30folio-compat: remove try_to_release_page()Vishal Moola (Oracle)1-1/+0
There are no more callers of try_to_release_page(), so remove it. This saves 85 bytes of kernel text. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/gup: reliable R/O long-term pinning in COW mappingsDavid Hildenbrand1-3/+24
We already support reliable R/O pinning of anonymous memory. However, assume we end up pinning (R/O long-term) a pagecache page or the shared zeropage inside a writable private ("COW") mapping. The next write access will trigger a write-fault and replace the pinned page by an exclusive anonymous page in the process page tables to break COW: the pinned page no longer corresponds to the page mapped into the process' page table. Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a COW mapping, let's properly break COW first before R/O long-term pinning something that's not an exclusive anon page inside a COW mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page instead that can get pinned safely. With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable R/O long-term pinning in COW mappings. With this change, the new R/O long-term pinning tests for non-anonymous memory succeed: # [RUN] R/O longterm GUP pin ... with shared zeropage ok 151 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd ok 152 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with tmpfile ok 153 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with huge zeropage ok 154 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB) ok 155 Longterm R/O pin is reliable # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB) ok 156 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with shared zeropage ok 157 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd ok 158 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with tmpfile ok 159 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with huge zeropage ok 160 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB) ok 161 Longterm R/O pin is reliable # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB) ok 162 Longterm R/O pin is reliable Note 1: We don't care about short-term R/O-pinning, because they have snapshot semantics: they are not supposed to observe modifications that happen after pinning. As one example, assume we start direct I/O to read from a page and store page content into a file: modifications to page content after starting direct I/O are not guaranteed to end up in the file. So even if we'd pin the shared zeropage, the end result would be as expected -- getting zeroes stored to the file. Note 2: For shared mappings we'll now always fallback to the slow path to lookup the VMA when R/O long-term pining. While that's the necessary price we have to pay right now, it's actually not that bad in practice: most FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with FOLL_FORCE because they tried dealing with COW mappings correctly ... Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE, such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd populate exclusive anon pages that we can pin. There was a concern that this could affect the memlock limit of existing setups. For example, a VM running with VFIO could run into the memlock limit and fail to run. However, we essentially had the same behavior already in commit 17839856fd58 ("gup: document and work around "COW can break either way" issue") which got merged into some enterprise distros, and there were not any such complaints. So most probably, we're fine. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Daniel Vetter <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mappingDavid Hildenbrand1-4/+4
Extend FAULT_FLAG_UNSHARE to break COW on anything mapped into a COW (i.e., private writable) mapping and adjust the documentation accordingly. FAULT_FLAG_UNSHARE will now also break COW when encountering the shared zeropage, a pagecache page, a PFNMAP, ... inside a COW mapping, by properly replacing the mapped page/pfn by a private copy (an exclusive anonymous page). Note that only do_wp_page() needs care: hugetlb_wp() already handles FAULT_FLAG_UNSHARE correctly. wp_huge_pmd()/wp_huge_pud() also handles it correctly, for example, splitting the huge zeropage on FAULT_FLAG_UNSHARE such that we can handle FAULT_FLAG_UNSHARE on the PTE level. This change is a requirement for reliable long-term R/O pinning in COW mappings. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: anonymous shared memory namingPasha Tatashin2-14/+14
Since commit 9a10064f5625 ("mm: add a field to store names for private anonymous memory"), name for private anonymous memory, but not shared anonymous, can be set. However, naming shared anonymous memory just as useful for tracking purposes. Extend the functionality to be able to set names for shared anon. There are two ways to create anonymous shared memory, using memfd or directly via mmap(): 1. fd = memfd_create(...) mem = mmap(..., MAP_SHARED, fd, ...) 2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...) In both cases the anonymous shared memory is created the same way by mapping an unlinked file on tmpfs. The memfd way allows to give a name for anonymous shared memory, but not useful when parts of shared memory require to have distinct names. Example use case: The VMM maps VM memory as anonymous shared memory (not private because VMM is sandboxed and drivers are running in their own processes). However, the VM tells back to the VMM how parts of the memory are actually used by the guest, how each of the segments should be backed (i.e. 4K pages, 2M pages), and some other information about the segments. The naming allows us to monitor the effective memory footprint for each of these segments from the host without looking inside the guest. Sample output: /* Create shared anonymous segmenet */ anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); /* Name the segment: "MY-NAME" */ rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, anon_shmem, SIZE, "MY-NAME"); cat /proc/<pid>/maps (and smaps): 7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME] If the segment is not named, the output is: 7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted) Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bagas Sanjaya <[email protected]> Cc: Colin Cross <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vincent Whitchurch <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: xu xin <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: shrinkers: add missing includes for undeclared typesT.J. Mercier1-0/+3
The shrinker.h header depends on a user including other headers before it for types used by shrinker.h. Fix this by including the appropriate headers in shrinker.h. ./include/linux/shrinker.h:13:9: error: unknown type name `gfp_t' 13 | gfp_t gfp_mask; | ^~~~~ ./include/linux/shrinker.h:71:26: error: field `list' has incomplete type 71 | struct list_head list; | ^~~~ ./include/linux/shrinker.h:82:9: error: unknown type name `atomic_long_t' 82 | atomic_long_t *nr_deferred; | Link: https://lkml.kernel.org/r/[email protected] Fixes: 83aeeada7c69 ("vmscan: use atomic-long for shrinker batching") Fixes: b0d40c92adaf ("superblock: introduce per-sb cache shrinker infrastructure") Signed-off-by: T.J. Mercier <[email protected]> Cc: Al Viro <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30zram: add size class equals check into recompressionAlexey Romanov1-0/+2
It makes no sense for us to recompress the object if it will be in the same size class. We anyway don't get any memory gain. But, at the same time, we get a CPU time overhead when inserting this object into zspage and decompressing it afterwards. [senozhatsky: rebased and fixed conflicts] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexey Romanov <[email protected]> Signed-off-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Suleiman Souhlal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: mmu_gather: prepare to gather encoded page pointers with flagsLinus Torvalds1-1/+1
This is purely a preparatory patch that makes all the data structures ready for encoding flags with the mmu_gather page pointers. The code currently always sets the flag to zero and doesn't use it yet, but now it's tracking the type state along. The next step will be to actually start using it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: teach release_pages() to take an array of encoded page pointers tooLinus Torvalds1-2/+19
release_pages() already could take either an array of page pointers, or an array of folio pointers. Expand it to also accept an array of encoded page pointers, which is what both the existing mlock() use and the upcoming mmu_gather use of encoded page pointers wants. Note that release_pages() won't actually use, or react to, any extra encoded bits. Instead, this is very much a case of "I have walked the array of encoded pages and done everything the extra bits tell me to do, now release it all". Also, while the "either page or folio pointers" dual use was handled with a cast of the pointer in "release_folios()", this takes a slightly different approach and uses the "transparent union" attribute to describe the set of arguments to the function: https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html which has been supported by gcc forever, but the kernel hasn't used before. That allows us to avoid using various wrappers with casts, and just use the same function regardless of use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: introduce 'encoded' page pointers with embedded extra bitsLinus Torvalds1-1/+33
We already have this notion in parts of the MM code (see the mlock code with the LRU_PAGE and NEW_PAGE bits), but I'm going to introduce a new case, and I refuse to do the same thing we've done before where we just put bits in the raw pointer and say it's still a normal pointer. So this introduces a 'struct encoded_page' pointer that cannot be used for anything else than to encode a real page pointer and a couple of extra bits in the low bits. That way the compiler can trivially track the state of the pointer and you just explicitly encode and decode the extra bits. Note that this makes the alignment of 'struct page' explicit even for the case where CONFIG_HAVE_ALIGNED_STRUCT_PAGE is not set. That is entirely redundant in almost all cases, since the page structure already contains several word-sized entries. However, on m68k, the alignment of even 32-bit data is just 16 bits, and as such in theory the alignment of 'struct page' could be too. So let's just make it very very explicit that the alignment needs to be at least 32 bits, giving us a guarantee of two unused low bits in the pointer. Now, in practice, our page struct array is aligned much more than that anyway, even on m68k, and our existing code in mm/mlock.c obviously already depended on that. But since the whole point of this change is to be careful about the type system when hiding extra bits in the pointer, let's also be explicit about the assumptions we make. NOTE! This is being very careful in another way too: it has a build-time assertion that the 'flags' added to the page pointer actually fit in the two bits. That means that this helper must be inlined, and can only be used in contexts where the compiler can statically determine that the value fits in the available bits. [[email protected]: kerneldoc on a forward-declared struct confuses htmldocs] Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Hugh Dickins <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> [s390] Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm: remove unused savedwrite infrastructureDavid Hildenbrand1-24/+0
NUMA hinting no longer uses savedwrite, let's rip it out. ... and while at it, drop __pte_write() and __pmd_write() on ppc64. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwriteDavid Hildenbrand1-0/+2
commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault") added remembering write permissions using ordinary pte_write() for PROT_NONE mapped pages to avoid write faults when remapping the page !PROT_NONE on NUMA hinting faults. That commit noted: The patch looks hacky but the alternatives looked worse. The tidest was to rewalk the page tables after a hinting fault but it was more complex than this approach and the performance was worse. It's not generally safe to just mark the page writable during the fault if it's a write fault as it may have been read-only for COW so that approach was discarded. Later, commit 288bc54949fc ("mm/autonuma: let architecture override how the write bit should be stashed in a protnone pte.") introduced a family of savedwrite PTE functions that didn't necessarily improve the whole situation. One confusing thing is that nowadays, if a page is pte_protnone() and pte_savedwrite() then also pte_write() is true. Another source of confusion is that there is only a single pte_mk_savedwrite() call in the kernel. All other write-protection code seems to silently rely on pte_wrprotect(). Ever since PageAnonExclusive was introduced and we started using it in mprotect context via commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection"), we do have machinery in place to avoid write faults when changing protection, which is exactly what we want to do here. Let's similarly do what ordinary mprotect() does nowadays when upgrading write permissions and reuse can_change_pte_writable() and can_change_pmd_writable() to detect if we can upgrade PTE permissions to be writable. For anonymous pages there should be absolutely no change: if an anonymous page is not exclusive, it could not have been mapped writable -- because only exclusive anonymous pages can be mapped writable. However, there *might* be a change for writable shared mappings that require writenotify: if they are not dirty, we cannot map them writable. While it might not matter in practice, we'd need a different way to identify whether writenotify is actually required -- and ordinary mprotect would benefit from that as well. Note that we don't optimize for the actual migration case: (1) When migration succeeds the new PTE will not be writable because the source PTE was not writable (protnone); in the future we might just optimize that case similarly by reusing can_change_pte_writable()/can_change_pmd_writable() when removing migration PTEs. (2) When migration fails, we'd have to recalculate the "writable" flag because we temporarily dropped the PT lock; for now keep it simple and set "writable=false". We'll remove all savedwrite leftovers next. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm/mprotect: factor out check whether manual PTE write upgrades are requiredDavid Hildenbrand1-2/+14
Let's factor the check out into vma_wants_manual_pte_write_upgrade(), to be reused in NUMA hinting fault context soon. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-11-30mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mappedHugh Dickins2-10/+22
Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now? Yes. Not by atomic64_t or cmpxchg games, those get difficult on 32-bit; but if we slightly abuse subpages_mapcount by additionally demanding that one bit be set there when the compound page is PMD-mapped, then a cascade of two atomic ops is able to maintain the stats without bit_spin_lock. This is harder to reason about than when bit_spin_locked, but I believe safe; and no drift in stats detected when testing. When there are racing removes and adds, of course the sequence of operations is less well- defined; but each operation on subpages_mapcount is atomically good. What might be disastrous, is if subpages_mapcount could ever fleetingly appear negative: but the pte lock (or pmd lock) these rmap functions are called under, ensures that a last remove cannot race ahead of a first add. Continue to make an exception for hugetlb (PageHuge) pages, though that exception can be easily removed by a further commit if necessary: leave subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just carry on checking compound_mapcount too in folio_mapped(), page_mapped(). Evidence is that this way goes slightly faster than the previous implementation in all cases (pmds after ptes now taking around 103ms); and relieves us of worrying about contention on the bit_spin_lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: James Houghton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Hubbard <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mina Almasry <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Xu <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Zach O'Keefe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>