aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2023-10-19PM / devfreq: rockchip-dfi: add support for RK3588Sascha Hauer1-0/+18
Add support for the RK3588 to the driver. The RK3588 has four DDR channels with a register stride of 0x4000 between the channel registers, also it has a DDRMON_CTRL register per channel. Link: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Sebastian Reichel <[email protected]> Acked-by: Chanwoo Choi <[email protected]> Acked-by: Heiko Stuebner <[email protected]> Signed-off-by: Sascha Hauer <[email protected]> Signed-off-by: Chanwoo Choi <[email protected]>
2023-10-19PM / devfreq: rockchip-dfi: Add perf supportSascha Hauer2-0/+3
The DFI is a unit which is suitable for measuring DDR utilization, but so far it could only be used as an event driver for the DDR frequency scaling driver. This adds perf support to the DFI driver. Usage with the 'perf' tool can look like: perf stat -a -e rockchip_ddr/cycles/,\ rockchip_ddr/read-bytes/,\ rockchip_ddr/write-bytes/,\ rockchip_ddr/bytes/ sleep 1 Performance counter stats for 'system wide': 1582524826 rockchip_ddr/cycles/ 1802.25 MB rockchip_ddr/read-bytes/ 1793.72 MB rockchip_ddr/write-bytes/ 3595.90 MB rockchip_ddr/bytes/ 1.014369709 seconds time elapsed perf support has been tested on a RK3568 and a RK3399, the latter with dual channel DDR. Link: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Sebastian Reichel <[email protected]> Acked-by: Chanwoo Choi <[email protected]> Acked-by: Heiko Stuebner <[email protected]> Signed-off-by: Sascha Hauer <[email protected]> [cw00.choi: Fix typo from 'write_acccess' to 'write_access'] Signed-off-by: Chanwoo Choi <[email protected]>
2023-10-19PM / devfreq: rockchip-dfi: Handle LPDDR4XSascha Hauer1-0/+1
In the DFI driver LPDDR4X can be handled in the same way as LPDDR4. Add the missing case. Link: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Jonathan Cameron <[email protected]> Reviewed-by: Sebastian Reichel <[email protected]> Acked-by: Chanwoo Choi <[email protected]> Acked-by: Heiko Stuebner <[email protected]> Signed-off-by: Sascha Hauer <[email protected]> Signed-off-by: Chanwoo Choi <[email protected]>
2023-10-19PM / devfreq: rockchip-dfi: Add RK3568 supportSascha Hauer1-0/+12
This adds RK3568 support to the DFI driver. Only iniitialization differs from the currently supported RK3399. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Sascha Hauer <[email protected]> Acked-by: Heiko Stuebner <[email protected]> Signed-off-by: Chanwoo Choi <[email protected]>
2023-10-19PM / devfreq: rk3399_dmc,dfi: generalize DDRTYPE definesSascha Hauer2-6/+18
The DDRTYPE defines are named to be RK3399 specific, but they can be used for other Rockchip SoCs as well, so replace the RK3399_PMUGRF_ prefix with ROCKCHIP_. They are defined in a SoC specific header file, so when generalizing the prefix also move the new defines to a SoC agnostic header file. While at it use GENMASK to define the DDRTYPE bitfield and give it a name including the full register name. Link: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Sebastian Reichel <[email protected]> Acked-by: Chanwoo Choi <[email protected]> Acked-by: Heiko Stuebner <[email protected]> Signed-off-by: Sascha Hauer <[email protected]> Signed-off-by: Chanwoo Choi <[email protected]>
2023-10-19inet: lock the socket in ip_sock_set_tos()Eric Dumazet1-0/+1
Christoph Paasch reported a panic in TCP stack [1] Indeed, we should not call sk_dst_reset() without holding the socket lock, as __sk_dst_get() callers do not all rely on bare RCU. [1] BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 12bad6067 P4D 12bad6067 PUD 12bad5067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 2750 Comm: syz-executor.5 Not tainted 6.6.0-rc4-g7a5720a344e7 #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:tcp_get_metrics+0x118/0x8f0 net/ipv4/tcp_metrics.c:321 Code: c7 44 24 70 02 00 8b 03 89 44 24 48 c7 44 24 4c 00 00 00 00 66 c7 44 24 58 02 00 66 ba 02 00 b1 01 89 4c 24 04 4c 89 7c 24 10 <49> 8b 0f 48 8b 89 50 05 00 00 48 89 4c 24 30 33 81 00 02 00 00 69 RSP: 0018:ffffc90000af79b8 EFLAGS: 00010293 RAX: 000000000100007f RBX: ffff88812ae8f500 RCX: ffff88812b5f8f01 RDX: 0000000000000002 RSI: ffffffff8300f080 RDI: 0000000000000002 RBP: 0000000000000002 R08: 0000000000000003 R09: ffffffff8205eca0 R10: 0000000000000002 R11: ffff88812b5f8f00 R12: ffff88812a9e0580 R13: 0000000000000000 R14: ffff88812ae8fbd2 R15: 0000000000000000 FS: 00007f70a006b640(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000012bad7003 CR4: 0000000000170ee0 Call Trace: <TASK> tcp_fastopen_cache_get+0x32/0x140 net/ipv4/tcp_metrics.c:567 tcp_fastopen_cookie_check+0x28/0x180 net/ipv4/tcp_fastopen.c:419 tcp_connect+0x9c8/0x12a0 net/ipv4/tcp_output.c:3839 tcp_v4_connect+0x645/0x6e0 net/ipv4/tcp_ipv4.c:323 __inet_stream_connect+0x120/0x590 net/ipv4/af_inet.c:676 tcp_sendmsg_fastopen+0x2d6/0x3a0 net/ipv4/tcp.c:1021 tcp_sendmsg_locked+0x1957/0x1b00 net/ipv4/tcp.c:1073 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1336 __sock_sendmsg+0x83/0xd0 net/socket.c:730 __sys_sendto+0x20a/0x2a0 net/socket.c:2194 __do_sys_sendto net/socket.c:2206 [inline] Fixes: e08d0b3d1723 ("inet: implement lockless IP_TOS") Reported-by: Christoph Paasch <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2023-10-19net: stmmac: intel: remove unnecessary field struct ↵Johannes Zink1-1/+0
plat_stmmacenet_data::ext_snapshot_num Do not store bitmask for enabling AUX_SNAPSHOT0. The previous commit ("net: stmmac: fix PPS capture input index") takes care of calculating the proper bit mask from the request data's extts.index field, which is 0 if not explicitly specified otherwise. Signed-off-by: Johannes Zink <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-10-19fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=nMax Kellermann1-0/+5
Make IS_POSIXACL() return false if POSIX ACL support is disabled. Never skip applying the umask in namei.c and never bother to do any ACL specific checks if the filesystem falsely indicates it has ACLs enabled when the feature is completely disabled in the kernel. This fixes a problem where the umask is always ignored in the NFS client when compiled without CONFIG_FS_POSIX_ACL. This is a 4 year old regression caused by commit 013cdf1088d723 which itself was not completely wrong, but failed to consider all the side effects by misdesigned VFS code. Prior to that commit, there were two places where the umask could be applied, for example when creating a directory: 1. in the VFS layer in SYSCALL_DEFINE3(mkdirat), but only if !IS_POSIXACL() 2. again (unconditionally) in nfs3_proc_mkdir() The first one does not apply, because even without CONFIG_FS_POSIX_ACL, the NFS client sets SB_POSIXACL in nfs_fill_super(). After that commit, (2.) was replaced by: 2b. in posix_acl_create(), called by nfs3_proc_mkdir() There's one branch in posix_acl_create() which applies the umask; however, without CONFIG_FS_POSIX_ACL, posix_acl_create() is an empty dummy function which does not apply the umask. The approach chosen by this patch is to make IS_POSIXACL() always return false when POSIX ACL support is disabled, so the umask always gets applied by the VFS layer. This is consistent with the (regular) behavior of posix_acl_create(): that function returns early if IS_POSIXACL() is false, before applying the umask. Therefore, posix_acl_create() is responsible for applying the umask if there is ACL support enabled in the file system (SB_POSIXACL), and the VFS layer is responsible for all other cases (no SB_POSIXACL or no CONFIG_FS_POSIX_ACL). Signed-off-by: Max Kellermann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2023-10-19fs: store real path instead of fake path in backing file f_pathAmir Goldstein2-20/+5
A backing file struct stores two path's, one "real" path that is referring to f_inode and one "fake" path, which should be displayed to users in /proc/<pid>/maps. There is a lot more potential code that needs to know the "real" path, then code that needs to know the "fake" path. Instead of code having to request the "real" path with file_real_path(), store the "real" path in f_path and require code that needs to know the "fake" path request it with file_user_path(). Replace the file_real_path() helper with a simple const accessor f_path(). After this change, file_dentry() is not expected to observe any files with overlayfs f_path and real f_inode, so the call to ->d_real() should not be needed. Leave the ->d_real() call for now and add an assertion in ovl_d_real() to catch if we made wrong assumptions. Suggested-by: Miklos Szeredi <[email protected]> Link: https://lore.kernel.org/r/CAJfpegtt48eXhhjDFA1ojcHPNKj3Go6joryCPtEFAKpocyBsnw@mail.gmail.com/ Signed-off-by: Amir Goldstein <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2023-10-19fs: create helper file_user_path() for user displayed mapped file pathAmir Goldstein1-0/+14
Overlayfs uses backing files with "fake" overlayfs f_path and "real" underlying f_inode, in order to use underlying inode aops for mapped files and to display the overlayfs path in /proc/<pid>/maps. In preparation for storing the overlayfs "fake" path instead of the underlying "real" path in struct backing_file, define a noop helper file_user_path() that returns f_path for now. Use the new helper in procfs and kernel logs whenever a path of a mapped file is displayed to users. Signed-off-by: Amir Goldstein <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2023-10-19vfs: predict the error in retry_estale as unlikelyMateusz Guzik1-1/+1
Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2023-10-19file: convert to SLAB_TYPESAFE_BY_RCUChristian Brauner2-15/+6
In recent discussions around some performance improvements in the file handling area we discussed switching the file cache to rely on SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based freeing for files completely. This is a pretty sensitive change overall but it might actually be worth doing. The main downside is the subtlety. The other one is that we should really wait for Jann's patch to land that enables KASAN to handle SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this exists. With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times which requires a few changes. So it isn't sufficient anymore to just acquire a reference to the file in question under rcu using atomic_long_inc_not_zero() since the file might have already been recycled and someone else might have bumped the reference. In other words, callers might see reference count bumps from newer users. For this reason it is necessary to verify that the pointer is the same before and after the reference count increment. This pattern can be seen in get_file_rcu() and __files_get_rcu(). In addition, it isn't possible to access or check fields in struct file without first aqcuiring a reference on it. Not doing that was always very dodgy and it was only usable for non-pointer data in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a reference under rcu or they must hold the files_lock of the fdtable. Failing to do either one of this is a bug. Thanks to Jann for pointing out that we need to ensure memory ordering between reallocations and pointer check by ensuring that all subsequent loads have a dependency on the second load in get_file_rcu() and providing a fixup that was folded into this patch. Cc: Jann Horn <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-10-19watch_queue: Annotate struct watch_filter with __counted_byKees Cook1-1/+1
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct watch_filter. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: David Howells <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Siddh Raman Pant <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Qian Cai <[email protected]> Signed-off-by: Kees Cook <[email protected]> Tested-by: Siddh Raman Pant <[email protected]> Reviewed-by: "Gustavo A. R. Silva" <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-10-19fs/pipe: move check to pipe_has_watch_queue()Max Kellermann1-0/+16
This declutters the code by reducing the number of #ifdefs and makes the watch_queue checks simpler. This has no runtime effect; the machine code is identical. Signed-off-by: Max Kellermann <[email protected]> Message-Id: <[email protected]> Reviewed-by: David Howells <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-10-19pipe: reduce padding in struct pipe_inode_infoMax Kellermann1-3/+3
This has no effect on 64 bit because there are 10 32-bit integers surrounding the two bools, but on 32 bit architectures, this reduces the struct size by 4 bytes by merging the two bools into one word. Signed-off-by: Max Kellermann <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-10-19fs: add a new SB_I_NOUMASK flagJeff Layton2-1/+26
SB_POSIXACL must be set when a filesystem supports POSIX ACLs, but NFSv4 also sets this flag to prevent the VFS from applying the umask on newly-created files. NFSv4 doesn't support POSIX ACLs however, which causes confusion when other subsystems try to test for them. Add a new SB_I_NOUMASK flag that allows filesystems to opt-in to umask stripping without advertising support for POSIX ACLs. Set the new flag on NFSv4 instead of SB_POSIXACL. Also, move mode_strip_umask to namei.h and convert init_mknod and init_mkdir to use it. Signed-off-by: Jeff Layton <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-10-19vlynq: remove bus driverWolfram Sang1-149/+0
There are no users with a vlynq_driver in the Kernel tree. Also, only the AR7 platform ever initialized a VLYNQ bus, but AR7 is going to be removed from the Kernel. OpenWRT had some out-of-tree drivers which they probably intended to upport, but AR7 devices are even there not supported anymore because they are "stuck with Kernel 3.18" [1]. This code can go. [1] https://openwrt.org/docs/techref/targets/ar7 Signed-off-by: Wolfram Sang <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Acked-by: Florian Fainelli <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]>
2023-10-19perf: Disallow mis-matched inherited group readsPeter Zijlstra1-0/+1
Because group consistency is non-atomic between parent (filedesc) and children (inherited) events, it is possible for PERF_FORMAT_GROUP read() to try and sum non-matching counter groups -- with non-sensical results. Add group_generation to distinguish the case where a parent group removes and adds an event and thus has the same number, but a different configuration of events as inherited groups. This became a problem when commit fa8c269353d5 ("perf/core: Invert perf_read_group() loops") flipped the order of child_list and sibling_list. Previously it would iterate the group (sibling_list) first, and for each sibling traverse the child_list. In this order, only the group composition of the parent is relevant. By flipping the order the group composition of the child (inherited) events becomes an issue and the mis-match in group composition becomes evident. That said; even prior to this commit, while reading of a group that is not equally inherited was not broken, it still made no sense. (Ab)use ECHILD as error return to indicate issues with child process group composition. Fixes: fa8c269353d5 ("perf/core: Invert perf_read_group() loops") Reported-by: Budimir Markovic <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-10-19RDMA/core: Add support to set privileged QKEY parameterPatrisious Haddad1-0/+2
Add netlink command that enables/disables privileged QKEY by default. It is disabled by default, since according to IB spec only privileged users are allowed to use privileged QKEY. According to the IB specification rel-1.6, section 3.5.3: "QKEYs with the most significant bit set are considered controlled QKEYs, and a HCA does not allow a consumer to arbitrarily specify a controlled QKEY." Using rdma tool, $rdma system set privileged-qkey on When enabled non-privileged users would be able to use controlled QKEYs which are considered privileged. Using rdma tool, $rdma system set privileged-qkey off When disabled only privileged users would be able to use controlled QKEYs. You can also use the command below to check the parameter state: $rdma system show netns shared privileged-qkey off copy-on-fork on Signed-off-by: Patrisious Haddad <[email protected]> Link: https://lore.kernel.org/r/90398be70a9d23d2aa9d0f9fd11d2c264c1be534.1696848201.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]>
2023-10-18buildid: reduce header file dependencies for moduleArnd Bergmann1-1/+2
The vmlinux decompressor code intentionally has only a limited set of included header files, but this started running into a build failure because of the bitmap logic needing linux/errno.h: In file included from include/linux/cpumask.h:12, from include/linux/mm_types_task.h:14, from include/linux/mm_types.h:5, from include/linux/buildid.h:5, from include/linux/module.h:14, from arch/arm/boot/compressed/../../../../lib/lz4/lz4_decompress.c:39, from arch/arm/boot/compressed/../../../../lib/decompress_unlz4.c:10, from arch/arm/boot/compressed/decompress.c:60: include/linux/bitmap.h: In function 'bitmap_allocate_region': include/linux/bitmap.h:527:25: error: 'EBUSY' undeclared (first use in this function) 527 | return -EBUSY; | ^~~~~ include/linux/bitmap.h:527:25: note: each undeclared identifier is reported only once for each function it appears in include/linux/bitmap.h: In function 'bitmap_find_free_region': include/linux/bitmap.h:554:17: error: 'ENOMEM' undeclared (first use in this function) 554 | return -ENOMEM; | ^~~~~~ This is easily avoided by changing linux/buildid.h to no longer depend on linux/mm_types.h, a header that pulls in a huge number of indirect dependencies. Fixes: b9c957f554442 ("bitmap: move bitmap_*_region() functions to bitmap.h") Fixes: bd7525dacd7e2 ("bpf: Move stack_map_get_build_id into lib") Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Yury Norov <[email protected]>
2023-10-18string: Adjust strtomem() logic to allow for smaller sourcesKees Cook1-2/+5
Arnd noticed we have a case where a shorter source string is being copied into a destination byte array, but this results in a strnlen() call that exceeds the size of the source. This is seen with -Wstringop-overread: In file included from ../include/linux/uuid.h:11, from ../include/linux/mod_devicetable.h:14, from ../include/linux/cpufeature.h:12, from ../arch/x86/coco/tdx/tdx.c:7: ../arch/x86/coco/tdx/tdx.c: In function 'tdx_panic.constprop': ../include/linux/string.h:284:9: error: 'strnlen' specified bound 64 exceeds source size 60 [-Werror=stringop-overread] 284 | memcpy_and_pad(dest, _dest_len, src, strnlen(src, _dest_len), pad); \ | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../arch/x86/coco/tdx/tdx.c:124:9: note: in expansion of macro 'strtomem_pad' 124 | strtomem_pad(message.str, msg, '\0'); | ^~~~~~~~~~~~ Use the smaller of the two buffer sizes when calling strnlen(). When src length is unknown (SIZE_MAX), it is adjusted to use dest length, which is what the original code did. Reported-by: Arnd Bergmann <[email protected]> Fixes: dfbafa70bde2 ("string: Introduce strtomem() and strtomem_pad()") Tested-by: Arnd Bergmann <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: [email protected] Signed-off-by: Kees Cook <[email protected]>
2023-10-19dt-bindings: thermal: mediatek: Add LVTS thermal controller definition for ↵Balsam CHIHI1-0/+19
mt8192 Add LVTS thermal controller definition for MT8192. Signed-off-by: Balsam CHIHI <[email protected]> Reviewed-by: AngeloGioacchino Del Regno <[email protected]> Acked-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Bernhard Rosenkränzer <[email protected]> Reviewed-by: Matthias Brugger <[email protected]> Reviewed-by: Alexandre Mergnat <[email protected]> Signed-off-by: Daniel Lezcano <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-10-18compiler.h: move __is_constexpr() to compiler.hDavid Laight2-8/+8
Prior to f747e6667ebb2 __is_constexpr() was in its only user minmax.h. That commit moved it to const.h - but that file just defines ULL(x) and UL(x) so that constants can be defined for .S and .c files. So apart from the word 'const' it wasn't really a good location. Instead move the definition to compiler.h just before the similar is_signed_type() and is_unsigned_type(). This may not be a good long-term home, but the three definitions belong together. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Laight <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bart Van Assche <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18minmax: relax check to allow comparison between unsigned arguments and ↵David Laight1-7/+17
signed constants Allow (for example) min(unsigned_var, 20). The opposite min(signed_var, 20u) is still errored. Since a comparison between signed and unsigned never makes the unsigned value negative it is only necessary to adjust the __types_ok() test. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Laight <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason A. Donenfeld <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18minmax: allow comparisons of 'int' against 'unsigned char/short'David Laight1-2/+3
Since 'unsigned char/short' get promoted to 'signed int' it is safe to compare them against an 'int' value. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Laight <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason A. Donenfeld <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18minmax: fix indentation of __cmp_once() and __clamp_once()David Laight1-15/+15
Remove the extra indentation and align continuation markers. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Laight <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason A. Donenfeld <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18minmax: allow min()/max()/clamp() if the arguments have the same signedness.David Laight1-28/+32
The type-check in min()/max() is there to stop unexpected results if a negative value gets converted to a large unsigned value. However it also rejects 'unsigned int' v 'unsigned long' compares which are common and never problematc. Replace the 'same type' check with a 'same signedness' check. The new test isn't itself a compile time error, so use static_assert() to report the error and give a meaningful error message. Due to the way builtin_choose_expr() works detecting the error in the 'non-constant' side (where static_assert() can be used) also detects errors when the arguments are constant. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Laight <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason A. Donenfeld <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18minmax: add umin(a, b) and umax(a, b)David Laight1-0/+17
Patch series "minmax: Relax type checks in min() and max()", v4. The min() (etc) functions in minmax.h require that the arguments have exactly the same types. However when the type check fails, rather than look at the types and fix the type of a variable/constant, everyone seems to jump on min_t(). In reality min_t() ought to be rare - when something unusual is being done, not normality. The orginal min() (added in 2.4.9) replaced several inline functions and included the type - so matched the implicit casting of the function call. This was renamed min_t() in 2.4.10 and the current min() added. There is no actual indication that the conversion of negatve values to large unsigned values has ever been an actual problem. A quick grep shows 5734 min() and 4597 min_t(). Having the casts on almost half of the calls shows that something is clearly wrong. If the wrong type is picked (and it is far too easy to pick the type of the result instead of the larger input) then significant bits can get discarded. Pretty much the worst example is in the derived clamp_val(), consider: unsigned char x = 200u; y = clamp_val(x, 10u, 300u); I also suspect that many of the min_t(u16, ...) are actually wrong. For example copy_data() in printk_ringbuffer.c contains: data_size = min_t(u16, buf_size, len); Here buf_size is 'unsigned int' and len 'u16', pass a 64k buffer (can you prove that doesn't happen?) and no data is returned. Apparantly it did - and has since been fixed. The only reason that most of the min_t() are 'fine' is that pretty much all the values in the kernel are between 0 and INT_MAX. Patch 1 adds umin(), this uses integer promotions to convert both arguments to 'unsigned long long'. It can be used to compare a signed type that is known to contain a non-negative value with an unsigned type. The compiler typically optimises it all away. Added first so that it can be referred to in patch 2. Patch 2 replaces the 'same type' check with a 'same signedness' one. This makes min(unsigned_int_var, sizeof()) be ok. The error message is also improved and will contain the expanded form of both arguments (useful for seeing how constants are defined). Patch 3 just fixes some whitespace. Patch 4 allows comparisons of 'unsigned char' and 'unsigned short' to signed types. The integer promotion rules convert them both to 'signed int' prior to the comparison so they can never cause a negative value be converted to a large positive one. Patch 5 (rewritted for v4) allows comparisons of unsigned values against non-negative constant integer expressions. This makes min(unsigned_int_var, 4) be ok. The only common case that is still errored is the comparison of signed values against unsigned constant integer expressions below __INT_MAX__. Typcally min(int_val, sizeof (foo)), the real fix for this is casting the constant: min(int_var, (int)sizeof (foo)). With all the patches applied pretty much all the min_t() could be replaced by min(), and most of the rest by umin(). However they all need careful inspection due to code like: sz = min_t(unsigned char, sz - 1, LIM - 1) + 1; which converts 0 to LIM. This patch (of 6): umin() and umax() can be used when min()/max() errors a signed v unsigned compare when the signed value is known to be non-negative. Unlike min_t(some_unsigned_type, a, b) umin() will never mask off high bits if an inappropriate type is selected. The '+ 0u + 0ul + 0ull' may look strange. The '+ 0u' is needed for 'signed int' on 64bit systems. The '+ 0ul' is needed for 'signed long' on 32bit systems. The '+ 0ull' is needed for 'signed long long'. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Laight <[email protected]> Cc: Andy Shevchenko <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason A. Donenfeld <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18kstrtox: remove strtobool()Christophe JAILLET1-5/+0
The conversion from strtobool() to kstrtobool() is completed. So strtobool() can now be removed. Link: https://lkml.kernel.org/r/87e3cc2547df174cd5af1fadbf866be4ef9e8e45.1694878151.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18extract and use FILE_LINE macroAlexey Dobriyan3-3/+4
Extract nifty FILE_LINE useful for printk style debugging: printk("%s\n", FILE_LINE); It should not be used en mass probably because __FILE__ string literals can be merged while FILE_LINE's won't. But for debugging it is what the doctor ordered. Don't add leading and trailing underscores, they're painful to type. Trust me, I've tried both versions. Link: https://lkml.kernel.org/r/ebf12ac4-5a61-4b12-b8b0-1253eb371332@p183 Signed-off-by: Alexey Dobriyan <[email protected]> Cc: Kees Cook <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm: update memfd seal write check to include F_SEAL_WRITELorenzo Stoakes1-7/+8
The seal_check_future_write() function is called by shmem_mmap() or hugetlbfs_file_mmap() to disallow any future writable mappings of an memfd sealed this way. The F_SEAL_WRITE flag is not checked here, as that is handled via the mapping->i_mmap_writable mechanism and so any attempt at a mapping would fail before this could be run. However we intend to change this, meaning this check can be performed for F_SEAL_WRITE mappings also. The logic here is equally applicable to both flags, so update this function to accommodate both and rename it accordingly. Link: https://lkml.kernel.org/r/913628168ce6cce77df7d13a63970bae06a526e0.1697116581.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm: drop the assumption that VM_SHARED always implies writableLorenzo Stoakes2-2/+13
Patch series "permit write-sealed memfd read-only shared mappings", v4. The man page for fcntl() describing memfd file seals states the following about F_SEAL_WRITE:- Furthermore, trying to create new shared, writable memory-mappings via mmap(2) will also fail with EPERM. With emphasis on 'writable'. In turns out in fact that currently the kernel simply disallows all new shared memory mappings for a memfd with F_SEAL_WRITE applied, rendering this documentation inaccurate. This matters because users are therefore unable to obtain a shared mapping to a memfd after write sealing altogether, which limits their usefulness. This was reported in the discussion thread [1] originating from a bug report [2]. This is a product of both using the struct address_space->i_mmap_writable atomic counter to determine whether writing may be permitted, and the kernel adjusting this counter when any VM_SHARED mapping is performed and more generally implicitly assuming VM_SHARED implies writable. It seems sensible that we should only update this mapping if VM_MAYWRITE is specified, i.e. whether it is possible that this mapping could at any point be written to. If we do so then all we need to do to permit write seals to function as documented is to clear VM_MAYWRITE when mapping read-only. It turns out this functionality already exists for F_SEAL_FUTURE_WRITE - we can therefore simply adapt this logic to do the same for F_SEAL_WRITE. We then hit a chicken and egg situation in mmap_region() where the check for VM_MAYWRITE occurs before we are able to clear this flag. To work around this, perform this check after we invoke call_mmap(), with careful consideration of error paths. Thanks to Andy Lutomirski for the suggestion! [1]:https://lore.kernel.org/all/[email protected]/ [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238 This patch (of 3): There is a general assumption that VMAs with the VM_SHARED flag set are writable. If the VM_MAYWRITE flag is not set, then this is simply not the case. Update those checks which affect the struct address_space->i_mmap_writable field to explicitly test for this by introducing [vma_]is_shared_maywrite() helper functions. This remains entirely conservative, as the lack of VM_MAYWRITE guarantees that the VMA cannot be written to. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Suggested-by: Andy Lutomirski <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm: make vma_merge() and split_vma() internalLorenzo Stoakes1-9/+0
Now the common pattern of - attempting a merge via vma_merge() and should this fail splitting VMAs via split_vma() - has been abstracted, the former can be placed into mm/internal.h and the latter made static. In addition, the split_vma() nommu variant also need not be exported. Link: https://lkml.kernel.org/r/405f2be10e20c4e9fbcc9fe6b2dfea105f6642e0.1697043508.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.Lorenzo Stoakes1-0/+60
mprotect() and other functions which change VMA parameters over a range each employ a pattern of:- 1. Attempt to merge the range with adjacent VMAs. 2. If this fails, and the range spans a subset of the VMA, split it accordingly. This is open-coded and duplicated in each case. Also in each case most of the parameters passed to vma_merge() remain the same. Create a new function, vma_modify(), which abstracts this operation, accepting only those parameters which can be changed. To avoid the mess of invoking each function call with unnecessary parameters, create inline wrapper functions for each of the modify operations, parameterised only by what is required to perform the action. We can also significantly simplify the logic - by returning the VMA if we split (or merged VMA if we do not) we no longer need specific handling for merge/split cases in any of the call sites. Note that the userfaultfd_release() case works even though it does not split VMAs - since start is set to vma->vm_start and end is set to vma->vm_end, the split logic does not trigger. In addition, since we calculate pgoff to be equal to vma->vm_pgoff + (start - vma->vm_start) >> PAGE_SHIFT, and start - vma->vm_start will be 0 in this instance, this invocation will remain unchanged. We eliminate a VM_WARN_ON() in mprotect_fixup() as this simply asserts that vma_merge() correctly ensures that flags remain the same, something that is already checked in is_mergeable_vma() and elsewhere, and in any case is not specific to mprotect(). Link: https://lkml.kernel.org/r/0dfa9368f37199a423674bf0ee312e8ea0619044.1697043508.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm: move vma_policy() and anon_vma_name() decls to mm_types.hLorenzo Stoakes3-23/+28
Patch series "Abstract vma_merge() and split_vma()", v4. The vma_merge() interface is very confusing and its implementation has led to numerous bugs as a result of that confusion. In addition there is duplication both in invocation of vma_merge(), but also in the common mprotect()-style pattern of attempting a merge, then if this fails, splitting the portion of a VMA about to have its attributes changed. This pattern has been copy/pasted around the kernel in each instance where such an operation has been required, each very slightly modified from the last to make it even harder to decipher what is going on. Simplify the whole thing by dividing the actual uses of vma_merge() and split_vma() into specific and abstracted functions and de-duplicate the vma_merge()/split_vma() pattern altogether. Doing so also opens the door to changing how vma_merge() is implemented - by knowing precisely what cases a caller is invoking rather than having a central interface where anything might happen we can untangle the brittle and confusing vma_merge() implementation into something more workable. For mprotect()-like cases we introduce vma_modify() which performs the vma_merge()/split_vma() pattern, returning a pointer to either the merged or split VMA or an ERR_PTR(err) if the splits fail. We provide a number of inline helper functions to make things even clearer:- * vma_modify_flags() - Prepare to modify the VMA's flags. * vma_modify_flags_name() - Prepare to modify the VMA's flags/anon_vma_name * vma_modify_policy() - Prepare to modify the VMA's mempolicy. * vma_modify_flags_uffd() - Prepare to modify the VMA's flags/uffd context. For cases where a new VMA is attempted to be merged with adjacent VMAs we add:- * vma_merge_new_vma() - Prepare to merge a new VMA. * vma_merge_extend() - Prepare to extend the end of a new VMA. This patch (of 5): The vma_policy() define is a helper specifically for a VMA field so it makes sense to host it in the memory management types header. The anon_vma_name(), anon_vma_name_alloc() and anon_vma_name_free() functions are a little out of place in mm_inline.h as they define external functions, and so it makes sense to locate them in mm_types.h. The purpose of these relocations is to make it possible to abstract static inline wrappers which invoke both of these helpers. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/24bfc6c9e382fffbcb0ea8d424392c27d56cc8ca.1697043508.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18sched: remove wait bookmarksMatthew Wilcox (Oracle)1-6/+3
There are no users of wait bookmarks left, so simplify the wait code by removing them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Benjamin Segall <[email protected]> Cc: Bin Lai <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vincent Guittot <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm: add printf attribute to shrinker_debugfs_name_allocLucy Mielke1-0/+1
This fixes a compiler warning when compiling an allyesconfig with W=1: mm/internal.h:1235:9: error: function might be a candidate for `gnu_printf' format attribute [-Werror=suggest-attribute=format] [[email protected]: fix shrinker_alloc() as welll per Qi Zheng] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/ZSBue-3kM6gI6jCr@mainframe Fixes: c42d50aefd17 ("mm: shrinker: add infrastructure for dynamically allocating shrinker") Signed-off-by: Lucy Mielke <[email protected]> Cc: Qi Zheng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18hugetlb: memcg: account hugetlb-backed memory in memory controllerNhat Pham2-0/+14
Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system. For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etc. fair. What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state. This patch rectifies this issue by charging the memcg when the hugetlb folio is utilized, and uncharging when the folio is freed (analogous to the hugetlb controller). Note that we do not charge when the folio is allocated to the hugetlb pool, because at this point it is not owned by any memcg. Some caveats to consider: * This feature is only available on cgroup v2. * There is no hugetlb pool management involved in the memory controller. As stated above, hugetlb folios are only charged towards the memory controller when it is used. Host overcommit management has to consider it when configuring hard limits. * Failure to charge towards the memcg results in SIGBUS. This could happen even if the hugetlb pool still has pages (but the cgroup limit is hit and reclaim attempt fails). * When this feature is enabled, hugetlb pages contribute to memory reclaim protection. low, min limits tuning must take into account hugetlb memory. * Hugetlb pages utilized while this option is not selected will not be tracked by the memory controller (even if cgroup v2 is remounted later on). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Frank van der Linden <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Tejun heo <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18memcontrol: only transfer the memcg data for migrationNhat Pham1-0/+7
For most migration use cases, only transfer the memcg data from the old folio to the new folio, and clear the old folio's memcg data. No charging and uncharging will be done. This shaves off some work on the migration path, and avoids the temporary double charging of a folio during its migration. The only exception is replace_page_cache_folio(), which will use the old mem_cgroup_migrate() (now renamed to mem_cgroup_replace_folio). In that context, the isolation of the old page isn't quite as thorough as with migration, so we cannot use our new implementation directly. This patch is the result of the following discussion on the new hugetlb memcg accounting behavior: https://lore.kernel.org/lkml/20231003171329.GB314430@monkey/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Frank van der Linden <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Tejun heo <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18memcontrol: add helpers for hugetlb memcg accountingNhat Pham1-0/+21
Patch series "hugetlb memcg accounting", v4. Currently, hugetlb memory usage is not acounted for in the memory controller, which could lead to memory overprotection for cgroups with hugetlb-backed memory. This has been observed in our production system. For instance, here is one of our usecases: suppose there are two 32G containers. The machine is booted with hugetlb_cma=6G, and each container may or may not use up to 3 gigantic page, depending on the workload within it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. But it is very difficult to configure memory.max to keep overall consumption, including anon, cache, slab etcetera fair. What we have had to resort to is to constantly poll hugetlb usage and readjust memory.max. Similar procedure is done to other memory limits (memory.low for e.g). However, this is rather cumbersome and buggy. Furthermore, when there is a delay in memory limits correction, (for e.g when hugetlb usage changes within consecutive runs of the userspace agent), the system could be in an over/underprotected state. This patch series rectifies this issue by charging the memcg when the hugetlb folio is allocated, and uncharging when the folio is freed. In addition, a new selftest is added to demonstrate and verify this new behavior. This patch (of 4): This patch exposes charge committing and cancelling as parts of the memory controller interface. These functionalities are useful when the try_charge() and commit_charge() stages have to be separated by other actions in between (which can fail). One such example is the new hugetlb accounting behavior in the following patch. The patch also adds a helper function to obtain a reference to the current task's memcg. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Frank van der Linden <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Tejun heo <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm, hugetlb: remove HUGETLB_CGROUP_MIN_ORDERFrank van der Linden1-11/+0
Originally, hugetlb_cgroup was the only hugetlb user of tail page structure fields. So, the code defined and checked against HUGETLB_CGROUP_MIN_ORDER to make sure pages weren't too small to use. However, by now, tail page #2 is used to store hugetlb hwpoison and subpool information as well. In other words, without that tail page hugetlb doesn't work. Acknowledge this fact by getting rid of HUGETLB_CGROUP_MIN_ORDER and checks against it. Instead, just check for the minimum viable page order at hstate creation time. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Frank van der Linden <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm: add folio_xor_flags_has_waiters()Matthew Wilcox (Oracle)1-0/+19
Optimise folio_end_read() by setting the uptodate bit at the same time we clear the unlock bit. This saves at least one memory barrier and one write-after-write hazard. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm: delete checks for xor_unlock_is_negative_byte()Matthew Wilcox (Oracle)2-6/+0
Architectures which don't define their own use the one in asm-generic/bitops/lock.h. Get rid of all the ifdefs around "maybe we don't have it". Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18bitops: add xor_unlock_is_negative_byte()Matthew Wilcox (Oracle)2-28/+20
Replace clear_bit_and_unlock_is_negative_byte() with xor_unlock_is_negative_byte(). We have a few places that like to lock a folio, set a flag and unlock it again. Allow for the possibility of combining the latter two operations for efficiency. We are guaranteed that the caller holds the lock, so it is safe to unlock it with the xor. The caller must guarantee that nobody else will set the flag without holding the lock; it is not safe to do this with the PG_dirty flag, for example. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm: add folio_end_read()Matthew Wilcox (Oracle)1-0/+1
Provide a function for filesystems to call when they have finished reading an entire folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm/gup: adapt get_user_page_vma_remote() to never return NULLLorenzo Stoakes1-3/+9
get_user_pages_remote() will never return 0 except in the case of FOLL_NOWAIT being specified, which we explicitly disallow. This simplifies error handling for the caller and avoids the awkwardness of dealing with both errors and failing to pin. Failing to pin here is an error. Link: https://lkml.kernel.org/r/00319ce292d27b3aae76a0eb220ce3f528187508.1696288092.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Suggested-by: Arnd Bergmann <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Acked-by: Catalin Marinas <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Hubbard <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm: make __access_remote_vm() staticLorenzo Stoakes1-2/+0
Patch series "various improvements to the GUP interface", v2. A series of fixes to simplify and improve the GUP interface with an eye to providing groundwork to future improvements:- * __access_remote_vm() and access_remote_vm() are functionally identical, so make the former static such that in future we can potentially change the external-facing implementation details of this function. * Extend is_valid_gup_args() to cover the missing FOLL_TOUCH case, and simplify things by defining INTERNAL_GUP_FLAGS to check against. * Adjust __get_user_pages_locked() to explicitly treat a failure to pin any pages as an error in all circumstances other than FOLL_NOWAIT being specified, bringing it in line with the nommu implementation of this function. * (With many thanks to Arnd who suggested this in the first instance) Update get_user_page_vma_remote() to explicitly only return a page or an error, simplifying the interface and avoiding the questionable IS_ERR_OR_NULL() pattern. This patch (of 4): access_remote_vm() passes through parameters to __access_remote_vm() directly, so remove the __access_remote_vm() function from mm.h and use access_remote_vm() in the one caller that needs it (ptrace_access_vm()). This allows future adjustments to the GUP-internal __access_remote_vm() function while keeping the access_remote_vm() function stable. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/f7877c5039ce1c202a514a8aeeefc5cdd5e32d19.1696288092.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Hubbard <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()David Hildenbrand1-1/+1
Let's convert it to consume a folio. [[email protected]: fix kerneldoc] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18percpu_counter: extend _limited_add() to negative amountsHugh Dickins1-2/+9
Though tmpfs does not need it, percpu_counter_limited_add() can be twice as useful if it works sensibly with negative amounts (subs) - typically decrements towards a limit of 0 or nearby: as suggested by Dave Chinner. And in the course of that reworking, skip the percpu counter sum if it is already obvious that the limit would be passed: as suggested by Tim Chen. Extend the comment above __percpu_counter_limited_add(), defining the behaviour with positive and negative amounts, allowing negative limits, but not bothering about overflow beyond S64_MAX. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Carlos Maiolino <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Tim Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-18shmem,percpu_counter: add _limited_add(fbc, limit, amount)Hugh Dickins1-0/+23
Percpu counter's compare and add are separate functions: without locking around them (which would defeat their purpose), it has been possible to overflow the intended limit. Imagine all the other CPUs fallocating tmpfs huge pages to the limit, in between this CPU's compare and its add. I have not seen reports of that happening; but tmpfs's recent addition of dquot_alloc_block_nodirty() in between the compare and the add makes it even more likely, and I'd be uncomfortable to leave it unfixed. Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it. I believe this implementation is correct, and slightly more efficient than the combination of compare and add (taking the lock once rather than twice when nearing full - the last 128MiB of a tmpfs volume on a machine with 128 CPUs and 4KiB pages); but it does beg for a better design - when nearing full, there is no new batching, but the costly percpu counter sum across CPUs still has to be done, while locked. Follow __percpu_counter_sum()'s example, including cpu_dying_mask as well as cpu_online_mask: but shouldn't __percpu_counter_compare() and __percpu_counter_limited_add() then be adding a num_dying_cpus() to num_online_cpus(), when they calculate the maximum which could be held across CPUs? But the times when it matters would be vanishingly rare. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Tim Chen <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Carlos Maiolino <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>