blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2021-09-03	io_uring: reexpand under-reexpanded iters	Pavel Begunkov	1	-0/+2
	[ 74.211232] BUG: KASAN: stack-out-of-bounds in iov_iter_revert+0x809/0x900 [ 74.212778] Read of size 8 at addr ffff888025dc78b8 by task syz-executor.0/828 [ 74.214756] CPU: 0 PID: 828 Comm: syz-executor.0 Not tainted 5.14.0-rc3-next-20210730 #1 [ 74.216525] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 74.219033] Call Trace: [ 74.219683] dump_stack_lvl+0x8b/0xb3 [ 74.220706] print_address_description.constprop.0+0x1f/0x140 [ 74.224226] kasan_report.cold+0x7f/0x11b [ 74.226085] iov_iter_revert+0x809/0x900 [ 74.227960] io_write+0x57d/0xe40 [ 74.232647] io_issue_sqe+0x4da/0x6a80 [ 74.242578] __io_queue_sqe+0x1ac/0xe60 [ 74.245358] io_submit_sqes+0x3f6e/0x76a0 [ 74.248207] __do_sys_io_uring_enter+0x90c/0x1a20 [ 74.257167] do_syscall_64+0x3b/0x90 [ 74.257984] entry_SYSCALL_64_after_hwframe+0x44/0xae old_size = iov_iter_count(); ... iov_iter_revert(old_size - iov_iter_count()); If iov_iter_revert() is done base on the initial size as above, and the iter is truncated and not reexpanded in the middle, it miscalculates borders causing problems. This trace is due to no one reexpanding after generic_write_checks(). Now iters store how many bytes has been truncated, so reexpand them to the initial state right before reverting. Cc: [email protected] Reported-by: Palash Oswal <[email protected]> Reported-by: Sudip Mukherjee <[email protected]> Reported-and-tested-by: [email protected] Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Al Viro <[email protected]>
2021-09-03	Merge tag 'kbuild-v5.15' of ↵	Linus Torvalds	3	-3/+3
	git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add -s option (strict mode) to merge_config.sh to make it fail when any symbol is redefined. - Show a warning if a different compiler is used for building external modules. - Infer --target from ARCH for CC=clang to let you cross-compile the kernel without CROSS_COMPILE. - Make the integrated assembler default (LLVM_IAS=1) for CC=clang. - Add <linux/stdarg.h> to the kernel source instead of borrowing <stdarg.h> from the compiler. - Add Nick Desaulniers as a Kbuild reviewer. - Drop stale cc-option tests. - Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG to handle symbols in inline assembly. - Show a warning if 'FORCE' is missing for if_changed rules. - Various cleanups * tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits) kbuild: redo fake deps at include/ksym/.h kbuild: clean up objtool_args slightly modpost: get the .mod file path more simply checkkconfigsymbols.py: Fix the '--ignore' option kbuild: merge vmlinux_link() between ARCH=um and other architectures kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh kbuild: merge vmlinux_link() between the ordinary link and Clang LTO kbuild: remove stale *.symversions kbuild: remove unused quiet_cmd_update_lto_symversions gen_compile_commands: extract compiler command from a series of commands x86: remove cc-option-yn test for -mtune= arc: replace cc-option-yn uses with cc-option s390: replace cc-option-yn uses with cc-option ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild sparc: move the install rule to arch/sparc/Makefile security: remove unneeded subdir-$(CONFIG_...) kbuild: sh: remove unused install script kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y kbuild: Switch to 'f' variants of integrated assembler flag kbuild: Shuffle blank line to improve comment meaning ...
2021-09-03	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	15	-92/+116
	Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <[email protected]>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
2021-09-03	userfaultfd: prevent concurrent API initialization	Nadav Amit	1	-47/+44
	userfaultfd assumes that the enabled features are set once and never changed after UFFDIO_API ioctl succeeded. However, currently, UFFDIO_API can be called concurrently from two different threads, succeed on both threads and leave userfaultfd's features in non-deterministic state. Theoretically, other uffd operations (ioctl's and page-faults) can be dispatched while adversely affected by such changes of features. Moreover, the writes to ctx->state and ctx->features are not ordered, which can - theoretically, again - let userfaultfd_ioctl() think that userfaultfd API completed, while the features are still not initialized. To avoid races, it is arguably best to get rid of ctx->state. Since there are only 2 states, record the API initialization in ctx->features as the uppermost bit and remove ctx->state. Link: https://lkml.kernel.org/r/[email protected] Fixes: 9cd75c3cd4c3d ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor") Signed-off-by: Nadav Amit <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	userfaultfd: change mmap_changing to atomic	Nadav Amit	1	-12/+13
	Patch series "userfaultfd: minor bug fixes". Three unrelated bug fixes. The first two addresses possible issues (not too theoretical ones), but I did not encounter them in practice. The third patch addresses a test bug that causes the test to fail on my system. It has been sent before as part of a bigger RFC. This patch (of 3): mmap_changing is currently a boolean variable, which is set and cleared without any lock that protects against concurrent modifications. mmap_changing is supposed to mark whether userfaultfd page-faults handling should be retried since mappings are undergoing a change. However, concurrent calls, for instance to madvise(MADV_DONTNEED), might cause mmap_changing to be false, although the remove event was still not read (hence acknowledged) by the user. Change mmap_changing to atomic_t and increase/decrease appropriately. Add a debug assertion to see whether mmap_changing is negative. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: df2cc96e77011 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Signed-off-by: Nadav Amit <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Xu <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	mm/pagemap: add mmap_assert_locked() annotations to find_vma*()	Luigi Rizzo	1	-0/+2
	find_vma() and variants need protection when used. This patch adds mmap_assert_lock() calls in the functions. To make sure the invariant is satisfied, we also need to add a mmap_read_lock() around the get_user_pages_remote() call in get_arg_page(). The lock is not strictly necessary because the mm has been newly created, but the extra cost is limited because the same mutex was also acquired shortly before in __bprm_mm_init(), so it is hot and uncontended. [[email protected]: TOMOYO needs the same protection which get_arg_page() needs] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Luigi Rizzo <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	mm: remove flush_kernel_dcache_page	Christoph Hellwig	1	-3/+3
	flush_kernel_dcache_page is a rather confusing interface that implements a subset of flush_dcache_page by not being able to properly handle page cache mapped pages. The only callers left are in the exec code as all other previous callers were incorrect as they could have dealt with page cache pages. Replace the calls to flush_kernel_dcache_page with calls to flush_dcache_page, which for all architectures does either exactly the same thing, can contains one or more of the following: 1) an optimization to defer the cache flush for page cache pages not mapped into userspace 2) additional flushing for mapped page cache pages if cache aliases are possible Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Linus Torvalds <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Cc: Alex Shi <[email protected]> Cc: Geoff Levand <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Guo Ren <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Nick Hu <[email protected]> Cc: Paul Cercueil <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Ulf Hansson <[email protected]> Cc: Vincent Chen <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	memcg: enable accounting for new namesapces and struct nsproxy	Vasily Averin	1	-1/+1
	Container admin can create new namespaces and force kernel to allocate up to several pages of memory for the namespaces and its associated structures. Net and uts namespaces have enabled accounting for such allocations. It makes sense to account for rest ones to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vasily Averin <[email protected]> Acked-by: Serge Hallyn <[email protected]> Acked-by: Christian Brauner <[email protected]> Acked-by: Kirill Tkhai <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Yutian Yang <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	memcg: enable accounting for fasync_cache	Vasily Averin	1	-1/+2
	fasync_struct is used by almost all character device drivers to set up the fasync queue, and for regular files by the file lease code. This structure is quite small but long-living and it can be assigned for any open file. It makes sense to account for its allocations to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vasily Averin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Yutian Yang <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	memcg: enable accounting for file lock caches	Vasily Averin	1	-2/+4
	User can create file locks for each open file and force kernel to allocate small but long-living objects per each open file. It makes sense to account for these objects to limit the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vasily Averin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Yutian Yang <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	memcg: enable accounting for pollfd and select bits arrays	Vasily Averin	1	-2/+2
	User can call select/poll system calls with a large number of assigned file descriptors and force kernel to allocate up to several pages of memory till end of these sleeping system calls. We have here long-living unaccounted per-task allocations. It makes sense to account for these allocations to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vasily Averin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Yutian Yang <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	memcg: enable accounting for mnt_cache entries	Vasily Averin	1	-2/+3
	Patch series "memcg accounting from OpenVZ", v7. OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. Initially we used our own accounting subsystem, then partially committed it to upstream, and a few years ago switched to cgroups v1. Now we're rebasing again, revising our old patches and trying to push them upstream. We try to protect the host system from any misuse of kernel memory allocation triggered by untrusted users inside the containers. Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing list, though I would be very grateful for any comments from maintainersi of affected subsystems or other people added in cc: Compared to the upstream, we additionally account the following kernel objects: - network devices and its Tx/Rx queues - ipv4/v6 addresses and routing-related objects - inet_bind_bucket cache objects - VLAN group arrays - ipv6/sit: ip_tunnel_prl - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets - nsproxy and namespace objects itself - IPC objects: semaphores, message queues and share memory segments - mounts - pollfd and select bits arrays - signals and posix timers - file lock - fasync_struct used by the file lease code and driver's fasync queues - tty objects - per-mm LDT We have an incorrect/incomplete/obsoleted accounting for few other kernel objects: sk_filter, af_packets, netlink and xt_counters for iptables. They require rework and probably will be dropped at all. Also we're going to add an accounting for nft, however it is not ready yet. We have not tested performance on upstream, however, our performance team compares our current RHEL7-based production kernel and reports that they are at least not worse as the according original RHEL7 kernel. This patch (of 10): The kernel allocates ~400 bytes of 'struct mount' for any new mount. Creating a new mount namespace clones most of the parent mounts, and this can be repeated many times. Additionally, each mount allocates up to PATH_MAX=4096 bytes for mnt->mnt_devname. It makes sense to account for these allocations to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vasily Averin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Christian Brauner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Yutian Yang <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dmitry Safonov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Zefan Li <[email protected]> Cc: Borislav Petkov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	memcg: charge fs_context and legacy_fs_context	Yutian Yang	1	-2/+2
	This patch adds accounting flags to fs_context and legacy_fs_context allocation sites so that kernel could correctly charge these objects. We have written a PoC to demonstrate the effect of the missing-charging bugs. The PoC takes around 1,200MB unaccounted memory, while it is charged for only 362MB memory usage. We evaluate the PoC on QEMU x86_64 v5.2.90 + Linux kernel v5.10.19 + Debian buster. All the limitations including ulimits and sysctl variables are set as default. Specifically, the hard NOFILE limit and nr_open in sysctl are both 1,048,576. /------------------------- POC code ----------------------------/ #define _GNU_SOURCE #include <sys/types.h> #include <sys/file.h> #include <time.h> #include <sys/wait.h> #include <stdint.h> #include <stdlib.h> #include <unistd.h> #include <stdio.h> #include <signal.h> #include <sched.h> #include <fcntl.h> #include <linux/mount.h> #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \ } while (0) #define STACK_SIZE (8 * 1024) #ifndef __NR_fsopen #define __NR_fsopen 430 #endif static inline int fsopen(const char fs_name, unsigned int flags) { return syscall(__NR_fsopen, fs_name, flags); } static char thread_stack[512][STACK_SIZE]; int thread_fn(void arg) { for (int i = 0; i< 800000; ++i) { int fsfd = fsopen("nfs", FSOPEN_CLOEXEC); if (fsfd == -1) { errExit("fsopen"); } } while(1); return 0; } int main(int argc, char argv[]) { int thread_pid; for (int i = 0; i < 1; ++i) { thread_pid = clone(thread_fn, thread_stack[i] + STACK_SIZE, \ SIGCHLD, NULL); } while(1); return 0; } /-------------------------- end --------------------------------*/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yutian Yang <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	fs, mm: fix race in unlinking swapfile	Hugh Dickins	1	-1/+7
	We had a recurring situation in which admin procedures setting up swapfiles would race with test preparation clearing away swapfiles; and just occasionally that got stuck on a swapfile "(deleted)" which could never be swapped off. That is not supposed to be possible. 2.6.28 commit f9454548e17c ("don't unlink an active swapfile") admitted that it was leaving a race window open: now close it. may_delete() makes the IS_SWAPFILE check (amongst many others) before inode_lock has been taken on target: now repeat just that simple check in vfs_unlink() and vfs_rename(), after taking inode_lock. Which goes most of the way to fixing the race, but swapon() must also check after it acquires inode_lock, that the file just opened has not already been unlinked. Link: https://lkml.kernel.org/r/[email protected] Fixes: f9454548e17c ("don't unlink an active swapfile") Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	mm/gup: remove try_get_page(), call try_get_compound_head() directly	John Hubbard	1	-1/+1
	try_get_page() is very similar to try_get_compound_head(), and in fact try_get_page() has fallen a little behind in terms of maintenance: try_get_compound_head() handles speculative page references more thoroughly. There are only two try_get_page() callsites, so just call try_get_compound_head() directly from those, and remove try_get_page() entirely. Also, seeing as how this changes try_get_compound_head() into a non-static function, provide some kerneldoc documentation for it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	writeback: memcg: simplify cgroup_writeback_by_id	Shakeel Butt	1	-11/+9
	Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty pages for a memcg. However mem_cgroup_wb_stats() does a lot more than just get the number of dirty pages. Just directly get the number of dirty pages instead of calling mem_cgroup_wb_stats(). Also cgroup_writeback_by_id() is only called for best-effort dirty flushing, so remove the unused 'nr' parameter and no need to explicitly flush memcg stats. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	fs: inode: count invalidated shadow pages in pginodesteal	Johannes Weiner	1	-1/+1
	pginodesteal is supposed to capture the impact that inode reclaim has on the page cache state. Currently, it doesn't consider shadow pages that get dropped this way, even though this can have a significant impact on paging behavior, memory pressure calculations etc. To improve visibility into these effects, make sure shadow pages get counted when they get dropped through inode reclaim. This changes the return value semantics of invalidate_mapping_pages() semantics slightly, but the only two users are the inode shrinker itsel and a usb driver that logs it for debugging purposes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	fs: drop_caches: fix skipping over shadow cache inodes	Johannes Weiner	1	-1/+2
	When drop_caches truncates the page cache in an inode it also includes any shadow entries for evicted pages. However, there is a preliminary check on whether the inode has pages: if it has only shadow entries, it will skip running truncation on the inode and leave it behind. Fix the check to mapping_empty(), such that it runs truncation on any inode that has cache entries at all. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Roman Gushchin <[email protected]> Acked-by: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	writeback: reliably update bandwidth estimation	Jan Kara	1	-3/+0
	Currently we trigger writeback bandwidth estimation from balance_dirty_pages() and from wb_writeback(). However neither of these need to trigger when the system is relatively idle and writeback is triggered e.g. from fsync(2). Make sure writeback estimates happen reliably by triggering them from do_writepages(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Cc: Michael Stapelberg <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	writeback: track number of inodes under writeback	Jan Kara	1	-0/+5
	Patch series "writeback: Fix bandwidth estimates", v4. Fix estimate of writeback throughput when device is not fully busy doing writeback. Michael Stapelberg has reported that such workload (e.g. generated by linking) tends to push estimated throughput down to 0 and as a result writeback on the device is practically stalled. The first three patches fix the reported issue, the remaining two patches are unrelated cleanups of problems I've noticed when reading the code. This patch (of 4): Track number of inodes under writeback for each bdi_writeback structure. We will use this to decide whether wb does any IO and so we can estimate its writeback throughput. In principle we could use number of pages under writeback (WB_WRITEBACK counter) for this however normal percpu counter reads are too inaccurate for our purposes and summing the counter is too expensive. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: Michael Stapelberg <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	ocfs2: ocfs2_downconvert_lock failure results in deadlock	Gang He	1	-0/+12
	Usually, ocfs2_downconvert_lock() function always downconverts dlm lock to the expected level for satisfy dlm bast requests from the other nodes. But there is a rare situation. When dlm lock conversion is being canceled, ocfs2_downconvert_lock() function will return -EBUSY. You need to be aware that ocfs2_cancel_convert() function is asynchronous in fsdlm implementation. If we does not requeue this lockres entry, ocfs2 downconvert thread no longer handles this dlm lock bast request. Then, the other nodes will not get the dlm lock again, the current node's process will be blocked when acquire this dlm lock again. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gang He <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	ocfs2: quota_local: fix possible uninitialized-variable access in ↵	Tuo Li	2	-1/+2
	ocfs2_local_read_info() A memory block is allocated through kmalloc(), and its return value is assigned to the pointer oinfo. However, oinfo->dqi_gqinode is not initialized but it is accessed in: iput(oinfo->dqi_gqinode); To fix this possible uninitialized-variable access, assign NULL to oinfo->dqi_gqinode, and add ocfs2_qinfo_lock_res_init() behind the assignment in ocfs2_local_read_info(). Remove ocfs2_qinfo_lock_res_init() in ocfs2_global_read_info(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tuo Li <[email protected]> Reported-by: TOTE Robot <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	ocfs2: remove an unnecessary condition	Dan Carpenter	1	-1/+1
	The case where "tmp_oh" is NULL is handled at the start of the function. At this point we know it's non-NULL so this will always return 1. Link: https://lkml.kernel.org/r/YOcItgIXtisi3MaO@mwanda Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Cc: Larry Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-09-03	mm: remove VM_DENYWRITE	David Hildenbrand	1	-1/+0
	All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be set from user space, so all users are gone; let's remove it. Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: David Hildenbrand <[email protected]>
2021-09-03	binfmt: remove in-tree usage of MAP_DENYWRITE	David Hildenbrand	3	-6/+5
	At exec time when we mmap the new executable via MAP_DENYWRITE we have it opened via do_open_execat() and already deny_write_access()'ed the file successfully. Once exec completes, we allow_write_acces(); however, we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and also deny_write_access() as long as mm->exe_file remains set. We'll effectively deny write access to our executable via mm->exe_file until mm->exe_file is changed -- when the process is removed, on new exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE). Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for mm->exe_file. In case of an elf interpreter, we'll now only deny write access to the file during exec. This is somewhat okay, because the interpreter behaves (and sometime is) a shared library; all shared libraries, especially the ones loaded directly in user space like via dlopen() won't ever be mapped via MAP_DENYWRITE, because we ignore that from user space completely; these shared libraries can always be modified while mapped and executed. Let's only special-case the main executable, denying write access while being executed by a process. This can be considered a minor user space visible change. While this is a cleanup, it also fixes part of a problem reported with VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with this patch and will be removed next: "Overlayfs did not honor positive i_writecount on realfile for VM_DENYWRITE mappings." [1] [1] https://lore.kernel.org/r/[email protected]/ Reported-by: Chengguang Xu <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: David Hildenbrand <[email protected]>
2021-09-03	kernel/fork: always deny write access to current MM exe_file	David Hildenbrand	1	-1/+3
	We want to remove VM_DENYWRITE only currently only used when mapping the executable during exec. During exec, we already deny_write_access() the executable, however, after exec completes the VMAs mapped with VM_DENYWRITE effectively keeps write access denied via deny_write_access(). Let's deny write access when setting or replacing the MM exe_file. With this change, we can remove VM_DENYWRITE for mapping executables. Make set_mm_exe_file() return an error in case deny_write_access() fails; note that this should never happen, because exec code does a deny_write_access() early and keeps write access denied when calling set_mm_exe_file. However, it makes the code easier to read and makes set_mm_exe_file() and replace_mm_exe_file() look more similar. This represents a minor user space visible change: sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file cannot be opened writable. Note that we can already fail with -EACCES if the file doesn't have execute permissions. Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: David Hildenbrand <[email protected]>
2021-09-03	binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()	David Hildenbrand	2	-2/+2
	uselib() is the legacy systemcall for loading shared libraries. Nowadays, applications use dlopen() to load shared libraries, completely implemented in user space via mmap(). For example, glibc uses MAP_COPY to mmap shared libraries. While this maps to MAP_PRIVATE \| MAP_DENYWRITE on Linux, Linux ignores any MAP_DENYWRITE specification from user space in mmap. With this change, all remaining in-tree users of MAP_DENYWRITE use it to map an executable. We will be able to open shared libraries loaded via uselib() writable, just as we already can via dlopen() from user space. This is one step into the direction of removing MAP_DENYWRITE from the kernel. This can be considered a minor user space visible change. Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Christian König <[email protected]> Signed-off-by: David Hildenbrand <[email protected]>
2021-09-03	io_uring: fix possible poll event lost in multi shot mode	Xiaoguang Wang	1	-3/+13
	IIUC, IORING_POLL_ADD_MULTI is similar to epoll's edge-triggered mode, that means once one pure poll request returns one event(cqe), we'll need to read or write continually until EAGAIN is returned, then I think there is a possible poll event lost race in multi shot mode: t1 poll request add \| \| t2 \| \| t3 event happens \| \| t4 task work add \| \| t5 \| task work run \| t6 \| commit one cqe \| t7 \| \| user app handles cqe t8 \| new event happen \| t9 \| add back to waitqueue \| t10 \| After t6 but before t9, if new event happens, there'll be no wakeup operation, and if user app has picked up this cqe in t7, read or write until EAGAIN is returned. In t8, new event happens and will be lost, though this race window maybe small. To fix this possible race, add poll request back to waitqueue before committing cqe. Fixes: 88e41cf928a6 ("io_uring: add multishot mode for IORING_OP_POLL_ADD") Signed-off-by: Xiaoguang Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-09-03	io_uring: prolong tctx_task_work() with flushing	Pavel Begunkov	1	-0/+3
	io_submit_flush_completions() may enqueue linked requests for task_work execution, so don't leave tctx_task_work() right after the tw list is exhausted, but try to flush and then retry. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/0755d4c2c36301447c63bdd4146c10477cea4249.1630539342.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-09-03	io_uring: don't disable kiocb_done() CQE batching	Pavel Begunkov	1	-1/+1
	Not passing issue_flags from kiocb_done() into __io_complete_rw() means that completion batching for this case is disabled, e.g. for most of buffered reads. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/b2689462835c3ee28a5999ef4f9a581e24be04a2.1630539342.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-09-03	io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL	Jens Axboe	1	-4/+24
	SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: https://github.com/axboe/liburing/issues/420 Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers") Reported-by: Johannes Lundberg <[email protected]> Tested-by: Johannes Lundberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-09-03	ceph: fix dereference of null pointer cf	Colin Ian King	1	-0/+3
	Currently in the case where kmem_cache_alloc fails the null pointer cf is dereferenced when assigning cf->is_capsnap = false. Fix this by adding a null pointer check and return path. Cc: [email protected] Addresses-Coverity: ("Dereference null return") Fixes: b2f9fa1f3bd8 ("ceph: correctly handle releasing an embedded cap flush") Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi	Linus Torvalds	1	-1/+1
	Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (ufs, qla2xxx, target, smartpqi, lpfc, mpt3sas). The core change causing the most churn was replacing the command request field request with a macro, allowing us to offset map to it and remove the redundant field; the same was also done for the tag field. The most impactful change is the final removal of scsi_ioctl, which has been deprecated for over a decade" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits) scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1 scsi: ufs: ufs-exynos: Fix static checker warning scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI scsi: lpfc: Use the proper SCSI midlayer interfaces for PI scsi: lpfc: Copyright updates for 14.0.0.1 patches scsi: lpfc: Update lpfc version to 14.0.0.1 scsi: lpfc: Add bsg support for retrieving adapter cmf data scsi: lpfc: Add cmf_info sysfs entry scsi: lpfc: Add debugfs support for cm framework buffers scsi: lpfc: Add support for maintaining the cm statistics buffer scsi: lpfc: Add rx monitoring statistics scsi: lpfc: Add support for the CM framework scsi: lpfc: Add cmfsync WQE support scsi: lpfc: Add support for cm enablement buffer scsi: lpfc: Add cm statistics buffer support scsi: lpfc: Add EDC ELS support scsi: lpfc: Expand FPIN and RDF receive logging scsi: lpfc: Add MIB feature enablement support scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware scsi: fc: Add EDC ELS definition ...
2021-09-02	ceph: drop the mdsc_get_session/put_session dout messages	Jeff Layton	1	-9/+2
	These are very chatty, racy, and not terribly useful. Just remove them. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: lockdep annotations for try_nonblocking_invalidate	Jeff Layton	1	-0/+2
	Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: don't WARN if we're forcibly removing the session caps	Xiubo Li	3	-10/+32
	For example in the case of a forced umount, we'll remove all the session caps even if they are dirty. Move the warning to a wrapper function and make most of the callers use it. Call the core function when removing caps due to a forced umount. Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: don't WARN if we're force umounting	Xiubo Li	1	-2/+5
	Force umount will try to close the sessions by setting the session state to _CLOSING. We don't want to WARN in this situation, since it's expected. Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: remove the capsnaps when removing caps	Xiubo Li	3	-18/+87
	capsnaps will take inode references via ihold when queueing to flush. When force unmounting, the client will just close the sessions and may never get a flush reply, causing a leak and inode ref leak. Fix this by removing the capsnaps for an inode when removing the caps. URL: https://tracker.ceph.com/issues/52295 Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: request Fw caps before updating the mtime in ceph_write_iter	Jeff Layton	1	-15/+17
	The current code will update the mtime and then try to get caps to handle the write. If we end up having to request caps from the MDS, then the mtime in the cap grant will clobber the updated mtime and it'll be lost. This is most noticable when two clients are alternately writing to the same file. Fw caps are continually being granted and revoked, and the mtime ends up stuck because the updated mtimes are always being overwritten with the old one. Fix this by changing the order of operations in ceph_write_iter to get the caps before updating the times. Also, make sure we check the pool full conditions before even getting any caps or uninlining. URL: https://tracker.ceph.com/issues/46574 Reported-by: Jozef Kováč <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-by: Luis Henriques <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: reconnect to the export targets on new mdsmaps	Xiubo Li	2	-4/+65
	In the case where the export MDS has crashed just after the EImportStart journal is flushed, a standby MDS takes over for it and when replaying the EImportStart journal the MDS will wait the client to reconnect. That may never happen because the client may not have registered or opened the sessions yet. When receiving a new map, ensure we reconnect to valid export targets as well if their sessions don't exist yet. Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: print more information when we can't find snaprealm	Jeff Layton	1	-6/+5
	Print a bit more information when we can't find the realm during ceph_add_cap. Show both the inode number and the old realm inode number. Suggested-by: Sage Weil <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: add ceph_change_snap_realm() helper	Jeff Layton	4	-63/+45
	Consolidate some fiddly code for changing an inode's snap_realm into a new helper function, and change the callers to use it. While we're in here, nothing uses the i_snap_realm_counter field, so remove that from the inode. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Luis Henriques <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: remove redundant initializations from mdsc and session	Jeff Layton	1	-19/+0
	The ceph_mds_client and ceph_mds_session structures are kzalloc'ed so there's no need to explicitly initialize either of their fields to 0. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: cancel delayed work instead of flushing on mdsc teardown	Jeff Layton	2	-3/+2
	The first thing metric_delayed_work does is check mdsc->stopping, and then return immediately if it's set. That's good since we would have already torn down the metric structures at this point, otherwise, but there is no locking around mdsc->stopping. It's possible that the ceph_metric_destroy call could race with the delayed_work, in which case we could end up with the delayed_work accessing destroyed percpu variables. At this point in the mdsc teardown, the "stopping" flag has already been set, so there's no benefit to flushing the work. Move the work cancellation in ceph_metric_destroy ahead of the percpu variable destruction, and eliminate the flush_delayed_work call in ceph_mdsc_destroy. Fixes: 18f473b384a6 ("ceph: periodically send perf metrics to MDSes") Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: add a new vxattr to return auth mds for an inode	Jeff Layton	1	-0/+19
	Add a new vxattr that shows what MDS is authoritative for an inode (if we happen to have auth caps). If we don't have an auth cap for the inode then just return -1. URL: https://tracker.ceph.com/issues/1276 Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Luis Henriques <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: remove some defunct forward declarations	Jeff Layton	1	-6/+0
	We missed these in the recent fscache rework. Reported-by: Steve French <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: flush the mdlog before waiting on unsafe reqs	Xiubo Li	1	-0/+76
	For the client requests who will have unsafe and safe replies from MDS daemons, in the MDS side the MDS daemons won't flush the mdlog (journal log) immediatelly, because they think it's unnecessary. That's true for most cases but not all, likes the fsync request. The fsync will wait until all the unsafe replied requests to be safely replied. Normally if there have multiple threads or clients are running, the whole mdlog in MDS daemons could be flushed in time if any request will trigger the mdlog submit thread. So usually we won't experience the normal operations will stuck for a long time. But in case there has only one client with only thread is running, the stuck phenomenon maybe obvious and the worst case it must wait at most 5 seconds to wait the mdlog to be flushed by the MDS's tick thread periodically. This patch will trigger to flush the mdlog in the relevant and auth MDSes to which the in-flight requests are sent just before waiting the unsafe requests to finish. Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: flush mdlog before umounting	Xiubo Li	3	-0/+27
	Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: make iterate_sessions a global symbol	Xiubo Li	3	-42/+36
	Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-09-02	ceph: make ceph_create_session_msg a global symbol	Xiubo Li	2	-7/+10
	Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>