aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-02-22fork: Use IS_ENABLED() in account_kernel_stack()Sebastian Andrzej Siewior1-4/+4
Not strickly needed but checking CONFIG_VMAP_STACK instead of task_stack_vm_area()' result allows the compiler the remove the else path in the CONFIG_VMAP_STACK case where the pointer can't be NULL. Check for CONFIG_VMAP_STACK in order to use the proper path. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-22fork: Only cache the VMAP stack in finish_task_switch()Sebastian Andrzej Siewior1-13/+63
The task stack could be deallocated later, but for fork()/exec() kind of workloads (say a shell script executing several commands) it is important that the stack is released in finish_task_switch() so that in VMAP_STACK case it can be cached and reused in the new task. For PREEMPT_RT it would be good if the wake-up in vfree_atomic() could be avoided in the scheduling path. Far worse are the other free_thread_stack() implementations which invoke __free_pages()/ kmem_cache_free() with disabled preemption. Cache the stack in free_thread_stack() in the VMAP_STACK case and RCU-delay the free path otherwise. Free the stack in the RCU callback. In the VMAP_STACK case this is another opportunity to fill the cache. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-22fork: Move task stack accounting to do_exit()Sebastian Andrzej Siewior2-12/+24
There is no need to perform the stack accounting of the outgoing task in its final schedule() invocation which happens with preemption disabled. The task is leaving, the resources will be freed and the accounting can happen in do_exit() before the actual schedule invocation which frees the stack memory. Move the accounting of the stack memory from release_task_stack() to exit_task_stack_account() which then can be invoked from do_exit(). Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-22fork: Move memcg_charge_kernel_stack() into CONFIG_VMAP_STACKSebastian Andrzej Siewior1-33/+36
memcg_charge_kernel_stack() is only used in the CONFIG_VMAP_STACK case. Move memcg_charge_kernel_stack() into the CONFIG_VMAP_STACK block and invoke it from within alloc_thread_stack_node(). Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-22fork: Don't assign the stack pointer in dup_task_struct()Sebastian Andrzej Siewior1-31/+16
All four versions of alloc_thread_stack_node() assign now task_struct::stack in case the allocation was successful. Let alloc_thread_stack_node() return an error code instead of the stack pointer and remove the stack assignment in dup_task_struct(). Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-22fork, IA64: Provide alloc_thread_stack_node() for IA64Sebastian Andrzej Siewior1-0/+17
Provide a generic alloc_thread_stack_node() for IA64 and CONFIG_ARCH_THREAD_STACK_ALLOCATOR which returns stack pointer and sets task_struct::stack so it behaves exactly like the other implementations. Rename IA64's alloc_thread_stack_node() and add the generic version to the fork code so it is in one place _and_ to drastically lower the chances of fat fingering the IA64 code. Do the same for free_thread_stack(). Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-22fork: Duplicate task_struct before stack allocationSebastian Andrzej Siewior1-5/+4
alloc_thread_stack_node() already populates the task_struct::stack member except on IA64. The stack pointer is saved and populated again because IA64 needs it and arch_dup_task_struct() overwrites it. Allocate thread's stack after task_struct has been duplicated as a preparation for further changes. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-22fork: Redo ifdefs around task stack handlingSebastian Andrzej Siewior1-35/+39
The use of ifdef CONFIG_VMAP_STACK is confusing in terms what is actually happenning and what can happen. For instance from reading free_thread_stack() it appears that in the CONFIG_VMAP_STACK case it may receive a non-NULL vm pointer but it may also be NULL in which case __free_pages() is used to free the stack. This is however not the case because in the VMAP case a non-NULL pointer is always returned here. Since it looks like this might happen, the compiler creates the correct dead code with the invocation to __free_pages() and everything around it. Twice. Add spaces between the ifdef and the identifer to recognize the ifdef level which is currently in scope. Add the current identifer as a comment behind #else and #endif. Move the code within free_thread_stack() and alloc_thread_stack_node() into the relevant ifdef blocks. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-22cpuset: Fix kernel-docJiapeng Chong1-5/+5
Fix the following W=1 kernel warnings: kernel/cgroup/cpuset.c:3718: warning: expecting prototype for cpuset_memory_pressure_bump(). Prototype was for __cpuset_memory_pressure_bump() instead. kernel/cgroup/cpuset.c:3568: warning: expecting prototype for cpuset_node_allowed(). Prototype was for __cpuset_node_allowed() instead. Reported-by: Abaci Robot <[email protected]> Signed-off-by: Jiapeng Chong <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2022-02-22audit: log AUDIT_TIME_* records only from rulesRichard Guy Briggs2-20/+71
AUDIT_TIME_* events are generated when there are syscall rules present that are not related to time keeping. This will produce noisy log entries that could flood the logs and hide events we really care about. Rather than immediately produce the AUDIT_TIME_* records, store the data in the context and log it at syscall exit time respecting the filter rules. Note: This eats the audit_buffer, unlike any others in show_special(). Please see https://bugzilla.redhat.com/show_bug.cgi?id=1991919 Fixes: 7e8eda734d30 ("ntp: Audit NTP parameters adjustment") Fixes: 2d87a0674bd6 ("timekeeping: Audit clock adjustments") Signed-off-by: Richard Guy Briggs <[email protected]> [PM: fixed style/whitespace issues] Signed-off-by: Paul Moore <[email protected]>
2022-02-22cgroup-v1: Correct privileges check in release_agent writesMichal Koutný1-2/+4
The idea is to check: a) the owning user_ns of cgroup_ns, b) capabilities in init_user_ns. The commit 24f600856418 ("cgroup-v1: Require capabilities to set release_agent") got this wrong in the write handler of release_agent since it checked user_ns of the opener (may be different from the owning user_ns of cgroup_ns). Secondly, to avoid possibly confused deputy, the capability of the opener must be checked. Fixes: 24f600856418 ("cgroup-v1: Require capabilities to set release_agent") Cc: [email protected] Link: https://lore.kernel.org/stable/[email protected]/ Signed-off-by: Michal Koutný <[email protected]> Reviewed-by: Masami Ichikawa(CIP) <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2022-02-22cgroup: clarify cgroup_css_set_fork()Christian Brauner1-0/+14
With recent fixes for the permission checking when moving a task into a cgroup using a file descriptor to a cgroup's cgroup.procs file and calling write() it seems a good idea to clarify CLONE_INTO_CGROUP permission checking with a comment. Cc: Tejun Heo <[email protected]> Cc: <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2022-02-21random: clear fast pool, crng, and batches in cpuhp bring upJason A. Donenfeld1-0/+11
For the irq randomness fast pool, rather than having to use expensive atomics, which were visibly the most expensive thing in the entire irq handler, simply take care of the extreme edge case of resetting count to zero in the cpuhp online handler, just after workqueues have been reenabled. This simplifies the code a bit and lets us use vanilla variables rather than atomics, and performance should be improved. As well, very early on when the CPU comes up, while interrupts are still disabled, we clear out the per-cpu crng and its batches, so that it always starts with fresh randomness. Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Sultan Alsawaf <[email protected]> Cc: Dominik Brodowski <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-02-21printk: make suppress_panic_printk staticJiapeng Chong1-1/+1
This symbol is not used outside of printk.c, so marks it static. Fix the following sparse warning: kernel/printk/printk.c:100:19: warning: symbol 'suppress_panic_printk' was not declared. Should it be static? Reported-by: Abaci Robot <[email protected]> Signed-off-by: Jiapeng Chong <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Reviewed-by: Miguel Ojeda <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-21printk: Set console_set_on_cmdline=1 when __add_preferred_console() is ↵Andre Kalb1-4/+16
called with user_specified == true In case of using console="" or console=null set console_set_on_cmdline=1 to disable "stdout-path" node from DT. We basically need to set it every time when __add_preferred_console() is called with parameter 'user_specified' set. Therefore we can move setting it into a helper function that is called from __add_preferred_console(). Suggested-by: Petr Mladek <[email protected]> Signed-off-by: Andre Kalb <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/YgzU4ho8l6XapyG2@pc6682
2022-02-21Merge tag 'v5.17-rc5' into sched/core, to resolve conflictsIngo Molnar37-339/+553
New conflicts in sched/core due to the following upstream fixes: 44585f7bc0cb ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n") a06247c6804f ("psi: Fix uaf issue when psi trigger is destroyed while being polled") Conflicts: include/linux/psi_types.h kernel/sched/psi.c Signed-off-by: Ingo Molnar <[email protected]>
2022-02-21Merge tag 'irq-api-2022-02-21' into irq/coreThomas Gleixner26-106/+243
Merge the generic_handle_irq_safe() API back into irq/core.
2022-02-21genirq: Provide generic_handle_irq_safe()Sebastian Andrzej Siewior1-0/+23
Provide generic_handle_irq_safe() which can used from any context. Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Reviewed-by: Oleksandr Natalenko <[email protected]> Reviewed-by: Wolfram Sang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-21x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation ↵Josh Poimboeuf1-0/+7
reporting With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable to Spectre v2 BHB-based attacks. When both are enabled, print a warning message and report it in the 'spectre_v2' sysfs vulnerabilities file. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]>
2022-02-20Merge tag 'locking_urgent_for_v5.17_rc5' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: "Fix a NULL ptr dereference when dumping lockdep chains through /proc/lockdep_chains" * tag 'locking_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Correct lock_classes index mapping
2022-02-20Merge tag 'sched_urgent_for_v5.17_rc5' of ↵Linus Torvalds2-14/+33
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: "Fix task exposure order when forking tasks" * tag 'sched_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix yet more sched_fork() races
2022-02-20Merge tag 'pidfd.v5.17-rc4' of ↵Linus Torvalds1-4/+3
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd fix from Christian Brauner: "This fixes a problem reported by lockdep when installing a pidfd via fd_install() with siglock and the tasklisk write lock held in copy_process() when calling clone()/clone3() with CLONE_PIDFD. Originally a pidfd was created prior to holding any of these locks but this required a call to ksys_close(). So quite some time ago in 6fd2fe494b17 ("copy_process(): don't use ksys_close() on cleanups") we switched to a get_unused_fd_flags() + fd_install() model. As part of that we moved fd_install() as late as possible. This was done for two main reasons. First, because we needed to ensure that we call fd_install() past the point of no return as once that's called the fd is live in the task's file table. Second, because we tried to ensure that the fd is visible in /proc/<pid>/fd/<pidfd> right when the task is visible. This fix moves the fd_install() to an even later point which means that a task will be visible in proc while the pidfd isn't yet under /proc/<pid>/fd/<pidfd>. While this is a user visible change it's very unlikely that this will have any impact. Nobody should be relying on that and if they do we need to come up with something better but again, it's doubtful this is relevant" * tag 'pidfd.v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: copy_process(): Move fd_install() out of sighand->siglock critical section
2022-02-20Merge branch 'ucount-rlimit-fixes-for-v5.17' of ↵Linus Torvalds4-19/+23
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucounts fixes from Eric Biederman: "Michal Koutný recently found some bugs in the enforcement of RLIMIT_NPROC in the recent ucount rlimit implementation. In this set of patches I have developed a very conservative approach changing only what is necessary to fix the bugs that I can see clearly. Cleanups and anything that is making the code more consistent can follow after we have the code working as it has historically. The problem is not so much inconsistencies (although those exist) but that it is very difficult to figure out what the code should be doing in the case of RLIMIT_NPROC. All other rlimits are only enforced where the resource is acquired (allocated). RLIMIT_NPROC by necessity needs to be enforced in an additional location, and our current implementation stumbled it's way into that implementation" * 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucounts: Handle wrapping in is_ucounts_overlimit ucounts: Move RLIMIT_NPROC handling after set_user ucounts: Base set_cred_ucounts changes on the real user ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1 rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in set_user
2022-02-19bpf: Initialize ret to 0 inside btf_populate_kfunc_set()Souptick Joarder (HPE)1-1/+1
Kernel test robot reported below error -> kernel/bpf/btf.c:6718 btf_populate_kfunc_set() error: uninitialized symbol 'ret'. Initialize ret to 0. Fixes: dee872e124e8 ("bpf: Populate kfunc BTF ID sets in struct btf") Reported-by: kernel test robot <[email protected]> Signed-off-by: Souptick Joarder (HPE) <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-02-19sched/preempt: Add PREEMPT_DYNAMIC using static keysMark Rutland3-3/+65
Where an architecture selects HAVE_STATIC_CALL but not HAVE_STATIC_CALL_INLINE, each static call has an out-of-line trampoline which will either branch to a callee or return to the caller. On such architectures, a number of constraints can conspire to make those trampolines more complicated and potentially less useful than we'd like. For example: * Hardware and software control flow integrity schemes can require the addition of "landing pad" instructions (e.g. `BTI` for arm64), which will also be present at the "real" callee. * Limited branch ranges can require that trampolines generate or load an address into a register and perform an indirect branch (or at least have a slow path that does so). This loses some of the benefits of having a direct branch. * Interaction with SW CFI schemes can be complicated and fragile, e.g. requiring that we can recognise idiomatic codegen and remove indirections understand, at least until clang proves more helpful mechanisms for dealing with this. For PREEMPT_DYNAMIC, we don't need the full power of static calls, as we really only need to enable/disable specific preemption functions. We can achieve the same effect without a number of the pain points above by using static keys to fold early returns into the preemption functions themselves rather than in an out-of-line trampoline, effectively inlining the trampoline into the start of the function. For arm64, this results in good code generation. For example, the dynamic_cond_resched() wrapper looks as follows when enabled. When disabled, the first `B` is replaced with a `NOP`, resulting in an early return. | <dynamic_cond_resched>: | bti c | b <dynamic_cond_resched+0x10> // or `nop` | mov w0, #0x0 | ret | mrs x0, sp_el0 | ldr x0, [x0, #8] | cbnz x0, <dynamic_cond_resched+0x8> | paciasp | stp x29, x30, [sp, #-16]! | mov x29, sp | bl <preempt_schedule_common> | mov w0, #0x1 | ldp x29, x30, [sp], #16 | autiasp | ret ... compared to the regular form of the function: | <__cond_resched>: | bti c | mrs x0, sp_el0 | ldr x1, [x0, #8] | cbz x1, <__cond_resched+0x18> | mov w0, #0x0 | ret | paciasp | stp x29, x30, [sp, #-16]! | mov x29, sp | bl <preempt_schedule_common> | mov w0, #0x1 | ldp x29, x30, [sp], #16 | autiasp | ret Any architecture which implements static keys should be able to use this to implement PREEMPT_DYNAMIC with similar cost to non-inlined static calls. Since this is likely to have greater overhead than (inlined) static calls, PREEMPT_DYNAMIC is only defaulted to enabled when HAVE_PREEMPT_DYNAMIC_CALL is selected. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-19sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRYMark Rutland1-0/+2
Now that the enabled/disabled states for the preemption functions are declared alongside their definitions, the core PREEMPT_DYNAMIC logic is no longer tied to GENERIC_ENTRY, and can safely be selected so long as an architecture provides enabled/disabled states for irqentry_exit_cond_resched(). Make it possible to select HAVE_PREEMPT_DYNAMIC without GENERIC_ENTRY. For existing users of HAVE_PREEMPT_DYNAMIC there should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-19sched/preempt: Simplify irqentry_exit_cond_resched() callersMark Rutland1-8/+4
Currently callers of irqentry_exit_cond_resched() need to be aware of whether the function should be indirected via a static call, leading to ugly ifdeffery in callers. Save them the hassle with a static inline wrapper that does the right thing. The raw_irqentry_exit_cond_resched() will also be useful in subsequent patches which will add conditional wrappers for preemption functions. Note: in arch/x86/entry/common.c, xen_pv_evtchn_do_upcall() always calls irqentry_exit_cond_resched() directly, even when PREEMPT_DYNAMIC is in use. I believe this is a latent bug (which this patch corrects), but I'm not entirely certain this wasn't deliberate. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-19sched/preempt: Refactor sched_dynamic_update()Mark Rutland1-22/+37
Currently sched_dynamic_update needs to open-code the enabled/disabled function names for each preemption model it supports, when in practice this is a boolean enabled/disabled state for each function. Make this clearer and avoid repetition by defining the enabled/disabled states at the function definition, and using helper macros to perform the static_call_update(). Where x86 currently overrides the enabled function, it is made to provide both the enabled and disabled states for consistency, with defaults provided by the core code otherwise. In subsequent patches this will allow us to support PREEMPT_DYNAMIC without static calls. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-19sched/preempt: Move PREEMPT_DYNAMIC logic laterMark Rutland1-136/+136
The PREEMPT_DYNAMIC logic in kernel/sched/core.c patches static calls for a bunch of preemption functions. While most are defined prior to this, the definition of cond_resched() is later in the file, and so we only have its declarations from include/linux/sched.h. In subsequent patches we'd like to define some macros alongside the definition of each of the preemption functions, which we can use within sched_dynamic_update(). For this to be possible, the PREEMPT_DYNAMIC logic needs to be placed after the various preemption functions. As a preparatory step, this patch moves the PREEMPT_DYNAMIC logic after the various preemption functions, with no other changes -- this is purely a move. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-19sched: Fix yet more sched_fork() racesPeter Zijlstra2-14/+33
Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") fixed a fork race vs cgroup, it opened up a race vs syscalls by not placing the task on the runqueue before it gets exposed through the pidhash. Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is trying to fix a single instance of this, instead fix the whole class of issues, effectively reverting this commit. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Tadeusz Struk <[email protected]> Tested-by: Zhang Qiao <[email protected]> Tested-by: Dietmar Eggemann <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-02-18bpf: Call maybe_wait_bpf_programs() only once from generic_map_delete_batch()Eric Dumazet1-1/+2
As stated in the comment found in maybe_wait_bpf_programs(), the synchronize_rcu() barrier is only needed before returning to userspace, not after each deletion in the batch. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Stanislav Fomichev <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-02-18KVM: x86: allow defining return-0 static callsPaolo Bonzini1-0/+1
A few vendor callbacks are only used by VMX, but they return an integer or bool value. Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is NULL in struct kvm_x86_ops, it will be changed to __static_call_return0 when updating static calls. Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-02-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski13-329/+196
Daniel Borkmann says: ==================== bpf-next 2022-02-17 We've added 29 non-merge commits during the last 8 day(s) which contain a total of 34 files changed, 1502 insertions(+), 524 deletions(-). The main changes are: 1) Add BTFGen support to bpftool which allows to use CO-RE in kernels without BTF info, from Mauricio Vásquez, Rafael David Tinoco, Lorenzo Fontana and Leonardo Di Donato. (Details: https://lpc.events/event/11/contributions/948/) 2) Prepare light skeleton to be used in both kernel module and user space and convert bpf_preload.ko to use light skeleton, from Alexei Starovoitov. 3) Rework bpftool's versioning scheme and align with libbpf's version number; also add linked libbpf version info to "bpftool version", from Quentin Monnet. 4) Add minimal C++ specific additions to bpftool's skeleton codegen to facilitate use of C skeletons in C++ applications, from Andrii Nakryiko. 5) Add BPF verifier sanity check whether relative offset on kfunc calls overflows desc->imm and reject the BPF program if the case, from Hou Tao. 6) Fix libbpf to use a dynamically allocated buffer for netlink messages to avoid receiving truncated messages on some archs, from Toke Høiland-Jørgensen. 7) Various follow-up fixes to the JIT bpf_prog_pack allocator, from Song Liu. 8) Various BPF selftest and vmtest.sh fixes, from Yucong Sun. 9) Fix bpftool pretty print handling on dumping map keys/values when no BTF is available, from Jiri Olsa and Yinjun Zhang. 10) Extend XDP frags selftest to check for invalid length, from Lorenzo Bianconi. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (29 commits) bpf: bpf_prog_pack: Set proper size before freeing ro_header selftests/bpf: Fix crash in core_reloc when bpftool btfgen fails selftests/bpf: Fix vmtest.sh to launch smp vm. libbpf: Fix memleak in libbpf_netlink_recv() bpftool: Fix C++ additions to skeleton bpftool: Fix pretty print dump for maps without BTF loaded selftests/bpf: Test "bpftool gen min_core_btf" bpftool: Gen min_core_btf explanation and examples bpftool: Implement btfgen_get_btf() bpftool: Implement "gen min_core_btf" logic bpftool: Add gen min_core_btf command libbpf: Expose bpf_core_{add,free}_cands() to bpftool libbpf: Split bpf_core_apply_relo() bpf: Reject kfunc calls that overflow insn->imm selftests/bpf: Add Skeleton templated wrapper as an example bpftool: Add C++-specific open/load/etc skeleton wrappers selftests/bpf: Fix GCC11 compiler warnings in -O2 mode bpftool: Fix the error when lookup in no-btf maps libbpf: Use dynamically allocated buffer when receiving netlink messages libbpf: Fix libbpf.map inheritance chain for LIBBPF_0.7.0 ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-02-17bpf: bpf_prog_pack: Set proper size before freeing ro_headerSong Liu1-0/+1
bpf_prog_pack_free() uses header->size to decide whether the header should be freed with module_memfree() or the bpf_prog_pack logic. However, in kvmalloc() failure path of bpf_jit_binary_pack_alloc(), header->size is not set yet. As a result, bpf_prog_pack_free() may treat a slice of a pack as a standalone kvmalloc'd header and call module_memfree() on the whole pack. This in turn causes use-after-free by other users of the pack. Fix this by setting ro_header->size before freeing ro_header. Fixes: 33c9805860e5 ("bpf: Introduce bpf_jit_binary_pack_[alloc|finalize|free]") Reported-by: [email protected] Reported-by: [email protected] Reported-by: [email protected] Signed-off-by: Song Liu <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-02-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-2/+8
Fast path bpf marge for some -next work. Signed-off-by: Jakub Kicinski <[email protected]>
2022-02-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski3-2/+8
Alexei Starovoitov says: ==================== pull-request: bpf 2022-02-17 We've added 8 non-merge commits during the last 7 day(s) which contain a total of 8 files changed, 119 insertions(+), 15 deletions(-). The main changes are: 1) Add schedule points in map batch ops, from Eric. 2) Fix bpf_msg_push_data with len 0, from Felix. 3) Fix crash due to incorrect copy_map_value, from Kumar. 4) Fix crash due to out of bounds access into reg2btf_ids, from Kumar. 5) Fix a bpf_timer initialization issue with clang, from Yonghong. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Add schedule points in batch ops bpf: Fix crash due to out of bounds access into reg2btf_ids. selftests: bpf: Check bpf_msg_push_data return value bpf: Fix a bpf_timer initialization issue bpf: Emit bpf_timer in vmlinux BTF selftests/bpf: Add test for bpf_timer overwriting crash bpf: Fix crash due to incorrect copy_map_value bpf: Do not try bpf_msg_push_data with len 0 ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-02-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-13/+32
No conflicts. Signed-off-by: Jakub Kicinski <[email protected]>
2022-02-17bpf: Add schedule points in batch opsEric Dumazet1-0/+3
syzbot reported various soft lockups caused by bpf batch operations. INFO: task kworker/1:1:27 blocked for more than 140 seconds. INFO: task hung in rcu_barrier Nothing prevents batch ops to process huge amount of data, we need to add schedule points in them. Note that maybe_wait_bpf_programs(map) calls from generic_map_delete_batch() can be factorized by moving the call after the loop. This will be done later in -next tree once we get this fix merged, unless there is strong opinion doing this optimization sooner. Fixes: aa2e93b8e58e ("bpf: Add generic support for update and delete batch ops") Fixes: cb4d03ab499d ("bpf: Add generic support for lookup batch op") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Reviewed-by: Stanislav Fomichev <[email protected]> Acked-by: Brian Vazquez <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-02-17mm/munlock: rmap call mlock_vma_page() munlock_vma_page()Hugh Dickins1-5/+2
Add vma argument to mlock_vma_page() and munlock_vma_page(), make them inline functions which check (vma->vm_flags & VM_LOCKED) before calling mlock_page() and munlock_page() in mm/mlock.c. Add bool compound to mlock_vma_page() and munlock_vma_page(): this is because we have understandable difficulty in accounting pte maps of THPs, and if passed a PageHead page, mlock_page() and munlock_page() cannot tell whether it's a pmd map to be counted or a pte map to be ignored. Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the others, and use that to call mlock_vma_page() at the end of the page adds, and munlock_vma_page() at the end of page_remove_rmap() (end or beginning? unimportant, but end was easier for assertions in testing). No page lock is required (although almost all adds happen to hold it): delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s. Certainly page lock did serialize with page migration, but I'm having difficulty explaining why that was ever important. Mlock accounting on THPs has been hard to define, differed between anon and file, involved PageDoubleMap in some places and not others, required clear_page_mlock() at some points. Keep it simple now: just count the pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks. page_add_new_anon_rmap() callers unchanged: they have long been calling lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED handling (it also checks for not VM_SPECIAL: I think that's overcautious, and inconsistent with other checks, that mmap_region() already prevents VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it). Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
2022-02-17ucounts: Handle wrapping in is_ucounts_overlimitEric W. Biederman1-1/+2
While examining is_ucounts_overlimit and reading the various messages I realized that is_ucounts_overlimit fails to deal with counts that may have wrapped. Being wrapped should be a transitory state for counts and they should never be wrapped for long, but it can happen so handle it. Cc: [email protected] Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Shuah Khan <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-02-17ucounts: Move RLIMIT_NPROC handling after set_userEric W. Biederman1-5/+14
During set*id() which cred->ucounts to charge the the current process to is not known until after set_cred_ucounts. So move the RLIMIT_NPROC checking into a new helper flag_nproc_exceeded and call flag_nproc_exceeded after set_cred_ucounts. This is very much an arbitrary subset of the places where we currently change the RLIMIT_NPROC accounting, designed to preserve the existing logic. Fixing the existing logic will be the subject of another series of changes. Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-02-17ucounts: Base set_cred_ucounts changes on the real userEric W. Biederman1-7/+2
Michal Koutný <[email protected]> wrote: > Tasks are associated to multiple users at once. Historically and as per > setrlimit(2) RLIMIT_NPROC is enforce based on real user ID. > > The commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") > made the accounting structure "indexed" by euid and hence potentially > account tasks differently. > > The effective user ID may be different e.g. for setuid programs but > those are exec'd into already existing task (i.e. below limit), so > different accounting is moot. > > Some special setresuid(2) users may notice the difference, justifying > this fix. I looked at cred->ucount and it is only used for rlimit operations that were previously stored in cred->user. Making the fact cred->ucount can refer to a different user from cred->user a bug, affecting all uses of cred->ulimit not just RLIMIT_NPROC. Fix set_cred_ucounts to always use the real uid not the effective uid. Further simplify set_cred_ucounts by noticing that set_cred_ucounts somehow retained a draft version of the check to see if alloc_ucounts was needed that checks the new->user and new->user_ns against the current_real_cred(). Remove that draft version of the check. All that matters for setting the cred->ucounts are the user_ns and uid fields in the cred. Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Reported-by: Michal Koutný <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-02-17ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1Eric W. Biederman1-5/+5
Michal Koutný <[email protected]> wrote: > It was reported that v5.14 behaves differently when enforcing > RLIMIT_NPROC limit, namely, it allows one more task than previously. > This is consequence of the commit 21d1c5e386bc ("Reimplement > RLIMIT_NPROC on top of ucounts") that missed the sharpness of > equality in the forking path. This can be fixed either by fixing the test or by moving the increment to be before the test. Fix it my moving copy_creds which contains the increment before is_ucounts_overlimit. In the case of CLONE_NEWUSER the ucounts in the task_cred changes. The function is_ucounts_overlimit needs to use the final version of the ucounts for the new process. Which means moving the is_ucounts_overlimit test after copy_creds is necessary. Both the test in fork and the test in set_user were semantically changed when the code moved to ucounts. The change of the test in fork was bad because it was before the increment. The test in set_user was wrong and the change to ucounts fixed it. So this fix only restores the old behavior in one lcation not two. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Reported-by: Michal Koutný <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-02-17rlimit: Fix RLIMIT_NPROC enforcement failure caused by capability calls in ↵Eric W. Biederman1-2/+1
set_user Solar Designer <[email protected]> wrote: > I'm not aware of anyone actually running into this issue and reporting > it. The systems that I personally know use suexec along with rlimits > still run older/distro kernels, so would not yet be affected. > > So my mention was based on my understanding of how suexec works, and > code review. Specifically, Apache httpd has the setting RLimitNPROC, > which makes it set RLIMIT_NPROC: > > https://httpd.apache.org/docs/2.4/mod/core.html#rlimitnproc > > The above documentation for it includes: > > "This applies to processes forked from Apache httpd children servicing > requests, not the Apache httpd children themselves. This includes CGI > scripts and SSI exec commands, but not any processes forked from the > Apache httpd parent, such as piped logs." > > In code, there are: > > ./modules/generators/mod_cgid.c: ( (cgid_req.limits.limit_nproc_set) && ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, > ./modules/generators/mod_cgi.c: ((rc = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, > ./modules/filters/mod_ext_filter.c: rv = apr_procattr_limit_set(procattr, APR_LIMIT_NPROC, conf->limit_nproc); > > For example, in mod_cgi.c this is in run_cgi_child(). > > I think this means an httpd child sets RLIMIT_NPROC shortly before it > execs suexec, which is a SUID root program. suexec then switches to the > target user and execs the CGI script. > > Before 2863643fb8b9, the setuid() in suexec would set the flag, and the > target user's process count would be checked against RLIMIT_NPROC on > execve(). After 2863643fb8b9, the setuid() in suexec wouldn't set the > flag because setuid() is (naturally) called when the process is still > running as root (thus, has those limits bypass capabilities), and > accordingly execve() would not check the target user's process count > against RLIMIT_NPROC. In commit 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds") capable calls were added to set_user to make it more consistent with fork. Unfortunately because of call site differences those capable calls were checking the credentials of the user before set*id() instead of after set*id(). This breaks enforcement of RLIMIT_NPROC for applications that set the rlimit and then call set*id() while holding a full set of capabilities. The capabilities are only changed in the new credential in security_task_fix_setuid(). The code in apache suexec appears to follow this pattern. Commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits (RLIMIT_NPROC)") where this check was added describes the targes of this capability check as: 2/ When a root-owned process (e.g. cgiwrap) sets up process limits and then calls setuid, the setuid should fail if the user would then be running more than rlim_cur[RLIMIT_NPROC] processes, but it doesn't. This patch adds an appropriate test. With this patch, and per-user process limit imposed in cgiwrap really works. So the original use case of this check also appears to match the broken pattern. Restore the enforcement of RLIMIT_NPROC by removing the bad capable checks added in set_user. This unfortunately restores the inconsistent state the code has been in for the last 11 years, but dealing with the inconsistencies looks like a larger problem. Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 2863643fb8b9 ("set_user: add capability check when rlimit(RLIMIT_NPROC) exceeds") History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Reviewed-by: Solar Designer <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-02-16module: fix building with sysfs disabledDmitry Torokhov1-0/+2
Sysfs support might be disabled so we need to guard the code that instantiates "compression" attribute with an #ifdef. Fixes: b1ae6dc41eaa ("module: add in-kernel support for decompressing") Reported-by: kernel test robot <[email protected]> Signed-off-by: Dmitry Torokhov <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2022-02-16bpf: Fix crash due to out of bounds access into reg2btf_ids.Kumar Kartikeya Dwivedi1-2/+3
When commit e6ac2450d6de ("bpf: Support bpf program calling kernel function") added kfunc support, it defined reg2btf_ids as a cheap way to translate the verifier reg type to the appropriate btf_vmlinux BTF ID, however commit c25b2ae13603 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL") moved the __BPF_REG_TYPE_MAX from the last member of bpf_reg_type enum to after the base register types, and defined other variants using type flag composition. However, now, the direct usage of reg->type to index into reg2btf_ids may no longer fall into __BPF_REG_TYPE_MAX range, and hence lead to out of bounds access and kernel crash on dereference of bad pointer. Fixes: c25b2ae13603 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL") Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-02-16PM: hibernate: fix load_image_and_restore() error pathYe Bin1-1/+3
As 'swsusp_check' open 'hib_resume_bdev', if call 'create_basic_memory_bitmaps' failed, we need to close 'hib_resume_bdev' in 'load_image_and_restore' function. Signed-off-by: Ye Bin <[email protected]> [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2022-02-16libbpf: Split bpf_core_apply_relo()Mauricio Vásquez1-3/+10
BTFGen needs to run the core relocation logic in order to understand what are the types involved in a given relocation. Currently bpf_core_apply_relo() calculates and **applies** a relocation to an instruction. Having both operations in the same function makes it difficult to only calculate the relocation without patching the instruction. This commit splits that logic in two different phases: (1) calculate the relocation and (2) patch the instruction. For the first phase bpf_core_apply_relo() is renamed to bpf_core_calc_relo_insn() who is now only on charge of calculating the relocation, the second phase uses the already existing bpf_core_patch_insn(). bpf_object__relocate_core() uses both of them and the BTFGen will use only bpf_core_calc_relo_insn(). Signed-off-by: Mauricio Vásquez <[email protected]> Signed-off-by: Rafael David Tinoco <[email protected]> Signed-off-by: Lorenzo Fontana <[email protected]> Signed-off-by: Leonardo Di Donato <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-02-16locking/lockdep: Iterate lock_classes directly when reading lockdep filesWaiman Long3-15/+56
When dumping lock_classes information via /proc/lockdep, we can't take the lockdep lock as the lock hold time is indeterminate. Iterating over all_lock_classes without holding lock can be dangerous as there is a slight chance that it may branch off to other lists leading to infinite loop or even access invalid memory if changes are made to all_lock_classes list in parallel. To avoid this problem, iteration of lock classes is now done directly on the lock_classes array itself. The lock_classes_in_use bitmap is checked to see if the lock class is being used. To avoid iterating the full array all the times, a new max_lock_class_idx value is added to track the maximum lock_class index that is currently being used. We can theoretically take the lockdep lock for iterating all_lock_classes when other lockdep files (lockdep_stats and lock_stat) are accessed as the lock hold time will be shorter for them. For consistency, they are also modified to iterate the lock_classes array directly. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-02-16sched/isolation: Split housekeeping cpumask per isolation featuresFrederic Weisbecker1-29/+62
To prepare for supporting each housekeeping feature toward cpuset, split the global housekeeping cpumask per HK_TYPE_* entry. This will later allow, for example, to runtime modify the cpulist passed through "isolcpus=", "nohz_full=" and "rcu_nocbs=" kernel boot parameters. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Juri Lelli <[email protected]> Reviewed-by: Phil Auld <[email protected]> Link: https://lore.kernel.org/r/[email protected]