aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-01-07livepatch: Avoid CPU hogging with cond_reschedDavid Vernet2-0/+3
When initializing a 'struct klp_object' in klp_init_object_loaded(), and performing relocations in klp_resolve_symbols(), klp_find_object_symbol() is invoked to look up the address of a symbol in an already-loaded module (or vmlinux). This, in turn, calls kallsyms_on_each_symbol() or module_kallsyms_on_each_symbol() to find the address of the symbol that is being patched. It turns out that symbol lookups often take up the most CPU time when enabling and disabling a patch, and may hog the CPU and cause other tasks on that CPU's runqueue to starve -- even in paths where interrupts are enabled. For example, under certain workloads, enabling a KLP patch with many objects or functions may cause ksoftirqd to be starved, and thus for interrupts to be backlogged and delayed. This may end up causing TCP retransmits on the host where the KLP patch is being applied, and in general, may cause any interrupts serviced by softirqd to be delayed while the patch is being applied. So as to ensure that kallsyms_on_each_symbol() does not end up hogging the CPU, this patch adds a call to cond_resched() in kallsyms_on_each_symbol() and module_kallsyms_on_each_symbol(), which are invoked when doing a symbol lookup in vmlinux and a module respectively. Without this patch, if a live-patch is applied on a 36-core Intel host with heavy TCP traffic, a ~10x spike is observed in TCP retransmits while the patch is being applied. Additionally, collecting sched events with perf indicates that ksoftirqd is awakened ~1.3 seconds before it's eventually scheduled. With the patch, no increase in TCP retransmit events is observed, and ksoftirqd is scheduled shortly after it's awakened. Signed-off-by: David Vernet <[email protected]> Acked-by: Miroslav Benes <[email protected]> Acked-by: Song Liu <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-01-07irq: remove unused flags argument from __handle_irq_event_percpu()Sebastian Andrzej Siewior3-11/+4
The __IRQF_TIMER bit from the flags argument was used in add_interrupt_randomness() to distinguish the timer interrupt from other interrupts. This is no longer the case. Remove the flags argument from __handle_irq_event_percpu(). Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-01-07random: remove unused irq_flags argument from add_interrupt_randomness()Sebastian Andrzej Siewior1-1/+1
Since commit ee3e00e9e7101 ("random: use registers from interrupted code for CPU's w/o a cycle counter") the irq_flags argument is no longer used. Remove unused irq_flags. Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dexuan Cui <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: K. Y. Srinivasan <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Wei Liu <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Wei Liu <[email protected]> Signed-off-by: Jason A. Donenfeld <[email protected]>
2022-01-06Merge tag 'trace-v5.16-rc8' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Three minor tracing fixes: - Fix missing prototypes in sample module for direct functions - Fix check of valid buffer in get_trace_buf() - Fix annotations of percpu pointers" * tag 'trace-v5.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Tag trace_percpu_buffer as a percpu pointer tracing: Fix check for trace_percpu_buffer validity in get_trace_buf() ftrace/samples: Add missing prototypes direct functions
2022-01-06cgroup/rstat: check updated_next only for rootWei Yang1-24/+23
After commit dc26532aed0a ("cgroup: rstat: punt root-level optimization to individual controllers"), each rstat on updated_children list has its ->updated_next not NULL. This means we can remove the check on ->updated_next, if we make sure the subtree from @root is on list, which could be done by checking updated_next for root. tj: Coding style fixes. Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2022-01-06cgroup: rstat: explicitly put loop variant in whileWei Yang1-3/+1
Instead of do while unconditionally, let's put the loop variant in while. Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2022-01-06cgroup: Use open-time cgroup namespace for process migration perm checksTejun Heo2-9/+21
cgroup process migration permission checks are performed at write time as whether a given operation is allowed or not is dependent on the content of the write - the PID. This currently uses current's cgroup namespace which is a potential security weakness as it may allow scenarios where a less privileged process tricks a more privileged one into writing into a fd that it created. This patch makes cgroup remember the cgroup namespace at the time of open and uses it for migration permission checks instad of current's. Note that this only applies to cgroup2 as cgroup1 doesn't have namespace support. This also fixes a use-after-free bug on cgroupns reported in https://lore.kernel.org/r/[email protected] Note that backporting this fix also requires the preceding patch. Reported-by: "Eric W. Biederman" <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Oleg Nesterov <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Reported-by: [email protected] Link: https://lore.kernel.org/r/[email protected] Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option") Signed-off-by: Tejun Heo <[email protected]>
2022-01-06cgroup: Allocate cgroup_file_ctx for kernfs_open_file->privTejun Heo3-31/+65
of->priv is currently used by each interface file implementation to store private information. This patch collects the current two private data usages into struct cgroup_file_ctx which is allocated and freed by the common path. This allows generic private data which applies to multiple files, which will be used to in the following patch. Note that cgroup_procs iterator is now embedded as procs.iter in the new cgroup_file_ctx so that it doesn't need to be allocated and freed separately. v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in cgroup_file_ctx as suggested by Linus. v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too. Converted. Didn't change to embedded allocation as cgroup1 pidlists get stored for caching. Signed-off-by: Tejun Heo <[email protected]> Cc: Linus Torvalds <[email protected]> Reviewed-by: Michal Koutný <[email protected]>
2022-01-06cgroup: Use open-time credentials for process migraton perm checksTejun Heo2-4/+12
cgroup process migration permission checks are performed at write time as whether a given operation is allowed or not is dependent on the content of the write - the PID. This currently uses current's credentials which is a potential security weakness as it may allow scenarios where a less privileged process tricks a more privileged one into writing into a fd that it created. This patch makes both cgroup2 and cgroup1 process migration interfaces to use the credentials saved at the time of open (file->f_cred) instead of current's. Reported-by: "Eric W. Biederman" <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Fixes: 187fe84067bd ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy") Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2022-01-05xdp: Move conversion to xdp_frame out of map functionsToke Høiland-Jørgensen2-28/+12
All map redirect functions except XSK maps convert xdp_buff to xdp_frame before enqueueing it. So move this conversion of out the map functions and into xdp_do_redirect(). This removes a bit of duplicated code, but more importantly it makes it possible to support caller-allocated xdp_frame structures, which will be added in a subsequent commit. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-01-05tracing: Tag trace_percpu_buffer as a percpu pointerNaveen N. Rao1-2/+2
Tag trace_percpu_buffer as a percpu pointer to resolve warnings reported by sparse: /linux/kernel/trace/trace.c:3218:46: warning: incorrect type in initializer (different address spaces) /linux/kernel/trace/trace.c:3218:46: expected void const [noderef] __percpu *__vpp_verify /linux/kernel/trace/trace.c:3218:46: got struct trace_buffer_struct * /linux/kernel/trace/trace.c:3234:9: warning: incorrect type in initializer (different address spaces) /linux/kernel/trace/trace.c:3234:9: expected void const [noderef] __percpu *__vpp_verify /linux/kernel/trace/trace.c:3234:9: got int * Link: https://lkml.kernel.org/r/ebabd3f23101d89cb75671b68b6f819f5edc830b.1640255304.git.naveen.n.rao@linux.vnet.ibm.com Cc: [email protected] Reported-by: kernel test robot <[email protected]> Fixes: 07d777fe8c398 ("tracing: Add percpu buffers for trace_printk()") Signed-off-by: Naveen N. Rao <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2022-01-05tracing: Fix check for trace_percpu_buffer validity in get_trace_buf()Naveen N. Rao1-1/+1
With the new osnoise tracer, we are seeing the below splat: Kernel attempted to read user page (c7d880000) - exploit attempt? (uid: 0) BUG: Unable to handle kernel data access on read at 0xc7d880000 Faulting instruction address: 0xc0000000002ffa10 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries ... NIP [c0000000002ffa10] __trace_array_vprintk.part.0+0x70/0x2f0 LR [c0000000002ff9fc] __trace_array_vprintk.part.0+0x5c/0x2f0 Call Trace: [c0000008bdd73b80] [c0000000001c49cc] put_prev_task_fair+0x3c/0x60 (unreliable) [c0000008bdd73be0] [c000000000301430] trace_array_printk_buf+0x70/0x90 [c0000008bdd73c00] [c0000000003178b0] trace_sched_switch_callback+0x250/0x290 [c0000008bdd73c90] [c000000000e70d60] __schedule+0x410/0x710 [c0000008bdd73d40] [c000000000e710c0] schedule+0x60/0x130 [c0000008bdd73d70] [c000000000030614] interrupt_exit_user_prepare_main+0x264/0x270 [c0000008bdd73de0] [c000000000030a70] syscall_exit_prepare+0x150/0x180 [c0000008bdd73e10] [c00000000000c174] system_call_vectored_common+0xf4/0x278 osnoise tracer on ppc64le is triggering osnoise_taint() for negative duration in get_int_safe_duration() called from trace_sched_switch_callback()->thread_exit(). The problem though is that the check for a valid trace_percpu_buffer is incorrect in get_trace_buf(). The check is being done after calculating the pointer for the current cpu, rather than on the main percpu pointer. Fix the check to be against trace_percpu_buffer. Link: https://lkml.kernel.org/r/a920e4272e0b0635cf20c444707cbce1b2c8973d.1640255304.git.naveen.n.rao@linux.vnet.ibm.com Cc: [email protected] Fixes: e2ace001176dc9 ("tracing: Choose static tp_printk buffer by explicit nesting count") Signed-off-by: Naveen N. Rao <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2022-01-05bpf: Fix verifier support for validation of async callbacksKris Van Hees1-0/+1
Commit bfc6bb74e4f1 ("bpf: Implement verifier support for validation of async callbacks.") added support for BPF_FUNC_timer_set_callback to the __check_func_call() function. The test in __check_func_call() is flaweed because it can mis-interpret a regular BPF-to-BPF pseudo-call as a BPF_FUNC_timer_set_callback callback call. Consider the conditional in the code: if (insn->code == (BPF_JMP | BPF_CALL) && insn->imm == BPF_FUNC_timer_set_callback) { The BPF_FUNC_timer_set_callback has value 170. This means that if you have a BPF program that contains a pseudo-call with an instruction delta of 170, this conditional will be found to be true by the verifier, and it will interpret the pseudo-call as a callback. This leads to a mess with the verification of the program because it makes the wrong assumptions about the nature of this call. Solution: include an explicit check to ensure that insn->src_reg == 0. This ensures that calls cannot be mis-interpreted as an async callback call. Fixes: bfc6bb74e4f1 ("bpf: Implement verifier support for validation of async callbacks.") Signed-off-by: Kris Van Hees <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-01-05bpf: Don't promote bogus looking registers after null check.Daniel Borkmann1-6/+6
If we ever get to a point again where we convert a bogus looking <ptr>_or_null typed register containing a non-zero fixed or variable offset, then lets not reset these bounds to zero since they are not and also don't promote the register to a <ptr> type, but instead leave it as <ptr>_or_null. Converting to a unknown register could be an avenue as well, but then if we run into this case it would allow to leak a kernel pointer this way. Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2022-01-05Merge branches 'for-next/misc', 'for-next/cache-ops-dzp', ↵Catalin Marinas1-0/+1
'for-next/stacktrace', 'for-next/xor-neon', 'for-next/kasan', 'for-next/armv8_7-fp', 'for-next/atomics', 'for-next/bti', 'for-next/sve', 'for-next/kselftest' and 'for-next/kcsan', remote-tracking branch 'arm64/for-next/perf' into for-next/core * arm64/for-next/perf: (32 commits) arm64: perf: Don't register user access sysctl handler multiple times drivers: perf: marvell_cn10k: fix an IS_ERR() vs NULL check perf/smmuv3: Fix unused variable warning when CONFIG_OF=n arm64: perf: Support new DT compatibles arm64: perf: Simplify registration boilerplate arm64: perf: Support Denver and Carmel PMUs drivers/perf: hisi: Add driver for HiSilicon PCIe PMU docs: perf: Add description for HiSilicon PCIe PMU driver dt-bindings: perf: Add YAML schemas for Marvell CN10K LLC-TAD pmu bindings drivers: perf: Add LLC-TAD perf counter support perf/smmuv3: Synthesize IIDR from CoreSight ID registers perf/smmuv3: Add devicetree support dt-bindings: Add Arm SMMUv3 PMCG binding perf/arm-cmn: Add debugfs topology info perf/arm-cmn: Add CI-700 Support dt-bindings: perf: arm-cmn: Add CI-700 perf/arm-cmn: Support new IP features perf/arm-cmn: Demarcate CMN-600 specifics perf/arm-cmn: Move group validation data off-stack perf/arm-cmn: Optimise DTC counter accesses ... * for-next/misc: : Miscellaneous patches arm64: Use correct method to calculate nomap region boundaries arm64: Drop outdated links in comments arm64: errata: Fix exec handling in erratum 1418040 workaround arm64: Unhash early pointer print plus improve comment asm-generic: introduce io_stop_wc() and add implementation for ARM64 arm64: remove __dma_*_area() aliases docs/arm64: delete a space from tagged-address-abi arm64/fp: Add comments documenting the usage of state restore functions arm64: mm: Use asid feature macro for cheanup arm64: mm: Rename asid2idx() to ctxid2asid() arm64: kexec: reduce calls to page_address() arm64: extable: remove unused ex_handler_t definition arm64: entry: Use SDEI event constants arm64: Simplify checking for populated DT arm64/kvm: Fix bitrotted comment for SVE handling in handle_exit.c * for-next/cache-ops-dzp: : Avoid DC instructions when DCZID_EL0.DZP == 1 arm64: mte: DC {GVA,GZVA} shouldn't be used when DCZID_EL0.DZP == 1 arm64: clear_page() shouldn't use DC ZVA when DCZID_EL0.DZP == 1 * for-next/stacktrace: : Unify the arm64 unwind code arm64: Make some stacktrace functions private arm64: Make dump_backtrace() use arch_stack_walk() arm64: Make profile_pc() use arch_stack_walk() arm64: Make return_address() use arch_stack_walk() arm64: Make __get_wchan() use arch_stack_walk() arm64: Make perf_callchain_kernel() use arch_stack_walk() arm64: Mark __switch_to() as __sched arm64: Add comment for stack_info::kr_cur arch: Make ARCH_STACKWALK independent of STACKTRACE * for-next/xor-neon: : Use SHA3 instructions to speed up XOR arm64/xor: use EOR3 instructions when available * for-next/kasan: : Log potential KASAN shadow aliases arm64: mm: log potential KASAN shadow alias arm64: mm: use die_kernel_fault() in do_mem_abort() * for-next/armv8_7-fp: : Add HWCAPS for ARMv8.7 FEAT_AFP amd FEAT_RPRES arm64: cpufeature: add HWCAP for FEAT_RPRES arm64: add ID_AA64ISAR2_EL1 sys register arm64: cpufeature: add HWCAP for FEAT_AFP * for-next/atomics: : arm64 atomics clean-ups and codegen improvements arm64: atomics: lse: define RETURN ops in terms of FETCH ops arm64: atomics: lse: improve constraints for simple ops arm64: atomics: lse: define ANDs in terms of ANDNOTs arm64: atomics lse: define SUBs in terms of ADDs arm64: atomics: format whitespace consistently * for-next/bti: : BTI clean-ups arm64: Ensure that the 'bti' macro is defined where linkage.h is included arm64: Use BTI C directly and unconditionally arm64: Unconditionally override SYM_FUNC macros arm64: Add macro version of the BTI instruction arm64: ftrace: add missing BTIs arm64: kexec: use __pa_symbol(empty_zero_page) arm64: update PAC description for kernel * for-next/sve: : SVE code clean-ups and refactoring in prepararation of Scalable Matrix Extensions arm64/sve: Minor clarification of ABI documentation arm64/sve: Generalise vector length configuration prctl() for SME arm64/sve: Make sysctl interface for SVE reusable by SME * for-next/kselftest: : arm64 kselftest additions kselftest/arm64: Add pidbench for floating point syscall cases kselftest/arm64: Add a test program to exercise the syscall ABI kselftest/arm64: Allow signal tests to trigger from a function kselftest/arm64: Parameterise ptrace vector length information * for-next/kcsan: : Enable KCSAN for arm64 arm64: Enable KCSAN
2022-01-04swiotlb: Add CONFIG_HAS_IOMEM check around swiotlb_mem_remap()Wei Liu1-0/+7
HAS_IOMEM option may not be selected on some platforms (e.g, s390) and this will cause compilation failure due to missing memremap() implementation. Fix it by stubbing out swiotlb_mem_remap when CONFIG_HAS_IOMEM is not set. Reported-by: kernel test robot <[email protected]> Signed-off-by: Tianyu Lan <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Signed-off-by: Wei Liu <[email protected]>
2022-01-04livepatch: Fix missing unlock on error in klp_enable_patch()Yang Yingliang1-1/+3
Add missing unlock when try_module_get() fails in klp_enable_patch(). Fixes: 5ef3dd20555e8e8 ("livepatch: Fix kobject refcount bug on klp_init_patch_early failure path") Reported-by: Hulk Robot <[email protected]> Signed-off-by: Yang Yingliang <[email protected]> Acked-by: David Vernet <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-01-04livepatch: Fix kobject refcount bug on klp_init_patch_early failure pathDavid Vernet1-18/+13
When enabling a klp patch with klp_enable_patch(), klp_init_patch_early() is invoked to initialize the kobjects for the patch itself, as well as the 'struct klp_object' and 'struct klp_func' objects that comprise it. However, there are some error paths in klp_enable_patch() where some kobjects may have been initialized with kobject_init(), but an error code is still returned due to e.g. a 'struct klp_object' having a NULL funcs pointer. In these paths, the initial reference of the kobject of the 'struct klp_patch' may never be released, along with one or more of its objects and their functions, as kobject_put() is not invoked on the cleanup path if klp_init_patch_early() returns an error code. For example, if an object entry such as the following were added to the sample livepatch module's klp patch, it would cause the vmlinux klp_object, and its klp_func which updates 'cmdline_proc_show', to never be released: static struct klp_object objs[] = { { /* name being NULL means vmlinux */ .funcs = funcs, }, { /* NULL funcs -- would cause reference leak */ .name = "kvm", }, { } }; Without this change, if CONFIG_DEBUG_KOBJECT is enabled, and the sample klp patch is loaded, the kobjects (the patch, the vmlinux 'struct klp_object', and its func) are observed as initialized, but never released, in the dmesg log output. With the change, these kobject references no longer fail to be released as the error case is properly handled before they are initialized. Signed-off-by: David Vernet <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Acked-by: Miroslav Benes <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2021-12-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller23-369/+672
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-12-30 The following pull-request contains BPF updates for your *net-next* tree. We've added 72 non-merge commits during the last 20 day(s) which contain a total of 223 files changed, 3510 insertions(+), 1591 deletions(-). The main changes are: 1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii. 2) Beautify and de-verbose verifier logs, from Christy. 3) Composable verifier types, from Hao. 4) bpf_strncmp helper, from Hou. 5) bpf.h header dependency cleanup, from Jakub. 6) get_func_[arg|ret|arg_cnt] helpers, from Jiri. 7) Sleepable local storage, from KP. 8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar. ==================== Signed-off-by: David S. Miller <[email protected]>
2021-12-30bpf: Fix typo in a comment in bpf lpm_trie.Leon Huayra1-1/+1
Fix typo in a comment in trie_update_elem(). Signed-off-by: Leon Huayra <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-9/+31
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c commit 077cdda764c7 ("net/mlx5e: TC, Fix memory leak with rules with internal port") commit 31108d142f36 ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'") commit 4390c6edc0fb ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'") https://lore.kernel.org/all/[email protected]/ net/smc/smc_wr.c commit 49dc9013e34b ("net/smc: Use the bitmap API when applicable") commit 349d43127dac ("net/smc: fix kernel panic caused by race of smc_sock") bitmap_zero()/memset() is removed by the fix Signed-off-by: Jakub Kicinski <[email protected]>
2021-12-29net: Add includes masked by netdevice.h including uapi/bpf.hJakub Kicinski1-0/+1
Add missing includes unmasked by the subsequent change. Mostly network drivers missing an include for XDP_PACKET_HEADROOM. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-29bpf: Allow bpf_local_storage to be used by sleepable programsKP Singh4-15/+50
Other maps like hashmaps are already available to sleepable programs. Sleepable BPF programs run under trace RCU. Allow task, sk and inode storage to be used from sleepable programs. This allows sleepable and non-sleepable programs to provide shareable annotations on kernel objects. Sleepable programs run in trace RCU where as non-sleepable programs run in a normal RCU critical section i.e. __bpf_prog_enter{_sleepable} and __bpf_prog_exit{_sleepable}) (rcu_read_lock or rcu_read_lock_trace). In order to make the local storage maps accessible to both sleepable and non-sleepable programs, one needs to call both call_rcu_tasks_trace and call_rcu to wait for both trace and classical RCU grace periods to expire before freeing memory. Paul's work on call_rcu_tasks_trace allows us to have per CPU queueing for call_rcu_tasks_trace. This behaviour can be achieved by setting rcupdate.rcu_task_enqueue_lim=<num_cpus> boot parameter. In light of these new performance changes and to keep the local storage code simple, avoid adding a new flag for sleepable maps / local storage to select the RCU synchronization (trace / classical). Also, update the dereferencing of the pointers to use rcu_derference_check (with either the trace or normal RCU locks held) with a common bpf_rcu_lock_held helper method. Signed-off-by: KP Singh <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-29bpf: Add missing map_get_next_key method to bloom filter map.Haimin Zhang1-0/+6
Without it, kernel crashes in map_get_next_key(). Fixes: 9330986c0300 ("bpf: Add bloom filter map implementation") Reported-by: TCS Robot <[email protected]> Signed-off-by: Haimin Zhang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Joanne Koong <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-29net: Don't include filter.h from net/sock.hJakub Kicinski1-0/+1
sock.h is pretty heavily used (5k objects rebuilt on x86 after it's touched). We can drop the include of filter.h from it and add a forward declaration of struct sk_filter instead. This decreases the number of rebuilt objects when bpf.h is touched from ~5k to ~1k. There's a lot of missing includes this was masking. Primarily in networking tho, this time. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Marc Kleine-Budde <[email protected]> Acked-by: Florian Fainelli <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Acked-by: Stefano Garzarella <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-29notifier: Return an error when a callback has already been registeredBorislav Petkov1-7/+8
Return -EEXIST when a notifier callback has already been registered on a notifier chain. This should avoid any homegrown registration tracking at the callsite like https://lore.kernel.org/amd-gfx/[email protected] for example. This version is an alternative of https://lore.kernel.org/r/[email protected] which needed to touch every caller not checking the registration routine's return value. Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Alan Stern <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-28kobject: remove kset from struct kset_uevent_ops callbacksGreg Kroah-Hartman1-1/+1
There is no need to pass the pointer to the kset in the struct kset_uevent_ops callbacks as no one uses it, so just remove that pointer entirely. Reviewed-by: Rafael J. Wysocki <[email protected]> Reviewed-by: Wedson Almeida Filho <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-12-27driver core: make kobj_type constant.Wedson Almeida Filho1-1/+1
This way instances of kobj_type (which contain function pointers) can be stored in .rodata, which means that they cannot be [easily/accidentally] modified at runtime. Signed-off-by: Wedson Almeida Filho <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2021-12-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+11
Merge misc fixes from Andrew Morton: "9 patches. Subsystems affected by this patch series: mm (kfence, mempolicy, memory-failure, pagemap, pagealloc, damon, and memory-failure), core-kernel, and MAINTAINERS" * emailed patches from Andrew Morton <[email protected]>: mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page() mm/damon/dbgfs: protect targets destructions with kdamond_lock mm/page_alloc: fix __alloc_size attribute for alloc_pages_exact_nid mm: delete unsafe BUG from page_cache_add_speculative() mm, hwpoison: fix condition in free hugetlb page path MAINTAINERS: mark more list instances as moderated kernel/crash_core: suppress unknown crashkernel parameter warning mm: mempolicy: fix THP allocations escaping mempolicy restrictions kfence: fix memory leak when cat kfence objects
2021-12-25kernel/crash_core: suppress unknown crashkernel parameter warningPhilipp Rudo1-0/+11
When booting with crashkernel= on the kernel command line a warning similar to Kernel command line: ro console=ttyS0 crashkernel=256M Unknown kernel command line parameters "crashkernel=256M", will be passed to user space. is printed. This comes from crashkernel= being parsed independent from the kernel parameter handling mechanism. So the code in init/main.c doesn't know that crashkernel= is a valid kernel parameter and prints this incorrect warning. Suppress the warning by adding a dummy early_param handler for crashkernel=. Link: https://lkml.kernel.org/r/[email protected] Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters") Signed-off-by: Philipp Rudo <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Andrew Halaney <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-12-23Merge branch 'ucount-rlimit-fixes-for-v5.16' of ↵Linus Torvalds1-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucount fix from Eric Biederman: "This fixes a silly logic bug in the ucount rlimits code, where it was comparing against the wrong limit" * 'ucount-rlimit-fixes-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucounts: Fix rlimit max values check
2021-12-23Documentation: livepatch: Add livepatch API pageDavid Vernet1-3/+3
The livepatch subsystem has several exported functions and objects with kerneldoc comments. Though the livepatch documentation contains handwritten descriptions of all of these exported functions, they are currently not pulled into the docs build using the kernel-doc directive. In order to allow readers of the documentation to see the full kerneldoc comments in the generated documentation files, this change adds a new Documentation/livepatch/api.rst page which contains kernel-doc directives to link the kerneldoc comments directly in the documentation. With this, all of the hand-written descriptions of the APIs now cross-reference the kerneldoc comments on the new Livepatching APIs page, and running ./scripts/find-unused-docs.sh on kernel/livepatch no longer shows any files as missing documentation. Note that all of the handwritten API descriptions were left alone with the exception of Documentation/livepatch/system-state.rst, which was updated to allow the cross-referencing to work correctly. The file now follows the cross-referencing formatting guidance specified in Documentation/doc-guide/kernel-doc.rst. Furthermore, some comments around klp_shadow_free_all() were updated to say <_, id> rather than <*, id> to match the rest of the file, and to prevent the docs build from emitting an "Inline emphasis start-string without end string" error. Signed-off-by: David Vernet <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Acked-by: Miroslav Benes <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-22kthread: Never put_user the set_child_tid addressEric W. Biederman1-1/+1
Kernel threads abuse set_child_tid. Historically that has been fine as set_child_tid was initialized after the kernel thread had been forked. Unfortunately storing struct kthread in set_child_tid after the thread is running makes struct kthread being unusable for storing result codes of the thread. When set_child_tid is set to struct kthread during fork that results in schedule_tail writing the thread id to the beggining of struct kthread (if put_user does not realize it is a kernel address). Solve this by skipping the put_user for all kthreads. Reported-by: Nathan Chancellor <[email protected]> Link: https://lkml.kernel.org/r/YcNsG0Lp94V13whH@archlinux-ax161 Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-21bpf: Use struct_size() helperXiu Jianfeng2-7/+2
In an effort to avoid open-coded arithmetic in the kernel, use the struct_size() helper instead of open-coded calculation. Signed-off-by: Xiu Jianfeng <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://github.com/KSPP/linux/issues/160 Link: https://lore.kernel.org/bpf/[email protected]
2021-12-21kthread: Warn about failed allocations for the init kthreadEric W. Biederman1-1/+1
Failed allocates are not expected when setting up the initial task and it is not really possible to handle them either. So I added a warning to report if such an allocation failure ever happens. Correct the sense of the warning so it warns when an allocation failure happens not when the allocation succeeded. Oops. Reported-by: kernel test robot <[email protected]> Reported-by: Stephen Rothwell <[email protected]> Reported-by: Naresh Kamboju <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/CA+G9fYvLaR5CF777CKeWTO+qJFTN6vAvm95gtzN+7fw3Wi5hkA@mail.gmail.com Link: https://lkml.kernel.org/r/20211216102956.GC10708@xsang-OptiPlex-9020 Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-20blktrace: switch trace spinlock to a raw spinlockWander Lairson Costa1-9/+9
The running_trace_lock protects running_trace_list and is acquired within the tracepoint which implies disabled preemption. The spinlock_t typed lock can not be acquired with disabled preemption on PREEMPT_RT because it becomes a sleeping lock. The runtime of the tracepoint depends on the number of entries in running_trace_list and has no limit. The blk-tracer is considered debug code and higher latencies here are okay. Make running_trace_lock a raw_spinlock_t. Signed-off-by: Wander Lairson Costa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-12-20audit: use struct_size() helper in audit_[send|make]_reply()Xiu Jianfeng2-2/+2
Make use of struct_size() helper instead of an open-coded calculation. Link: https://github.com/KSPP/linux/issues/160 Signed-off-by: Xiu Jianfeng <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2021-12-20swiotlb: Add swiotlb bounce buffer remap function for HV IVMTianyu Lan1-2/+41
In Isolation VM with AMD SEV, bounce buffer needs to be accessed via extra address space which is above shared_gpa_boundary (E.G 39 bit address line) reported by Hyper-V CPUID ISOLATION_CONFIG. The access physical address will be original physical address + shared_gpa_boundary. The shared_gpa_boundary in the AMD SEV SNP spec is called virtual top of memory(vTOM). Memory addresses below vTOM are automatically treated as private while memory above vTOM is treated as shared. Expose swiotlb_unencrypted_base for platforms to set unencrypted memory base offset and platform calls swiotlb_update_mem_attributes() to remap swiotlb mem to unencrypted address space. memremap() can not be called in the early stage and so put remapping code into swiotlb_update_mem_attributes(). Store remap address and use it to copy data from/to swiotlb bounce buffer. Signed-off-by: Tianyu Lan <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Reviewed-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]>
2021-12-20fork: Rename bad_fork_cleanup_threadgroup_lock to bad_fork_cleanup_delayacctEric W. Biederman1-3/+3
I just fixed a bug in copy_process when using the label bad_fork_cleanup_threadgroup_lock. While fixing the bug I looked closer at the label and realized it has been misnamed since 568ac888215c ("cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork"). Fix the name so that fork is easier to understand. Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-20fork: Stop protecting back_fork_cleanup_cgroup_lock with CONFIG_NUMAEric W. Biederman1-1/+1
Mark Brown <[email protected]> reported: > This is also causing further build errors including but not limited to: > > /tmp/next/build/kernel/fork.c: In function 'copy_process': > /tmp/next/build/kernel/fork.c:2106:4: error: label 'bad_fork_cleanup_threadgroup_lock' used but not defined > 2106 | goto bad_fork_cleanup_threadgroup_lock; > | ^~~~ It turns out that I messed up and was depending upon a label protected by an ifdef. Move the label out of the ifdef as the ifdef around the label no longer makes sense (if it ever did). Link: https://lkml.kernel.org/r/[email protected] Fixes: 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-19Merge tag 'timers_urgent_for_v5.16_rc6' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Make sure the CLOCK_REALTIME to CLOCK_MONOTONIC offset is never positive * tag 'timers_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Really make sure wall_to_monotonic isn't positive
2021-12-19Merge tag 'locking_urgent_for_v5.16_rc6' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Fix the rtmutex condition checking when the optimistic spinning of a waiter needs to be terminated * tag 'locking_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Fix incorrect condition in rtmutex_spin_on_owner()
2021-12-19Merge tag 'core_urgent_for_v5.16_rc6' of ↵Linus Torvalds1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull signal handlign fix from Borislav Petkov: - Prevent lock contention on the new sigaltstack lock on the common-case path, when no changes have been made to the alternative signal stack. * tag 'core_urgent_for_v5.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: signal: Skip the altstack update when not needed
2021-12-18bpf: Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument supportKumar Kartikeya Dwivedi1-21/+73
Allow passing PTR_TO_CTX, if the kfunc expects a matching struct type, and punt to PTR_TO_MEM block if reg->type does not fall in one of PTR_TO_BTF_ID or PTR_TO_SOCK* types. This will be used by future commits to get access to XDP and TC PTR_TO_CTX, and pass various data (flags, l4proto, netns_id, etc.) encoded in opts struct passed as pointer to kfunc. For PTR_TO_MEM support, arguments are currently limited to pointer to scalar, or pointer to struct composed of scalars. This is done so that unsafe scenarios (like passing PTR_TO_MEM where PTR_TO_BTF_ID of in-kernel valid structure is expected, which may have pointers) are avoided. Since the argument checking happens basd on argument register type, it is not easy to ascertain what the expected type is. In the future, support for PTR_TO_MEM for kfunc can be extended to serve other usecases. The struct type whose pointer is passed in may have maximum nesting depth of 4, all recursively composed of scalars or struct with scalars. Future commits will add negative tests that check whether these restrictions imposed for kfunc arguments are duly rejected by BPF verifier or not. Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-18bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.Hao Luo7-24/+38
Some helper functions may modify its arguments, for example, bpf_d_path, bpf_get_stack etc. Previously, their argument types were marked as ARG_PTR_TO_MEM, which is compatible with read-only mem types, such as PTR_TO_RDONLY_BUF. Therefore it's legitimate, but technically incorrect, to modify a read-only memory by passing it into one of such helper functions. This patch tags the bpf_args compatible with immutable memory with MEM_RDONLY flag. The arguments that don't have this flag will be only compatible with mutable memory types, preventing the helper from modifying a read-only memory. The bpf_args that have MEM_RDONLY are compatible with both mutable memory and immutable memory. Signed-off-by: Hao Luo <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-18bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM.Hao Luo2-6/+28
Tag the return type of {per, this}_cpu_ptr with RDONLY_MEM. The returned value of this pair of helpers is kernel object, which can not be updated by bpf programs. Previously these two helpers return PTR_OT_MEM for kernel objects of scalar type, which allows one to directly modify the memory. Now with RDONLY_MEM tagging, the verifier will reject programs that write into RDONLY_MEM. Fixes: 63d9b80dcf2c ("bpf: Introducte bpf_this_cpu_ptr()") Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()") Fixes: 4976b718c355 ("bpf: Introduce pseudo_btf_id") Signed-off-by: Hao Luo <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-18bpf: Convert PTR_TO_MEM_OR_NULL to composable types.Hao Luo2-2/+2
Remove PTR_TO_MEM_OR_NULL and replace it with PTR_TO_MEM combined with flag PTR_MAYBE_NULL. Signed-off-by: Hao Luo <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-18bpf: Introduce MEM_RDONLY flagHao Luo3-38/+53
This patch introduce a flag MEM_RDONLY to tag a reg value pointing to read-only memory. It makes the following changes: 1. PTR_TO_RDWR_BUF -> PTR_TO_BUF 2. PTR_TO_RDONLY_BUF -> PTR_TO_BUF | MEM_RDONLY Signed-off-by: Hao Luo <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-18bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULLHao Luo3-177/+132
We have introduced a new type to make bpf_reg composable, by allocating bits in the type to represent flags. One of the flags is PTR_MAYBE_NULL which indicates a pointer may be NULL. This patch switches the qualified reg_types to use this flag. The reg_types changed in this patch include: 1. PTR_TO_MAP_VALUE_OR_NULL 2. PTR_TO_SOCKET_OR_NULL 3. PTR_TO_SOCK_COMMON_OR_NULL 4. PTR_TO_TCP_SOCK_OR_NULL 5. PTR_TO_BTF_ID_OR_NULL 6. PTR_TO_MEM_OR_NULL 7. PTR_TO_RDONLY_BUF_OR_NULL 8. PTR_TO_RDWR_BUF_OR_NULL Signed-off-by: Hao Luo <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-18bpf: Replace RET_XXX_OR_NULL with RET_XXX | PTR_MAYBE_NULLHao Luo2-27/+27
We have introduced a new type to make bpf_ret composable, by reserving high bits to represent flags. One of the flag is PTR_MAYBE_NULL, which indicates a pointer may be NULL. When applying this flag to ret_types, it means the returned value could be a NULL pointer. This patch switches the qualified arg_types to use this flag. The ret_types changed in this patch include: 1. RET_PTR_TO_MAP_VALUE_OR_NULL 2. RET_PTR_TO_SOCKET_OR_NULL 3. RET_PTR_TO_TCP_SOCK_OR_NULL 4. RET_PTR_TO_SOCK_COMMON_OR_NULL 5. RET_PTR_TO_ALLOC_MEM_OR_NULL 6. RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL 7. RET_PTR_TO_BTF_ID_OR_NULL This patch doesn't eliminate the use of these names, instead it makes them aliases to 'RET_PTR_TO_XXX | PTR_MAYBE_NULL'. Signed-off-by: Hao Luo <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]