aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-08-25locking/rtmutex: Dont dereference waiter locklessThomas Gleixner2-2/+16
The new rt_mutex_spin_on_onwer() loop checks whether the spinning waiter is still the top waiter on the lock by utilizing rt_mutex_top_waiter(), which is broken because that function contains a sanity check which dereferences the top waiter pointer to check whether the waiter belongs to the lock. That's wrong in the lockless spinwait case: CPU 0 CPU 1 rt_mutex_lock(lock) rt_mutex_lock(lock); queue(waiter0) waiter0 == rt_mutex_top_waiter(lock) rt_mutex_spin_on_onwer(lock, waiter0) { queue(waiter1) waiter1 == rt_mutex_top_waiter(lock) ... top_waiter = rt_mutex_top_waiter(lock) leftmost = rb_first_cached(&lock->waiters); -> signal dequeue(waiter1) destroy(waiter1) w = rb_entry(leftmost, ....) BUG_ON(w->lock != lock) <- UAF The BUG_ON() is correct for the case where the caller holds lock->wait_lock which guarantees that the leftmost waiter entry cannot vanish. For the lockless spinwait case it's broken. Create a new helper function which avoids the pointer dereference and just compares the leftmost entry pointer with current's waiter pointer to validate that currrent is still elegible for spinning. Fixes: 992caf7f1724 ("locking/rtmutex: Add adaptive spinwait mechanism") Reported-by: Sebastian Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-08-24audit: move put_tree() to avoid trim_trees refcount underflow and UAFRichard Guy Briggs1-1/+1
AUDIT_TRIM is expected to be idempotent, but multiple executions resulted in a refcount underflow and use-after-free. git bisect fingered commit fb041bb7c0a9 ("locking/refcount: Consolidate implementations of refcount_t") but this patch with its more thorough checking that wasn't in the x86 assembly code merely exposed a previously existing tree refcount imbalance in the case of tree trimming code that was refactored with prune_one() to remove a tree introduced in commit 8432c7006297 ("audit: Simplify locking around untag_chunk()") Move the put_tree() to cover only the prune_one() case. Passes audit-testsuite and 3 passes of "auditctl -t" with at least one directory watch. Cc: Jan Kara <[email protected]> Cc: Will Deacon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Seiji Nishikawa <[email protected]> Cc: [email protected] Fixes: 8432c7006297 ("audit: Simplify locking around untag_chunk()") Signed-off-by: Richard Guy Briggs <[email protected]> Reviewed-by: Jan Kara <[email protected]> [PM: reformatted/cleaned-up the commit description] Signed-off-by: Paul Moore <[email protected]>
2021-08-24bpf: Fix possible out of bound write in narrow load handlingAndrey Ignatov1-0/+4
Fix a verifier bug found by smatch static checker in [0]. This problem has never been seen in prod to my best knowledge. Fixing it still seems to be a good idea since it's hard to say for sure whether it's possible or not to have a scenario where a combination of convert_ctx_access() and a narrow load would lead to an out of bound write. When narrow load is handled, one or two new instructions are added to insn_buf array, but before it was only checked that cnt >= ARRAY_SIZE(insn_buf) And it's safe to add a new instruction to insn_buf[cnt++] only once. The second try will lead to out of bound write. And this is what can happen if `shift` is set. Fix it by making sure that if the BPF_RSH instruction has to be added in addition to BPF_AND then there is enough space for two more instructions in insn_buf. The full report [0] is below: kernel/bpf/verifier.c:12304 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array kernel/bpf/verifier.c:12311 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array kernel/bpf/verifier.c 12282 12283 insn->off = off & ~(size_default - 1); 12284 insn->code = BPF_LDX | BPF_MEM | size_code; 12285 } 12286 12287 target_size = 0; 12288 cnt = convert_ctx_access(type, insn, insn_buf, env->prog, 12289 &target_size); 12290 if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) || ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Bounds check. 12291 (ctx_field_size && !target_size)) { 12292 verbose(env, "bpf verifier is misconfigured\n"); 12293 return -EINVAL; 12294 } 12295 12296 if (is_narrower_load && size < target_size) { 12297 u8 shift = bpf_ctx_narrow_access_offset( 12298 off, size, size_default) * 8; 12299 if (ctx_field_size <= 4) { 12300 if (shift) 12301 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH, ^^^^^ increment beyond end of array 12302 insn->dst_reg, 12303 shift); --> 12304 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg, ^^^^^ out of bounds write 12305 (1 << size * 8) - 1); 12306 } else { 12307 if (shift) 12308 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH, 12309 insn->dst_reg, 12310 shift); 12311 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg, ^^^^^^^^^^^^^^^ Same. 12312 (1ULL << size * 8) - 1); 12313 } 12314 } 12315 12316 new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); 12317 if (!new_prog) 12318 return -ENOMEM; 12319 12320 delta += cnt - 1; 12321 12322 /* keep walking new program and skip insns we just inserted */ 12323 env->prog = new_prog; 12324 insn = new_prog->insnsi + i + delta; 12325 } 12326 12327 return 0; 12328 } [0] https://lore.kernel.org/bpf/20210817050843.GA21456@kili/ v1->v2: - clarify that problem was only seen by static checker but not in prod; Fixes: 46f53a65d2de ("bpf: Allow narrow loads with offset > 0") Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Andrey Ignatov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-24genirq/msi: Move MSI sysfs handling from PCI to MSI coreBarry Song1-0/+134
Move PCI's MSI sysfs code to the irq core so that other busses such as platform can reuse it. Signed-off-by: Barry Song <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Acked-by: Bjorn Helgaas <[email protected]> Acked-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-24genirq/cpuhotplug: Demote debug printk to KERN_DEBUGLee Jones1-1/+1
This sort of information is only generally useful when debugging. No need to have these sprinkled through the kernel log otherwise. Real world problem: During pre-release testing these have an affect on performance on real products. To the point where so much logging builds up, that it sets off the watchdog(s) on some high profile consumer devices. Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-23bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enumDave Marchevsky1-63/+93
Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf attach types and a function to map bpf_attach_type values to the new enum. Inspired by netns_bpf_attach_type. Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever possible. Functionality is unchanged as attach_type_to_prog_type switches in bpf/syscall.c were preventing non-cgroup programs from making use of the invalid cgroup_bpf array slots. As a result struct cgroup_bpf uses 504 fewer bytes relative to when its arrays were sized using MAX_BPF_ATTACH_TYPE. bpf_cgroup_storage is notably not migrated as struct bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type member which is not meant to be opaque. Similarly, bpf_cgroup_link continues to report its bpf_attach_type member to userspace via fdinfo and bpf_link_info. To ease disambiguation, bpf_attach_type variables are renamed from 'type' to 'atype' when changed to cgroup_bpf_attach_type. Signed-off-by: Dave Marchevsky <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-23ucounts: Increase ucounts reference counter before the security hookAlexey Gladkov1-6/+6
We need to increment the ucounts reference counter befor security_prepare_creds() because this function may fail and abort_creds() will try to decrement this reference. [ 96.465056][ T8641] FAULT_INJECTION: forcing a failure. [ 96.465056][ T8641] name fail_page_alloc, interval 1, probability 0, space 0, times 0 [ 96.478453][ T8641] CPU: 1 PID: 8641 Comm: syz-executor668 Not tainted 5.14.0-rc6-syzkaller #0 [ 96.487215][ T8641] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 96.497254][ T8641] Call Trace: [ 96.500517][ T8641] dump_stack_lvl+0x1d3/0x29f [ 96.505758][ T8641] ? show_regs_print_info+0x12/0x12 [ 96.510944][ T8641] ? log_buf_vmcoreinfo_setup+0x498/0x498 [ 96.516652][ T8641] should_fail+0x384/0x4b0 [ 96.521141][ T8641] prepare_alloc_pages+0x1d1/0x5a0 [ 96.526236][ T8641] __alloc_pages+0x14d/0x5f0 [ 96.530808][ T8641] ? __rmqueue_pcplist+0x2030/0x2030 [ 96.536073][ T8641] ? lockdep_hardirqs_on_prepare+0x3e2/0x750 [ 96.542056][ T8641] ? alloc_pages+0x3f3/0x500 [ 96.546635][ T8641] allocate_slab+0xf1/0x540 [ 96.551120][ T8641] ___slab_alloc+0x1cf/0x350 [ 96.555689][ T8641] ? kzalloc+0x1d/0x30 [ 96.559740][ T8641] __kmalloc+0x2e7/0x390 [ 96.563980][ T8641] ? kzalloc+0x1d/0x30 [ 96.568029][ T8641] kzalloc+0x1d/0x30 [ 96.571903][ T8641] security_prepare_creds+0x46/0x220 [ 96.577174][ T8641] prepare_creds+0x411/0x640 [ 96.581747][ T8641] __sys_setfsuid+0xe2/0x3a0 [ 96.586333][ T8641] do_syscall_64+0x3d/0xb0 [ 96.590739][ T8641] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 96.596611][ T8641] RIP: 0033:0x445a69 [ 96.600483][ T8641] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 96.620152][ T8641] RSP: 002b:00007f1054173318 EFLAGS: 00000246 ORIG_RAX: 000000000000007a [ 96.628543][ T8641] RAX: ffffffffffffffda RBX: 00000000004ca4c8 RCX: 0000000000445a69 [ 96.636600][ T8641] RDX: 0000000000000010 RSI: 00007f10541732f0 RDI: 0000000000000000 [ 96.644550][ T8641] RBP: 00000000004ca4c0 R08: 0000000000000001 R09: 0000000000000000 [ 96.652500][ T8641] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ca4cc [ 96.660631][ T8641] R13: 00007fffffe0b62f R14: 00007f1054173400 R15: 0000000000022000 Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Reported-by: [email protected] Signed-off-by: Alexey Gladkov <[email protected]> Link: https://lkml.kernel.org/r/97433b1742c3331f02ad92de5a4f07d673c90613.1629735352.git.legion@kernel.org Signed-off-by: Eric W. Biederman <[email protected]>
2021-08-23ucounts: Fix regression preventing increasing of rlimits in init_user_nsEric W. Biederman1-4/+4
"Ma, XinjianX" <[email protected]> reported: > When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests > in kselftest failed with following message. > > # selftests: mqueue: mq_perf_tests > # > # Initial system state: > # Using queue path: /mq_perf_tests > # RLIMIT_MSGQUEUE(soft): 819200 > # RLIMIT_MSGQUEUE(hard): 819200 > # Maximum Message Size: 8192 > # Maximum Queue Size: 10 > # Nice value: 0 > # > # Adjusted system state for testing: > # RLIMIT_MSGQUEUE(soft): (unlimited) > # RLIMIT_MSGQUEUE(hard): (unlimited) > # Maximum Message Size: 16777216 > # Maximum Queue Size: 65530 > # Nice value: -20 > # Continuous mode: (disabled) > # CPUs to pin: 3 > # ./mq_perf_tests: mq_open() at 296: Too many open files > not ok 2 selftests: mqueue: mq_perf_tests # exit=1 > ``` > > Test env: > rootfs: debian-10 > gcc version: 9 After investigation the problem turned out to be that ucount_max for the rlimits in init_user_ns was being set to the initial rlimit value. The practical problem is that ucount_max provides a limit that applications inside the user namespace can not exceed. Which means in practice that rlimits that have been converted to use the ucount infrastructure were not able to exceend their initial rlimits. Solve this by setting the relevant values of ucount_max to RLIM_INIFINITY. A limit in init_user_ns is pointless so the code should allow the values to grow as large as possible without riscking an underflow or an overflow. As the ltp test case was a bit of a pain I have reproduced the rlimit failure and tested the fix with the following little C program: > #include <stdio.h> > #include <fcntl.h> > #include <sys/stat.h> > #include <mqueue.h> > #include <sys/time.h> > #include <sys/resource.h> > #include <errno.h> > #include <string.h> > #include <stdlib.h> > #include <limits.h> > #include <unistd.h> > > int main(int argc, char **argv) > { > struct mq_attr mq_attr; > struct rlimit rlim; > mqd_t mqd; > int ret; > > ret = getrlimit(RLIMIT_MSGQUEUE, &rlim); > if (ret != 0) { > fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > printf("RLIMIT_MSGQUEUE %lu %lu\n", > rlim.rlim_cur, rlim.rlim_max); > rlim.rlim_cur = RLIM_INFINITY; > rlim.rlim_max = RLIM_INFINITY; > ret = setrlimit(RLIMIT_MSGQUEUE, &rlim); > if (ret != 0) { > fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > > memset(&mq_attr, 0, sizeof(struct mq_attr)); > mq_attr.mq_maxmsg = 65536 - 1; > mq_attr.mq_msgsize = 16*1024*1024 - 1; > > mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &mq_attr); > if (mqd == (mqd_t)-1) { > fprintf(stderr, "mq_open failed: %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > ret = mq_close(mqd); > if (ret) { > fprintf(stderr, "mq_close failed; %s\n", strerror(errno)); > exit(EXIT_FAILURE); > } > > return EXIT_SUCCESS; > } Fixes: 6e52a9f0532f ("Reimplement RLIMIT_MSGQUEUE on top of ucounts") Fixes: d7c9e99aee48 ("Reimplement RLIMIT_MEMLOCK on top of ucounts") Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Reported-by: kernel test robot [email protected] Acked-by: Alexey Gladkov <[email protected]> Link: https://lkml.kernel.org/r/87eeajswfc.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-08-23bpf: Fix ringbuf helper function compatibilityDaniel Borkmann1-2/+6
Commit 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") extended check_map_func_compatibility() by enforcing map -> helper function match, but not helper -> map type match. Due to this all of the bpf_ringbuf_*() helper functions could be used with a wrong map type such as array or hash map, leading to invalid access due to type confusion. Also, both BPF_FUNC_ringbuf_{submit,discard} have ARG_PTR_TO_ALLOC_MEM as argument and not a BPF map. Therefore, their check_map_func_compatibility() presence is incorrect since it's only for map type checking. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: Ryota Shiga (Flatt Security) Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-08-23io_uring: remove files pointer in cancellation functionsHao Xu1-1/+1
When doing cancellation, we use a parameter to indicate where it's from do_exit or exec. So a boolean value is good enough for this, remove the struct files* as it is not necessary. Signed-off-by: Hao Xu <[email protected]> [axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23Merge back cpufreq changes for v5.15.Rafael J. Wysocki1-5/+11
2021-08-23irqdomain: Export irq_domain_disconnect_hierarchy()Maulik Shah1-0/+1
Export irq_domain_disconnect_hierarchy() so irqchip module drivers can use it. Signed-off-by: Maulik Shah <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-20tracing: Add a probe that attaches to trace eventsTzvetomir Stoyanov (VMware)8-13/+958
A new dynamic event is introduced: event probe. The event is attached to an existing tracepoint and uses its fields as arguments. The user can specify custom format string of the new event, select what tracepoint arguments will be printed and how to print them. An event probe is created by writing configuration string in 'dynamic_events' ftrace file: e[:[SNAME/]ENAME] SYSTEM/EVENT [FETCHARGS] - Set an event probe -:SNAME/ENAME - Delete an event probe Where: SNAME - System name, if omitted 'eprobes' is used. ENAME - Name of the new event in SNAME, if omitted the SYSTEM_EVENT is used. SYSTEM - Name of the system, where the tracepoint is defined, mandatory. EVENT - Name of the tracepoint event in SYSTEM, mandatory. FETCHARGS - Arguments: <name>=$<field>[:TYPE] - Fetch given filed of the tracepoint and print it as given TYPE with given name. Supported types are: (u8/u16/u32/u64/s8/s16/s32/s64), basic type (x8/x16/x32/x64), hexadecimal types "string", "ustring" and bitfield. Example, attach an event probe on openat system call and print name of the file that will be opened: echo "e:esys/eopen syscalls/sys_enter_openat file=\$filename:string" >> dynamic_events A new dynamic event is created in events/esys/eopen/ directory. It can be deleted with: echo "-:esys/eopen" >> dynamic_events Filters, triggers and histograms can be attached to the new event, it can be matched in synthetic events. There is one limitation - an event probe can not be attached to kprobe, uprobe or another event probe. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Acked-by: Masami Hiramatsu <[email protected]> Co-developed-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Tzvetomir Stoyanov (VMware) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-20locking/semaphore: Add might_sleep() to down_*() familyXiaoming Ni1-0/+4
Semaphore is sleeping lock. Add might_sleep() to down*() family (with exception of down_trylock()) to detect atomic context sleep. Signed-off-by: Xiaoming Ni <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-20sched: Introduce dl_task_check_affinity() to check proposed affinityWill Deacon1-17/+29
In preparation for restricting the affinity of a task during execve() on arm64, introduce a new dl_task_check_affinity() helper function to give an indication as to whether the restricted mask is admissible for a deadline task. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Daniel Bristot de Oliveira <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-20sched: Allow task CPU affinity to be restricted on asymmetric systemsWill Deacon2-18/+181
Asymmetric systems may not offer the same level of userspace ISA support across all CPUs, meaning that some applications cannot be executed by some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do not feature support for 32-bit applications on both clusters. Although userspace can carefully manage the affinity masks for such tasks, one place where it is particularly problematic is execve() because the CPU on which the execve() is occurring may be incompatible with the new application image. In such a situation, it is desirable to restrict the affinity mask of the task and ensure that the new image is entered on a compatible CPU. From userspace's point of view, this looks the same as if the incompatible CPUs have been hotplugged off in the task's affinity mask. Similarly, if a subsequent execve() reverts to a compatible image, then the old affinity is restored if it is still valid. In preparation for restricting the affinity mask for compat tasks on arm64 systems without uniform support for 32-bit applications, introduce {force,relax}_compatible_cpus_allowed_ptr(), which respectively restrict and restore the affinity mask for a task based on the compatible CPUs. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Quentin Perret <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-20sched: Split the guts of sched_setaffinity() into a helper functionWill Deacon1-48/+57
In preparation for replaying user affinity requests using a saved mask, split sched_setaffinity() up so that the initial task lookup and security checks are only performed when the request is coming directly from userspace. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-20sched: Introduce task_struct::user_cpus_ptr to track requested affinityWill Deacon2-0/+22
In preparation for saving and restoring the user-requested CPU affinity mask of a task, add a new cpumask_t pointer to 'struct task_struct'. If the pointer is non-NULL, then the mask is copied across fork() and freed on task exit. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-20sched: Reject CPU affinity changes based on task_cpu_possible_mask()Will Deacon1-1/+8
Reject explicit requests to change the affinity mask of a task via set_cpus_allowed_ptr() if the requested mask is not a subset of the mask returned by task_cpu_possible_mask(). This ensures that the 'cpus_mask' for a given task cannot contain CPUs which are incapable of executing it, except in cases where the affinity is forced. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Quentin Perret <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-20cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()Will Deacon2-4/+9
select_fallback_rq() only needs to recheck for an allowed CPU if the affinity mask of the task has changed since the last check. Return a 'bool' from cpuset_cpus_allowed_fallback() to indicate whether the affinity mask was updated, and use this to elide the allowed check when the mask has been left alone. No functional change. Suggested-by: Valentin Schneider <[email protected]> Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-20cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()Will Deacon1-17/+26
Asymmetric systems may not offer the same level of userspace ISA support across all CPUs, meaning that some applications cannot be executed by some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do not feature support for 32-bit applications on both clusters. Modify guarantee_online_cpus() to take task_cpu_possible_mask() into account when trying to find a suitable set of online CPUs for a given task. This will avoid passing an invalid mask to set_cpus_allowed_ptr() during ->attach() and will subsequently allow the cpuset hierarchy to be taken into account when forcefully overriding the affinity mask for a task which requires migration to a compatible CPU. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-08-20cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1Will Deacon1-2/+6
If the scheduler cannot find an allowed CPU for a task, cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask if cgroup v1 is in use. In preparation for allowing architectures to provide their own fallback mask, just return early if we're either using cgroup v1 or we're using cgroup v2 with a mask that contains invalid CPUs. This will allow select_fallback_rq() to figure out the mask by itself. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Quentin Perret <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-08-20sched: Introduce task_cpu_possible_mask() to limit fallback rq selectionWill Deacon1-6/+3
Asymmetric systems may not offer the same level of userspace ISA support across all CPUs, meaning that some applications cannot be executed by some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do not feature support for 32-bit applications on both clusters. On such a system, we must take care not to migrate a task to an unsupported CPU when forcefully moving tasks in select_fallback_rq() in response to a CPU hot-unplug operation. Introduce a task_cpu_possible_mask() hook which, given a task argument, allows an architecture to return a cpumask of CPUs that are capable of executing that task. The default implementation returns the cpu_possible_mask, since sane machines do not suffer from per-cpu ISA limitations that affect scheduling. The new mask is used when selecting the fallback runqueue as a last resort before forcing a migration to the first active CPU. Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Quentin Perret <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-20sched: Cgroup SCHED_IDLE supportJosh Don4-25/+208
This extends SCHED_IDLE to cgroups. Interface: cgroup/cpu.idle. 0: default behavior 1: SCHED_IDLE Extending SCHED_IDLE to cgroups means that we incorporate the existing aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its descendant threads towards the idle_h_nr_running count of all of its ancestor cgroups. Thus, sched_idle_rq() will work properly. Additionally, SCHED_IDLE cgroups are configured with minimum weight. There are two key differences between the per-task and per-cgroup SCHED_IDLE interface: - The cgroup interface allows tasks within a SCHED_IDLE hierarchy to maintain their relative weights. The entity that is "idle" is the cgroup, not the tasks themselves. - Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption decision is not made by comparing the current task with the woken task, but rather by comparing their matching sched_entity. A typical use-case for this is a user that creates an idle and a non-idle subtree. The non-idle subtree will dominate competition vs the idle subtree, but the idle subtree will still be high priority vs other users on the system. The latter is accomplished via comparing matching sched_entity in the waken preemption path (this could also be improved by making the sched_idle_rq() decision dependent on the perspective of a specific task). For now, we maintain the existing SCHED_IDLE semantics. Future patches may make improvements that extend how we treat SCHED_IDLE entities. The per-task_group idle field is an integer that currently only holds either a 0 or a 1. This is explicitly typed as an integer to allow for further extensions to this API. For example, a negative value may indicate a highly latency-sensitive cgroup that should be preferred for preemption/placement/etc. Signed-off-by: Josh Don <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-20sched/topology: Skip updating masks for non-online nodesValentin Schneider1-0/+65
The scheduler currently expects NUMA node distances to be stable from init onwards, and as a consequence builds the related data structures once-and-for-all at init (see sched_init_numa()). Unfortunately, on some architectures node distance is unreliable for offline nodes and may very well change upon onlining. Skip over offline nodes during sched_init_numa(). Track nodes that have been onlined at least once, and trigger a build of a node's NUMA masks when it is first onlined post-init. Reported-by: Geetika Moolchandani <[email protected]> Signed-off-by: Srikar Dronamraju <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-08-20sched: Fix Core-wide rq->lock for uninitialized CPUsPeter Zijlstra2-27/+118
Eugene tripped over the case where rq_lock(), as called in a for_each_possible_cpu() loop came apart because rq->core hadn't been setup yet. This is a somewhat unusual, but valid case. Rework things such that rq->core is initialized to point at itself. IOW initialize each CPU as a single threaded Core. CPU online will then join the new CPU (thread) to an existing Core where needed. For completeness sake, have CPU offline fully undo the state so as to not presume the topology will match the next time it comes online. Fixes: 9edeaea1bc45 ("sched: Core-wide rq->lock") Reported-by: Eugene Syromiatnikov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Josh Don <[email protected]> Tested-by: Eugene Syromiatnikov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-08-20locking/ww_mutex: Initialize waiter.ww_ctx properlySebastian Andrzej Siewior1-1/+1
The consolidation of the debug code for mutex waiter intialization sets waiter::ww_ctx to a poison value unconditionally. For regular mutexes this is intended to catch the case where waiter_ww_ctx is dereferenced accidentally. For ww_mutex the poison value has to be overwritten either with a context pointer or NULL for ww_mutexes without context. The rework broke this as it made the store conditional on the context pointer instead of the argument which signals whether ww_mutex code should be compiled in or optiized out. As a result waiter::ww_ctx ends up with the poison pointer for contextless ww_mutexes which causes a later dereference of the poison pointer because it is != NULL. Use the build argument instead so for ww_mutex the poison value is always overwritten. Fixes: c0afb0ffc06e6 ("locking/ww_mutex: Gather mutex_waiter initialization") Reported-by: Guenter Roeck <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski12-53/+116
drivers/ptp/Kconfig: 55c8fca1dae1 ("ptp_pch: Restore dependency on PCI") e5f31552674e ("ethernet: fix PTP_1588_CLOCK dependencies") Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-20bpf: Add support for {set|get} socket options from setsockopt BPFPrankur Gupta1-0/+8
Add logic to call bpf_setsockopt() and bpf_getsockopt() from setsockopt BPF programs. An example use case is when the user sets the IPV6_TCLASS socket option, we would also like to change the tcp-cc for that socket. We don't have any use case for calling bpf_setsockopt() from supposedly read- only sys_getsockopt(), so it is made available to BPF_CGROUP_SETSOCKOPT only at this point. Signed-off-by: Prankur Gupta <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-20bpf: Use kvmalloc for map keys in syscallsStanislav Fomichev1-17/+17
Same as previous patch but for the keys. memdup_bpfptr is renamed to kvmemdup_bpfptr (and converted to kvmalloc). Signed-off-by: Stanislav Fomichev <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-20bpf: Use kvmalloc for map values in syscallStanislav Fomichev1-17/+11
Use kvmalloc/kvfree for temporary value when manipulating a map via syscall. kmalloc might not be sufficient for percpu maps where the value is big (and further multiplied by hundreds of CPUs). Can be reproduced with netcnt test on qemu with "-smp 255". Signed-off-by: Stanislav Fomichev <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-19Merge tag 'net-5.14-rc7' of ↵Linus Torvalds2-6/+17
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from bpf, wireless and mac80211 trees. Current release - regressions: - tipc: call tipc_wait_for_connect only when dlen is not 0 - mac80211: fix locking in ieee80211_restart_work() Current release - new code bugs: - bpf: add rcu_read_lock in bpf_get_current_[ancestor_]cgroup_id() - ethernet: ice: fix perout start time rounding - wwan: iosm: prevent underflow in ipc_chnl_cfg_get() Previous releases - regressions: - bpf: clear zext_dst of dead insns - sch_cake: fix srchost/dsthost hashing mode - vrf: reset skb conntrack connection on VRF rcv - net/rds: dma_map_sg is entitled to merge entries Previous releases - always broken: - ethernet: bnxt: fix Tx path locking and races, add Rx path barriers" * tag 'net-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (42 commits) net: dpaa2-switch: disable the control interface on error path Revert "flow_offload: action should not be NULL when it is referenced" iavf: Fix ping is lost after untrusted VF had tried to change MAC i40e: Fix ATR queue selection r8152: fix the maximum number of PLA bp for RTL8153C r8152: fix writing USB_BP2_EN mptcp: full fully established support after ADD_ADDR mptcp: fix memory leak on address flush net/rds: dma_map_sg is entitled to merge entries net: mscc: ocelot: allow forwarding from bridge ports to the tag_8021q CPU port net: asix: fix uninit value bugs ovs: clear skb->tstamp in forwarding path net: mdio-mux: Handle -EPROBE_DEFER correctly net: mdio-mux: Don't ignore memory allocation errors net: mdio-mux: Delete unnecessary devm_kfree net: dsa: sja1105: fix use-after-free after calling of_find_compatible_node, or worse sch_cake: fix srchost/dsthost hashing mode ixgbe, xsk: clean up the resources in ixgbe_xsk_pool_enable error path net: qlcnic: add missed unlock in qlcnic_83xx_flash_read32 mac80211: fix locking in ieee80211_restart_work() ...
2021-08-19bpf: Fix NULL event->prog pointer access in bpf_overflow_handlerYonghong Song1-1/+4
Andrii reported that libbpf CI hit the following oops when running selftest send_signal: [ 1243.160719] BUG: kernel NULL pointer dereference, address: 0000000000000030 [ 1243.161066] #PF: supervisor read access in kernel mode [ 1243.161066] #PF: error_code(0x0000) - not-present page [ 1243.161066] PGD 0 P4D 0 [ 1243.161066] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1243.161066] CPU: 1 PID: 882 Comm: new_name Tainted: G O 5.14.0-rc5 #1 [ 1243.161066] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 1243.161066] RIP: 0010:bpf_overflow_handler+0x9a/0x1e0 [ 1243.161066] Code: 5a 84 c0 0f 84 06 01 00 00 be 66 02 00 00 48 c7 c7 6d 96 07 82 48 8b ab 18 05 00 00 e8 df 55 eb ff 66 90 48 8d 75 48 48 89 e7 <ff> 55 30 41 89 c4 e8 fb c1 f0 ff 84 c0 0f 84 94 00 00 00 e8 6e 0f [ 1243.161066] RSP: 0018:ffffc900000c0d80 EFLAGS: 00000046 [ 1243.161066] RAX: 0000000000000002 RBX: ffff8881002e0dd0 RCX: 00000000b4b47cf8 [ 1243.161066] RDX: ffffffff811dcb06 RSI: 0000000000000048 RDI: ffffc900000c0d80 [ 1243.161066] RBP: 0000000000000000 R08: 0000000000000000 R09: 1a9d56bb00000000 [ 1243.161066] R10: 0000000000000001 R11: 0000000000080000 R12: 0000000000000000 [ 1243.161066] R13: ffffc900000c0e00 R14: ffffc900001c3c68 R15: 0000000000000082 [ 1243.161066] FS: 00007fc0be2d3380(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 [ 1243.161066] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1243.161066] CR2: 0000000000000030 CR3: 0000000104f8e000 CR4: 00000000000006e0 [ 1243.161066] Call Trace: [ 1243.161066] <IRQ> [ 1243.161066] __perf_event_overflow+0x4f/0xf0 [ 1243.161066] perf_swevent_hrtimer+0x116/0x130 [ 1243.161066] ? __lock_acquire+0x378/0x2730 [ 1243.161066] ? __lock_acquire+0x372/0x2730 [ 1243.161066] ? lock_is_held_type+0xd5/0x130 [ 1243.161066] ? find_held_lock+0x2b/0x80 [ 1243.161066] ? lock_is_held_type+0xd5/0x130 [ 1243.161066] ? perf_event_groups_first+0x80/0x80 [ 1243.161066] ? perf_event_groups_first+0x80/0x80 [ 1243.161066] __hrtimer_run_queues+0x1a3/0x460 [ 1243.161066] hrtimer_interrupt+0x110/0x220 [ 1243.161066] __sysvec_apic_timer_interrupt+0x8a/0x260 [ 1243.161066] sysvec_apic_timer_interrupt+0x89/0xc0 [ 1243.161066] </IRQ> [ 1243.161066] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 1243.161066] RIP: 0010:finish_task_switch+0xaf/0x250 [ 1243.161066] Code: 31 f6 68 90 2a 09 81 49 8d 7c 24 18 e8 aa d6 03 00 4c 89 e7 e8 12 ff ff ff 4c 89 e7 e8 ca 9c 80 00 e8 35 af 0d 00 fb 4d 85 f6 <58> 74 1d 65 48 8b 04 25 c0 6d 01 00 4c 3b b0 a0 04 00 00 74 37 f0 [ 1243.161066] RSP: 0018:ffffc900001c3d18 EFLAGS: 00000282 [ 1243.161066] RAX: 000000000000031f RBX: ffff888104cf4980 RCX: 0000000000000000 [ 1243.161066] RDX: 0000000000000000 RSI: ffffffff82095460 RDI: ffffffff820adc4e [ 1243.161066] RBP: ffffc900001c3d58 R08: 0000000000000001 R09: 0000000000000001 [ 1243.161066] R10: 0000000000000001 R11: 0000000000080000 R12: ffff88813bd2bc80 [ 1243.161066] R13: ffff8881002e8000 R14: ffff88810022ad80 R15: 0000000000000000 [ 1243.161066] ? finish_task_switch+0xab/0x250 [ 1243.161066] ? finish_task_switch+0x70/0x250 [ 1243.161066] __schedule+0x36b/0xbb0 [ 1243.161066] ? _raw_spin_unlock_irqrestore+0x2d/0x50 [ 1243.161066] ? lockdep_hardirqs_on+0x79/0x100 [ 1243.161066] schedule+0x43/0xe0 [ 1243.161066] pipe_read+0x30b/0x450 [ 1243.161066] ? wait_woken+0x80/0x80 [ 1243.161066] new_sync_read+0x164/0x170 [ 1243.161066] vfs_read+0x122/0x1b0 [ 1243.161066] ksys_read+0x93/0xd0 [ 1243.161066] do_syscall_64+0x35/0x80 [ 1243.161066] entry_SYSCALL_64_after_hwframe+0x44/0xae The oops can also be reproduced with the following steps: ./vmtest.sh -s # at qemu shell cd /root/bpf && while true; do ./test_progs -t send_signal Further analysis showed that the failure is introduced with commit b89fbfbb854c ("bpf: Implement minimal BPF perf link"). With the above commit, the following scenario becomes possible: cpu1 cpu2 hrtimer_interrupt -> bpf_overflow_handler (due to closing link_fd) bpf_perf_link_release -> perf_event_free_bpf_prog -> perf_event_free_bpf_handler -> WRITE_ONCE(event->overflow_handler, event->orig_overflow_handler) event->prog = NULL bpf_prog_run(event->prog, &ctx) In the above case, the event->prog is NULL for bpf_prog_run, hence causing oops. To fix the issue, check whether event->prog is NULL or not. If it is, do not call bpf_prog_run. This seems working as the above reproducible step runs more than one hour and I didn't see any failures. Fixes: b89fbfbb854c ("bpf: Implement minimal BPF perf link") Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-19bpf: Undo off-by-one in interpreter tail call count limitDaniel Borkmann1-1/+1
The BPF interpreter as well as x86-64 BPF JIT were both in line by allowing up to 33 tail calls (however odd that number may be!). Recently, this was changed for the interpreter to reduce it down to 32 with the assumption that this should have been the actual limit "which is in line with the behavior of the x86 JITs" according to b61a28cf11d61 ("bpf: Fix off-by-one in tail call count limiting"). Paul recently reported: I'm a bit surprised by this because I had previously tested the tail call limit of several JIT compilers and found it to be 33 (i.e., allowing chains of up to 34 programs). I've just extended a test program I had to validate this again on the x86-64 JIT, and found a limit of 33 tail calls again [1]. Also note we had previously changed the RISC-V and MIPS JITs to allow up to 33 tail calls [2, 3], for consistency with other JITs and with the interpreter. We had decided to increase these two to 33 rather than decrease the other JITs to 32 for backward compatibility, though that probably doesn't matter much as I'd expect few people to actually use 33 tail calls. [1] https://github.com/pchaigno/tail-call-bench/commit/ae7887482985b4b1745c9b2ef7ff9ae506c82886 [2] 96bc4432f5ad ("bpf, riscv: Limit to 33 tail calls") [3] e49e6f6db04e ("bpf, mips: Limit to 33 tail calls") Therefore, revert b61a28cf11d61 to re-align interpreter to limit a maximum of 33 tail calls. While it is unlikely to hit the limit for the vast majority, programs in the wild could one way or another depend on this, so lets rather be a bit more conservative, and lets align the small remainder of JITs to 33. If needed in future, this limit could be slightly increased, but not decreased. Fixes: b61a28cf11d61 ("bpf: Fix off-by-one in tail call count limiting") Reported-by: Paul Chaignon <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Johan Almbladh <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/CAO5pjwTWrC0_dzTbTHFPSqDwA56aVH+4KFGVqdq8=ASs0MqZGQ@mail.gmail.com
2021-08-19Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski2-6/+17
Daniel Borkmann says: ==================== pull-request: bpf 2021-08-19 We've added 3 non-merge commits during the last 3 day(s) which contain a total of 3 files changed, 29 insertions(+), 6 deletions(-). The main changes are: 1) Fix to clear zext_dst for dead instructions which was causing invalid program rejections on JITs with bpf_jit_needs_zext such as s390x, from Ilya Leoshkevich. 2) Fix RCU splat in bpf_get_current_{ancestor_,}cgroup_id() helpers when they are invoked from sleepable programs, from Yonghong Song. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests, bpf: Test that dead ldx_w insns are accepted bpf: Clear zext_dst of dead insns bpf: Add rcu_read_lock in bpf_get_current_[ancestor_]cgroup_id() helpers ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-19tracing/probes: Reject events which have the same name of existing oneMasami Hiramatsu4-2/+36
Since kprobe_events and uprobe_events only check whether the other same-type probe event has the same name or not, if the user gives the same name of the existing tracepoint event (or the other type of probe events), it silently fails to create the tracefs entry (but registered.) as below. /sys/kernel/tracing # ls events/task/task_rename enable filter format hist id trigger /sys/kernel/tracing # echo p:task/task_rename vfs_read >> kprobe_events [ 113.048508] Could not create tracefs 'task_rename' directory /sys/kernel/tracing # cat kprobe_events p:task/task_rename vfs_read To fix this issue, check whether the existing events have the same name or not in trace_probe_register_event_call(). If exists, it rejects to register the new event. Link: https://lkml.kernel.org/r/162936876189.187130.17558311387542061930.stgit@devnote2 Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-19tracing/probes: Have process_fetch_insn() take a void * instead of pt_regsSteven Rostedt (VMware)3-5/+7
In preparation to allow event probes to use the process_fetch_insn() callback in trace_probe_tmpl.h, change the data passed to it from a pointer to pt_regs, as the event probe will not be using regs, and make it a void pointer instead. Update the process_fetch_insn() callers for kprobe and uprobe events to have the regs defined in the function and just typecast the void pointer parameter. Link: https://lkml.kernel.org/r/[email protected] Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-19tracing/probe: Change traceprobe_set_print_fmt() to take a typeSteven Rostedt (VMware)4-11/+31
Instead of a boolean "is_return" have traceprobe_set_print_fmt() take a type (currently just PROBE_PRINT_NORMAL and PROBE_PRINT_RETURN). This will simplify adding different types. For example, the development of the event_probe, will need its own type as it prints an event, and not an IP. Link: https://lkml.kernel.org/r/[email protected] Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-19livepatch: Replace deprecated CPU-hotplug functions.Sebastian Andrzej Siewior1-2/+2
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Cc: Josh Poimboeuf <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Miroslav Benes <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Joe Lawrence <[email protected]> Cc: [email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2021-08-19dma-mapping: make the global coherent pool conditionalChristoph Hellwig1-22/+27
Only build the code to support the global coherent pool if support for it is enabled. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Dillon Min <[email protected]>
2021-08-19isystem: trim/fixup stdarg.h and other headersAlexey Dobriyan1-1/+0
Delete/fixup few includes in anticipation of global -isystem compile option removal. Note: crypto/aegis128-neon-inner.c keeps <stddef.h> due to redefinition of uintptr_t error (one definition comes from <stddef.h>, another from <linux/types.h>). Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2021-08-18tracing/probes: Use struct_size() instead of defining custom macrosSteven Rostedt (VMware)2-10/+2
Remove SIZEOF_TRACE_KPROBE() and SIZEOF_TRACE_UPROBE() and use struct_size() as that's what it is made for. No need to have custom macros. Especially since struct_size() has some extra memory checks for correctness. Link: https://lkml.kernel.org/r/[email protected] Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-18tracing/probes: Allow for dot delimiter as well as slash for system namesSteven Rostedt (VMware)1-0/+3
Kprobe and uprobe events can add a "system" to the events that are created via the kprobe_events and uprobe_events files respectively. If they do not include a "system" in the name, then the default "kprobes" or "uprobes" is used. The current notation to specify a system for one of these probe events is to add a '/' delimiter in the name, where the content before the '/' will be the system to use, and the content after will be the event name. echo 'p:my_system/my_event' > kprobe_events But this is inconsistent with the way histogram triggers separate their system / event names. The histogram triggers use a '.' delimiter, which can be confusing. To allow this to be more consistent, as well as keep backward compatibility, allow the kprobe and uprobe events to denote a system name with either a '/' or a '.'. That is: echo 'p:my_system/my_event' > kprobe_events is equivalent to: echo 'p:my_system.my_event' > kprobe_events Link: https://lore.kernel.org/linux-trace-devel/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Masami Hiramatsu <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-18tracing/probe: Have traceprobe_parse_probe_arg() take a const argSteven Rostedt (VMware)4-36/+31
The two places that call traceprobe_parse_probe_arg() allocate a temporary buffer to copy the argv[i] into, because argv[i] is constant and the traceprobe_parse_probe_arg() will modify it to do the parsing. These two places allocate this buffer and then free it right after calling this function, leaving the onus of this allocation to the caller. As there's about to be a third user of this function that will have to do the same thing, instead of having the caller allocate the temporary buffer, simply move that allocation into the traceprobe_parse_probe_arg() itself, which will simplify the code of the callers. Link: https://lkml.kernel.org/r/[email protected] Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-18tracing: Have dynamic events have a ref counterSteven Rostedt (VMware)8-23/+80
As dynamic events are not created by modules, if something is attached to one, calling "try_module_get()" on its "mod" field, is not going to keep the dynamic event from going away. Since dynamic events do not need the "mod" pointer of the event structure, make a union out of it in order to save memory (there's one structure for each of the thousand+ events in the kernel), and have any event with the DYNAMIC flag set to use a ref counter instead. Link: https://lore.kernel.org/linux-trace-devel/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Masami Hiramatsu <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-18tracing: Add DYNAMIC flag for dynamic eventsSteven Rostedt (VMware)4-6/+8
To differentiate between static and dynamic events, add a new flag DYNAMIC to the event flags that all dynamic events have set. This will allow to differentiate when attaching to a dynamic event from a static event. Static events have a mod pointer that references the module they were created in (or NULL for core kernel). This can be incremented when the event has something attached to it. But there exists no such mechanism for dynamic events. This is dangerous as the dynamic events may now disappear without the "attachment" knowing that it no longer exists. To enforce the dynamic flag, change dyn_event_add() to pass the event that is being created such that it can set the DYNAMIC flag of the event. This helps make sure that no location that creates a dynamic event misses setting this flag. Link: https://lore.kernel.org/linux-trace-devel/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Masami Hiramatsu <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-18Merge tag 'cfi-v5.14-rc7' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull clang cfi fix from Kees Cook: - Use rcu_read_{un}lock_sched_notrace to avoid recursion (Elliot Berman) * tag 'cfi-v5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: cfi: Use rcu_read_{un}lock_sched_notrace
2021-08-18dma-mapping: add a dma_init_global_coherent helperChristoph Hellwig1-18/+14
Add a new helper to initialize the global coherent pool. This both cleans up the existing initialization which indirects through the reserved_mem_ops that are normally only used for struct device, and also allows using the global pool for non-devicetree architectures. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Dillon Min <[email protected]>
2021-08-18dma-mapping: simplify dma_init_coherent_memoryChristoph Hellwig1-45/+33
Return the allocated dma_coherent_mem structure, set the use_dma_pfn_offset and print the failure warning inside of dma_init_coherent_memory instead of leaving that to the callers. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Dillon Min <[email protected]>
2021-08-18dma-mapping: allow using the global coherent pool for !ARMChristoph Hellwig1-0/+2
Switch an ifdef so that the global coherent pool is initialized for any architecture that selects the DMA_GLOBAL_POOL symbol insted of hardcoding ARM. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Dillon Min <[email protected]>