aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-05-19bpf: Introduce bpf_sys_bpf() helper and program type.Alexei Starovoitov2-0/+61
Add placeholders for bpf_sys_bpf() helper and new program type. Make sure to check that expected_attach_type is zero for future extensibility. Allow tracing helper functions to be used in this program type, since they will only execute from user context via bpf_prog_test_run. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-05-18signal: Deliver all of the siginfo perf data in _perfEric W. Biederman1-8/+13
Don't abuse si_errno and deliver all of the perf data in _perf member of siginfo_t. Note: The data field in the perf data structures in a u64 to allow a pointer to be encoded without needed to implement a 32bit and 64bit version of the same structure. There already exists a 32bit and 64bit versions siginfo_t, and the 32bit version can not include a 64bit member as it only has 32bit alignment. So unsigned long is used in siginfo_t instead of a u64 as unsigned long can encode a pointer on all architectures linux supports. v1: https://lkml.kernel.org/r/[email protected] v2: https://lkml.kernel.org/r/[email protected] v3: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Marco Elver <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-05-18signal: Factor force_sig_perf out of perf_sigtrapEric W. Biederman2-9/+15
Separate filling in siginfo for TRAP_PERF from deciding that siginal needs to be sent. There are enough little details that need to be correct when properly filling in siginfo_t that it is easy to make mistakes if filling in the siginfo_t is in the same function with other logic. So factor out force_sig_perf to reduce the cognative load of on reviewers, maintainers and implementors. v1: https://lkml.kernel.org/r/[email protected] v2: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Marco Elver <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-05-18signal: Implement SIL_FAULT_TRAPNOEric W. Biederman1-22/+12
Now that si_trapno is part of the union in _si_fault and available on all architectures, add SIL_FAULT_TRAPNO and update siginfo_layout to return SIL_FAULT_TRAPNO when the code assumes si_trapno is valid. There is room for future changes to reduce when si_trapno is valid but this is all that is needed to make si_trapno and the other members of the the union in _sigfault mutually exclusive. Update the code that uses siginfo_layout to deal with SIL_FAULT_TRAPNO and have the same code ignore si_trapno in in all other cases. v1: https://lkml.kernel.org/r/[email protected] v2: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Marco Elver <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-05-18siginfo: Move si_trapno inside the union inside _si_faultEric W. Biederman1-0/+1
It turns out that linux uses si_trapno very sparingly, and as such it can be considered extra information for a very narrow selection of signals, rather than information that is present with every fault reported in siginfo. As such move si_trapno inside the union inside of _si_fault. This results in no change in placement, and makes it eaiser to extend _si_fault in the future as this reduces the number of special cases. In particular with si_trapno included in the union it is no longer a concern that the union must be pointer aligned on most architectures because the union follows immediately after si_addr which is a pointer. This change results in a difference in siginfo field placement on sparc and alpha for the fields si_addr_lsb, si_lower, si_upper, si_pkey, and si_perf. These architectures do not implement the signals that would use si_addr_lsb, si_lower, si_upper, si_pkey, and si_perf. Further these architecture have not yet implemented the userspace that would use si_perf. The point of this change is in fact to correct these placement issues before sparc or alpha grow userspace that cares. This change was discussed[1] and the agreement is that this change is currently safe. [1]: https://lkml.kernel.org/r/CAK8P3a0+uKYwL1NhY6Hvtieghba2hKYGD6hcKx5n8=4Gtt+pHA@mail.gmail.com Acked-by: Marco Elver <[email protected]> v1: https://lkml.kernel.org/r/[email protected] v2: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-05-18kcsan: Report observed value changesMark Rutland3-9/+33
When a thread detects that a memory location was modified without its watchpoint being hit, the report notes that a change was detected, but does not provide concrete values for the change. Knowing the concrete values can be very helpful in tracking down any racy writers (e.g. as specific values may only be written in some portions of code, or under certain conditions). When we detect a modification, let's report the concrete old/new values, along with the access's mask of relevant bits (and which relevant bits were modified). This can make it easier to identify potential racy writers. As the snapshots are at most 8 bytes, we can only report values for acceses up to this size, but this appears to cater for the common case. When we detect a race via a watchpoint, we may or may not have concrete values for the modification. To be helpful, let's attempt to log them when we do as they can be ignored where irrelevant. The resulting reports appears as follows, with values zero-padded to the access width: | ================================================================== | BUG: KCSAN: data-race in el0_svc_common+0x34/0x25c arch/arm64/kernel/syscall.c:96 | | race at unknown origin, with read to 0xffff00007ae6aa00 of 8 bytes by task 223 on cpu 1: | el0_svc_common+0x34/0x25c arch/arm64/kernel/syscall.c:96 | do_el0_svc+0x48/0xec arch/arm64/kernel/syscall.c:178 | el0_svc arch/arm64/kernel/entry-common.c:226 [inline] | el0_sync_handler+0x1a4/0x390 arch/arm64/kernel/entry-common.c:236 | el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:674 | | value changed: 0x0000000000000000 -> 0x0000000000000002 | | Reported by Kernel Concurrency Sanitizer on: | CPU: 1 PID: 223 Comm: syz-executor.1 Not tainted 5.8.0-rc3-00094-ga73f923ecc8e-dirty #3 | Hardware name: linux,dummy-virt (DT) | ================================================================== If an access mask is set, it is shown underneath the "value changed" line as "bits changed: 0x<bits changed> with mask 0x<non-zero mask>". Signed-off-by: Mark Rutland <[email protected]> [ [email protected]: align "value changed" and "bits changed" lines, which required massaging the message; do not print bits+mask if no mask set. ] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Remove kcsan_report_typeMark Rutland2-42/+20
Now that the reporting code has been refactored, it's clear by construction that print_report() can only be passed KCSAN_REPORT_RACE_SIGNAL or KCSAN_REPORT_RACE_UNKNOWN_ORIGIN, and these can also be distinguished by the presence of `other_info`. Let's simplify things and remove the report type enum, and instead let's check `other_info` to distinguish these cases. This allows us to remove code for cases which are impossible and generally makes the code simpler. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> [ [email protected]: add updated comments to kcsan_report_*() functions ] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Remove reporting indirectionMark Rutland1-66/+49
Now that we have separate kcsan_report_*() functions, we can factor the distinct logic for each of the report cases out of kcsan_report(). While this means each case has to handle mutual exclusion independently, this minimizes the conditionality of code and makes it easier to read, and will permit passing distinct bits of information to print_report() in future. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> [ [email protected]: retain comment about lockdep_off() ] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Refactor access_info initializationMark Rutland1-17/+25
In subsequent patches we'll want to split kcsan_report() into distinct handlers for each report type. The largest bit of common work is initializing the `access_info`, so let's factor this out into a helper, and have the kcsan_report_*() functions pass the `aaccess_info` as a parameter to kcsan_report(). There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Fold panic() call into print_report()Mark Rutland1-13/+8
So that we can add more callers of print_report(), lets fold the panic() call into print_report() so the caller doesn't have to handle this explicitly. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Refactor passing watchpoint/other_infoMark Rutland1-9/+4
The `watchpoint_idx` argument to kcsan_report() isn't meaningful for races which were not detected by a watchpoint, and it would be clearer if callers passed the other_info directly so that a NULL value can be passed in this case. Given that callers manipulate their watchpoints before passing the index into kcsan_report_*(), and given we index the `other_infos` array using this before we sanity-check it, the subsequent sanity check isn't all that useful. Let's remove the `watchpoint_idx` sanity check, and move the job of finding the `other_info` out of kcsan_report(). Other than the removal of the check, there should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Distinguish kcsan_report() callsMark Rutland3-15/+33
Currently kcsan_report() is used to handle three distinct cases: * The caller hit a watchpoint when attempting an access. Some information regarding the caller and access are recorded, but no output is produced. * A caller which previously setup a watchpoint detected that the watchpoint has been hit, and possibly detected a change to the location in memory being watched. This may result in output reporting the interaction between this caller and the caller which hit the watchpoint. * A caller detected a change to a modification to a memory location which wasn't detected by a watchpoint, for which there is no information on the other thread. This may result in output reporting the unexpected change. ... depending on the specific case the caller has distinct pieces of information available, but the prototype of kcsan_report() has to handle all three cases. This means that in some cases we pass redundant information, and in others we don't pass all the information we could pass. This also means that the report code has to demux these three cases. So that we can pass some additional information while also simplifying the callers and report code, add separate kcsan_report_*() functions for the distinct cases, updating callers accordingly. As the watchpoint_idx is unused in the case of kcsan_report_unknown_origin(), this passes a dummy value into kcsan_report(). Subsequent patches will refactor the report code to avoid this. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> [ [email protected]: try to make kcsan_report_*() names more descriptive ] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Simplify value change detectionMark Rutland1-24/+16
In kcsan_setup_watchpoint() we store snapshots of a watched value into a union of u8/u16/u32/u64 sized fields, modify this in place using a consistent field, then later check for any changes via the u64 field. We can achieve the safe effect more simply by always treating the field as a u64, as smaller values will be zero-extended. As the values are zero-extended, we don't need to truncate the access_mask when we apply it, and can always apply the full 64-bit access_mask to the 64-bit value. Finally, we can store the two snapshots and calculated difference separately, which makes the code a little easier to read, and will permit reporting the old/new values in subsequent patches. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Fix debugfs initcall return typeArnd Bergmann1-1/+2
clang with CONFIG_LTO_CLANG points out that an initcall function should return an 'int' due to the changes made to the initcall macros in commit 3578ad11f3fb ("init: lto: fix PREL32 relocations"): kernel/kcsan/debugfs.c:274:15: error: returning 'void' from a function with incompatible result type 'int' late_initcall(kcsan_debugfs_init); ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ include/linux/init.h:292:46: note: expanded from macro 'late_initcall' #define late_initcall(fn) __define_initcall(fn, 7) Fixes: e36299efe7d7 ("kcsan, debugfs: Move debugfs file creation out of early init") Cc: stable <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Marco Elver <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Reviewed-by: Miguel Ojeda <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18Merge branches 'bitmaprange.2021.05.10c', 'doc.2021.05.10c', ↵Paul E. McKenney15-469/+731
'fixes.2021.05.13a', 'kvfree_rcu.2021.05.10c', 'mmdumpobj.2021.05.10c', 'nocb.2021.05.12a', 'srcu.2021.05.12a', 'tasks.2021.05.18a' and 'torture.2021.05.10c' into HEAD bitmaprange.2021.05.10c: Allow "all" for bitmap ranges. doc.2021.05.10c: Documentation updates. fixes.2021.05.13a: Miscellaneous fixes. kvfree_rcu.2021.05.10c: kvfree_rcu() updates. mmdumpobj.2021.05.10c: mem_dump_obj() updates. nocb.2021.05.12a: RCU NOCB CPU updates, including limited deoffloading. srcu.2021.05.12a: SRCU updates. tasks.2021.05.18a: Tasks-RCU updates. torture.2021.05.10c: Torture-test updates.
2021-05-18tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inlinePaul E. McKenney2-1/+4
In some architectures, the no-op variant of show_rcu_tasks_gp_kthreads() get "no previous prototype" compiler warnings. These are false positives given that kernel/rcu/tasks.h is included only once. But why put up with the compiler noise? This commit therefore adds "static inline" to this definition to force the compiler to accept this situation, while also moving it to its proper place in kernel/rcu/rcu.h. Reported-by: kernel test robot <[email protected]> [ paulmck: Update per Stephen Rothwell feedback. ] Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent statesPaul E. McKenney1-0/+1
Heavy networking load can cause a CPU to execute continuously and indefinitely within ksoftirqd, in which case there will be no voluntary task switches and thus no RCU-tasks quiescent states. This commit therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks quiescent state. This of course means that __do_softirq() and its callers cannot be invoked from within a tracing trampoline. Reported-by: Toke Høiland-Jørgensen <[email protected]> Tested-by: Toke Høiland-Jørgensen <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Masami Hiramatsu <[email protected]>
2021-05-18sched: Make the idle task quack like a per-CPU kthreadValentin Schneider2-18/+33
For all intents and purposes, the idle task is a per-CPU kthread. It isn't created via the same route as other pcpu kthreads however, and as a result it is missing a few bells and whistles: it fails kthread_is_per_cpu() and it doesn't have PF_NO_SETAFFINITY set. Fix the former by giving the idle task a kthread struct along with the KTHREAD_IS_PER_CPU flag. This requires some extra iffery as init_idle() call be called more than once on the same idle task. Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-18sched,stats: Further simplify sched_infoPeter Zijlstra1-9/+12
There's no point doing delta==0 updates. Suggested-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2021-05-18locking/mutex: clear MUTEX_FLAGS if wait_list is empty due to signalZqiang4-11/+17
When a interruptible mutex locker is interrupted by a signal without acquiring this lock and removed from the wait queue. if the mutex isn't contended enough to have a waiter put into the wait queue again, the setting of the WAITER bit will force mutex locker to go into the slowpath to acquire the lock every time, so if the wait queue is empty, the WAITER bit need to be clear. Fixes: 040a0a371005 ("mutex: Add support for wound/wait style locks") Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Zqiang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-18locking/lockdep: Correct calling tracepointsLeo Yan1-2/+2
The commit eb1f00237aca ("lockdep,trace: Expose tracepoints") reverses tracepoints for lock_contended() and lock_acquired(), thus the ftrace log shows the wrong locking sequence that "acquired" event is prior to "contended" event: <idle>-0 [001] d.s3 20803.501685: lock_acquire: 0000000008b91ab4 &sg_policy->update_lock <idle>-0 [001] d.s3 20803.501686: lock_acquired: 0000000008b91ab4 &sg_policy->update_lock <idle>-0 [001] d.s3 20803.501689: lock_contended: 0000000008b91ab4 &sg_policy->update_lock <idle>-0 [001] d.s3 20803.501690: lock_release: 0000000008b91ab4 &sg_policy->update_lock This patch fixes calling tracepoints for lock_contended() and lock_acquired(). Fixes: eb1f00237aca ("lockdep,trace: Expose tracepoints") Signed-off-by: Leo Yan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-17genirq: Add a IRQF_NO_DEBUG flagThomas Gleixner4-2/+19
The whole call to note_interrupt() can be avoided or return early when interrupts would be marked accordingly. For IPI handlers which always return HANDLED the whole procedure is pretty pointless to begin with. Add a IRQF_NO_DEBUG flag and mark the interrupt accordingly if supplied when the interrupt is requested. When noirqdebug is set on the kernel commandline, then the interrupt is marked unconditionally so that there is only one condition in the hotpath to evaluate. [ clg: Add changelog ] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Cédric Le Goater <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-17kdb: Switch to use %ptTsAndy Shevchenko1-8/+1
Use %ptTs instead of open-coded variant to print contents of time64_t type in human readable form. Cc: Jason Wessel <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: [email protected] Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Reviewed-by: Daniel Thompson <[email protected]> Acked-by: Daniel Thompson <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-17module: check for exit sections in layout_sections() instead of ↵Jessica Yu1-6/+11
module_init_section() Previously, when CONFIG_MODULE_UNLOAD=n, the module loader just does not attempt to load exit sections since it never expects that any code in those sections will ever execute. However, dynamic code patching (alternatives, jump_label and static_call) can have sites in __exit code, even if __exit is never executed. Therefore __exit must be present at runtime, at least for as long as __init code is. Commit 33121347fb1c ("module: treat exit sections the same as init sections when !CONFIG_MODULE_UNLOAD") solves the requirements of jump_labels and static_calls by putting the exit sections in the init region of the module so that they are at least present at init, and discarded afterwards. It does this by including a check for exit sections in module_init_section(), so that it also returns true for exit sections, and the module loader will automatically sort them in the init region of the module. However, the solution there was not completely arch-independent. ARM is a special case where it supplies its own module_{init, exit}_section() functions. Instead of pushing the exit section checks into module_init_section(), just implement the exit section check in layout_sections(), so that we don't have to touch arch-dependent code. Fixes: 33121347fb1c ("module: treat exit sections the same as init sections when !CONFIG_MODULE_UNLOAD") Reviewed-by: Russell King (Oracle) <[email protected]> Signed-off-by: Jessica Yu <[email protected]>
2021-05-16Merge tag 'timers-urgent-2021-05-16' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two fixes for timers: - Use the ALARM feature check in the alarmtimer core code insted of the old method of checking for the set_alarm() callback. Drivers can have that callback set but the feature bit cleared. If such a RTC device is selected then alarms wont work. - Use a proper define to let the preprocessor check whether Hyper-V VDSO clocksource should be active. The code used a constant in an enum with #ifdef, which evaluates to always false and disabled the clocksource for VDSO" * tag 'timers-urgent-2021-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/hyper-v: Re-enable VDSO_CLOCKMODE_HVCLOCK on X86 alarmtimer: Check RTC features instead of ops
2021-05-15Merge tag 'sched-urgent-2021-05-15' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix an idle CPU selection bug, and an AMD Ryzen maximum frequency enumeration bug" * tag 'sched-urgent-2021-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, sched: Fix the AMD CPPC maximum performance value on certain AMD Ryzen generations sched/fair: Fix clearing of has_idle_cores flag in select_idle_cpu()
2021-05-15Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+1
Merge misc fixes from Andrew Morton: "13 patches. Subsystems affected by this patch series: resource, squashfs, hfsplus, modprobe, and mm (hugetlb, slub, userfaultfd, ksm, pagealloc, kasan, pagemap, and ioremap)" * emailed patches from Andrew Morton <[email protected]>: mm/ioremap: fix iomap_max_page_shift docs: admin-guide: update description for kernel.modprobe sysctl hfsplus: prevent corruption in shrinking truncate mm/filemap: fix readahead return types kasan: fix unit tests with CONFIG_UBSAN_LOCAL_BOUNDS enabled mm: fix struct page layout on 32-bit systems ksm: revert "use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()" userfaultfd: release page in error path to avoid BUG_ON squashfs: fix divide error in calculate_skip() kernel/resource: fix return code check in __request_free_mem_region mm, slub: move slub_debug static key enabling outside slab_mutex mm/hugetlb: fix cow where page writtable in child mm/hugetlb: fix F_SEAL_FUTURE_WRITE
2021-05-14kernel/resource: fix return code check in __request_free_mem_regionAlistair Popple1-1/+1
Splitting an earlier version of a patch that allowed calling __request_region() while holding the resource lock into a series of patches required changing the return code for the newly introduced __request_region_locked(). Unfortunately this change was not carried through to a subsequent commit 56fd94919b8b ("kernel/resource: fix locking in request_free_mem_region") in the series. This resulted in a use-after-free due to freeing the struct resource without properly releasing it. Fix this by correcting the return code check so that the struct is not freed if the request to add it was successful. Link: https://lkml.kernel.org/r/[email protected] Fixes: 56fd94919b8b ("kernel/resource: fix locking in request_free_mem_region") Signed-off-by: Alistair Popple <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Dan Williams <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oliver Sang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-14Merge tag 'trace-v5.13-rc1' of ↵Linus Torvalds1-4/+27
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fix trace_check_vprintf() for %.*s The sanity check of all strings being read from the ring buffer to make sure they are in safe memory space did not account for the %.*s notation having another parameter to process (the length). Add that to the check" * tag 'trace-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Handle %.*s in trace_check_vprintf()
2021-05-14kernel/module: Use BUG_ON instead of if condition followed by BUGzhouchuangao1-2/+1
Fix the following coccinelle report: kernel/module.c:1018:2-5: WARNING: Use BUG_ON instead of if condition followed by BUG. BUG_ON uses unlikely in if(). Through disassembly, we can see that brk #0x800 is compiled to the end of the function. As you can see below: ...... ffffff8008660bec: d65f03c0 ret ffffff8008660bf0: d4210000 brk #0x800 Usually, the condition in if () is not satisfied. For the multi-stage pipeline, we do not need to perform fetch decode and excute operation on brk instruction. In my opinion, this can improve the efficiency of the multi-stage pipeline. Signed-off-by: zhouchuangao <[email protected]> Signed-off-by: Jessica Yu <[email protected]>
2021-05-13tracing: Handle %.*s in trace_check_vprintf()Steven Rostedt (VMware)1-4/+27
If a trace event uses the %*.s notation, the trace_check_vprintf() will fail and will warn about a bad processing of strings, because it does not take into account the length field when processing the star (*) part. Have it handle this case as well. Link: https://lore.kernel.org/linux-nfs/[email protected]/ Reported-by: Chuck Lever III <[email protected]> Fixes: 9a6944fee68e2 ("tracing: Add a verifier to check string pointers for trace events") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-05-13rcu: Add missing __releases() annotationJules Irenge1-0/+1
Sparse reports a warning at rcu_print_task_stall(): "warning: context imbalance in rcu_print_task_stall - unexpected unlock" The root cause is a missing annotation on rcu_print_task_stall(). This commit therefore adds the missing __releases(rnp->lock) annotation. Signed-off-by: Jules Irenge <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-13rcu: Improve comments describing RCU read-side critical sectionsPaul E. McKenney1-10/+14
There are a number of places that call out the fact that preempt-disable regions of code now act as RCU read-side critical sections, where preempt-disable regions of code include irq-disable regions of code, bh-disable regions of code, hardirq handlers, and NMI handlers. However, someone relying solely on (for example) the call_rcu() header comment might well have no idea that preempt-disable regions of code have RCU semantics. This commit therefore updates the header comments for call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and rcu_dereference_sched_check() to call out these new(ish) forms of RCU readers. Reported-by: Michel Lespinasse <[email protected]> [ paulmck: Apply Matthew Wilcox and Michel Lespinasse feedback. ] Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-13tick/nohz: Call tick_nohz_task_switch() with interrupts disabledPeter Zijlstra2-7/+2
Call tick_nohz_task_switch() slightly earlier after the context switch to benefit from disabled IRQs. This way the function doesn't need to disable them once more. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Kick only _queued_ task whose tick dependency is updatedMarcelo Tosatti2-2/+22
When the tick dependency of a task is updated, we want it to aknowledge the new state and restart the tick if needed. If the task is not running, we don't need to kick it because it will observe the new dependency upon scheduling in. But if the task is running, we may need to send an IPI to it so that it gets notified. Unfortunately we don't have the means to check if a task is running in a race free way. Checking p->on_cpu in a synchronized way against p->tick_dep_mask would imply adding a full barrier between prepare_task_switch() and tick_nohz_task_switch(), which we want to avoid in this fast-path. Therefore we blindly fire an IPI to the task's CPU. Meanwhile we can check if the task is queued on the CPU rq because p->on_rq is always set to TASK_ON_RQ_QUEUED _before_ schedule() and its full barrier that precedes tick_nohz_task_switch(). And if the task is queued on a nohz_full CPU, it also has fair chances to be running as the isolation constraints prescribe running single tasks on full dynticks CPUs. So use this as a trick to check if we can spare an IPI toward a non-running task. NOTE: For the ordering to be correct, it is assumed that we never deactivate a task while it is running, the only exception being the task deactivating itself while scheduling out. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Marcelo Tosatti <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Change signal tick dependency to wake up CPUs of member tasksMarcelo Tosatti2-4/+15
Rather than waking up all nohz_full CPUs on the system, only wake up the target CPUs of member threads of the signal. Reduces interruptions to nohz_full CPUs. Signed-off-by: Marcelo Tosatti <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Only wake up a single target cpu when kicking a taskFrederic Weisbecker1-13/+27
When adding a tick dependency to a task, its necessary to wake up the CPU where the task resides to reevaluate tick dependencies on that CPU. However the current code wakes up all nohz_full CPUs, which is unnecessary. Switch to waking up a single CPU, by using ordering of writes to task->cpu and task->tick_dep_mask. [ mingo: Minor readability edit. ] Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Marcelo Tosatti <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Update nohz_full Kconfig helpFrederic Weisbecker1-5/+6
CONFIG_NO_HZ_FULL behaves just like CONFIG_NO_HZ_IDLE by default. Reassure distros about it. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Update idle_exittime on actual idle exitYunfeng Ye1-6/+8
The idle_exittime field of tick_sched is used to record the time when the idle state was left. but currently the idle_exittime is updated in the function tick_nohz_restart_sched_tick(), which is not always in idle state when nohz_full is configured: tick_irq_exit tick_nohz_irq_exit tick_nohz_full_update_tick tick_nohz_restart_sched_tick ts->idle_exittime = now; It's thus overwritten by mistake on nohz_full tick restart. Move the update to the appropriate idle exit path instead. Signed-off-by: Yunfeng Ye <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Remove superflous check for CONFIG_VIRT_CPU_ACCOUNTING_NATIVEFrederic Weisbecker1-2/+0
The vtime_accounting_enabled_this_cpu() early check already makes what follows as dead code in the case of CONFIG_VIRT_CPU_ACCOUNTING_NATIVE. No need to keep the ifdeferry around. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Conditionally restart tick on idle exitYunfeng Ye1-15/+27
In nohz_full mode, switching from idle to a task will unconditionally issue a tick restart. If the task is alone in the runqueue or is the highest priority, the tick will fire once then eventually stop. But that alone is still undesired noise. Therefore, only restart the tick on idle exit when it's strictly necessary. Signed-off-by: Yunfeng Ye <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13sched/isolation: Reconcile rcu_nocbs= and nohz_full=Paul Gortmaker1-3/+1
We have a mismatch between RCU and isolation -- in relation to what is considered the maximum valid CPU number. This matters because nohz_full= and rcu_nocbs= are joined at the hip; in fact the former will enforce the latter. So we don't want a CPU mask to be valid for one and denied for the other. The difference 1st appeared as of v4.15; further details are below. As it is confusing to anyone who isn't looking at the code regularly, a reminder is in order; three values exist here: CONFIG_NR_CPUS - compiled in maximum cap on number of CPUs supported. nr_cpu_ids - possible # of CPUs (typically reflects what ACPI says) cpus_present - actual number of present/detected/installed CPUs. For this example, I'll refer to NR_CPUS=64 from "make defconfig" and nr_cpu_ids=6 for ACPI reporting on a board that could run a six core, and present=4 for a quad that is physically in the socket. From dmesg: smpboot: Allowing 6 CPUs, 2 hotplug CPUs setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:6 nr_node_ids:1 rcu: RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=6. smp: Brought up 1 node, 4 CPUs And from userspace, see: paul@trash:/sys/devices/system/cpu$ cat present 0-3 paul@trash:/sys/devices/system/cpu$ cat possible 0-5 paul@trash:/sys/devices/system/cpu$ cat kernel_max 63 Everything is fine if we boot 5x5 for rcu/nohz: Command line: BOOT_IMAGE=/boot/bzImage nohz_full=2-5 rcu_nocbs=2-5 root=/dev/sda1 ro NO_HZ: Full dynticks CPUs: 2-5. rcu: Offload RCU callbacks from CPUs: 2-5. ..even though there is no CPU 4 or 5. Both RCU and nohz_full are OK. Now we push that > 6 but less than NR_CPU and with 15x15 we get: Command line: BOOT_IMAGE=/boot/bzImage rcu_nocbs=2-15 nohz_full=2-15 root=/dev/sda1 ro rcu: Note: kernel parameter 'rcu_nocbs=', 'nohz_full', or 'isolcpus=' contains nonexistent CPUs. rcu: Offload RCU callbacks from CPUs: 2-5. These are both functionally equivalent, as we are only changing flags on phantom CPUs that don't exist, but note the kernel interpretation changes. And worse, it only changes for one of the two - which is the problem. RCU doesn't care if you want to restrict the flags on phantom CPUs but clearly nohz_full does after this change from v4.15. edb9382175c3: ("sched/isolation: Move isolcpus= handling to the housekeeping code") - if (cpulist_parse(str, non_housekeeping_mask) < 0) { - pr_warn("Housekeeping: Incorrect nohz_full cpumask\n"); + err = cpulist_parse(str, non_housekeeping_mask); + if (err < 0 || cpumask_last(non_housekeeping_mask) >= nr_cpu_ids) { + pr_warn("Housekeeping: nohz_full= or isolcpus= incorrect CPU range\n"); To be clear, the sanity check on "possible" (nr_cpu_ids) is new here. The goal was reasonable ; not wanting housekeeping to land on a not-possible CPU, but note two things: 1) this is an exclusion list, not an inclusion list; we are tracking non_housekeeping CPUs; not ones who are explicitly assigned housekeeping 2) we went one further in 9219565aa890 ("sched/isolation: Require a present CPU in housekeeping mask") - ensuring that housekeeping was sanity checking against present and not just possible CPUs. To be clear, this means the check added in v4.15 is doubly redundant. And more importantly, overly strict/restrictive. We care now, because the bitmap boot arg parsing now knows that a value of "N" is NR_CPUS; the size of the bitmap, but the bitmap code doesn't know anything about the subtleties of our max/possible/present CPU specifics as outlined above. So drop the check added in v4.15 (edb9382175c3) and make RCU and nohz_full both in alignment again on NR_CPUS so "N" works for both, and then they can fall back to nr_cpu_ids internally just as before. Command line: BOOT_IMAGE=/boot/bzImage nohz_full=2-N rcu_nocbs=2-N root=/dev/sda1 ro NO_HZ: Full dynticks CPUs: 2-5. rcu: Offload RCU callbacks from CPUs: 2-5. As shown above, with this change, RCU and nohz_full are in sync, even with the use of the "N" placeholder. Same result is achieved with "15". Signed-off-by: Paul Gortmaker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched: Make multiple runqueue task counters 32-bitAlexey Dobriyan2-7/+7
Make: struct dl_rq::dl_nr_migratory struct dl_rq::dl_nr_running struct rt_rq::rt_nr_boosted struct rt_rq::rt_nr_migratory struct rt_rq::rt_nr_total struct rq::nr_uninterruptible 32-bit. If total number of tasks can't exceed 2**32 (and less due to futex pid limits), then per-runqueue counters can't as well. This patchset has been sponsored by REX Prefix Eradication Society. Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched: Make nr_iowait_cpu() return 32-bit valueAlexey Dobriyan1-1/+1
Runqueue ->nr_iowait counters are 32-bit anyway. Propagate 32-bitness into other code, but don't try too hard. Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched: Make nr_iowait() return 32-bit valueAlexey Dobriyan1-2/+2
Creating 2**32 tasks to wait in D-state is impossible and wasteful. Return "unsigned int" and save on REX prefixes. Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched: Make nr_running() return 32-bit valueAlexey Dobriyan1-2/+2
Creating 2**32 tasks is impossible due to futex pid limits and wasteful anyway. Nobody has done it. Bring nr_running() into 32-bit world to save on REX prefixes. Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12srcu: Early test SRCU polling startFrederic Weisbecker1-1/+5
Place an early call to start_poll_synchronize_srcu() before the invocation of call_srcu() on the same srcu_struct structure. After the later call to srcu_barrier(), the completion of the first grace period should be visible to a subsequent invocation of poll_state_synchronize_srcu(), and if not, warn. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Uladzislau Rezki <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu: Fix various typos in commentsIngo Molnar6-13/+13
Fix ~12 single-word typos in RCU code comments. [ paulmck: Apply feedback from Randy Dunlap. ] Reviewed-by: Randy Dunlap <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu/nocb: Unify timersFrederic Weisbecker2-56/+42
Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar, this commit merges them together. A new RCU_NOCB_WAKE_BYPASS wake level is introduced. As a result, timers perform all kinds of deferred wake ups but other deferred wakeup callsites only handle non-bypass wakeups in order not to wake up rcuo too early. The timer also unconditionally executes a full barrier so as to order timer_pending() and callback enqueue although the path performing RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also test against the rdp leader instead of the current rdp. This unconditional full barrier shouldn't bring visible overhead since these timers almost never fire. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu/nocb: Prepare for fine-grained deferred wakeupFrederic Weisbecker3-10/+11
Tuning the deferred wakeup level must be done from a safe wakeup point. Currently those sites are: * ->nocb_timer * user/idle/guest entry * CPU down * softirq/rcuc All of these sites perform the wake up for both RCU_NOCB_WAKE and RCU_NOCB_WAKE_FORCE. In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until a timer fires so that we don't wake up the NOCB-gp kthread too early. To prepare for that, this commit specifies the per-callsite wakeup level/limit. Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> [ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ] Signed-off-by: Paul E. McKenney <[email protected]>