aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-01-21ftrace: Remove abandoned macrosAlex Shi1-2/+0
These 2 macros aren't used from commit eee8ded131f1 ("ftrace: Have the function probes call their own function"), so remove them. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-01-20bpf: Fix memory leaks in generic update/delete batch opsBrian Vazquez1-11/+19
Generic update/delete batch ops functions were using __bpf_copy_key without properly freeing the memory. Handle the memory allocation and copy_from_user separately. Fixes: aa2e93b8e58e ("bpf: Add generic support for update and delete batch ops") Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Brian Vazquez <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-20tracing: Do not set trace clock if tracefs lockdown is in effectMasami Ichikawa1-0/+5
When trace_clock option is not set and unstable clcok detected, tracing_set_default_clock() sets trace_clock(ThinkPad A285 is one of case). In that case, if lockdown is in effect, null pointer dereference error happens in ring_buffer_set_clock(). Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 17911ff38aa58 ("tracing: Add locked_down checks to the open calls of files created for tracefs") Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1788488 Signed-off-by: Masami Ichikawa <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-01-20tracing: Fix histogram code when expression has same var as valueSteven Rostedt (VMware)1-0/+22
While working on a tool to convert SQL syntex into the histogram language of the kernel, I discovered the following bug: # echo 'first u64 start_time u64 end_time pid_t pid u64 delta' >> synthetic_events # echo 'hist:keys=pid:start=common_timestamp' > events/sched/sched_waking/trigger # echo 'hist:keys=next_pid:delta=common_timestamp-$start,start2=$start:onmatch(sched.sched_waking).trace(first,$start2,common_timestamp,next_pid,$delta)' > events/sched/sched_switch/trigger Would not display any histograms in the sched_switch histogram side. But if I were to swap the location of "delta=common_timestamp-$start" with "start2=$start" Such that the last line had: # echo 'hist:keys=next_pid:start2=$start,delta=common_timestamp-$start:onmatch(sched.sched_waking).trace(first,$start2,common_timestamp,next_pid,$delta)' > events/sched/sched_switch/trigger The histogram works as expected. What I found out is that the expressions clear out the value once it is resolved. As the variables are resolved in the order listed, when processing: delta=common_timestamp-$start The $start is cleared. When it gets to "start2=$start", it errors out with "unresolved symbol" (which is silent as this happens at the location of the trace), and the histogram is dropped. When processing the histogram for variable references, instead of adding a new reference for a variable used twice, use the same reference. That way, not only is it more efficient, but the order will no longer matter in processing of the variables. From Tom Zanussi: "Just to clarify some more about what the problem was is that without your patch, we would have two separate references to the same variable, and during resolve_var_refs(), they'd both want to be resolved separately, so in this case, since the first reference to start wasn't part of an expression, it wouldn't get the read-once flag set, so would be read normally, and then the second reference would do the read-once read and also be read but using read-once. So everything worked and you didn't see a problem: from: start2=$start,delta=common_timestamp-$start In the second case, when you switched them around, the first reference would be resolved by doing the read-once, and following that the second reference would try to resolve and see that the variable had already been read, so failed as unset, which caused it to short-circuit out and not do the trigger action to generate the synthetic event: to: delta=common_timestamp-$start,start2=$start With your patch, we only have the single resolution which happens correctly the one time it's resolved, so this can't happen." Link: https://lore.kernel.org/r/[email protected] Cc: [email protected] Fixes: 067fe038e70f6 ("tracing: Add variable reference handling to hist triggers") Reviewed-by: Tom Zanuss <[email protected]> Tested-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-01-20irqdomain: Fix a memory leak in irq_domain_push_irq()Kevin Hao1-0/+1
Fix a memory leak reported by kmemleak: unreferenced object 0xffff000bc6f50e80 (size 128): comm "kworker/23:2", pid 201, jiffies 4294894947 (age 942.132s) hex dump (first 32 bytes): 00 00 00 00 41 00 00 00 86 c0 03 00 00 00 00 00 ....A........... 00 a0 b2 c6 0b 00 ff ff 40 51 fd 10 00 80 ff ff ........@Q...... backtrace: [<00000000e62d2240>] kmem_cache_alloc_trace+0x1a4/0x320 [<00000000279143c9>] irq_domain_push_irq+0x7c/0x188 [<00000000d9f4c154>] thunderx_gpio_probe+0x3ac/0x438 [<00000000fd09ec22>] pci_device_probe+0xe4/0x198 [<00000000d43eca75>] really_probe+0xdc/0x320 [<00000000d3ebab09>] driver_probe_device+0x5c/0xf0 [<000000005b3ecaa0>] __device_attach_driver+0x88/0xc0 [<000000004e5915f5>] bus_for_each_drv+0x7c/0xc8 [<0000000079d4db41>] __device_attach+0xe4/0x140 [<00000000883bbda9>] device_initial_probe+0x18/0x20 [<000000003be59ef6>] bus_probe_device+0x98/0xa0 [<0000000039b03d3f>] deferred_probe_work_func+0x74/0xa8 [<00000000870934ce>] process_one_work+0x1c8/0x470 [<00000000e3cce570>] worker_thread+0x1f8/0x428 [<000000005d64975e>] kthread+0xfc/0x128 [<00000000f0eaa764>] ret_from_fork+0x10/0x18 Fixes: 495c38d3001f ("irqdomain: Add irq_domain_{push,pop}_irq() functions") Signed-off-by: Kevin Hao <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2020-01-20module: avoid setting info->name early in case we can fall back to ↵Jessica Yu1-5/+4
info->mod->name In setup_load_info(), info->name (which contains the name of the module, mostly used for early logging purposes before the module gets set up) gets unconditionally assigned if .modinfo is missing despite the fact that there is an if (!info->name) check near the end of the function. Avoid assigning a placeholder string to info->name if .modinfo doesn't exist, so that we can fall back to info->mod->name later on. Fixes: 5fdc7db6448a ("module: setup load info before module_sig_check()") Reviewed-by: Miroslav Benes <[email protected]> Signed-off-by: Jessica Yu <[email protected]>
2020-01-20genirq: Introduce irq_domain_translate_onecellYash Shah1-0/+17
Add a new function irq_domain_translate_onecell() that is to be used as the translate function in struct irq_domain_ops. Signed-off-by: Yash Shah <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-20Merge tag 'v5.5-rc7' into perf/core, to pick up fixesIngo Molnar24-162/+249
Signed-off-by: Ingo Molnar <[email protected]>
2020-01-20sched/fair: Define sched_idle_cpu() only for SMP configurationsViresh Kumar1-0/+2
sched_idle_cpu() isn't used for non SMP configuration and with a recent change, we have started getting following warning: kernel/sched/fair.c:5221:12: warning: ‘sched_idle_cpu’ defined but not used [-Wunused-function] Fix that by defining sched_idle_cpu() only for SMP configurations. Fixes: 323af6deaf70 ("sched/fair: Load balance aggressively for SCHED_IDLE CPUs") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/f0554f590687478b33914a4aff9f0e6a62886d44.1579499907.git.viresh.kumar@linaro.org
2020-01-19Revert "um: Enable CONFIG_CONSTRUCTORS"Johannes Berg1-1/+1
This reverts commit 786b2384bf1c ("um: Enable CONFIG_CONSTRUCTORS"). There are two issues with this commit, uncovered by Anton in tests on some (Debian) systems: 1) I completely forgot to call any constructors if CONFIG_CONSTRUCTORS isn't set. Don't recall now if it just wasn't needed on my system, or if I never tested this case. 2) With that fixed, it works - with CONFIG_CONSTRUCTORS *unset*. If I set CONFIG_CONSTRUCTORS, it fails again, which isn't totally unexpected since whatever wanted to run is likely to have to run before the kernel init etc. that calls the constructors in this case. Basically, some constructors that gcc emits (libc has?) need to run very early during init; the failure mode otherwise was that the ptrace fork test already failed: ---------------------- $ ./linux mem=512M Core dump limits : soft - 0 hard - NONE Checking that ptrace can change system call numbers...check_ptrace : child exited with exitcode 6, while expecting 0; status 0x67f Aborted ---------------------- Thinking more about this, it's clear that we simply cannot support CONFIG_CONSTRUCTORS in UML. All the cases we need now (gcov, kasan) involve not use of the __attribute__((constructor)), but instead some constructor code/entry generated by gcc. Therefore, we cannot distinguish between kernel constructors and system constructors. Thus, revert this commit. Cc: [email protected] [5.4+] Fixes: 786b2384bf1c ("um: Enable CONFIG_CONSTRUCTORS") Reported-by: Anton Ivanov <[email protected]> Signed-off-by: Johannes Berg <[email protected]> Acked-by: Anton Ivanov <[email protected]> Signed-off-by: Richard Weinberger <[email protected]>
2020-01-19Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/netDavid S. Miller13-95/+134
2020-01-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2-5/+17
Pull networking fixes from David Miller: 1) Fix non-blocking connect() in x25, from Martin Schiller. 2) Fix spurious decryption errors in kTLS, from Jakub Kicinski. 3) Netfilter use-after-free in mtype_destroy(), from Cong Wang. 4) Limit size of TSO packets properly in lan78xx driver, from Eric Dumazet. 5) r8152 probe needs an endpoint sanity check, from Johan Hovold. 6) Prevent looping in tcp_bpf_unhash() during sockmap/tls free, from John Fastabend. 7) hns3 needs short frames padded on transmit, from Yunsheng Lin. 8) Fix netfilter ICMP header corruption, from Eyal Birger. 9) Fix soft lockup when low on memory in hns3, from Yonglong Liu. 10) Fix NTUPLE firmware command failures in bnxt_en, from Michael Chan. 11) Fix memory leak in act_ctinfo, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits) cxgb4: reject overlapped queues in TC-MQPRIO offload cxgb4: fix Tx multi channel port rate limit net: sched: act_ctinfo: fix memory leak bnxt_en: Do not treat DSN (Digital Serial Number) read failure as fatal. bnxt_en: Fix ipv6 RFS filter matching logic. bnxt_en: Fix NTUPLE firmware command failures. net: systemport: Fixed queue mapping in internal ring map net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec net: dsa: sja1105: Don't error out on disabled ports with no phy-mode net: phy: dp83867: Set FORCE_LINK_GOOD to default after reset net: hns: fix soft lockup when there is not enough memory net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key() net/sched: act_ife: initalize ife->metalist earlier netfilter: nat: fix ICMP header corruption on ICMP errors net: wan: lapbether.c: Use built-in RCU list checking netfilter: nf_tables: fix flowtable list del corruption netfilter: nf_tables: fix memory leak in nf_tables_parse_netdev_hooks() netfilter: nf_tables: remove WARN and add NLA_STRING upper limits netfilter: nft_tunnel: ERSPAN_VERSION must not be null netfilter: nft_tunnel: fix null-attribute check ...
2020-01-18Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2-5/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Three fixes: fix link failure on Alpha, fix a Sparse warning and annotate/robustify a lockless access in the NOHZ code" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/sched: Annotate lockless access to last_jiffies_update lib/vdso: Make __cvdso_clock_getres() static time/posix-stubs: Provide compat itimer supoprt for alpha
2020-01-18Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds1-71/+72
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cpu/SMT fix from Ingo Molnar: "Fix a build bug on CONFIG_HOTPLUG_SMT=y && !CONFIG_SYSFS kernels" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/SMT: Fix x86 link error without CONFIG_SYSFS
2020-01-18Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Tooling fixes, three Intel uncore driver fixes, plus an AUX events fix uncovered by the perf fuzzer" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Remove PCIe3 unit for SNR perf/x86/intel/uncore: Fix missing marker for snr_uncore_imc_freerunning_events perf/x86/intel/uncore: Add PCI ID of IMC for Xeon E3 V5 Family perf: Correctly handle failed perf_get_aux_event() perf hists: Fix variable name's inconsistency in hists__for_each() macro perf map: Set kmap->kmaps backpointer for main kernel map chunks perf report: Fix incorrectly added dimensions as switch perf data file tools lib traceevent: Fix memory leakage in filter_event
2020-01-18Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds3-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Three fixes: - Fix an rwsem spin-on-owner crash, introduced in v5.4 - Fix a lockdep bug when running out of stack_trace entries, introduced in v5.4 - Docbook fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Fix kernel crash when spinning on RWSEM_OWNER_UNKNOWN futex: Fix kernel-doc notation warning locking/lockdep: Fix buffer overrun problem in stack_trace[]
2020-01-18Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull rseq fixes from Ingo Molnar: "Two rseq bugfixes: - CLONE_VM !CLONE_THREAD didn't work properly, the kernel would end up corrupting the TLS of the parent. Technically a change in the ABI but the previous behavior couldn't resonably have been relied on by applications so this looks like a valid exception to the ABI rule. - Make the RSEQ_FLAG_UNREGISTER ABI behavior consistent with the handling of other flags. This is not thought to impact any applications either" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq: Unregister rseq for clone CLONE_VM rseq: Reject unknown flags on rseq unregister
2020-01-18Merge tag 'for-linus-2020-01-18' of ↵Linus Torvalds1-5/+10
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread fixes from Christian Brauner: "Here is an urgent fix for ptrace_may_access() permission checking. Commit 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") introduced the ability to opt out of audit messages for accesses to various proc files since they are not violations of policy. While doing so it switched the check from ns_capable() to has_ns_capability{_noaudit}(). That means it switched from checking the subjective credentials (ktask->cred) of the task to using the objective credentials (ktask->real_cred). This is appears to be wrong. ptrace_has_cap() is currently only used in ptrace_may_access() And is used to check whether the calling task (subject) has the CAP_SYS_PTRACE capability in the provided user namespace to operate on the target task (object). According to the cred.h comments this means the subjective credentials of the calling task need to be used. With this fix we switch ptrace_has_cap() to use security_capable() and thus back to using the subjective credentials. As one example where this might be particularly problematic, Jann pointed out that in combination with the upcoming IORING_OP_OPENAT{2} feature, this bug might allow unprivileged users to bypass the capability checks while asynchronously opening files like /proc/*/mem, because the capability checks for this would be performed against kernel credentials. To illustrate on the former point about this being exploitable: When io_uring creates a new context it records the subjective credentials of the caller. Later on, when it starts to do work it creates a kernel thread and registers a callback. The callback runs with kernel creds for ktask->real_cred and ktask->cred. To prevent this from becoming a full-blown 0-day io_uring will call override_cred() and override ktask->cred with the subjective credentials of the creator of the io_uring instance. With ptrace_has_cap() currently looking at ktask->real_cred this override will be ineffective and the caller will be able to open arbitray proc files as mentioned above. Luckily, this is currently not exploitable but would be so once IORING_OP_OPENAT{2} land in v5.6. Let's fix it now. To minimize potential regressions I successfully ran the criu testsuite. criu makes heavy use of ptrace() and extensively hits ptrace_may_access() codepaths and has a good change of detecting any regressions. Additionally, I succesfully ran the ptrace and seccomp kernel tests" * tag 'for-linus-2020-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()
2020-01-18ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()Christian Brauner1-5/+10
Commit 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") introduced the ability to opt out of audit messages for accesses to various proc files since they are not violations of policy. While doing so it somehow switched the check from ns_capable() to has_ns_capability{_noaudit}(). That means it switched from checking the subjective credentials of the task to using the objective credentials. This is wrong since. ptrace_has_cap() is currently only used in ptrace_may_access() And is used to check whether the calling task (subject) has the CAP_SYS_PTRACE capability in the provided user namespace to operate on the target task (object). According to the cred.h comments this would mean the subjective credentials of the calling task need to be used. This switches ptrace_has_cap() to use security_capable(). Because we only call ptrace_has_cap() in ptrace_may_access() and in there we already have a stable reference to the calling task's creds under rcu_read_lock() there's no need to go through another series of dereferences and rcu locking done in ns_capable{_noaudit}(). As one example where this might be particularly problematic, Jann pointed out that in combination with the upcoming IORING_OP_OPENAT feature, this bug might allow unprivileged users to bypass the capability checks while asynchronously opening files like /proc/*/mem, because the capability checks for this would be performed against kernel credentials. To illustrate on the former point about this being exploitable: When io_uring creates a new context it records the subjective credentials of the caller. Later on, when it starts to do work it creates a kernel thread and registers a callback. The callback runs with kernel creds for ktask->real_cred and ktask->cred. To prevent this from becoming a full-blown 0-day io_uring will call override_cred() and override ktask->cred with the subjective credentials of the creator of the io_uring instance. With ptrace_has_cap() currently looking at ktask->real_cred this override will be ineffective and the caller will be able to open arbitray proc files as mentioned above. Luckily, this is currently not exploitable but will turn into a 0-day once IORING_OP_OPENAT{2} land in v5.6. Fix it now! Cc: Oleg Nesterov <[email protected]> Cc: Eric Paris <[email protected]> Cc: [email protected] Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Serge Hallyn <[email protected]> Reviewed-by: Jann Horn <[email protected]> Fixes: 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") Signed-off-by: Christian Brauner <[email protected]>
2020-01-17lib/vdso: Update coarse timekeeper unconditionallyThomas Gleixner1-20/+17
The low resolution parts of the VDSO, i.e.: clock_gettime(CLOCK_*_COARSE), clock_getres(), time() can be used even if there is no VDSO capable clocksource. But if an architecture opts out of the VDSO data update then this information becomes stale. This affects ARM when there is no architected timer available. The lack of update causes userspace to use stale data forever. Make the update of the low resolution parts unconditional and only skip the update of the high resolution parts if the architecture requests it. Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation") Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-17lib/vdso: Make __arch_update_vdso_data() logic understandableThomas Gleixner1-1/+1
The function name suggests that this is a boolean checking whether the architecture asks for an update of the VDSO data, but it works the other way round. To spare further confusion invert the logic. Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation") Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-17perf: Correctly handle failed perf_get_aux_event()Mark Rutland1-1/+3
Vince reports a worrying issue: | so I was tracking down some odd behavior in the perf_fuzzer which turns | out to be because perf_even_open() sometimes returns 0 (indicating a file | descriptor of 0) even though as far as I can tell stdin is still open. ... and further the cause: | error is triggered if aux_sample_size has non-zero value. | | seems to be this line in kernel/events/core.c: | | if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader)) | goto err_locked; | | (note, err is never set) This seems to be a thinko in commit: ab43762ef010967e ("perf: Allow normal events to output AUX data") ... and we should probably return -EINVAL here, as this should only happen when the new event is mis-configured or does not have a compatible aux_event group leader. Fixes: ab43762ef010967e ("perf: Allow normal events to output AUX data") Reported-by: Vince Weaver <[email protected]> Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Alexander Shishkin <[email protected]> Tested-by: Vince Weaver <[email protected]>
2020-01-17watchdog/softlockup: Enforce that timestamp is valid on bootThomas Gleixner1-4/+6
Robert reported that during boot the watchdog timestamp is set to 0 for one second which is the indicator for a watchdog reset. The reason for this is that the timestamp is in seconds and the time is taken from sched clock and divided by ~1e9. sched clock starts at 0 which means that for the first second during boot the watchdog timestamp is 0, i.e. reset. Use ULONG_MAX as the reset indicator value so the watchdog works correctly right from the start. ULONG_MAX would only conflict with a real timestamp if the system reaches an uptime of 136 years on 32bit and almost eternity on 64bit. Reported-by: Robert Richter <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-17locking/osq: Use optimized spinning loop for arm64Waiman Long1-13/+10
Arm64 has a more optimized spinning loop (atomic_cond_read_acquire) using wfe for spinlock that can boost performance of sibling threads by putting the current cpu to a wait state that is broken only when the monitored variable changes or an external event happens. OSQ has a more complicated spinning loop. Besides the lock value, it also checks for need_resched() and vcpu_is_preempted(). The check for need_resched() is not a problem as it is only set by the tick interrupt handler. That will be detected by the spinning cpu right after iret. The vcpu_is_preempted() check, however, is a problem as changes to the preempt state of of previous node will not affect the wait state. For ARM64, vcpu_is_preempted is not currently defined and so is a no-op. Will has indicated that he is planning to para-virtualize wfe instead of defining vcpu_is_preempted for PV support. So just add a comment in arch/arm64/include/asm/spinlock.h to indicate that vcpu_is_preempted() should not be defined as suggested. On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking microbenchmark was run for 10s with and without the patch. The performance numbers before patch were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 224, Min/Mean/Max = 316/123,143/2,121,269 Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s After patch, the numbers were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 224, Min/Mean/Max = 334/147,836/1,304,787 Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s So there was about 20% performance improvement. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17locking/qspinlock: Fix inaccessible URL of MCS lock paperWaiman Long1-6/+7
It turns out that the URL of the MCS lock paper listed in the source code is no longer accessible. I did got question about where the paper was. This patch updates the URL to BZ 206115 which contains a copy of the paper from https://www.cs.rochester.edu/u/scott/papers/1991_TOCS_synch.pdf Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17locking/lockdep: Fix lockdep_stats indentation problemWaiman Long1-2/+2
It was found that two lines in the output of /proc/lockdep_stats have indentation problem: # cat /proc/lockdep_stats : in-process chains: 25057 stack-trace entries: 137827 [max: 524288] number of stack traces: 7973 number of stack hash chains: 6355 combined max dependencies: 1356414598 hardirq-safe locks: 57 hardirq-unsafe locks: 1286 : All the numbers displayed in /proc/lockdep_stats except the two stack trace numbers are formatted with a field with of 11. To properly align all the numbers, a field width of 11 is now added to the two stack trace numbers. Fixes: 8c779229d0f4 ("locking/lockdep: Report more stack trace statistics") Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17locking/rwsem: Fix kernel crash when spinning on RWSEM_OWNER_UNKNOWNWaiman Long1-2/+2
The commit 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner") will allow a recently woken up waiting writer to spin on the owner. Unfortunately, if the owner happens to be RWSEM_OWNER_UNKNOWN, the code will incorrectly spin on it leading to a kernel crash. This is fixed by passing the proper non-spinnable bits to rwsem_spin_on_owner() so that RWSEM_OWNER_UNKNOWN will be treated as a non-spinnable target. Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner") Reported-by: Christoph Hellwig <[email protected]> Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Christoph Hellwig <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2020-01-17sched/topology: Assert non-NUMA topology masks don't (partially) overlapValentin Schneider1-0/+39
topology.c::get_group() relies on the assumption that non-NUMA domains do not partially overlap. Zeng Tao pointed out in [1] that such topology descriptions, while completely bogus, can end up being exposed to the scheduler. In his example (8 CPUs, 2-node system), we end up with: MC span for CPU3 == 3-7 MC span for CPU4 == 4-7 The first pass through get_group(3, sdd@MC) will result in the following sched_group list: 3 -> 4 -> 5 -> 6 -> 7 ^ / `----------------' And a later pass through get_group(4, sdd@MC) will "corrupt" that to: 3 -> 4 -> 5 -> 6 -> 7 ^ / `-----------' which will completely break things like 'while (sg != sd->groups)' when using CPU3's base sched_domain. There already are some architecture-specific checks in place such as x86/kernel/smpboot.c::topology.sane(), but this is something we can detect in the core scheduler, so it seems worthwhile to do so. Warn and abort the construction of the sched domains if such a broken topology description is detected. Note that this is somewhat expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be gated under SCHED_DEBUG if deemed necessary. Testing ======= Dietmar managed to reproduce this using the following qemu incantation: $ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \ -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \ cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \ node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1 alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be needed if '-smp cores=X, sockets=Y' would work with qemu): 8<--- @@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid) if (cpuid_topo->package_id != cpu_topo->package_id) continue; + if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4)) + continue; + cpumask_set_cpu(cpuid, &cpu_topo->core_sibling); cpumask_set_cpu(cpu, &cpuid_topo->core_sibling); 8<--- [1]: https://lkml.kernel.org/r/[email protected] Reported-by: Zeng Tao <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17idle: fix spelling mistake "iterrupts" -> "interrupts"Hewenliang1-1/+1
There is a spelling misake in comments of cpuidle_idle_call. Fix it. Signed-off-by: Hewenliang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17sched/fair: Remove redundant call to cpufreq_update_util()Vincent Guittot1-7/+7
With commit bef69dd87828 ("sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()") update_load_avg() has become the central point for calling cpufreq (not including the update of blocked load). This change helps to simplify further the number of calls to cpufreq_update_util() and to remove last redundant ones. With update_load_avg(), we are now sure that cpufreq_update_util() will be called after every task attachment to a cfs_rq and especially after propagating this event down to the util_avg of the root cfs_rq, which is the level that is used by cpufreq governors like schedutil to set the frequency of a CPU. The SCHED_CPUFREQ_MIGRATION flag forces an early call to cpufreq when the migration happens in a cgroup whereas util_avg of root cfs_rq is not yet updated and this call is duplicated with the one that happens immediately after when the migration event reaches the root cfs_rq. The dedicated flag SCHED_CPUFREQ_MIGRATION is now useless and can be removed. The interface of attach_entity_load_avg() can also be simplified accordingly. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only ↵Wang Long1-4/+6
when psi enabled when CONFIG_PSI_DEFAULT_DISABLED set to N or the command line set psi=0, I think we should not create /proc/pressure and /proc/pressure/{io|memory|cpu}. In the future, user maybe determine whether the psi feature is enabled by checking the existence of the /proc/pressure dir or /proc/pressure/{io|memory|cpu} files. Signed-off-by: Wang Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAPPeng Liu1-22/+4
commit bf475ce0a3dd ("sched/fair: Add per-CPU min capacity to sched_group_capacity") introduced per-cpu min_capacity. commit e3d6d0cb66f2 ("sched/fair: Add sched_group per-CPU max capacity") introduced per-cpu max_capacity. In the SD_OVERLAP case, the local variable 'capacity' represents the sum of CPU capacity of all CPUs in the first sched group (sg) of the sched domain (sd). It is erroneously used to calculate sg's min and max CPU capacity. To fix this use capacity_of(cpu) instead of 'capacity'. The code which achieves this via cpu_rq(cpu)->sd->groups->sgc->capacity (for rq->sd != NULL) can be removed since it delivers the same value as capacity_of(cpu) which is currently only used for the (!rq->sd) case (see update_cpu_capacity()). An sg of the lowest sd (rq->sd or sd->child == NULL) represents a single CPU (and hence sg->sgc->capacity == capacity_of(cpu)). Signed-off-by: Peng Liu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/20200104130828.GA7718@iZj6chx1xj0e0buvshuecpZ
2020-01-17sched/fair: calculate delta runnable load only when it's neededPeng Wang1-5/+6
Move the code of calculation for delta_sum/delta_avg to where it is really needed to be done. Signed-off-by: Peng Wang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17sched/cputime: move rq parameter in irqtime_account_process_tickAlex Shi1-9/+6
Every time we call irqtime_account_process_tick() is in a interrupt, Every caller will get and assign a parameter rq = this_rq(), This is unnecessary and increase the code size a little bit. Move the rq getting action to irqtime_account_process_tick internally is better. base with this patch cputime.o 578792 bytes 577888 bytes Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17stop_machine: Make stop_cpus() staticYangtao Li1-1/+1
The function stop_cpus() is only used internally by the stop_machine for stop multiple cpus. Make it static. Signed-off-by: Yangtao Li <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17sched/debug: Reset watchdog on all CPUs while processing sysrq-tWei Li1-2/+9
Lengthy output of sysrq-t may take a lot of time on slow serial console with lots of processes and CPUs. So we need to reset NMI-watchdog to avoid spurious lockup messages, and we also reset softlockup watchdogs on all other CPUs since another CPU might be blocked waiting for us to process an IPI or stop_machine. Add to sysrq_sched_debug_show() as what we did in show_state_filter(). Signed-off-by: Wei Li <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17sched/core: Fix size of rq::uclamp initializationLi Guanglei1-1/+2
rq::uclamp is an array of struct uclamp_rq, make sure we clear the whole thing. Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga") Signed-off-by: Li Guanglei <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Qais Yousef <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17sched/uclamp: Fix a bug in propagating uclamp value in new cgroupsQais Yousef1-0/+6
When a new cgroup is created, the effective uclamp value wasn't updated with a call to cpu_util_update_eff() that looks at the hierarchy and update to the most restrictive values. Fix it by ensuring to call cpu_util_update_eff() when a new cgroup becomes online. Without this change, the newly created cgroup uses the default root_task_group uclamp values, which is 1024 for both uclamp_{min, max}, which will cause the rq to to be clamped to max, hence cause the system to run at max frequency. The problem was observed on Ubuntu server and was reproduced on Debian and Buildroot rootfs. By default, Ubuntu and Debian create a cpu controller cgroup hierarchy and add all tasks to it - which creates enough noise to keep the rq uclamp value at max most of the time. Imitating this behavior makes the problem visible in Buildroot too which otherwise looks fine since it's a minimal userspace. Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps") Reported-by: Doug Smythies <[email protected]> Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Doug Smythies <[email protected]> Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
2020-01-17sched/fair: Load balance aggressively for SCHED_IDLE CPUsViresh Kumar1-11/+21
The fair scheduler performs periodic load balance on every CPU to check if it can pull some tasks from other busy CPUs. The duration of this periodic load balance is set to sd->balance_interval for the idle CPUs and is calculated by multiplying the sd->balance_interval with the sd->busy_factor (set to 32 by default) for the busy CPUs. The multiplication is done for busy CPUs to avoid doing load balance too often and rather spend more time executing actual task. While that is the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE tasks. With the recent enhancements in the fair scheduler around SCHED_IDLE CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE CPU instead of other busy or idle CPUs. The same reasoning should be applied to the load balancer as well to make it migrate tasks more aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling latency of the migrated (SCHED_OTHER) tasks. This patch makes minimal changes to the fair scheduler to do the next load balance soon after the last non SCHED_IDLE task is dequeued from a runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is ignored while calculating the balance_interval for such CPUs. This is done to avoid delaying the periodic load balance by few hundred milliseconds for SCHED_IDLE CPUs. This is tested on ARM64 Hikey620 platform (octa-core) with the help of rt-app and it is verified, using kernel traces, that the newly SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE and pulls tasks from other busy CPUs. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/e485827eb8fe7db0943d6f3f6e0f5a4a70272781.1578471925.git.viresh.kumar@linaro.org
2020-01-17sched/fair : Improve update_sd_pick_busiest for spare capacity caseVincent Guittot1-5/+9
Similarly to calculate_imbalance() and find_busiest_group(), using the number of idle CPUs when there is only 1 CPU in the group is not efficient because we can't make a difference between a CPU running 1 task and a CPU running dozens of small tasks competing for the same CPU but not enough to overload it. More generally speaking, we should use the number of running tasks when there is the same number of idle CPUs in a group instead of blindly select the 1st one. When the groups have spare capacity and the same number of idle CPUs, we compare the number of running tasks to select the busiest group. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17watchdog: Remove soft_lockup_hrtimer_cnt and related codeJisheng Zhang1-3/+0
After commit 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work"), the percpu soft_lockup_hrtimer_cnt is not used any more, so remove it and related code. Signed-off-by: Jisheng Zhang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-01-17tracing: Initialize ret in syscall_enter_define_fields()Steven Rostedt (VMware)1-1/+2
If syscall_enter_define_fields() is called on a system call with no arguments, the return code variable "ret" will never get initialized. Initialize it to zero. Fixes: 04ae87a52074e ("ftrace: Rework event_create_dir()") Reported-by: Qian Cai <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-16bpf: Remove set but not used variable 'first_key'YueHaibing1-2/+0
kernel/bpf/syscall.c: In function generic_map_lookup_batch: kernel/bpf/syscall.c:1339:7: warning: variable first_key set but not used [-Wunused-but-set-variable] It is never used, so remove it. Reported-by: Hulk Robot <[email protected]> Signed-off-by: YueHaibing <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Brian Vazquez <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-16devmap: Adjust tracepoint for map-less queue flushJesper Dangaard Brouer1-1/+1
Now that we don't have a reference to a devmap when flushing the device bulk queue, let's change the the devmap_xmit tracepoint to remote the map_id and map_index fields entirely. Rearrange the fields so 'drops' and 'sent' stay in the same position in the tracepoint struct, to make it possible for the xdp_monitor utility to read both the old and the new format. Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-16xdp: Use bulking for non-map XDP_REDIRECT and consolidate code pathsToke Høiland-Jørgensen1-10/+22
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device, we can re-use the bulking for the non-map version of the bpf_redirect() helper. This is a simple matter of having xdp_do_redirect_slow() queue the frame on the bulk queue instead of sending it out with __bpf_tx_xdp(). Unfortunately we can't make the bpf_redirect() helper return an error if the ifindex doesn't exit (as bpf_redirect_map() does), because we don't have a reference to the network namespace of the ingress device at the time the helper is called. So we have to leave it as-is and keep the device lookup in xdp_do_redirect_slow(). Since this leaves less reason to have the non-map redirect code in a separate function, so we get rid of the xdp_do_redirect_slow() function entirely. This does lose us the tracepoint disambiguation, but fortunately the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint entry structures. This means both can contain a map index, so we can just amend the tracepoint definitions so we always emit the xdp_redirect(_err) tracepoints, but with the map ID only populated if a map is present. This means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep the definitions around in case someone is still listening for them. With this change, the performance of the xdp_redirect sample program goes from 5Mpps to 8.4Mpps (a 68% increase). Since the flush functions are no longer map-specific, rename the flush() functions to drop _map from their names. One of the renamed functions is the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To keep from having to update all drivers, use a #define to keep the old name working, and only update the virtual drivers in this patch. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-16xdp: Move devmap bulk queue into struct net_deviceToke Høiland-Jørgensen1-36/+27
Commit 96360004b862 ("xdp: Make devmap flush_list common for all map instances"), changed devmap flushing to be a global operation instead of a per-map operation. However, the queue structure used for bulking was still allocated as part of the containing map. This patch moves the devmap bulk queue into struct net_device. The motivation for this is reusing it for the non-map variant of XDP_REDIRECT, which will be changed in a subsequent commit. To avoid other fields of struct net_device moving to different cache lines, we also move a couple of other members around. We defer the actual allocation of the bulk queue structure until the NETDEV_REGISTER notification devmap.c. This makes it possible to check for ndo_xdp_xmit support before allocating the structure, which is not possible at the time struct net_device is allocated. However, we keep the freeing in free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER. Because of this change, we lose the reference back to the map that originated the redirect, so change the tracepoint to always return 0 as the map ID and index. Otherwise no functional change is intended with this patch. After this patch, the relevant part of struct net_device looks like this, according to pahole: /* --- cacheline 14 boundary (896 bytes) --- */ struct netdev_queue * _tx __attribute__((__aligned__(64))); /* 896 8 */ unsigned int num_tx_queues; /* 904 4 */ unsigned int real_num_tx_queues; /* 908 4 */ struct Qdisc * qdisc; /* 912 8 */ unsigned int tx_queue_len; /* 920 4 */ spinlock_t tx_global_lock; /* 924 4 */ struct xdp_dev_bulk_queue * xdp_bulkq; /* 928 8 */ struct xps_dev_maps * xps_cpus_map; /* 936 8 */ struct xps_dev_maps * xps_rxqs_map; /* 944 8 */ struct mini_Qdisc * miniq_egress; /* 952 8 */ /* --- cacheline 15 boundary (960 bytes) --- */ struct hlist_head qdisc_hash[16]; /* 960 128 */ /* --- cacheline 17 boundary (1088 bytes) --- */ struct timer_list watchdog_timer; /* 1088 40 */ /* XXX last struct has 4 bytes of padding */ int watchdog_timeo; /* 1128 4 */ /* XXX 4 bytes hole, try to pack */ struct list_head todo_list; /* 1136 16 */ /* --- cacheline 18 boundary (1152 bytes) --- */ Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Björn Töpel <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-01-16PM: hibernate: fix crashes with init_on_free=1Alexander Potapenko1-10/+10
Upon resuming from hibernation, free pages may contain stale data from the kernel that initiated the resume. This breaks the invariant inflicted by init_on_free=1 that freed pages must be zeroed. To deal with this problem, make clear_free_pages() also clear the free pages when init_on_free is enabled. Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Reported-by: Johannes Stezenbach <[email protected]> Signed-off-by: Alexander Potapenko <[email protected]> Cc: 5.3+ <[email protected]> # 5.3+ Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-01-16PM: suspend: Add sysfs attribute to control the "sync on suspend" behaviorJonas Meurer3-2/+38
The sysfs attribute `/sys/power/sync_on_suspend` controls, whether or not filesystems are synced by the kernel before system suspend. Congruously, the behaviour of build-time switch CONFIG_SUSPEND_SKIP_SYNC is slightly changed: It now defines the run-tim default for the new sysfs attribute `/sys/power/sync_on_suspend`. The run-time attribute is added because the existing corresponding build-time Kconfig flag for (`CONFIG_SUSPEND_SKIP_SYNC`) is not flexible enough. E.g. Linux distributions that provide pre-compiled kernels usually want to stick with the default (sync filesystems before suspend) but under special conditions this needs to be changed. One example for such a special condition is user-space handling of suspending block devices (e.g. using `cryptsetup luksSuspend` or `dmsetup suspend`) before system suspend. The Kernel trying to sync filesystems after the underlying block device already got suspended obviously leads to dead-locks. Be aware that you have to take care of the filesystem sync yourself before suspending the system in those scenarios. Signed-off-by: Jonas Meurer <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-01-16watchdog/softlockup: Remove obsolete check of last reported taskPetr Mladek1-17/+1
commit 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work") ensures that the watchdog is reliably touched during a task switch. As a result the check for an unnoticed task switch is not longer needed. Remove the relevant code, which effectively reverts commit b1a8de1f5343 ("softlockup: make detector be aware of task switch of processes hogging cpu") Signed-off-by: Petr Mladek <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Peter Ziljstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-01-16tracing: Allow trace_printk() to nest in other tracing codeSteven Rostedt (VMware)1-5/+19
trace_printk() is used to debug the kernel which includes the tracing infrastructure. But because it writes to the ring buffer, and so does much of the tracing infrastructure, the ring buffer's recursive detection will drop writes to the ring buffer that is in the same context as the current write is happening (it allows interrupts to write when normal context is writing, but wont let normal context write while normal context is writing). This can cause confusion and think that the code is where the trace_printk() exists is not hit. To solve this, up the recursive nesting of the ring buffer when trace_printk() is called before it writes to the buffer itself. Note, this does make it dangerous to use trace_printk() in the ring buffer code itself, because this basically disables the recursion protection of trace_printk() buffer writes. But as trace_printk() is only used for debugging, and if this does occur, the developer will see the cause real quick (recursive blowing up of the stack). Thus the developer can deal with that. But having trace_printk() silently ignored is a much bigger problem, and disabling recursive protection is a small price to pay to fix it. Signed-off-by: Steven Rostedt (VMware) <[email protected]>