aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-12-02entry: Rename exit_to_user_mode()Sven Schnelle1-4/+4
In order to make this function publicly available rename it so it can still be inlined. An additional exit_to_user_mode() function will be added with a later commit. Signed-off-by: Sven Schnelle <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-12-02entry: Rename enter_from_user_mode()Sven Schnelle1-5/+5
In order to make this function publicly available rename it so it can still be inlined. An additional enter_from_user_mode() function will be added with a later commit. Signed-off-by: Sven Schnelle <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-12-02entry: Support Syscall User Dispatch on common syscall entryGabriel Krisman Bertazi1-0/+25
Syscall User Dispatch (SUD) must take precedence over seccomp and ptrace, since the use case is emulation (it can be invoked with a different ABI) such that seccomp filtering by syscall number doesn't make sense in the first place. In addition, either the syscall is dispatched back to userspace, in which case there is no resource for to trace, or the syscall will be executed, and seccomp/ptrace will execute next. Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as well, just to prevent a trace exit event when dispatch was triggered. For that, the on_syscall_dispatch() examines context to skip the tracepoint, audit and other work. [ tglx: Add a comment on the exit side ] Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Andy Lutomirski <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-12-02kernel: Implement selective syscall userspace redirectionGabriel Krisman Bertazi5-1/+118
Introduce a mechanism to quickly disable/enable syscall handling for a specific process and redirect to userspace via SIGSYS. This is useful for processes with parts that require syscall redirection and parts that don't, but who need to perform this boundary crossing really fast, without paying the cost of a system call to reconfigure syscall handling on each boundary transition. This is particularly important for Windows games running over Wine. The proposed interface looks like this: prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector]) The range [<offset>,<offset>+<length>) is a part of the process memory map that is allowed to by-pass the redirection code and dispatch syscalls directly, such that in fast paths a process doesn't need to disable the trap nor the kernel has to check the selector. This is essential to return from SIGSYS to a blocked area without triggering another SIGSYS from rt_sigreturn. selector is an optional pointer to a char-sized userspace memory region that has a key switch for the mechanism. This key switch is set to either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the redirection without calling the kernel. The feature is meant to be set per-thread and it is disabled on fork/clone/execv. Internally, this doesn't add overhead to the syscall hot path, and it requires very little per-architecture support. I avoided using seccomp, even though it duplicates some functionality, due to previous feedback that maybe it shouldn't mix with seccomp since it is not a security mechanism. And obviously, this should never be considered a security mechanism, since any part of the program can by-pass it by using the syscall dispatcher. For the sysinfo benchmark, which measures the overhead added to executing a native syscall that doesn't require interception, the overhead using only the direct dispatcher region to issue syscalls is pretty much irrelevant. The overhead of using the selector goes around 40ns for a native (unredirected) syscall in my system, and it is (as expected) dominated by the supervisor-mode user-address access. In fact, with SMAP off, the overhead is consistently less than 5ns on my test box. Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Andy Lutomirski <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-12-01ring-buffer: Add test to validate the time stamp deltasSteven Rostedt (VMware)2-0/+170
While debugging a situation where a delta for an event was calucalted wrong, I realize there was nothing making sure that the delta of events are correct. If a single event has an incorrect delta, then all events after it will also have one. If the discrepency gets large enough, it could cause the time stamps to go backwards when crossing sub buffers, that record a full 64 bit time stamp, and the new deltas are added to that. Add a way to validate the events at most events and when crossing a buffer page. This will help make sure that the deltas are always correct. This test will detect if they are ever corrupted. The test adds a high overhead to the ring buffer recording, as it does the audit for almost every event, and should only be used for testing the ring buffer. This will catch the bug that is fixed by commit 55ea4cf40380 ("ring-buffer: Update write stamp with the correct ts"), which is not applied when this commit is applied. Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-12-01Merge tag 'v5.10-rc6' into rdma.git for-nextJason Gunthorpe3-5/+29
For dependencies in following patches Signed-off-by: Jason Gunthorpe <[email protected]>
2020-12-01Merge tag 'trace-v5.10-rc6' of ↵Linus Torvalds5-15/+33
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Use correct timestamp variable for ring buffer write stamp update - Fix up before stamp and write stamp when crossing ring buffer sub buffers - Keep a zero delta in ring buffer in slow path if cmpxchg fails - Fix trace_printk static buffer for archs that care - Fix ftrace record accounting for ftrace ops with trampolines - Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency - Remove WARN_ON in hwlat tracer that triggers on something that is OK - Make "my_tramp" trampoline in ftrace direct sample code global - Fixes in the bootconfig tool for better alignment management * tag 'trace-v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Always check to put back before stamp when crossing pages ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency ftrace: Fix updating FTRACE_FL_TRAMP tracing: Fix alignment of static buffer tracing: Remove WARN_ON in start_thread() samples/ftrace: Mark my_tramp[12]? global ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next() ring-buffer: Update write stamp with the correct ts docs: bootconfig: Update file format on initrd image tools/bootconfig: Align the bootconfig applied initrd image size to 4 tools/bootconfig: Fix to check the write failure correctly tools/bootconfig: Fix errno reference after printf()
2020-12-01block: merge struct block_device and struct hd_structChristoph Hellwig1-35/+8
Instead of having two structures that represent each block device with different life time rules, merge them into a single one. This also greatly simplifies the reference counting rules, as we can use the inode reference count as the main reference count for the new struct block_device, with the device model reference front ending it for device model interaction. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-01block: move the start_sect field to struct block_deviceChristoph Hellwig1-8/+3
Move the start_sect field to struct block_device in preparation of killing struct hd_struct. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-01block: remove the nr_sects field in struct hd_structChristoph Hellwig1-1/+1
Now that the hd_struct always has a block device attached to it, there is no need for having two size field that just get out of sync. Additionally the field in hd_struct did not use proper serialization, possibly allowing for torn writes. By only using the block_device field this problem also gets fixed. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Coly Li <[email protected]> [bcache] Acked-by: Chao Yu <[email protected]> [f2fs] Signed-off-by: Jens Axboe <[email protected]>
2020-12-01scs: switch to vmapped shadow stacksSami Tolvanen1-11/+60
The kernel currently uses kmem_cache to allocate shadow call stacks, which means an overflows may not be immediately detected and can potentially result in another task's shadow stack to be overwritten. This change switches SCS to use virtually mapped shadow stacks for tasks, which increases shadow stack size to a full page and provides more robust overflow detection, similarly to VMAP_STACK. Signed-off-by: Sami Tolvanen <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2020-11-30ring-buffer: Always check to put back before stamp when crossing pagesSteven Rostedt (VMware)1-8/+6
The current ring buffer logic checks to see if the updating of the event buffer was interrupted, and if it is, it will try to fix up the before stamp with the write stamp to make them equal again. This logic is flawed, because if it is not interrupted, the two are guaranteed to be different, as the current event just updated the before stamp before allocation. This guarantees that the next event (this one or another interrupting one) will think it interrupted the time updates of a previous event and inject an absolute time stamp to compensate. The correct logic is to always update the timestamps when traversing to a new sub buffer. Cc: [email protected] Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-11-30ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependencyNaveen N. Rao1-1/+1
DYNAMIC_FTRACE_WITH_DIRECT_CALLS should depend on DYNAMIC_FTRACE_WITH_REGS since we need ftrace_regs_caller(). Link: https://lkml.kernel.org/r/fc4b257ea8689a36f086d2389a9ed989496ca63a.1606412433.git.naveen.n.rao@linux.vnet.ibm.com Cc: [email protected] Fixes: 763e34e74bb7d5c ("ftrace: Add register_ftrace_direct()") Signed-off-by: Naveen N. Rao <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-11-30ftrace: Fix updating FTRACE_FL_TRAMPNaveen N. Rao1-1/+21
On powerpc, kprobe-direct.tc triggered FTRACE_WARN_ON() in ftrace_get_addr_new() followed by the below message: Bad trampoline accounting at: 000000004222522f (wake_up_process+0xc/0x20) (f0000001) The set of steps leading to this involved: - modprobe ftrace-direct-too - enable_probe - modprobe ftrace-direct - rmmod ftrace-direct <-- trigger The problem turned out to be that we were not updating flags in the ftrace record properly. From the above message about the trampoline accounting being bad, it can be seen that the ftrace record still has FTRACE_FL_TRAMP set though ftrace-direct module is going away. This happens because we are checking if any ftrace_ops has the FTRACE_FL_TRAMP flag set _before_ updating the filter hash. The fix for this is to look for any _other_ ftrace_ops that also needs FTRACE_FL_TRAMP. Link: https://lkml.kernel.org/r/56c113aa9c3e10c19144a36d9684c7882bf09af5.1606412433.git.naveen.n.rao@linux.vnet.ibm.com Cc: [email protected] Fixes: a124692b698b0 ("ftrace: Enable trampoline when rec count returns back to one") Signed-off-by: Naveen N. Rao <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-11-30tracing: Fix alignment of static bufferMinchan Kim1-1/+1
With 5.9 kernel on ARM64, I found ftrace_dump output was broken but it had no problem with normal output "cat /sys/kernel/debug/tracing/trace". With investigation, it seems coping the data into temporal buffer seems to break the align binary printf expects if the static buffer is not aligned with 4-byte. IIUC, get_arg in bstr_printf expects that args has already right align to be decoded and seq_buf_bprintf says ``the arguments are saved in a 32bit word array that is defined by the format string constraints``. So if we don't keep the align under copy to temporal buffer, the output will be broken by shifting some bytes. This patch fixes it. Link: https://lkml.kernel.org/r/[email protected] Cc: <[email protected]> Fixes: 8e99cf91b99bb ("tracing: Do not allocate buffer in trace_find_next_entry() in atomic") Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-11-30tracing: Remove WARN_ON in start_thread()Vasily Averin1-1/+1
This patch reverts commit 978defee11a5 ("tracing: Do a WARN_ON() if start_thread() in hwlat is called when thread exists") .start hook can be legally called several times if according tracer is stopped screen window 1 [root@localhost ~]# echo 1 > /sys/kernel/tracing/events/kmem/kfree/enable [root@localhost ~]# echo 1 > /sys/kernel/tracing/options/pause-on-trace [root@localhost ~]# less -F /sys/kernel/tracing/trace screen window 2 [root@localhost ~]# cat /sys/kernel/debug/tracing/tracing_on 0 [root@localhost ~]# echo hwlat > /sys/kernel/debug/tracing/current_tracer [root@localhost ~]# echo 1 > /sys/kernel/debug/tracing/tracing_on [root@localhost ~]# cat /sys/kernel/debug/tracing/tracing_on 0 [root@localhost ~]# echo 2 > /sys/kernel/debug/tracing/tracing_on triggers warning in dmesg: WARNING: CPU: 3 PID: 1403 at kernel/trace/trace_hwlat.c:371 hwlat_tracer_start+0xc9/0xd0 Link: https://lkml.kernel.org/r/[email protected] Cc: Ingo Molnar <[email protected]> Cc: [email protected] Fixes: 978defee11a5 ("tracing: Do a WARN_ON() if start_thread() in hwlat is called when thread exists") Signed-off-by: Vasily Averin <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-11-30ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next()Andrea Righi1-3/+3
In the slow path of __rb_reserve_next() a nested event(s) can happen between evaluating the timestamp delta of the current event and updating write_stamp via local_cmpxchg(); in this case the delta is not valid anymore and it should be set to 0 (same timestamp as the interrupting event), since the event that we are currently processing is not the last event in the buffer. Link: https://lkml.kernel.org/r/X8IVJcp1gRE+FJCJ@xps-13-7390 Cc: Ingo Molnar <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: [email protected] Link: https://lwn.net/Articles/831207 Fixes: a389d86f7fd0 ("ring-buffer: Have nested events still record running time stamp") Signed-off-by: Andrea Righi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-11-30ring-buffer: Update write stamp with the correct tsSteven Rostedt (VMware)1-1/+1
The write stamp, used to calculate deltas between events, was updated with the stale "ts" value in the "info" structure, and not with the updated "ts" variable. This caused the deltas between events to be inaccurate, and when crossing into a new sub buffer, had time go backwards. Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp") Reported-by: "J. Avila" <[email protected]> Tested-by: Daniel Mentz <[email protected]> Tested-by: Will McVicker <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-11-30genirq/irqdomain: Don't try to free an interrupt that has no mappingMarc Zyngier1-2/+9
When an interrupt allocation fails for N interrupts, it is pretty common for the error handling code to free the same number of interrupts, no matter how many interrupts have actually been allocated. This may result in the domain freeing code to be unexpectedly called for interrupts that have no mapping in that domain. Things end pretty badly. Instead, add some checks to irq_domain_free_irqs_hierarchy() to make sure that thiss does not follow the hierarchy if no mapping exists for a given interrupt. Fixes: 6a6544e520abe ("genirq/irqdomain: Remove auto-recursive hierarchy support") Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-30genirq/irqdomain: Add an irq_create_mapping_affinity() functionLaurent Vivier1-5/+8
There is currently no way to convey the affinity of an interrupt via irq_create_mapping(), which creates issues for devices that expect that affinity to be managed by the kernel. In order to sort this out, rename irq_create_mapping() to irq_create_mapping_affinity() with an additional affinity parameter that can be passed down to irq_domain_alloc_descs(). irq_create_mapping() is re-implemented as a wrapper around irq_create_mapping_affinity(). No functional change. Fixes: e75eafb9b039 ("genirq/msi: Switch to new irq spreading infrastructure") Signed-off-by: Laurent Vivier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Greg Kurz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2020-11-29Merge tag 'locking-urgent-2020-11-29' of ↵Linus Torvalds1-1/+27
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "Two more places which invoke tracing from RCU disabled regions in the idle path. Similar to the entry path the low level idle functions have to be non-instrumentable" * tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: intel_idle: Fix intel_idle() vs tracing sched/idle: Fix arch_cpu_idle() vs tracing
2020-11-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski8-75/+84
Trivial conflict in CAN, keep the net-next + the byteswap wrapper. Conflicts: drivers/net/can/usb/gs_usb.c Signed-off-by: Jakub Kicinski <[email protected]>
2020-11-27Backmerge tag 'v5.10-rc2' into arm/driversArnd Bergmann14-52/+44
The SCMI pull request for the arm/drivers branch requires v5.10-rc2 because of dependencies with other git trees, so merge that in here. Signed-off-by: Arnd Bergmann <[email protected]>
2020-11-27Merge tag 'printk-for-5.10-rc6-fixup' of ↵Linus Torvalds2-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fixes from Petr Mladek: - do not lose trailing newline in pr_cont() calls - two trivial fixes for a dead store and a config description * tag 'printk-for-5.10-rc6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: finalize records with trailing newlines printk: remove unneeded dead-store assignment init/Kconfig: Fix CPU number in LOG_CPU_MAX_BUF_SHIFT description
2020-11-27Merge branch 'for-5.10-pr_cont-fixup' into for-linusPetr Mladek1-2/+2
2020-11-27printk: finalize records with trailing newlinesJohn Ogness1-2/+2
Any record with a trailing newline (LOG_NEWLINE flag) cannot be continued because the newline has been stripped and will not be visible if the message is appended. This was already handled correctly when committing in log_output() but was not handled correctly when committing in log_store(). Fixes: f5f022e53b87 ("printk: reimplement log_cont using record extension") Link: https://lore.kernel.org/r/[email protected] Reported-by: Kefeng Wang <[email protected]> Signed-off-by: John Ogness <[email protected]> Tested-by: Kefeng Wang <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2020-11-27Merge branch 'linus' into sched/core, to resolve semantic conflictIngo Molnar49-322/+485
Signed-off-by: Ingo Molnar <[email protected]>
2020-11-27dma-mapping: add benchmark support for streaming DMA APIsBarry Song3-0/+371
Nowadays, there are increasing requirements to benchmark the performance of dma_map and dma_unmap particually while the device is attached to an IOMMU. This patch enables the support. Users can run specified number of threads to do dma_map_page and dma_unmap_page on a specific NUMA node with the specified duration. Then dma_map_benchmark will calculate the average latency for map and unmap. A difficulity for this benchmark is that dma_map/unmap APIs must run on a particular device. Each device might have different backend of IOMMU or non-IOMMU. So we use the driver_override to bind dma_map_benchmark to a particual device by: For platform devices: echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override echo xxx > /sys/bus/platform/drivers/xxx/unbind echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind For PCI devices: echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind Cc: Will Deacon <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Robin Murphy <[email protected]> Signed-off-by: Barry Song <[email protected]> [hch: folded in two fixes from Colin Ian King <[email protected]>] Signed-off-by: Christoph Hellwig <[email protected]>
2020-11-27dma-contiguous: fix a typo error in a commenttangjianqiang1-1/+1
Fix a typo error in cma description comment: "then" -> "than". Signed-off-by: tangjianqiang <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-11-27dma-pool: no need to check return value of debugfs_create functionsTiezhu Yang1-3/+0
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Signed-off-by: Tiezhu Yang <[email protected]> Reviewed-by: Robin Murphy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-11-27dma-mapping: Allow mixing bypass and mapped DMA operationAlexey Kardashevskiy2-4/+12
At the moment we allow bypassing DMA ops only when we can do this for the entire RAM. However there are configs with mixed type memory where we could still allow bypassing IOMMU in most cases; POWERPC with persistent memory is one example. This adds an arch hook to determine where bypass can still work and we invoke direct DMA API. The following patch checks the bus limit on POWERPC to allow or disallow direct mapping. This adds a ARCH_HAS_DMA_MAP_DIRECT config option to make the arch_xxxx hooks no-op by default. Signed-off-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-11-27kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handlingNicholas Piggin1-1/+5
powerpc/64s keeps a counter in the mm which counts bits set in mm_cpumask as well as other things. This means it can't use generic code to clear bits out of the mask and doesn't adjust the arch specific counter. Add an arch override that allows powerpc/64s to use clear_tasks_mm_cpumask(). Signed-off-by: Nicholas Piggin <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-11-26Merge remote-tracking branch 'origin/master' into perf/corePeter Zijlstra237-3825/+14568
Further perf/core patches will depend on: d3f7b1bb2040 ("mm/gup: fix gup_fast with dynamic page table folding") which is already in Linus' tree.
2020-11-26bpf: Add a BPF helper for getting the IMA hash of an inodeKP Singh1-0/+26
Provide a wrapper function to get the IMA hash of an inode. This helper is useful in fingerprinting files (e.g executables on execution) and using these fingerprints in detections like an executable unlinking itself. Since the ima_inode_hash can sleep, it's only allowed for sleepable LSM hooks. Signed-off-by: KP Singh <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-11-25cgroup/cgroup.c: replace 'of->kn->priv' with of_cft()Hui Su1-4/+4
we have supplied the inline function: of_cft() in cgroup.h. So replace the direct use 'of->kn->priv' with inline func of_cft(), which is more readable. Signed-off-by: Hui Su <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2020-11-25kernel: cgroup: Mundane spelling fixes throughout the fileBhaskar Chowdhury1-11/+11
Few spelling fixes throughout the file. Signed-off-by: Bhaskar Chowdhury <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2020-11-25workqueue: Kick a worker based on the actual activation of delayed worksYunfeng Ye1-3/+10
In realtime scenario, We do not want to have interference on the isolated cpu cores. but when invoking alloc_workqueue() for percpu wq on the housekeeping cpu, it kick a kworker on the isolated cpu. alloc_workqueue pwq_adjust_max_active wake_up_worker The comment in pwq_adjust_max_active() said: "Need to kick a worker after thawed or an unbound wq's max_active is bumped" So it is unnecessary to kick a kworker for percpu's wq when invoking alloc_workqueue(). this patch only kick a worker based on the actual activation of delayed works. Signed-off-by: Yunfeng Ye <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2020-11-25resource: provide meaningful MODULE_LICENSE() in test suiteAndy Shevchenko1-0/+2
modpost complains that module has no licence provided. Provide it via meaningful MODULE_LICENSE(). Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-11-25module: simplify version-attribute handlingJohan Hovold1-6/+4
Instead of using the array-of-pointers trick to avoid having gcc mess up the built-in module-version array stride, specify type alignment when declaring entries to prevent gcc from increasing alignment. This is essentially an alternative (one-line) fix to the problem addressed by commit b4bc842802db ("module: deal with alignment issues in built-in module versions"). gcc can increase the alignment of larger objects with static extent as an optimisation, but this can be suppressed by using the aligned attribute when declaring variables. Note that we have been relying on this behaviour for kernel parameters for 16 years and it indeed hasn't changed since the introduction of the aligned attribute in gcc-3.1. Link: https://lore.kernel.org/lkml/[email protected] Signed-off-by: Johan Hovold <[email protected]> Signed-off-by: Jessica Yu <[email protected]>
2020-11-24bpf: Refactor check_cfg to use a structured loop.Wedson Almeida Filho1-84/+95
The current implementation uses a number of gotos to implement a loop and different paths within the loop, which makes the code less readable than it would be with an explicit while-loop. This patch also replaces a chain of if/if-elses keyed on the same expression with a switch statement. No change in behaviour is intended. Signed-off-by: Wedson Almeida Filho <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-11-24audit: fix macros warningsAlex Shi2-7/+6
Some unused macros could cause gcc warning: kernel/audit.c:68:0: warning: macro "AUDIT_UNINITIALIZED" is not used [-Wunused-macros] kernel/auditsc.c:104:0: warning: macro "AUDIT_AUX_IPCPERM" is not used [-Wunused-macros] kernel/auditsc.c:82:0: warning: macro "AUDITSC_INVALID" is not used [-Wunused-macros] AUDIT_UNINITIALIZED and AUDITSC_INVALID are still meaningful and should be in incorporated. Just remove AUDIT_AUX_IPCPERM. Thanks comments from Richard Guy Briggs and Paul Moore. Signed-off-by: Alex Shi <[email protected]> Cc: Paul Moore <[email protected]> Cc: Richard Guy Briggs <[email protected]> Cc: Eric Paris <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Paul Moore <[email protected]>
2020-11-25bpf: Sanitize BTF data pointer after module is loadedAndrii Nakryiko1-0/+5
Given .BTF section is not allocatable, it will get trimmed after module is loaded. BPF system handles that properly by creating an independent copy of data. But prevent any accidental misused by resetting the pointer to BTF data. Fixes: 36e68442d1af ("bpf: Load and verify kernel module BTFs") Suggested-by: Jessica Yu <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Jessica Yu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-11-24irq_work: Optimize irq_work_single()Peter Zijlstra1-12/+17
Trade one atomic op for a full memory barrier. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]>
2020-11-24smp: Cleanup smp_call_function*()Peter Zijlstra3-38/+30
Get rid of the __call_single_node union and cleanup the API a little to avoid external code relying on the structure layout as much. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]>
2020-11-24irq_work: CleanupPeter Zijlstra6-21/+16
Get rid of the __call_single_node union and clean up the API a little to avoid external code relying on the structure layout as much. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]>
2020-11-24sched: Limit the amount of NUMA imbalance that can exist at fork timeMel Gorman1-2/+15
At fork time currently, a local node can be allowed to fill completely and allow the periodic load balancer to fix the problem. This can be problematic in cases where a task creates lots of threads that idle until woken as part of a worker poll causing a memory bandwidth problem. However, a "real" workload suffers badly from this behaviour. The workload in question is mostly NUMA aware but spawns large numbers of threads that act as a worker pool that can be called from anywhere. These need to spread early to get reasonable behaviour. This patch limits how much a local node can fill before spilling over to another node and it will not be a universal win. Specifically, very short-lived workloads that fit within a NUMA node would prefer the memory bandwidth. As I cannot describe the "real" workload, the best proxy measure I found for illustration was a page fault microbenchmark. It's not representative of the workload but demonstrates the hazard of the current behaviour. pft timings 5.10.0-rc2 5.10.0-rc2 imbalancefloat-v2 forkspread-v2 Amean elapsed-1 46.37 ( 0.00%) 46.05 * 0.69%* Amean elapsed-4 12.43 ( 0.00%) 12.49 * -0.47%* Amean elapsed-7 7.61 ( 0.00%) 7.55 * 0.81%* Amean elapsed-12 4.79 ( 0.00%) 4.80 ( -0.17%) Amean elapsed-21 3.13 ( 0.00%) 2.89 * 7.74%* Amean elapsed-30 3.65 ( 0.00%) 2.27 * 37.62%* Amean elapsed-48 3.08 ( 0.00%) 2.13 * 30.69%* Amean elapsed-79 2.00 ( 0.00%) 1.90 * 4.95%* Amean elapsed-80 2.00 ( 0.00%) 1.90 * 4.70%* This is showing the time to fault regions belonging to threads. The target machine has 80 logical CPUs and two nodes. Note the ~30% gain when the machine is approximately the point where one node becomes fully utilised. The slower results are borderline noise. Kernel building shows similar benefits around the same balance point. Generally performance was either neutral or better in the tests conducted. The main consideration with this patch is the point where fork stops spreading a task so some workloads may benefit from different balance points but it would be a risky tuning parameter. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-24sched/numa: Allow a floating imbalance between NUMA nodesMel Gorman1-10/+11
Currently, an imbalance is only allowed when a destination node is almost completely idle. This solved one basic class of problems and was the cautious approach. This patch revisits the possibility that NUMA nodes can be imbalanced until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat superficial -- it's half the cores when HT is enabled. At higher utilisations, balancing should continue as normal and keep things even until scheduler domains are fully busy or over utilised. Note that this is not expected to be a universal win. Any benchmark that prefers spreading as wide as possible with limited communication will favour the old behaviour as there is more memory bandwidth. Workloads that communicate heavily in pairs such as netperf or tbench benefit. For the tests I ran, the vast majority of workloads saw a benefit so it seems to be a worthwhile trade-off. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-24sched: Avoid unnecessary calculation of load imbalance at clone timeMel Gorman1-3/+5
In find_idlest_group(), the load imbalance is only relevant when the group is either overloaded or fully busy but it is calculated unconditionally. This patch moves the imbalance calculation to the context it is required. Technically, it is a micro-optimisation but really the benefit is avoiding confusing one type of imbalance with another depending on the group_type in the next patch. No functional change. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-24sched/numa: Rename nr_running and break out the magic numberMel Gorman1-4/+6
This is simply a preparation patch to make the following patches easier to read. No functional change. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-11-24sched/idle: Fix arch_cpu_idle() vs tracingPeter Zijlstra1-1/+27
We call arch_cpu_idle() with RCU disabled, but then use local_irq_{en,dis}able(), which invokes tracing, which relies on RCU. Switch all arch_cpu_idle() implementations to use raw_local_irq_{en,dis}able() and carefully manage the lockdep,rcu,tracing state like we do in entry. (XXX: we really should change arch_cpu_idle() to not return with interrupts enabled) Reported-by: Sven Schnelle <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Mark Rutland <[email protected]> Tested-by: Mark Rutland <[email protected]> Link: https://lkml.kernel.org/r/[email protected]