aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2016-11-22mm: Add a user_ns owner to mm_struct and fix ptrace permission checksEric W. Biederman2-18/+17
During exec dumpable is cleared if the file that is being executed is not readable by the user executing the file. A bug in ptrace_may_access allows reading the file if the executable happens to enter into a subordinate user namespace (aka clone(CLONE_NEWUSER), unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER). This problem is fixed with only necessary userspace breakage by adding a user namespace owner to mm_struct, captured at the time of exec, so it is clear in which user namespace CAP_SYS_PTRACE must be present in to be able to safely give read permission to the executable. The function ptrace_may_access is modified to verify that the ptracer has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns. This ensures that if the task changes it's cred into a subordinate user namespace it does not become ptraceable. The function ptrace_attach is modified to only set PT_PTRACE_CAP when CAP_SYS_PTRACE is held over task->mm->user_ns. The intent of PT_PTRACE_CAP is to be a flag to note that whatever permission changes the task might go through the tracer has sufficient permissions for it not to be an issue. task->cred->user_ns is always the same as or descendent of mm->user_ns. Which guarantees that having CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks credentials. To prevent regressions mm->dumpable and mm->user_ns are not considered when a task has no mm. As simply failing ptrace_may_attach causes regressions in privileged applications attempting to read things such as /proc/<pid>/stat Cc: [email protected] Acked-by: Kees Cook <[email protected]> Tested-by: Cyrill Gorcunov <[email protected]> Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces") Signed-off-by: "Eric W. Biederman" <[email protected]>
2016-11-22locking/mutex: Break out of expensive busy-loop on ↵Pan Xinhui2-5/+22
{mutex,rwsem}_spin_on_owner() when owner vCPU is preempted An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Break out of the loop if the vCPU is preempted: if vcpu_is_preempted(cpu) is true. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc0000000000768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Tested-by: Juergen Gross <[email protected]> Signed-off-by: Pan Xinhui <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Christian Borntraeger <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-22locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU ↵Pan Xinhui1-1/+8
in osq_lock() An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because if vCPU-A holds the osq lock and yields out, vCPU-B ends up waiting for per_cpu node->locked to be set. IOW, vCPU-B waits for vCPU-A to run and unlock the osq lock. Use the new vcpu_is_preempted(cpu) interface to detect if a vCPU is currently running or not, and break out of the spin-loop if so. test case: $ perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun Feng <[email protected]> Tested-by: Juergen Gross <[email protected]> Signed-off-by: Pan Xinhui <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Christian Borntraeger <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] [ Translated to English. ] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-22Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar9-56/+103
Signed-off-by: Ingo Molnar <[email protected]>
2016-11-22sched/autogroup: Do not use autogroup->tg in zombie threadsOleg Nesterov2-0/+20
Exactly because for_each_thread() in autogroup_move_group() can't see it and update its ->sched_task_group before _put() and possibly free(). So the exiting task needs another sched_move_task() before exit_notify() and we need to re-introduce the PF_EXITING (or similar) check removed by the previous change for another reason. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-22sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()Oleg Nesterov1-11/+12
The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove it, but see the next patch. However the comment is correct in that autogroup_move_group() must always change task_group() for every thread so the sysctl_ check is very wrong; we can race with cgroups and even sys_setsid() is not safe because a task running with task_group() == ag->tg must participate in refcounting: int main(void) { int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY); assert(sctl > 0); if (fork()) { wait(NULL); // destroy the child's ag/tg pause(); } assert(pwrite(sctl, "1\n", 2, 0) == 2); assert(setsid() > 0); if (fork()) pause(); kill(getppid(), SIGKILL); sleep(1); // The child has gone, the grandchild runs with kref == 1 assert(pwrite(sctl, "0\n", 2, 0) == 2); assert(setsid() > 0); // runs with the freed ag/tg for (;;) sleep(1); return 0; } crashes the kernel. It doesn't really need sleep(1), it doesn't matter if autogroup_move_group() actually frees the task_group or this happens later. Reported-by: Vern Lovejoy <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-22genirq/msi: Drop artificial PCI dependencyMarc Zyngier1-3/+1
The generic MSI layer doesn't have any PCI ties anymore, and the build hack should have been removed some time ago. Fixes: d9109698be6e ("genirq: Introduce msi_domain_alloc/free_irqs()") Signed-off-by: Marc Zyngier <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds1-3/+17
Pull sparc fixes from David Miller: 1) With modern networking cards we can run out of 32-bit DMA space, so support 64-bit DMA addressing when possible on sparc64. From Dave Tushar. 2) Some signal frame validation checks are inverted on sparc32, fix from Andreas Larsson. 3) Lockdep tables can get too large in some circumstances on sparc64, add a way to adjust the size a bit. From Babu Moger. 4) Fix NUMA node probing on some sun4v systems, from Thomas Tai. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: drop duplicate header scatterlist.h lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc sunbmac: Fix compiler warning sunqe: Fix compiler warnings sparc64: Enable 64-bit DMA sparc64: Enable sun4v dma ops to use IOMMU v2 APIs sparc64: Bind PCIe devices to use IOMMU v2 service sparc64: Initialize iommu_map_table and iommu_pool sparc64: Add ATU (new IOMMU) support sparc64: Add FORCE_MAX_ZONEORDER and default to 13 sparc64: fix compile warning section mismatch in find_node() sparc32: Fix inverted invalid_frame_pointer checks on sigreturns sparc64: Fix find_node warning if numa node cannot be found
2016-11-21PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flagRafael J. Wysocki1-2/+2
Modify the ACPI system sleep support setup code to select suspend-to-idle as the default system sleep state if the ACPI_FADT_LOW_POWER_S0 flag is set in the FADT and the default sleep state was not selected from the kernel command line. Signed-off-by: Rafael J. Wysocki <[email protected]> Tested-by: Mario Limonciello <[email protected]>
2016-11-21PM / sleep: System sleep state selection interface reworkRafael J. Wysocki3-31/+132
There are systems in which the platform doesn't support any special sleep states, so suspend-to-idle (PM_SUSPEND_FREEZE) is the only available system sleep state. However, some user space frameworks only use the "mem" and (sometimes) "standby" sleep state labels, so the users of those systems need to modify user space in order to be able to use system suspend at all and that may be a pain in practice. Commit 0399d4db3edf (PM / sleep: Introduce command line argument for sleep state enumeration) attempted to address this problem by adding a command line argument to change the meaning of the "mem" string in /sys/power/state to make it trigger suspend-to-idle (instead of suspend-to-RAM). However, there also are systems in which the platform does support special sleep states, but suspend-to-idle is the preferred one anyway (it even may save more energy than the platform-provided sleep states in some cases) and the above commit doesn't help in those cases. For this reason, rework the system sleep state selection interface again (but preserve backwards compatibiliby). Namely, add a new sysfs file, /sys/power/mem_sleep, that will control the system suspend mode triggered by writing "mem" to /sys/power/state (in analogy with what /sys/power/disk does for hibernation). Make it select suspend-to-RAM ("deep" sleep) by default (if supported) and fall back to suspend-to-idle ("s2idle") otherwise and add a new command line argument, mem_sleep_default, allowing that default to be overridden if need be. At the same time, drop the relative_sleep_states command line argument that doesn't make sense any more. Signed-off-by: Rafael J. Wysocki <[email protected]> Tested-by: Mario Limonciello <[email protected]>
2016-11-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-23/+47
Pull networking fixes from David Miller: 1) Clear congestion control state when changing algorithms on an existing socket, from Florian Westphal. 2) Fix register bit values in altr_tse_pcs portion of stmmac driver, from Jia Jie Ho. 3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe CAVALLARO. 4) Fix udplite multicast delivery handling, it ignores the udp_table parameter passed into the lookups, from Pablo Neira Ayuso. 5) Synchronize the space estimated by rtnl_vfinfo_size and the space actually used by rtnl_fill_vfinfo. From Sabrina Dubroca. 6) Fix memory leak in fib_info when splitting nodes, from Alexander Duyck. 7) If a driver does a napi_hash_del() explicitily and not via netif_napi_del(), it must perform RCU synchronization as needed. Fix this in virtio-net and bnxt drivers, from Eric Dumazet. 8) Likewise, it is not necessary to invoke napi_hash_del() is we are also doing neif_napi_del() in the same code path. Remove such calls from be2net and cxgb4 drivers, also from Eric Dumazet. 9) Don't allocate an ID in peernet2id_alloc() if the netns is dead, from WANG Cong. 10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold. 11) We cannot cache routes in ip6_tunnel when using inherited traffic classes, from Paolo Abeni. 12) Fix several crashes and leaks in cpsw driver, from Johan Hovold. 13) Splice operations cannot use freezable blocking calls in AF_UNIX, from WANG Cong. 14) Link dump filtering by master device and kind support added an error in loop index updates during the dump if we actually do filter, fix from Zhang Shengju. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits) tcp: zero ca_priv area when switching cc algorithms net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC tipc: eliminate obsolete socket locking policy description rtnl: fix the loop index update error in rtnl_dump_ifinfo() l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind() net: macb: add check for dma mapping error in start_xmit() rtnetlink: fix FDB size computation netns: fix get_net_ns_by_fd(int pid) typo af_unix: conditionally use freezable blocking calls in read net: ethernet: ti: cpsw: fix fixed-link phy probe deferral net: ethernet: ti: cpsw: add missing sanity check net: ethernet: ti: cpsw: fix secondary-emac probe error path net: ethernet: ti: cpsw: fix of_node and phydev leaks net: ethernet: ti: cpsw: fix deferred probe net: ethernet: ti: cpsw: fix mdio device reference leak net: ethernet: ti: cpsw: fix bad register access in probe error path net: sky2: Fix shutdown crash cfg80211: limit scan results cache size net sched filters: pass netlink message flags in event notification ...
2016-11-21bpf, mlx5: fix mlx5e_create_rq taking reference on progDaniel Borkmann1-0/+1
In mlx5e_create_rq(), when creating a new queue, we call bpf_prog_add() but without checking the return value. bpf_prog_add() can fail since 92117d8443bc ("bpf: fix refcnt overflow"), so we really must check it. Take the reference right when we assign it to the rq from priv->xdp_prog, and just drop the reference on error path. Destruction in mlx5e_destroy_rq() looks good, though. Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-21perf/core: Fix address filter parserAlexander Shishkin1-0/+2
The token table passed into match_token() must be null-terminated, which it currently is not in the perf's address filter string parser, as caught by Vince's perf_fuzzer and KASAN. It doesn't blow up otherwise because of the alignment padding of the table to the next element in the .rodata, which is luck. Fixing by adding a null-terminator to the token table. Reported-by: Vince Weaver <[email protected]> Tested-by: Vince Weaver <[email protected]> Signed-off-by: Alexander Shishkin <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] # v4.7+ Fixes: 375637bc524 ("perf/core: Introduce address range filtering") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-21sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_QWaiman Long4-11/+11
Currently the wake_q data structure is defined by the WAKE_Q() macro. This macro, however, looks like a function doing something as "wake" is a verb. Even checkpatch.pl was confused as it reported warnings like WARNING: Missing a blank line after declarations #548: FILE: kernel/futex.c:3665: + int ret; + WAKE_Q(wake_q); This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies what the macro is doing and eliminates the checkpatch.pl warnings. Signed-off-by: Waiman Long <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Resolved conflict and added missing rename. ] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-20audit: fix formatting of AUDIT_CONFIG_CHANGE eventsSteve Grubb4-10/+6
The AUDIT_CONFIG_CHANGE events sometimes use a op= field. The current code logs the value of the field with quotes. This field is documented to not be encoded, so it should not have quotes. Signed-off-by: Steve Grubb <[email protected]> Reviewed-by: Richard Guy Briggs <[email protected]> [PM: reformatted commit description to make checkpatch.pl happy] Signed-off-by: Paul Moore <[email protected]>
2016-11-20audit: skip sessionid sentinel value when auto-incrementingRichard Guy Briggs1-1/+4
The value (unsigned int)-1 is used as a sentinel to indicate the sessionID is unset. Skip this value when the session_id value wraps. Signed-off-by: Richard Guy Briggs <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2016-11-18lockdep: Limit static allocations if PROVE_LOCKING_SMALL is definedBabu Moger1-3/+17
Reduce the size of data structure for lockdep entries by half if PROVE_LOCKING_SMALL if defined. This is used only for sparc. Signed-off-by: Babu Moger <[email protected]> Acked-by: Sam Ravnborg <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-18netns: make struct pernet_operations::id unsigned intAlexey Dobriyan1-1/+1
Make struct pernet_operations::id unsigned. There are 2 reasons to do so: 1) This field is really an index into an zero based array and thus is unsigned entity. Using negative value is out-of-bound access by definition. 2) On x86_64 unsigned 32-bit data which are mixed with pointers via array indexing or offsets added or subtracted to pointers are preffered to signed 32-bit data. "int" being used as an array index needs to be sign-extended to 64-bit before being used. void f(long *p, int i) { g(p[i]); } roughly translates to movsx rsi, esi mov rdi, [rsi+...] call g MOVSX is 3 byte instruction which isn't necessary if the variable is unsigned because x86_64 is zero extending by default. Now, there is net_generic() function which, you guessed it right, uses "int" as an array index: static inline void *net_generic(const struct net *net, int id) { ... ptr = ng->ptr[id - 1]; ... } And this function is used a lot, so those sign extensions add up. Patch snipes ~1730 bytes on allyesconfig kernel (without all junk messing with code generation): add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) Unfortunately some functions actually grow bigger. This is a semmingly random artefact of code generation with register allocator being used differently. gcc decides that some variable needs to live in new r8+ registers and every access now requires REX prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be used which is longer than [r8] However, overall balance is in negative direction: add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730) function old new delta nfsd4_lock 3886 3959 +73 tipc_link_build_proto_msg 1096 1140 +44 mac80211_hwsim_new_radio 2776 2808 +32 tipc_mon_rcv 1032 1058 +26 svcauth_gss_legacy_init 1413 1429 +16 tipc_bcbase_select_primary 379 392 +13 nfsd4_exchange_id 1247 1260 +13 nfsd4_setclientid_confirm 782 793 +11 ... put_client_renew_locked 494 480 -14 ip_set_sockfn_get 730 716 -14 geneve_sock_add 829 813 -16 nfsd4_sequence_done 721 703 -18 nlmclnt_lookup_host 708 686 -22 nfsd4_lockt 1085 1063 -22 nfs_get_client 1077 1050 -27 tcf_bpf_init 1106 1076 -30 nfsd4_encode_fattr 5997 5930 -67 Total: Before=154856051, After=154854321, chg -0.00% Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-17kernel/printk: Block cpuhotplug callback when tasks are frozenThomas Gleixner1-2/+4
The recent conversion of the console hotplug notifier to the state machine missed the fact, that the notifier only operated on the non frozen transitions. As a consequence the console_lock/unlock() pair is also invoked during suspend, which results in a lockdep warning. Restore the previous state by making the lock/unlock conditional on !tasks_frozen. Fixes: 90b14889d2f9 ("kernel/printk: Convert to hotplug state machine") Reported-and-tested-by: Borislav Petkov <[email protected]> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611171729320.3645@nanos Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]>
2016-11-17Merge branch 'x86/cpufeature' into x86/asm, to pick up dependencyIngo Molnar4-22/+11
Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16cpufreq: schedutil: irq-work and mutex are only used in slow pathViresh Kumar1-5/+8
Execute the irq-work specific initialization/exit code only when the fast path isn't available. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-11-16cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO taskViresh Kumar1-7/+78
If slow path frequency changes are conducted in a SCHED_OTHER context then they may be delayed for some amount of time, including indefinitely, when real time or deadline activity is taking place. Move the slow path to a real time kernel thread. In the future the thread should be made SCHED_DEADLINE. The RT priority is arbitrarily set to 50 for now. Hackbench results on ARM Exynos, dual core A15 platform for 10 iterations: $ hackbench -s 100 -l 100 -g 10 -f 20 Before After --------------------------------- 1.808 1.603 1.847 1.251 2.229 1.590 1.952 1.600 1.947 1.257 1.925 1.627 2.694 1.620 1.258 1.621 1.919 1.632 1.250 1.240 Average: 1.8829 1.5041 Based on initial work by Steve Muckle. Signed-off-by: Steve Muckle <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-11-16cpufreq: schedutil: enable fast switch earlierViresh Kumar1-6/+11
The fast_switch_enabled flag will be used by both sugov_policy_alloc() and sugov_policy_free() with a later patch. Prepare for that by moving the calls to enable and disable it to the beginning of sugov_init() and end of sugov_exit(). Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-11-16cpufreq: schedutil: Avoid indented labelsViresh Kumar1-3/+3
Switch to the more common practice of writing labels. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-11-16bpf: fix range arithmetic for bpf map accessJosef Bacik1-23/+47
I made some invalid assumptions with BPF_AND and BPF_MOD that could result in invalid accesses to bpf map entries. Fix this up by doing a few things 1) Kill BPF_MOD support. This doesn't actually get used by the compiler in real life and just adds extra complexity. 2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the minimum value to 0 for positive AND's. 3) Don't do operations on the ranges if they are set to the limits, as they are by definition undefined, and allowing arithmetic operations on those values could make them appear valid when they really aren't. This fixes the testcase provided by Jann as well as a few other theoretical problems. Reported-by: Jann Horn <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-16genirq/affinity: Use default affinity mask for reserved vectorsThomas Gleixner1-2/+2
The reserved vectors at the beginning and the end of the vector space get cpu_possible_mask assigned as their affinity mask. All other non-auto affine interrupts get the default irq affinity mask assigned. Using cpu_possible_mask breaks that rule. Treat them like any other interrupt and use irq_default_affinity as target mask. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Christoph Hellwig <[email protected]>
2016-11-16genirq/affinity: Take reserved vectors into account when spreading irqsChristoph Hellwig1-3/+5
The recent addition of reserved vectors at the beginning or the end of the vector space did not take the reserved vectors at the beginning into account for the various loop exit conditions. As a consequence the last vectors of the spread area are not included into the spread algorithm and are treated like the reserved vectors at the end of the vector space and get the default affinity mask assigned. Sum up the affinity vectors and the reserved vectors at the beginning and use the sum as exit condition. [ tglx: Fixed all conditions instead of only one and massaged changelog ] Signed-off-by: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-16bpf: Fix compilation warning in __bpf_lru_list_rotate_inactiveMartin KaFai Lau1-1/+1
gcc-6.2.1 gives the following warning: kernel/bpf/bpf_lru_list.c: In function ‘__bpf_lru_list_rotate_inactive.isra.3’: kernel/bpf/bpf_lru_list.c:201:28: warning: ‘next’ may be used uninitialized in this function [-Wmaybe-uninitialized] The "next" is currently initialized in the while() loop which must have >=1 iterations. This patch initializes next to get rid of the compiler warning. Fixes: 3a08c2fd7634 ("bpf: LRU List") Reported-by: David Miller <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-16sched/fair: Fix task group initializationVincent Guittot1-1/+1
The moves of tasks are now propagated down to root and the utilization of cfs_rq reflects reality so it doesn't need to be estimated at init. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Dietmar Eggemann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/fair: Propagate asynchrous detachVincent Guittot1-0/+6
A task can be asynchronously detached from cfs_rq when migrating between CPUs. The load of the migrated task is then removed from source cfs_rq during its next update. We use this event to set propagation flag. During the load balance, we take advantage of the update of blocked load to propagate any pending changes. The propagation relies on patch: "sched: Fix hierarchical order in rq->leaf_cfs_rq_list" ... which orders children and parents, to ensure that it's done in one pass. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Dietmar Eggemann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/fair: Propagate load during synchronous attach/detachVincent Guittot2-1/+188
When a task moves from/to a cfs_rq, we set a flag which is then used to propagate the change at parent level (sched_entity and cfs_rq) during next update. If the cfs_rq is throttled, the flag will stay pending until the cfs_rq is unthrottled. For propagating the utilization, we copy the utilization of group cfs_rq to the sched_entity. For propagating the load, we have to take into account the load of the whole task group in order to evaluate the load of the sched_entity. Similarly to what was done before the rewrite of PELT, we add a correction factor in case the task group's load is greater than its share so it will contribute the same load of a task of equal weight. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Dietmar Eggemann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/fair: Factorize PELT updateVincent Guittot1-51/+25
Every time we modify load/utilization of sched_entity, we start to sync it with its cfs_rq. This update is done in different ways: - when attaching/detaching a sched_entity, we update cfs_rq and then we sync the entity with the cfs_rq. - when enqueueing/dequeuing the sched_entity, we update both sched_entity and cfs_rq metrics to now. Use update_load_avg() everytime we have to update and sync cfs_rq and sched_entity before changing the state of a sched_enity. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Dietmar Eggemann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_listVincent Guittot3-7/+49
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a child will always be called before its parent. The hierarchical order in shares update list has been introduced by commit: 67e86250f8ea ("sched: Introduce hierarchal order on shares update list") With the current implementation a child can be still put after its parent. Lets take the example of: root \ b /\ c d* | e* with root -> b -> c already enqueued but not d -> e so the leaf_cfs_rq_list looks like: head -> c -> b -> root -> tail The branch d -> e will be added the first time that they are enqueued, starting with e then d. When e is added, its parents is not already on the list so e is put at the tail : head -> c -> b -> root -> e -> tail Then, d is added at the head because its parent is already on the list: head -> d -> c -> b -> root -> e -> tail e is not placed at the right position and will be called the last whereas it should be called at the beginning. Because it follows the bottom-up enqueue sequence, we are sure that we will finished to add either a cfs_rq without parent or a cfs_rq with a parent that is already on the list. We can use this event to detect when we have finished to add a new branch. For the others, whose parents are not already added, we have to ensure that they will be added after their children that have just been inserted the steps before, and after any potential parents that are already in the list. The easiest way is to put the cfs_rq just after the last inserted one and to keep track of it untl the branch is fully added. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Dietmar Eggemann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/fair: Factorize attach/detach entityVincent Guittot1-22/+31
Factorize post_init_entity_util_avg() and part of attach_task_cfs_rq() in one function attach_entity_cfs_rq(). Create symmetric detach_entity_cfs_rq() function. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Dietmar Eggemann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/fair: Fix incorrect comment for capacity_marginMorten Rasmussen1-1/+1
The comment for capacity_margin introduced in: 3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance at wake-up") ... got its usage the wrong way round - fix it. Signed-off-by: Morten Rasmussen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/fair: Avoid pulling tasks from non-overloaded higher capacity groupsMorten Rasmussen1-0/+25
For asymmetric CPU capacity systems it is counter-productive for throughput if low capacity CPUs are pulling tasks from non-overloaded CPUs with higher capacity. The assumption is that higher CPU capacity is preferred over running alone in a group with lower CPU capacity. This patch rejects higher CPU capacity groups with one or less task per CPU as potential busiest group which could otherwise lead to a series of failing load-balancing attempts leading to a force-migration. Signed-off-by: Morten Rasmussen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/fair: Add per-CPU min capacity to sched_group_capacityMorten Rasmussen3-7/+16
struct sched_group_capacity currently represents the compute capacity sum of all CPUs in the sched_group. Unless it is divided by the group_weight to get the average capacity per CPU, it hides differences in CPU capacity for mixed capacity systems (e.g. high RT/IRQ utilization or ARM big.LITTLE). But even the average may not be sufficient if the group covers CPUs of different capacities. Instead, by extending struct sched_group_capacity to indicate min per-CPU capacity in the group a suitable group for a given task utilization can more easily be found such that CPUs with reduced capacity can be avoided for tasks with high utilization (not implemented by this patch). Signed-off-by: Morten Rasmussen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/fair: Consider spare capacity in find_idlest_group()Morten Rasmussen1-5/+45
In low-utilization scenarios comparing relative loads in find_idlest_group() doesn't always lead to the most optimum choice. Systems with groups containing different numbers of cpus and/or cpus of different compute capacity are significantly better off when considering spare capacity rather than relative load in those scenarios. In addition to existing load based search an alternative spare capacity based candidate sched_group is found and selected instead if sufficient spare capacity exists. If not, existing behaviour is preserved. Signed-off-by: Morten Rasmussen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/fair: Compute task/cpu utilization at wake-up correctlyMorten Rasmussen1-4/+35
At task wake-up load-tracking isn't updated until the task is enqueued. The task's own view of its utilization contribution may therefore not be aligned with its contribution to the cfs_rq load-tracking which may have been updated in the meantime. Basically, the task's own utilization hasn't yet accounted for the sleep decay, while the cfs_rq may have (partially). Estimating the cfs_rq utilization in case the task is migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg is therefore incorrect as the two load-tracking signals aren't time synchronized (different last update). To solve this problem, this patch synchronizes the task utilization with its previous rq before the task utilization is used in the wake-up path. Currently the update/synchronization is done _after_ the task has been placed by select_task_rq_fair(). The synchronization is done without having to take the rq lock using the existing mechanism used in remove_entity_load_avg(). Signed-off-by: Morten Rasmussen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16sched/cpuacct: Avoid %lld seq_printf warningMartin Schwidefsky1-1/+1
For s390 kernel builds I keep getting this warning: kernel/sched/cpuacct.c: In function 'cpuacct_stats_show': kernel/sched/cpuacct.c:298:25: warning: format '%lld' expects argument of type 'long long int', but argument 4 has type 'clock_t {aka long int}' [-Wformat=] seq_printf(sf, "%s %lld\n", Silence the warning by adding an explicit cast. Signed-off-by: Martin Schwidefsky <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar7-30/+39
Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16locking/core: Remove cpu_relax_lowlatency() usersChristian Borntraeger5-12/+12
With the s390 special case of a yielding cpu_relax() implementation gone, we can now remove all users of cpu_relax_lowlatency() and replace them with cpu_relax(). Signed-off-by: Christian Borntraeger <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Noam Camus <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16locking/core, stop_machine: Yield the CPU during stop machine()Christian Borntraeger1-1/+1
Some time ago the following commit: 57f2ffe14fd125c2 ("s390: remove diag 44 calls from cpu_relax()") ... stopped cpu_relax() on s390 yielding to the hypervisor. As it turns out this made stop_machine() run really slow on virtualized overcommited systems. For example the kprobes test during bootup took several seconds instead of just running unnoticed with large guests. Therefore, yielding was reintroduced with commit: 4d92f50249eb ("s390: reintroduce diag 44 calls for cpu_relax()") ... but in fact the stop machine code seems to be the only place where this yielding was really necessary. This place is probably the most important one as it makes all but one guest CPUs wait for one guest CPU. As we now have cpu_relax_yield(), we can use this in multi_cpu_stop(). For now lets only add it here. We can add it later in other places when necessary. Signed-off-by: Christian Borntraeger <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Noam Camus <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-11-16posix-timers: Make them configurableNicolas Pitre9-8/+164
Some embedded systems have no use for them. This removes about 25KB from the kernel binary size when configured out. Corresponding syscalls are routed to a stub logging the attempt to use those syscalls which should be enough of a clue if they were disabled without proper consideration. They are: timer_create, timer_gettime: timer_getoverrun, timer_settime, timer_delete, clock_adjtime, setitimer, getitimer, alarm. The clock_settime, clock_gettime, clock_getres and clock_nanosleep syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME, CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast majority of use cases with very little code. Signed-off-by: Nicolas Pitre <[email protected]> Acked-by: Richard Cochran <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: John Stultz <[email protected]> Reviewed-by: Josh Triplett <[email protected]> Cc: Paul Bolle <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Michal Marek <[email protected]> Cc: Edward Cree <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-16posix_cpu_timers: Move the add_device_randomness() call to a proper placeNicolas Pitre2-4/+4
There is no logical relation between add_device_randomness() and posix_cpu_timers_exit(). Let's move the former to where the later is called. This way, when posix-cpu-timers.c is compiled out, there is no need to worry about not losing a call to add_device_randomness(). Signed-off-by: Nicolas Pitre <[email protected]> Acked-by: John Stultz <[email protected]> Cc: Paul Bolle <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Richard Cochran <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Michal Marek <[email protected]> Cc: Edward Cree <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-16timer: Move sys_alarm from timer.c to itimer.cNicolas Pitre2-14/+14
Move the only user of alarm_setitimer to itimer.c where it is defined. This allows for making alarm_setitimer static, and dropping it from the build when __ARCH_WANT_SYS_ALARM is not defined. Signed-off-by: Nicolas Pitre <[email protected]> Acked-by: John Stultz <[email protected]> Cc: Paul Bolle <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Richard Cochran <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Michal Marek <[email protected]> Cc: Edward Cree <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-15ftrace: Provide API to use global filtering for ftrace opsJoel Fernandes1-0/+17
Currently the global_ops filtering hash is not available to outside users registering for function tracing. Provide an API for those users to be able to choose global filtering. This is in preparation for pstore's ftrace feature to be able to use the global filters. Suggested-by: Steven Rostedt <[email protected]> Cc: Anton Vorontsov <[email protected]> Cc: Colin Cross <[email protected]> Cc: Kees Cook <[email protected]> Cc: Tony Luck <[email protected]> Signed-off-by: Joel Fernandes <[email protected]> Acked-by: Steven Rostedt <[email protected]> Signed-off-by: Kees Cook <[email protected]>
2016-11-15tracing: Add new trace_marker_rawSteven Rostedt4-31/+181
A new file is created: /sys/kernel/debug/tracing/trace_marker_raw This allows for appications to create data structures and write the binary data directly into it, and then read the trace data out from trace_pipe_raw into the same type of data structure. This saves on converting numbers into ASCII that would be required by trace_marker. Suggested-by: Olof Johansson <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2016-11-15bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASHMartin KaFai Lau2-8/+129
Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-11-15bpf: Add BPF_MAP_TYPE_LRU_HASHMartin KaFai Lau1-14/+252
Provide a LRU version of the existing BPF_MAP_TYPE_HASH. Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>