aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-03-06sched/rt: Remove unnecessary push for unfit tasksQais Yousef1-5/+2
In task_woken_rt() and switched_to_rto() we try trigger push-pull if the task is unfit. But the logic is found lacking because if the task was the only one running on the CPU, then rt_rq is not in overloaded state and won't trigger a push. The necessity of this logic was under a debate as well, a summary of the discussion can be found in the following thread: https://lore.kernel.org/lkml/[email protected]/ Remove the logic for now until a better approach is agreed upon. Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/rt: Allow pulling unfitting taskQais Yousef1-2/+1
When implemented RT Capacity Awareness; the logic was done such that if a task was running on a fitting CPU, then it was sticky and we would try our best to keep it there. But as Steve suggested, to adhere to the strict priority rules of RT class; allow pulling an RT task to unfitting CPU to ensure it gets a chance to run ASAP. LINK: https://lore.kernel.org/lkml/[email protected]/ Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/rt: Optimize cpupri_find() on non-heterogenous systemsQais Yousef3-8/+31
By introducing a new cpupri_find_fitness() function that takes the fitness_fn as an argument and only called when asym_system static key is enabled. cpupri_find() is now a wrapper function that calls cpupri_find_fitness() passing NULL as a fitness_fn, hence disabling the logic that handles fitness by default. LINK: https://lore.kernel.org/lkml/[email protected]/ Reported-by: Dietmar Eggemann <[email protected]> Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/rt: Re-instate old behavior in select_task_rq_rt()Qais Yousef1-0/+9
When RT Capacity Aware support was added, the logic in select_task_rq_rt was modified to force a search for a fitting CPU if the task currently doesn't run on one. But if the search failed, and the search was only triggered to fulfill the fitness request; we could end up selecting a new CPU unnecessarily. Fix this and re-instate the original behavior by ensuring we bail out in that case. This behavior change only affected asymmetric systems that are using util_clamp to implement capacity aware. None asymmetric systems weren't affected. LINK: https://lore.kernel.org/lkml/[email protected]/ Reported-by: Pavan Kondeti <[email protected]> Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/rt: cpupri_find: Implement fallback mechanism for !fit caseQais Yousef1-56/+101
When searching for the best lowest_mask with a fitness_fn passed, make sure we record the lowest_level that returns a valid lowest_mask so that we can use that as a fallback in case we fail to find a fitting CPU at all levels. The intention in the original patch was not to allow a down migration to unfitting CPU. But this missed the case where we are already running on unfitting one. With this change now RT tasks can still move between unfitting CPUs when they're already running on such CPU. And as Steve suggested; to adhere to the strict priority rules of RT, if a task is already running on a fitting CPU but due to priority it can't run on it, allow it to downmigrate to unfitting CPU so it can run. Reported-by: Pavan Kondeti <[email protected]> Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/lkml/20200203142712.a7yvlyo2y3le5cpn@e107158-lin/
2020-03-06sched/fair: Fix reordering of enqueue/dequeue_task_fair()Vincent Guittot1-8/+9
Even when a cgroup is throttled, the group se of a child cgroup can still be enqueued and its gse->on_rq stays true. When a task is enqueued on such child, we still have to update the load_avg and increase h_nr_running of the throttled cfs. Nevertheless, the 1st for_each_sched_entity() loop is skipped because of gse->on_rq == true and the 2nd loop because the cfs is throttled whereas we have to update both load_avg with the old h_nr_running and increase h_nr_running in such case. The same sequence can happen during dequeue when se moves to parent before breaking in the 1st loop. Note that the update of load_avg will effectively happen only once in order to sync up to the throttled time. Next call for updating load_avg will stop early because the clock stays unchanged. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: 6d4d22468dae ("sched/fair: Reorder enqueue/dequeue_task_fair path") Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/fair: Fix runnable_avg for throttled cfsVincent Guittot1-2/+12
When a cfs_rq is throttled, its group entity is dequeued and its running tasks are removed. We must update runnable_avg with the old h_nr_running and update group_se->runnable_weight with the new h_nr_running at each level of the hierarchy. Reviewed-by: Ben Segall <[email protected]> Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: 9f68395333ad ("sched/pelt: Add a new runnable average signal") Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/deadline: Make two functions staticYu Chen2-4/+4
Since commit 06a76fe08d4 ("sched/deadline: Move DL related code from sched/core.c to sched/deadline.c"), DL related code moved to deadline.c. Make the following two functions static since they're only used in deadline.c: dl_change_utilization() init_dl_rq_bw_ratio() Signed-off-by: Yu Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/topology: Don't enable EAS on SMT systemsValentin Schneider1-2/+10
EAS already requires asymmetric CPU capacities to be enabled, and mixing this with SMT is an aberration, but better be safe than sorry. Reviewed-by: Dietmar Eggemann <[email protected]> Acked-by: Quentin Perret <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/numa: Acquire RCU lock for checking idle cores during NUMA balancingMel Gorman1-0/+2
Qian Cai reported the following bug: The linux-next commit ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") introduced a boot warning, [ 86.520534][ T1] WARNING: suspicious RCU usage [ 86.520540][ T1] 5.6.0-rc3-next-20200227 #7 Not tainted [ 86.520545][ T1] ----------------------------- [ 86.520551][ T1] kernel/sched/fair.c:5914 suspicious rcu_dereference_check() usage! [ 86.520555][ T1] [ 86.520555][ T1] other info that might help us debug this: [ 86.520555][ T1] [ 86.520561][ T1] [ 86.520561][ T1] rcu_scheduler_active = 2, debug_locks = 1 [ 86.520567][ T1] 1 lock held by systemd/1: [ 86.520571][ T1] #0: ffff8887f4b14848 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x1d2/0x998 [ 86.520594][ T1] [ 86.520594][ T1] stack backtrace: [ 86.520602][ T1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.6.0-rc3-next-20200227 #7 task_numa_migrate() checks for idle cores when updating NUMA-related statistics. This relies on reading a RCU-protected structure in test_idle_cores() via this call chain task_numa_migrate -> update_numa_stats -> numa_idle_core -> test_idle_cores While the locking could be fine-grained, it is more appropriate to acquire the RCU lock for the entire scan of the domain. This patch removes the warning triggered at boot time. Reported-by: Qian Cai <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/fair: Fix kernel build warning in test_idle_cores() for !SMT NUMAValentin Schneider1-5/+9
Building against the tip/sched/core as ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") with the arm64 defconfig (which doesn't have CONFIG_SCHED_SMT set) leads to: kernel/sched/fair.c:1525:20: warning: 'test_idle_cores' declared 'static' but never defined [-Wunused-function] static inline bool test_idle_cores(int cpu, bool def); ^~~~~~~~~~~~~~~ Rather than define it in its own CONFIG_SCHED_SMT #define island, bunch it up with test_idle_cores(). Reported-by: Anshuman Khandual <[email protected]> Reported-by: Naresh Kamboju <[email protected]> Reviewed-by: Lukasz Luba <[email protected]> [[email protected]: Edit changelog, minor style change] Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/fair: Enable tuning of decay periodThara Gopinath3-2/+33
Thermal pressure follows pelt signals which means the decay period for thermal pressure is the default pelt decay period. Depending on SoC characteristics and thermal activity, it might be beneficial to decay thermal pressure slower, but still in-tune with the pelt signals. One way to achieve this is to provide a command line parameter to set a decay shift parameter to an integer between 0 and 10. Signed-off-by: Thara Gopinath <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/fair: Update cpu_capacity to reflect thermal pressureThara Gopinath1-0/+7
cpu_capacity initially reflects the maximum possible capacity of a CPU. Thermal pressure on a CPU means this maximum possible capacity is unavailable due to thermal events. This patch subtracts the average thermal pressure for a CPU from its maximum possible capacity so that cpu_capacity reflects the remaining maximum capacity. Signed-off-by: Thara Gopinath <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/fair: Enable periodic update of average thermal pressureThara Gopinath2-0/+10
Introduce support in scheduler periodic tick and other CFS bookkeeping APIs to trigger the process of computing average thermal pressure for a CPU. Also consider avg_thermal.load_avg in others_have_blocked which allows for decay of pelt signals. Signed-off-by: Thara Gopinath <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/pelt: Add support to track thermal pressureThara Gopinath3-0/+65
Extrapolating on the existing framework to track rt/dl utilization using pelt signals, add a similar mechanism to track thermal pressure. The difference here from rt/dl utilization tracking is that, instead of tracking time spent by a CPU running a RT/DL task through util_avg, the average thermal pressure is tracked through load_avg. This is because thermal pressure signal is weighted time "delta" capacity unlike util_avg which is binary. "delta capacity" here means delta between the actual capacity of a CPU and the decreased capacity a CPU due to a thermal event. In order to track average thermal pressure, a new sched_avg variable avg_thermal is introduced. Function update_thermal_load_avg can be called to do the periodic bookkeeping (accumulate, decay and average) of the thermal pressure. Reviewed-by: Vincent Guittot <[email protected]> Signed-off-by: Thara Gopinath <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/vtime: Prevent unstable evaluation of WARN(vtime->state)Chris Wilson1-19/+22
As the vtime is sampled under loose seqcount protection by kcpustat, the vtime fields may change as the code flows. Where logic dictates a field has a static value, use a READ_ONCE. Signed-off-by: Chris Wilson <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Fixes: 74722bb223d0 ("sched/vtime: Bring up complete kcpustat accessor") Link: https://lkml.kernel.org/r/[email protected]
2020-03-06Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar10-127/+287
Signed-off-by: Ingo Molnar <[email protected]>
2020-03-02Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a scheduler statistics bug" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix statistics for find_idlest_group()
2020-02-28Merge tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-blockLinus Torvalds1-31/+83
Pull block fixes from Jens Axboe: - Passthrough insertion fix (Ming) - Kill off some unused arguments (John) - blktrace RCU fix (Jan) - Dead fields removal for null_blk (Dongli) - NVMe polled IO fix (Bijan) * tag 'block-5.6-2020-02-28' of git://git.kernel.dk/linux-block: nvme-pci: Hold cq_poll_lock while completing CQEs blk-mq: Remove some unused function arguments null_blk: remove unused fields in 'nullb_cmd' blktrace: Protect q->blk_trace with RCU blk-mq: insert passthrough request into hctx->dispatch directly
2020-02-28Merge tag 'pm-5.6-rc4' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "Fix a recent cpufreq initialization regression (Rafael Wysocki), revert a devfreq commit that made incompatible changes and broke user land on some systems (Orson Zhai), drop a stale reference to a document that has gone away recently (Jonathan Neuschäfer), and fix a typo in a hibernation code comment (Alexandre Belloni)" * tag 'pm-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: Fix policy initialization for internal governor drivers Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs" PM / hibernate: fix typo "reserverd_size" -> "reserved_size" Documentation: power: Drop reference to interface.rst
2020-02-28Merge branches 'pm-sleep' and 'pm-devfreq'Rafael J. Wysocki1-1/+1
* pm-sleep: PM / hibernate: fix typo "reserverd_size" -> "reserved_size" Documentation: power: Drop reference to interface.rst * pm-devfreq: Revert "PM / devfreq: Modify the device name as devfreq(X) for sysfs"
2020-02-27Merge tag 'audit-pr-20200226' of ↵Linus Torvalds2-51/+60
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: "Two fixes for problems found by syzbot: - Moving audit filter structure fields into a union caused some problems in the code which populates that filter structure. We keep the union (that idea is a good one), but we are fixing the code so that it doesn't needlessly set fields in the union and mess up the error handling. - The audit_receive_msg() function wasn't validating user input as well as it should in all cases, we add the necessary checks" * tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: always check the netlink payload length in audit_receive_msg() audit: fix error handling in audit_data_to_entry()
2020-02-27sched/fair: Fix statistics for find_idlest_group()Vincent Guittot1-0/+2
sgs->group_weight is not set while gathering statistics in update_sg_wakeup_stats(). This means that a group can be classified as fully busy with 0 running tasks if utilization is high enough. This path is mainly used for fork and exec. Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()") Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Mel Gorman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-26Merge tag 'trace-v5.6-rc2' of ↵Linus Torvalds4-35/+127
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing and bootconfig updates: "Fixes and changes to bootconfig before it goes live in a release. Change in API of bootconfig (before it comes live in a release): - Have a magic value "BOOTCONFIG" in initrd to know a bootconfig exists - Set CONFIG_BOOT_CONFIG to 'n' by default - Show error if "bootconfig" on cmdline but not compiled in - Prevent redefining the same value - Have a way to append values - Added a SELECT BLK_DEV_INITRD to fix a build failure Synthetic event fixes: - Switch to raw_smp_processor_id() for recording CPU value in preempt section. (No care for what the value actually is) - Fix samples always recording u64 values - Fix endianess - Check number of values matches number of fields - Fix a printing bug Fix of trace_printk() breaking postponed start up tests Make a function static that is only used in a single file" * tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue bootconfig: Add append value operator support bootconfig: Prohibit re-defining value on same key bootconfig: Print array as multiple commands for legacy command line bootconfig: Reject subkey and value on same parent key tools/bootconfig: Remove unneeded error message silencer bootconfig: Add bootconfig magic word for indicating bootconfig explicitly bootconfig: Set CONFIG_BOOT_CONFIG=n by default tracing: Clear trace_state when starting trace bootconfig: Mark boot_config_checksum() static tracing: Disable trace_printk() on post poned tests tracing: Have synthetic event test use raw_smp_processor_id() tracing: Fix number printing bug in print_synth_event() tracing: Check that number of vals matches number of synth event fields tracing: Make synth_event trace functions endian-correct tracing: Make sure synth_event_trace() example always uses u64
2020-02-26signal: avoid double atomic counter increments for user accountingLinus Torvalds1-9/+14
When queueing a signal, we increment both the users count of pending signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount of the user struct itself (because we keep a reference to the user in the signal structure in order to correctly account for it when freeing). That turns out to be fairly expensive, because both of them are atomic updates, and particularly under extreme signal handling pressure on big machines, you can get a lot of cache contention on the user struct. That can then cause horrid cacheline ping-pong when you do these multiple accesses. So change the reference counting to only pin the user for the _first_ pending signal, and to unpin it when the last pending signal is dequeued. That means that when a user sees a lot of concurrent signal queuing - which is the only situation when this matters - the only atomic access needed is generally the 'sigpending' count update. This was noticed because of a particularly odd timing artifact on a dual-socket 96C/192T Cascade Lake platform: when you get into bad contention, on that machine for some reason seems to be much worse when the contention happens in the upper 32-byte half of the cacheline. As a result, the kernel test robot will-it-scale 'signal1' benchmark had an odd performance regression simply due to random alignment of the 'struct user_struct' (and pointed to a completely unrelated and apparently nonsensical commit for the regression). Avoiding the double increments (and decrements on the dequeueing side, of course) makes for much less contention and hugely improved performance on that will-it-scale microbenchmark. Quoting Feng Tang: "It makes a big difference, that the performance score is tripled! bump from original 17000 to 54000. Also the gap between 5.0-rc6 and 5.0-rc6+Jiri's patch is reduced to around 2%" [ The "2% gap" is the odd cacheline placement difference on that platform: under the extreme contention case, the effect of which half of the cacheline was hot was 5%, so with the reduced contention the odd timing artifact is reduced too ] It does help in the non-contended case too, but is not nearly as noticeable. Reported-and-tested-by: Feng Tang <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Huang, Ying <[email protected]> Cc: Philip Li <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-02-25bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issueMasami Hiramatsu1-1/+0
Since commit d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default") also changed the CONFIG_BOOTTIME_TRACING to select CONFIG_BOOT_CONFIG to show the boot-time tracing on the menu, it introduced wrong dependencies with BLK_DEV_INITRD as below. WARNING: unmet direct dependencies detected for BOOT_CONFIG Depends on [n]: BLK_DEV_INITRD [=n] Selected by [y]: - BOOTTIME_TRACING [=y] && TRACING_SUPPORT [=y] && FTRACE [=y] && TRACING [=y] This makes the CONFIG_BOOT_CONFIG selects CONFIG_BLK_DEV_INITRD to fix this error and make CONFIG_BOOTTIME_TRACING=n by default, so that both boot-time tracing and boot configuration off but those appear on the menu list. Link: http://lkml.kernel.org/r/158264140162.23842.11237423518607465535.stgit@devnote2 Fixes: d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default") Reported-by: Randy Dunlap <[email protected]> Compiled-tested-by: Randy Dunlap <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-02-25blktrace: Protect q->blk_trace with RCUJan Kara1-31/+83
KASAN is reporting that __blk_add_trace() has a use-after-free issue when accessing q->blk_trace. Indeed the switching of block tracing (and thus eventual freeing of q->blk_trace) is completely unsynchronized with the currently running tracing and thus it can happen that the blk_trace structure is being freed just while __blk_add_trace() works on it. Protect accesses to q->blk_trace by RCU during tracing and make sure we wait for the end of RCU grace period when shutting down tracing. Luckily that is rare enough event that we can afford that. Note that postponing the freeing of blk_trace to an RCU callback should better be avoided as it could have unexpected user visible side-effects as debugfs files would be still existing for a short while block tracing has been shut down. Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711 CC: [email protected] Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Ming Lei <[email protected]> Tested-by: Ming Lei <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reported-by: Tristan Madani <[email protected]> Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-02-24audit: always check the netlink payload length in audit_receive_msg()Paul Moore1-19/+21
This patch ensures that we always check the netlink payload length in audit_receive_msg() before we take any action on the payload itself. Cc: [email protected] Reported-by: [email protected] Reported-by: [email protected] Signed-off-by: Paul Moore <[email protected]>
2020-02-24sched/numa: Stop an exhastive search if a reasonable swap candidate or idle ↵Mel Gorman1-4/+27
CPU is found When domains are imbalanced or overloaded a search of all CPUs on the target domain is searched and compared with task_numa_compare. In some circumstances, a candidate is found that is an obvious win. o A task can move to an idle CPU and an idle CPU is found o A swap candidate is found that would move to its preferred domain This patch terminates the search when either condition is met. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/numa: Bias swapping tasks based on their preferred nodeMel Gorman1-6/+37
When swapping tasks for NUMA balancing, it is preferred that tasks move to or remain on their preferred node. When considering an imbalance, encourage tasks to move to their preferred node and discourage tasks from moving away from their preferred node. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/numa: Find an alternative idle CPU if the CPU is part of an active ↵Mel Gorman1-18/+22
NUMA balance Multiple tasks can attempt to select and idle CPU but fail because numa_migrate_on is already set and the migration fails. Instead of failing, scan for an alternative idle CPU. select_idle_sibling is not used because it requires IRQs to be disabled and it ignores numa_migrate_on allowing multiple tasks to stack. This scan may still fail if there are idle candidate CPUs due to races but if this occurs, it's best that a task stay on an available CPU that move to a contended one. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/numa: Prefer using an idle CPU as a migration target instead of ↵Mel Gorman1-17/+102
comparing tasks task_numa_find_cpu() can scan a node multiple times. Minimally it scans to gather statistics and later to find a suitable target. In some cases, the second scan will simply pick an idle CPU if the load is not imbalanced. This patch caches information on an idle core while gathering statistics and uses it immediately if load is not imbalanced to avoid a second scan of the node runqueues. Preference is given to an idle core rather than an idle SMT sibling to avoid packing HT siblings due to linearly scanning the node cpumask. As a side-effect, even when the second scan is necessary, the importance of using select_idle_sibling is much reduced because information on idle CPUs is cached and can be reused. Note that this patch actually makes is harder to move to an idle CPU as multiple tasks can race for the same idle CPU due to a race checking numa_migrate_on. This is addressed in the next patch. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/fair: Take into account runnable_avg to classify groupVincent Guittot1-1/+30
Take into account the new runnable_avg signal to classify a group and to mitigate the volatility of util_avg in face of intensive migration or new task with random utilization. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: "Dietmar Eggemann <[email protected]>" Acked-by: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/pelt: Add a new runnable average signalVincent Guittot4-21/+126
Now that runnable_load_avg has been removed, we can replace it by a new signal that will highlight the runnable pressure on a cfs_rq. This signal track the waiting time of tasks on rq and can help to better define the state of rqs. At now, only util_avg is used to define the state of a rq: A rq with more that around 80% of utilization and more than 1 tasks is considered as overloaded. But the util_avg signal of a rq can become temporaly low after that a task migrated onto another rq which can bias the classification of the rq. When tasks compete for the same rq, their runnable average signal will be higher than util_avg as it will include the waiting time and we can use this signal to better classify cfs_rqs. The new runnable_avg will track the runnable time of a task which simply adds the waiting time to the running time. The runnable _avg of cfs_rq will be the /Sum of se's runnable_avg and the runnable_avg of group entity will follow the one of the rq similarly to util_avg. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: "Dietmar Eggemann <[email protected]>" Acked-by: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/pelt: Remove unused runnable load averageVincent Guittot5-167/+42
Now that runnable_load_avg is no more used, we can remove it to make space for a new signal. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: "Dietmar Eggemann <[email protected]>" Acked-by: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/numa: Use similar logic to the load balancer for moving between ↵Mel Gorman1-31/+50
domains with spare capacity The standard load balancer generally tries to keep the number of running tasks or idle CPUs balanced between NUMA domains. The NUMA balancer allows tasks to move if there is spare capacity but this causes a conflict and utilisation between NUMA nodes gets badly skewed. This patch uses similar logic between the NUMA balancer and load balancer when deciding if a task migrating to its preferred node can use an idle CPU. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/numa: Replace runnable_load_avg by load_avgVincent Guittot1-32/+70
Similarly to what has been done for the normal load balancer, we can replace runnable_load_avg by load_avg in numa load balancing and track the other statistics like the utilization and the number of running tasks to get to better view of the current state of a node. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: "Dietmar Eggemann <[email protected]>" Acked-by: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/fair: Reorder enqueue/dequeue_task_fair pathVincent Guittot1-22/+20
The walk through the cgroup hierarchy during the enqueue/dequeue of a task is split in 2 distinct parts for throttled cfs_rq without any added value but making code less readable. Change the code ordering such that everything related to a cfs_rq (throttled or not) will be done in the same loop. In addition, the same steps ordering is used when updating a cfs_rq: - update_load_avg - update_cfs_group - update *h_nr_running This reordering enables the use of h_nr_running in PELT algorithm. No functional and performance changes are expected and have been noticed during tests. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: "Dietmar Eggemann <[email protected]>" Acked-by: Peter Zijlstra <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/numa: Distinguish between the different task_numa_migrate() failure casesMel Gorman1-3/+3
sched:sched_stick_numa is meant to fire when a task is unable to migrate to the preferred node but from the trace, it's possibile to tell the difference between "no CPU found", "migration to idle CPU failed" and "tasks could not be swapped". Extend the tracepoint accordingly. Signed-off-by: Mel Gorman <[email protected]> [ Minor edits. ] Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24sched/numa: Trace when no candidate CPU was found on the preferred nodeMel Gorman1-1/+3
sched:sched_stick_numa is meant to fire when a task is unable to migrate to the preferred node. The case where no candidate CPU could be found is not traced which is an important gap. The tracepoint is not fired when the task is not allowed to run on any CPU on the preferred node or the task is already running on the target CPU but neither are interesting corner cases. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-02-24Merge tag 'v5.6-rc3' into sched/core, to pick up fixes and dependent patchesIngo Molnar170-3179/+9144
Signed-off-by: Ingo Molnar <[email protected]>
2020-02-22audit: fix error handling in audit_data_to_entry()Paul Moore1-32/+39
Commit 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive") combined a number of separate fields in the audit_field struct into a single union. Generally this worked just fine because they are generally mutually exclusive. Unfortunately in audit_data_to_entry() the overlap can be a problem when a specific error case is triggered that causes the error path code to attempt to cleanup an audit_field struct and the cleanup involves attempting to free a stored LSM string (the lsm_str field). Currently the code always has a non-NULL value in the audit_field.lsm_str field as the top of the for-loop transfers a value into audit_field.val (both .lsm_str and .val are part of the same union); if audit_data_to_entry() fails and the audit_field struct is specified to contain a LSM string, but the audit_field.lsm_str has not yet been properly set, the error handling code will attempt to free the bogus audit_field.lsm_str value that was set with audit_field.val at the top of the for-loop. This patch corrects this by ensuring that the audit_field.val is only set when needed (it is cleared when the audit_field struct is allocated with kcalloc()). It also corrects a few other issues to ensure that in case of error the proper error code is returned. Cc: [email protected] Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive") Reported-by: [email protected] Signed-off-by: Paul Moore <[email protected]>
2020-02-22Merge tag 'irq-urgent-2020-02-22' of ↵Linus Torvalds3-18/+24
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "Two fixes for the irq core code which are follow ups to the recent MSI fixes: - The WARN_ON which was put into the MSI setaffinity callback for paranoia reasons actually triggered via a callchain which escaped when all the possible ways to reach that code were analyzed. The proc/irq/$N/*affinity interfaces have a quirk which came in when ALPHA moved to the generic interface: In case that the written affinity mask does not contain any online CPU it calls into ALPHAs magic auto affinity setting code. A few years later this mechanism was also made available to x86 for no good reasons and in a way which circumvents all sanity checks for interrupts which cannot have their affinity set from process context on X86 due to the way the X86 interrupt delivery works. It would be possible to make this work properly, but there is no point in doing so. If the interrupt is not yet started then the affinity setting has no effect and if it is started already then it is already assigned to an online CPU so there is no point to randomly move it to some other CPU. Just return EINVAL as the code has done before that change forever. - The new MSI quirk bit in the irq domain flags turned out to be already occupied, which escaped the author and the reviewers because the already in use bits were 0,6,2,3,4,5 listed in that order. That bit 6 was simply overlooked because the ordering was straight forward linear otherwise. So the new bit ended up being a duplicate. Fix it up by switching the oddball 6 to the obvious 1" * tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/irqdomain: Make sure all irq domain flags are distinct genirq/proc: Reject invalid affinity masks (again)
2020-02-22Merge tag 's390-5.6-4' of ↵Linus Torvalds1-9/+0
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Remove ieee_emulation_warnings sysctl which is a dead code. - Avoid triggering rebuild of the kernel during make install. - Enable protected virtualization guest support in default configs. - Fix cio_ignore seq_file .next function to increase position index. And use kobj_to_dev instead of container_of in cio code. - Fix storage block address lists to contain absolute addresses in qdio code. - Few clang warnings and spelling fixes. * tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/qdio: fill SBALEs with absolute addresses s390/qdio: fill SL with absolute addresses s390: remove obsolete ieee_emulation_warnings s390: make 'install' not depend on vmlinux s390/kaslr: Fix casts in get_random s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range s390/pkey/zcrypt: spelling s/crytp/crypt/ s390/cio: use kobj_to_dev() API s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST s390/cio: cio_ignore_proc_seq_next should increase position index
2020-02-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds3-9/+57
Pull networking fixes from David Miller: 1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from Cong Wang. 2) Fix deadlock in xsk by publishing global consumer pointers when NAPI is finished, from Magnus Karlsson. 3) Set table field properly to RT_TABLE_COMPAT when necessary, from Jethro Beekman. 4) NLA_STRING attributes are not necessary NULL terminated, deal wiht that in IFLA_ALT_IFNAME. From Eric Dumazet. 5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov. 6) Handle mtu==0 devices properly in wireguard, from Jason A. Donenfeld. 7) Fix several lockdep warnings in bonding, from Taehee Yoo. 8) Fix cls_flower port blocking, from Jason Baron. 9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen. 10) Fix RDMA race in qede driver, from Michal Kalderon. 11) Fix several false lockdep warnings by adding conditions to list_for_each_entry_rcu(), from Madhuparna Bhowmik. 12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen. 13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song. 14) Hey, variables declared in switch statement before any case statements are not initialized. I learn something every day. Get rids of this stuff in several parts of the networking, from Kees Cook. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits) bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs. bnxt_en: Improve device shutdown method. net: netlink: cap max groups which will be considered in netlink_bind() net: thunderx: workaround BGX TX Underflow issue ionic: fix fw_status read net: disable BRIDGE_NETFILTER by default net: macb: Properly handle phylink on at91rm9200 s390/qeth: fix off-by-one in RX copybreak check s390/qeth: don't warn for napi with 0 budget s390/qeth: vnicc Fix EOPNOTSUPP precedence openvswitch: Distribute switch variables for initialization net: ip6_gre: Distribute switch variables for initialization net: core: Distribute switch variables for initialization udp: rehash on disconnect net/tls: Fix to avoid gettig invalid tls record bpf: Fix a potential deadlock with bpf_map_do_batch bpf: Do not grab the bucket spinlock by default on htab batch ops ice: Wait for VF to be reset/ready before configuration ice: Don't tell the OS that link is going down ice: Don't reject odd values of usecs set by user ...
2020-02-21y2038: remove unused time32 interfacesArnd Bergmann2-107/+0
No users remain, so kill these off before we grow new ones. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Deepa Dinamani <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-02-20bootconfig: Set CONFIG_BOOT_CONFIG=n by defaultMasami Hiramatsu1-1/+2
Set CONFIG_BOOT_CONFIG=n by default. This also warns user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given in the kernel command line. Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2 Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-02-20tracing: Clear trace_state when starting traceMasami Hiramatsu1-2/+2
Clear trace_state data structure when starting trace in __synth_event_trace_start() internal function. Currently trace_state is initialized only in the synth_event_trace_start() API, but the trace_state in synth_event_trace() and synth_event_trace_array() are on the stack without initialization. This means those APIs will see wrong parameters and wil skip closing process in __synth_event_trace_end() because trace_state->disabled may be !0. Link: http://lkml.kernel.org/r/158193315899.8868.1781259176894639952.stgit@devnote2 Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-02-20tracing: Disable trace_printk() on post poned testsSteven Rostedt (VMware)1-0/+2
The tracing seftests checks various aspects of the tracing infrastructure, and one is filtering. If trace_printk() is active during a self test, it can cause the filtering to fail, which will disable that part of the trace. To keep the selftests from failing because of trace_printk() calls, trace_printk() checks the variable tracing_selftest_running, and if set, it does not write to the tracing buffer. As some tracers were registered earlier in boot, the selftest they triggered would fail because not all the infrastructure was set up for the full selftest. Thus, some of the tests were post poned to when their infrastructure was ready (namely file system code). The postpone code did not set the tracing_seftest_running variable, and could fail if a trace_printk() was added and executed during their run. Cc: [email protected] Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-02-20tracing: Have synthetic event test use raw_smp_processor_id()Steven Rostedt (VMware)1-6/+6
The test code that tests synthetic event creation pushes in as one of its test fields the current CPU using "smp_processor_id()". As this is just something to see if the value is correctly passed in, and the actual CPU used does not matter, use raw_smp_processor_id(), otherwise with debug preemption enabled, a warning happens as the smp_processor_id() is called without preemption enabled. Link: http://lkml.kernel.org/r/[email protected] Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>