blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2017-09-14	sched/wait: Introduce wakeup boomark in wake_up_page_bit	Tim Chen	1	-0/+7
	Now that we have added breaks in the wait queue scan and allow bookmark on scan position, we put this logic in the wake_up_page_bit function. We can have very long page wait list in large system where multiple pages share the same wait list. We break the wake up walk here to allow other cpus a chance to access the list, and not to disable the interrupts when traversing the list for too long. This reduces the interrupt and rescheduling latency, and excessive page wait queue lock hold time. [ v2: Remove bookmark_wake_function ] Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-14	sched/wait: Break up long wake list walk	Tim Chen	1	-15/+63
	We encountered workloads that have very long wake up list on large systems. A waker takes a long time to traverse the entire wake list and execute all the wake functions. We saw page wait list that are up to 3700+ entries long in tests of large 4 and 8 socket systems. It took 0.8 sec to traverse such list during wake up. Any other CPU that contends for the list spin lock will spin for a long time. It is a result of the numa balancing migration of hot pages that are shared by many threads. Multiple CPUs waking are queued up behind the lock, and the last one queued has to wait until all CPUs did all the wakeups. The page wait list is traversed with interrupt disabled, which caused various problems. This was the original cause that triggered the NMI watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only extending the NMI watch dog timer there helped. This patch bookmarks the waker's scan position in wake list and break the wake up walk, to allow access to the list before the waker resume its walk down the rest of the wait list. It lowers the interrupt and rescheduling latency. This patch also provides a performance boost when combined with the next patch to break up page wakeup list walk. We saw 22% improvement in the will-it-scale file pread2 test on a Xeon Phi system running 256 threads. [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and simply access to flags. ] Reported-by: Kan Liang <[email protected]> Tested-by: Kan Liang <[email protected]> Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-14	watchdog/hardlockup: Clean up hotplug locking mess	Thomas Gleixner	1	-6/+0
	All watchdog thread related functions are delegated to the smpboot thread infrastructure, which handles serialization against CPU hotplug correctly. The sysctl interface is completely decoupled from anything which requires CPU hotplug protection. No need to protect the sysctl writes against cpu hotplug anymore. Remove it and add the now required protection to the powerpc arch_nmi_watchdog implementation. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/hardlockup/perf: Simplify deferred event destroy	Thomas Gleixner	1	-5/+2
	Now that all functionality is properly serialized against CPU hotplug, remove the extra per cpu storage which holds the disabled events for cleanup. The core makes sure that cleanup happens before new events are created. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/hardlockup/perf: Use new perf CPU enable mechanism	Thomas Gleixner	2	-84/+8
	Get rid of the hodgepodge which tries to be smart about perf being unavailable and error printout rate limiting. That's all not required simply because this is never invoked when the perf NMI watchdog is not functional. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/hardlockup/perf: Implement CPU enable replacement	Thomas Gleixner	1	-0/+11
	watchdog_nmi_enable() is an unparseable mess, Provide a clean perf specific implementation, which will be used when the existing setup/teardown mess is replaced. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/hardlockup/perf: Implement init time detection of perf	Thomas Gleixner	1	-1/+12
	Use the init time detection of the perf NMI watchdog to determine whether the perf NMI watchdog is functional. If not disable it permanentely. It won't come back magically at runtime. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/hardlockup/perf: Implement init time perf validation	Thomas Gleixner	1	-0/+37
	The watchdog tries to create perf events even after it figured out that perf is not functional or the requested event is not supported. That's braindead as this can be done once at init time and if not supported the NMI watchdog can be turned off unconditonally. Implement the perf hardlockup detector functionality for that. This creates a new event create function, which will replace the unholy mess of the existing one in later patches. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Get rid of the racy update loop	Thomas Gleixner	1	-48/+47
	Letting user space poke directly at variables which are used at run time is stupid and causes a lot of race conditions and other issues. Seperate the user variables and on change invoke the reconfiguration, which then stops the watchdogs, reevaluates the new user value and restarts the watchdogs with the new parameters. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage	Thomas Gleixner	1	-9/+22
	Both the perf reconfiguration and the powerpc watchdog_nmi_reconfigure() need to be done in two steps. 1) Stop all NMIs 2) Read the new parameters and start NMIs Right now watchdog_nmi_reconfigure() is a combination of both. To allow a clean reconfiguration add a 'run' argument and split the functionality in powerpc. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/sysctl: Clean up sysctl variable name space	Thomas Gleixner	2	-29/+28
	Reflect that these variables are user interface related and remove the whitespace damage in the sysctl table while at it. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/sysctl: Get rid of the #ifdeffery	Thomas Gleixner	1	-5/+1
	The sysctl of the nmi_watchdog file prevents writes by setting: min = max = 0 if none of the users is enabled. That involves ifdeffery and is competely non obvious. If none of the facilities is enabeld, then the file can simply be made read only. Move the ifdeffery into the header and use a constant for file permissions. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Further simplify sysctl handling	Thomas Gleixner	1	-20/+7
	Use a single function to update sysctl changes. This is not a high frequency user space interface and it's root only. Preparatory patch to cleanup the sysctl variable handling. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Get rid of the thread teardown/setup dance	Thomas Gleixner	1	-171/+19
	The lockup detector reconfiguration tears down all watchdog threads when the watchdog is disabled and sets them up again when its enabled. That's a pointless exercise. The watchdog threads are not consuming an insane amount of resources, so it's enough to set them up at init time and keep them in parked position when the watchdog is disabled and unpark them when it is reenabled. The smpboot thread infrastructure takes care of keeping the force parked threads in place even across cpu hotplug. Aside of that the code implements the park/unpark facility of smp hotplug threads on its own, which is even more pointless. We have functionality in the smpboot thread code to do so. Use the new thread management functions and get rid of the unholy mess. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Create new thread handling infrastructure	Thomas Gleixner	1	-0/+75
	The lockup detector reconfiguration tears down all watchdog threads when the watchdog is disabled and sets them up again when its enabled. That's a pointless exercise. The watchdog threads are not consuming an insane amount of resources, so it's enough to set them up at init time and keep them in parked position when the watchdog is disabled and unpark them when it is reenabled. The smpboot thread infrastructure takes care of keeping the force parked threads in place even across cpu hotplug. Another horrible mechanism are the open coded park/unpark loops which are used for reconfiguration of the watchdog. The smpboot infrastructure allows exactly the same via smpboot_update_cpumask_thread_percpu(), which is cpu hotplug safe. Using that instead of the open coded loops allows to get rid of the hotplug locking mess in the watchdog code. Implement a clean infrastructure which allows to replace the open coded nonsense. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	smpboot/threads, watchdog/core: Avoid runtime allocation	Thomas Gleixner	2	-31/+12
	smpboot_update_cpumask_threads_percpu() allocates a temporary cpumask at runtime. This is suboptimal because the call site needs more code size for proper error handling than a statically allocated temporary mask requires data size. Add static temporary cpumask. The function is globaly serialized, so no further protection required. Remove the half baken error handling in the watchdog code and get rid of the export as there are no in tree modular users of that function. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Split out cpumask write function	Thomas Gleixner	1	-19/+21
	Split the write part of the cpumask proc handler out into a separate helper to avoid deep indentation. This also reduces the patch complexity in the following cleanups. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Clean up the #ifdef maze	Thomas Gleixner	1	-20/+13
	The #ifdef maze in this file is horrible, group stuff at least a bit so one can figure out what belongs to what. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Clean up stub functions	Thomas Gleixner	1	-46/+22
	Having stub functions which take a full page is not helping the readablility of code. Condense them and move the doubled #ifdef variant into the SYSFS section. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Remove the park_in_progress obfuscation	Thomas Gleixner	2	-25/+19
	Commit: b94f51183b06 ("kernel/watchdog: prevent false hardlockup on overloaded system") tries to fix the following issue: proc_write() set_sample_period() <--- New sample period becoms visible <----- Broken starts proc_watchdog_update() watchdog_enable_all_cpus() watchdog_hrtimer_fn() update_watchdog_all_cpus() restart_timer(sample_period) watchdog_park_threads() thread->park() disable_nmi() <----- Broken ends The reason why this is broken is that the update of the watchdog threshold becomes immediately effective and visible for the hrtimer function which uses that value to rearm the timer. But the NMI/perf side still uses the old value up to the point where it is disabled. If the rate has been lowered then the NMI can run fast enough to 'detect' a hard lockup because the timer has not fired due to the longer period. The patch 'fixed' this by adding a variable: proc_write() set_sample_period() <----- Broken starts proc_watchdog_update() watchdog_enable_all_cpus() watchdog_hrtimer_fn() update_watchdog_all_cpus() restart_timer(sample_period) watchdog_park_threads() park_in_progress = 1 <----- Broken ends nmi_watchdog() if (park_in_progress) return; The only effect of this variable was to make the window where the breakage can hit small enough that it was not longer observable in testing. From a correctness point of view it is a pointless bandaid which merily papers over the root cause: the unsychronized update of the variable. Looking deeper into the related code pathes unearthed similar problems in the watchdog_start()/stop() functions. watchdog_start() perf_nmi_event_start() hrtimer_start() watchdog_stop() hrtimer_cancel() perf_nmi_event_stop() In both cases the call order is wrong because if the tasks gets preempted or the VM gets scheduled out long enough after the first call, then there is a chance that the next NMI will see a stale hrtimer interrupt count and trigger a false positive hard lockup splat. Get rid of park_in_progress so the code can be gradually deobfuscated and pruned from several layers of duct tape papering over the root cause, which has been either ignored or not understood at all. Once this is removed the underlying problem will be fixed by rewriting the proc interface to do a proper synchronized update. Address the start/stop() ordering problem as well by reverting the call order, so this part is at least correct now. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1709052038270.2393@nanos Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/hardlockup/perf: Prevent CPU hotplug deadlock	Thomas Gleixner	3	-6/+59
	The following deadlock is possible in the watchdog hotplug code: cpus_write_lock() ... takedown_cpu() smpboot_park_threads() smpboot_park_thread() kthread_park() ->park() := watchdog_disable() watchdog_nmi_disable() perf_event_release_kernel(); put_event() _free_event() ->destroy() := hw_perf_event_destroy() x86_release_hardware() release_ds_buffers() get_online_cpus() when a per cpu watchdog perf event is destroyed which drops the last reference to the PMU hardware. The cleanup code there invokes get_online_cpus() which instantly deadlocks because the hotplug percpu rwsem is write locked. To solve this add a deferring mechanism: cpus_write_lock() kthread_park() watchdog_nmi_disable(deferred) perf_event_disable(event); move_event_to_deferred(event); .... cpus_write_unlock() cleaup_deferred_events() perf_event_release_kernel() This is still properly serialized against concurrent hotplug via the cpu_add_remove_lock, which is held by the task which initiated the hotplug event. This is also used to handle event destruction when the watchdog threads are parked via other mechanisms than CPU hotplug. Analyzed-by: Peter Zijlstra <[email protected]> Reported-by: Borislav Petkov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/hardlockup/perf: Remove broken self disable on failure	Thomas Gleixner	2	-28/+7
	The self disabling feature is broken vs. CPU hotplug locking: CPU 0 CPU 1 cpus_write_lock(); cpu_up(1) wait_for_completion() .... unpark_watchdog() ->unpark() perf_event_create() <- fails watchdog_enable &= ~NMI_WATCHDOG; .... cpus_write_unlock(); CPU 2 cpus_write_lock() cpu_down(2) wait_for_completion() wakeup(watchdog); watchdog() if (!(watchdog_enable & NMI_WATCHDOG)) watchdog_nmi_disable() perf_event_disable() .... cpus_read_lock(); stop_smpboot_threads() park_watchdog(); wait_for_completion(watchdog->parked); Result: End of hotplug and instantaneous full lockup of the machine. There is a similar problem with disabling the watchdog via the user space interface as the sysctl function fiddles with watchdog_enable directly. It's very debatable whether this is required at all. If the watchdog works nicely on N CPUs and it fails to enable on the N + 1 CPU either during hotplug or because the user space interface disabled it via sysctl cpumask and then some perf user grabbed the counter which is then unavailable for the watchdog when the sysctl cpumask gets changed back. There is no real justification for this. One of the reasons WHY this is done is the utter stupidity of the init code of the perf NMI watchdog. Instead of checking upfront at boot whether PERF is available and functional at all, it just does this check at run time over and over when user space fiddles with the sysctl. That's broken beyond repair along with the idiotic error code dependent warn level printks and the even more silly printk rate limiting. If the init code checks whether perf works at boot time, then this mess can be more or less avoided completely. Perf does not come magically into life at runtime. Brain usage while coding is overrated. Remove the cruft and add a temporary safe guard which gets removed later. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Mark hardlockup_detector_disable() __init	Thomas Gleixner	1	-1/+1
	The function is only used by the KVM init code. Mark it __init to prevent creative abuse. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Rename watchdog_proc_mutex	Thomas Gleixner	1	-8/+7
	Following patches will use the mutex for other purposes as well. Rename it as it is not longer a proc specific thing. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Rework CPU hotplug locking	Thomas Gleixner	1	-6/+6
	The watchdog proc interface causes extensive recursive locking of the CPU hotplug percpu rwsem, which is deadlock prone. Replace the get/put_online_cpus() pairs with cpu_hotplug_disable()/enable() calls for now. Later patches will remove that requirement completely. Reported-by: Borislav Petkov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Remove broken suspend/resume interfaces	Thomas Gleixner	1	-88/+1
	This interface has several issues: - It's causing recursive locking of the hotplug lock. - It's complete overkill to teardown all threads and then recreate them The same can be achieved with the simple hardlockup_detector_perf_stop / restart() interfaces. The abuse from the busy looping poweroff() loop of PARISC has been solved as well. Remove the cruft. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/core: Provide interface to stop from poweroff()	Thomas Gleixner	1	-1/+13
	PARISC has a a busy looping power off routine. If the watchdog is enabled the watchdog timer will still fire, but the thread is not running, which causes the softlockup watchdog to trigger. Provide a interface which allows to turn the watchdog off. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Helge Deller <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-14	watchdog/hardlockup: Provide interface to stop/restart perf events	Peter Zijlstra	1	-0/+41
	Provide an interface to stop and restart perf NMI watchdog events on all CPUs. This is only usable during init and especially for handling the perf HT bug on Intel machines. It's safe to use it this way as nothing can start/stop the NMI watchdog in parallel. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Don Zickus <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Sebastian Siewior <[email protected]> Cc: Ulrich Obergfell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-13	mm: treewide: remove GFP_TEMPORARY allocation flag	Michal Hocko	2	-2/+2
	GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's primary motivation was to allow users to tell that an allocation is short lived and so the allocator can try to place such allocations close together and prevent long term fragmentation. As much as this sounds like a reasonable semantic it becomes much less clear when to use the highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the context holding that memory sleep? Can it take locks? It seems there is no good answer for those questions. The current implementation of GFP_TEMPORARY is basically GFP_KERNEL \| __GFP_RECLAIMABLE which in itself is tricky because basically none of the existing caller provide a way to reclaim the allocated memory. So this is rather misleading and hard to evaluate for any benefits. I have checked some random users and none of them has added the flag with a specific justification. I suspect most of them just copied from other existing users and others just thought it might be a good idea to use without any measuring. This suggests that GFP_TEMPORARY just motivates for cargo cult usage without any reasoning. I believe that our gfp flags are quite complex already and especially those with highlevel semantic should be clearly defined to prevent from confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and replace all existing users to simply use GFP_KERNEL. Please note that SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and so they will be placed properly for memory fragmentation prevention. I can see reasons we might want some gfp flag to reflect shorterm allocations but I propose starting from a clear semantic definition and only then add users with proper justification. This was been brought up before LSF this year by Matthew [1] and it turned out that GFP_TEMPORARY really doesn't have a clear semantic. It seems to be a heuristic without any measured advantage for most (if not all) its current users. The follow up discussion has revealed that opinions on what might be temporary allocation differ a lot between developers. So rather than trying to tweak existing users into a semantic which they haven't expected I propose to simply remove the flag and start from scratch if we really need a semantic for short term allocations. [1] http://lkml.kernel.org/r/[email protected] [[email protected]: fix typo] [[email protected]: coding-style fixes] [[email protected]: drm/i915: fix up] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Neil Brown <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-13	Merge branch 'sched-urgent-for-linus' of ↵	Linus Torvalds	5	-3/+25
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Three CPU hotplug related fixes and a debugging improvement" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Add debugfs knob for "sched_debug" sched/core: WARN() when migrating to an offline CPU sched/fair: Plug hole between hotplug and active_load_balance() sched/fair: Avoid newidle balance for !active CPUs
2017-09-13	Merge tag 'modules-for-v4.14' of ↵	Linus Torvalds	1	-6/+6
	git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: "Summary of modules changes for the 4.14 merge window: - minor code cleanups and fixes - modpost: avoid building modules that have names that exceed the size of the name field in struct module" * tag 'modules-for-v4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: Remove const attribute from alias for MODULE_DEVICE_TABLE module: fix ddebug_remove_module() modpost: abort if module name is too long
2017-09-12	Merge tag 'selinux-pr-20170831' of ↵	Linus Torvalds	1	-4/+0
	git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: "A relatively quiet period for SELinux, 11 patches with only two/three having any substantive changes. These noteworthy changes include another tweak to the NNP/nosuid handling, per-file labeling for cgroups, and an object class fix for AF_UNIX/SOCK_RAW sockets; the rest of the changes are minor tweaks or administrative updates (Stephen's email update explains the file explosion in the diffstat). Everything passes the selinux-testsuite" [ Also a couple of small patches from the security tree from Tetsuo Handa for Tomoyo and LSM cleanup. The separation of security policy updates wasn't all that clean - Linus ] * tag 'selinux-pr-20170831' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: constify nf_hook_ops selinux: allow per-file labeling for cgroupfs lsm_audit: update my email address selinux: update my email address MAINTAINERS: update the NetLabel and Labeled Networking information selinux: use GFP_NOWAIT in the AVC kmem_caches selinux: Generalize support for NNP/nosuid SELinux domain transitions selinux: genheaders should fail if too many permissions are defined selinux: update the selinux info in MAINTAINERS credits: update Paul Moore's info selinux: Assign proper class to PF_UNIX/SOCK_RAW sockets tomoyo: Update URLs in Documentation/admin-guide/LSM/tomoyo.rst LSM: Remove security_task_create() hook.
2017-09-12	Merge branch 'sched-urgent-for-linus' of ↵	Linus Torvalds	4	-8/+24
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Three fixes: - fix a suspend/resume cpusets bug - fix a !CONFIG_NUMA_BALANCING bug - fix a kerneldoc warning" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix nuisance kernel-doc warning sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs sched/fair: Fix wake_affine_llc() balancing rules
2017-09-12	Merge branch 'irq-urgent-for-linus' of ↵	Linus Torvalds	2	-19/+10
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "A sparse irq race/locking fix, and a MSI irq domains population fix" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Make sparse_irq_lock protect what it should protect genirq/msi: Fix populating multiple interrupts
2017-09-12	sched/debug: Add debugfs knob for "sched_debug"	Peter Zijlstra	3	-3/+8
	I'm forever late for editing my kernel cmdline, add a runtime knob to disable the "sched_debug" thing. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-12	sched/core: WARN() when migrating to an offline CPU	Peter Zijlstra	1	-0/+4
	Migrating tasks to offline CPUs is a pretty big fail, warn about it. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-12	sched/fair: Plug hole between hotplug and active_load_balance()	Peter Zijlstra	1	-0/+7
	The load balancer applies cpu_active_mask to whatever sched_domains it finds, however in the case of active_balance there is a hole between setting rq->{active_balance,push_cpu} and running the stop_machine work doing the actual migration. The @push_cpu can go offline in this window, which would result in us moving a task onto a dead cpu, which is a fairly bad thing. Double check the active mask before the stop work does the migration. CPU0 CPU1 <SoftIRQ> stop_machine(takedown_cpu) load_balance() cpu_stopper_thread() ... work = multi_cpu_stop stop_one_cpu_nowait( /* wait for CPU0 / .func = active_load_balance_cpu_stop ); </SoftIRQ> cpu_stopper_thread() work = multi_cpu_stop / sync with CPU1 / take_cpu_down() <idle> play_dead(); work = active_load_balance_cpu_stop set_task_cpu(p, CPU1); / oops!! */ Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-12	sched/fair: Avoid newidle balance for !active CPUs	Peter Zijlstra	1	-0/+6
	On CPU hot unplug, when parking the last kthread we'll try and schedule into idle to kill the CPU. This last schedule can (and does) trigger newidle balance because at this point the sched domains are still up because of commit: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds") Obviously pulling tasks to an already offline CPU is a bad idea, and all balancing operations _should_ be subject to cpu_active_mask, make it so. Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Fixes: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-11	Merge branch 'for-linus' of ↵	Linus Torvalds	6	-37/+77
	git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "Life has been busy and I have not gotten half as much done this round as I would have liked. I delayed it so that a minor conflict resolution with the mips tree could spend a little time in linux-next before I sent this pull request. This includes two long delayed user namespace changes from Kirill Tkhai. It also includes a very useful change from Serge Hallyn that allows the security capability attribute to be used inside of user namespaces. The practical effect of this is people can now untar tarballs and install rpms in user namespaces. It had been suggested to generalize this and encode some of the namespace information information in the xattr name. Upon close inspection that makes the things that should be hard easy and the things that should be easy more expensive. Then there is my bugfix/cleanup for signal injection that removes the magic encoding of the siginfo union member from the kernel internal si_code. The mips folks reported the case where I had used FPE_FIXME me is impossible so I have remove FPE_FIXME from mips, while at the same time including a return statement in that case to keep gcc from complaining about unitialized variables. I almost finished the work to get make copy_siginfo_to_user a trivial copy to user. The code is available at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3 But I did not have time/energy to get the code posted and reviewed before the merge window opened. I was able to see that the security excuse for just copying fields that we know are initialized doesn't work in practice there are buggy initializations that don't initialize the proper fields in siginfo. So we still sometimes copy unitialized data to userspace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: Introduce v3 namespaced file capabilities mips/signal: In force_fcr31_sig return in the impossible case signal: Remove kernel interal si_code magic fcntl: Don't use ambiguous SIG_POLL si_codes prctl: Allow local CAP_SYS_ADMIN changing exe_file security: Use user_namespace::level to avoid redundant iterations in cap_capable() userns,pidns: Verify the userns for new pid namespaces signal/testing: Don't look for __SI_FAULT in userspace signal/mips: Document a conflict with SI_USER with SIGFPE signal/sparc: Document a conflict with SI_USER with SIGFPE signal/ia64: Document a conflict with SI_USER with SIGFPE signal/alpha: Document a conflict with SI_USER for SIGTRAP
2017-09-11	perf/bpf: fix a clang compilation issue	Yonghong Song	1	-1/+1
	clang does not support variable length array for structure member. It has the following error during compilation: kernel/trace/trace_syscalls.c:568:17: error: fields must have a constant size: 'variable length array in structure' extension will never be supported unsigned long args[sys_data->nb_args]; ^ The fix is to use a fixed array length instead. Reported-by: Nick Desaulniers <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-09-11	sched/fair: Fix nuisance kernel-doc warning	Randy Dunlap	1	-1/+1
	Work around kernel-doc warning ('*' in Sphinx doc means "emphasis"): ../kernel/sched/fair.c:7584: WARNING: Inline emphasis start-string without end-string. Signed-off-by: Randy Dunlap <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-09	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	Linus Torvalds	4	-12/+35
	Pull networking fixes from David Miller: "The iwlwifi firmware compat fix is in here as well as some other stuff: 1) Fix request socket leak introduced by BPF deadlock fix, from Eric Dumazet. 2) Fix VLAN handling with TXQs in mac80211, from Johannes Berg. 3) Missing __qdisc_drop conversions in prio and qfq schedulers, from Gao Feng. 4) Use after free in netlink nlk groups handling, from Xin Long. 5) Handle MTU update properly in ipv6 gre tunnels, from Xin Long. 6) Fix leak of ipv6 fib tables on netns teardown, from Sabrina Dubroca with follow-on fix from Eric Dumazet. 7) Need RCU and preemption disabled during generic XDP data patch, from John Fastabend" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (54 commits) bpf: make error reporting in bpf_warn_invalid_xdp_action more clear Revert "mdio_bus: Remove unneeded gpiod NULL check" bpf: devmap, use cond_resched instead of cpu_relax bpf: add support for sockmap detach programs net: rcu lock and preempt disable missing around generic xdp bpf: don't select potentially stale ri->map from buggy xdp progs net: tulip: Constify tulip_tbl net: ethernet: ti: netcp_core: no need in netif_napi_del davicom: Display proper debug level up to 6 net: phy: sfp: rename dt properties to match the binding dt-binding: net: sfp binding documentation dt-bindings: add SFF vendor prefix dt-bindings: net: don't confuse with generic PHY property ip6_tunnel: fix setting hop_limit value for ipv6 tunnel ip_tunnel: fix setting ttl and tos value in collect_md mode ipv6: fix typo in fib6_net_exit() tcp: fix a request socket leak sctp: fix missing wake ups in some situations netfilter: xt_hashlimit: fix build error caused by 64bit division netfilter: xt_hashlimit: alloc hashtable with right size ...
2017-09-09	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	19	-666/+696
	Merge more updates from Andrew Morton: - most of the rest of MM - a small number of misc things - lib/ updates - checkpatch - autofs updates - ipc/ updates * emailed patches from Andrew Morton <[email protected]>: (126 commits) ipc: optimize semget/shmget/msgget for lots of keys ipc/sem: play nicer with large nsops allocations ipc/sem: drop sem_checkid helper ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_t ipc: convert sem_undo_list.refcnt from atomic_t to refcount_t ipc: convert ipc_namespace.count from atomic_t to refcount_t kcov: support compat processes sh: defconfig: cleanup from old Kconfig options mn10300: defconfig: cleanup from old Kconfig options m32r: defconfig: cleanup from old Kconfig options drivers/pps: use surrounding "if PPS" to remove numerous dependency checks drivers/pps: aesthetic tweaks to PPS-related content cpumask: make cpumask_next() out-of-line kmod: move #ifdef CONFIG_MODULES wrapper to Makefile kmod: split off umh headers into its own file MAINTAINERS: clarify kmod is just a kernel module loader kmod: split out umh code into its own file test_kmod: flip INT checks to be consistent test_kmod: remove paranoid UINT_MAX check on uint range processing vfat: deduplicate hex2bin() ...
2017-09-08	bpf: devmap, use cond_resched instead of cpu_relax	John Fastabend	1	-1/+1
	Be a bit more friendly about waiting for flush bits to complete. Replace the cpu_relax() with a cond_resched(). Suggested-by: Daniel Borkmann <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-09-08	bpf: add support for sockmap detach programs	John Fastabend	2	-11/+18
	The bpf map sockmap supports adding programs via attach commands. This patch adds the detach command to keep the API symmetric and allow users to remove previously added programs. Otherwise the user would have to delete the map and re-add it to get in this state. This also adds a series of additional tests to capture detach operation and also attaching/detaching invalid prog types. API note: socks will run (or not run) programs depending on the state of the map at the time the sock is added. We do not for example walk the map and remove programs from previously attached socks. Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-09-08	bpf: don't select potentially stale ri->map from buggy xdp progs	Daniel Borkmann	1	-0/+16
	We can potentially run into a couple of issues with the XDP bpf_redirect_map() helper. The ri->map in the per CPU storage can become stale in several ways, mostly due to misuse, where we can then trigger a use after free on the map: i) prog A is calling bpf_redirect_map(), returning XDP_REDIRECT and running on a driver not supporting XDP_REDIRECT yet. The ri->map on that CPU becomes stale when the XDP program is unloaded on the driver, and a prog B loaded on a different driver which supports XDP_REDIRECT return code. prog B would have to omit calling to bpf_redirect_map() and just return XDP_REDIRECT, which would then access the freed map in xdp_do_redirect() since not cleared for that CPU. ii) prog A is calling bpf_redirect_map(), returning a code other than XDP_REDIRECT. prog A is then detached, which triggers release of the map. prog B is attached which, similarly as in i), would just return XDP_REDIRECT without having called bpf_redirect_map() and thus be accessing the freed map in xdp_do_redirect() since not cleared for that CPU. iii) prog A is attached to generic XDP, calling the bpf_redirect_map() helper and returning XDP_REDIRECT. xdp_do_generic_redirect() is currently not handling ri->map (will be fixed by Jesper), so it's not being reset. Later loading a e.g. native prog B which would, say, call bpf_xdp_redirect() and then returns XDP_REDIRECT would find in xdp_do_redirect() that a map was set and uses that causing use after free on map access. Fix thus needs to avoid accessing stale ri->map pointers, naive way would be to call a BPF function from drivers that just resets it to NULL for all XDP return codes but XDP_REDIRECT and including XDP_REDIRECT for drivers not supporting it yet (and let ri->map being handled in xdp_do_generic_redirect()). There is a less intrusive way w/o letting drivers call a reset for each BPF run. The verifier knows we're calling into bpf_xdp_redirect_map() helper, so it can do a small insn rewrite transparent to the prog itself in the sense that it fills R4 with a pointer to the own bpf_prog. We have that pointer at verification time anyway and R4 is allowed to be used as per calling convention we scratch R0 to R5 anyway, so they become inaccessible and program cannot read them prior to a write. Then, the helper would store the prog pointer in the current CPUs struct redirect_info. Later in xdp_do_*_redirect() we check whether the redirect_info's prog pointer is the same as passed xdp_prog pointer, and if that's the case then all good, since the prog holds a ref on the map anyway, so it is always valid at that point in time and must have a reference count of at least 1. If in the unlikely case they are not equal, it means we got a stale pointer, so we clear and bail out right there. Also do reset map and the owning prog in bpf_xdp_redirect(), so that bpf_xdp_redirect_map() and bpf_xdp_redirect() won't get mixed up, only the last call should take precedence. A tc bpf_redirect() doesn't use map anywhere yet, so no need to clear it there since never accessed in that layer. Note that in case the prog is released, and thus the map as well we're still under RCU read critical section at that time and have preemption disabled as well. Once we commit with the __dev_map_insert_ctx() from xdp_do_redirect_map() and set the map to ri->map_to_flush, we still wait for a xdp_do_flush_map() to finish in devmap dismantle time once flush_needed bit is set, so that is fine. Fixes: 97f91a7cf04f ("bpf: add bpf_redirect_map helper routine") Reported-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-09-08	kcov: support compat processes	Dmitry Vyukov	1	-0/+1
	Support compat processes in KCOV by providing compat_ioctl callback. Compat mode uses the same ioctl callback: we have 2 commands that do not use the argument and 1 that already checks that the arg does not overflow INT_MAX. This allows to use KCOV-guided fuzzing in compat processes. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dmitry Vyukov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	drivers/pps: aesthetic tweaks to PPS-related content	Robert P. J. Day	1	-1/+1
	Collection of aesthetic adjustments to various PPS-related files, directories and Documentation, some quite minor just for the sake of consistency, including: * Updated example of pps device tree node (courtesy Rodolfo G.) * "PPS-API" -> "PPS API" * "pps_source_info_s" -> "pps_source_info" * "ktimer driver" -> "pps-ktimer driver" * "ppstest /dev/pps0" -> "ppstest /dev/pps1" to match example * Add missing PPS-related entries to MAINTAINERS file * Other trivialities Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Robert P. J. Day <[email protected]> Acked-by: Rodolfo Giometti <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	kmod: move #ifdef CONFIG_MODULES wrapper to Makefile	Luis R. Rodriguez	2	-4/+2
	The entire file is now conditionally compiled only when CONFIG_MODULES is enabled, and this this is a bool. Just move this conditional to the Makefile as its easier to read this way. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Luis R. Rodriguez <[email protected]> Cc: Kees Cook <[email protected]> Cc: Dmitry Torokhov <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Michal Marek <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Miroslav Benes <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Matt Redfearn <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Daniel Mentz <[email protected]> Cc: David Binderman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-09-08	kmod: split out umh code into its own file	Luis R. Rodriguez	3	-559/+571
	Patch series "kmod: few code cleanups to split out umh code" The usermode helper has a provenance from the old usb code which first required a usermode helper. Eventually this was shoved into kmod.c and the kernel's modprobe calls was converted over eventually to share the same code. Over time the list of usermode helpers in the kernel has grown -- so kmod is just but one user of the API. This series is a simple logical cleanup which acknowledges the code evolution of the usermode helper and shoves the UMH API into its own dedicated file. This way users of the API can later just include umh.h instead of kmod.h. Note despite the diff state the first patch really is just a code shove, no functional changes are done there. I did use git format-patch -M to generate the patch, but in the end the split was not enough for git to consider it a rename hence the large diffstat. I've put this through 0-day and it gives me their machine compilation blessings with all tests as OK. This patch (of 4): There's a slew of usermode helper users and kmod is just one of them. Split out the usermode helper code into its own file to keep the logic and focus split up. This change provides no functional changes. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Luis R. Rodriguez <[email protected]> Cc: Kees Cook <[email protected]> Cc: Dmitry Torokhov <[email protected]> Cc: Jessica Yu <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Michal Marek <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Miroslav Benes <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Matt Redfearn <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Daniel Mentz <[email protected]> Cc: David Binderman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>