blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2015-06-19	lockdep: Simplify lock_release()	Peter Zijlstra	1	-101/+18
	lock_release() takes this nested argument that's mostly pointless these days, remove the implementation but leave the argument a rudiment for now. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	sched: Streamline the task migration locking a little	Peter Zijlstra	1	-42/+34
	The whole migrate_task{,s}() locking seems a little shaky, there's a lot of dropping an require happening. Pull the locking up into the callers as far as possible to streamline the lot. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	sched: Move code around	Peter Zijlstra	1	-186/+178
	In preparation to reworking set_cpus_allowed_ptr() move some code around. This also removes some superfluous #ifdefs and adds comments to some #endifs. text data bss dec hex filename 12211532 1738144 1081344 15031020 e55aec defconfig-build/vmlinux.pre 12211532 1738144 1081344 15031020 e55aec defconfig-build/vmlinux.post Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	sched,dl: Fix sched class hopping CBS hole	Peter Zijlstra	1	-66/+86
	We still have a few pending issues with the deadline code, one of which is that switching between scheduling classes can 'leak' CBS state. Close the hole by retaining the current CBS state when leaving SCHED_DEADLINE and unconditionally programming the deadline timer. The timer will then reset the CBS state if the task is still !SCHED_DEADLINE by the time it hits. If the task left SCHED_DEADLINE it will not call task_dead_dl() and we'll not cancel the hrtimer, leaving us a pending timer in free space. Avoid this by giving the timer a task reference, this avoids littering the task exit path for this rather uncommon case. In order to do this, I had to move dl_task_offline_migration() below the replenishment, such that the task_rq()->lock fully covers that. While doing this, I noticed that it (was) buggy in assuming a task is enqueued and or we need to enqueue the task now. Fixing this means select_task_rq_dl() might encounter an offline rq -- look into that. As a result this kills cancel_dl_timer() which included a rq->lock break. Fixes: 40767b0dc768 ("sched/deadline: Fix deadline parameter modification handling") Cc: Wanpeng Li <[email protected]> Cc: Luca Abeni <[email protected]> Cc: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Luca Abeni <[email protected]> Cc: Juri Lelli <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	sched, dl: Convert switched_{from, to}_dl() / prio_changed_dl() to balance ↵	Peter Zijlstra	1	-21/+24
	callbacks Remove the direct {push,pull} balancing operations from switched_{from,to}_dl() / prio_changed_dl() and use the balance callback queue. Again, err on the side of too many reschedules; since too few is a hard bug while too many is just annoying. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	sched,dl: Remove return value from pull_dl_task()	Peter Zijlstra	1	-10/+10
	In order to be able to use pull_dl_task() from a callback, we need to do away with the return value. Since the return value indicates if we should reschedule, do this inside the function. Since not all callers currently do this, this can increase the number of reschedules due rt balancing. Too many reschedules is not a correctness issues, too few are. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance ↵	Peter Zijlstra	1	-16/+19
	callbacks Remove the direct {push,pull} balancing operations from switched_{from,to}_rt() / prio_changed_rt() and use the balance callback queue. Again, err on the side of too many reschedules; since too few is a hard bug while too many is just annoying. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	sched,rt: Remove return value from pull_rt_task()	Peter Zijlstra	1	-11/+11
	In order to be able to use pull_rt_task() from a callback, we need to do away with the return value. Since the return value indicates if we should reschedule, do this inside the function. Since not all callers currently do this, this can increase the number of reschedules due rt balancing. Too many reschedules is not a correctness issues, too few are. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	sched: Allow balance callbacks for check_class_changed()	Peter Zijlstra	1	-3/+22
	In order to remove dropping rq->lock from the switched_{to,from}()/prio_changed() sched_class methods, run the balance callbacks after it. We need to remove dropping rq->lock because its buggy, suppose using sched_setattr()/sched_setscheduler() to change a running task from FIFO to OTHER. By the time we get to switched_from_rt() the task is already enqueued on the cfs runqueues. If switched_from_rt() does pull_rt_task() and drops rq->lock, load-balancing can come in and move our task @p to another rq. The subsequent switched_to_fair() still assumes @p is on @rq and bad things will happen. By using balance callbacks we delay the load-balancing operations {rt,dl}x{push,pull} until we've done all the important work and the task is fully set up. Furthermore, the balance callbacks do not know about @p, therefore they cannot get confused like this. Reported-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	sched: Use replace normalize_task() with __sched_setscheduler()	Peter Zijlstra	1	-42/+23
	Reduce duplicate logic; normalize_task() is a simplified version of __sched_setscheduler(). Parametrize the difference and collapse. This reduces the amount of check_class_changed() sites. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	sched: Replace post_schedule with a balance callback list	Peter Zijlstra	4	-38/+63
	Generalize the post_schedule() stuff into a balance callback list. This allows us to more easily use it outside of schedule() and cross sched_class. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	Merge branch 'timers/core' into sched/hrtimers	Thomas Gleixner	31	-1069/+939
	Merge sched/core and timers/core so we can apply the sched balancing patch queue, which depends on both.
2015-06-19	hrtimer: Allow hrtimer::function() to free the timer	Peter Zijlstra	1	-23/+91
	Currently an hrtimer callback function cannot free its own timer because __run_hrtimer() still needs to clear HRTIMER_STATE_CALLBACK after it. Freeing the timer would result in a clear use-after-free. Solve this by using a scheme similar to regular timers; track the current running timer in hrtimer_clock_base::running. Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Al Viro <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	hrtimer: Fix hrtimer_is_queued() hole	Peter Zijlstra	1	-10/+13
	A queued hrtimer that gets restarted (hrtimer_start*() while hrtimer_is_queued()) will briefly appear as unqueued/inactive, even though the timer has always been active, we just moved it. Close this hole by preserving timer->state in hrtimer_start_range_ns()'s remove_hrtimer() call. Reported-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-19	hrtimer: Remove HRTIMER_STATE_MIGRATE	Oleg Nesterov	1	-5/+2
	I do not understand HRTIMER_STATE_MIGRATE. Unless I am totally confused it looks buggy and simply unneeded. migrate_hrtimer_list() sets it to keep hrtimer_active() == T, but this is not enough: this can fool, say, hrtimer_is_queued() in dequeue_signal(). Can't migrate_hrtimer_list() simply use HRTIMER_STATE_ENQUEUED? This fixes the race and we can kill STATE_MIGRATE. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-18	locking/rtmutex: Implement lockless top-waiter wakeup	Davidlohr Bueso	1	-11/+10
	Mark the task for later wakeup after the wait_lock has been released. This way, once the next task is awoken, it will have a better chance to of finding the wait_lock free when continuing executing in __rt_mutex_slowlock() when trying to acquire the rtmutex, calling try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting take the lock may acquire it first, right after the wait_lock is released, but (a) this can also occur with the current code, as it relies on the spinlock fairness, and (b) we are dealing with the top-waiter anyway, so it will always take the lock next. Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-18	irq: Add irq_set_chained_handler_and_data()	Russell King	1	-11/+34
	Driver authors seem to get the ordering of irq_set_chained_handler() and irq_set_handler_data() wrong - ordering the former before the latter. This opens a race window where, if there is an interrupt pending, the handler will be called between these two calls, potentially resulting in an oops. Provide a single interface to set both of these together, especially as that's commonly what is required. Signed-off-by: Russell King <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Hans Ulli Kroll <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Lee Jones <[email protected]> Cc: Linus Walleij <[email protected]> Cc: Thierry Reding <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-18	timekeeping: Copy the shadow-timekeeper over the real timekeeper last	John Stultz	1	-1/+2
	The fix in d151832650ed9 (time: Move clock_was_set_seq update before updating shadow-timekeeper) was unfortunately incomplete. The main gist of that change was to do the shadow-copy update last, so that any state changes were properly duplicated, and we wouldn't accidentally have stale data in the shadow. Unfortunately in the main update_wall_time() logic, we update use the shadow-timekeeper to calculate the next update values, then while holding the lock, copy the shadow-timekeeper over, then call timekeeping_update() to do some additional bookkeeping, (skipping the shadow mirror). The bug with this is the additional bookkeeping isn't all read-only, and some changes timkeeper state. Thus we might then overwrite this state change on the next update. To avoid this problem, do the timekeeping_update() on the shadow-timekeeper prior to copying the full state over to the real-timekeeper. This avoids problems with both the clock_was_set_seq and next_leap_ktime being overwritten and possibly the fast-timekeepers as well. Many thanks to Prarit for his rigorous testing, which discovered this problem, along with Prarit and Daniel's work validating this fix. Reported-by: Prarit Bhargava <[email protected]> Tested-by: Prarit Bhargava <[email protected]> Tested-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: John Stultz <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: Ingo Molnar <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-18	clockevents: Check state instead of mode in suspend/resume path	Viresh Kumar	1	-2/+2
	CLOCK_EVT_MODE_* macros are present for backward compatibility (as most of the drivers are still using old ->set_mode() interface). These macro's shouldn't be used anymore in code, that is common to both driver interfaces, i.e. ->set_mode() and ->set_state_(). Drivers implementing ->set_state_() interface, which have their clkevt->mode set to 0 (clkevt device structures are normally globally defined), will not participate in suspend/resume as they will always be marked as UNUSED. Fix this by checking state of the clockevent device instead of mode, which is updated for both the interfaces. Fixes: ac34ad27fc16 ("clockevents: Do not suspend/resume if unused") Signed-off-by: Viresh Kumar <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/a1964eef6e8a47d02b1ff9083c6c91f73f0ff643.1434537215.git.viresh.kumar@linaro.org Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-17	Merge tag 'trace-fix-filter-4.1-rc8' of ↵	Linus Torvalds	1	-2/+9
	git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing filter fix from Steven Rostedt: "Vince Weaver reported a warning when he added perf event filters into his fuzzer tests. There's a missing check of balanced operations when parenthesis are used, and this triggers a WARN_ON() and when reading the failure, the filter reports no failure occurred. The operands were not being checked if they match, this adds that" * tag 'trace-fix-filter-4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Have filter check for balanced ops
2015-06-17	tracing: Have filter check for balanced ops	Steven Rostedt	1	-2/+9
	When the following filter is used it causes a warning to trigger: # cd /sys/kernel/debug/tracing # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter -bash: echo: write error: Invalid argument # cat events/ext4/ext4_truncate_exit/filter ((dev==1)blocks==2) ^ parse_error: No error ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990() Modules linked in: bnep lockd grace bluetooth ... CPU: 3 PID: 1223 Comm: bash Tainted: G W 4.1.0-rc3-test+ #450 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0 0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea Call Trace: [<ffffffff816ed4f9>] dump_stack+0x4c/0x6e [<ffffffff8107fb07>] warn_slowpath_common+0x97/0xe0 [<ffffffff8136b46c>] ? _kstrtoull+0x2c/0x80 [<ffffffff8107fb6a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81159065>] replace_preds+0x3c5/0x990 [<ffffffff811596b2>] create_filter+0x82/0xb0 [<ffffffff81159944>] apply_event_filter+0xd4/0x180 [<ffffffff81152bbf>] event_filter_write+0x8f/0x120 [<ffffffff811db2a8>] __vfs_write+0x28/0xe0 [<ffffffff811dda43>] ? __sb_start_write+0x53/0xf0 [<ffffffff812e51e0>] ? security_file_permission+0x30/0xc0 [<ffffffff811dc408>] vfs_write+0xb8/0x1b0 [<ffffffff811dc72f>] SyS_write+0x4f/0xb0 [<ffffffff816f5217>] system_call_fastpath+0x12/0x6a ---[ end trace e11028bd95818dcd ]--- Worse yet, reading the error message (the filter again) it says that there was no error, when there clearly was. The issue is that the code that checks the input does not check for balanced ops. That is, having an op between a closed parenthesis and the next token. This would only cause a warning, and fail out before doing any real harm, but it should still not caues a warning, and the error reported should work: # cd /sys/kernel/debug/tracing # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter -bash: echo: write error: Invalid argument # cat events/ext4/ext4_truncate_exit/filter ((dev==1)blocks==2) ^ parse_error: Meaningless filter expression And give no kernel warning. Link: http://lkml.kernel.org/r/[email protected] Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: [email protected] # 2.6.31+ Reported-by: Vince Weaver <[email protected]> Tested-by: Vince Weaver <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2015-06-16	genirq: Prevent crash in irq_move_irq()	Jiang Liu	1	-1/+8
	The functions irq_move_irq() and irq_move_masked_irq() expect that the caller passes the top-level irq_data to them when hierarchical irqdomains are enabled. But that's not true when called from apic_ack_edge(), which results in a null pointer dereference by idata->chip->irq_mask(idata). Instead of fixing callers to passing top-level irq_data, we rather change irq_move_irq()/irq_move_masked_irq() to accept any irq_data. Fixes: 52f518a3a7c 'x86/MSI: Use hierarchical irqdomains to manage MSI interrupts' Reported-by: Huang Ying <[email protected]> Signed-off-by: Jiang Liu <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Tony Luck <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Borislav Petkov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-16	genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain	Jiang Liu	1	-2/+0
	For irq associated with hierarchy irqdomains, there will be multiple irq_datas for one irq_desc. So enhance irq_data_to_desc() to support hierarchy irqdomain. Also export irq_data_to_desc() as an inline function for later reuse. Signed-off-by: Jiang Liu <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Tony Luck <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Marc Zyngier <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-15	bpf: let kprobe programs use bpf_get_smp_processor_id() helper	Alexei Starovoitov	1	-0/+2
	It's useful to do per-cpu histograms. Suggested-by: Daniel Wagner <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-06-15	bpf: allow networking programs to use bpf_trace_printk() for debugging	Alexei Starovoitov	2	-8/+16
	bpf_trace_printk() is a helper function used to debug eBPF programs. Let socket and TC programs use it as well. Note, it's DEBUG ONLY helper. If it's used in the program, the kernel will print warning banner to make sure users don't use it in production. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-06-15	bpf: introduce current->pid, tgid, uid, gid, comm accessors	Alexei Starovoitov	3	-0/+67
	eBPF programs attached to kprobes need to filter based on current->pid, uid and other fields, so introduce helper functions: u64 bpf_get_current_pid_tgid(void) Return: current->tgid << 32 \| current->pid u64 bpf_get_current_uid_gid(void) Return: current_gid << 32 \| current_uid bpf_get_current_comm(char *buf, int size_of_buf) stores current->comm into buf They can be used from the programs attached to TC as well to classify packets based on current task fields. Update tracex2 example to print histogram of write syscalls for each process instead of aggregated for all. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-06-14	Merge branch 'locking-urgent-for-linus' of ↵	Linus Torvalds	2	-6/+19
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull lockdep fix from Ingo Molnar: "A lockdep/modules unload race fix that can oops" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Fix a race between /proc/lock_stat and module unload
2015-06-13	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	David S. Miller	2	-2/+2

2015-06-12	genirq: Introduce helper function irq_data_get_node()	Jiang Liu	5	-10/+10
	Introduce helper function irq_data_get_node() and variants thereof to hide struct irq_data implementation details. Convert the core code to use them. Signed-off-by: Jiang Liu <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Tony Luck <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Kevin Cernekee <[email protected]> Cc: Arnd Bergmann <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-12	genirq: Introduce struct irq_common_data to host shared irq data	Jiang Liu	3	-5/+7
	With the introduction of hierarchy irqdomain, struct irq_data becomes per-chip instead of per-irq and there may be multiple irq_datas associated with the same irq. Some per-irq data stored in struct irq_data now may get duplicated into multiple irq_datas, and causes inconsistent view. So introduce struct irq_common_data to host per-irq common data and to achieve consistent view among irq_chips. Signed-off-by: Jiang Liu <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Tony Luck <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Kevin Cernekee <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Marc Zyngier <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-12	genirq: Prevent crash in irq_move_irq()	Jiang Liu	1	-1/+8
	The functions irq_move_irq() and irq_move_masked_irq() expect that the caller passes the top-level irq_data to them when hierarchical irqdomains are enabled. But that's not true when called from apic_ack_edge(), which results in a null pointer dereference by idata->chip->irq_mask(idata). Instead of fixing callers to passing top-level irq_data, we rather change irq_move_irq()/irq_move_masked_irq() to accept any irq_data. Signed-off-by: Jiang Liu <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Tony Luck <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Borislav Petkov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-12	genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain	Jiang Liu	1	-2/+0
	For irq associated with hierarchy irqdomains, there will be multiple irq_datas for one irq_desc. So enhance irq_data_to_desc() to support hierarchy irqdomain. Also export irq_data_to_desc() as an inline function for later reuse. Signed-off-by: Jiang Liu <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Tony Luck <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Marc Zyngier <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-12	ntp: Do leapsecond adjustment in adjtimex read path	John Stultz	1	-0/+18
	Since the leapsecond is applied at tick-time, this means there is a small window of time at the start of a leap-second where we cross into the next second before applying the leap. This patch modified adjtimex so that the leap-second is applied on the second edge. Providing more correct leapsecond behavior. This does make it so that adjtimex()'s returned time values can be inconsistent with time values read from gettimeofday() or clock_gettime(CLOCK_REALTIME,...) for a brief period of one tick at the leapsecond. However, those other interfaces do not provide the TIME_OOP time_state return that adjtimex() provides, which allows the leapsecond to be properly represented. They instead only see a time discontinuity, and cannot tell the first 23:59:59 from the repeated 23:59:59 leap second. This seems like a reasonable tradeoff given clock_gettime() / gettimeofday() cannot properly represent a leapsecond, and users likely care more about performance, while folks who are using adjtimex() more likely care about leap-second correctness. Signed-off-by: John Stultz <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: Ingo Molnar <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-12	time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge	John Stultz	3	-8/+58
	Currently, leapsecond adjustments are done at tick time. As a result, the leapsecond was applied at the first timer tick after the leapsecond (~1-10ms late depending on HZ), rather then exactly on the second edge. This was in part historical from back when we were always tick based, but correcting this since has been avoided since it adds extra conditional checks in the gettime fastpath, which has performance overhead. However, it was recently pointed out that ABS_TIME CLOCK_REALTIME timers set for right after the leapsecond could fire a second early, since some timers may be expired before we trigger the timekeeping timer, which then applies the leapsecond. This isn't quite as bad as it sounds, since behaviorally it is similar to what is possible w/ ntpd made leapsecond adjustments done w/o using the kernel discipline. Where due to latencies, timers may fire just prior to the settimeofday call. (Also, one should note that all applications using CLOCK_REALTIME timers should always be careful, since they are prone to quirks from settimeofday() disturbances.) However, the purpose of having the kernel do the leap adjustment is to avoid such latencies, so I think this is worth fixing. So in order to properly keep those timers from firing a second early, this patch modifies the ntp and timekeeping logic so that we keep enough state so that the update_base_offsets_now accessor, which provides the hrtimer core the current time, can check and apply the leapsecond adjustment on the second edge. This prevents the hrtimer core from expiring timers too early. This patch does not modify any other time read path, so no additional overhead is incurred. However, this also means that the leap-second continues to be applied at tick time for all other read-paths. Apologies to Richard Cochran, who pushed for similar changes years ago, which I resisted due to the concerns about the performance overhead. While I suspect this isn't extremely critical, folks who care about strict leap-second correctness will likely want to watch this. Potentially a -stable candidate eventually. Originally-suggested-by: Richard Cochran <[email protected]> Reported-by: Daniel Bristot de Oliveira <[email protected]> Reported-by: Prarit Bhargava <[email protected]> Signed-off-by: John Stultz <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Ingo Molnar <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-12	ntp: Introduce and use SECS_PER_DAY macro instead of 86400	John Stultz	1	-2/+3
	Currently the leapsecond logic uses what looks like magic values. Improve this by defining SECS_PER_DAY and using that macro to make the logic more clear. Signed-off-by: John Stultz <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: Ingo Molnar <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-12	time: Move clock_was_set_seq update before updating shadow-timekeeper	John Stultz	1	-4/+8
	It was reported that 868a3e915f7f5eba (hrtimer: Make offset update smarter) was causing timer problems after suspend/resume. The problem with that change is the modification to clock_was_set_seq in timekeeping_update is done prior to mirroring the time state to the shadow-timekeeper. Thus the next time we do update_wall_time() the updated sequence is overwritten by whats in the shadow copy. This patch moves the shadow-timekeeper mirroring to the end of the function, after all updates have been made, so all data is kept in sync. (This patch also affects the update_fast_timekeeper calls which were also problematically done prior to the mirroring). Reported-and-tested-by: Jeremiah Mahler <[email protected]> Signed-off-by: John Stultz <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Viresh Kumar <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Frederic Weisbecker <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-11	Merge tag 'trace-rb-bm-fix-4.1-rc7' of ↵	Linus Torvalds	1	-1/+1
	git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull ring buffer benchmark buglet fix from Steven Rostedt: "Wang Long fixed a minor bug in the module parameter for the ring buffer benchmark, where the produce_fifo was being ignored and the producer thread's priority was being set with the consumer_fifo parameter" * tag 'trace-rb-bm-fix-4.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer-benchmark: Fix the wrong sched_priority of producer
2015-06-11	ring-buffer-benchmark: Fix the wrong sched_priority of producer	Wang Long	1	-1/+1
	The producer should be used producer_fifo as its sched_priority, so correct it. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] # 2.6.33+ Signed-off-by: Wang Long <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2015-06-10	sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings	Mel Gorman	1	-1/+1
	Jovi Zhangwei reported the following problem Below kernel vm bug can be triggered by tcpdump which mmaped a lot of pages with GFP_COMP flag. [Mon May 25 05:29:33 2015] page:ffffea0015414000 count:66 mapcount:1 mapping: (null) index:0x0 [Mon May 25 05:29:33 2015] flags: 0x20047580004000(head) [Mon May 25 05:29:33 2015] page dumped because: VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page)) [Mon May 25 05:29:33 2015] ------------[ cut here ]------------ [Mon May 25 05:29:33 2015] kernel BUG at mm/migrate.c:1661! [Mon May 25 05:29:33 2015] invalid opcode: 0000 [#1] SMP In this case it was triggered by running tcpdump but it's not necessary reproducible on all systems. sudo tcpdump -i bond0.100 'tcp port 4242' -c 100000000000 -w 4242.pcap Compound pages cannot be migrated and it was not expected that such pages be marked for NUMA balancing. This did not take into account that drivers such as net/packet/af_packet.c may insert compound pages into userspace with vm_insert_page. This patch tells the NUMA balancing protection scanner to skip all VM_MIXEDMAP mappings which avoids the possibility that compound pages are marked for migration. Signed-off-by: Mel Gorman <[email protected]> Reported-by: Jovi Zhangwei <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-06-10	clocksource: Use current logging style	Joe Perches	1	-12/+12
	clocksource messages aren't prefixed in dmesg so it's a bit unclear what subsystem emits the messages. Use pr_fmt and pr_<level> to auto-prefix the messages appropriately. Miscellanea: o Remove "Warning" from KERN_WARNING level messages o Align "timekeeping watchdog: " messages o Coalesce formats o Align multiline arguments Signed-off-by: Joe Perches <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-10	time: Refactor usecs_to_jiffies	Nicholas Mc Guire	1	-10/+3
	Refactor the usecs_to_jiffies conditional code part in time.c and jiffies.h putting it into conditional functions rather than #ifdefs to improve readability. This is analogous to the msecs_to_jiffies() cleanup in commit ca42aaf0c861 ("time: Refactor msecs_to_jiffies") Signed-off-by: Nicholas Mc Guire <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Sam Ravnborg <[email protected]> Cc: Joe Perches <[email protected]> Cc: John Stultz <[email protected]> Cc: Andrew Hunter <[email protected]> Cc: Paul Turner <[email protected]> Cc: Michal Marek <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-06-09	x86/mpx: Clean up the code by not passing a task pointer around when unnecessary	Dave Hansen	1	-4/+4
	The MPX code can only work on the current task. You can not, for instance, enable MPX management in another process or thread. You can also not handle a fault for another process or thread. Despite this, we pass a task_struct around prolifically. This patch removes all of the task struct passing for code paths where the code can not deal with another task (which turns out to be all of them). This has no functional changes. It's just a cleanup. Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-06-08	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	David S. Miller	3	-3/+20

2015-06-07	perf/x86/intel: Introduce PERF_RECORD_LOST_SAMPLES	Kan Liang	1	-0/+33
	After enlarging the PEBS interrupt threshold, there may be some mixed up PEBS samples which are discarded by the kernel. This patch makes the kernel emit a PERF_RECORD_LOST_SAMPLES record with the number of possible discarded records when it is impossible to demux the samples. It makes sure the user is not left in the dark about such discards. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-06-07	perf/x86/intel: Handle multiple records in the PEBS buffer	Yan, Zheng	2	-12/+3
	When the PEBS interrupt threshold is larger than one record and the machine supports multiple PEBS events, the records of these events are mixed up and we need to demultiplex them. Demuxing the records is hard because the hardware is deficient. The hardware has two issues that, when combined, create impossible scenarios to demux. The first issue is that the 'status' field of the PEBS record is a copy of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a problem let us first describe the regular PEBS cycle: A) the CTRn value reaches 0: - the corresponding bit in GLOBAL_STATUS gets set - we start arming the hardware assist < some unspecified amount of time later -- this could cover multiple events of interest > B) the hardware assist is armed, any next event will trigger it C) a matching event happens: - the hardware assist triggers and generates a PEBS record this includes a copy of GLOBAL_STATUS at this moment - if we auto-reload we (re)set CTRn - we clear the relevant bit in GLOBAL_STATUS Now consider the following chain of events: A0, B0, A1, C0 The event generated for counter 0 will include a status with counter 1 set, even though its not at all related to the record. A similar thing can happen with a !PEBS event if it just happens to overflow at the right moment. The second issue is that the hardware will only emit one record for two or more counters if the event that triggers the assist is 'close'. The 'close' can be several cycles. In some cases even the complete assist, if the event is something that doesn't need retirement. For instance, consider this chain of events: A0, B0, A1, B1, C01 Where C01 is an event that triggers both hardware assists, we will generate but a single record, but again with both counters listed in the status field. This time the record pertains to both events. Note that these two cases are different but undistinguishable with the data as generated. Therefore demuxing records with multiple PEBS bits (we can safely ignore status bits for !PEBS counters) is impossible. Furthermore we cannot emit the record to both events because that might cause a data leak -- the events might not have the same privileges -- so what this patch does is discard such events. The assumption/hope is that such discards will be rare. Here lists some possible ways you may get high discard rate. - when you count the same thing multiple times. But it is not a useful configuration. - you can be unfortunate if you measure with a userspace only PEBS event along with either a kernel or unrestricted PEBS event. Imagine the event triggering and setting the overflow flag right before entering the kernel. Then all kernel side events will end up with multiple bits set. Signed-off-by: Yan, Zheng <[email protected]> Signed-off-by: Kan Liang <[email protected]> [ Changelog improvements. ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2015-06-07	sched/numa: Only consider less busy nodes as numa balancing destinations	Rik van Riel	1	-2/+28
	Changeset a43455a1d572 ("sched/numa: Ensure task_numa_migrate() checks the preferred node") fixes an issue where workloads would never converge on a fully loaded (or overloaded) system. However, it introduces a regression on less than fully loaded systems, where workloads converge on a few NUMA nodes, instead of properly staying spread out across the whole system. This leads to a reduction in available memory bandwidth, and usable CPU cache, with predictable performance problems. The root cause appears to be an interaction between the load balancer and NUMA balancing, where the short term load represented by the load balancer differs from the long term load the NUMA balancing code would like to base its decisions on. Simply reverting a43455a1d572 would re-introduce the non-convergence of workloads on fully loaded systems, so that is not a good option. As an aside, the check done before a43455a1d572 only applied to a task's preferred node, not to other candidate nodes in the system, so the converge-on-too-few-nodes problem still happens, just to a lesser degree. Instead, try to compensate for the impedance mismatch between the load balancer and NUMA balancing by only ever considering a lesser loaded node as a destination for NUMA balancing, regardless of whether the task is trying to move to the preferred node, or to another node. This patch also addresses the issue that a system with a single runnable thread would never migrate that thread to near its memory, introduced by 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced"). A test where the main thread creates a large memory area, and spawns a worker thread to iterate over the memory (placed on another node by select_task_rq_fair), after which the main thread goes to sleep and waits for the worker thread to loop over all the memory now sees the worker thread migrated to where the memory is, instead of having all the memory migrated over like before. Jirka has run a number of performance tests on several systems: single instance SpecJBB 2005 performance is 7-15% higher on a 4 node system, with higher gains on systems with more cores per socket. Multi-instance SpecJBB 2005 (one per node), linpack, and stream see little or no changes with the revert of 095bebf61a46 and this patch. Reported-by: Artem Bityutski <[email protected]> Reported-by: Jirka Hladky <[email protected]> Tested-by: Jirka Hladky <[email protected]> Tested-by: Artem Bityutskiy <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Andrew Morton <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-06-07	Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if ↵	Rik van Riel	1	-26/+15
	unbalanced") Commit 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") broke convergence of workloads with just one runnable thread, by making it impossible for the one runnable thread on the system to move from one NUMA node to another. Instead, the thread would remain where it was, and pull all the memory across to its location, which is much slower than just migrating the thread to where the memory is. The next patch has a better fix for the issue that 095bebf61a46 tried to address. Reported-by: Jirka Hladky <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-06-07	sched/fair: Prevent throttling in early pick_next_task_fair()	Ben Segall	1	-11/+14
	The optimized task selection logic optimistically selects a new task to run without first doing a full put_prev_task(). This is so that we can avoid a put/set on the common ancestors of the old and new task. Similarly, we should only call check_cfs_rq_runtime() to throttle eligible groups if they're part of the common ancestry, otherwise it is possible to end up with no eligible task in the simple task selection. Imagine: /root /prev /next /A /B If our optimistic selection ends up throttling /next, we goto simple and our put_prev_task() ends up throttling /prev, after which we're going to bug out in set_next_entity() because there aren't any tasks left. Avoid this scenario by only throttling common ancestors. Reported-by: Mohammed Naser <[email protected]> Reported-by: Konstantin Khlebnikov <[email protected]> Signed-off-by: Ben Segall <[email protected]> [ munged Changelog ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: 678d5718d8d0 ("sched/fair: Optimize cgroup pick_next_task_fair()") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-06-07	preempt: Use preempt_schedule_context() as the official tracing preemption point	Frederic Weisbecker	1	-5/+3
	preempt_schedule_context() is a tracing safe preemption point but it's only used when CONFIG_CONTEXT_TRACKING=y. Other configs have tracing recursion issues since commit: b30f0e3ffedf ("sched/preempt: Optimize preemption operations on __schedule() callers") introduced function based preemp_count_*() ops. Lets make it available on all configs and give it a more appropriate name for its new position. Reported-by: Fengguang Wu <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-06-07	sched: Make preempt_schedule_context() function-tracing safe	Frederic Weisbecker	1	-2/+9
	Since function tracing disables preemption, it needs a safe preemption point to use when preemption is re-enabled without worrying about tracing recursion. Ie: to avoid tracing recursion, that preemption point can't be traced (use of notrace qualifier) and it can't call any traceable function before that preemption point disables preemption itself, which disarms the recursion. preempt_schedule() was fine until commit: b30f0e3ffedf ("sched/preempt: Optimize preemption operations on __schedule() callers") because PREEMPT_ACTIVE (which has the property to disable preemption and this disarm tracing preemption recursion) was set before calling any further function. But that commit introduced the use of preempt_count_add/sub() functions to set PREEMPT_ACTIVE and because these functions are called before preemption gets a chance to be disabled, we have a tracing recursion. preempt_schedule_context() is one of the possible preemption functions used by tracing. Its special purpose is to avoid tracing recursion against context tracking. Lets enhance this function to become more generally tracing safe by disabling preemption with raw accessors, such that no function is called before preemption gets disabled and disarm the tracing recursion. This function is going to become the specific tracing-safe preemption point in further commit. Reported-by: Fengguang Wu <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>