aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2011-10-06sched: Use resched IPI to kick off the nohz idle balanceSuresh Siddha2-22/+28
Current use of smp call function to kick the nohz idle balance can deadlock in this scenario. 1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single data (csd) to the call single queue, cpu-A took a timer interrupt. Actual IPI to cpu-B to process the call single queue is not yet sent. 2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1) and __smp_call_function_single() with nowait will queue the csd to the cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B as the call single queue was not empty. 3. cpu-A is busy with lot of interrupts 4. Meanwhile cpu-B is entering and exiting idle and noticed that it has it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the idle load balancer and clear its rq->nohz_balance_kick. 5. At this point, csd queued as part of the step-2 above is still locked and waiting to be serviced on cpu-B. 6. cpu-A is still busy with interrupt load and now it got another timer interrupt and as part of it decided to kick cpu-B for another idle load balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4 above) and does __smp_call_function_single() with the same csd that is still locked. 7. And we get a deadlock waiting for the csd_lock() in the __smp_call_function_single(). Main issue here is that cpu-B can service the idle load balancer kick request from cpu-A even with out receiving the IPI and this lead to doing multiple __smp_call_function_single() on the same csd leading to deadlock. To kick a cpu, scheduler already has the reschedule vector reserved. Use that mechanism (kick_process()) instead of using the generic smp call function mechanism to kick off the nohz idle load balancing and avoid the deadlock. [ This issue is present from 2.6.35+ kernels, but marking it -stable only from v3.0+ as the proposed fix depends on the scheduler_ipi() that is introduced recently. ] Reported-by: Prarit Bhargava <[email protected]> Signed-off-by: Suresh Siddha <[email protected]> Cc: [email protected] # v3.0+ Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2011-10-05rtmutex: Add missing rcu_read_unlock() in debug_rt_mutex_print_deadlock()Thomas Gleixner1-1/+3
Signed-off-by: Thomas Gleixner <[email protected]>
2011-10-04genirq: Fix fatfinered fixup reallyThomas Gleixner1-1/+1
Putting the argument inside the quote does not really help. Signed-off-by: Thomas Gleixner <[email protected]>
2011-10-04sched: Fix idle_cpu()Thomas Gleixner1-1/+14
On -rt we observed hackbench waking all 400 tasks to a single cpu. This is because of select_idle_sibling()'s interaction with the new ipi based wakeup scheme. The existing idle_cpu() test only checks to see if the current task on that cpu is the idle task, it does not take already queued tasks into account, nor does it take queued to be woken tasks into account. If the remote wakeup IPIs come hard enough, there won't be time to schedule away from the idle task, and would thus keep thinking the cpu was in fact idle, regardless of the fact that there were already several hundred tasks runnable. We couldn't reproduce on mainline, but there's no reason it couldn't happen. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2011-10-04sched: Convert to struct llistPeter Zijlstra1-38/+10
Use the generic llist primitives. We had a private lockless list implementation in the scheduler in the wake-list code, now that we have a generic llist implementation that provides all required operations, switch to it. This patch is not expected to change any behavior. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Huang Ying <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/1315836353.26517.42.camel@twins Signed-off-by: Ingo Molnar <[email protected]>
2011-10-04llist: Add llist_next()Peter Zijlstra1-1/+1
So we don't have to expose the struct list_node member. Cc: Huang Ying <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1315836348.26517.41.camel@twins Signed-off-by: Ingo Molnar <[email protected]>
2011-10-04irq_work: Use llist in the struct irq_work logicHuang Ying1-58/+33
Use llist in irq_work instead of the lock-less linked list implementation in irq_work to avoid the code duplication. Signed-off-by: Huang Ying <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2011-10-04Merge branch 'linus' into sched/coreIngo Molnar10-53/+43
Merge reason: pick up the latest fixes. Signed-off-by: Ingo Molnar <[email protected]>
2011-10-03ipv4: NET_IPV4_ROUTE_GC_INTERVAL removalVasily Averin1-1/+1
removing obsoleted sysctl, ip_rt_gc_interval variable no longer used since 2.6.38 Signed-off-by: Vasily Averin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-10-03genirq: percpu: allow interrupt type to be set at enable timeMarc Zyngier1-1/+14
As request_percpu_irq() doesn't allow for a percpu interrupt to have its type configured (it is generally impossible to configure it on all CPUs at once), add a 'type' argument to enable_percpu_irq(). This allows some low-level, board specific init code to be switched to a generic API. [ tglx: Added WARN_ON argument ] Signed-off-by: Marc Zyngier <[email protected]> Cc: Abhijeet Dharmapurikar <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2011-10-03genirq: Add support for per-cpu dev_id interruptsMarc Zyngier5-22/+302
The ARM GIC interrupt controller offers per CPU interrupts (PPIs), which are usually used to connect local timers to each core. Each CPU has its own private interface to the GIC, and only sees the PPIs that are directly connect to it. While these timers are separate devices and have a separate interrupt line to a core, they all use the same IRQ number. For these devices, request_irq() is not the right API as it assumes that an IRQ number is visible by a number of CPUs (through the affinity setting), but makes it very awkward to express that an IRQ number can be handled by all CPUs, and yet be a different interrupt line on each CPU, requiring a different dev_id cookie to be passed back to the handler. The *_percpu_irq() functions is designed to overcome these limitations, by providing a per-cpu dev_id vector: int request_percpu_irq(unsigned int irq, irq_handler_t handler, const char *devname, void __percpu *percpu_dev_id); void free_percpu_irq(unsigned int, void __percpu *); int setup_percpu_irq(unsigned int irq, struct irqaction *new); void remove_percpu_irq(unsigned int irq, struct irqaction *act); void enable_percpu_irq(unsigned int irq); void disable_percpu_irq(unsigned int irq); The API has a number of limitations: - no interrupt sharing - no threading - common handler across all the CPUs Once the interrupt is requested using setup_percpu_irq() or request_percpu_irq(), it must be enabled by each core that wishes its local interrupt to be delivered. Based on an initial patch by Thomas Gleixner. Signed-off-by: Marc Zyngier <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2011-10-01Merge branches 'irq-urgent-for-linus', 'x86-urgent-for-linus' and ↵Linus Torvalds4-30/+11
'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip * 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: irq: Fix check for already initialized irq_domain in irq_domain_add irq: Add declaration of irq_domain_simple_ops to irqdomain.h * 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: x86/rtc: Don't recursively acquire rtc_lock * 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: posix-cpu-timers: Cure SMP wobbles sched: Fix up wchan borkage sched/rt: Migrate equal priority tasks to available CPUs
2011-10-01Merge branch 'rcu/next' of git://github.com/paulmckrcu/linux into core/rcuIngo Molnar14-425/+599
2011-09-30posix-cpu-timers: Cure SMP wobblesPeter Zijlstra2-26/+3
David reported: Attached below is a watered-down version of rt/tst-cpuclock2.c from GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or similar. Run it several times, and you will see cases where the main thread will measure a process clock difference before and after the nanosleep which is smaller than the cpu-burner thread's individual thread clock difference. This doesn't make any sense since the cpu-burner thread is part of the top-level process's thread group. I've reproduced this on both x86-64 and sparc64 (using both 32-bit and 64-bit binaries). For example: [davem@boricha build-x86_64-linux]$ ./test process: before(0.001221967) after(0.498624371) diff(497402404) thread: before(0.000081692) after(0.498316431) diff(498234739) self: before(0.001223521) after(0.001240219) diff(16698) [davem@boricha build-x86_64-linux]$ The diff of 'process' should always be >= the diff of 'thread'. I make sure to wrap the 'thread' clock measurements the most tightly around the nanosleep() call, and that the 'process' clock measurements are the outer-most ones. --- #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <fcntl.h> #include <string.h> #include <errno.h> #include <pthread.h> static pthread_barrier_t barrier; static void *chew_cpu(void *arg) { pthread_barrier_wait(&barrier); while (1) __asm__ __volatile__("" : : : "memory"); return NULL; } int main(void) { clockid_t process_clock, my_thread_clock, th_clock; struct timespec process_before, process_after; struct timespec me_before, me_after; struct timespec th_before, th_after; struct timespec sleeptime; unsigned long diff; pthread_t th; int err; err = clock_getcpuclockid(0, &process_clock); if (err) return 1; err = pthread_getcpuclockid(pthread_self(), &my_thread_clock); if (err) return 1; pthread_barrier_init(&barrier, NULL, 2); err = pthread_create(&th, NULL, chew_cpu, NULL); if (err) return 1; err = pthread_getcpuclockid(th, &th_clock); if (err) return 1; pthread_barrier_wait(&barrier); err = clock_gettime(process_clock, &process_before); if (err) return 1; err = clock_gettime(my_thread_clock, &me_before); if (err) return 1; err = clock_gettime(th_clock, &th_before); if (err) return 1; sleeptime.tv_sec = 0; sleeptime.tv_nsec = 500000000; nanosleep(&sleeptime, NULL); err = clock_gettime(th_clock, &th_after); if (err) return 1; err = clock_gettime(my_thread_clock, &me_after); if (err) return 1; err = clock_gettime(process_clock, &process_after); if (err) return 1; diff = process_after.tv_nsec - process_before.tv_nsec; printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", process_before.tv_sec, process_before.tv_nsec, process_after.tv_sec, process_after.tv_nsec, diff); diff = th_after.tv_nsec - th_before.tv_nsec; printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", th_before.tv_sec, th_before.tv_nsec, th_after.tv_sec, th_after.tv_nsec, diff); diff = me_after.tv_nsec - me_before.tv_nsec; printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", me_before.tv_sec, me_before.tv_nsec, me_after.tv_sec, me_after.tv_nsec, diff); return 0; } This is due to us using p->se.sum_exec_runtime in thread_group_cputime() where we iterate the thread group and sum all data. This does not take time since the last schedule operation (tick or otherwise) into account. We can cure this by using task_sched_runtime() at the cost of having to take locks. This also means we can (and must) do away with thread_group_sched_runtime() since the modified thread_group_cputime() is now more accurate and would deadlock when called from thread_group_sched_runtime(). Aside of that it makes the function safe on 32 bit systems. The old code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a 64bit value and could be changed on another cpu at the same time. Reported-by: David Miller <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins Tested-by: David Miller <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2011-09-29Resource: fix wrong resource window calculationRam Pai1-1/+6
__find_resource() incorrectly returns a resource window which overlaps an existing allocated window. This happens when the parent's resource-window spans 0x00000000 to 0xffffffff and is entirely allocated to all its children resource-windows. __find_resource() looks for gaps in resource allocation among the children resource windows. When it encounters the last child window it blindly tries the range next to one allocated to the last child. Since the last child's window ends at 0xffffffff the calculation overflows, leading the algorithm to believe that any window in the range 0x0000000 to 0xfffffff is available for allocation. This leads to a conflicting window allocation. Michal Ludvig reported this issue seen on his platform. The following patch fixes the problem and has been verified by Michal. I believe this bug has been there for ages. It got exposed by git commit 2bbc6942273b ("PCI : ability to relocate assigned pci-resources") Signed-off-by: Ram Pai <[email protected]> Tested-by: Michal Ludvig <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2011-09-29user namespace: usb: make usb urbs user namespace aware (v2)Serge Hallyn1-8/+16
Add to the dev_state and alloc_async structures the user namespace corresponding to the uid and euid. Pass these to kill_pid_info_as_uid(), which can then implement a proper, user-namespace-aware uid check. Changelog: Sep 20: Per Oleg's suggestion: Instead of caching and passing user namespace, uid, and euid each separately, pass a struct cred. Sep 26: Address Alan Stern's comments: don't define a struct cred at usbdev_open(), and take and put a cred at async_completed() to ensure it lasts for the duration of kill_pid_info_as_cred(). Signed-off-by: Serge Hallyn <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2011-09-29PM / Tracing: build rpm-traces.c only if CONFIG_PM_RUNTIME is setMing Lei1-0/+2
Do not build kernel/trace/rpm-traces.c if CONFIG_PM_RUNTIME is not set, which avoids a build failure. [rjw: Added the changelog and modified the subject slightly.] Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2011-09-28rcu: Move propagation of ->completed from rcu_start_gp() to rcu_report_qs_rsp()Paul E. McKenney1-20/+51
It is possible for the CPU that noted the end of the prior grace period to not need a new one, and therefore to decide to propagate ->completed throughout the rcu_node tree without starting another grace period. However, in so doing, it releases the root rcu_node structure's lock, which can allow some other CPU to start another grace period. The first CPU will be propagating ->completed in parallel with the second CPU initializing the rcu_node tree for the new grace period. In theory this is harmless, but in practice we need to keep things simple. This commit therefore moves the propagation of ->completed to rcu_report_qs_rsp(), and refrains from marking the old grace period as having been completed until it has finished doing this. This prevents anyone from starting a new grace period concurrently with marking the old grace period as having been completed. Of course, the optimization where a CPU needing a new grace period doesn't bother marking the old one completed is still in effect: In that case, the marking happens implicitly as part of initializing the new grace period. Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent statesPaul E. McKenney3-29/+0
The purpose of rcu_needs_cpu_flush() was to iterate on pushing the current grace period in order to help the current CPU enter dyntick-idle mode. However, this can result in failures if the CPU starts entering dyntick-idle mode, but then backs out. In this case, the call to rcu_pending() from rcu_needs_cpu_flush() might end up announcing a non-existing quiescent state. This commit therefore removes rcu_needs_cpu_flush() in favor of letting the dyntick-idle machinery at the end of the softirq handler push the loop along via its call to rcu_pending(). Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Wire up RCU_BOOST_PRIO for rcutreeMike Galbraith2-7/+15
RCU boost threads start life at RCU_BOOST_PRIO, while others remain at RCU_KTHREAD_PRIO. While here, change thread names to match other kthreads, and adjust rcu_yield() to not override the priority set by the user. This last change sets the stage for runtime changes to priority in the -rt tree. Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Make rcu_torture_boost() exit loops at end of testPaul E. McKenney1-1/+2
One of the loops in rcu_torture_boost() fails to check kthread_should_stop(), and thus might be slowing or even stopping completion of rcutorture tests at rmmod time. This commit adds the kthread_should_stop() check to the offending loop. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Make rcu_torture_fqs() exit loops at end of testPaul E. McKenney1-4/+6
The rcu_torture_fqs() function can prevent the rcutorture tests from completing, resulting in a hang. This commit therefore ensures that rcu_torture_fqs() will exit its inner loops at the end of the test, and also applies the newish ULONG_CMP_LT() macro to time comparisons. Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Permit rt_mutex_unlock() with irqs disabledPaul E. McKenney2-0/+13
Create a separate lockdep class for the rt_mutex used for RCU priority boosting and enable use of rt_mutex_lock() with irqs disabled. This prevents RCU priority boosting from falling prey to deadlocks when someone begins an RCU read-side critical section in preemptible state, but releases it with an irq-disabled lock held. Unfortunately, the scheduler's runqueue and priority-inheritance locks still must either completely enclose or be completely enclosed by any overlapping RCU read-side critical section. This version removes a redundant local_irq_restore() noted by Yong Zhang. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Avoid having just-onlined CPU resched itself when RCU is idlePaul E. McKenney1-3/+8
CPUs set rdp->qs_pending when coming online to resolve races with grace-period start. However, this means that if RCU is idle, the just-onlined CPU might needlessly send itself resched IPIs. Adjust the online-CPU initialization to avoid this, and also to correctly cause the CPU to respond to the current grace period if needed. Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Josh Boyer <[email protected]> Tested-by: Christian Hoffmann <[email protected]>
2011-09-28rcu: Suppress NMI backtraces when stall ends before dumpPaul E. McKenney3-9/+19
It is possible for an RCU CPU stall to end just as it is detected, in which case the current code will uselessly dump all CPU's stacks. This commit therefore checks for this condition and refrains from sending needless NMIs. And yes, the stall might also end just after we checked all CPUs and tasks, but in that case we would at least have given some clue as to which CPU/task was at fault. Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Prohibit grace periods during early bootPaul E. McKenney1-2/+5
Greater use of RCU during early boot (before the scheduler is operating) is causing RCU to attempt to start grace periods during that time, which in turn is resulting in both RCU and the callback functions attempting to use the scheduler before it is ready. This commit prevents these problems by prohibiting RCU grace periods until after the scheduler has spawned the first non-idle task. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Simplify unboosting checksPaul E. McKenney1-10/+10
Commit 7765be (Fix RCU_BOOST race handling current->rcu_read_unlock_special) introduced a new ->rcu_boosted field in the task structure. This is redundant because the existing ->rcu_boost_mutex will be non-NULL at any time that ->rcu_boosted is nonzero. Therefore, this commit removes ->rcu_boosted and tests ->rcu_boost_mutex instead. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Prevent early boot set_need_resched() from __rcu_pending()Paul E. McKenney1-1/+2
There isn't a whole lot of point in poking the scheduler before there are other tasks to switch to. This commit therefore adds a check for rcu_scheduler_fully_active in __rcu_pending() to suppress any pre-scheduler calls to set_need_resched(). The downside of this approach is additional runtime overhead in a reasonably hot code path. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Dump local stack if cannot dump all CPUs' stacksPaul E. McKenney1-2/+4
The trigger_all_cpu_backtrace() function is a no-op in architectures that do not define arch_trigger_all_cpu_backtrace. On such architectures, RCU CPU stall warning messages contain no stack trace information, which makes debugging quite difficult. This commit therefore substitutes dump_stack() for architectures that do not define arch_trigger_all_cpu_backtrace, so that at least the local CPU's stack is dumped as part of the RCU CPU stall warning message. Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Move __rcu_read_unlock()'s barrier() within if-statementPaul E. McKenney1-1/+1
We only need to constrain the compiler if we are actually exiting the top-level RCU read-side critical section. This commit therefore moves the first barrier() cal in __rcu_read_unlock() to inside the "if" statement, thus avoiding needless register flushes for inner rcu_read_unlock() calls. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Make rcu_implicit_dynticks_qs() locals be correct sizePaul E. McKenney1-5/+5
When the ->dynticks field in the rcu_dynticks structure changed to an atomic_t, its size on 64-bit systems changed from 64 bits to 32 bits. The local variables in rcu_implicit_dynticks_qs() need to change as well, hence this commit. Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Eliminate in_irq() checks in rcu_enter_nohz()Paul E. McKenney1-7/+0
The in_irq() check in rcu_enter_nohz() is redundant because if we really are in an interrupt, the attempt to re-enter dyntick-idle mode will invoke rcu_needs_cpu() in any case, which will force the check for RCU callbacks. So this commit removes the check along with the set_need_resched(). Suggested-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28nohz: Remove nohz_cpu_maskShi, Alex2-17/+0
RCU no longer uses this global variable, nor does anyone else. This commit therefore removes this variable. This reduces memory footprint and also removes some atomic instructions and memory barriers from the dyntick-idle path. Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Allow rcutorture's stat_interval parameter to be changed at runtimePaul E. McKenney1-1/+1
When rcutorture is compiled directly into the kernel (instead of separately as a module), it is necessary to specify rcutorture.stat_interval as a kernel command-line parameter, otherwise, the rcu_torture_stats kthread is never started. However, when working with the system after it has booted, it is convenient to be able to change the time between statistic printing, particularly when logged into the console. This commit therefore allows the stat_interval parameter to be changed at runtime. Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Simplify quiescent-state accountingPaul E. McKenney4-32/+32
There is often a delay between the time that a CPU passes through a quiescent state and the time that this quiescent state is reported to the RCU core. It is quite possible that the grace period ended before the quiescent state could be reported, for example, some other CPU might have deduced that this CPU passed through dyntick-idle mode. It is critically important that quiescent state be counted only against the grace period that was in effect at the time that the quiescent state was detected. Previously, this was handled by recording the number of the last grace period to complete when passing through a quiescent state. The RCU core then checks this number against the current value, and rejects the quiescent state if there is a mismatch. However, one additional possibility must be accounted for, namely that the quiescent state was recorded after the prior grace period completed but before the current grace period started. In this case, the RCU core must reject the quiescent state, but the recorded number will match. This is handled when the CPU becomes aware of a new grace period -- at that point, it invalidates any prior quiescent state. This works, but is a bit indirect. The new approach records the current grace period, and the RCU core checks to see (1) that this is still the current grace period and (2) that this grace period has not yet ended. This approach simplifies reasoning about correctness, and this commit changes over to this new approach. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Add grace-period, quiescent-state, and call_rcu trace eventsPaul E. McKenney5-10/+68
Add trace events to record grace-period start and end, quiescent states, CPUs noticing grace-period start and end, grace-period initialization, call_rcu() invocation, tasks blocking in RCU read-side critical sections, tasks exiting those same critical sections, force_quiescent_state() detection of dyntick-idle and offline CPUs, CPUs entering and leaving dyntick-idle mode (except from NMIs), CPUs coming online and going offline, and CPUs being kicked for staying in dyntick-idle mode for too long (as in many weeks, even on 32-bit systems). Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> rcu: Add the rcu flavor to callback trace events The earlier trace events for registering RCU callbacks and for invoking them did not include the RCU flavor (rcu_bh, rcu_preempt, or rcu_sched). This commit adds the RCU flavor to those trace events. Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Make TINY_RCU also use softirq for RCU_BOOST=nPaul E. McKenney2-91/+93
This patch #ifdefs TINY_RCU kthreads out of the kernel unless RCU_BOOST=y, thus eliminating context-switch overhead if RCU priority boosting has not been configured. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Add event-trace markers to TREE_RCU kthreadsPaul E. McKenney1-0/+12
Add event-trace markers to TREE_RCU kthreads to allow including these kthread's CPU time in the utilization calculations. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Move RCU_BOOST declarations to allow compiler checkingPaul E. McKenney2-5/+7
Andi Kleen noticed that one of the RCU_BOOST data declarations was out of sync with the definition. Move the declarations so that the compiler can do the checking in the future. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Update comments to reflect softirqs vs. kthreadsPaul E. McKenney2-12/+14
We now have kthreads only for flavors of RCU that support boosting, so update the now-misleading comments accordingly. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Add RCU type to callback-invocation tracingPaul E. McKenney2-8/+8
Add a string to the rcu_batch_start() and rcu_batch_end() trace messages that indicates the RCU type ("rcu_sched", "rcu_bh", or "rcu_preempt"). The trace messages for the actual invocations themselves are not marked, as it should be clear from the rcu_batch_start() and rcu_batch_end() events before and after. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Put names into TINY_RCU structures under RCU_TRACEPaul E. McKenney5-27/+18
In order to allow event tracing to distinguish between flavors of RCU, we need those names in the relevant RCU data structures. TINY_RCU has avoided them for memory-footprint reasons, so add them only if CONFIG_RCU_TRACE=y. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Event-trace markers for computing RCU CPU utilizationPaul E. McKenney1-1/+15
This commit adds the trace_rcu_utilization() marker that is to be used to allow postprocessing scripts compute RCU's CPU utilization, give or take event-trace overhead. Note that we do not include RCU's dyntick-idle interface because event tracing requires RCU protection, which is not available in dyntick-idle mode. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Add event-tracing for RCU callback invocationPaul E. McKenney4-4/+121
There was recently some controversy about the overhead of invoking RCU callbacks. Add TRACE_EVENT()s to obtain fine-grained timings for the start and stop of a batch of callbacks and also for each callback invoked. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Don't destroy rcu_torture_boost() callback until it is donePaul E. McKenney1-1/+1
The rcu_torture_boost() cleanup code destroyed debug-objects state before waiting for the last RCU callback to be invoked, resulting in rare but very real debug-objects warnings. Move the destruction to after the waiting to fix this problem. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Catch rcutorture up to new RCU API additionsPaul E. McKenney1-34/+21
Now that the RCU API contains synchronize_rcu_bh(), synchronize_sched(), call_rcu_sched(), and rcu_bh_expedited()... Make rcutorture test synchronize_rcu_bh(), getting rid of the old rcu_bh_torture_synchronize() workaround. Similarly, make rcutorture test synchronize_sched(), getting rid of the old sched_torture_synchronize() workaround. Make rcutorture test call_rcu_sched() instead of wrappering synchronize_sched(). Also add testing of rcu_bh_expedited(). Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Abstract common code for RCU grace-period-wait primitivesPaul E. McKenney5-73/+23
Pull the code that waits for an RCU grace period into a single function, which is then called by synchronize_rcu() and friends in the case of TREE_RCU and TREE_PREEMPT_RCU, and from rcu_barrier() and friends in the case of TINY_RCU and TINY_PREEMPT_RCU. Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Fix mismatched variable in rcutree_trace.cAndi Kleen1-1/+1
rcutree.c defines rcu_cpu_kthread_cpu as int, not unsigned int, so the extern has to follow that. Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Restore checks for blocking in RCU read-side critical sectionsPaul E. McKenney3-38/+52
Long ago, using TREE_RCU with PREEMPT would result in "scheduling while atomic" diagnostics if you blocked in an RCU read-side critical section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats this diagnostic. This commit therefore adds a replacement diagnostic based on PROVE_RCU. Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being used for things that have nothing to do with rcu_dereference(), rename lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third argument that is a string indicating what is suspicious. This third argument is passed in from a new third argument to rcu_lockdep_assert(). Update all calls to rcu_lockdep_assert() to add an informative third argument. Also, add a pair of rcu_lockdep_assert() calls from within rcu_note_context_switch(), one complaining if a context switch occurs in an RCU-bh read-side critical section and another complaining if a context switch occurs in an RCU-sched read-side critical section. These are present only if the PROVE_RCU kernel parameter is enabled. Finally, fix some checkpatch whitespace complaints in lockdep.c. Again, you must enable PROVE_RCU to see these new diagnostics. But you are enabling PROVE_RCU to check out new RCU uses in any case, aren't you? Signed-off-by: Paul E. McKenney <[email protected]>
2011-09-28rcu: Avoid unnecessary self-wakeup of per-CPU kthreadsShaohua Li1-5/+3
There are a number of cases where the RCU can find additional work for the per-CPU kthread within the context of that per-CPU kthread. In such cases, the per-CPU kthread is already running, so attempting to wake itself up does nothing except waste CPU cycles. This commit therefore checks to see if it is in the per-CPU kthread context, omitting the wakeup in this case. Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>