aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2010-10-19futex: Fix errors in nested key ref-countingDarren Hart1-15/+16
futex_wait() is leaking key references due to futex_wait_setup() acquiring an additional reference via the queue_lock() routine. The nested key ref-counting has been masking bugs and complicating code analysis. queue_lock() is only called with a previously ref-counted key, so remove the additional ref-counting from the queue_(un)lock() functions. Also futex_wait_requeue_pi() drops one key reference too many in unqueue_me_pi(). Remove the key reference handling from unqueue_me_pi(). This was paired with a queue_lock() in futex_lock_pi(), so the count remains unchanged. Document remaining nested key ref-counting sites. Signed-off-by: Darren Hart <[email protected]> Reported-and-tested-by: Matthieu Fertré<[email protected]> Reported-by: Louis Rilling<[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: John Kacur <[email protected]> Cc: Rusty Russell <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected]
2010-10-19dabusb: remove the BKLArnd Bergmann1-15/+3
The dabusb device driver is sufficiently serialized using its own mutex, no need for the big kernel lock here in addition. Signed-off-by: Arnd Bergmann <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]>
2010-10-19sunrpc: remove the big kernel lockArnd Bergmann2-30/+11
The sunrpc cache_ioctl function does not need the big kernel lock because it uses its own queue_lock already. rpc_pipe_ioctl apparently should be using i_lock like the other operations on the pipe file descriptor do. Signed-off-by: Arnd Bergmann <[email protected]>
2010-10-19init/main.c: remove BKL notationsNamhyung Kim1-2/+0
According to commit 5e3d20a68f63fc5a310687d81956c3b96e488b84 (init: Remove the BKL from startup code) these sparse notations should be removed also. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2010-10-19blktrace: remove the big kernel lockArnd Bergmann1-11/+3
According to Jens, this code does not need the BKL at all, it is sufficiently serialized by bd_mutex. Signed-off-by: Arnd Bergmann <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Steven Rostedt <[email protected]>
2010-10-19rtmutex-tester: make it build without BKLArnd Bergmann1-0/+6
The big kernel lock is going away, so make sure that if it is disabled by Kconfig, we do not try to validate it, which would result in compile errors. Signed-off-by: Arnd Bergmann <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Andrew Morton <[email protected]>
2010-10-19dvb-core: kill the big kernel lockArnd Bergmann4-40/+11
The dvb core only uses the big kernel lock in the open and ioctl functions, which means it can be replaced with a dvb specific mutex. Fortunately, all the ioctl functions go through dvb_usercopy, so we can move the serialization in there. Signed-off-by: Arnd Bergmann <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: [email protected]
2010-10-19dvb/bt8xx: kill the big kernel lockArnd Bergmann1-3/+4
The bt8xx driver only uses the big kernel lock in its dst_ca_ioctl function and never to serialize against other code, so we can trivially replace it with a private mutex. Signed-off-by: Arnd Bergmann <[email protected]> Cc: [email protected] Cc: Mauro Carvalho Chehab <[email protected]>
2010-10-19tlclk: remove big kernel lockArnd Bergmann1-3/+3
This driver already has a global mutex, so let's just use that in the open function instead of the BKL. It may not even be needed there, but this patch should have the smallest impact. Signed-off-by: Arnd Bergmann <[email protected]> Cc: Mark Gross <[email protected]>
2010-10-19fix rawctl compat ioctls breakage on amd64 and itanicAl Viro2-173/+140
RAW_SETBIND and RAW_GETBIND 32bit versions are fscked in interesting ways. 1) fs/compat_ioctl.c has COMPATIBLE_IOCTL(RAW_SETBIND) followed by HANDLE_IOCTL(RAW_SETBIND, raw_ioctl). The latter is ignored. 2) on amd64 (and itanic) the damn thing is broken - we have int + u64 + u64 and layouts on i386 and amd64 are _not_ the same. raw_ioctl() would work there, but it's never called due to (1). As it is, i386 /sbin/raw definitely doesn't work on amd64 boxen. 3) switching to raw_ioctl() as is would *not* work on e.g. sparc64 and ppc64, which would be rather sad, seeing that normal userland there is 32bit. The thing is, slapping __packed on the struct in question does not DTRT - it eliminates *all* padding. The real solution is to use compat_u64. 4) of course, all that stuff has no business being outside of raw.c in the first place - there should be ->compat_ioctl() for /dev/rawctl instead of messing with compat_ioctl.c. [[email protected]: coding-style fixes] [[email protected]: port to 2.6.36] Signed-off-by: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2010-10-19uml: kill big kernel lockArnd Bergmann4-18/+20
Three uml device drivers still use the big kernel lock, but all of them can be safely converted to using a per-driver mutex instead. Most likely this is not even necessary, so after further review these can and should be removed as well. The exec system call no longer requires the BKL either, so remove it from there, too. Signed-off-by: Arnd Bergmann <[email protected]> Cc: Jeff Dike <[email protected]> Cc: [email protected]
2010-10-19x86: ioapic: Call free_irte only if interrupt remapping enabledYinghai Lu1-1/+2
On a system that support intr-rempping when booting with "intremap=off" [ 177.895501] BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8 [ 177.913316] IP: [<ffffffff8145fc18>] free_irte+0x47/0xc0 ... [ 178.173326] Call Trace: [ 178.173574] [<ffffffff810515b4>] destroy_irq+0x3a/0x75 [ 178.192934] [<ffffffff81051834>] arch_teardown_msi_irq+0xe/0x10 [ 178.193418] [<ffffffff81458dc3>] arch_teardown_msi_irqs+0x56/0x7f [ 178.213021] [<ffffffff81458e79>] free_msi_irqs+0x8d/0xeb Call free_irte only when interrupt remapping is enabled. Signed-off-by: Yinghai Lu <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2010-10-19perf, powerpc: Fix power_pmu_event_init to not use event->ctxPaul Mackerras1-1/+1
Commit c3f00c70 ("perf: Separate find_get_context() from event initialization") changed the generic perf_event code to call perf_event_alloc, which calls the arch-specific event_init code, before looking up the context for the new event. Unfortunately, power_pmu_event_init uses event->ctx->task to see whether the new event is a per-task event or a system-wide event, and thus crashes since event->ctx is NULL at the point where power_pmu_event_init gets called. (The reason it needs to know whether it is a per-task event is because there are some hardware events on Power systems which only count when the processor is not idle, and there are some fixed-function counters which count such events. For example, the "run cycles" event counts cycles when the processor is not idle. If the user asks to count cycles, we can use "run cycles" if this is a per-task event, since the processor is running when the task is running, by definition. We can't use "run cycles" if the user asks for "cycles" on a system-wide counter.) Fortunately the information we need is in the event->attach_state field, so we just use that instead. Signed-off-by: Paul Mackerras <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Frederic Weisbecker <[email protected]> LKML-Reference: <20101019055535.GA10398@drongo> Signed-off-by: Ingo Molnar <[email protected]> Reported-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-19Merge branch 'tip/perf/recordmcount-2' of ↵Ingo Molnar1-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core
2010-10-19ARM: S5PV310: Fix build error on GPIO mapKukjin Kim2-2/+1
This patch fixes build error about GPIO address due to conflict of commit 4d914705 and 19a2c065. - commit 4d914705: Fix on GPIO base addresses - commit 19a2c065: Moves initial map for merging S5P64X0 Signed-off-by: Kukjin Kim <[email protected]>
2010-10-18Merge branch 'hotplug' into develRussell King29-300/+403
Conflicts: arch/arm/kernel/head-common.S
2010-10-18Merge branches 'at91', 'dcache', 'ftrace', 'hwbpt', 'misc', 'mmci', 's3c', ↵Russell King921-5717/+14105
'st-ux' and 'unwind' into devel
2010-10-18ftrace: Remove recursion between recordmcount and scripts/mod/emptySteven Rostedt1-1/+7
When DYNAMIC_FTRACE is enabled and we use the C version of recordmcount, all objects are run through the recordmcount program to create a separate section that stores all the callers of mcount. The build process has a special file: scripts/mod/empty.o. This is built from empty.c which is literally an empty file (except for a single comment). This file is used to find information about the target elf format, like endianness and word size. The problem comes up when we need to build recordmcount. The build process requires that empty.o is built first. The build rules for empty.o will try to execute recordmcount on the empty.o file. We get an error that recordmcount does not exist. To avoid this recursion, the build file will skip running recordmcount if the file that it is building is script/mod/empty.o. [ extra comment Suggested-by: Sam Ravnborg <[email protected]> ] Reported-by: Ingo Molnar <[email protected]> Tested-by: Ingo Molnar <[email protected]> Cc: Michal Marek <[email protected]> Cc: [email protected] Signed-off-by: Steven Rostedt <[email protected]>
2010-10-18ARM: 6441/1: ux500: The platform is not just based on early drop silicon ↵Srinidhi Kasagar1-3/+1
version. Update Kconfig text accordingly. Signed-off-by: srinidhi kasagar <[email protected]> Acked-by: Linus Walleij <[email protected]> Signed-off-by: Russell King <[email protected]>
2010-10-18Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linusLinus Torvalds12-44/+53
* 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: MIPS: Enable ISA_DMA_API config to fix build failure MIPS: 32-bit: Fix build failure in asm/fcntl.h MIPS: Remove all generated vmlinuz* files on "make clean" MIPS: do_sigaltstack() expects userland pointers MIPS: Fix error values in case of bad_stack MIPS: Sanitize restart logics MIPS: secure_computing, syscall audit: syscall number should in r2, not r0. MIPS: Don't block signals if we'd failed to setup a sigframe
2010-10-18Merge branch 'for-linus' of ↵Linus Torvalds1-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: evdev - fix EVIOCSABS regression Input: evdev - fix Ooops in EVIOCGABS/EVIOCSABS
2010-10-18Merge branch 'for-linus' of ↵Linus Torvalds2-26/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6: firewire: ohci: fix TI TSB82AA2 regression since 2.6.35
2010-10-18mxc_nand: do not depend on disabling the irq in the interrupt handlerSascha Hauer1-9/+83
This patch reverts the driver to enabling/disabling the NFC interrupt mask rather than enabling/disabling the system interrupt. This cleans up the driver so that it doesn't rely on interrupts being disabled within the interrupt handler. For i.MX21 we keep the current behaviour, that is calling enable_irq/disable_irq_nosync to enable/disable interrupts. This patch is based on earlier work by John Ogness. Signed-off-by: Sascha Hauer <[email protected]> Acked-by: John Ogness <[email protected]> Tested-by: John Ogness <[email protected]> Signed-off-by: David Woodhouse <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-10-18Merge branch 'for-linus/i2c/2636-rc8' of git://git.fluff.org/bjdooks/linuxLinus Torvalds2-18/+18
* 'for-linus/i2c/2636-rc8' of git://git.fluff.org/bjdooks/linux: i2c-imx: do not allow interruptions when waiting for I2C to complete i2c-davinci: Fix TX setup for more SoCs
2010-10-18Merge branch 'fixes'Linus Torvalds2-31/+28
* fixes: v4l1: fix 32-bit compat microcode loading translation De-pessimize rds_page_copy_user
2010-10-18sched: Export account_system_vtime()Ingo Molnar1-0/+1
KVM uses it for example: ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined! Cc: Venkatesh Pallipadi <[email protected]> Cc: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Call tick_check_idle before __irq_enterVenkatesh Pallipadi2-4/+10
When CPU is idle and on first interrupt, irq_enter calls tick_check_idle() to notify interruption from idle. But, there is a problem if this call is done after __irq_enter, as all routines in __irq_enter may find stale time due to yet to be done tick_check_idle. Specifically, trace calls in __irq_enter when they use global clock and also account_system_vtime change in this patch as it wants to use sched_clock_cpu() to do proper irq timing. But, tick_check_idle was moved after __irq_enter intentionally to prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a: irq: call __irq_enter() before calling the tick_idle_check Impact: avoid spurious ksoftirqd wakeups Moving tick_check_idle() before __irq_enter and wrapping it with local_bh_enable/disable would solve both the problems. Fixed-by: Yong Zhang <[email protected]> Signed-off-by: Venkatesh Pallipadi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Remove irq time from available CPU powerVenkatesh Pallipadi3-1/+30
The idea was suggested by Peter Zijlstra here: http://marc.info/?l=linux-kernel&m=127476934517534&w=2 irq time is technically not available to the tasks running on the CPU. This patch removes irq time from CPU power piggybacking on sched_rt_avg_update(). Tested this by keeping CPU X busy with a network intensive task having 75% oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven cycle soakers on the system. Without this change, there will be two tasks on each CPU. With this change, there is a single task on irq busy CPU X and remaining 7 tasks are spread around among other 3 CPUs. Signed-off-by: Venkatesh Pallipadi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Do not account irq time to current taskVenkatesh Pallipadi3-10/+47
Scheduler accounts both softirq and interrupt processing times to the currently running task. This means, if the interrupt processing was for some other task in the system, then the current task ends up being penalized as it gets shorter runtime than otherwise. Change sched task accounting to acoount only actual task time from currently running task. Now update_curr(), modifies the delta_exec to depend on rq->clock_task. Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats for later. This change will impact scheduling behavior in interrupt heavy conditions. Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2 spending 75%+ of its time in irq processing. CPU 3 spending around 35% time running nc task. Now, if I run another CPU intensive task on CPU 2, without this change /proc/<pid>/schedstat shows 100% of time accounted to this task. With this change, it rightly shows less than 25% accounted to this task as remaining time is actually spent on irq processing. Signed-off-by: Venkatesh Pallipadi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18x86: Add IRQ_TIME_ACCOUNTINGVenkatesh Pallipadi3-0/+23
This patch adds IRQ_TIME_ACCOUNTING option on x86 and runtime enables it when TSC is enabled. This change just enables fine grained irq time accounting, isn't used yet. Following patches use it for different purposes. Signed-off-by: Venkatesh Pallipadi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq timeVenkatesh Pallipadi3-1/+63
s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 will be challenging however, given the state of TSC reliability on various platforms and also the overhead it will add in syscall entry exit. Instead, add a lighter variant that only does finer accounting of hardirq and softirq times, providing precise irq times (instead of timer tick based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not interested in paying the perf penalty. This accounting is based on sched_clock, with the code being generic. So, other archs may find it useful as well. This patch just adds the core logic and does not enable this logic yet. Signed-off-by: Venkatesh Pallipadi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Add a PF flag for ksoftirqd identificationVenkatesh Pallipadi2-0/+2
To account softirq time cleanly in scheduler, we need to identify whether softirq is invoked in ksoftirqd context or softirq at hardirq tail context. Add PF_KSOFTIRQD for that purpose. As all PF flag bits are currently taken, create space by moving one of the infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along with some other state fields. Signed-off-by: Venkatesh Pallipadi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Consolidate account_system_vtime extern declarationVenkatesh Pallipadi4-9/+2
Just a minor cleanup patch that makes things easier to the following patches. No functionality change in this patch. Signed-off-by: Venkatesh Pallipadi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Fix softirq time accountingVenkatesh Pallipadi5-22/+44
Peter Zijlstra found a bug in the way softirq time is accounted in VIRT_CPU_ACCOUNTING on this thread: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html The problem is, softirq processing uses local_bh_disable internally. There is no way, later in the flow, to differentiate between whether softirq is being processed or is it just that bh has been disabled. So, a hardirq when bh is disabled results in time being wrongly accounted as softirq. Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING as well. As account_system_time() in normal tick based accouting also uses softirq_count, which will be set even when not in softirq with bh disabled. Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq processing. The patch below does that and adds API in_serving_softirq() which returns whether we are currently processing softirq or not. Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c to in_serving_softirq. Looks like many usages of in_softirq really want in_serving_softirq. Those changes can be made individually on a case by case basis. Signed-off-by: Venkatesh Pallipadi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Drop group_capacity to 1 only if local group has extra capacityNikhil Rao1-2/+7
When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1 only if the local group has extra capacity. The extra check prevents the case where you always pull from the heaviest group when it is already under-utilized (possible with a large weight task outweighs the tasks on the system). For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task, and each task is running on one core. In this case, we observe the following events when balancing at the NUMA domain: - find_busiest_group() will always pick the sched group containing the niced task to be the busiest group. - find_busiest_queue() will then always pick one of the cpus running the nice0 task (never picks the cpu with the nice -15 task since weighted_cpuload > imbalance). - The load balancer fails to migrate the task since it is the running task and increments sd->nr_balance_failed. - It repeats the above steps a few more times until sd->nr_balance_failed > 5, at which point it kicks off the active load balancer, wakes up the migration thread and kicks the nice 0 task off the cpu. The load balancer doesn't stop until we kick out all nice 0 tasks from the sched group, leaving you with 3 idle cpus and one cpu running the nice -15 task. When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are not relevant because the niced task has a very large weight. In this patch, we add an extra condition to the "if(prefer_sibling)" check in update_sd_lb_stats(). We drop the capacity of a group only if the local group has extra capacity, ie. nr_running < group_capacity. This patch preserves the original intent of the prefer_siblings check (to spread tasks across the system in low utilization scenarios) and fixes the case above. It helps in the following ways: - In low utilization cases (where nr_tasks << nr_cpus), we still drop group_capacity down to 1 if we prefer siblings. - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most likely be > sgs.group_capacity. - When balancing large weight tasks, if the local group does not have extra capacity, we do not pick the group with the niced task as the busiest group. This prevents failed balances, active migration and the under-utilization described above. Signed-off-by: Nikhil Rao <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Force balancing on newidle balance if local group has capacityNikhil Rao1-3/+25
This patch forces a load balance on a newly idle cpu when the local group has extra capacity and the busiest group does not have any. It improves system utilization when balancing tasks with a large weight differential. Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. With this patch, if the local group has extra capacity, we shortcut the checks in f_b_g() and try to pull a task over. A sched group has extra capacity if the group capacity is greater than the number of running tasks in that group. Thanks to Mike Galbraith for discussions leading to this patch and for the insight to reuse SD_NEWIDLE_BALANCE. Signed-off-by: Nikhil Rao <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Set group_imb only a task can be pulled from the busiest cpuNikhil Rao1-5/+7
When cycling through sched groups to determine the busiest group, set group_imb only if the busiest cpu has more than 1 runnable task. This patch fixes the case where two cpus in a group have one runnable task each, but there is a large weight differential between these two tasks. The load balancer is unable to migrate any task from this group, and hence do not consider this group to be imbalanced. Signed-off-by: Nikhil Rao <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> [ small code readability edits ] Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18sched: Do not consider SCHED_IDLE tasks to be cache hotNikhil Rao1-0/+3
This patch adds a check in task_hot to return if the task has SCHED_IDLE policy. SCHED_IDLE tasks have very low weight, and when run with regular workloads, are typically scheduled many milliseconds apart. There is no need to consider these tasks hot for load balancing. Signed-off-by: Nikhil Rao <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18jump_label: Add COND_STMT(), reducer wrapperyPeter Zijlstra2-10/+12
The use of the JUMP_LABEL() construct ends up creating endless silly wrappers, create a higher level construct to reduce this clutter. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Jason Baron <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Paul Mackerras <[email protected]> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18perf: Optimize sw eventsPeter Zijlstra2-11/+13
Acked-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18perf: Use jump_labels to optimize the scheduler hooksPeter Zijlstra2-17/+34
Trades a call + conditional + ret for an unconditional jmp. Acked-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18jump_label: Add atomic_t interfacePeter Zijlstra1-0/+44
Add an interface to allow usage of jump_labels with atomic counters. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18jump_label: Use more consistent namingPeter Zijlstra3-9/+9
Now that there's still only a few users around, rename things to make them more consistent. Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18perf, hw_breakpoint: Fix crash in hw_breakpoint creationPeter Zijlstra3-9/+29
hw_breakpoint creation needs to account stuff per-task to ensure there is always sufficient hardware resources to back these things due to ptrace. With the perf per pmu context changes the event initialization no longer has access to the event context, for the simple reason that we need to first find the pmu (result of initialization) before we can find the context. This makes hw_breakpoints unhappy, because it can no longer do per task accounting, cure this by frobbing a task pointer in the event::hw bits for now... Signed-off-by: Peter Zijlstra <[email protected]> Cc: Frederic Weisbecker <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18perf: Find task before event allocPeter Zijlstra1-11/+12
So that we can pass the task pointer to the event allocation, so that we can use task associated data during event initialization. Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18perf: Fix task refcount bugsPeter Zijlstra1-3/+4
Currently it looks like find_lively_task_by_vpid() takes a task ref and relies on find_get_context() to drop it. The problem is that perf_event_create_kernel_counter() shouldn't be dropping task refs. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Acked-by: Matt Helsley <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18perf: Fix group movingPeter Zijlstra1-1/+6
Matt found we trigger the WARN_ON_ONCE() in perf_group_attach() when we take the move_group path in perf_event_open(). Since we cannot de-construct the group (we rely on it to move the events), we have to simply ignore the double attach. The group state is context invariant and doesn't need changing. Reported-by: Matt Fleming <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <1287135757.29097.1368.camel@twins> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18irq_work: Add generic hardirq context callbacksPeter Zijlstra39-242/+311
Provide a mechanism that allows running code in IRQ context. It is most useful for NMI code that needs to interact with the rest of the system -- like wakeup a task to drain buffers. Perf currently has such a mechanism, so extract that and provide it as a generic feature, independent of perf so that others may also benefit. The IRQ context callback is generated through self-IPIs where possible, or on architectures like powerpc the decrementer (the built-in timer facility) is set to generate an interrupt immediately. Architectures that don't have anything like this get to do with a callback from the timer tick. These architectures can call irq_work_run() at the tail of any IRQ handlers that might enqueue such work (like the perf IRQ handler) to avoid undue latencies in processing the work. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Kyle McMartin <[email protected]> Acked-by: Martin Schwidefsky <[email protected]> [ various fixes ] Signed-off-by: Huang Ying <[email protected]> LKML-Reference: <1287036094.7768.291.camel@yhuang-dev> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18perf_events: Fix transaction recovery in group_sched_in()Stephane Eranian1-13/+63
The group_sched_in() function uses a transactional approach to schedule a group of events. In a group, either all events can be scheduled or none are. To schedule each event in, the function calls event_sched_in(). In case of error, event_sched_out() is called on each event in the group. The problem is that event_sched_out() does not completely cancel the effects of event_sched_in(). Furthermore event_sched_out() changes the state of the event as if it had run which is not true is this particular case. Those inconsistencies impact time tracking fields and may lead to events in a group not all reporting the same time_enabled and time_running values. This is demonstrated with the example below: $ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5 1946101 unhalted_core_cycles (32.85% scaling, ena=829181, run=556827) 11423 baclears (32.85% scaling, ena=829181, run=556827) 7671 baclears (0.00% scaling, ena=556827, run=556827) 2250443 unhalted_core_cycles (57.83% scaling, ena=962822, run=405995) 11705 baclears (57.83% scaling, ena=962822, run=405995) 11705 baclears (57.83% scaling, ena=962822, run=405995) Notice that in the first group, the last baclears event does not report the same timings as its siblings. This issue comes from the fact that tstamp_stopped is updated by event_sched_out() as if the event had actually run. To solve the issue, we must ensure that, in case of error, there is no change in the event state whatsoever. That means timings must remain as they were when entering group_sched_in(). To do this we defer updating tstamp_running until we know the transaction succeeded. Therefore, we have split event_sched_in() in two parts separating the update to tstamp_running. Similarly, in case of error, we do not want to update tstamp_stopped. Therefore, we have split event_sched_out() in two parts separating the update to tstamp_stopped. With this patch, we now get the following output: $ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5 2492050 unhalted_core_cycles (71.75% scaling, ena=1093330, run=308841) 11243 baclears (71.75% scaling, ena=1093330, run=308841) 11243 baclears (71.75% scaling, ena=1093330, run=308841) 1852746 unhalted_core_cycles (0.00% scaling, ena=784489, run=784489) 9253 baclears (0.00% scaling, ena=784489, run=784489) 9253 baclears (0.00% scaling, ena=784489, run=784489) Note that the uneven timing between groups is a side effect of the process spending most of its time sleeping, i.e., not enough event rotations (but that's a separate issue). Signed-off-by: Stephane Eranian <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-10-18perf_events: Fix bogus AMD64 generic TLB eventsStephane Eranian1-2/+2
PERF_COUNT_HW_CACHE_DTLB:READ:MISS had a bogus umask value of 0 which counts nothing. Needed to be 0x7 (to count all possibilities). PERF_COUNT_HW_CACHE_ITLB:READ:MISS had a bogus umask value of 0 which counts nothing. Needed to be 0x3 (to count all possibilities). Signed-off-by: Stephane Eranian <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Robert Richter <[email protected]> Cc: <[email protected]> # as far back as it applies LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>