aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2012-07-31mm/hugetlb: add new HugeTLB cgroupAneesh Kumar K.V5-0/+179
Implement a new controller that allows us to control HugeTLB allocations. The extension allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to access HugeTLB pages beyond its limit. This requires the application to know beforehand how much HugeTLB pages it would require for its use. The charge/uncharge calls will be added to HugeTLB code in later patch. Support for cgroup removal will be added in later patches. [[email protected]: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g] [[email protected]: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/g] Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: make some static variables globalAneesh Kumar K.V2-5/+7
We will use them later in hugetlb_cgroup.c Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: add a list for tracking in-use HugeTLB pagesAneesh Kumar K.V2-5/+8
hugepage_activelist will be used to track currently used HugeTLB pages. We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal. On cgroup removal we update the page's HugeTLB cgroup to point to parent cgroup. Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: simplify migrate_huge_page()Aneesh Kumar K.V3-57/+27
Since we migrate only one hugepage, don't use linked list for passing the page around. Directly pass the page that need to be migrated as argument. This also removes the usage of page->lru in the migrate path. Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: use mmu_gather instead of a temporary linked list for accumulating ↵Aneesh Kumar K.V4-33/+59
pages Use a mmu_gather instead of a temporary linked list for accumulating pages when we unmap a hugepage range Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: add an inline helper for finding hstate indexAneesh Kumar K.V2-9/+17
Add an inline helper and use it in the code. Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: don't use ERR_PTR with VM_FAULT* valuesAneesh Kumar K.V1-5/+13
The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the VM_FAULT_* values from MAX_ERRNO. Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: rename max_hstate to hugetlb_max_hstateAneesh Kumar K.V1-7/+7
This patchset implements a cgroup resource controller for HugeTLB pages. The controller allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to access HugeTLB pages beyond its limit. This requires the application to know beforehand how much HugeTLB pages it would require for its use. The goal is to control how many HugeTLB pages a group of task can allocate. It can be looked at as an extension of the existing quota interface which limits the number of HugeTLB pages per hugetlbfs superblock. HPC job scheduler requires jobs to specify their resource requirements in the job file. Once their requirements can be met, job schedulers like (SLURM) will schedule the job. We need to make sure that the jobs won't consume more resources than requested. If they do we should either error out or kill the application. This patch: Rename max_hstate to hugetlb_max_hstate. We will be using this from other subsystems like hugetlb controller in later patches. Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threadsWanpeng Li9-27/+40
Since per-BDI flusher threads were introduced in 2.6, the pdflush mechanism is not used any more. But the old interface exported through /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless. For back-compatibility, printk warning information and return 2 to notify the users that the interface is removed. Signed-off-by: Wanpeng Li <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/buddy: cleanup on should_fail_alloc_pageGavin Shan1-7/+7
Currently, function should_fail() has "bool" for its return value, so it's reasonable to change the return value of function should_fail_alloc_page() into "bool" as well. The patch does cleanup on function should_fail_alloc_page() to have "bool" for its return value. Signed-off-by: Gavin Shan <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: account the total_vm in the vm_stat_account()Huang Shijie5-9/+4
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now. But we can also account for total_vm in the vm_stat_account() which makes the code tidy. Even for mprotect_fixup(), we can get the right result in the end. Signed-off-by: Huang Shijie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31documentation: update how page-cluster affects swap I/OChristian Ehrhardt1-2/+10
Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of the code and add some comments about what the tunable will change in that behavior. Signed-off-by: Christian Ehrhardt <[email protected]> Acked-by: Jens Axboe <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31swap: allow swap readahead to be mergedChristian Ehrhardt1-0/+5
Swap readahead works fine, but the I/O to disk is almost always done in page size requests, despite the fact that readahead submits 1<<page-cluster pages at a time. On older kernels the old per device plugging behavior might have captured this and merged the requests, but currently all comes down to much more I/Os than required. On a single device this might not be an issue, but as soon as a server runs on shared san resources savin I/Os not only improves swapin throughput but also provides a lower resource utilization. With a load running KVM in a lot of memory overcommitment (the hot memory is 1.5 times the host memory) swapping throughput improves significantly and the lead feels more responsive as well as achieves more throughput. In a test setup with 16 swap disks running blocktrace on one of those disks shows the improved merging: Prior: Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB IO unplugs: 149,614 Timer unplugs: 2,940 With the patch: Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB IO unplugs: 337,130 Timer unplugs: 11,184 I got ~10% to ~40% more throughput in my cases and at the same time much lower cpu consumption when broken down per transferred kilobyte (the majority of that due to saved interrupts and better cache handling). In a shared SAN others might get an additional benefit as well, because this now causes less protocol overhead. Signed-off-by: Christian Ehrhardt <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Jens Axboe <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: remove MEM_CGROUP_CHARGE_TYPE_FORCEKamezawa Hiroyuki1-1/+0
There are no users since commit b24028572fb69 ("memcg: remove PCG_CACHE"). Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: rename MEM_CGROUP_CHARGE_TYPE_MAPPED as MEM_CGROUP_CHARGE_TYPE_ANONKamezawa Hiroyuki1-8/+8
Now, in memcg, 2 "MAPPED" enum/macro are found MEM_CGROUP_CHARGE_TYPE_MAPPED MEM_CGROUP_STAT_FILE_MAPPED Thier names looks similar to each other but the former is used for accounting anonymous memory. rename it as TYPE_ANON. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: rename MEM_CGROUP_STAT_SWAPOUT as MEM_CGROUP_STAT_SWAPKamezawa Hiroyuki1-5/+5
MEM_CGROUP_STAT_SWAPOUT represents the usage of swap rather than the number of swap-out events. Rename it to be MEM_CGROUP_STAT_SWAP. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Hugh Dickins <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: make vb_alloc() more foolproofJan Kara1-0/+8
If someone calls vb_alloc() (or vm_map_ram() for that matter) to allocate 0 bytes (0 pages), get_order() returns BITS_PER_LONG - PAGE_CACHE_SHIFT and interesting stuff happens. So make debugging such problems easier and warn about 0-size allocation. [[email protected]: use WARN_ON-return-value feature] Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31vmalloc: walk vmap_areas by sorted list instead of rb_next()Hong zhi guo1-4/+4
There's a walk by repeating rb_next to find a suitable hole. Could be simply replaced by walk on the sorted vmap_area_list. More simpler and efficient. Mutation of the list and tree only happens in pair within __insert_vmap_area and __free_vmap_area, under protection of vmap_area_lock. The patch code is also under vmap_area_lock, so the list walk is safe, and consistent with the tree walk. Tested on SMP by repeating batch of vmalloc anf vfree for random sizes and rounds for hours. Signed-off-by: Hong Zhiguo <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31drivers/media/video/v4l2-ioctl.c: fix buildAndrew Morton1-2/+6
Fix zillions of these: drivers/media/video/v4l2-ioctl.c:1848: error: unknown field 'func' specified in initializer drivers/media/video/v4l2-ioctl.c:1848: warning: missing braces around initializer drivers/media/video/v4l2-ioctl.c:1848: warning: (near initialization for 'v4l2_ioctls[0].<anonymous>') drivers/media/video/v4l2-ioctl.c:1848: warning: initialization makes integer from pointer without a cast drivers/media/video/v4l2-ioctl.c:1848: error: initializer element is not computable at load time drivers/media/video/v4l2-ioctl.c:1848: error: (near initialization for 'v4l2_ioctls[0].<anonymous>.offset') Cc: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31xtensa: select generic atomic64_t supportFengguang Wu1-0/+1
This will fix build errors: block/blk-cgroup.c:609:2: error: unknown type name 'atomic64_t' block/blk-cgroup.c:609:2: error: implicit declaration of function 'ATOMIC64_INIT' [-Werror=implicit-function-declaration] Signed-off-by: Fengguang Wu <[email protected]> Cc: Chris Zankel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31fault-injection: fix failcmd.sh warningAkinobu Mita1-1/+1
"fault-injection: add tool to run command with failslab or fail_page_alloc" added tools/testing/fault-injection/failcmd.sh to make it easier to inject slab/page allocation failures by fault injection. failcmd.sh prints the following warning when running with arguments for command. # ./failcmd.sh echo aaa failcmd.sh: line 209: [: echo: binary operator expected aaa This warning is caused by an improper check whether at least one parameter is left after parsing command options. Fix it by testing the length of $1 instead of $@ Signed-off-by: Akinobu Mita <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31Merge tag 'for-v3.6' of git://git.infradead.org/battery-2.6Linus Torvalds23-57/+789
Pull battery updates from Anton Vorontsov: "The tag contains just a few battery-related changes for v3.6. It's is all pretty straightforward, except one thing. One of our patches added thermal support for power supply class, but thermal/ subsystem changed under our feet. We (well, Stephen, that is) caught the issue and it was decided[1] that I'd just delay the battery pull request, and then will fix it up by merging upstream back into battery tree at the specific commit. That's not all though: another[2] small fixup for thermal subsystem was needed to get rid of a warning in power supply subsystem (the warning was not drivers/power's "fault", the thermal registration function just needed a proper const annotation, which is also done by a small commit on top of the merge. So, to sum this up: - The 'master' branch of the battery tree was in the -next tree for weeks, was never rebased, altered etc. It should be all OK; - Although, for-v3.6 tag contains the 'master' branch + merge + the warning fix. [1] http://lkml.org/lkml/2012/6/19/23 [2] http://lkml.org/lkml/2012/6/18/28" * tag 'for-v3.6' of git://git.infradead.org/battery-2.6: (23 commits) thermal: Constify 'type' argument for the registration routine olpc-battery: update CHARGE_FULL_DESIGN property for BYD LiFe batteries olpc-battery: Add VOLTAGE_MAX_DESIGN property charger-manager: Fix build break related to EXTCON lp8727_charger: Move header file into platform_data directory power_supply: Add min/max alert properties for CAPACITY, TEMP, TEMP_AMBIENT bq27x00_battery: Add support for BQ27425 chip charger-manager: Set current limit of regulator for over current protection charger-manager: Use EXTCON Subsystem to detect charger cables for charging test_power: Add VOLTAGE_NOW and BATTERY_TEMP properties test_power: Add support for USB AC source gpio-charger: Use cansleep version of gpio_set_value bq27x00_battery: Add support for power average and health properties sbs-battery: Don't trigger false supply_changed event twl4030_charger: Allow charger to control the regulator that feeds it twl4030_charger: Add backup-battery charging twl4030_charger: Fix some typos max17042_battery: Support CHARGE_COUNTER power supply attribute smb347-charger: Add constant charge and current properties power_supply: Add constant charge_current and charge_voltage properties ...
2012-07-31Merge branch 'perf-core-for-linus' of ↵Linus Torvalds26-385/+2228
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes updates/cleanups/fixes from Oleg and diverse tooling updates (mostly fixes) now that Arnaldo is back from vacation." * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) uprobes: __replace_page() needs munlock_vma_page() uprobes: Rename vma_address() and make it return "unsigned long" uprobes: Fix register_for_each_vma()->vma_address() check uprobes: Introduce vaddr_to_offset(vma, vaddr) uprobes: Teach build_probe_list() to consider the range uprobes: Remove insert_vm_struct()->uprobe_mmap() uprobes: Remove copy_vma()->uprobe_mmap() uprobes: Fix overflow in vma_address()/find_active_uprobe() uprobes: Suppress uprobe_munmap() from mmput() uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe() uprobes: Clean up and document write_opcode()->lock_page(old_page) uprobes: Kill write_opcode()->lock_page(new_page) uprobes: __replace_page() should not use page_address_in_vma() uprobes: Don't recheck vma/f_mapping in write_opcode() perf/x86: Fix missing struct before structure name perf/x86: Fix format definition of SNB-EP uncore QPI box perf/x86: Make bitfield unsigned perf/x86: Fix LLC-* and node-* events on Intel SandyBridge perf/x86: Add Intel Nehalem-EX uncore support perf/x86: Fix typo in format definition of uncore PCU filter ...
2012-07-31Merge branch 'merge' of ↵Linus Torvalds7-55/+121
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Benjamin Herrenschmidt: "Kumar sent me a handful of Freescale related fixes and I added another regression fix to the pile. PS. I -will- eventually learn about that signed tag business :-)" * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc/kvm/book3s_32: Fix MTMSR_EERI macro powerpc/85xx: p1022ds: fix DIU/LBC switching with NAND enabled powerpc/85xx: p1022ds: disable the NAND flash node if video is enabled powerpc/85xx: Fix sram_offset parameter type powerpc/85xx: P3041DS - change espi input-clock from 40MHz to 35MHz powerpc/85xx: Fix pci base address error for p2020rdb-pc in dts
2012-07-31Merge branch 'for-linus' of ↵Linus Torvalds18-105/+104
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "This it the second batch of s390 patches for the 3.6 merge window. Included is enablement for two common code changes, killable page faults and sorted exception tables. And the regular set of cleanup and bug fix patches." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: make use of user_mode() macro where possible s390/mm: rename user_mode variable to addressing_mode s390/mm: fix fault handling for page table walk case s390/mm: make page faults killable s390: update defconfig s390/mm: downgrade page table after fork of a 31 bit process s390/ipl: Use diagnose 8 command separation s390/linker script: use RO_DATA_SECTION s390/exceptions: sort exception table at build time s390/debug: remove module_exit function / move EXPORT_SYMBOLs
2012-07-31ipv4: Properly purge netdev references on uncached routes.David S. Miller4-4/+69
When a device is unregistered, we have to purge all of the references to it that may exist in the entire system. If a route is uncached, we currently have no way of accomplishing this. So create a global list that is scanned when a network device goes down. This mirrors the logic in net/core/dst.c's dst_ifdown(). Signed-off-by: David S. Miller <[email protected]>
2012-07-31ipv4: Cache routes in nexthop exception entries.David S. Miller3-64/+79
Signed-off-by: David S. Miller <[email protected]>
2012-07-31Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linuxLinus Torvalds27-237/+368
Pull nfsd changes from J. Bruce Fields: "This has been an unusually quiet cycle--mostly bugfixes and cleanup. The one large piece is Stanislav's work to containerize the server's grace period--but that in itself is just one more step in a not-yet-complete project to allow fully containerized nfs service. There are a number of outstanding delegation, container, v4 state, and gss patches that aren't quite ready yet; 3.7 may be wilder." * 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (35 commits) NFSd: make boot_time variable per network namespace NFSd: make grace end flag per network namespace Lockd: move grace period management from lockd() to per-net functions LockD: pass actual network namespace to grace period management functions LockD: manage grace list per network namespace SUNRPC: service request network namespace helper introduced NFSd: make nfsd4_manager allocated per network namespace context. LockD: make lockd manager allocated per network namespace LockD: manage grace period per network namespace Lockd: add more debug to host shutdown functions Lockd: host complaining function introduced LockD: manage used host count per networks namespace LockD: manage garbage collection timeout per networks namespace LockD: make garbage collector network namespace aware. LockD: mark host per network namespace on garbage collect nfsd4: fix missing fault_inject.h include locks: move lease-specific code out of locks_delete_lock locks: prevent side-effects of locks_release_private before file_lock is initialized NFSd: set nfsd_serv to NULL after service destruction NFSd: introduce nfsd_destroy() helper ...
2012-07-31ipv4: percpu nh_rth_output cacheEric Dumazet3-7/+34
Input path is mostly run under RCU and doesnt touch dst refcnt But output path on forwarding or UDP workloads hits badly dst refcount, and we have lot of false sharing, for example in ipv4_mtu() when reading rt->rt_pmtu Using a percpu cache for nh_rth_output gives a nice performance increase at a small cost. 24 udpflood test on my 24 cpu machine (dummy0 output device) (each process sends 1.000.000 udp frames, 24 processes are started) before : 5.24 s after : 2.06 s For reference, time on linux-3.5 : 6.60 s Signed-off-by: Eric Dumazet <[email protected]> Tested-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2012-07-31ipv4: Restore old dst_free() behavior.Eric Dumazet7-50/+50
commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried to solve a race but added a problem at device/fib dismantle time : We really want to call dst_free() as soon as possible, even if sockets still have dst in their cache. dst_release() calls in free_fib_info_rcu() are not welcomed. Root of the problem was that now we also cache output routes (in nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in rt_free(), because output route lookups are done in process context. Based on feedback and initial patch from David Miller (adding another call_rcu_bh() call in fib, but it appears it was not the right fix) I left the inet_sk_rx_dst_set() helper and added __rcu attributes to nh_rth_output and nh_rth_input to better document what is going on in this code. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2012-07-31Merge branch 'for-linus' of ↵Linus Torvalds25-898/+1351
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "Lots of stuff this time around: - lots of cleanup and refactoring in the libceph messenger code, and many hard to hit races and bugs closed as a result. - lots of cleanup and refactoring in the rbd code from Alex Elder, mostly in preparation for the layering functionality that will be coming in 3.7. - some misc rbd cleanups from Josh Durgin that are finally going upstream - support for CRUSH tunables (used by newer clusters to improve the data placement) - some cleanup in our use of d_parent that Al brought up a while back - a random collection of fixes across the tree There is another patch coming that fixes up our ->atomic_open() behavior, but I'm going to hammer on it a bit more before sending it." Fix up conflicts due to commits that were already committed earlier in drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits) rbd: create rbd_refresh_helper() rbd: return obj version in __rbd_refresh_header() rbd: fixes in rbd_header_from_disk() rbd: always pass ops array to rbd_req_sync_op() rbd: pass null version pointer in add_snap() rbd: make rbd_create_rw_ops() return a pointer rbd: have __rbd_add_snap_dev() return a pointer libceph: recheck con state after allocating incoming message libceph: change ceph_con_in_msg_alloc convention to be less weird libceph: avoid dropping con mutex before fault libceph: verify state after retaking con lock after dispatch libceph: revoke mon_client messages on session restart libceph: fix handling of immediate socket connect failure ceph: update MAINTAINERS file libceph: be less chatty about stray replies libceph: clear all flags on con_close libceph: clean up con flags libceph: replace connection state bits with states libceph: drop unnecessary CLOSED check in socket state change callback libceph: close socket directly from ceph_con_close() ...
2012-07-31[media] radio-tea5777: use library for 64bits divMauro Carvalho Chehab1-3/+5
drivers/built-in.o: In function `radio_tea5777_set_freq': radio-tea5777.c:(.text+0x4d8704): undefined reference to `__udivdi3' Reported-by: Randy Dunlap <[email protected]> Cc: Hans de Goede <[email protected]> Acked-by: Randy Dunlap <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2012-07-31nfs: explicitly reject LOCK_MAND flock() requestsJeff Layton1-0/+9
We have no mechanism to emulate LOCK_MAND locks on NFSv4, so explicitly return -EINVAL if someone requests it. Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-07-31ASoC: wm8962: Allow VMID time to fully rampMark Brown1-0/+3
Required for reliable power up from cold. Signed-off-by: Mark Brown <[email protected]> Cc: [email protected]
2012-07-31nfs: increase number of permitted callback connections.NeilBrown1-0/+4
By default a sunrpc service is limited to (N+3)*20 connections where N is the number of threads. This is 80 when N==1. If this number is exceeded a warning is printed suggesting that the number of threads be increased. However with services which run a single thread, this is impossible. For such services there is a ->sv_maxconn setting that can be used to forcibly increase the limit, and silence the message. This is used by lockd. The nfs client uses a sunrpc service to handle callbacks and it too is single-threaded, so to avoid the useless messages, and to allow a reasonable number of concurrent connections, we need to set ->sv_maxconn. 1024 seems like a good number. Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2012-07-31ASoC: AC97 doesn't use regmap by defaultManuel Lauss4-0/+4
Since commit 38cbf9598feba97de9f9b43efa9153fd7c1a2ec9 ("ASoC: core: Try to use regmap if the driver doesn't set up any I/O") any ASoC codec which doesn't set codec::control_data is assumed to use regmap. That doesn't work with AC97 so this workaround sets the codec::control_data member to a random value to restore proper behaviour. Tested with WM9712. Signed-off-by: Manuel Lauss <[email protected]> Signed-off-by: Mark Brown <[email protected]>
2012-07-31ASoC: sgtl5000: enable VAG_POWER for LINE_INDong Aisheng1-0/+1
LINE_IN also needs VAG_POWER on or we may hear noise when directly route LINE_IN to Headphone Mux. Tested on imx28evk. Signed-off-by: Dong Aisheng <[email protected]> Signed-off-by: Mark Brown <[email protected]>
2012-07-31ASoC: ab8500: Inform SoC Core that we have our own I/O arrangementsLee Jones1-0/+4
If codec->control_data is not populated SoC Core assumes we want to use regmap, which fails catastrophically, as we don't have one: Unable to handle kernel NULL pointer dereference at virtual address 00000080 pgd = c0004000 [00000080] *pgd=00000000 Internal error: Oops: 17 [#1] PREEMPT SMP ARM Modules linked in: CPU: 1 Not tainted (3.5.0-rc6-00884-g0b2419e-dirty #130) PC is at regmap_read+0x10/0x5c LR is at hw_read+0x80/0x90 pc : [<c01a91b8>] lr : [<c0216804>] psr: 60000013 Signed-off-by: Lee Jones <[email protected]> Signed-off-by: Mark Brown <[email protected]>
2012-07-31time: Remove all direct references to timekeeperJohn Stultz1-128/+154
Ingo noted that the numerous timekeeper.value references made the timekeeping code ugly and caused many long lines that had to be broken up. He recommended replacing timekeeper.value references with tk->value. This patch provides a local tk value for all top level time functions and sets it to &timekeeper. Then all timekeeper access is done via a tk pointer. Signed-off-by: John Stultz <[email protected]> Cc: Prarit Bhargava <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-31time: Clean up offs_real/wall_to_mono and offs_boot/total_sleep_time updatesJohn Stultz1-36/+54
For performance reasons, we maintain ktime_t based duplicates of wall_to_monotonic (offs_real) and total_sleep_time (offs_boot). Since large problems could occur (such as the resume regression on 3.5-rc7, or the leapsecond hrtimer issue) if these value pairs were to be inconsistently updated, this patch this cleans up how we modify these value pairs to ensure we are always consistent. As a side-effect this is also more efficient as we only caulculate the duplicate values when they are changed, rather then every update_wall_time call. This also provides WARN_ONs to detect if future changes break the invariants. Signed-off-by: John Stultz <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Cleaned up minor style issues. ] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-31time: Clean up stray newlinesJohn Stultz1-10/+0
Ingo noted inconsistent newline usage between functions. This patch cleans those up. Signed-off-by: John Stultz <[email protected]> Cc: Prarit Bhargava <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-31time/jiffies: Rename ACTHZ to SHIFTED_HZJohn Stultz4-11/+16
Ingo noted that ACTHZ is a confusing name, and requested it be renamed, so this patch renames ACTHZ to SHIFTED_HZ to better describe it. Signed-off-by: John Stultz <[email protected]> Cc: Prarit Bhargava <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-31time/jiffies: Allow CLOCK_TICK_RATE to be undefinedCatalin Marinas1-4/+8
CLOCK_TICK_RATE is a legacy constant that defines the timer device's granularity. On hardware with particularly coarse granularity, this constant is used to reduce accumulated time error when using jiffies as a clocksource, by calculating the hardware's actual tick length rather then just assuming it is 1sec/HZ. However, for the most part this is unnecessary, as most modern systems don't use jiffies for their clocksource, and their tick device is sufficiently fine grained to avoid major error. Thus, this patch allows an architecture to not define CLOCK_TICK_RATE, in which case ACTHZ defaults to (HZ << 8). Signed-off-by: Catalin Marinas <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Andrew Morton <[email protected]> [ Commit log & intention tweaks ] Signed-off-by: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-31Merge branch 'linus' into timers/urgentIngo Molnar396-4797/+9316
Merge in Linus's branch which already has timers/core merged. Signed-off-by: Ingo Molnar <[email protected]>
2012-07-31perf/trace: Add ability to set a target task for eventsAndrew Vagin11-14/+60
A few events are interesting not only for a current task. For example, sched_stat_* events are interesting for a task which wakes up. For this reason, it will be good if such events will be delivered to a target task too. Now a target task can be set by using __perf_task(). The original idea and a draft patch belongs to Peter Zijlstra. I need these events for profiling sleep times. sched_switch is used for getting callchains and sched_stat_* is used for getting time periods. These events are combined in user space, then it can be analyzed by perf tools. Inspired-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Arun Sharma <[email protected]> Signed-off-by: Andrew Vagin <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-31perf/x86: Fix USER/KERNEL tagging of samples properlyPeter Zijlstra5-17/+114
Some PMUs don't provide a full register set for their sample, specifically 'advanced' PMUs like AMD IBS and Intel PEBS which provide 'better' than regular interrupt accuracy. In this case we use the interrupt regs as basis and over-write some fields (typically IP) with different information. The perf core however uses user_mode() to distinguish user/kernel samples, user_mode() relies on regs->cs. If the interrupt skid pushed us over a boundary the new IP might not be in the same domain as the interrupt. Commit ce5c1fe9a9e ("perf/x86: Fix USER/KERNEL tagging of samples") tried to fix this by making the perf core use kernel_ip(). This however is wrong (TM), as pointed out by Linus, since it doesn't allow for VM86 and non-zero based segments in IA32 mode. Therefore, provide a new helper to set the regs->ip field, set_linear_ip(), which massages the regs into a suitable state assuming the provided IP is in fact a linear address. Also modify perf_instruction_pointer() and perf_callchain_user() to deal with segments base offsets. Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1341910954.3462.102.camel@twins Signed-off-by: Ingo Molnar <[email protected]>
2012-07-31perf/x86/intel/uncore: Make UNCORE_PMU_HRTIMER_INTERVAL 64-bitAndrew Morton1-1/+1
i386 allmodconfig: arch/x86/kernel/cpu/perf_event_intel_uncore.c: In function 'uncore_pmu_hrtimer': arch/x86/kernel/cpu/perf_event_intel_uncore.c:728: warning: integer overflow in expression arch/x86/kernel/cpu/perf_event_intel_uncore.c: In function 'uncore_pmu_start_hrtimer': arch/x86/kernel/cpu/perf_event_intel_uncore.c:735: warning: integer overflow in expression Signed-off-by: Andrew Morton <[email protected]> Cc: Zheng Yan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-31sched/cleanups: Add load balance cpumask pointer to 'struct lb_env'Michael Wang1-15/+14
With this patch struct ld_env will have a pointer of the load balancing cpumask and we don't need to pass a cpumask around anymore. Signed-off-by: Michael Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-31vfio: Add PCI device driverAlex Williamson9-0/+3259
Add PCI device support for VFIO. PCI devices expose regions for accessing config space, I/O port space, and MMIO areas of the device. PCI config access is virtualized in the kernel, allowing us to ensure the integrity of the system, by preventing various accesses while reducing duplicate support across various userspace drivers. I/O port supports read/write access while MMIO also supports mmap of sufficiently sized regions. Support for INTx, MSI, and MSI-X interrupts are provided using eventfds to userspace. Signed-off-by: Alex Williamson <[email protected]>
2012-07-31vfio: Type1 IOMMU implementationAlex Williamson5-1/+821
This VFIO IOMMU backend is designed primarily for AMD-Vi and Intel VT-d hardware, but is potentially usable by anything supporting similar mapping functionality. We arbitrarily call this a Type1 backend for lack of a better name. This backend has no IOVA or host memory mapping restrictions for the user and is optimized for relatively static mappings. Mapped areas are pinned into system memory. Signed-off-by: Alex Williamson <[email protected]>