aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2010-03-06rmap: move exclusively owned pages to own anon_vma in do_wp_page()Rik van Riel3-0/+32
When the parent process breaks the COW on a page, both the original which is mapped at child and the new page which is mapped parent end up in that same anon_vma. Generally this won't be a problem, but for some workloads it could preserve the O(N) rmap scanning complexity. A simple fix is to ensure that, when a page which is mapped child gets reused in do_wp_page, because we already are the exclusive owner, the page gets moved to our own exclusive child's anon_vma. Signed-off-by: Rik van Riel <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06rmap: remove obsolete check from __page_check_anon_rmap()Rik van Riel1-3/+0
When an anonymous page is inherited from a parent process, the vma->anon_vma can differ from the page anon_vma. This can trip up __page_check_anon_rmap, which is indirectly called from do_swap_page(). Remove that obsolete check to prevent an oops. Signed-off-by: Rik van Riel <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: change anon_vma linking to fix multi-process server scalability issueRik van Riel14-85/+298
The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [[email protected]: cleanups] Signed-off-by: Rik van Riel <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/memcontrol.c: fix "integer as NULL pointer" sparse warningThiago Farina1-1/+1
mm/memcontrol.c:2548:32: warning: Using plain integer as NULL pointer Signed-off-by: Thiago Farina <[email protected]> Acked-by: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06include/linux/fs.h: convert FMODE_* constants to hexAndrew Morton1-11/+11
It was tolerable until Eric went and added 8388608. Cc: Eric Paris <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOMWu Fengguang3-1/+18
This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM. POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance: a 16K read will be carried out in 4 _sync_ 1-page reads. In other places, ra_pages==0 means - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs - some IO error happened where multi-page read IO won't help or should be avoided. POSIX_FADV_RANDOM actually want a different semantics: to disable the *heuristic* readahead algorithm, and to use a dumb one which faithfully submit read IO for whatever application requests. So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM. Note that the random hint is not likely to help random reads performance noticeably. And it may be too permissive on huge request size (its IO size is not limited by read_ahead_kb). In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall (NFS read) performance of the application increased by 313%! Tested-by: Quentin Barnes <[email protected]> Signed-off-by: Wu Fengguang <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Steven Whitehouse <[email protected]> Cc: David Howells <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Chuck Lever <[email protected]> Cc: <[email protected]> [2.6.33.x] Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06vfs: take f_lock on modifying f_mode after open timeWu Fengguang2-0/+4
We'll introduce FMODE_RANDOM which will be runtime modified. So protect all runtime modification to f_mode with f_lock to avoid races. Signed-off-by: Wu Fengguang <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Chuck Lever <[email protected]> Cc: <[email protected]> [2.6.33.x] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/migrate.c: kill anon local variable from migrate_page_copyKOSAKI Motohiro1-4/+0
commit 01b1ae63c2 ("memcg: simple migration handling") removed mem_cgroup_uncharge_cache_page() call from migrate_page_copy. Local variable `anon' is now unused. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/mempolicy.c: fix indentation of the comments of do_migrate_pagesKOSAKI Motohiro1-30/+30
Currently, do_migrate_pages() have very long comment and this is not indent properly. I often misunderstand it is function starting commnents and confused it. this patch fixes it. note: this patch doesn't break 80 column rule. I guess original author intended this indentaion, but an accident corrupted it. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06memory-hotplug: create /sys/firmware/memmap entry for new memory[email protected]3-24/+43
A memmap is a directory in sysfs which includes 3 text files: start, end and type. For example: start: 0x100000 end: 0x7e7b1cff type: System RAM Interface firmware_map_add was not called explicitly. Remove it and add function firmware_map_add_hotplug as hotplug interface of memmap. Each memory entry has a memmap in sysfs, When we hot-add new memory, sysfs does not export memmap entry for it. We add a call in function add_memory to function firmware_map_add_hotplug. Add a new function add_sysfs_fw_map_entry() to create memmap entry, it will be called when initialize memmap and hot-add memory. [[email protected]: un-kernedoc a no longer kerneldoc comment] Signed-off-by: Shaohui Zheng <[email protected]> Acked-by: Andi Kleen <[email protected]> Acked-by: Yasunori Goto <[email protected]> Reviewed-by: Wu Fengguang <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: fix mbind vma merge problemKOSAKI Motohiro1-13/+39
Strangely, current mbind() doesn't merge vma with neighbor vma although it's possible. Unfortunately, many vma can reduce performance... This patch fixes it. reproduced program ---------------------------------------------------------------- #include <numaif.h> #include <numa.h> #include <sys/mman.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> static unsigned long pagesize; int main(int argc, char** argv) { void* addr; int ch; int node; struct bitmask *nmask = numa_allocate_nodemask(); int err; int node_set = 0; char buf[128]; while ((ch = getopt(argc, argv, "n:")) != -1){ switch (ch){ case 'n': node = strtol(optarg, NULL, 0); numa_bitmask_setbit(nmask, node); node_set = 1; break; default: ; } } argc -= optind; argv += optind; if (!node_set) numa_bitmask_setbit(nmask, 0); pagesize = getpagesize(); addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, 0, 0); if (addr == MAP_FAILED) perror("mmap "), exit(1); fprintf(stderr, "pid = %d \n" "addr = %p\n", getpid(), addr); /* make page populate */ memset(addr, 0, pagesize*3); /* first mbind */ err = mbind(addr+pagesize, pagesize, MPOL_BIND, nmask->maskp, nmask->size, MPOL_MF_MOVE_ALL); if (err) error("mbind1 "); /* second mbind */ err = mbind(addr, pagesize*3, MPOL_DEFAULT, NULL, 0, 0); if (err) error("mbind2 "); sprintf(buf, "cat /proc/%d/maps", getpid()); system(buf); return 0; } ---------------------------------------------------------------- result without this patch addr = 0x7fe26ef09000 [snip] 7fe26ef09000-7fe26ef0a000 rw-p 00000000 00:00 0 7fe26ef0a000-7fe26ef0b000 rw-p 00000000 00:00 0 7fe26ef0b000-7fe26ef0c000 rw-p 00000000 00:00 0 7fe26ef0c000-7fe26ef0d000 rw-p 00000000 00:00 0 => 0x7fe26ef09000-0x7fe26ef0c000 have three vmas. result with this patch addr = 0x7fc9ebc76000 [snip] 7fc9ebc76000-7fc9ebc7a000 rw-p 00000000 00:00 0 7fffbe690000-7fffbe6a5000 rw-p 00000000 00:00 0 [stack] => 0x7fc9ebc76000-0x7fc9ebc7a000 have only one vma. [[email protected]: fix file offset passed to vma_merge()] Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: restore zone->all_unreclaimable to independence wordKOSAKI Motohiro4-23/+14
commit e815af95 ("change all_unreclaimable zone member to flags") changed all_unreclaimable member to bit flag. But it had an undesireble side effect. free_one_page() is one of most hot path in linux kernel and increasing atomic ops in it can reduce kernel performance a bit. Thus, this patch revert such commit partially. at least all_unreclaimable shouldn't share memory word with other zone flags. [[email protected]: fix patch interaction] Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Wu Fengguang <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Huang Shijie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: remove free_hot_page()Li Hong3-9/+5
free_hot_page() is just a wrapper around free_hot_cold_page() with parameter 'cold = 0'. After adding a clear comment for free_hot_cold_page(), it is reasonable to remove a level of call. [[email protected]: fix build] Signed-off-by: Li Hong <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Li Ming Chun <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Americo Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/page_alloc.c: adjust a call site to trace_mm_page_free_directLi Hong1-1/+1
Move a call of trace_mm_page_free_direct() from free_hot_page() to free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(), as it is done in function __free_pages_ok(). Signed-off-by: Li Hong <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Li Ming Chun <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/page_alloc.c: remove duplicate call to trace_mm_page_free_directLi Hong1-1/+1
trace_mm_page_free_direct() is called in function __free_pages(). But it is called again in free_hot_page() if order == 0 and produce duplicate records in trace file for mm_page_free_direct event. As below: K-PID CPU# TIMESTAMP FUNCTION gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 This patch removes the first call and adds a call to trace_mm_page_free_direct() in __free_pages_ok(). Signed-off-by: Li Hong <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Li Ming Chun <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm, lockdep: annotate reclaim context to zone reclaim tooKOSAKI Motohiro1-0/+2
Commit cf40bd16fd ("lockdep: annotate reclaim context") introduced reclaim context annotation. But it didn't annotate zone reclaim. This patch do it. The point is, commit cf40bd16fd annotate __alloc_pages_direct_reclaim but zone-reclaim doesn't use __alloc_pages_direct_reclaim. current call graph is __alloc_pages_nodemask get_page_from_freelist zone_reclaim() __alloc_pages_slowpath __alloc_pages_direct_reclaim try_to_free_pages Actually, if zone_reclaim_mode=1, VM never call __alloc_pages_direct_reclaim in usual VM pressure. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Nick Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06vmscan: get_scan_ratio() cleanupKOSAKI Motohiro1-9/+14
The get_scan_ratio() should have all scan-ratio related calculations. Thus, this patch move some calculation into get_scan_ratio. Signed-off-by: KOSAKI Motohiro <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06vmscan: check high watermark after shrink zoneMinchan Kim1-10/+12
Kswapd checks that zone has sufficient pages free via zone_watermark_ok(). If any zone doesn't have enough pages, we set all_zones_ok to zero. !all_zone_ok makes kswapd retry rather than sleeping. I think the watermark check before shrink_zone() is pointless. Only after kswapd has tried to shrink the zone is the check meaningful. Move the check to after the call to shrink_zone(). [[email protected]: fix comment, layout] Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: use rlimit helpersJiri Slaby4-14/+15
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in 3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: mlock_vma_pages_range() only return success or failureKOSAKI Motohiro1-2/+2
Currently, mlock_vma_pages_range() only return len or 0. then current error handling of mmap_region() is meaningless complex. This patch makes simplify and makes consist with brk() code. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Rik van Riel <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: mlock_vma_pages_range() never return negative valueKOSAKI Motohiro1-9/+2
Currently, mlock_vma_pages_range() never return negative value. Then, we can remove some worthless error check. Signed-off-by: KOSAKI Motohiro <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Rik van Riel <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: count swap usageKAMEZAWA Hiroyuki6-7/+23
A frequent questions from users about memory management is what numbers of swap ents are user for processes. And this information will give some hints to oom-killer. Besides we can count the number of swapents per a process by scanning /proc/<pid>/smaps, this is very slow and not good for usual process information handler which works like 'ps' or 'top'. (ps or top is now enough slow..) This patch adds a counter of swapents to mm_counter and update is at each swap events. Information is exported via /proc/<pid>/status file as [kamezawa@bluextal memory]$ cat /proc/self/status Name: cat State: R (running) Tgid: 2910 Pid: 2910 PPid: 2823 TracerPid: 0 Uid: 500 500 500 500 Gid: 500 500 500 500 FDSize: 256 Groups: 500 VmPeak: 82696 kB VmSize: 82696 kB VmLck: 0 kB VmHWM: 432 kB VmRSS: 432 kB VmData: 172 kB VmStk: 84 kB VmExe: 48 kB VmLib: 1568 kB VmPTE: 40 kB VmSwap: 0 kB <=============== this. [[email protected]: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: avoid false sharing of mm_counterKAMEZAWA Hiroyuki7-15/+107
Considering the nature of per mm stats, it's the shared object among threads and can be a cache-miss point in the page fault path. This patch adds per-thread cache for mm_counter. RSS value will be counted into a struct in task_struct and synchronized with mm's one at events. Now, in this patch, the event is the number of calls to handle_mm_fault. Per-thread value is added to mm at each 64 calls. rough estimation with small benchmark on parallel thread (2threads) shows [before] 4.5 cache-miss/faults [after] 4.0 cache-miss/faults Anyway, the most contended object is mmap_sem if the number of threads grows. [[email protected]: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: clean up mm_counterKAMEZAWA Hiroyuki12-101/+174
Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06infiniband: use for_each_set_bit()Akinobu Mita1-16/+5
Replace open-coded loop with for_each_set_bit(). Signed-off-by: Akinobu Mita <[email protected]> Acked-by: Roland Dreier <[email protected]> Cc: Sean Hefty <[email protected]> Cc: Hal Rosenstock <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06bitops: rename for_each_bit() to for_each_set_bit()Akinobu Mita18-24/+26
Rename for_each_bit to for_each_set_bit in the kernel source tree. To permit for_each_clear_bit(), should that ever be added. The patch includes a macro to map the old for_each_bit() onto the new for_each_set_bit(). This is a (very) temporary thing to ease the migration. [[email protected]: add temporary for_each_bit()] Suggested-by: Alexey Dobriyan <[email protected]> Suggested-by: Andrew Morton <[email protected]> Signed-off-by: Akinobu Mita <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Russell King <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Artem Bityutskiy <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06timbgpio: fix buildDavid Miller1-0/+1
Use of get_irq_chip_data() et al. requires including linux/irq.h Signed-off-by: David S. Miller <[email protected]> Cc: Richard Röjfors <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06Fix a dumb typo - use of & instead of &&Al Viro1-1/+1
We managed to lose O_DIRECTORY testing due to a stupid typo in commit 1f36f774b2 ("Switch !O_CREAT case to use of do_last()") Reported-by: Walter Sheets <[email protected]> Signed-off-by: Al Viro <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06[WATCHDOG] gef_wdt: Author corrections following split of GE Fanuc joint ventureMartyn Welch2-9/+9
This patch corrects author and copyright notices following the split-up of the GE Fanuc joint venture. Signed-off-by: Martyn Welch <[email protected]> Signed-off-by: Wim Van Sebroeck <[email protected]>
2010-03-06[WATCHDOG] iTCO_wdt: clean up probe(), modify err msgNaga Chumbalkar1-10/+9
It's possible that the platform is not allowing reboot via TCO timer expiration. Also, differentiate between not finding a chipset that has TCO, and the case where TCO is present but the driver fails to initialize for some reason. Signed-off-by: Naga Chumbalkar <[email protected]> Signed-off-by: Wim Van Sebroeck <[email protected]>
2010-03-06[WATCHDOG] ep93xx: watchdog timer driver for TS-72xx SBCs cleanupWim Van Sebroeck1-4/+8
Clean-up driver: * make release the reverse of probe so that both are consistent * add WDIOC_GETSTATUS & WDIOC_GETBOOTSTATUS ioctls. Signed-off-by: Wim Van Sebroeck <[email protected]>
2010-03-06[WATCHDOG] support for max63xx watchdog timer chipsMarc Zyngier3-0/+403
This driver adds support for the max63{69,70,71,72,73,74} family of watchdog timer chips. It has been tested on an Arcom Zeus (max6369). Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Wim Van Sebroeck <[email protected]>
2010-03-06[WATCHDOG] ep93xx: added platform side support for TS-72xx WDT driverMika Westerberg2-0/+23
Signed-off-by: Mika Westerberg <[email protected]> Signed-off-by: Wim Van Sebroeck <[email protected]>
2010-03-06[WATCHDOG] ep93xx: implemented watchdog timer driver for TS-72xx SBCsMika Westerberg3-0/+528
Technologic Systems TS-72xx SBCs have external glue logic CPLD which includes watchdog timer. This driver implements kernel support for that. Signed-off-by: Mika Westerberg <[email protected]> Signed-off-by: Wim Van Sebroeck <[email protected]>
2010-03-06[LogFS] Change magic numberJoern Engel1-1/+1
Many changes were made during development that could result in old versions of mklogfs and the kernel code being subtly incompatible. Not being a friend of subtleties, I hereby change the magic number. Any old version of mklogfs is now guaranteed to fail.
2010-03-06[LogFS] Remove h_version fieldJoern Engel2-6/+5
Incompatible change: h_compr is moved up so the padding is all in one chunk.
2010-03-06dm raid1: fix deadlock when suspending failed deviceTakahiro Yasui1-18/+23
To prevent deadlock, bios in the hold list should be flushed before dm_rh_stop_recovery() is called in mirror_suspend(). The recovery can't start because there are pending bios and therefore dm_rh_stop_recovery deadlocks. When there are pending bios in the hold list, the recovery waits for the completion of the bios after recovery_count is acquired. The recovery_count is released when the recovery finished, however, the bios in the hold list are processed after dm_rh_stop_recovery() in mirror_presuspend(). dm_rh_stop_recovery() also acquires recovery_count, then deadlock occurs. Signed-off-by: Takahiro Yasui <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]>
2010-03-06dm: eliminate some holes data structuresMike Snitzer2-15/+15
Eliminate a 4-byte hole in 'struct dm_io_memory' by moving 'offset' above the 'ptr' to which it applies (size reduced from 24 to 16 bytes). And by association, 1-4 byte hole is eliminated in 'struct dm_io_request' (size reduced from 56 to 48 bytes). Eliminate all 6 4-byte holes and 1 cache-line in 'struct dm_snapshot' (size reduced from 392 to 368 bytes). Signed-off-by: Mike Snitzer <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm ioctl: introduce flag indicating uevent was generatedPeter Rajnoha4-14/+25
Set a new DM_UEVENT_GENERATED_FLAG when returning from ioctls to indicate that a uevent was actually generated. This tells the userspace caller that it may need to wait for the event to be processed. Signed-off-by: Peter Rajnoha <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm: free dm_io before bio_endio not afterMikulas Patocka1-2/+2
Free the dm_io structure before calling bio_endio() instead of after it, to ensure that the io_pool containing it is not referenced after it is freed. This partially fixes a problem described here https://www.redhat.com/archives/dm-devel/2010-February/msg00109.html thread 1: bio_endio(bio, io_error); /* scheduling happens */ thread 2: close the device remove the device thread 1: free_io(md, io); Thread 2, when removing the device, sees non-empty md->io_pool (because the io hasn't been freed by thread 1 yet) and may crash with BUG in mempool_free. Thread 1 may also crash, when freeing into a nonexisting mempool. To fix this we must make sure that bio_endio() is the last call and the md structure is not accessed afterwards. There is another bio_endio in process_barrier, but it is called from the thread and the thread is destroyed prior to freeing the mempools, so this call is not affected by the bug. A similar bug exists with module unloads - the module may be unloaded immediately after bio_endio - but that is more difficult to fix. Signed-off-by: Mikulas Patocka <[email protected]> Cc: [email protected] Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm table: remove unused dm_get_device range parametersNikanth Karthikesan10-32/+21
Remove unused parameters(start and len) of dm_get_device() and fix the callers. Signed-off-by: Nikanth Karthikesan <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm ioctl: only issue uevent on resume if state changedMike Snitzer1-4/+5
Only issue a uevent on a resume if the state of the device changed, i.e. if it was suspended and/or its table was replaced. Signed-off-by: Dave Wysochanski <[email protected]> Signed-off-by: Mike Snitzer <[email protected]> Cc: [email protected] Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm raid1: always return error if all legs failMikulas Patocka1-3/+6
If all mirror legs fail, always return an error instead of holding the bio, even if the handle_errors option was set. At present it is the responsibility of the driver underneath us to deal with retries, multipath etc. The patch adds the bio to the failures list instead of holding it directly. do_failures tests first if all legs failed and, if so, returns the bio with -EIO. If any leg is still alive and handle_errors is set, do_failures calls hold_bio. Reviewed-by: Takahiro Yasui <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm mpath: refactor pg_initKiyoshi Ueda1-12/+19
This patch pulls the pg_init path activation code out of process_queued_ios() into a new function. No functional change. Signed-off-by: Kiyoshi Ueda <[email protected]> Signed-off-by: Jun'ichi Nomura <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm mpath: wait for pg_init completion when suspendingKiyoshi Ueda1-3/+35
When suspending the device we must wait for all I/O to complete, but pg-init may be still in progress even after flushing the workqueue for kmpath_handlerd in multipath_postsuspend. This patch waits for pg-init completion correctly in multipath_postsuspend(). Signed-off-by: Kiyoshi Ueda <[email protected]> Signed-off-by: Jun'ichi Nomura <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm mpath: hold io until all pg_inits completedKiyoshi Ueda1-6/+11
m->queue_io is set to block processing I/Os, and it needs to be kept while pg-init, which issues multiple path activations, is in progress. But m->queue is cleared when a path activation completes without error in pg_init_done(), even while other path activations are in progress. That may cause undesired -EIO on paths which are not complete activation. This patch fixes that by not clearing m->queue_io until all path activations complete. (Before the hardware handlers were moved into the SCSI layer, pg_init only used one path.) Signed-off-by: Kiyoshi Ueda <[email protected]> Signed-off-by: Jun'ichi Nomura <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm mpath: avoid storing private suspended stateKiyoshi Ueda1-12/+0
'suspended' flag in struct multipath was introduced to check whether the multipath target is in suspended state, but the same check is done through dm_suspended() now, so remove the flag and related code. Signed-off-by: Kiyoshi Ueda <[email protected]> Signed-off-by: Jun'ichi Nomura <[email protected]> Cc: Mike Anderson <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm: document when snapshot has finished mergingMike Snitzer1-0/+44
Update Documentation/device-mapper/snapshot.txt to cover "How to determine when a snapshot has finished merging". Signed-off-by: Mike Snitzer <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm table: remove dm_get from dm_table_get_mdKiyoshi Ueda3-19/+4
Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could be called from presuspend/postsuspend, which are called while mapped_device is in DMF_FREEING state, where dm_get() is not allowed. Justification for that is the lifetime of both objects: As far as the current dm design/implementation, mapped_device is never freed while targets are doing something, because dm core waits for targets to become quiet in dm_put() using presuspend/postsuspend. So targets should be able to touch mapped_device without holding reference count of the mapped_device, and we should allow targets to touch mapped_device even if it is in DMF_FREEING state. Backgrounds: I'm trying to remove the multipath internal queue, since dm core now has a generic queue for request-based dm. In the patch-set, the multipath target wants to request dm core to start/stop queue. One of such start/stop requests can happen during postsuspend() while the target waits for pg-init to complete, because the target stops queue when starting pg-init and tries to restart it when completing pg-init. Since queue belongs to mapped_device, it involves calling dm_table_get_md() and dm_put(). On the other hand, postsuspend() is called in dm_put() for mapped_device which is in DMF_FREEING state, and that triggers BUG_ON(DMF_FREEING) in the 2nd dm_put(). I had tried to solve this problem by changing only multipath not to touch mapped_device which is in DMF_FREEING state, but I couldn't and I came up with a question why we need dm_get() in dm_table_get_md(). Signed-off-by: Kiyoshi Ueda <[email protected]> Signed-off-by: Jun'ichi Nomura <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>
2010-03-06dm mpath: skip activate_path for failed pathsMoger, Babu1-2/+5
This patch adds two minor fixes while processing device mapper path activation. Skip failed paths while calling activate_path. If the path is already failed then activate_path will fail for sure. We don't have to call in that case. In some case this might cause prolonged retries unnecessarily. Change the misleading message if the path being activated fails with SCSI_DH_NOSYS. Signed-off-by: Babu Moger <[email protected]> Signed-off-by: Alasdair G Kergon <[email protected]>