Age | Commit message (Collapse) | Author | Files | Lines |
|
When the parent process breaks the COW on a page, both the original which
is mapped at child and the new page which is mapped parent end up in that
same anon_vma. Generally this won't be a problem, but for some workloads
it could preserve the O(N) rmap scanning complexity.
A simple fix is to ensure that, when a page which is mapped child gets
reused in do_wp_page, because we already are the exclusive owner, the page
gets moved to our own exclusive child's anon_vma.
Signed-off-by: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Larry Woodman <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When an anonymous page is inherited from a parent process, the
vma->anon_vma can differ from the page anon_vma. This can trip up
__page_check_anon_rmap, which is indirectly called from do_swap_page().
Remove that obsolete check to prevent an oops.
Signed-off-by: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Larry Woodman <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[[email protected]: cleanups]
Signed-off-by: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Larry Woodman <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
mm/memcontrol.c:2548:32: warning: Using plain integer as NULL pointer
Signed-off-by: Thiago Farina <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It was tolerable until Eric went and added 8388608.
Cc: Eric Paris <[email protected]>
Cc: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
a 16K read will be carried out in 4 _sync_ 1-page reads.
In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.
POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.
So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.
Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).
In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!
Tested-by: Quentin Barnes <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Steven Whitehouse <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Chuck Lever <[email protected]>
Cc: <[email protected]> [2.6.33.x]
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We'll introduce FMODE_RANDOM which will be runtime modified. So protect
all runtime modification to f_mode with f_lock to avoid races.
Signed-off-by: Wu Fengguang <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Chuck Lever <[email protected]>
Cc: <[email protected]> [2.6.33.x]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
commit 01b1ae63c2 ("memcg: simple migration handling") removed
mem_cgroup_uncharge_cache_page() call from migrate_page_copy. Local
variable `anon' is now unused.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, do_migrate_pages() have very long comment and this is not
indent properly. I often misunderstand it is function starting commnents
and confused it.
this patch fixes it.
note: this patch doesn't break 80 column rule. I guess original
author intended this indentaion, but an accident corrupted it.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A memmap is a directory in sysfs which includes 3 text files: start, end
and type. For example:
start: 0x100000
end: 0x7e7b1cff
type: System RAM
Interface firmware_map_add was not called explicitly. Remove it and add
function firmware_map_add_hotplug as hotplug interface of memmap.
Each memory entry has a memmap in sysfs, When we hot-add new memory, sysfs
does not export memmap entry for it. We add a call in function add_memory
to function firmware_map_add_hotplug.
Add a new function add_sysfs_fw_map_entry() to create memmap entry, it
will be called when initialize memmap and hot-add memory.
[[email protected]: un-kernedoc a no longer kerneldoc comment]
Signed-off-by: Shaohui Zheng <[email protected]>
Acked-by: Andi Kleen <[email protected]>
Acked-by: Yasunori Goto <[email protected]>
Reviewed-by: Wu Fengguang <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Strangely, current mbind() doesn't merge vma with neighbor vma although it's possible.
Unfortunately, many vma can reduce performance...
This patch fixes it.
reproduced program
----------------------------------------------------------------
#include <numaif.h>
#include <numa.h>
#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
static unsigned long pagesize;
int main(int argc, char** argv)
{
void* addr;
int ch;
int node;
struct bitmask *nmask = numa_allocate_nodemask();
int err;
int node_set = 0;
char buf[128];
while ((ch = getopt(argc, argv, "n:")) != -1){
switch (ch){
case 'n':
node = strtol(optarg, NULL, 0);
numa_bitmask_setbit(nmask, node);
node_set = 1;
break;
default:
;
}
}
argc -= optind;
argv += optind;
if (!node_set)
numa_bitmask_setbit(nmask, 0);
pagesize = getpagesize();
addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
MAP_ANON|MAP_PRIVATE, 0, 0);
if (addr == MAP_FAILED)
perror("mmap "), exit(1);
fprintf(stderr, "pid = %d \n" "addr = %p\n", getpid(), addr);
/* make page populate */
memset(addr, 0, pagesize*3);
/* first mbind */
err = mbind(addr+pagesize, pagesize, MPOL_BIND, nmask->maskp,
nmask->size, MPOL_MF_MOVE_ALL);
if (err)
error("mbind1 ");
/* second mbind */
err = mbind(addr, pagesize*3, MPOL_DEFAULT, NULL, 0, 0);
if (err)
error("mbind2 ");
sprintf(buf, "cat /proc/%d/maps", getpid());
system(buf);
return 0;
}
----------------------------------------------------------------
result without this patch
addr = 0x7fe26ef09000
[snip]
7fe26ef09000-7fe26ef0a000 rw-p 00000000 00:00 0
7fe26ef0a000-7fe26ef0b000 rw-p 00000000 00:00 0
7fe26ef0b000-7fe26ef0c000 rw-p 00000000 00:00 0
7fe26ef0c000-7fe26ef0d000 rw-p 00000000 00:00 0
=> 0x7fe26ef09000-0x7fe26ef0c000 have three vmas.
result with this patch
addr = 0x7fc9ebc76000
[snip]
7fc9ebc76000-7fc9ebc7a000 rw-p 00000000 00:00 0
7fffbe690000-7fffbe6a5000 rw-p 00000000 00:00 0 [stack]
=> 0x7fc9ebc76000-0x7fc9ebc7a000 have only one vma.
[[email protected]: fix file offset passed to vma_merge()]
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag. But it had an undesireble side
effect. free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.
Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.
[[email protected]: fix patch interaction]
Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Huang Shijie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
free_hot_page() is just a wrapper around free_hot_cold_page() with
parameter 'cold = 0'. After adding a clear comment for
free_hot_cold_page(), it is reasonable to remove a level of call.
[[email protected]: fix build]
Signed-off-by: Li Hong <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Larry Woodman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Li Ming Chun <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Americo Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move a call of trace_mm_page_free_direct() from free_hot_page() to
free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
as it is done in function __free_pages_ok().
Signed-off-by: Li Hong <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Larry Woodman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Li Ming Chun <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
trace_mm_page_free_direct() is called in function __free_pages(). But it
is called again in free_hot_page() if order == 0 and produce duplicate
records in trace file for mm_page_free_direct event. As below:
K-PID CPU# TIMESTAMP FUNCTION
gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
This patch removes the first call and adds a call to
trace_mm_page_free_direct() in __free_pages_ok().
Signed-off-by: Li Hong <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Larry Woodman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Li Ming Chun <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit cf40bd16fd ("lockdep: annotate reclaim context") introduced reclaim
context annotation. But it didn't annotate zone reclaim. This patch do
it.
The point is, commit cf40bd16fd annotate __alloc_pages_direct_reclaim but
zone-reclaim doesn't use __alloc_pages_direct_reclaim.
current call graph is
__alloc_pages_nodemask
get_page_from_freelist
zone_reclaim()
__alloc_pages_slowpath
__alloc_pages_direct_reclaim
try_to_free_pages
Actually, if zone_reclaim_mode=1, VM never call
__alloc_pages_direct_reclaim in usual VM pressure.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Acked-by: Nick Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The get_scan_ratio() should have all scan-ratio related calculations.
Thus, this patch move some calculation into get_scan_ratio.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Kswapd checks that zone has sufficient pages free via zone_watermark_ok().
If any zone doesn't have enough pages, we set all_zones_ok to zero.
!all_zone_ok makes kswapd retry rather than sleeping.
I think the watermark check before shrink_zone() is pointless. Only after
kswapd has tried to shrink the zone is the check meaningful.
Move the check to after the call to shrink_zone().
[[email protected]: fix comment, layout]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Reviewed-by: Wu Fengguang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.
I.e. either use rlimit helpers added in
3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
fetching rlimits") or ACCESS_ONCE if not applicable.
Signed-off-by: Jiri Slaby <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, mlock_vma_pages_range() only return len or 0. then current
error handling of mmap_region() is meaningless complex.
This patch makes simplify and makes consist with brk() code.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, mlock_vma_pages_range() never return negative value. Then, we
can remove some worthless error check.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.
Besides we can count the number of swapents per a process by scanning
/proc/<pid>/smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)
This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc/<pid>/status file as
[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB <=============== this.
[[email protected]: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.
This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.
Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.
rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.
[[email protected]: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Presently, per-mm statistics counter is defined by macro in sched.h
This patch modifies it to
- defined in mm.h as inlinf functions
- use array instead of macro's name creation.
This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Replace open-coded loop with for_each_set_bit().
Signed-off-by: Akinobu Mita <[email protected]>
Acked-by: Roland Dreier <[email protected]>
Cc: Sean Hefty <[email protected]>
Cc: Hal Rosenstock <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Rename for_each_bit to for_each_set_bit in the kernel source tree. To
permit for_each_clear_bit(), should that ever be added.
The patch includes a macro to map the old for_each_bit() onto the new
for_each_set_bit(). This is a (very) temporary thing to ease the migration.
[[email protected]: add temporary for_each_bit()]
Suggested-by: Alexey Dobriyan <[email protected]>
Suggested-by: Andrew Morton <[email protected]>
Signed-off-by: Akinobu Mita <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Russell King <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Artem Bityutskiy <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use of get_irq_chip_data() et al. requires including linux/irq.h
Signed-off-by: David S. Miller <[email protected]>
Cc: Richard Röjfors <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We managed to lose O_DIRECTORY testing due to a stupid typo in commit
1f36f774b2 ("Switch !O_CREAT case to use of do_last()")
Reported-by: Walter Sheets <[email protected]>
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch corrects author and copyright notices following the split-up of
the GE Fanuc joint venture.
Signed-off-by: Martyn Welch <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>
|
|
It's possible that the platform is not allowing reboot via TCO timer
expiration.
Also, differentiate between not finding a chipset that has TCO, and the case
where TCO is present but the driver fails to initialize for some reason.
Signed-off-by: Naga Chumbalkar <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>
|
|
Clean-up driver:
* make release the reverse of probe so that both are consistent
* add WDIOC_GETSTATUS & WDIOC_GETBOOTSTATUS ioctls.
Signed-off-by: Wim Van Sebroeck <[email protected]>
|
|
This driver adds support for the max63{69,70,71,72,73,74} family of
watchdog timer chips.
It has been tested on an Arcom Zeus (max6369).
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>
|
|
Signed-off-by: Mika Westerberg <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>
|
|
Technologic Systems TS-72xx SBCs have external glue logic
CPLD which includes watchdog timer. This driver implements
kernel support for that.
Signed-off-by: Mika Westerberg <[email protected]>
Signed-off-by: Wim Van Sebroeck <[email protected]>
|
|
Many changes were made during development that could result in old
versions of mklogfs and the kernel code being subtly incompatible.
Not being a friend of subtleties, I hereby change the magic number.
Any old version of mklogfs is now guaranteed to fail.
|
|
Incompatible change: h_compr is moved up so the padding is all in one chunk.
|
|
To prevent deadlock, bios in the hold list should be flushed before
dm_rh_stop_recovery() is called in mirror_suspend().
The recovery can't start because there are pending bios and therefore
dm_rh_stop_recovery deadlocks.
When there are pending bios in the hold list, the recovery waits for
the completion of the bios after recovery_count is acquired.
The recovery_count is released when the recovery finished, however,
the bios in the hold list are processed after dm_rh_stop_recovery() in
mirror_presuspend(). dm_rh_stop_recovery() also acquires recovery_count,
then deadlock occurs.
Signed-off-by: Takahiro Yasui <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
|
|
Eliminate a 4-byte hole in 'struct dm_io_memory' by moving 'offset' above the
'ptr' to which it applies (size reduced from 24 to 16 bytes). And by
association, 1-4 byte hole is eliminated in 'struct dm_io_request' (size
reduced from 56 to 48 bytes).
Eliminate all 6 4-byte holes and 1 cache-line in 'struct dm_snapshot' (size
reduced from 392 to 368 bytes).
Signed-off-by: Mike Snitzer <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
Set a new DM_UEVENT_GENERATED_FLAG when returning from ioctls to
indicate that a uevent was actually generated. This tells the userspace
caller that it may need to wait for the event to be processed.
Signed-off-by: Peter Rajnoha <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
Free the dm_io structure before calling bio_endio() instead of after it,
to ensure that the io_pool containing it is not referenced after it is
freed.
This partially fixes a problem described here
https://www.redhat.com/archives/dm-devel/2010-February/msg00109.html
thread 1:
bio_endio(bio, io_error);
/* scheduling happens */
thread 2:
close the device
remove the device
thread 1:
free_io(md, io);
Thread 2, when removing the device, sees non-empty md->io_pool (because the
io hasn't been freed by thread 1 yet) and may crash with BUG in mempool_free.
Thread 1 may also crash, when freeing into a nonexisting mempool.
To fix this we must make sure that bio_endio() is the last call and
the md structure is not accessed afterwards.
There is another bio_endio in process_barrier, but it is called from the thread
and the thread is destroyed prior to freeing the mempools, so this call is
not affected by the bug.
A similar bug exists with module unloads - the module may be unloaded
immediately after bio_endio - but that is more difficult to fix.
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
Remove unused parameters(start and len) of dm_get_device()
and fix the callers.
Signed-off-by: Nikanth Karthikesan <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
Only issue a uevent on a resume if the state of the device changed,
i.e. if it was suspended and/or its table was replaced.
Signed-off-by: Dave Wysochanski <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
Cc: [email protected]
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
If all mirror legs fail, always return an error instead of holding the
bio, even if the handle_errors option was set. At present it is the
responsibility of the driver underneath us to deal with retries,
multipath etc.
The patch adds the bio to the failures list instead of holding it
directly. do_failures tests first if all legs failed and, if so,
returns the bio with -EIO. If any leg is still alive and handle_errors
is set, do_failures calls hold_bio.
Reviewed-by: Takahiro Yasui <[email protected]>
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
This patch pulls the pg_init path activation code out of
process_queued_ios() into a new function.
No functional change.
Signed-off-by: Kiyoshi Ueda <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
When suspending the device we must wait for all I/O to complete, but
pg-init may be still in progress even after flushing the workqueue
for kmpath_handlerd in multipath_postsuspend.
This patch waits for pg-init completion correctly in
multipath_postsuspend().
Signed-off-by: Kiyoshi Ueda <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
m->queue_io is set to block processing I/Os, and it needs to be kept
while pg-init, which issues multiple path activations, is in progress.
But m->queue is cleared when a path activation completes without error
in pg_init_done(), even while other path activations are in progress.
That may cause undesired -EIO on paths which are not complete activation.
This patch fixes that by not clearing m->queue_io until all path
activations complete.
(Before the hardware handlers were moved into the SCSI layer, pg_init
only used one path.)
Signed-off-by: Kiyoshi Ueda <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
'suspended' flag in struct multipath was introduced to check whether
the multipath target is in suspended state, but the same check is
done through dm_suspended() now, so remove the flag and related code.
Signed-off-by: Kiyoshi Ueda <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
Cc: Mike Anderson <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
Update Documentation/device-mapper/snapshot.txt to cover "How to
determine when a snapshot has finished merging".
Signed-off-by: Mike Snitzer <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could
be called from presuspend/postsuspend, which are called while
mapped_device is in DMF_FREEING state, where dm_get() is not allowed.
Justification for that is the lifetime of both objects: As far as the
current dm design/implementation, mapped_device is never freed while
targets are doing something, because dm core waits for targets to become
quiet in dm_put() using presuspend/postsuspend. So targets should be
able to touch mapped_device without holding reference count of the
mapped_device, and we should allow targets to touch mapped_device even
if it is in DMF_FREEING state.
Backgrounds:
I'm trying to remove the multipath internal queue, since dm core now has
a generic queue for request-based dm. In the patch-set, the multipath
target wants to request dm core to start/stop queue. One of such
start/stop requests can happen during postsuspend() while the target
waits for pg-init to complete, because the target stops queue when
starting pg-init and tries to restart it when completing pg-init. Since
queue belongs to mapped_device, it involves calling dm_table_get_md()
and dm_put(). On the other hand, postsuspend() is called in dm_put()
for mapped_device which is in DMF_FREEING state, and that triggers
BUG_ON(DMF_FREEING) in the 2nd dm_put().
I had tried to solve this problem by changing only multipath not to
touch mapped_device which is in DMF_FREEING state, but I couldn't and I
came up with a question why we need dm_get() in dm_table_get_md().
Signed-off-by: Kiyoshi Ueda <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|
|
This patch adds two minor fixes while processing device mapper path activation.
Skip failed paths while calling activate_path. If the path is already failed
then activate_path will fail for sure. We don't have to call in that case. In
some case this might cause prolonged retries unnecessarily.
Change the misleading message if the path being activated fails with SCSI_DH_NOSYS.
Signed-off-by: Babu Moger <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
|