Age | Commit message (Collapse) | Author | Files | Lines |
|
1.include linux/memblock.h directly. so later could reduce e820.h reference.
2 this patch is done by sed scripts mainly
-v2: use MEMBLOCK_ERROR instead of -1ULL or -1UL
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
1. replace find_e820_area with memblock_find_in_range
2. replace reserve_early with memblock_x86_reserve_range
3. replace free_early with memblock_x86_free_range.
4. NO_BOOTMEM will switch to use memblock too.
5. use _e820, _early wrap in the patch, in following patch, will
replace them all
6. because memblock_x86_free_range support partial free, we can remove some special care
7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
so adjust some calling later in setup.c::setup_arch()
-- corruption_check and mptable_update
-v2: Move reserve_brk() early
Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
that could happen We have more then 128 RAM entry in E820 tables, and
memblock_x86_fill() could use memblock_find_in_range() to find a new place for
memblock.memory.region array.
and We don't need to use extend_brk() after fill_memblock_area()
So move reserve_brk() early before fill_memblock_area().
-v3: Move find_smp_config early
To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
in right place.
-v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
memblock.reserved already..
use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
-v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
active_region for 32bit does include high pages
need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
-v6: Use current_limit instead
-v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
-v8: Set memblock_can_resize early to handle EFI with more RAM entries
-v9: update after kmemleak changes in mainline
Suggested-by: David S. Miller <[email protected]>
Suggested-by: Benjamin Herrenschmidt <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
According to node range in early_node_map[] with __memblock_find_in_range
to find free range.
Will be used by memblock_x86_find_in_range_node()
memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
It will be used memblock_x86_to_bootmem converting
It is an wrapper for reserve_bootmem, and x86 64bit is using special one.
Also clean up that version for x86_64. We don't need to take care of numa
path for that, bootmem can handle it how
Signed-off-by: Yinghai Lu <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
So we can avoid export memblock_reserved_init_regions()
Suggested by Ben.
-v2: use __init_memblock attribute
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
When pcpu_build_alloc_info() searches best_upa value, it ignores current value
if the number of waste units exceeds 1/3 of the number of total cpus. But the
comment on the code says that it will ignore if wastage is over 25%.
Modify the comment.
Signed-off-by: Namhyung Kim <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
The original code did not free the old map. This patch fixes it.
tj: use @old as memcpy source instead of @chunk->map, and indentation
and description update
Signed-off-by: Huang Shijie <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Cc: [email protected]
|
|
This patch fixes the following issue:
INFO: task mount.nfs4:1120 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.nfs4 D 00000000fffc6a21 0 1120 1119 0x00000000
ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000
ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8
00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0
Call Trace:
[<ffffffff813bc747>] schedule_timeout+0x34/0xf1
[<ffffffff813bc530>] ? wait_for_common+0x3f/0x130
[<ffffffff8106b50b>] ? trace_hardirqs_on+0xd/0xf
[<ffffffff813bc5c3>] wait_for_common+0xd2/0x130
[<ffffffff8104159c>] ? default_wake_function+0x0/0xf
[<ffffffff813beaa0>] ? _raw_spin_unlock+0x26/0x2a
[<ffffffff813bc6bb>] wait_for_completion+0x18/0x1a
[<ffffffff81101a03>] sync_inodes_sb+0xca/0x1bc
[<ffffffff811056a6>] __sync_filesystem+0x47/0x7e
[<ffffffff81105798>] sync_filesystem+0x47/0x4b
[<ffffffff810e7ffd>] generic_shutdown_super+0x22/0xd2
[<ffffffff810e80f8>] kill_anon_super+0x11/0x4f
[<ffffffffa00d06d7>] nfs4_kill_super+0x3f/0x72 [nfs]
[<ffffffff810e7b68>] deactivate_locked_super+0x21/0x41
[<ffffffff810e7fd6>] deactivate_super+0x40/0x45
[<ffffffff810fc66c>] mntput_no_expire+0xb8/0xed
[<ffffffff810fc73b>] release_mounts+0x9a/0xb0
[<ffffffff810fc7bb>] put_mnt_ns+0x6a/0x7b
[<ffffffffa00d0fb2>] nfs_follow_remote_path+0x19a/0x296 [nfs]
[<ffffffffa00d11ca>] nfs4_try_mount+0x75/0xaf [nfs]
[<ffffffffa00d1790>] nfs4_get_sb+0x276/0x2ff [nfs]
[<ffffffff810e7dba>] vfs_kern_mount+0xb8/0x196
[<ffffffff810e7ef6>] do_kern_mount+0x48/0xe8
[<ffffffff810fdf68>] do_mount+0x771/0x7e8
[<ffffffff810fe062>] sys_mount+0x83/0xbd
[<ffffffff810089c2>] system_call_fastpath+0x16/0x1b
The reason of this hang was a race condition: when the flusher thread is
forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it
visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep.
'bdi->wb.task' is still NULL. And this is a dangerous time window.
If at this time someone queues a work for this bdi, he does not see the bdi
thread and wakes up the forker thread instead! But the forker has already
forked this bdi thread, but just did not make it visible yet!
The result is that we lose the wake up event for this bdi thread and the NFS4
code waits forever.
To fix the problem, we should use 'ktrhead_create()' for creating bdi threads,
then make them visible in 'bdi->wb.task', and only after this wake them up.
This is exactly what this patch does.
Signed-off-by: Artem Bityutskiy <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev
* '2.6.36-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev:
xfs: do not discard page cache data on EAGAIN
xfs: don't do memory allocation under the CIL context lock
xfs: Reduce log force overhead for delayed logging
xfs: dummy transactions should not dirty VFS state
xfs: ensure f_ffree returned by statfs() is non-negative
xfs: handle negative wbc->nr_to_write during sync writeback
writeback: write_cache_pages doesn't terminate at nr_to_write <= 0
xfs: fix untrusted inode number lookup
xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE
xfs: unlock items before allowing the CIL to commit
|
|
pa-risc and ia64 have stacks that grow upwards. Check that
they do not run into other mappings. By making VM_GROWSUP
0x0 on architectures that do not ever use it, we can avoid
some unpleasant #ifdefs in check_stack_guard_page().
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have
been. Enabling writeback tracing showed:
flush-253:16-8516 [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
Writeback is not terminating in background writeback if ->writepage is
returning with wbc->nr_to_write == 0, resulting in sub-optimal single page
writeback on XFS.
Fix the write_cache_pages loop to terminate correctly when this situation
occurs and so prevent this sub-optimal background writeback pattern. This
improves sustained sequential buffered write performance from around
250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM.
Cc:<[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Wu Fengguang <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
|
|
In x86, access and dirty bits are set automatically by CPU when CPU accesses
memory. When we go into the code path of below flush_tlb_fix_spurious_fault(),
we already set dirty bit for pte and don't need flush tlb. This might mean
tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When
the CPUs do page write, they will automatically check the bit and no software
involved.
On the other hand, flush tlb in below position is harmful. Test creates CPU
number of threads, each thread writes to a same but random address in same vma
range and we measure the total time. Under a 4 socket system, original time is
1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is
20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for
tlb flush.
Signed-off-by: Shaohua Li <[email protected]>
LKML-Reference: <[email protected]>
Acked-by: Suresh Siddha <[email protected]>
Cc: Andrea Archangeli <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
This allows code outside of the mm core to safely manipulate page state
and not worry about the other accounting. Not using these routines means
that some code will lose track of the accounting and we get bugs. This
has happened once already.
Signed-off-by: Michael Rubin <[email protected]>
Signed-off-by: Sage Weil <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slab: fix object alignment
slub: add missing __percpu markup in mm/slub_def.h
|
|
Like the mlock() change previously, this makes the stack guard check
code use vma->vm_prev to see what the mapping below the current stack
is, rather than have to look it up with find_vma().
Also, accept an abutting stack segment, since that happens naturally if
you split the stack with mlock or mprotect.
Tested-by: Ian Campbell <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If we've split the stack vma, only the lowest one has the guard page.
Now that we have a doubly linked list of vma's, checking this is trivial.
Tested-by: Ian Campbell <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It's a really simple list, and several of the users want to go backwards
in it to find the previous vma. So rather than have to look up the
previous entry with 'find_vma_prev()' or something similar, just make it
doubly linked instead.
Tested-by: Ian Campbell <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
dump_tasks() needs to hold the RCU read lock around its access of the
target task's UID. To this end it should use task_uid() as it only needs
that one thing from the creds.
The fact that dump_tasks() holds tasklist_lock is insufficient to prevent the
target process replacing its credentials on another CPU.
Then, this patch change to call rcu_read_lock() explicitly.
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
mm/oom_kill.c:410 invoked rcu_dereference_check() without protection!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 1
4 locks held by kworker/1:2/651:
#0: (events){+.+.+.}, at: [<ffffffff8106aae7>]
process_one_work+0x137/0x4a0
#1: (moom_work){+.+...}, at: [<ffffffff8106aae7>]
process_one_work+0x137/0x4a0
#2: (tasklist_lock){.+.+..}, at: [<ffffffff810fafd4>]
out_of_memory+0x164/0x3f0
#3: (&(&p->alloc_lock)->rlock){+.+...}, at: [<ffffffff810fa48e>]
find_lock_task_mm+0x2e/0x70
Signed-off-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: David Howells <[email protected]>
Acked-by: Paul E. McKenney <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit 0aad4b3124 ("oom: fold __out_of_memory into out_of_memory")
introduced a tasklist_lock leak. Then it caused following obvious
danger warnings and panic.
================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
rsyslogd/1422 is leaving the kernel with locks still held!
1 lock held by rsyslogd/1422:
#0: (tasklist_lock){.+.+.+}, at: [<ffffffff810faf64>] out_of_memory+0x164/0x3f0
BUG: scheduling while atomic: rsyslogd/1422/0x00000002
INFO: lockdep is turned off.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit b940fd7035 ("oom: remove unnecessary code and cleanup") added an
unnecessary NULL pointer dereference. remove it.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When radix_tree_maxindex() is ~0UL, it can happen that scanning overflows
index and tree traversal code goes astray reading memory until it hits
unreadable memory. Check for overflow and exit in that case.
Signed-off-by: Jan Kara <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
list_add() corruption messages reported from shmem_fill_super()'s recently
introduced percpu_counter_init(): shmem_put_super() needs to remember to
percpu_counter_destroy(). And also check error from percpu_counter_init().
Reported-bisected-and-tested-by: Tetsuo Handa <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This commit makes the stack guard page somewhat less visible to user
space. It does this by:
- not showing the guard page in /proc/<pid>/maps
It looks like lvm-tools will actually read /proc/self/maps to figure
out where all its mappings are, and effectively do a specialized
"mlockall()" in user space. By not showing the guard page as part of
the mapping (by just adding PAGE_SIZE to the start for grows-up
pages), lvm-tools ends up not being aware of it.
- by also teaching the _real_ mlock() functionality not to try to lock
the guard page.
That would just expand the mapping down to create a new guard page,
so there really is no point in trying to lock it in place.
It would perhaps be nice to show the guard page specially in
/proc/<pid>/maps (or at least mark grow-down segments some way), but
let's not open ourselves up to more breakage by user space from programs
that depends on the exact deails of the 'maps' file.
Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
source code to see what was going on with the whole new warning.
Reported-and-tested-by: François Valenduc <[email protected]
Reported-by: Henrique de Moraes Holschuh <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove leading /** from non-kernel-doc function comments to prevent
kernel-doc warnings.
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We do in fact need to unmap the page table _before_ doing the whole
stack guard page logic, because if it is needed (mainly 32-bit x86 with
PAE and CONFIG_HIGHPTE, but other architectures may use it too) then it
will do a kmap_atomic/kunmap_atomic.
And those kmaps will create an atomic region that we cannot do
allocations in. However, the whole stack expand code will need to do
anon_vma_prepare() and vma_lock_anon_vma() and they cannot do that in an
atomic region.
Now, a better model might actually be to do the anon_vma_prepare() when
_creating_ a VM_GROWSDOWN segment, and not have to worry about any of
this at page fault time. But in the meantime, this is the
straightforward fix for the issue.
See https://bugzilla.kernel.org/show_bug.cgi?id=16588 for details.
Reported-by: Wylda <[email protected]>
Reported-by: Sedat Dilek <[email protected]>
Reported-by: Mike Pagano <[email protected]>
Reported-by: François Valenduc <[email protected]>
Tested-by: Ed Tomlinson <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Greg KH <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove an extraneous no_printk() in mm/nommu.c that got missed when the
function got generalised from several things that used it in commit
12fdff3fc248 ("Add a dummy printk function for the maintenance of unused
printks").
Without this, the following error is observed:
mm/nommu.c:41: error: conflicting types for 'no_printk'
include/linux/kernel.h:314: error: previous definition of 'no_printk' was here
Reported-by: Michal Simek <[email protected]>
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
.. which didn't show up in my tests because it's a no-op on x86-64 and
most other architectures. But we enter the function with the last-level
page table mapped, and should unmap it at exit.
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is a rather minimally invasive patch to solve the problem of the
user stack growing into a memory mapped area below it. Whenever we fill
the first page of the stack segment, expand the segment down by one
page.
Now, admittedly some odd application might _want_ the stack to grow down
into the preceding memory mapping, and so we may at some point need to
make this a process tunable (some people might also want to have more
than a single page of guarding), but let's try the minimal approach
first.
Tested with trivial application that maps a single page just below the
stack, and then starts recursing. Without this, we will get a SIGSEGV
_after_ the stack has smashed the mapping. With this patch, we'll get a
nice SIGBUS just as the stack touches the page just above the mapping.
Requested-by: Keith Packard <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()
hwpoison: rename CONFIG
HWPOISON, hugetlb: support hwpoison injection for hugepage
HWPOISON, hugetlb: detect hwpoison in hugetlb code
HWPOISON, hugetlb: isolate corrupted hugepage
HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
HWPOISON, hugetlb: enable error handling path for hugepage
hugetlb, rmap: add reverse mapping for hugepage
hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h
Fix up trivial conflicts in mm/memory-failure.c
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
x86: Detect whether we should use Xen SWIOTLB.
pci-swiotlb-xen: Add glue code to setup dma_ops utilizing xen_swiotlb_* functions.
swiotlb-xen: SWIOTLB library for Xen PV guest with PCI passthrough.
xen/mmu: inhibit vmap aliases rather than trying to clear them out
vmap: add flag to allow lazy unmap to be disabled at runtime
xen: Add xen_create_contiguous_region
xen: Rename the balloon lock
xen: Allow unprivileged Xen domains to create iomap pages
xen: use _PAGE_IOMAP in ioremap to do machine mappings
Fix up trivial conflicts (adding both xen swiotlb and xen pci platform
driver setup close to each other) in drivers/xen/{Kconfig,Makefile} and
include/xen/xen-ops.h
|
|
Document global_dirty_limits() and bdi_dirty_limit().
Signed-off-by: Wu Fengguang <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
that the latter can be avoided when under global dirty background
threshold (which is the normal state for most systems).
Signed-off-by: Wu Fengguang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Reducing the number of times balance_dirty_pages calls global_page_state
reduces the cache references and so improves write performance on a
variety of workloads.
'perf stats' of simple fio write tests shows the reduction in cache
access. Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2
with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10
times, dropping the fasted & slowest values then taking the average &
standard deviation
average (s.d.) in millions (10^6)
2.6.31-rc8 648.6 (14.6)
+patch 620.1 (16.5)
Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads
the counters to apply the dirty_threshold and moving this check up into
balance_dirty_pages where it has already read the counters.
Also by rearrange the for loop to only contain one copy of the limit tests
allows the pdflush test after the loop to use the local copies of the
counters rather than rereading them.
In the common case with no throttling it now calls global_page_state 5
fewer times and bdi_stat 2 fewer.
Fengguang:
This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to
avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
accurate we don't need to do routinely clip. A simple dirty limit check
would be enough.
The check is necessary because, in principle we should throttle everything
calling balance_dirty_pages() when we're over the total limit, as said by
Peter.
We now set and clear dirty_exceeded not only based on bdi dirty limits,
but also on the global dirty limit. The global limit check is added in
place of clip_bdi_dirty_limit() for safety and not intended as a behavior
change. The bdi limits should be tight enough to keep all dirty pages
under the global limit at most time; occasional small exceeding should be
OK though. The change makes the logic more obvious: the global limit is
the ultimate goal and shall be always imposed.
We may now start background writeback work based on outdated conditions.
That's safe because the bdi flush thread will (and have to) double check
the states. It reduces overall overheads because the test based on old
states still have good chance to be right.
[[email protected]] fix uninitialized dirty_exceeded
Signed-off-by: Richard Kennedy <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Cc: Jan Kara <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jens Axboe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix a fatal kernel-doc error due to a #define coming between a function's
kernel-doc notation and the function signature. (kernel-doc cannot handle
this)
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We have zone_to_nid(). this patch convert all existing users of
zone->zone_pgdat->node_id.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Nishimura Daisuke <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
mem_cgroup_soft_limit_reclaim() has zone, nid and zid argument. but nid
and zid can be calculated from zone. So remove it.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Nishimura Daisuke <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently mem_cgroup_shrink_node_zone() call shrink_zone() directly. thus
it doesn't need to initialize sc.nodemask because shrink_zone() doesn't
use it at all.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Nishimura Daisuke <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
sc.nr_reclaimed and sc.nr_scanned have already been initialized few lines
above "struct scan_control sc = {}" statement.
So, This patch remove this unnecessary code.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Nishimura Daisuke <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, mem_cgroup_shrink_node_zone() initialize sc.nr_to_reclaim as 0.
It mean shrink_zone() only scan 32 pages and immediately return even if
it doesn't reclaim any pages.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Nishimura Daisuke <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now, memory cgroup increments css(cgroup subsys state)'s reference count
per a charged page. And the reference count is kept until the page is
uncharged. But this has 2 bad effect.
1. Because css_get/put calls atomic_inc()/dec, heavy call of them
on large smp will not scale well.
2. Because css's refcnt cannot be in a state as "ready-to-release",
cgroup's notify_on_release handler can't work with memcg.
3. css's refcnt is atomic_t, it means smaller than 32bit. Maybe too small.
This has been a problem since the 1st merge of memcg.
This is a trial to remove css's refcnt per a page. Even if we remove
refcnt, pre_destroy() does enough synchronization as
- check res->usage == 0.
- check no pages on LRU.
This patch removes css's refcnt per page. Even after this patch, at the
1st look, it seems css_get() is still called in try_charge().
But the logic is.
- If a memcg of mm->owner is cached one, consume_stock() will work.
At success, return immediately.
- If consume_stock returns false, css_get() is called and go to
slow path which may be blocked. At the end of slow path,
css_put() is called and restart from the start if necessary.
So, in the fast path, we don't call css_get() and can avoid access to
shared counter. This patch can make the most possible case fast.
Here is a result of multi-threaded page fault benchmark.
[Before]
25.32% multi-fault-all [kernel.kallsyms] [k] clear_page_c
9.30% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave
8.02% multi-fault-all [kernel.kallsyms] [k] try_get_mem_cgroup_from_mm <=====(*)
7.83% multi-fault-all [kernel.kallsyms] [k] down_read_trylock
5.38% multi-fault-all [kernel.kallsyms] [k] __css_put
5.29% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask
4.92% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq
4.24% multi-fault-all [kernel.kallsyms] [k] up_read
3.53% multi-fault-all [kernel.kallsyms] [k] css_put
2.11% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault
1.76% multi-fault-all [kernel.kallsyms] [k] __rmqueue
1.64% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge
[After]
28.41% multi-fault-all [kernel.kallsyms] [k] clear_page_c
10.08% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq
9.58% multi-fault-all [kernel.kallsyms] [k] down_read_trylock
9.38% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.86% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask
5.65% multi-fault-all [kernel.kallsyms] [k] up_read
2.82% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault
2.64% multi-fault-all [kernel.kallsyms] [k] mem_cgroup_add_lru_list
2.48% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge
Then, 8.02% of try_get_mem_cgroup_from_mm() disappears because this patch
removes css_tryget() in it. (But yes, this is an extreme case.)
Signed-off-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When the OOM killer scans task, it check a task is under memcg or
not when it's called via memcg's context.
But, as Oleg pointed out, a thread group leader may have NULL ->mm
and task_in_mem_cgroup() may do wrong decision. We have to use
find_lock_task_mm() in memcg as generic OOM-Killer does.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
mem_cgroup_charge_common() is always called with @mem = NULL, so it's
meaningless. This patch removes it.
Signed-off-by: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
- try_get_mem_cgroup_from_mm() calls rcu_read_lock/unlock by itself, so we
don't have to call them in task_in_mem_cgroup().
- *mz is not used in __mem_cgroup_uncharge_common().
- we don't have to call lookup_page_cgroup() in mem_cgroup_end_migration()
after we've cleared PCG_MIGRATION of @oldpage.
- remove empty comment.
- remove redundant empty line in mem_cgroup_cache_charge().
Signed-off-by: Daisuke Nishimura <[email protected]>
Acked-by: Balbir Singh <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now, for checking a memcg is under task-account-moving, we do css_tryget()
against mc.to and mc.from. But this is just complicating things. This
patch makes the check easier.
This patch adds a spinlock to move_charge_struct and guard modification of
mc.to and mc.from. By this, we don't have to think about complicated
races arount this not-critical path.
[[email protected]: don't crash on a null memcg being passed]
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
mem_cgroup_try_charge() has a big loop in it and seems to be hard to read.
Most of routines are for slow path. This patch moves codes out from the
loop and make it clear what's done.
Summary:
- refactoring a function to detect a memcg is under acccount move or not.
- refactoring a function to wait for the end of moving task acct.
- refactoring a main loop('s slow path) as a function and make it clear
why we retry or quit by return code.
- add fatal_signal_pending() check for bypassing charge loops.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch fixes possible deadlock in hugepage lock_page()
by adding missing unlock_page().
libhugetlbfs test will hit this bug when the next patch in this
patchset ("hugetlb, HWPOISON: move PG_HWPoison bit check") is applied.
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Jun'ichi Nomura <[email protected]>
Acked-by: Fengguang Wu <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
|
|
CONFIG_HUGETLBFS controls hugetlbfs interface code.
OTOH, CONFIG_HUGETLB_PAGE controls hugepage management code.
So we should use CONFIG_HUGETLB_PAGE here.
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
|
|
This patch enables hwpoison injection through debug/hwpoison interfaces,
with which we can test memory error handling for free or reserved
hugepages (which cannot be tested by madvise() injector).
[AK: Export PageHuge too for the injection module]
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Andrew Morton <[email protected]>
Acked-by: Fengguang Wu <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
|
|
This patch enables to block access to hwpoisoned hugepage and
also enables to block unmapping for it.
Dependency:
"HWPOISON, hugetlb: enable error handling path for hugepage"
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Andrew Morton <[email protected]>
Acked-by: Fengguang Wu <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
|
|
If error hugepage is not in-use, we can fully recovery from error
by dequeuing it from freelist, so return RECOVERY.
Otherwise whether or not we can recovery depends on user processes,
so return DELAYED.
Dependency:
"HWPOISON, hugetlb: enable error handling path for hugepage"
Signed-off-by: Naoya Horiguchi <[email protected]>
Cc: Andrew Morton <[email protected]>
Acked-by: Fengguang Wu <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
|