Age | Commit message (Collapse) | Author | Files | Lines |
|
The global zero page is used to satisfy an anonymous read fault. If
THP(Transparent HugePage) is enabled then the global huge zero page is
used. The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.
CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults. This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.
To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
the global counter in two cases:
1 The first time it uses the global huge zero page;
2 The time when mm_user of its mm_struct reaches zero.
Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away. With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.
And with the use of mm_user, the kthread is not eligible to use huge
zero page either. Since no kthread is using huge zero page today, there
is no difference after applying this patch. But if that is not desired,
I can change it to when mm_count reaches zero.
Case used for test on Haswell EP:
usemem -n 72 --readonly -j 0x200000 100G
Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.
CPU cycles from perf report for base commit:
54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
CPU cycles from perf report for this commit:
0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page
Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.
Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aaron Lu <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Ebru Akagunduz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
After "oom: keep mm of the killed task available" we can safely detect
an oom victim by checking task->signal->oom_mm so we do not need the
signal_struct counter anymore so let's get rid of it.
This alone wouldn't be sufficient for nommu archs because
exit_oom_victim doesn't hide the process from the oom killer anymore.
We can, however, mark the mm with a MMF flag in __mmput. We can reuse
MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Lockdep complains that __mmdrop is not safe from the softirq context:
=================================
[ INFO: inconsistent lock state ]
4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G W
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
{SOFTIRQ-ON-W} state was registered at:
__lock_acquire+0xa06/0x196e
lock_acquire+0x139/0x1e1
_raw_spin_lock+0x32/0x41
__change_page_attr_set_clr+0x2a5/0xacd
change_page_attr_set_clr+0x16f/0x32c
set_memory_nx+0x37/0x3a
free_init_pages+0x9e/0xc7
alternative_instructions+0xa2/0xb3
check_bugs+0xe/0x2d
start_kernel+0x3ce/0x3ea
x86_64_start_reservations+0x2a/0x2c
x86_64_start_kernel+0x17a/0x18d
irq event stamp: 105916
hardirqs last enabled at (105916): free_hot_cold_page+0x37e/0x390
hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
softirqs last enabled at (105878): _local_bh_enable+0x42/0x44
softirqs last disabled at (105879): irq_exit+0x6f/0xd1
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(pgd_lock);
<Interrupt>
lock(pgd_lock);
*** DEADLOCK ***
1 lock held by swapper/1/0:
#0: (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
Call Trace:
<IRQ>
print_usage_bug.part.25+0x259/0x268
mark_lock+0x381/0x567
__lock_acquire+0x993/0x196e
lock_acquire+0x139/0x1e1
_raw_spin_lock+0x32/0x41
pgd_free+0x19/0x6b
__mmdrop+0x25/0xb9
__put_task_struct+0x103/0x11e
delayed_put_task_struct+0x157/0x15e
rcu_process_callbacks+0x660/0x800
__do_softirq+0x1ec/0x4d5
irq_exit+0x6f/0xd1
smp_apic_timer_interrupt+0x42/0x4d
apic_timer_interrupt+0x8e/0xa0
<EOI>
arch_cpu_idle+0xf/0x11
default_idle_call+0x32/0x34
cpu_startup_entry+0x20c/0x399
start_secondary+0xfe/0x101
More over commit a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was
explicit about pgd_lock not to be called from the irq context. This
means that __mmdrop called from free_signal_struct has to be postponed
to a user context. We already have a similar mechanism for mmput_async
so we can use it here as well. This is safe because mm_count is pinned
by mm_users.
This fixes bug introduced by "oom: keep mm of the killed task available"
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will not block the oom killer for ever. This is, however,
opening new problems (e.g oom_killer_disable exclusion - see commit
74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
race")). exit_oom_victim should be only called from the victim's
context ideally.
One way to achieve this would be to rely on per mm_struct flags. We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:
do_exit
exit_mm
tsk->mm = NULL;
mmput
__mmput
exit_oom_victim
doesn't guarantee that exit_oom_victim will get called in a bounded
amount of time. At least exit_aio depends on IO which might get blocked
due to lack of memory and who knows what else is lurking there.
This patch takes a different approach. We remember tsk->mm into the
signal_struct and bind it to the signal struct life time for all oom
victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
have to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct. As a result all the oom specific
communication inside the OOM killer can be done via tsk->signal->oom_mm.
Increasing the signal_struct for something as unlikely as the oom killer
is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task->mm into the
signal_struct anyway. In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
"This set of changes is a number of smaller things that have been
overlooked in other development cycles focused on more fundamental
change. The devpts changes are small things that were a distraction
until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
trivial regression fix to autofs for the unprivileged mount changes
that went in last cycle. A pair of ioctls has been added by Andrey
Vagin making it is possible to discover the relationships between
namespaces when referring to them through file descriptors.
The big user visible change is starting to add simple resource limits
to catch programs that misbehave. With namespaces in general and user
namespaces in particular allowing users to use more kinds of
resources, it has become important to have something to limit errant
programs. Because the purpose of these limits is to catch errant
programs the code needs to be inexpensive to use as it always on, and
the default limits need to be high enough that well behaved programs
on well behaved systems don't encounter them.
To this end, after some review I have implemented per user per user
namespace limits, and use them to limit the number of namespaces. The
limits being per user mean that one user can not exhause the limits of
another user. The limits being per user namespace allow contexts where
the limit is 0 and security conscious folks can remove from their
threat anlysis the code used to manage namespaces (as they have
historically done as it root only). At the same time the limits being
per user namespace allow other parts of the system to use namespaces.
Namespaces are increasingly being used in application sand boxing
scenarios so an all or nothing disable for the entire system for the
security conscious folks makes increasing use of these sandboxes
impossible.
There is also added a limit on the maximum number of mounts present in
a single mount namespace. It is nontrivial to guess what a reasonable
system wide limit on the number of mount structure in the kernel would
be, especially as it various based on how a system is using
containers. A limit on the number of mounts in a mount namespace
however is much easier to understand and set. In most cases in
practice only about 1000 mounts are used. Given that some autofs
scenarious have the potential to be 30,000 to 50,000 mounts I have set
the default limit for the number of mounts at 100,000 which is well
above every known set of users but low enough that the mount hash
tables don't degrade unreaonsably.
These limits are a start. I expect this estabilishes a pattern that
other limits for resources that namespaces use will follow. There has
been interest in making inotify event limits per user per user
namespace as well as interest expressed in making details about what
is going on in the kernel more visible"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
autofs: Fix automounts by using current_real_cred()->uid
mnt: Add a per mount namespace limit on the number of mounts
netns: move {inc,dec}_net_namespaces into #ifdef
nsfs: Simplify __ns_get_path
tools/testing: add a test to check nsfs ioctl-s
nsfs: add ioctl to get a parent namespace
nsfs: add ioctl to get an owning user namespace for ns file descriptor
kernel: add a helper to get an owning user namespace for a namespace
devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
devpts: Remove sync_filesystems
devpts: Make devpts_kill_sb safe if fsi is NULL
devpts: Simplify devpts_mount by using mount_nodev
devpts: Move the creation of /dev/pts/ptmx into fill_super
devpts: Move parse_mount_options into fill_super
userns: When the per user per user namespace limit is reached return ENOSPC
userns; Document per user per user namespace limits.
mntns: Add a limit on the number of mount namespaces.
netns: Add a limit on the number of net namespaces
cgroupns: Add a limit on the number of cgroup namespaces
ipcns: Add a limit on the number of ipc namespaces
...
|
|
CONFIG_VMAP_STACK=y
vmalloc() is a bit slow, and pounding vmalloc()/vfree() will eventually
force a global TLB flush.
To reduce pressure on them, if CONFIG_VMAP_STACK=y, cache two thread
stacks per CPU. This will let us quickly allocate a hopefully
cache-hot, TLB-hot stack under heavy forking workloads (shell script style).
On my silly pthread_create() benchmark, it saves about 2 µs per
pthread_create()+join() with CONFIG_VMAP_STACK=y.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/94811d8e3994b2e962f88866290017d498eb069c.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We currently keep every task's stack around until the task_struct
itself is freed. This means that we keep the stack allocation alive
for longer than necessary and that, under load, we free stacks in
big batches whenever RCU drops the last task reference. Neither of
these is good for reuse of cache-hot memory, and freeing in batches
prevents us from usefully caching small numbers of vmalloced stacks.
On architectures that have thread_info on the stack, we can't easily
change this, but on architectures that set THREAD_INFO_IN_TASK, we
can free it as soon as the task is dead.
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Merge fixes from Andrew Morton:
"14 fixes"
* emailed patches from Andrew Morton <[email protected]>:
rapidio/tsi721: fix incorrect detection of address translation condition
rapidio/documentation/mport_cdev: add missing parameter description
kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
MAINTAINERS: Vladimir has moved
mm, mempolicy: task->mempolicy must be NULL before dropping final reference
printk/nmi: avoid direct printk()-s from __printk_nmi_flush()
treewide: remove references to the now unnecessary DEFINE_PCI_DEVICE_TABLE
drivers/scsi/wd719x.c: remove last declaration using DEFINE_PCI_DEVICE_TABLE
mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator
lib/test_hash.c: fix warning in preprocessor symbol evaluation
lib/test_hash.c: fix warning in two-dimensional array init
kconfig: tinyconfig: provide whole choice blocks to avoid warnings
kexec: fix double-free when failing to relocate the purgatory
mm, oom: prevent premature OOM killer invocation for high order request
|
|
Commit fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
exit") has caused a subtle regression in nscd which uses
CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
shared databases, so that the clients are notified when nscd is
restarted. Now, when nscd uses a non-persistent database, clients that
have it mapped keep thinking the database is being updated by nscd, when
in fact nscd has created a new (anonymous) one (for non-persistent
databases it uses an unlinked file as backend).
The original proposal for the CLONE_CHILD_CLEARTID change claimed
(https://lkml.org/lkml/2006/10/25/233):
: The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
: on behalf of pthread_create() library calls. This feature is used to
: request that the kernel clear the thread-id in user space (at an address
: provided in the syscall) when the thread disassociates itself from the
: address space, which is done in mm_release().
:
: Unfortunately, when a multi-threaded process incurs a core dump (such as
: from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
: the other threads, which then proceed to clear their user-space tids
: before synchronizing in exit_mm() with the start of core dumping. This
: misrepresents the state of process's address space at the time of the
: SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
: problems (misleading him/her to conclude that the threads had gone away
: before the fault).
:
: The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
: core dump has been initiated.
The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
seems to have a larger scope than the original patch asked for. It
seems that limitting the scope of the check to core dumping should work
for SIGSEGV issue describe above.
[Changelog partly based on Andreas' description]
Fixes: fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Michal Hocko <[email protected]>
Tested-by: William Preston <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Andreas Schwab <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Pull audit fixes from Paul Moore:
"Two small patches to fix some bugs with the audit-by-executable
functionality we introduced back in v4.3 (both patches are marked
for the stable folks)"
* 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
audit: fix exe_file access in audit_exe_compare
mm: introduce get_task_exe_file
|
|
For more convenient access if one has a pointer to the task.
As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.
Use the helper in proc_exe_link.
Signed-off-by: Mateusz Guzik <[email protected]>
Acked-by: Konstantin Khlebnikov <[email protected]>
Acked-by: Richard Guy Briggs <[email protected]>
Cc: <[email protected]> # 4.3.x
Signed-off-by: Paul Moore <[email protected]>
|
|
If CONFIG_VMAP_STACK=y is selected, kernel stacks are allocated with
__vmalloc_node_range().
Grsecurity has had a similar feature (called GRKERNSEC_KSTACKOVERFLOW=y)
for a long time.
Signed-off-by: Andy Lutomirski <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/14c07d4fd173a5b117f51e8b939f9f4323e39899.1470907718.git.luto@kernel.org
[ Minor edits. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
cgroup_threadgroup_rwsem is acquired in read mode during process exit
and fork. It is also grabbed in write mode during
__cgroups_proc_write(). I've recently run into a scenario with lots
of memory pressure and OOM and I am beginning to see
systemd
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
percpu_down_write+0x114/0x170
__cgroup_procs_write.isra.12+0xb8/0x3c0
cgroup_file_write+0x74/0x1a0
kernfs_fop_write+0x188/0x200
__vfs_write+0x6c/0xe0
vfs_write+0xc0/0x230
SyS_write+0x6c/0x110
system_call+0x38/0xb4
This thread is waiting on the reader of cgroup_threadgroup_rwsem to
exit. The reader itself is under memory pressure and has gone into
reclaim after fork. There are times the reader also ends up waiting on
oom_lock as well.
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
jbd2_log_wait_commit+0xd4/0x180
ext4_evict_inode+0x88/0x5c0
evict+0xf8/0x2a0
dispose_list+0x50/0x80
prune_icache_sb+0x6c/0x90
super_cache_scan+0x190/0x210
shrink_slab.part.15+0x22c/0x4c0
shrink_zone+0x288/0x3c0
do_try_to_free_pages+0x1dc/0x590
try_to_free_pages+0xdc/0x260
__alloc_pages_nodemask+0x72c/0xc90
alloc_pages_current+0xb4/0x1a0
page_table_alloc+0xc0/0x170
__pte_alloc+0x58/0x1f0
copy_page_range+0x4ec/0x950
copy_process.isra.5+0x15a0/0x1870
_do_fork+0xa8/0x4b0
ppc_clone+0x8/0xc
In the meanwhile, all processes exiting/forking are blocked almost
stalling the system.
This patch moves the threadgroup_change_begin from before
cgroup_fork() to just before cgroup_canfork(). There is no nee to
worry about threadgroup changes till the task is actually added to the
threadgroup. This avoids having to call reclaim with
cgroup_threadgroup_rwsem held.
tj: Subject and description edits.
Signed-off-by: Balbir Singh <[email protected]>
Acked-by: Zefan Li <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected] # v4.2+
Signed-off-by: Tejun Heo <[email protected]>
|
|
The same kind of recursive sane default limit and policy
countrol that has been implemented for the user namespace
is desirable for the other namespaces, so generalize
the user namespace refernce count into a ucount.
Acked-by: Kees Cook <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
Add a structure that is per user and per user ns and use it to hold
the count of user namespaces. This makes prevents one user from
creating denying service to another user by creating the maximum
number of user namespaces.
Rename the sysctl export of the maximum count from
/proc/sys/userns/max_user_namespaces to /proc/sys/user/max_user_namespaces
to reflect that the count is now per user.
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
Export the export the maximum number of user namespaces as
/proc/sys/userns/max_user_namespaces.
Acked-by: Kees Cook <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
We should account for stacks regardless of stack size, and we need to
account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
to kilobytes and Move it into account_kernel_stack().
Fixes: 12580e4b54ba8 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Reviewed-by: Josh Poimboeuf <[email protected]>
Reviewed-by: Vladimir Davydov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
This only makes sense if each kernel stack exists entirely in one zone,
and allowing vmapped stacks could break this assumption.
Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
architectures. Keep it simple and use KiB.
Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Reviewed-by: Josh Poimboeuf <[email protected]>
Reviewed-by: Vladimir Davydov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently, to charge a non-slab allocation to kmemcg one has to use
alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
this helper should finally be freed using free_kmem_pages, otherwise it
won't be uncharged.
This API suits its current users fine, but it turns out to be impossible
to use along with page reference counting, i.e. when an allocation is
supposed to be freed with put_page, as it is the case with pipe or unix
socket buffers.
To overcome this limitation, this patch moves charging/uncharging to
generic page allocator paths, i.e. to __alloc_pages_nodemask and
free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
one can use any of the available page allocation functions to get the
allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
just like in case of kmalloc and friends. A charged page will be
automatically uncharged on free.
To make it possible, we need to mark pages charged to kmemcg somehow.
To avoid introducing a new page flag, we make use of page->_mapcount for
marking such pages. Since pages charged to kmemcg are not supposed to
be mapped to userspace, it should work just fine. There are other
(ab)users of page->_mapcount - buddy and balloon pages - but we don't
conflict with them.
In case kmemcg is compiled out or not used at runtime, this patch
introduces no overhead to generic page allocator paths. If kmemcg is
used, it will be plus one gfp flags check on alloc and plus one
page->_mapcount check on free, which shouldn't hurt performance, because
the data accessed are hot.
Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit b235beea9e99 ("Clarify naming of thread info/stack allocators")
breaks the build on some powerpc configs, where THREAD_SIZE < PAGE_SIZE:
kernel/fork.c:235:2: error: implicit declaration of function 'free_thread_stack'
kernel/fork.c:355:8: error: assignment from incompatible pointer type
stack = alloc_thread_stack_node(tsk, node);
^
Fix it by renaming free_stack() to free_thread_stack(), and updating the
return type of alloc_thread_stack_node().
Fixes: b235beea9e99 ("Clarify naming of thread info/stack allocators")
Signed-off-by: Michael Ellerman <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We've had the thread info allocated together with the thread stack for
most architectures for a long time (since the thread_info was split off
from the task struct), but that is about to change.
But the patches that move the thread info to be off-stack (and a part of
the task struct instead) made it clear how confused the allocator and
freeing functions are.
Because the common case was that we share an allocation with the thread
stack and the thread_info, the two pointers were identical. That
identity then meant that we would have things like
ti = alloc_thread_info_node(tsk, node);
...
tsk->stack = ti;
which certainly _worked_ (since stack and thread_info have the same
value), but is rather confusing: why are we assigning a thread_info to
the stack? And if we move the thread_info away, the "confusing" code
just gets to be entirely bogus.
So remove all this confusion, and make it clear that we are doing the
stack allocation by renaming and clarifying the function names to be
about the stack. The fact that the thread_info then shares the
allocation is an implementation detail, and not really about the
allocation itself.
This is a pure renaming and type fix: we pass in the same pointer, it's
just that we clarify what the pointer means.
The ia64 code that actually only has one single allocation (for all of
task_struct, thread_info and kernel thread stack) now looks a bit odd,
but since "tsk->stack" is actually not even used there, that oddity
doesn't matter. It would be a separate thing to clean that up, I
intentionally left the ia64 changes as a pure brute-force renaming and
type change.
Acked-by: Andy Lutomirski <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
mmput_async is currently used only from the oom_reaper which is defined
only for CONFIG_MMU. We can save work_struct in mm_struct for
!CONFIG_MMU.
[[email protected]: fix typo, per Minchan]
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Minchan Kim <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
dup_mmap needs to lock current's mm mmap_sem for write. If the waiting
task gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting.
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Linux preallocates the task structs of the idle tasks for all possible
CPUs. This currently means they all end up on node 0. This also
implies that the cache line of MWAIT, which is around the flags field in
the task struct, are all located in node 0.
We see a noticeable performance improvement on Knights Landing CPUs when
the cache lines used for MWAIT are located in the local nodes of the
CPUs using them. I would expect this to give a (likely slight)
improvement on other systems too.
The patch implements placing the idle task in the node of its CPUs, by
passing the right target node to copy_process()
[[email protected]: use NUMA_NO_NODE, not a bare -1]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Andi Kleen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
When using this program (as root):
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/io.h>
#include <sys/types.h>
#include <sys/wait.h>
#define ITER 1000
#define FORKERS 15
#define THREADS (6000/FORKERS) // 1850 is proc max
static void fork_100_wait()
{
unsigned a, to_wait = 0;
printf("\t%d forking %d\n", THREADS, getpid());
for (a = 0; a < THREADS; a++) {
switch (fork()) {
case 0:
usleep(1000);
exit(0);
break;
case -1:
break;
default:
to_wait++;
break;
}
}
printf("\t%d forked from %d, waiting for %d\n", THREADS, getpid(),
to_wait);
for (a = 0; a < to_wait; a++)
wait(NULL);
printf("\t%d waited from %d\n", THREADS, getpid());
}
static void run_forkers()
{
pid_t forkers[FORKERS];
unsigned a;
for (a = 0; a < FORKERS; a++) {
switch ((forkers[a] = fork())) {
case 0:
fork_100_wait();
exit(0);
break;
case -1:
err(1, "DIE fork of %d'th forker", a);
break;
default:
break;
}
}
for (a = 0; a < FORKERS; a++)
waitpid(forkers[a], NULL, 0);
}
int main()
{
unsigned a;
int ret;
ret = ioperm(10, 20, 0);
if (ret < 0)
err(1, "ioperm");
for (a = 0; a < ITER; a++)
run_forkers();
return 0;
}
kmemleak reports many occurences of this leak:
unreferenced object 0xffff8805917c8000 (size 8192):
comm "fork-leak", pid 2932, jiffies 4295354292 (age 1871.028s)
hex dump (first 32 bytes):
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................
backtrace:
[<ffffffff814cfbf5>] kmemdup+0x25/0x50
[<ffffffff8103ab43>] copy_thread_tls+0x6c3/0x9a0
[<ffffffff81150174>] copy_process+0x1a84/0x5790
[<ffffffff811dc375>] wake_up_new_task+0x2d5/0x6f0
[<ffffffff8115411d>] _do_fork+0x12d/0x820
...
Due to the leakage of the memory items which should have been freed in
arch/x86/kernel/process.c:exit_thread().
Make sure the memory is freed when fork fails later in copy_process.
This is done by calling exit_thread with the thread to kill.
Signed-off-by: Jiri Slaby <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Aurelien Jacquiot <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Chen Liqin <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: David Howells <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: Haavard Skinnemoen <[email protected]>
Cc: Hans-Christian Egtvedt <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Jesper Nilsson <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Koichi Yasutake <[email protected]>
Cc: Lennox Wu <[email protected]>
Cc: Ley Foon Tan <[email protected]>
Cc: Mark Salter <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Mikael Starvik <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Russell King <[email protected]>
Cc: Steven Miao <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Tetsuo has properly noted that mmput slow path might get blocked waiting
for another party (e.g. exit_aio waits for an IO). If that happens the
oom_reaper would be put out of the way and will not be able to process
next oom victim. We should strive for making this context as reliable
and independent on other subsystems as much as possible.
Introduce mmput_async which will perform the slow path from an async
(WQ) context. This will delay the operation but that shouldn't be a
problem because the oom_reaper has reclaimed the victim's address space
for most cases as much as possible and the remaining context shouldn't
bind too much memory anymore. The only exception is when mmap_sem
trylock has failed which shouldn't happen too often.
The issue is only theoretical but not impossible.
Signed-off-by: Michal Hocko <[email protected]>
Reported-by: Tetsuo Handa <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch implements the SS_AUTODISARM flag that can be OR-ed with
SS_ONSTACK when forming ss_flags.
When this flag is set, sigaltstack will be disabled when entering
the signal handler; more precisely, after saving sas to uc_stack.
When leaving the signal handler, the sigaltstack is restored by
uc_stack.
When this flag is used, it is safe to switch from sighandler with
swapcontext(). Without this flag, the subsequent signal will corrupt
the state of the switched-away sighandler.
To detect the support of this functionality, one can do:
err = sigaltstack(SS_DISABLE | SS_AUTODISARM);
if (err && errno == EINVAL)
unsupported();
Signed-off-by: Stas Sergeev <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Aleksa Sarai <[email protected]>
Cc: Amanieu d'Antras <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Heinrich Schuchardt <[email protected]>
Cc: Jason Low <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Palmer Dabbelt <[email protected]>
Cc: Paul Moore <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.
kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).
Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.
This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.
We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:
https://github.com/google/syzkaller/wiki/Found-Bugs
We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.
Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.
kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.
Based on a patch by Quentin Casasnovas.
[[email protected]: make task_struct.kcov_mode have type `enum kcov_mode']
[[email protected]: unbreak allmodconfig]
[[email protected]: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: syzkaller <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Tavis Ormandy <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Quentin Casasnovas <[email protected]>
Cc: Kostya Serebryany <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: David Drysdale <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup namespace support from Tejun Heo:
"These are changes to implement namespace support for cgroup which has
been pending for quite some time now. It is very straight-forward and
only affects what part of cgroup hierarchies are visible.
After unsharing, mounting a cgroup fs will be scoped to the cgroups
the task belonged to at the time of unsharing and the cgroup paths
exposed to userland would be adjusted accordingly"
* 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix and restructure error handling in copy_cgroup_ns()
cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()
Add FS_USERNS_FLAG to cgroup fs
cgroup: Add documentation for cgroup namespaces
cgroup: mount cgroupns-root when inside non-init cgroupns
kernfs: define kernfs_node_dentry
cgroup: cgroup namespace setns support
cgroup: introduce cgroup namespaces
sched: new clone flag CLONE_NEWCGROUP for cgroup namespace
kernfs: Add API to generate relative kernfs path
|
|
Show how much memory is allocated to kernel stacks.
Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.
Signed-off-by: Aditya Kali <[email protected]>
Signed-off-by: Serge Hallyn <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
testing the RLIMIT_DATA value to figure out if we're allowed to assign
new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
commited that RLIMIT_DATA in a form it's implemented now doesn't do
anything useful because most of user-space libraries use mmap() syscall
for dynamic memory allocations.
Linus suggested to convert RLIMIT_DATA rlimit into something suitable
for anonymous memory accounting. But in this patch we go further, and
the changes are bundled together as:
* keep vma counting if CONFIG_PROC_FS=n, will be used for limits
* replace mm->shared_vm with better defined mm->data_vm
* account anonymous executable areas as executable
* account file-backed growsdown/up areas as stack
* drop struct file* argument from vm_stat_account
* enforce RLIMIT_DATA for size of data areas
This way code looks cleaner: now code/stack/data classification depends
only on vm_flags state:
VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
VM_WRITE & ~VM_SHARED & !stack -> data (VmData)
The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
"shared", but that might be strange beast like readonly-private or VM_IO
area.
- RLIMIT_AS limits whole address space "VmSize"
- RLIMIT_STACK limits stack "VmStk" (but each vma individually)
- RLIMIT_DATA now limits "VmData"
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Cyrill Gorcunov <[email protected]>
Cc: Quentin Casasnovas <[email protected]>
Cc: Vegard Nossum <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: Willy Tarreau <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg. For the list, see below:
- threadinfo
- task_struct
- task_delay_info
- pid
- cred
- mm_struct
- vm_area_struct and vm_region (nommu)
- anon_vma and anon_vma_chain
- signal_struct
- sighand_struct
- fs_struct
- files_struct
- fdtable and fdtable->full_fds_bits
- dentry and external_name
- inode for all filesystems. This is the most tedious part, because
most filesystems overwrite the alloc_inode method.
The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds. Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).
[[email protected]: coding-style fixes]
Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cgroup v2 interface is now official. It's no longer hidden behind a
devel flag and can be mounted using the new cgroup2 fs type.
Unfortunately, cpu v2 interface hasn't made it yet due to the
discussion around in-process hierarchical resource distribution and
only memory and io controllers can be used on the v2 interface at the
moment.
- The existing documentation which has always been a bit of mess is
relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
is added as the authoritative documentation for the v2 interface.
- Some features are added through for-4.5-ancestor-test branch to
enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
netfilter changes will be merged through the net tree which pulled in
the said branch.
- Various cleanups
* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rename cgroup documentations
cgroup: fix a typo.
cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
cgroup: demote subsystem init messages to KERN_DEBUG
cgroup: Fix uninitialized variable warning
cgroup: put controller Kconfig options in meaningful order
cgroup: clean up the kernel configuration menu nomenclature
cgroup_pids: fix a typo.
Subject: cgroup: Fix incomplete dd command in blkio documentation
cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
cpuset: Replace all instances of time_t with time64_t
cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
|
|
new patches
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In the following commit:
7675104990ed ("sched: Implement lockless wake-queues")
we gained lockless wake-queues.
The -RT kernel managed to lockup itself with those. There could be multiple
attempts for task X to enqueue it for a wakeup _even_ if task X is already
running.
The reason is that task X could be runnable but not yet on CPU. The the
task performing the wakeup did not leave the CPU it could performe
multiple wakeups.
With the proper timming task X could be running and enqueued for a
wakeup. If this happens while X is performing a fork() then its its
child will have a !NULL `wake_q` member copied.
This is not a problem as long as the child task does not participate in
lockless wakeups :)
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 7675104990ed ("sched: Implement lockless wake-queues")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The cputime can only be updated by the current task itself, even in
vtime case. So we can safely use seqcount instead of seqlock as there
is no writer concurrency involved.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Hiroshi Shimamoto <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E . McKenney <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
VTIME_SLEEPING state happens either when:
1) The task is sleeping and no tickless delta is to be added on the task
cputime stats.
2) The CPU isn't running vtime at all, so the same properties of 1) applies.
Lets rename the vtime symbol to reflect both states.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Hiroshi Shimamoto <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E . McKenney <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
If the new child migrates to another cgroup before cgroup_post_fork() calls
subsys->fork(), then both pids_can_attach() and pids_fork() will do the same
pids_uncharge(old_pids) + pids_charge(pids) sequence twice.
Change copy_process() to call threadgroup_change_begin/threadgroup_change_end
unconditionally. percpu_down_read() is cheap and this allows other cleanups,
see the next changes.
Also, this way we can unify cgroup_threadgroup_rwsem and dup_mmap_sem.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Merge patch-bomb from Andrew Morton:
- inotify tweaks
- some ocfs2 updates (many more are awaiting review)
- various misc bits
- kernel/watchdog.c updates
- Some of mm. I have a huge number of MM patches this time and quite a
lot of it is quite difficult and much will be held over to next time.
* emailed patches from Andrew Morton <[email protected]>: (162 commits)
selftests: vm: add tests for lock on fault
mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
mm: introduce VM_LOCKONFAULT
mm: mlock: add new mlock system call
mm: mlock: refactor mlock, munlock, and munlockall code
kasan: always taint kernel on report
mm, slub, kasan: enable user tracking by default with KASAN=y
kasan: use IS_ALIGNED in memory_is_poisoned_8()
kasan: Fix a type conversion error
lib: test_kasan: add some testcases
kasan: update reference to kasan prototype repo
kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
kasan: various fixes in documentation
kasan: update log messages
kasan: accurately determine the type of the bad access
kasan: update reported bug types for kernel memory accesses
kasan: update reported bug types for not user nor kernel memory accesses
mm/kasan: prevent deadlock in kasan reporting
mm/kasan: don't use kasan shadow pointer in generic functions
mm/kasan: MODULE_VADDR is not available on all archs
...
|
|
The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be used
this can incur a high penalty for locking.
For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting in
data that cannot be swapped out (credit card data, medical records, etc).
This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
flag for a VMA will cause pages faulted into that VMA to be added to the
unevictable LRU when they are faulted or if they are already present, but
will not cause any missing pages to be faulted in.
Exposing this new lock state means that we cannot overload the meaning of
the FOLL_POPULATE flag any longer. Prior to this patch it was used to
mean that the VMA for a fault was locked. This means we need the new
FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
will now only control if the VMA should be populated and in the case of
VM_LOCKONFAULT, it will not be set.
Signed-off-by: Eric B Munson <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"The cgroup core saw several significant updates this cycle:
- percpu_rwsem for threadgroup locking is reinstated. This was
temporarily dropped due to down_write latency issues. Oleg's
rework of percpu_rwsem which is scheduled to be merged in this
merge window resolves the issue.
- On the v2 hierarchy, when controllers are enabled and disabled, all
operations are atomic and can fail and revert cleanly. This allows
->can_attach() failure which is necessary for cpu RT slices.
- Tasks now stay associated with the original cgroups after exit
until released. This allows tracking resources held by zombies
(e.g. pids) and makes it easy to find out where zombies came from
on the v2 hierarchy. The pids controller was broken before these
changes as zombies escaped the limits; unfortunately, updating this
behavior required too many invasive changes and I don't think it's
a good idea to backport them, so the pids controller on 4.3, the
first version which included the pids controller, will stay broken
at least until I'm sure about the cgroup core changes.
- Optimization of a couple common tests using static_key"
* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
cgroup: fix race condition around termination check in css_task_iter_next()
blkcg: don't create "io.stat" on the root cgroup
cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
cgroup: replace error handling in cgroup_init() with WARN_ON()s
cgroup: add cgroup_subsys->free() method and use it to fix pids controller
cgroup: keep zombies associated with their original cgroups
cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
cgroup: don't hold css_set_rwsem across css task iteration
cgroup: reorganize css_task_iter functions
cgroup: factor out css_set_move_task()
cgroup: keep css_set and task lists in chronological order
cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
cgroup: make css_sets pin the associated cgroups
cgroup: relocate cgroup_[try]get/put()
cgroup: move check_for_release() invocation
cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
cgroup: make cgroup->nr_populated count the number of populated css_sets
cgroup: remove an unused parameter from cgroup_task_migrate()
cgroup: fix too early usage of static_branch_disable()
cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
...
|
|
cgroup_exit() is called when a task exits and disassociates the
exiting task from its cgroups and half-attach it to the root cgroup.
This is unnecessary and undesirable.
No controller actually needs an exiting task to be disassociated with
non-root cgroups. Both cpu and perf_event controllers update the
association to the root cgroup from their exit callbacks just to keep
consistent with the cgroup core behavior.
Also, this disassociation makes it difficult to track resources held
by zombies or determine where the zombies came from. Currently, pids
controller is completely broken as it uncharges on exit and zombies
always escape the resource restriction. With cgroup association being
reset on exit, fixing it is pretty painful.
There's no reason to reset cgroup membership on exit. The zombie can
be removed from its css_set so that it doesn't show up on
"cgroup.procs" and thus can't be migrated or interfere with cgroup
removal. It can still pin and point to the css_set so that its cgroup
membership is maintained. This patch makes cgroup core keep zombies
associated with their cgroups at the time of exit.
* Previous patches decoupled populated_cnt tracking from css_set
lifetime, so a dying task can be simply unlinked from its css_set
while pinning and pointing to the css_set. This keeps css_set
association from task side alive while hiding it from "cgroup.procs"
and populated_cnt tracking. The css_set reference is dropped when
the task_struct is freed.
* ->exit() callback no longer needs the css arguments as the
associated css never changes once PF_EXITING is set. Removed.
* cpu and perf_events controllers no longer need ->exit() callbacks.
There's no reason to explicitly switch away on exit. The final
schedule out is enough. The callbacks are removed.
* On traditional hierarchies, nothing changes. "/proc/PID/cgroup"
still reports "/" for all zombies. On the default hierarchy,
"/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
to at the time of exit. If the cgroup gets removed before the task
is reaped, " (deleted)" is appended.
v2: Build brekage due to missing dummy cgroup_free() when
!CONFIG_CGROUP fixed.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
|
|
In the next patch in this series, a new field 'checking_timer' will
be added to 'struct thread_group_cputimer'. Both this and the
existing 'running' integer field are just used as boolean values. To
save space in the structure, we can make both of these fields booleans.
This is a preparatory patch to convert the existing running integer
field to a boolean.
Suggested-by: George Spelvin <[email protected]>
Signed-off-by: Jason Low <[email protected]>
Reviewed: George Spelvin <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Note: This commit was originally committed as d59cfc09c32a but got
reverted by 0c986253b939 due to the performance regression from
the percpu_rwsem write down/up operations added to cgroup task
migration path. percpu_rwsem changes which alleviate the
performance issue are pending for v4.4-rc1 merge window.
Re-apply.
The cgroup side of threadgroup locking uses signal_struct->group_rwsem
to synchronize against threadgroup changes. This per-process rwsem
adds small overhead to thread creation, exit and exec paths, forces
cgroup code paths to do lock-verify-unlock-retry dance in a couple
places and makes it impossible to atomically perform operations across
multiple processes.
This patch replaces signal_struct->group_rwsem with a global
percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
side and contained in cgroups proper. This patch converts one-to-one.
This does make writer side heavier and lower the granularity; however,
cgroup process migration is a fairly cold path, we do want to optimize
thread operations over it and cgroup migration operations don't take
enough time for the lower granularity to matter.
Signed-off-by: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/g/[email protected]
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
|
|
percpu_rwsem"
This reverts commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14.
d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem") and b5ba75b5fc0e ("cgroup: simplify
threadgroup locking") changed how cgroup synchronizes against task
fork and exits so that it uses global percpu_rwsem instead of
per-process rwsem; unfortunately, the write [un]lock paths of
percpu_rwsem always involve synchronize_rcu_expedited() which turned
out to be too expensive.
Improvements for percpu_rwsem are scheduled to be merged in the coming
v4.4-rc1 merge window which alleviates this issue. For now, revert
the two commits to restore per-process rwsem. They will be re-applied
for the v4.4-rc1 merge window.
Signed-off-by: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/g/[email protected]
Reported-by: Christian Borntraeger <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: [email protected] # v4.2+
|
|
These two flags gets set in vma->vm_flags to tell the VM common code
if the userfaultfd is armed and in which mode (only tracking missing
faults, only tracking wrprotect faults or both). If neither flags is
set it means the userfaultfd is not armed on the vma.
Signed-off-by: Andrea Arcangeli <[email protected]>
Acked-by: Pavel Emelyanov <[email protected]>
Cc: Sanidhya Kashyap <[email protected]>
Cc: [email protected]
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Andres Lagar-Cavilla <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Feiner <[email protected]>
Cc: "Dr. David Alan Gilbert" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Huangpeng (Peter)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This adds the vm_userfaultfd_ctx to the vm_area_struct.
Signed-off-by: Andrea Arcangeli <[email protected]>
Acked-by: Pavel Emelyanov <[email protected]>
Cc: Sanidhya Kashyap <[email protected]>
Cc: [email protected]
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Andres Lagar-Cavilla <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Feiner <[email protected]>
Cc: "Dr. David Alan Gilbert" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Huangpeng (Peter)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|