Age | Commit message (Collapse) | Author | Files | Lines |
|
Change __exit_signal() to check thread_group_leader() instead of
atomic_dec_and_test(&sig->count). This must be equivalent, the group
leader must be released only after all other threads have exited and
passed __exit_signal().
Henceforth sig->count is not actually used, except in fs/proc for
get_nr_threads/etc.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Cc: Veaceslav Falico <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
de_thread() and __exit_signal() use signal_struct->count/notify_count for
synchronization. We can simplify the code and use ->notify_count only.
Instead of comparing these two counters, we can change de_thread() to set
->notify_count = nr_of_sub_threads, then change __exit_signal() to
dec-and-test this counter and notify group_exit_task.
Note that __exit_signal() checks "notify_count > 0" just for symmetry with
exit_notify(), we could just check it is != 0.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Cc: Veaceslav Falico <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Change zap_other_threads() to return the number of other sub-threads found
on ->thread_group list.
Other changes are cosmetic:
- change the code to use while_each_thread() helper
- remove the obsolete comment about SIGKILL/SIGSTOP
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Cc: Veaceslav Falico <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
signal_struct->count in its current form must die.
- it has no reasons to be atomic_t
- it looks like a reference counter, but it is not
- otoh, we really need to make task->signal refcountable, just look at
the extremely ugly task_rq_unlock_wait() called from __exit_signals().
- we should change the lifetime rules for task->signal, it should be
pinned to task_struct. We have a lot of code which can be simplified
after that.
- it is not needed! while the code is correct, any usage of this
counter is artificial, except fs/proc uses it correctly to show the
number of threads.
This series removes the usage of sig->count from exit pathes.
This patch:
Now that Veaceslav changed copy_signal() to use zalloc(), exit_notify()
can just check notify_count < 0 to ensure the execing sub-threads needs
the notification from us. No need to do other checks, notify_count != 0
must always mean ->group_exit_task != NULL is waiting for us.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Cc: Veaceslav Falico <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
- move the cprm.mm_flags checks up, before we take mmap_sem
- move down_write(mmap_sem) and ->core_state check from do_coredump()
to coredump_wait()
This simplifies the code and makes the locking symmetrical.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Given that do_coredump() calls put_cred() on exit path, it is a bit ugly
to do put_cred() + "goto fail" twice, just add the new "fail_creds" label.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
- kill "int dump_count", argv_split(argcp) accepts argcp == NULL.
- move "int dump_count" under " if (ispipe)" branch, fail_dropcount
can check ispipe.
- move "char **helper_argv" as well, change the code to do argv_free()
right after call_usermodehelper_fns().
- If call_usermodehelper_fns() fails goto close_fail label instead
of closing the file by hand.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
do_coredump() does a lot of file checks after it opens the file or calls
usermode helper. But all of these checks are only needed in !ispipe case.
Move this code into the "else" branch and kill the ugly repetitive ispipe
checks.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: David Howells <[email protected]>
Cc: Neil Horman <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
UMH_WAIT_EXEC should report the error if kernel_thread() fails, like
UMH_WAIT_PROC does.
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
__call_usermodehelper(UMH_NO_WAIT) has 2 problems:
- if kernel_thread() fails, call_usermodehelper_freeinfo()
is not called.
- for unknown reason UMH_NO_WAIT has UMH_WAIT_PROC logic,
we spawn yet another thread which waits until the user
mode application exits.
Change the UMH_NO_WAIT code to use ____call_usermodehelper() instead of
wait_for_helper(), and do call_usermodehelper_freeinfo() unconditionally.
We can rely on CLONE_VFORK, do_fork(CLONE_VFORK) until the child exits or
execs.
With or without this patch UMH_NO_WAIT does not report the error if
kernel_thread() fails, this is correct since the caller doesn't wait for
result.
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
1. wait_for_helper() calls allow_signal(SIGCHLD) to ensure the child
can't autoreap itself.
However, this means that a spurious SIGCHILD from user-space can
set TIF_SIGPENDING and:
- kernel_thread() or sys_wait4() can fail due to signal_pending()
- worse, wait4() can fail before ____call_usermodehelper() execs
or exits. In this case the caller may kfree(subprocess_info)
while the child still uses this memory.
Change the code to use SIG_DFL instead of magic "(void __user *)2"
set by allow_signal(). This means that SIGCHLD won't be delivered,
yet the child won't autoreap itsefl.
The problem is minor, only root can send a signal to this kthread.
2. If sys_wait4(&ret) fails it doesn't populate "ret", in this case
wait_for_helper() reports a random value from uninitialized var.
With this patch sys_wait4() should never fail, but still it makes
sense to initialize ret = -ECHILD so that the caller can notice
the problem.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Neil Horman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
____call_usermodehelper() correctly calls flush_signal_handlers() to set
SIG_DFL, but sigemptyset(->blocked) and recalc_sigpending() are not
needed.
This kthread was forked by workqueue thread, all signals must be unblocked
and ignored, no pending signal is possible.
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now that nobody ever changes subprocess_info->cred we can kill this member
and related code. ____call_usermodehelper() always runs in the context of
freshly forked kernel thread, it has the proper ->cred copied from its
parent kthread, keventd.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Neil Horman <[email protected]>
Acked-by: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
call_usermodehelper_keys() uses call_usermodehelper_setkeys() to change
subprocess_info->cred in advance. Now that we have info->init() we can
change this code to set tgcred->session_keyring in context of execing
kernel thread.
Note: since currently call_usermodehelper_keys() is never called with
UMH_NO_WAIT, call_usermodehelper_keys()->key_get() and umh_keys_cleanup()
are not really needed, we could rely on install_session_keyring_to_cred()
which does key_get() on success.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Neil Horman <[email protected]>
Acked-by: David Howells <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
resolve limit
The first patch in this series introduced an init function to the
call_usermodehelper api so that processes could be customized by caller.
This patch takes advantage of that fact, by customizing the helper in
do_coredump to create the pipe and set its core limit to one (for our
recusrsion check). This lets us clean up the previous uglyness in the
usermodehelper internals and factor call_usermodehelper out entirely.
While I'm at it, we can also modify the helper setup to look for a core
limit value of 1 rather than zero for our recursion check
Signed-off-by: Neil Horman <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe
feature in the kernel works. We had reports of several races, including
some reports of apps bypassing our recursion check so that a process that
was forked as part of a core_pattern setup could infinitely crash and
refork until the system crashed.
We fixed those by improving our recursion checks. The new check basically
refuses to fork a process if its core limit is zero, which works well.
Unfortunately, I've been getting grief from maintainer of user space
programs that are inserted as the forked process of core_pattern. They
contend that in order for their programs (such as abrt and apport) to
work, all the running processes in a system must have their core limits
set to a non-zero value, to which I say 'yes'. I did this by design, and
think thats the right way to do things.
But I've been asked to ease this burden on user space enough times that I
thought I would take a look at it. The first suggestion was to make the
recursion check fail on a non-zero 'special' number, like one. That way
the core collector process could set its core size ulimit to 1, and enable
the kernel's recursion detection. This isn't a bad idea on the surface,
but I don't like it since its opt-in, in that if a program like abrt or
apport has a bug and fails to set such a core limit, we're left with a
recursively crashing system again.
So I've come up with this. What I've done is modify the
call_usermodehelper api such that an extra parameter is added, a function
pointer which will be called by the user helper task, after it forks, but
before it exec's the required process. This will give the caller the
opportunity to get a call back in the processes context, allowing it to do
whatever it needs to to the process in the kernel prior to exec-ing the
user space code. In the case of do_coredump, this callback is ues to set
the core ulimit of the helper process to 1. This elimnates the opt-in
problem that I had above, as it allows the ulimit for core sizes to be set
to the value of 1, which is what the recursion check looks for in
do_coredump.
This patch:
Create new function call_usermodehelper_fns() and allow it to assign both
an init and cleanup function, as we'll as arbitrary data.
The init function is called from the context of the forked process and
allows for customization of the helper process prior to calling exec. Its
return code gates the continuation of the process, or causes its exit.
Also add an arbitrary data pointer to the subprocess_info struct allowing
for data to be passed from the caller to the new process, and the
subsequent cleanup process
Also, use this patch to cleanup the cleanup function. It currently takes
an argp and envp pointer for freeing, which is ugly. Lets instead just
make the subprocess_info structure public, and pass that to the cleanup
and init routines
Signed-off-by: Neil Horman <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Andrew Tridgell reports that aio_read(SIGEV_SIGNAL) can fail if the
notification from the helper thread races with setresuid(), see
http://samba.org/~tridge/junkcode/aio_uid.c
This happens because check_kill_permission() doesn't permit sending a
signal to the task with the different cred->xids. But there is not any
security reason to check ->cred's when the task sends a signal (private or
group-wide) to its sub-thread. Whatever we do, any thread can bypass all
security checks and send SIGKILL to all threads, or it can block a signal
SIG and do kill(gettid(), SIG) to deliver this signal to another
sub-thread. Not to mention that CLONE_THREAD implies CLONE_VM.
Change check_kill_permission() to avoid the credentials check when the
sender and the target are from the same thread group.
Also, move "cred = current_cred()" down to avoid calling get_current()
twice.
Note: David Howells pointed out we could relax this even more, the
CLONE_SIGHAND (without CLONE_THREAD) case probably does not need
these checks too.
Roland said:
: The glibc (libpthread) that does set*id across threads has
: been in use for a while (2.3.4?), probably in distro's using kernels as old
: or older than any active -stable streams. In the race in question, this
: kernel bug is breaking valid POSIX application expectations.
Reported-by: Andrew Tridgell <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: Jakub Jelinek <[email protected]>
Cc: James Morris <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Stephen Smalley <[email protected]>
Cc: <[email protected]> [all kernel versions]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now that Mike Frysinger unified the FDPIC ptrace code, we can fix the
unsafe usage of child->mm in ptrace_request(PTRACE_GETFDPIC).
We have the reference to task_struct, and ptrace_check_attach() verified
the tracee is stopped. But nothing can protect from SIGKILL after that,
we must not assume child->mm != NULL.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Mike Frysinger <[email protected]>
Acked-by: David Howells <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Greg Ungerer <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The Blackfin/FRV/SuperH guys all have the same exact FDPIC ptrace code in
their arch handlers (since they were probably copied & pasted). Since
these ptrace interfaces are an arch independent aspect of the FDPIC code,
unify them in the common ptrace code so new FDPIC ports don't need to copy
and paste this fundamental stuff yet again.
Signed-off-by: Mike Frysinger <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Acked-by: David Howells <[email protected]>
Acked-by: Paul Mundt <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Some workloads that create a large number of small files tend to assign
too many pages to node 0 (multi-node systems). Part of the reason is that
the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at
node 0 for newly created tasks.
This patch changes the rotor to be initialized to a random node number of
the cpuset.
[[email protected]: fix layout]
[[email protected]: Define stub numa_random() for !NUMA configuration]
Signed-off-by: Jack Steiner <[email protected]>
Signed-off-by: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Jack Steiner <[email protected]>
Cc: Robin Holt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We have observed several workloads running on multi-node systems where
memory is assigned unevenly across the nodes in the system. There are
numerous reasons for this but one is the round-robin rotor in
cpuset_mem_spread_node().
For example, a simple test that writes a multi-page file will allocate
pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
allocates on odd nodes & skips even nodes).
An example is shown below. The program "lfile" writes a file consisting
of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
MPOL_F_NODE) to determine the nodes where the file pages were allocated.
The output is shown below:
# ./lfile
allocated on nodes: 2 4 6 0 1 2 6 0 2
There is a single rotor that is used for allocating both file pages & slab
pages. Writing the file allocates both a data page & a slab page
(buffer_head). This advances the RR rotor 2 nodes for each page
allocated.
A quick confirmation seems to confirm this is the cause of the uneven
allocation:
# echo 0 >/dev/cpuset/memory_spread_slab
# ./lfile
allocated on nodes: 6 7 8 9 0 1 2 3 4 5
This patch introduces a second rotor that is used for slab allocations.
Signed-off-by: Jack Steiner <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Jack Steiner <[email protected]>
Cc: Robin Holt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Introduce struct mem_cgroup_thresholds. It helps to reduce number of
checks of thresholds type (memory or mem+swap).
[[email protected]: repair comment]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Phil Carmody <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Paul Menage <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since we are unable to handle an error returned by
cftype.unregister_event() properly, let's make the callback
void-returning.
mem_cgroup_unregister_event() has been rewritten to be a "never fail"
function. On mem_cgroup_usage_register_event() we save old buffer for
thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
avoid allocation.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Phil Carmody <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Li Zefan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
FILE_MAPPED per memcg of migrated file cache is not properly updated,
because our hook in page_add_file_rmap() can't know to which memcg
FILE_MAPPED should be counted.
Basically, this patch is for fixing the bug but includes some big changes
to fix up other messes.
Now, at migrating mapped file, events happen in following sequence.
1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst a new page before migration. But at this point,
no changes to new page's page_cgroup, no commit for the charge.
(IOW, PCG_USED bit is not set.)
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. Here, the new page is unlocked.
7. memcg commits the charge for newpage, Mark the new page's page_cgroup
as PCG_USED.
Because "commit" happens after page-remap, we can count FILE_MAPPED
at "5", because we should avoid to trust page_cgroup->mem_cgroup.
if PCG_USED bit is unset.
(Note: memcg's LRU removal code does that but LRU-isolation logic is used
for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
not on LRU or page_cgroup->mem_cgroup is NULL.)
We can lose file_mapped accounting information at 5 because FILE_MAPPED
is updated only when mapcount changes 0->1. So we should catch it.
BTW, historically, above implemntation comes from migration-failure
of anonymous page. Because we charge both of old page and new page
with mapcount=0, we can't catch
- the page is really freed before remap.
- migration fails but it's freed before remap
or .....corner cases.
New migration sequence with memcg is:
1. allocate a new page.
2. mark PageCgroupMigration to the old page.
3. charge against a new page onto the old page's memcg. (here, new page's pc
is marked as PageCgroupUsed.)
4. page migration replaces radix-tree, page table, etc...
5. At remapping, new page's page_cgroup is now makrked as "USED"
We can catch 0->1 event and FILE_MAPPED will be properly updated.
And we can catch SWAPOUT event after unlock this and freeing this
page by unmap() can be caught.
7. Clear PageCgroupMigration of the old page.
So, FILE_MAPPED will be correctly updated.
Then, for what MIGRATION flag is ?
Without it, at migration failure, we may have to charge old page again
because it may be fully unmapped. "charge" means that we have to dive into
memory reclaim or something complated. So, it's better to avoid
charge it again. Before this patch, __commit_charge() was working for
both of the old/new page and fixed up all. But this technique has some
racy condtion around FILE_MAPPED and SWAPOUT etc...
Now, the kernel use MIGRATION flag and don't uncharge old page until
the end of migration.
I hope this change will make memcg's page migration much simpler. This
page migration has caused several troubles. Worth to add a flag for
simplification.
Reviewed-by: Daisuke Nishimura <[email protected]>
Tested-by: Daisuke Nishimura <[email protected]>
Reported-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Only an out of memory error will cause ret to be set.
Signed-off-by: Phil Carmody <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The bottom 4 hunks are atomically changing memory to which there are no
aliases as it's freshly allocated, so there's no need to use atomic
operations.
The other hunks are just atomic_read and atomic_set, and do not involve
any read-modify-write. The use of atomic_{read,set} doesn't prevent a
read/write or write/write race, so if a race were possible (I'm not saying
one is), then it would still be there even with atomic_set.
See:
http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/
Signed-off-by: Phil Carmody <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It's pointless to try to kill current if select_bad_process() did not find
an eligible task to kill in mem_cgroup_out_of_memory() since it's
guaranteed that current is a member of the memcg that is oom and it is, by
definition, unkillable.
Signed-off-by: David Rientjes <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Some information are old, and I think current document doesn't work as "a
guide for users". We need summary of all of our controls, at least.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Randy Dunlap <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch adds support for moving charge of file pages, which include
normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
bit 1 of <target cgroup>/memory.move_charge_at_immigrate.
Unlike the case of anonymous pages, file pages(and swaps) in the range
mmapped by the task will be moved even if the task hasn't done page fault,
i.e. they might not be the task's "RSS", but other task's "RSS" that maps
the same file. And mapcount of the page is ignored(the page can be moved
even if page_mapcount(page) > 1). So, conditions that the page/swap
should be met to be moved is that it must be in the range mmapped by the
target task and it must be charged to the old cgroup.
[[email protected]: coding-style fixes]
[[email protected]: fix warning]
Signed-off-by: Daisuke Nishimura <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch cleans up move charge code by:
- define functions to handle pte for each types, and make
is_target_pte_for_mc() cleaner.
- instead of checking the MOVE_CHARGE_TYPE_ANON bit, define a function
that checks the bit.
Signed-off-by: Daisuke Nishimura <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This adds a feature to disable oom-killer for memcg, if disabled, of
course, tasks under memcg will stop.
But now, we have oom-notifier for memcg. And the world around memcg is
not under out-of-memory. memcg's out-of-memory just shows memcg hits
limit. Then, administrator or management daemon can recover the situation
by
- kill some process
- enlarge limit, add more swap.
- migrate some tasks
- remove file cache on tmps (difficult ?)
Unlike oom-killer, you can take enough information before killing tasks.
(by gcore, or, ps etc.)
[[email protected]: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Considering containers or other resource management softwares in userland,
event notification of OOM in memcg should be implemented. Now, memcg has
"threshold" notifier which uses eventfd, we can make use of it for oom
notification.
This patch adds oom notification eventfd callback for memcg. The usage is
very similar to threshold notifier, but control file is memory.oom_control
and no arguments other than eventfd is required.
% cgroup_event_notifier /cgroup/A/memory.oom_control dummy
(About cgroup_event_notifier, see Documentation/cgroup/)
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Davide Libenzi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
memcg's oom waitqueue is a system-wide wait_queue (for handling
hierarchy.) So, it's better to add custom wake function and do filtering
in wake up path.
This patch adds a filtering feature for waking up oom-waiters. Hierarchy
is properly handled.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Daisuke Nishimura <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Signed-off-by: Trevor Woerner <[email protected]>
Cc: Paul Menage <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
- Add additional location (Git) for the kernel master tree
- Add reference to Git Project
Signed-off-by: Abraham Arce <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
I recently had to recover some files from an old broken machine that was
running BorderWare Document Gateway. It's basically a drop in web server
for sharing files. From the look of the init process and using strings on
of a few files it seems to be based on FreeBSD 3.3.
The process turned out to be more difficult than I imagined, but to cut a
long story short BorderWare in their wisdom use a nonstandard magic number
in their UFS (ufstype=44bsd) file systems. Thus Linux refuses to mount
the file systems in order to recover the data. After a bit of hunting I
was able to make a quick fix to fs/ufs/super.c in order to detect the new
magic number.
I assume that this number is the same for all installations. It's quite
easy to find out from ufs_fs.h. The superblock sits 8k into the block
device and the magic number its 1372 bytes into the superblock struct.
# dd if=/dev/sda5 skip=$(( 8192 + 1372 )) bs=1 count=4 2> /dev/null | hd
00000000 97 26 24 0f |.&$.|
#
Signed-off-by: Thomas Stewart <[email protected]>
Cc: Evgeniy Dushistov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use memdup_user when user data is immediately copied into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@
- to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {
<+... when != goto l1;
- -ENOMEM
+ PTR_ERR(to)
...+>
}
- if (copy_from_user(to, from, size) != 0) {
- <+... when != goto l2;
- -EFAULT
- ...+>
- }
// </smpl>
Signed-off-by: Julia Lawall <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The current backlight code is stubbed out, so the new props changes added
some warnings:
drivers/video/bf54x-lq043fb.c: In function 'bfin_bf54x_probe':
drivers/video/bf54x-lq043fb.c:666: warning: label 'out9' defined but not used
drivers/video/bf54x-lq043fb.c:504: warning: unused variable 'props'
Fix em !
Signed-off-by: Mike Frysinger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The current backlight code is stubbed out, so the new props changes added
some warnings about unused label/prop.
Signed-off-by: Mike Frysinger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use memdup_user when user data is immediately copied into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@
- to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {
<+... when != goto l1;
- -ENOMEM
+ PTR_ERR(to)
...+>
}
- if (copy_from_user(to, from, size) != 0) {
- <+... when != goto l2;
- -EFAULT
- ...+>
- }
// </smpl>
Signed-off-by: Julia Lawall <[email protected]>
Cc: Joseph Chan <[email protected]>
Cc: Scott Fang <[email protected]>
Cc: Florian Tobias Schandinat <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add support for S3 Trio3D/1X (86C360) and S3 Trio3D/2X (86C362 and 86C368)
cards to s3fb driver. Tested with 86C362 AGP and 86C368 PCI&AGP.
[[email protected]: coding-style fixes]
Signed-off-by: Ondrej Zary <[email protected]>
Acked-by: Ondrej Zajicek <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This eliminates the following build warning:
drivers/gpio/it8761e_gpio.c: In function `it8761e_gpio_exit':
drivers/gpio/it8761e_gpio.c:220: warning: ignoring return value of `gpiochip_remove', declared with attribute warn_unused_result
Signed-off-by: Daniel Mack <[email protected]>
Cc: Denis Turischev <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Intel Penwell chip has two 96 pins GPIO blocks, which are very similiar as
Intel Langwell chip GPIO block, except for pin number difference. This
patch expends the original Langwell GPIO driver to support Penwell's.
Signed-off-by: Alek Du <[email protected]>
Cc: David Brownell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Nobody uses that anymore, so remove and expect drivers to use the gpiolib
implementation.
Signed-off-by: Felipe Balbi <[email protected]>
Cc: Tony Lindgren <[email protected]>
Cc: David Brownell <[email protected]>
Cc: Mark Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Stop using the omap-specific implementations for gpio debouncing now that
gpiolib provides its own support.
Signed-off-by: Felipe Balbi <[email protected]>
Cc: Tony Lindgren <[email protected]>
Cc: David Brownell <[email protected]>
Cc: Mark Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
OMAP supports debouncing of gpio lines, implement the method using
gpiolib.
Signed-off-by: Felipe Balbi <[email protected]>
Cc: Tony Lindgren <[email protected]>
Cc: David Brownell <[email protected]>
Cc: Mark Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A few architectures, like OMAP, allow you to set a debouncing time for the
gpio before generating the IRQ. Teach gpiolib about that.
Mark said:
: This would be generally useful for embedded systems, especially where
: the interrupt concerned is a wake source. It allows drivers to avoid
: spurious interrupts from noisy sources so if the hardware supports it
: the driver can avoid having to explicitly wait for the signal to become
: stable and software has to cope with fewer events. We've lived without
: it for quite some time, though.
David said:
: I looked at adding debounce support to the generic GPIO calls (and thus
: gpiolib) some time back, but decided against it. I forget why at this
: time (check list archives) but it wasn't because of lack of utility in
: certain contexts.
:
: One thing to watch out for is just how variable the hardware capabilities
: are. Atmel GPIOs have something like a fixed number of 32K clock cycles
: for debounce, twl4030 had something odd, OMAPs were more like the Atmel
: chips but with a different clock. In some cases debouncing had to be
: ganged, not per-GPIO. And so forth.
Signed-off-by: Felipe Balbi <[email protected]>
Cc: Tony Lindgren <[email protected]>
Cc: David Brownell <[email protected]>
Reviewed-by: Mark Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The current message, 'not registered' is confusing as it implies it was
not registered with something, whereas printing 'failed to register'
implies it was the gpiochip_add() call that did not work correctly.
Signed-off-by: Ben Dooks <[email protected]>
Cc: David Brownell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix a bug I noticed while hacking on the max732x driver for interrupt
support. According to the datasheets, open-drain pins have to be
configured as output-high (which in that case is actually high impedance)
to be used as input.
Signed-off-by: Marc Zyngier <[email protected]>
Acked-by: Eric Miao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Setup both client_group_a and client_group_b if nr_port > 8 (not including
nr_port==8).
Signed-off-by: Axel Lin <[email protected]>
Cc: Eric Miao <[email protected]>
Cc: Ben Dooks <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|