| Age | Commit message (Collapse) | Author | Files | Lines |
|
__exit_signal() does flush_sigqueue(tsk->pending) outside of ->siglock.
This can race with another thread doing sigqueue_free(), we can free the
same SIGQUEUE_PREALLOC sigqueue twice or corrupt the pending->list.
Note that even sys_exit_group() can trigger this race, not only
sys_timer_delete().
Move the callsite of flush_sigqueue(tsk->pending) under ->siglock.
This patch doesn't touch flush_sigqueue(->shared_pending) below, it is
called when there are no other threads which can play with signals, and
sigqueue_free() can't be used outside of our thread group.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Initial splitoff of the low-level stuff; taken to fdtable.h
Signed-off-by: Al Viro <[email protected]>
|
|
Use change_pid() instead of detach_pid() + attach_pid() in
__set_special_pids().
This way task_session() is not NULL in between.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add another trivial helper for the sake of grep. It also auto-documents the
fact that ->parent != real_parent implies ->ptrace.
No functional changes.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a couple of small comments, it is not easy to see what this code does.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Trivial, use same_thread_group() in reparent_thread().
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
exit.c has numerous "->exit_signal == -1" comparisons, this check is subtle
and deserves a helper. Imho makes the code more parseable for humans. At
least it's surely more greppable.
Also, a couple of whitespace cleanups. No functional changes.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
do_group_exit() checks SIGNAL_GROUP_EXIT to avoid taking sighand->siglock.
Since ed5d2cac114202fe2978a9cbcab8f5032796d538 exec() doesn't set this
flag, we should use signal_group_exit().
This is not needed for correctness, but can speedup the multithreaded exec
and makes the code more consistent.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Remove the mem_cgroup member from mm_struct and instead adds an owner.
This approach was suggested by Paul Menage. The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined. It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.
A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.
This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.
I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.
This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.
After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.
Signed-off-by: Balbir Singh <[email protected]>
Cc: Pavel Emelianov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Sudhir Kumar <[email protected]>
Cc: YAMAMOTO Takashi <[email protected]>
Cc: Hirokazu Takahashi <[email protected]>
Cc: David Rientjes <[email protected]>,
Cc: Balbir Singh <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Pekka Enberg <[email protected]>
Reviewed-by: Paul Menage <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is a change that was requested some time ago by Mel Gorman. Makes sense
to me, so here it is.
Note: I retain the name "mpol_free_shared_policy()" because it actually does
free the shared_policy, which is NOT a reference counted object. However, ...
The mempolicy object[s] referenced by the shared_policy are reference counted,
so mpol_put() is used to release the reference held by the shared_policy. The
mempolicy might not be freed at this time, because some task attached to the
shared object associated with the shared policy may be in the process of
allocating a page based on the mempolicy. In that case, the task performing
the allocation will hold a reference on the mempolicy, obtained via
mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
holding such a reference have called mpol_put() for the mempolicy.
Signed-off-by: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
* let unshare_files() give caller the displaced files_struct
* don't bother with grabbing reference only to drop it in the
caller if it hadn't been shared in the first place
* in that form unshare_files() is trivially implemented via
unshare_fd(), so we eliminate the duplicate logics in fork.c
* reset_files_struct() is not just only called for current;
it will break the system if somebody ever calls it for anything
else (we can't modify ->files of somebody else). Lose the
task_struct * argument.
Signed-off-by: Al Viro <[email protected]>
|
|
* unshare_files() can fail; doing it after irreversible actions is wrong
and de_thread() is certainly irreversible.
* since we do it unconditionally anyway, we might as well do it in do_execve()
and save ourselves the PITA in binfmt handlers, etc.
* while we are at it, binfmt_som actually leaked files_struct on failure.
As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
become unexported.
Signed-off-by: Al Viro <[email protected]>
|
|
The only reason to have separated __...() for those was to keep them inlined
for local users in exit.c. Since Alexey removed the inline on those, there's
no reason whatsoever to keep them around; just collapse with normal variants.
Signed-off-by: Al Viro <[email protected]>
|
|
The prevent_tail_call() macro works around the problem of the compiler
clobbering argument words on the stack, which for asmlinkage functions
is the caller's (user's) struct pt_regs. The tail/sibling-call
optimization is not the only way that the compiler can decide to use
stack argument words as scratch space, which we have to prevent.
Other optimizations can do it too.
Until we have new compiler support to make "asmlinkage" binding on the
compiler's own use of the stack argument frame, we have work around all
the manifestations of this issue that crop up.
More cases seem to be prevented by also keeping the incoming argument
variables live at the end of the function. This makes their original
stack slots attractive places to leave those variables, so the compiler
tends not clobber them for something else. It's still no guarantee, but
it handles some observed cases that prevent_tail_call() did not.
Signed-off-by: Roland McGrath <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In commit ee7c82da830ea860b1f9274f1f0cdf99f206e7c2 ("wait_task_stopped:
simplify and fix races with SIGCONT/SIGKILL/untrace"), the magic (short)
cast when storing si_code was lost in wait_task_stopped. This leaks the
in-kernel CLD_* values that do not match what userland expects.
Signed-off-by: Roland McGrath <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
1. exit_notify() always calls kill_orphaned_pgrp(). This is wrong, we
should do this only when the whole process exits.
2. exit_notify() uses "current" as "ignored_task", obviously wrong.
Use ->group_leader instead.
Test case:
void hup(int sig)
{
printf("HUP received\n");
}
void *tfunc(void *arg)
{
sleep(2);
printf("sub-thread exited\n");
return NULL;
}
int main(int argc, char *argv[])
{
if (!fork()) {
signal(SIGHUP, hup);
kill(getpid(), SIGSTOP);
exit(0);
}
pthread_t thr;
pthread_create(&thr, NULL, tfunc, NULL);
sleep(1);
printf("main thread exited\n");
syscall(__NR_exit, 0);
return 0;
}
output:
main thread exited
HUP received
Hangup
With this patch the output is:
main thread exited
sub-thread exited
HUP received
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
p->exit_state != 0 doesn't mean this process is dead, it may have
sub-threads. Change the code to use "p->exit_state && thread_group_empty(p)"
instead.
Without this patch, ^Z doesn't deliver SIGTSTP to the foreground process
if the main thread has exited.
However, the new check is not perfect either. There is a window when
exit_notify() drops tasklist and before release_task(). Suppose that
the last (non-leader) thread exits. This means that entire group exits,
but thread_group_empty() is not true yet.
As Eric pointed out, is_global_init() is wrong as well, but I did not
dare to do other changes.
Just for the record, has_stopped_jobs() is absolutely wrong too. But we
can't fix it now, we should first fix SIGNAL_STOP_STOPPED issues.
Even with this patch ^Z doesn't play well with the dead main thread.
The task is stopped correctly but do_wait(WSTOPPED) won't see it. This
is another unrelated issue, will be (hopefully) fixed separately.
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Factor out the common code in reparent_thread() and exit_notify().
No functional changes.
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
* Use struct path in fs_struct.
Signed-off-by: Andreas Gruenbacher <[email protected]>
Signed-off-by: Jan Blunck <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
[[email protected]: coding-style fixes]
Signed-off-by: Harvey Harrison <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
_all_ converted to operate on the current pid namespace. After this each call
like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
one.
Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
appropriate.
Signed-off-by: Pavel Emelyanov <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This modifies do_wait and eligible child to take a pair of enum pid_type
and struct pid *pid to precisely specify what set of processes are eligible
to be waited for, instead of the raw pid_t value from sys_wait4.
This fixes a bug in sys_waitid where you could not wait for children in
just process group 1.
This fixes a pid namespace crossing case in eligible_child. Allowing us to
wait for a processes in our current process group even if our current
process group == 0.
This allows the no child with this pid case to be optimized. This allows
us to optimize the pid membership test in eligible child to be optimized.
This even closes a theoretical pid wraparound race where in a threaded
parent if two threads are waiting for the same child and one thread picks
up the child and the pid numbers wrap around and generate another child
with that same pid before the other thread is scheduled (teribly insanely
unlikely) we could end up waiting on the second child with the same pid#
and not discover that the specific child we were waiting for has exited.
[[email protected]: coding-style fixes]
Signed-off-by: Eric W. Biederman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The previous bugfix was not optimal, we shouldn't care about group stop
when we are the only thread or the group stop is in progress. In that case
nothing special is needed, just set PF_EXITING and return.
Also, take the related "TIF_SIGPENDING re-targeting" code from exit_notify().
So, from the performance POV the only difference is that we don't trust
!signal_pending() until we take ->siglock. But this in fact fixes another
___pure___ theoretical minor race. __group_complete_signal() finds the
task without PF_EXITING and chooses it as the target for signal_wake_up().
But nothing prevents this task from exiting in between without noticing the
pending signal and thus unpredictably delaying the actual delivery.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Davide Libenzi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
do_signal_stop() counts all sub-thread and sets ->group_stop_count
accordingly. Every thread should decrement ->group_stop_count and stop,
the last one should notify the parent.
However a sub-thread can exit before it notices the signal_pending(), or it
may be somewhere in do_exit() already. In that case the group stop never
finishes properly.
Note: this is a minimal fix, we can add some optimizations later. Say we
can return quickly if thread_group_empty(). Also, we can move some signal
related code from exit_notify() to exit_signals().
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Davide Libenzi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Daemonized kernel threads run in the init's session. This doesn't match the
behaviour of kthread_create()'ed threads, and this is one of the 2 reasons
why we need a special hack in sys_setsid().
Now that set_special_pids() was changed to use struct pid, not pid_t, we can
use init_struct_pid and set 0,0 special pids.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Change set_special_pids() to work with struct pid, not pid_t from global name
space. This again speedups and imho cleanups the code, also a preparation for
the next patch.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Acked-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The first "p->exit_state != EXIT_ZOMBIE" check doesn't make too much sense.
The exit_state was EXIT_ZOMBIE when the function was called, and another
thread can change it to EXIT_DEAD right after the check.
The second condition is not possible, detached non-traced threads were already
filtered out by eligible_child(), we didn't drop tasklist since then.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Surprise, the other two wait_task_*() functions also abuse the
task_pid_nr_ns() function, and may cause read-after-free or report nr == 0
in wait_task_continued(). wait_task_zombie() doesn't have this problem,
but it is still better to cache pid_t rather than call task_pid_nr_ns()
three times on the saved pid_namespace.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Imho, the current usage of security_task_wait() is not logical.
Suppose we have the single child p, and security_task_wait(p) return
-EANY. In that case waitpid(-1) returns this error. Why? Isn't it
better to return ECHLD? We don't really have reapable children.
Now suppose that child was stolen by gdb. In that case we find this
child on ->ptrace_children and set flag = 1, but we don't check that the
child was denied. So, do_wait(..., WNOHANG) returns 0, this doesn't
match the behaviour above. Without WNOHANG do_wait() blocks only to
return the error later, when the child will be untraced. Inho, really
strange.
I think eligible_child() should return the error only if the child's pid
was requested explicitly, otherwise we should silently ignore the tasks
which were nacked by security_task_wait().
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Chris Wright <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: James Morris <[email protected]>
Cc: Stephen Smalley <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
eligible_child() == 2 means delay_group_leader(). With the previous patch
this only matters for EXIT_ZOMBIE task, we can move that special check to
the only place it is really needed.
Also, with this patch we don't skip security_task_wait() for the group
leaders in a non-empty thread group. I don't really understand the exact
semantics of security_task_wait(), but imho this change is a bugfix.
Also rearrange the code a bit to kill an ugly "check_continued" backdoor.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: James Morris <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Stephen Smalley <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
wait_task_stopped() doesn't need the "delay_group_leader" parameter. If
the child is not traced it must be a group leader. With or without
subthreads ->group_stop_count == 0 when the whole task is stopped.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Mika Penttila <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Every branch if the main "if" statement does the same code at the end. Move
it down. Also, fix the indentation.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
wait_task_stopped() has multiple races with SIGCONT/SIGKILL. tasklist_lock
does not pin the child in TASK_TRACED/TASK_STOPPED stated, almost all info
reported (including exit_code) may be wrong.
In fact, the code under write_lock_irq(tasklist_lock) is not safe. The child
may be PTRACE_DETACH'ed at this time by another subthread, in that case it is
possible we are no longer its ->parent.
Change wait_task_stopped() to take ->siglock before inspecting the task. This
guarantees that the child can't resume and (for example) clear its
->exit_code, so we don't need to use xchg(&p->exit_code) and re-check. The
only exception is ptrace_stop() which changes ->state and ->exit_code without
->siglock held during abort. But this can only happen if both the tracer and
the tracee are dying (coredump is in progress), we don't care.
With this patch wait_task_stopped() doesn't move the child to the end of
the ->parent list on success. This optimization could be restored, but
in that case we have to take write_lock(tasklist) and do some nasty
checks.
Also change the do_wait() since we don't return EAGAIN any longer.
[[email protected]: fix up after Willy renamed everything]
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now that my_ptrace_child() is trivial we can use the "p->ptrace & PT_PTRACED"
inline and simplify the corresponding logic in do_wait: we can't find the
child in TASK_TRACED state without PT_PTRACED flag set, ptrace_untrace()
either sets TASK_STOPPED or wakes up the tracee.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Since the patch
"Fix ptrace_attach()/ptrace_traceme()/de_thread() race"
commit f5b40e363ad6041a96e3da32281d8faa191597b9
we set PT_ATTACHED and change child->parent "atomically" wrt task_list lock.
This means we can remove the checks like "PT_ATTACHED && ->parent != ptracer"
which were needed to catch the "ptrace attach is in progress" case. We can
also remove the flag itself since nobody else uses it.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Roland McGrath <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Minor cleanup. We can remove one "else if" branch.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As Roland pointed out, we have the very old problem with exec. de_thread()
sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
clears signal->flags. All signals (even fatal ones) sent in this window
(which is not too small) will be lost.
With this patch exec doesn't abuse SIGNAL_GROUP_EXIT. signal_group_exit(),
the new helper, should be used to detect exit_group() or exec() in progress.
It can have more users, but this patch does only strictly necessary changes.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Davide Libenzi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Robin Holt <[email protected]>
Cc: Roland McGrath <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The file exit.c contains one useless extern declaration of sem_exit().
Moreover, it refers to nothing.
This trivial patch removes it.
Signed-off-by: Pierre Peiffer <[email protected]>
Signed-off-by: Adrian Bunk <[email protected]>
|
|
Also restructure the loop in do_wait()
Signed-off-by: Matthew Wilcox <[email protected]>
|
|
In wait_task_stopped() exit_code already contains the right value for the
si_status member of siginfo, and this is simply set in the non WNOWAIT
case.
If you call waitid() with a stopped or traced process, you'll get the signal
in siginfo.si_status as expected -- however if you call waitid(WNOWAIT) at the
same time, you'll get the signal << 8 | 0x7f
Pass it unchanged to wait_noreap_copyout(); we would only need to shift it
and add 0x7f if we were returning it in the user status field and that
isn't used for any function that permits WNOWAIT.
Signed-off-by: Scott James Remnant <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
wait_task_stopped(WNOWAIT) does task_pid_nr_ns() without tasklist/rcu lock,
we can read an already freed memory. Use the cached pid_t value.
Signed-off-by: Oleg Nesterov <[email protected]>
Looks-good-to: Roland McGrath <[email protected]>
Acked-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The original meaning of the old test (p->state > TASK_STOPPED) was
"not dead", since it was before TASK_TRACED existed and before the
state/exit_state split. It was a wrong correction in commit
14bf01bb0599c89fc7f426d20353b76e12555308 to make this test for
TASK_TRACED instead. It should have been changed when TASK_TRACED
was introducted and again when exit_state was introduced.
Signed-off-by: Roland McGrath <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Kees Cook <[email protected]>
Acked-by: Scott James Remnant <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Save ~650 bytes here.
add/remove: 4/0 grow/shrink: 0/7 up/down: 430/-1088 (-658)
function old new delta
__copy_fs_struct - 202 +202
__put_fs_struct - 112 +112
__exit_fs - 58 +58
__exit_files - 58 +58
exit_files 58 2 -56
put_fs_struct 112 5 -107
exit_fs 161 2 -159
sys_unshare 774 590 -184
copy_process 4031 3840 -191
do_exit 1791 1597 -194
copy_fs_struct 202 5 -197
No difference in lmbench lat_proc tests on 2-way Opteron 246.
Smaaaal degradation on UP P4 (within errors).
[[email protected]: coding-style fixes]
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.
The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.
[[email protected]: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <[email protected]>
Cc: Dave Airlie <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The pgrp field is not used widely around the kernel so it is now marked as
deprecated with appropriate comment.
The initialization of INIT_SIGNALS is trimmed because
a) they are set to 0 automatically;
b) gcc cannot properly initialize two anonymous (the second one
is the one with the session) unions. In this particular case
to make it compile we'd have to add some field initialized
right before the .pgrp.
This is the same patch as the 1ec320afdc9552c92191d5f89fcd1ebe588334ca one
(from Cedric), but for the pgrp field.
Some progress report:
We have to deprecate the pid, tgid, session and pgrp fields on struct
task_struct and struct signal_struct. The session and pgrp are already
deprecated. The tgid value is close to being such - the worst known usage
in in fs/locks.c and audit code. The pid field deprecation is mainly
blocked by numerous printk-s around the kernel that print the tsk->pid to
log.
Signed-off-by: Pavel Emelyanov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Sukadev Bhattiprolu <[email protected]>
Cc: Cedric Le Goater <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Herbert Poetzl <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.
The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.
[[email protected]: build fix]
[[email protected]: nuther build fix]
[[email protected]: yet nuther build fix]
[[email protected]: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <[email protected]>
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: Sukadev Bhattiprolu <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Terminate all processes in a namespace when the reaper of the namespace is
exiting. We do this by walking the pidmap of the namespace and sending
SIGKILL to all processes.
Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Pavel Emelyanov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Sukadev Bhattiprolu <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
proc trees
The first part is trivial - we just make the proc_flush_task() to operate on
arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
it.
The other change is more tricky: I moved the proc_flush_task() call in
release_task() higher to address the following problem.
When flushing task from many proc trees we need to know the set of ids (not
just one pid) to find the dentries' names to flush. Thus we need to pass the
task's pid to proc_flush_task() as struct pid is the only object that can
provide all the pid numbers. But after __exit_signal() task has detached all
his pids and this information is lost.
This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
tree and keep them in hash (since pids are still alive before __exit_signal())
till the next shrink, but since proc_flush_task() does not provide a 100%
guarantee that the dentries will be flushed, this is OK to do so.
Signed-off-by: Pavel Emelyanov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Sukadev Bhattiprolu <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Make task release its namespaces after it has reparented all his children to
child_reaper, but before it notifies its parent about its death.
The reason to release namespaces after reparenting is that when task exits it
may send a signal to its parent (SIGCHLD), but if the parent has already
exited its namespaces there will be no way to decide what pid to dever to him
- parent can be from different namespace.
The reason to release namespace before notifying the parent it that when task
sends a SIGCHLD to parent it can call wait() on this taks and release it. But
releasing the mnt namespace implies dropping of all the mounts in the mnt
namespace and NFS expects the task to have valid sighand pointer.
Thanks to Oleg for pointing out some races that can apear and helping with
patches and fixes.
Signed-off-by: Pavel Emelyanov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Sukadev Bhattiprolu <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A pid namespace is a "view" of a particular set of tasks on the system. They
work in a similar way to filesystem namespaces. A file (or a process) can be
accessed in multiple namespaces, but it may have a different name in each. In
a filesystem, this name might be /etc/passwd in one namespace, but
/chroot/etc/passwd in another.
For processes, a process may have pid 1234 in one namespace, but be pid 1 in
another. This allows new pid namespaces to have basically arbitrary pids, and
not have to worry about what pids exist in other namespaces. This is
essential for checkpoint/restart where a restarted process's pid might collide
with an existing process on the system's pid.
In this particular implementation, pid namespaces have a parent-child
relationship, just like processes. A process in a pid namespace may see all
of the processes in the same namespace, as well as all of the processes in all
of the namespaces which are children of its namespace. Processes may not,
however, see others which are in their parent's namespace, but not in their
own. The same goes for sibling namespaces.
The know issue to be solved in the nearest future is signal handling in the
namespace boundary. That is, currently the namespace's init is treated like
an ordinary task that can be killed from within an namespace. Ideally, the
signal handling by the namespace's init should have two sides: when signaling
the init from its namespace, the init should look like a real init task, i.e.
receive only those signals, that is explicitly wants to; when signaling the
init from one of the parent namespaces, init should look like an ordinary
task, i.e. receive any signal, only taking the general permissions into
account.
The pid namespace was developed by Pavel Emlyanov and Sukadev Bhattiprolu and
we eventually came to almost the same implementation, which differed in some
details. This set is based on Pavel's patches, but it includes comments and
patches that from Sukadev.
Many thanks to Oleg, who reviewed the patches, pointed out many BUGs and made
valuable advises on how to make this set cleaner.
This patch:
We have to call exit_task_namespaces() only after the exiting task has
reparented all his children and is sure that no other threads will reparent
theirs for it. Why this is needed is explained in appropriate patch. This
one only reworks the forget_original_parent() so that after calling this a
task cannot be/become parent of any other task.
We check PF_EXITING instead of ->exit_state while choosing the new parent.
Note that tasklits_lock acts as a barrier, everyone who takes tasklist after
us (when forget_original_parent() drops it) must see PF_EXITING.
The other changes are just cleanups. They just move some code from
exit_notify to forget_original_parent(). It is a bit silly to declare
ptrace_dead in exit_notify(), take tasklist, pass ptrace_dead to
forget_original_parent(), unlock-lock-unlock tasklist, and then use
ptrace_dead.
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Pavel Emelyanov <[email protected]>
Cc: Sukadev Bhattiprolu <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|