| Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In the err_file: fput(event_file) case, the event will not yet have
been attached to a context. However perf_release() does assume it has
been. Cure this.
Tested-by: Alexander Shishkin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In case of: err_file: fput(event_file), we'll end up calling
perf_release() which in turn will free the event.
Do not then free the event _again_.
Tested-by: Alexander Shishkin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Consider the following scenario:
CPU0 CPU1
ctx = find_get_ctx();
perf_event_exit_task_context()
mutex_lock(&ctx->mutex);
perf_install_in_context(ctx, ...);
/* NO-OP */
mutex_unlock(&ctx->mutex);
...
perf_release()
WARN_ON_ONCE(event->state != STATE_EXIT);
Since the event doesn't pass through perf_remove_from_context()
because perf_install_in_context() NO-OPs because the ctx is dead, and
perf_event_exit_task_context() will not observe the event because its
not attached yet, the event->state will not be set.
Solve this by revalidating ctx->task after we acquire ctx->mutex and
failing the event creation as a whole.
Tested-by: Alexander Shishkin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.c
All three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <[email protected]>
|
|
. avoid walking the stack when there is no room left in the buffer
. generalize get_perf_callchain() to be called from bpf helper
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
No functional change, just less confusing to read.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Vince Weaver <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
If CPU_UP_PREPARE is called it is not guaranteed, that a previously allocated
and assigned hash has been freed already, but perf_event_init_cpu()
unconditionally allocates and assignes a new hash if the swhash is referenced.
By overwriting the pointer the existing hash is not longer accessible.
Verify that there is no hash assigned on this cpu before allocating and
assigning a new one.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Vince Weaver <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
If CPU_DOWN_PREPARE fails the perf hotplug notifier is called for
CPU_DOWN_FAILED and calls perf_event_init_cpu(), which checks whether the
swhash is referenced. If yes it allocates a new hash and stores the pointer in
the per cpu data structure.
But at this point the cpu is still online, so there must be a valid hash
already. By overwriting the pointer the existing hash is not longer
accessible.
Remove the CPU_DOWN_FAILED state, as there is nothing to (re)allocate.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Vince Weaver <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
If CPU_UP_PREPARE fails the perf hotplug code calls perf_event_exit_cpu(),
which is a pointless exercise. The cpu is not online, so the smp function
calls return -ENXIO. So the result is a list walk to call noops.
Remove it.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Vince Weaver <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
For protection keys, we need to understand whether protections
should be enforced in software or not. In general, we enforce
protections when working on our own task, but not when on others.
We call these "current" and "remote" operations.
This patch introduces a new get_user_pages() variant:
get_user_pages_remote()
Which is a replacement for when get_user_pages() is called on
non-current tsk/mm.
We also introduce a new gup flag: FOLL_REMOTE which can be used
for the "__" gup variants to get this new behavior.
The uprobes is_trap_at_addr() location holds mmap_sem and
calls get_user_pages(current->mm) on an instruction address. This
makes it a pretty unique gup caller. Being an instruction access
and also really originating from the kernel (vs. the app), I opted
to consider this a 'remote' access where protection keys will not
be enforced.
Without protection keys, this patch should not change any behavior.
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"This is much bigger than typical fixes, but Peter found a category of
races that spurred more fixes and more debugging enhancements. Work
started before the merge window, but got finished only now.
Aside of that this contains the usual small fixes to perf and tools.
Nothing particular exciting"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
perf: Remove/simplify lockdep annotation
perf: Synchronously clean up child events
perf: Untangle 'owner' confusion
perf: Add flags argument to perf_remove_from_context()
perf: Clean up sync_child_event()
perf: Robustify event->owner usage and SMP ordering
perf: Fix STATE_EXIT usage
perf: Update locking order
perf: Remove __free_event()
perf/bpf: Convert perf_event_array to use struct file
perf: Fix NULL deref
perf/x86: De-obfuscate code
perf/x86: Fix uninitialized value usage
perf: Fix race in perf_event_exit_task_context()
perf: Fix orphan hole
perf stat: Do not clean event's private stats
perf hists: Fix HISTC_MEM_DCACHELINE width setting
perf annotate browser: Fix behaviour of Shift-Tab with nothing focussed
perf tests: Remove wrong semicolon in while loop in CQM test
perf: Synchronously free aux pages in case of allocation failure
...
|
|
Now that the perf_event_ctx_lock_nested() call has moved from
put_event() into perf_event_release_kernel() the first reason is no
longer valid as that can no longer happen.
The second reason seems to have been invalidated when Al Viro made fput()
unconditionally async in the following commit:
4a9d4b024a31 ("switch fput to task_work_add")
such that munmap()->fput()->release()->perf_release() would no longer happen.
Therefore, remove the annotation. This should increase the efficiency
of lockdep coverage of perf locking.
Suggested-by: Alexander Shishkin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The orphan cleanup workqueue doesn't always catch orphans, for example,
if they never schedule after they are orphaned. IOW, the event leak is
still very real. It also wouldn't work for kernel counters.
Doing it synchonously is a little hairy due to lock inversion issues,
but is made to work.
Patch based on work by Alexander Shishkin.
Suggested-by: Alexander Shishkin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There are two concepts of owner wrt an event and they are conflated:
- event::owner / event::owner_list,
used by prctl(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE).
- the 'owner' of the event object, typically the file descriptor.
Currently these two concepts are conflated, which gives trouble with
scm_rights passing of file descriptors. Passing the event and then
closing the creating task would render the event 'orphan' and would
have it cleared out. Unlikely what is expectd.
This patch untangles these two concepts by using PERF_EVENT_STATE_EXIT
to denote the second type.
Reported-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In preparation to adding more options, convert the boolean argument
into a flags word.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
sync_child_event() has outgrown its purpose, it does far too much.
Bring it back to its named purpose.
Rename __perf_event_exit_task() to perf_event_exit_event() to better
reflect what it does and move the event->state assignment under the
ctx->lock, like state changes ought to be.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Use smp_store_release() to clear event->owner and
lockless_dereference() to observe it. Further use READ_ONCE() for all
lockless reads.
This changes perf_remove_from_owner() to leave event->owner cleared.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We should never attempt to enable a STATE_EXIT event.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Update the locking order to note that ctx::lock nests inside of
child_mutex, as per:
perf_ioctl(): ctx::mutex
-> perf_event_for_each(): event::child_mutex
-> _perf_event_enable(): ctx::lock
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There is but a single caller, remove the function - we already have
_free_event(), the extra indirection is nonsensical..
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Robustify refcounting.
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: Wang Nan <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Dan reported:
1229 if (ctx->task == TASK_TOMBSTONE ||
1230 !atomic_inc_not_zero(&ctx->refcount)) {
1231 raw_spin_unlock(&ctx->lock);
1232 ctx = NULL;
^^^^^^^^^^
ctx is NULL.
1233 }
1234
1235 WARN_ON_ONCE(ctx->task != task);
^^^^^^^^^^^^^^^^^
The patch adds a NULL dereference.
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Fixes: 63b6da39bb38 ("perf: Fix perf_event_exit_task() race")
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There is a race between perf_event_exit_task_context() and
orphans_remove_work() which results in a use-after-free.
We mark ctx->task with TASK_TOMBSTONE to indicate a context is
'dead', under ctx->lock. After which point event_function_call()
on any event of that context will NOP
A concurrent orphans_remove_work() will only hold ctx->mutex for
the list iteration and not serialize against this. Therefore its
possible that orphans_remove_work()'s perf_remove_from_context()
call will fail, but we'll continue to free the event, with the
result of free'd memory still being on lists and everything.
Once perf_event_exit_task_context() gets around to acquiring
ctx->mutex it too will iterate the event list, encounter the
already free'd event and proceed to free it _again_. This fails
with the WARN in free_event().
Plug the race by having perf_event_exit_task_context() hold
ctx::mutex over the whole tear-down, thereby 'naturally'
serializing against all other sites, including the orphan work.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We should set event->owner before we install the event,
otherwise there is a hole where the target task can fork() and
we'll not inherit the event because it thinks the event is
orphaned.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <[email protected]>
|
|
We are currently using asynchronous deallocation in the error path in
AUX mmap code, which is unnecessary and also presents a problem for users
that wish to probe for the biggest possible buffer size they can get:
they'll get -EINVAL on all subsequent attemts to allocate a smaller
buffer before the asynchronous deallocation callback frees up the pages
from the previous unsuccessful attempt.
Currently, gdb does that for allocating AUX buffers for Intel PT traces.
More specifically, overwrite mode of AUX pmus that don't support hardware
sg (some implementations of Intel PT, for instance) is limited to only
one contiguous high order allocation for its buffer and there is no way
of knowing its size without trying.
This patch changes error path freeing to be synchronous as there won't
be any contenders for the AUX pages at that point.
Reported-by: Markus Metzger <[email protected]>
Signed-off-by: Alexander Shishkin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/1453216469-9509-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There is a race against perf_event_exit_task() vs
event_function_call(),find_get_context(),perf_install_in_context()
(iow, everyone).
Since there is no permanent marker on a context that its dead, it is
quite possible that we access (and even modify) a context after its
passed through perf_event_exit_task().
For instance, find_get_context() might find the context still
installed, but by the time we get to perf_install_in_context() it
might already have passed through perf_event_exit_task() and be
considered dead, we will however still add the event to it.
Solve this by marking a ctx dead by setting its ctx->task value to -1,
it must be !0 so we still know its a (former) task context.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Try to trigger warnings before races do damage.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There is one common bug left in all the event_function_call() users,
between loading ctx->task and getting to the remote_function(),
ctx->task can already have been changed.
Therefore we need to double check and retry if ctx->task != current.
Insert another trampoline specific to event_function_call() that
checks for this and further validates state. This also allows getting
rid of the active/inactive functions.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The perf_remove_from_context() usage in __perf_event_exit_task() is
different from the other usages in that this site has already
detached and scheduled out the task context.
This will stand in the way of stronger assertions checking the (task)
context scheduling invariants.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There is a very nasty problem wrt disabling the perf task scheduling
hooks.
Currently we {set,clear} ctx->is_active on every
__perf_event_task_sched_{in,out}, _however_ this means that if we
disable these calls we'll have task contexts with ->is_active set that
are not active and 'active' task contexts without ->is_active set.
This can result in event_function_call() looping on the ctx->is_active
condition basically indefinitely.
Resolve this by changing things such that contexts without events do
not set ->is_active like we used to. From this invariant it trivially
follows that if there are no (task) events, every task ctx is inactive
and disabling the context switch hooks is harmless.
This leaves two places that need attention (and already had
accumulated weird and wonderful hacks to work around, without
recognising this actual problem).
Namely:
- perf_install_in_context() will need to deal with installing events
in an inactive context, meaning it cannot rely on ctx-is_active for
its IPIs.
- perf_remove_from_context() will have to mark a context as inactive
when it removes the last event.
For specific detail, see the patch/comments.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
For no apparent reason and to great confusion the rules for
ctx->is_active and cpuctx->task_ctx are different. This means that its
not always possible to find all active (task) contexts.
Fix this such that if ctx->is_active gets set, we also set (or verify)
cpuctx->task_ctx.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
It doesn't make sense to take up-to _4_ references on
perf_sched_events() per event, avoid doing this.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Like perf_enable_on_exec(), perf_event_enable() event scheduling has problems
respecting the context hierarchy when trying to schedule events (for
example, it will try and add a pinned event without first removing
existing flexible events).
So simplify it by using the new ctx_resched() call which will DTRT.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We have a function that does exactly what we want here, use it. This
reduces the amount of cpuctx->task_ctx muckery.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There are two problems with the current perf_enable_on_exec() event
scheduling:
- the newly enabled events will be immediately scheduled
irrespective of their ctx event list order.
- there's a hole in the ctx->lock between scheduling the events
out and putting them back on.
Esp. the latter issue is a real problem because a hole in event
scheduling leaves the thing in an observable inconsistent state,
confusing things.
Fix both issues by first doing the enable iteration and at the end,
when there are newly enabled events, reschedule the ctx in one go.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The comment here is horribly out of date, remove it.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There is a comment that states that perf_event_context_sched_in() will
also switch in the cgroup events, I cannot find it does so. Therefore
all the resulting logic goes out the window too.
Clean that up.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There appears to be a problem in __perf_event_task_sched_in() wrt
cgroup event scheduling.
The normal event scheduling order is:
CPU pinned
Task pinned
CPU flexible
Task flexible
And since perf_cgroup_sched*() only schedules the cpu context, we must
call this _before_ adding the task events.
Note: double check what happens on the ctx switch optimization where
the task ctx isn't scheduled.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Make various bugs easier to see.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[[email protected]: fix warning]
Signed-off-by: Jann Horn <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: Casey Schaufler <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Morris <[email protected]>
Cc: "Serge E. Hallyn" <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Al Viro <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Willy Tarreau <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As with rmap, with new refcounting we cannot rely on PageTransHuge() to
check if we need to charge size of huge page form the cgroup. We need
to get information from caller to know whether it was mapped with PMD or
PTE.
We do uncharge when last reference on the page gone. At that point if
we see PageTransHuge() it means we need to unchange whole huge page.
The tricky part is partial unmap -- when we try to unmap part of huge
page. We don't do a special handing of this situation, meaning we don't
uncharge the part of huge page unless last user is gone or
split_huge_page() is triggered. In case of cgroup memory pressure
happens the partial unmapped page will be split through shrinker. This
should be good enough.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Steve Capper <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if
map/unmap small page or THP.
The patch adds new argument to rmap functions to indicate whether we
want to operate on whole compound page or only the small page.
[[email protected]: fix mapcount mismatch in hugepage migration]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Tested-by: Aneesh Kumar K.V <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Jerome Marchand <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Steve Capper <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently looking at /proc/<pid>/status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages
are mapped to /dev/zero), even though their implication in actual memory
use is quite different.
The internal accounting currently counts shmem pages together with
regular files. As a preparation to extend the userspace interfaces,
this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
shmem pages separately from MM_FILEPAGES. The next patch will expose it
to userspace - this patch doesn't change the exported values yet, by
adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
used before. The only user-visible change after this patch is the OOM
killer message that separates the reported "shmem-rss" from "file-rss".
[[email protected]: forward-porting, tweak changelog]
Signed-off-by: Jerome Marchand <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Acked-by: Konstantin Khlebnikov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This patch collapses the two 'hard' cases, which are
perf_event_{dis,en}able().
I cannot seem to convince myself the current code is correct.
So starting with perf_event_disable(); we don't strictly need to test
for event->state == ACTIVE, ctx->is_active is enough. If the event is
not scheduled while the ctx is, __perf_event_disable() still does the
right thing. Its a little less efficient to IPI in that case,
over-all simpler.
For perf_event_enable(); the same goes, but I think that's actually
broken in its current form. The current condition is: ctx->is_active
&& event->state == OFF, that means it doesn't do anything when
!ctx->active && event->state == OFF. This is wrong, it should still
mark the event INACTIVE in that case, otherwise we'll still not try
and schedule the event once the context becomes active again.
This patch implements the two function using the new
event_function_call() and does away with the tricky event->state
tests.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
new changes
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There's a race on CPU unplug where we free the swevent hash array
while it can still have events on. This will result in a
use-after-free which is BAD.
Simply do not free the hash array on unplug. This leaves the thing
around and no use-after-free takes place.
When the last swevent dies, we do a for_each_possible_cpu() iteration
anyway to clean these up, at which time we'll free it, so no leakage
will occur.
Reported-by: Sasha Levin <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
I managed to tickle this warning:
[ 2338.884942] ------------[ cut here ]------------
[ 2338.890112] WARNING: CPU: 13 PID: 35162 at ../kernel/events/core.c:2702 task_ctx_sched_out+0x6b/0x80()
[ 2338.900504] Modules linked in:
[ 2338.903933] CPU: 13 PID: 35162 Comm: bash Not tainted 4.4.0-rc4-dirty #244
[ 2338.911610] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[ 2338.923071] ffffffff81f1468e ffff8807c6457cb8 ffffffff815c680c 0000000000000000
[ 2338.931382] ffff8807c6457cf0 ffffffff810c8a56 ffffe8ffff8c1bd0 ffff8808132ed400
[ 2338.939678] 0000000000000286 ffff880813170380 ffff8808132ed400 ffff8807c6457d00
[ 2338.947987] Call Trace:
[ 2338.950726] [<ffffffff815c680c>] dump_stack+0x4e/0x82
[ 2338.956474] [<ffffffff810c8a56>] warn_slowpath_common+0x86/0xc0
[ 2338.963195] [<ffffffff810c8b4a>] warn_slowpath_null+0x1a/0x20
[ 2338.969720] [<ffffffff811a49cb>] task_ctx_sched_out+0x6b/0x80
[ 2338.976244] [<ffffffff811a62d2>] perf_event_exec+0xe2/0x180
[ 2338.982575] [<ffffffff8121fb6f>] setup_new_exec+0x6f/0x1b0
[ 2338.988810] [<ffffffff8126de83>] load_elf_binary+0x393/0x1660
[ 2338.995339] [<ffffffff811dc772>] ? get_user_pages+0x52/0x60
[ 2339.001669] [<ffffffff8121e297>] search_binary_handler+0x97/0x200
[ 2339.008581] [<ffffffff8121f8b3>] do_execveat_common.isra.33+0x543/0x6e0
[ 2339.016072] [<ffffffff8121fcea>] SyS_execve+0x3a/0x50
[ 2339.021819] [<ffffffff819fc165>] stub_execve+0x5/0x5
[ 2339.027469] [<ffffffff819fbeb2>] ? entry_SYSCALL_64_fastpath+0x12/0x71
[ 2339.034860] ---[ end trace ee1337c59a0ddeac ]---
Which is a WARN_ON_ONCE() indicating that cpuctx->task_ctx is not
what we expected it to be.
This is because context switches can swap the task_struct::perf_event_ctxp[]
pointer around. Therefore you have to either disable preemption when looking
at current, or hold ctx->lock.
Fix perf_event_enable_on_exec(), it loads current->perf_event_ctxp[]
before disabling interrupts, therefore a preemption in the right place
can swap contexts around and we're using the wrong one.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kostya Serebryany <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: syzkaller <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|