Age | Commit message (Collapse) | Author | Files | Lines |
|
Rename variables named "obj" which represent object names so they're
consistently named "object_name".
Rename the "cls" and "method" parameters in rbd_req_sync_exec()
to be "class_name" and "method_name", and make similar changes
to the names of local variables in that function representing
the lengths of those names.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
An rbd image is not a single object, but a logical construct made up
of an aggregation of objects.
Rename some fields in struct rbd_dev, in hopes of reinforcing this.
obj --> image_name
obj_len --> image_name_len
obj_md_name --> header_name
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
Most variables that represent a struct rbd_device are named
"rbd_dev", but in some cases "dev" is used instead. Change all the
"dev" references so they use "rbd_dev" consistently, to make it
clear from the name that we're working with an RBD device (as
opposed to, for example, a struct device). Similarly, change the
name of the "dev" field in struct rbd_notify_info to be "rbd_dev".
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
There is no need to impose a small limit the length of the snapshot
name recorded for an rbd image in a struct rbd_dev. Remove the
limitation by allocating space for the snapshot name dynamically.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
There is no need to impose a small limit the length of the rbd image
name recorded in a struct rbd_dev. Remove the limitation by
allocating space for the image name dynamically.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
There is no need to impose a small limit the length of the header
name recorded for an rbd image in a struct rbd_dev. Remove the
limitation by allocating space for the header name dynamically.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
There is no need to impose a small limit the length of the object
prefix recorded for an rbd image in a struct rbd_image_header.
Remove the limitation by allocating space for the object prefix
dynamically.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
There is no need to impose a small limit the length of the pool name
recorded for an rbd image in a struct rbd_device. Remove the
limitation by allocating space for the pool name ynamically.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
Add an entry under /sys/bus/rbd/devices/<N>/ named "pool_id" that
provides the id for the pool the rbd image is assocatied with. This
is in addition to the pool name already provided.
Rename the "poolid" field in struct rbd_device to be "pool_id".
Update the documentation to reflect the addition of this new entry.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
Each rbd image has a name that forms the basis of all data objects
backing the device. Old (format 1) images refer to this name as the
"block name," while new (format 2) images use the term "object
prefix" for this.
Change the field name in the in-core rbd image header structure to
reflect the more modern usage. We intentionally keep the the name
"block_name" in the on-disk definition for format 1 image headers.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Yehuda Sadeh <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
Define a new function dup_token(), to be used during argument
parsing for making dynamically-allocated copies of tokens being
parsed.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Yehuda Sadeh <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
This adds a new utility routine which will return a dynamically-
allocated buffer containing a string that has been decoded from ceph
over-the-wire format. It also returns the length of the string
if the address of a size variable is supplied to receive it.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Sage Weil <[email protected]>
|
|
In rbd_req_sync_notify_ack(), a local variable was needlessly being
used to hold a null pointer. Just pass NULL instead.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Yehuda Sadeh <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
There is a BUG_ON() call that doesn't account for the single byte
structure version at the start of an encoded filepath in
ceph_encode_filepath(). Fix that.
Signed-off-by: Alex Elder <[email protected]>
Reviewed-by: Yehuda Sadeh <[email protected]>
Reviewed-by: Josh Durgin <[email protected]>
|
|
d_parent is never NULL, and IS_ROOT() is the proper way to check for a
(non-self-referential) parent.
Reported-by: Al Viro <[email protected]>
Signed-off-by: Sage Weil <[email protected]>
|
|
Add an atomic variable 'stopping' as flag in struct ceph_messenger,
set this flag to 1 in function ceph_destroy_client(), and add the condition code
in function ceph_data_ready() to test the flag value, if true(1), just return.
Signed-off-by: Guanjun He <[email protected]>
Reviewed-by: Sage Weil <[email protected]>
|
|
In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.
Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time. This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.
Signed-off-by: Sage Weil <[email protected]>
|
|
These don't strictly need to be initialized based on how they are used, but
it is good practice to do so.
Reported-by: Alex Elder <[email protected]>
Signed-off-by: Sage Weil <[email protected]>
|
|
Initialize the type field for messages in a msgpool. The caller was doing
this for osd ops, but not for the reply messages.
Reported-by: Alex Elder <[email protected]>
Signed-off-by: Sage Weil <[email protected]>
|
|
Pull PWM subsystem from Thierry Reding:
"The new PWM subsystem aims at collecting all implementations of the
legacy PWM API and to eventually replace it completely.
The subsystem has been in development for over half a year now and
many drivers have already been converted. It has been in linux-next
for a couple of weeks and there have been no major issues so I think
it is ready for inclusion in your tree."
Arnd Bergmann <[email protected]>:
"Very much Ack on the new subsystem. It uses the interface
declarations as the previously separate pwm drivers, so nothing
changes for now in the drivers using it, although it enables us to
change those more easily in the future if we want to.
This work is also one of the missing pieces that are required to
eventually build ARM kernels for multiple platforms, which is
currently prohibited (amongs other things) by the fact that you cannot
have more than one driver exporting the pwm functions."
Tested-and-acked-by: Alexandre Courbot <[email protected]>
Acked-by: Mark Brown <[email protected]>
Acked-by: Philip, Avinash <[email protected]> # TI's AM33xx platforms
Acked-By: Alexandre Pereira da Silva <[email protected]> # LPC32XX
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Sachin Kamat <[email protected]>
Fix up trivial conflicts with other cleanups and DT updates.
* 'for-3.6' of git://gitorious.org/linux-pwm/linux-pwm: (36 commits)
pwm: pwm-tiehrpwm: PWM driver support for EHRPWM
pwm: pwm-tiecap: PWM driver support for ECAP APWM
pwm: fix used-uninitialized warning in pwm_get()
pwm: add lpc32xx PWM support
pwm_backlight: pass correct brightness to callback
pwm: Use pr_* functions in pwm-samsung.c file
pwm: Convert pwm-samsung to use devm_* APIs
pwm: Convert pwm-tegra to use devm_clk_get()
pwm: pwm-mxs: Return proper error if pwmchip_remove() fails
pwm: pwm-bfin: Return proper error if pwmchip_remove() fails
pwm: pxa: Propagate pwmchip_remove() error
pwm: Convert pwm-pxa to use devm_* APIs
pwm: Convert pwm-vt8500 to use devm_* APIs
pwm: Convert pwm-imx to use devm_* APIs
pwm: Conflict with legacy PWM API
pwm: pwm-mxs: add pinctrl support
pwm: pwm-mxs: use devm_* managed functions
pwm: pwm-mxs: use global reset function stmp_reset_block
pwm: pwm-mxs: encode soc name in compatible string
pwm: Take over maintainership of the PWM subsystem
...
|
|
Fixes an omission in the new v4l2_ioctls table: VIDIOC_TRY_EXT_CTRLS
must get the INFO_FL_CTRL flag, just like all the other control
related ioctls, otherwise the ioctl core won't know it also has
to check whether v4l2_fh->ctrl_handler is non-zero before it can
decide that this ioctl is not implemented.
Caught by v4l2-compliance while I was testing the mem2mem_testdev driver.
Signed-off-by: Hans Verkuil <[email protected]>
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
|
|
|
|
Dave Miller <[email protected]> provided a detailed description of
why the way IPoIB is using neighbours for its own ipoib_neigh struct
is buggy:
Any time an ipoib_neigh is changed, a sequence like the following is made:
spin_lock_irqsave(&priv->lock, flags);
/*
* It's safe to call ipoib_put_ah() inside
* priv->lock here, because we know that
* path->ah will always hold one more reference,
* so ipoib_put_ah() will never do more than
* decrement the ref count.
*/
if (neigh->ah)
ipoib_put_ah(neigh->ah);
list_del(&neigh->list);
ipoib_neigh_free(dev, neigh);
spin_unlock_irqrestore(&priv->lock, flags);
ipoib_path_lookup(skb, n, dev);
This doesn't work, because you're leaving a stale pointer to the freed up
ipoib_neigh in the special neigh->ha pointer cookie. Yes, it even fails
with all the locking done to protect _changes_ to *ipoib_neigh(n), and
with the code in ipoib_neigh_free() that NULLs out the pointer.
The core issue is that read side calls to *to_ipoib_neigh(n) are not
being synchronized at all, they are performed without any locking. So
whether we hold the lock or not when making changes to *ipoib_neigh(n)
you still can have threads see references to freed up ipoib_neigh
objects.
cpu 1 cpu 2
n = *ipoib_neigh()
*ipoib_neigh() = NULL
kfree(n)
n->foo == OOPS
[..]
Perhaps the ipoib code can have a private path database it manages
entirely itself, which holds all the necessary information and is
looked up by some generic key which is available easily at transmit
time and does not involve generic neighbour entries.
See <http://marc.info/?l=linux-rdma&m=132812793105624&w=2> and
<http://marc.info/?l=linux-rdma&w=2&r=1&s=allows+references+to+freed+memory&q=b>
for the full discussion.
This patch aims to solve the race conditions found in the IPoIB driver.
The patch removes the connection between the core networking neighbour
structure and the ipoib_neigh structure. In addition to avoiding the
race described above, it allows us to handle SKBs carrying IP packets
that don't have any associated neighbour.
We add an ipoib_neigh hash table with N buckets where the key is the
destination hardware address. The ipoib_neigh is fetched from the
hash table and instead of the stashed location in the neighbour
structure. The hash table uses both RCU and reference counting to
guarantee that no ipoib_neigh instance is ever deleted while in use.
Fetching the ipoib_neigh structure instance from the hash also makes
the special code in ipoib_start_xmit that handles remote and local
bonding failover redundant.
Aged ipoib_neigh instances are deleted by a garbage collection task
that runs every M seconds and deletes every ipoib_neigh instance that
was idle for at least 2*M seconds. The deletion is safe since the
ipoib_neigh instances are protected using RCU and reference count
mechanisms.
The number of buckets (N) and frequency of running the GC thread (M),
are taken from the exported arb_tbl.
Signed-off-by: Shlomo Pongratz <[email protected]>
Signed-off-by: Or Gerlitz <[email protected]>
Signed-off-by: Roland Dreier <[email protected]>
|
|
This patch adds support for DMA_ATTR_SKIP_CPU_SYNC attribute for
dma_(un)map_(single,page,sg) functions family. It lets dma mapping clients
to create a mapping for the buffer for the given device without performing
a CPU cache synchronization. CPU cache synchronization can be skipped for
the buffers which it is known that they are already in 'device' domain (CPU
caches have been already synchronized or there are only coherent mappings
for the buffer). For advanced users only, please use it with care.
Signed-off-by: Marek Szyprowski <[email protected]>
Reviewed-by: Kyungmin Park <[email protected]>
|
|
This patch adds DMA_ATTR_SKIP_CPU_SYNC attribute to the DMA-mapping
subsystem.
By default dma_map_{single,page,sg} functions family transfer a given
buffer from CPU domain to device domain. Some advanced use cases might
require sharing a buffer between more than one device. This requires
having a mapping created separately for each device and is usually
performed by calling dma_map_{single,page,sg} function more than once
for the given buffer with device pointer to each device taking part in
the buffer sharing. The first call transfers a buffer from 'CPU' domain
to 'device' domain, what synchronizes CPU caches for the given region
(usually it means that the cache has been flushed or invalidated
depending on the dma direction). However, next calls to
dma_map_{single,page,sg}() for other devices will perform exactly the
same sychronization operation on the CPU cache. CPU cache sychronization
might be a time consuming operation, especially if the buffers are
large, so it is highly recommended to avoid it if possible.
DMA_ATTR_SKIP_CPU_SYNC allows platform code to skip synchronization of
the CPU cache for the given buffer assuming that it has been already
transferred to 'device' domain. This attribute can be also used for
dma_unmap_{single,page,sg} functions family to force buffer to stay in
device domain after releasing a mapping for it. Use this attribute with
care!
Signed-off-by: Marek Szyprowski <[email protected]>
Reviewed-by: Kyungmin Park <[email protected]>
|
|
This patch adds support for dma_get_sgtable() function which is required
to let drivers to share the buffers allocated by DMA-mapping subsystem.
Generic implementation based on virt_to_page() is not suitable for ARM
dma-mapping subsystem.
Signed-off-by: Marek Szyprowski <[email protected]>
Reviewed-by: Kyungmin Park <[email protected]>
|
|
This patch adds dma_get_sgtable() function which is required to let
drivers to share the buffers allocated by DMA-mapping subsystem. Right
now the driver gets a dma address of the allocated buffer and the kernel
virtual mapping for it. If it wants to share it with other device (= map
into its dma address space) it usually hacks around kernel virtual
addresses to get pointers to pages or assumes that both devices share
the DMA address space. Both solutions are just hacks for the special
cases, which should be avoided in the final version of buffer sharing.
To solve this issue in a generic way, a new call to DMA mapping has been
introduced - dma_get_sgtable(). It allocates a scatter-list which
describes the allocated buffer and lets the driver(s) to use it with
other device(s) by calling dma_map_sg() on it.
This patch provides a generic implementation based on virt_to_page()
call. Architectures which require more sophisticated translation might
provide their own get_sgtable() methods.
Signed-off-by: Marek Szyprowski <[email protected]>
Reviewed-by: Kyungmin Park <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
|
|
This patch adds support for DMA_ATTR_NO_KERNEL_MAPPING attribute for
IOMMU allocations, what let drivers to save precious kernel virtual
address space for large buffers that are intended to be accessed only
from userspace.
This patch is heavily based on initial work kindly provided by Abhinav
Kochhar <[email protected]>.
Signed-off-by: Marek Szyprowski <[email protected]>
Reviewed-by: Kyungmin Park <[email protected]>
|
|
This patch adds DMA_ATTR_NO_KERNEL_MAPPING attribute which lets the
platform to avoid creating a kernel virtual mapping for the allocated
buffer. On some architectures creating such mapping is non-trivial task
and consumes very limited resources (like kernel virtual address space
or dma consistent address space). Buffers allocated with this attribute
can be only passed to user space by calling dma_mmap_attrs().
Signed-off-by: Marek Szyprowski <[email protected]>
Reviewed-by: Kyungmin Park <[email protected]>
Reviewed-by: Daniel Vetter <[email protected]>
|
|
Commit 9adc5374 ('common: dma-mapping: introduce mmap method') added a
generic method for implementing mmap user call to dma_map_ops structure.
This patch converts ARM and PowerPC architectures (the only providers of
dma_mmap_coherent/dma_mmap_writecombine calls) to use this generic
dma_map_ops based call and adds a generic cross architecture
definition for dma_mmap_attrs, dma_mmap_coherent, dma_mmap_writecombine
functions.
The generic mmap virt_to_page-based fallback implementation is provided for
architectures which don't provide their own implementation for mmap method.
Signed-off-by: Marek Szyprowski <[email protected]>
Reviewed-by: Kyungmin Park <[email protected]>
|
|
This patch fixes incorrect check in error path. When the allocation of
first page fails, the kernel ops appears due to accessing -1 element of
the pages array.
Reported-by: Sylwester Nawrocki <[email protected]>
Signed-off-by: Marek Szyprowski <[email protected]>
|
|
Add some sanity checks and forbid mmaping of buffers into vma areas larger
than allocated dma buffer.
Signed-off-by: Marek Szyprowski <[email protected]>
|
|
This patch changes dma-mapping subsystem to use generic vmalloc areas
for all consistent dma allocations. This increases the total size limit
of the consistent allocations and removes platform hacks and a lot of
duplicated code.
Atomic allocations are served from special pool preallocated on boot,
because vmalloc areas cannot be reliably created in atomic context.
Signed-off-by: Marek Szyprowski <[email protected]>
Reviewed-by: Kyungmin Park <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
|
|
'const void *' is a safer type for caller function type. This patch
updates all references to caller function type.
Signed-off-by: Marek Szyprowski <[email protected]>
Reviewed-by: Kyungmin Park <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
|
|
This patch adds a new constructor for an sg table. The table is constructed
from an array of struct pages. All contiguous chunks of the pages are merged
into a single sg nodes. A user may provide an offset and a size of a buffer if
the buffer is not page-aligned.
The function is dedicated for DMABUF exporters which often perform conversion
from an page array to a scatterlist. Moreover the scatterlist should be
squashed in order to save memory and to speed-up the process of DMA mapping
using dma_map_sg.
The code is based on the patch 'v4l: vb2-dma-contig: add support for
scatterlist in userptr mode' and hints from Laurent Pinchart.
Signed-off-by: Tomasz Stanislawski <[email protected]>
Signed-off-by: Kyungmin Park <[email protected]>
Acked-by: Daniel Vetter <[email protected]>
Acked-by: Laurent Pinchart <[email protected]>
Signed-off-by: Marek Szyprowski <[email protected]>
CC: Andrew Morton <[email protected]>
|
|
The label oops is used in CONFIG_DEBUG_VM ifdef block and is defined
outside ifdef CONFIG_DEBUG_VM block. This results in the following
build warning when built with CONFIG_DEBUG_VM disabled. Fix to move
label oops definition to inside a CONFIG_DEBUG_VM block.
mm/slab_common.c: In function ‘kmem_cache_create’:
mm/slab_common.c:101:1: warning: label ‘oops’ defined but not used
[-Wunused-label]
Signed-off-by: Shuah Khan <[email protected]>
Signed-off-by: Pekka Enberg <[email protected]>
|
|
... as being referenced from __init code only.
Signed-off-by: Jan Beulich <[email protected]>
Signed-off-by: Jean Delvare <[email protected]>
|
|
Like do_wp_page(), __replace_page() should do munlock_vma_page()
for the case when the old page still has other !VM_LOCKED
mappings. Unfortunately this needs mm/internal.h.
Also, move put_page() outside of ptl lock. This doesn't really
matter but looks a bit better.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
1. vma_address() returns loff_t, this looks confusing and this
is unnecessary after the previous change. Make it return "ulong",
all callers truncate the result anyway.
2. Its name conflicts with mm/rmap.c:vma_address(), rename it to
offset_to_vaddr(), this matches vaddr_to_offset().
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
1. register_for_each_vma() checks that vma_address() == vaddr,
but this is not enough. We should also ensure that
vaddr >= vm_start, find_vma() guarantees "vaddr < vm_end" only.
2. After the prevous changes, register_for_each_vma() is the
only reason why vma_address() has to return loff_t, all other
users know that we have the valid mapping at this offset and
thus the overflow is not possible.
Change the code to use vaddr_to_offset() instead, imho this looks
more clean/understandable and now we can change vma_address().
3. While at it, remove the unnecessary type-cast.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Add the new helper, vaddr_to_offset(vma, vaddr) which returns
the offset in vma->vm_file this vaddr is mapped at.
Change build_probe_list() and find_active_uprobe() to use the
new helper, the next patch adds another user.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Currently build_probe_list() builds the list of all uprobes
attached to the given inode, and the caller should filter out
those who don't fall into the [start,end) range, this is
sub-optimal.
This patch turns find_least_offset_node() into
find_node_in_range() which returns the first node inside the
[min,max] range, and changes build_probe_list() to use this node
as a starting point for rb_prev() and rb_next() to find all
other nodes the caller needs. The resulting list is no longer
sorted but we do not care.
This can speed up both build_probe_list() and the callers, but
there is another reason to introduce find_node_in_range(). It
can be used to figure out whether the given vma has uprobes or
not, this will be needed soon.
While at it, shift INIT_LIST_HEAD(tmp_list) into
build_probe_list().
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Remove insert_vm_struct()->uprobe_mmap(). It is not needed, nobody
except arch/ia64/kernel/perfmon.c uses insert_vm_struct(vma)
with vma->vm_file != NULL.
And it is wrong. Again, get_user_pages() can not succeed before
vma_link(vma) makes is visible to find_vma(). And even if this
worked, we must not insert the new bp before this mapping is
visible to vma_prio_tree_foreach() for uprobe_unregister().
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Remove copy_vma()->uprobe_mmap(new_vma), it is absolutely wrong.
This new_vma was just initialized to represent the new unmapped
area, [vm_start, vm_end) was returned by get_unmapped_area() in
the caller.
This means that uprobe_mmap()->get_user_pages() will fail for
sure, simply because find_vma() can never succeed. And I
verified that sys_mremap()->mremap_to() indeed always fails with
the wrong ENOMEM code if [addr, addr+old_len] is probed.
And why this uprobe_mmap() was added? I believe the intent was
wrong. Note that the caller is going to do move_page_tables(),
all registered uprobes are already faulted in, we only change
the virtual addresses.
NOTE: However, somehow we need to close the race with
uprobe_register() which relies on map_info->vaddr. This needs
another fix I'll try to do later. Probably we need uprobe_mmap()
in move_vma() but we can not do this right now, this can confuse
uprobes_state.counter (which I still hope we are going to kill).
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
vma->vm_pgoff is "unsigned long", it should be promoted to
loff_t before the multiplication to avoid the overflow.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
uprobe_munmap() does get_user_pages() and it is also called from
the final mmput()->exit_mmap() path. This slows down
exit/mmput() for no reason, and I think it is simply
dangerous/wrong to try to fault-in a page into the dying mm. If
nothing else, this happens after the last sync_mm_rss(), afaics
handle_mm_fault() can change the task->rss_stat and make the
subsequent check_mm() unhappy.
Change uprobe_munmap() to check mm->mm_users != 0.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The bug was introduced by me in 449d0d7c ("uprobes: Simplify the
usage of uprobe->pending_list").
Yes, we do not care about uprobe->pending_list after return and
nobody can remove the current list entry, but put_uprobe(uprobe)
can actually free it and thus we need list_for_each_safe().
Reported-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The comment above write_opcode()->lock_page(old_page) tells
about the race with do_wp_page(). I don't really understand
which exactly race it means, but afaics this lock_page() was not
enough to close all races with do_wp_page().
Anyway, since:
77fc4af1b59d uprobes: Change register_for_each_vma() to take mm->mmap_sem for writing
this code is always called with ->mmap_sem held for writing,
so we can forget about do_wp_page().
However, we can't simply remove this lock_page(), and the only
(afaics) reason is __replace_page()->try_to_free_swap().
Nothing in write_opcode() needs it, move it into
__replace_page() and fix the comment.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
write_opcode() does lock_page(new_page) for no reason. Nobody
can see this page until __replace_page() exposes it under ptl
lock, and we do nothing with this page after pte_unmap_unlock().
If nothing else, the similar code in do_wp_page() doesn't lock
the new page for page_add_new_anon_rmap/set_pte_at_notify.
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
page_address_in_vma(old_page) in __replace_page() is ugly and
wrong. The caller already knows the correct virtual address,
this page was found by get_user_pages(vaddr).
However, page_address_in_vma() can actually fail if
page->mapping was cleared by __delete_from_page_cache() after
get_user_pages() returns. But this means the race with page
reclaim, write_opcode() should not fail, it should retry and
read this page again. Probably the race with remove_mapping() is
not possible due to page_freeze_refs() logic, but afaics at
least shmem_writepage()->shmem_delete_from_page_cache() can
clear ->mapping.
We could change __replace_page() to return -EAGAIN in this case,
but it would be better to simply use the caller's vaddr and rely
on page_check_address().
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com>
Cc: Anton Arapov <[email protected]>
Cc: Srikar Dronamraju <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|