aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2012-07-30rbd: more symbol renamesAlex Elder1-26/+27
Rename variables named "obj" which represent object names so they're consistently named "object_name". Rename the "cls" and "method" parameters in rbd_req_sync_exec() to be "class_name" and "method_name", and make similar changes to the names of local variables in that function representing the lengths of those names. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30rbd: rename some fields in struct rbd_devAlex Elder1-27/+28
An rbd image is not a single object, but a logical construct made up of an aggregation of objects. Rename some fields in struct rbd_dev, in hopes of reinforcing this. obj --> image_name obj_len --> image_name_len obj_md_name --> header_name Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30rbd: use rbd_dev consistentlyAlex Elder1-61/+64
Most variables that represent a struct rbd_device are named "rbd_dev", but in some cases "dev" is used instead. Change all the "dev" references so they use "rbd_dev" consistently, to make it clear from the name that we're working with an RBD device (as opposed to, for example, a struct device). Similarly, change the name of the "dev" field in struct rbd_notify_info to be "rbd_dev". Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30rbd: dynamically allocate snapshot nameAlex Elder1-10/+16
There is no need to impose a small limit the length of the snapshot name recorded for an rbd image in a struct rbd_dev. Remove the limitation by allocating space for the snapshot name dynamically. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30rbd: dynamically allocate image nameAlex Elder2-16/+13
There is no need to impose a small limit the length of the rbd image name recorded in a struct rbd_dev. Remove the limitation by allocating space for the image name dynamically. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30rbd: dynamically allocate image header nameAlex Elder1-11/+20
There is no need to impose a small limit the length of the header name recorded for an rbd image in a struct rbd_dev. Remove the limitation by allocating space for the header name dynamically. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30rbd: dynamically allocate object prefixAlex Elder1-8/+26
There is no need to impose a small limit the length of the object prefix recorded for an rbd image in a struct rbd_image_header. Remove the limitation by allocating space for the object prefix dynamically. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30rbd: dynamically allocate pool nameAlex Elder1-8/+19
There is no need to impose a small limit the length of the pool name recorded for an rbd image in a struct rbd_device. Remove the limitation by allocating space for the pool name ynamically. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30rbd: create pool_id device attributeAlex Elder2-6/+22
Add an entry under /sys/bus/rbd/devices/<N>/ named "pool_id" that provides the id for the pool the rbd image is assocatied with. This is in addition to the pool name already provided. Rename the "poolid" field in struct rbd_device to be "pool_id". Update the documentation to reflect the addition of this new entry. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30rbd: rename rbd_dev->block_nameAlex Elder1-6/+6
Each rbd image has a name that forms the basis of all data objects backing the device. Old (format 1) images refer to this name as the "block name," while new (format 2) images use the term "object prefix" for this. Change the field name in the in-core rbd image header structure to reflect the more modern usage. We intentionally keep the the name "block_name" in the on-disk definition for format 1 image headers. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Yehuda Sadeh <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30rbd: define dup_token()Alex Elder1-0/+36
Define a new function dup_token(), to be used during argument parsing for making dynamically-allocated copies of tokens being parsed. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Yehuda Sadeh <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30libceph: define ceph_extract_encoded_string()Alex Elder1-0/+47
This adds a new utility routine which will return a dynamically- allocated buffer containing a string that has been decoded from ceph over-the-wire format. It also returns the length of the string if the address of a size variable is supplied to receive it. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Sage Weil <[email protected]>
2012-07-30rbd: drop a useless local variableAlex Elder1-2/+1
In rbd_req_sync_notify_ack(), a local variable was needlessly being used to hold a null pointer. Just pass NULL instead. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Yehuda Sadeh <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30libceph: fix off-by-one bug in ceph_encode_filepath()Alex Elder1-1/+1
There is a BUG_ON() call that doesn't account for the single byte structure version at the start of an encoded filepath in ceph_encode_filepath(). Fix that. Signed-off-by: Alex Elder <[email protected]> Reviewed-by: Yehuda Sadeh <[email protected]> Reviewed-by: Josh Durgin <[email protected]>
2012-07-30ceph: clean up useless d_parent checksSage Weil2-15/+3
d_parent is never NULL, and IS_ROOT() is the proper way to check for a (non-self-referential) parent. Reported-by: Al Viro <[email protected]> Signed-off-by: Sage Weil <[email protected]>
2012-07-30libceph: prevent the race of incoming work during teardownGuanjun He3-0/+8
Add an atomic variable 'stopping' as flag in struct ceph_messenger, set this flag to 1 in function ceph_destroy_client(), and add the condition code in function ceph_data_ready() to test the flag value, if true(1), just return. Signed-off-by: Guanjun He <[email protected]> Reviewed-by: Sage Weil <[email protected]>
2012-07-30libceph: fix messenger retrySage Weil2-16/+8
In ancient times, the messenger could both initiate and accept connections. An artifact if that was data structures to store/process an incoming ceph_msg_connect request and send an outgoing ceph_msg_connect_reply. Sadly, the negotiation code was referencing those structures and ignoring important information (like the peer's connect_seq) from the correct ones. Among other things, this fixes tight reconnect loops where the server sends RETRY_SESSION and we (the client) retries with the same connect_seq as last time. This bug pretty easily triggered by injecting socket failures on the MDS and running some fs workload like workunits/direct_io/test_sync_io. Signed-off-by: Sage Weil <[email protected]>
2012-07-30libceph: initialize rb, list nodes in ceph_osd_requestSage Weil1-0/+3
These don't strictly need to be initialized based on how they are used, but it is good practice to do so. Reported-by: Alex Elder <[email protected]> Signed-off-by: Sage Weil <[email protected]>
2012-07-30libceph: initialize msgpool message typesSage Weil3-7/+10
Initialize the type field for messages in a msgpool. The caller was doing this for osd ops, but not for the reply messages. Reported-by: Alex Elder <[email protected]> Signed-off-by: Sage Weil <[email protected]>
2012-07-30Merge branch 'for-3.6' of git://gitorious.org/linux-pwm/linux-pwmLinus Torvalds44-1007/+3308
Pull PWM subsystem from Thierry Reding: "The new PWM subsystem aims at collecting all implementations of the legacy PWM API and to eventually replace it completely. The subsystem has been in development for over half a year now and many drivers have already been converted. It has been in linux-next for a couple of weeks and there have been no major issues so I think it is ready for inclusion in your tree." Arnd Bergmann <[email protected]>: "Very much Ack on the new subsystem. It uses the interface declarations as the previously separate pwm drivers, so nothing changes for now in the drivers using it, although it enables us to change those more easily in the future if we want to. This work is also one of the missing pieces that are required to eventually build ARM kernels for multiple platforms, which is currently prohibited (amongs other things) by the fact that you cannot have more than one driver exporting the pwm functions." Tested-and-acked-by: Alexandre Courbot <[email protected]> Acked-by: Mark Brown <[email protected]> Acked-by: Philip, Avinash <[email protected]> # TI's AM33xx platforms Acked-By: Alexandre Pereira da Silva <[email protected]> # LPC32XX Acked-by: Arnd Bergmann <[email protected]> Acked-by: Sachin Kamat <[email protected]> Fix up trivial conflicts with other cleanups and DT updates. * 'for-3.6' of git://gitorious.org/linux-pwm/linux-pwm: (36 commits) pwm: pwm-tiehrpwm: PWM driver support for EHRPWM pwm: pwm-tiecap: PWM driver support for ECAP APWM pwm: fix used-uninitialized warning in pwm_get() pwm: add lpc32xx PWM support pwm_backlight: pass correct brightness to callback pwm: Use pr_* functions in pwm-samsung.c file pwm: Convert pwm-samsung to use devm_* APIs pwm: Convert pwm-tegra to use devm_clk_get() pwm: pwm-mxs: Return proper error if pwmchip_remove() fails pwm: pwm-bfin: Return proper error if pwmchip_remove() fails pwm: pxa: Propagate pwmchip_remove() error pwm: Convert pwm-pxa to use devm_* APIs pwm: Convert pwm-vt8500 to use devm_* APIs pwm: Convert pwm-imx to use devm_* APIs pwm: Conflict with legacy PWM API pwm: pwm-mxs: add pinctrl support pwm: pwm-mxs: use devm_* managed functions pwm: pwm-mxs: use global reset function stmp_reset_block pwm: pwm-mxs: encode soc name in compatible string pwm: Take over maintainership of the PWM subsystem ...
2012-07-30[media] Fix VIDIOC_TRY_EXT_CTRLS regressionHans Verkuil1-1/+1
Fixes an omission in the new v4l2_ioctls table: VIDIOC_TRY_EXT_CTRLS must get the INFO_FL_CTRL flag, just like all the other control related ioctls, otherwise the ioctl core won't know it also has to check whether v4l2_fh->ctrl_handler is non-zero before it can decide that this ioctl is not implemented. Caught by v4l2-compliance while I was testing the mem2mem_testdev driver. Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>
2012-07-30Merge branches 'cma', 'ipoib', 'ocrdma' and 'qib' into for-nextRoland Dreier6-242/+547
2012-07-30IPoIB: Use a private hash table for path lookup in xmit pathShlomo Pongratz4-236/+539
Dave Miller <[email protected]> provided a detailed description of why the way IPoIB is using neighbours for its own ipoib_neigh struct is buggy: Any time an ipoib_neigh is changed, a sequence like the following is made: spin_lock_irqsave(&priv->lock, flags); /* * It's safe to call ipoib_put_ah() inside * priv->lock here, because we know that * path->ah will always hold one more reference, * so ipoib_put_ah() will never do more than * decrement the ref count. */ if (neigh->ah) ipoib_put_ah(neigh->ah); list_del(&neigh->list); ipoib_neigh_free(dev, neigh); spin_unlock_irqrestore(&priv->lock, flags); ipoib_path_lookup(skb, n, dev); This doesn't work, because you're leaving a stale pointer to the freed up ipoib_neigh in the special neigh->ha pointer cookie. Yes, it even fails with all the locking done to protect _changes_ to *ipoib_neigh(n), and with the code in ipoib_neigh_free() that NULLs out the pointer. The core issue is that read side calls to *to_ipoib_neigh(n) are not being synchronized at all, they are performed without any locking. So whether we hold the lock or not when making changes to *ipoib_neigh(n) you still can have threads see references to freed up ipoib_neigh objects. cpu 1 cpu 2 n = *ipoib_neigh() *ipoib_neigh() = NULL kfree(n) n->foo == OOPS [..] Perhaps the ipoib code can have a private path database it manages entirely itself, which holds all the necessary information and is looked up by some generic key which is available easily at transmit time and does not involve generic neighbour entries. See <http://marc.info/?l=linux-rdma&m=132812793105624&w=2> and <http://marc.info/?l=linux-rdma&w=2&r=1&s=allows+references+to+freed+memory&q=b> for the full discussion. This patch aims to solve the race conditions found in the IPoIB driver. The patch removes the connection between the core networking neighbour structure and the ipoib_neigh structure. In addition to avoiding the race described above, it allows us to handle SKBs carrying IP packets that don't have any associated neighbour. We add an ipoib_neigh hash table with N buckets where the key is the destination hardware address. The ipoib_neigh is fetched from the hash table and instead of the stashed location in the neighbour structure. The hash table uses both RCU and reference counting to guarantee that no ipoib_neigh instance is ever deleted while in use. Fetching the ipoib_neigh structure instance from the hash also makes the special code in ipoib_start_xmit that handles remote and local bonding failover redundant. Aged ipoib_neigh instances are deleted by a garbage collection task that runs every M seconds and deletes every ipoib_neigh instance that was idle for at least 2*M seconds. The deletion is safe since the ipoib_neigh instances are protected using RCU and reference count mechanisms. The number of buckets (N) and frequency of running the GC thread (M), are taken from the exported arb_tbl. Signed-off-by: Shlomo Pongratz <[email protected]> Signed-off-by: Or Gerlitz <[email protected]> Signed-off-by: Roland Dreier <[email protected]>
2012-07-30ARM: dma-mapping: add support for DMA_ATTR_SKIP_CPU_SYNC attributeMarek Szyprowski1-9/+11
This patch adds support for DMA_ATTR_SKIP_CPU_SYNC attribute for dma_(un)map_(single,page,sg) functions family. It lets dma mapping clients to create a mapping for the buffer for the given device without performing a CPU cache synchronization. CPU cache synchronization can be skipped for the buffers which it is known that they are already in 'device' domain (CPU caches have been already synchronized or there are only coherent mappings for the buffer). For advanced users only, please use it with care. Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: Kyungmin Park <[email protected]>
2012-07-30common: DMA-mapping: add DMA_ATTR_SKIP_CPU_SYNC attributeMarek Szyprowski2-0/+25
This patch adds DMA_ATTR_SKIP_CPU_SYNC attribute to the DMA-mapping subsystem. By default dma_map_{single,page,sg} functions family transfer a given buffer from CPU domain to device domain. Some advanced use cases might require sharing a buffer between more than one device. This requires having a mapping created separately for each device and is usually performed by calling dma_map_{single,page,sg} function more than once for the given buffer with device pointer to each device taking part in the buffer sharing. The first call transfers a buffer from 'CPU' domain to 'device' domain, what synchronizes CPU caches for the given region (usually it means that the cache has been flushed or invalidated depending on the dma direction). However, next calls to dma_map_{single,page,sg}() for other devices will perform exactly the same sychronization operation on the CPU cache. CPU cache sychronization might be a time consuming operation, especially if the buffers are large, so it is highly recommended to avoid it if possible. DMA_ATTR_SKIP_CPU_SYNC allows platform code to skip synchronization of the CPU cache for the given buffer assuming that it has been already transferred to 'device' domain. This attribute can be also used for dma_unmap_{single,page,sg} functions family to force buffer to stay in device domain after releasing a mapping for it. Use this attribute with care! Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: Kyungmin Park <[email protected]>
2012-07-30ARM: dma-mapping: add support for dma_get_sgtable()Marek Szyprowski3-0/+35
This patch adds support for dma_get_sgtable() function which is required to let drivers to share the buffers allocated by DMA-mapping subsystem. Generic implementation based on virt_to_page() is not suitable for ARM dma-mapping subsystem. Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: Kyungmin Park <[email protected]>
2012-07-30common: dma-mapping: introduce dma_get_sgtable() functionMarek Szyprowski3-0/+39
This patch adds dma_get_sgtable() function which is required to let drivers to share the buffers allocated by DMA-mapping subsystem. Right now the driver gets a dma address of the allocated buffer and the kernel virtual mapping for it. If it wants to share it with other device (= map into its dma address space) it usually hacks around kernel virtual addresses to get pointers to pages or assumes that both devices share the DMA address space. Both solutions are just hacks for the special cases, which should be avoided in the final version of buffer sharing. To solve this issue in a generic way, a new call to DMA mapping has been introduced - dma_get_sgtable(). It allocates a scatter-list which describes the allocated buffer and lets the driver(s) to use it with other device(s) by calling dma_map_sg() on it. This patch provides a generic implementation based on virt_to_page() call. Architectures which require more sophisticated translation might provide their own get_sgtable() methods. Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: Kyungmin Park <[email protected]> Reviewed-by: Daniel Vetter <[email protected]>
2012-07-30ARM: dma-mapping: add support for DMA_ATTR_NO_KERNEL_MAPPING attributeMarek Szyprowski1-5/+13
This patch adds support for DMA_ATTR_NO_KERNEL_MAPPING attribute for IOMMU allocations, what let drivers to save precious kernel virtual address space for large buffers that are intended to be accessed only from userspace. This patch is heavily based on initial work kindly provided by Abhinav Kochhar <[email protected]>. Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: Kyungmin Park <[email protected]>
2012-07-30common: DMA-mapping: add DMA_ATTR_NO_KERNEL_MAPPING attributeMarek Szyprowski2-0/+19
This patch adds DMA_ATTR_NO_KERNEL_MAPPING attribute which lets the platform to avoid creating a kernel virtual mapping for the allocated buffer. On some architectures creating such mapping is non-trivial task and consumes very limited resources (like kernel virtual address space or dma consistent address space). Buffers allocated with this attribute can be only passed to user space by calling dma_mmap_attrs(). Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: Kyungmin Park <[email protected]> Reviewed-by: Daniel Vetter <[email protected]>
2012-07-30common: dma-mapping: add support for generic dma_mmap_* callsMarek Szyprowski9-40/+95
Commit 9adc5374 ('common: dma-mapping: introduce mmap method') added a generic method for implementing mmap user call to dma_map_ops structure. This patch converts ARM and PowerPC architectures (the only providers of dma_mmap_coherent/dma_mmap_writecombine calls) to use this generic dma_map_ops based call and adds a generic cross architecture definition for dma_mmap_attrs, dma_mmap_coherent, dma_mmap_writecombine functions. The generic mmap virt_to_page-based fallback implementation is provided for architectures which don't provide their own implementation for mmap method. Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: Kyungmin Park <[email protected]>
2012-07-30ARM: dma-mapping: fix error path for memory allocation failureMarek Szyprowski1-1/+1
This patch fixes incorrect check in error path. When the allocation of first page fails, the kernel ops appears due to accessing -1 element of the pages array. Reported-by: Sylwester Nawrocki <[email protected]> Signed-off-by: Marek Szyprowski <[email protected]>
2012-07-30ARM: dma-mapping: add more sanity checks in arm_dma_mmap()Marek Szyprowski1-4/+10
Add some sanity checks and forbid mmaping of buffers into vma areas larger than allocated dma buffer. Signed-off-by: Marek Szyprowski <[email protected]>
2012-07-30ARM: dma-mapping: remove custom consistent dma regionMarek Szyprowski6-323/+181
This patch changes dma-mapping subsystem to use generic vmalloc areas for all consistent dma allocations. This increases the total size limit of the consistent allocations and removes platform hacks and a lot of duplicated code. Atomic allocations are served from special pool preallocated on boot, because vmalloc areas cannot be reliably created in atomic context. Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: Kyungmin Park <[email protected]> Reviewed-by: Minchan Kim <[email protected]>
2012-07-30mm: vmalloc: use const void * for caller argumentMarek Szyprowski2-13/+13
'const void *' is a safer type for caller function type. This patch updates all references to caller function type. Signed-off-by: Marek Szyprowski <[email protected]> Reviewed-by: Kyungmin Park <[email protected]> Reviewed-by: Minchan Kim <[email protected]>
2012-07-30scatterlist: add sg_alloc_table_from_pages functionTomasz Stanislawski2-0/+68
This patch adds a new constructor for an sg table. The table is constructed from an array of struct pages. All contiguous chunks of the pages are merged into a single sg nodes. A user may provide an offset and a size of a buffer if the buffer is not page-aligned. The function is dedicated for DMABUF exporters which often perform conversion from an page array to a scatterlist. Moreover the scatterlist should be squashed in order to save memory and to speed-up the process of DMA mapping using dma_map_sg. The code is based on the patch 'v4l: vb2-dma-contig: add support for scatterlist in userptr mode' and hints from Laurent Pinchart. Signed-off-by: Tomasz Stanislawski <[email protected]> Signed-off-by: Kyungmin Park <[email protected]> Acked-by: Daniel Vetter <[email protected]> Acked-by: Laurent Pinchart <[email protected]> Signed-off-by: Marek Szyprowski <[email protected]> CC: Andrew Morton <[email protected]>
2012-07-30mm: Fix build warning in kmem_cache_create()Shuah Khan1-0/+2
The label oops is used in CONFIG_DEBUG_VM ifdef block and is defined outside ifdef CONFIG_DEBUG_VM block. This results in the following build warning when built with CONFIG_DEBUG_VM disabled. Fix to move label oops definition to inside a CONFIG_DEBUG_VM block. mm/slab_common.c: In function ‘kmem_cache_create’: mm/slab_common.c:101:1: warning: label ‘oops’ defined but not used [-Wunused-label] Signed-off-by: Shuah Khan <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2012-07-30hwmon: struct x86_cpu_id arrays can be __initconstJan Beulich2-2/+2
... as being referenced from __init code only. Signed-off-by: Jan Beulich <[email protected]> Signed-off-by: Jean Delvare <[email protected]>
2012-07-30uprobes: __replace_page() needs munlock_vma_page()Oleg Nesterov1-2/+6
Like do_wp_page(), __replace_page() should do munlock_vma_page() for the case when the old page still has other !VM_LOCKED mappings. Unfortunately this needs mm/internal.h. Also, move put_page() outside of ptl lock. This doesn't really matter but looks a bit better. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Rename vma_address() and make it return "unsigned long"Oleg Nesterov1-10/+5
1. vma_address() returns loff_t, this looks confusing and this is unnecessary after the previous change. Make it return "ulong", all callers truncate the result anyway. 2. Its name conflicts with mm/rmap.c:vma_address(), rename it to offset_to_vaddr(), this matches vaddr_to_offset(). Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Fix register_for_each_vma()->vma_address() checkOleg Nesterov1-4/+5
1. register_for_each_vma() checks that vma_address() == vaddr, but this is not enough. We should also ensure that vaddr >= vm_start, find_vma() guarantees "vaddr < vm_end" only. 2. After the prevous changes, register_for_each_vma() is the only reason why vma_address() has to return loff_t, all other users know that we have the valid mapping at this offset and thus the overflow is not possible. Change the code to use vaddr_to_offset() instead, imho this looks more clean/understandable and now we can change vma_address(). 3. While at it, remove the unnecessary type-cast. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Introduce vaddr_to_offset(vma, vaddr)Oleg Nesterov1-6/+8
Add the new helper, vaddr_to_offset(vma, vaddr) which returns the offset in vma->vm_file this vaddr is mapped at. Change build_probe_list() and find_active_uprobe() to use the new helper, the next patch adds another user. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Teach build_probe_list() to consider the rangeOleg Nesterov1-53/+50
Currently build_probe_list() builds the list of all uprobes attached to the given inode, and the caller should filter out those who don't fall into the [start,end) range, this is sub-optimal. This patch turns find_least_offset_node() into find_node_in_range() which returns the first node inside the [min,max] range, and changes build_probe_list() to use this node as a starting point for rb_prev() and rb_next() to find all other nodes the caller needs. The resulting list is no longer sorted but we do not care. This can speed up both build_probe_list() and the callers, but there is another reason to introduce find_node_in_range(). It can be used to figure out whether the given vma has uprobes or not, this will be needed soon. While at it, shift INIT_LIST_HEAD(tmp_list) into build_probe_list(). Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Remove insert_vm_struct()->uprobe_mmap()Oleg Nesterov1-3/+0
Remove insert_vm_struct()->uprobe_mmap(). It is not needed, nobody except arch/ia64/kernel/perfmon.c uses insert_vm_struct(vma) with vma->vm_file != NULL. And it is wrong. Again, get_user_pages() can not succeed before vma_link(vma) makes is visible to find_vma(). And even if this worked, we must not insert the new bp before this mapping is visible to vma_prio_tree_foreach() for uprobe_unregister(). Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Remove copy_vma()->uprobe_mmap()Oleg Nesterov1-3/+0
Remove copy_vma()->uprobe_mmap(new_vma), it is absolutely wrong. This new_vma was just initialized to represent the new unmapped area, [vm_start, vm_end) was returned by get_unmapped_area() in the caller. This means that uprobe_mmap()->get_user_pages() will fail for sure, simply because find_vma() can never succeed. And I verified that sys_mremap()->mremap_to() indeed always fails with the wrong ENOMEM code if [addr, addr+old_len] is probed. And why this uprobe_mmap() was added? I believe the intent was wrong. Note that the caller is going to do move_page_tables(), all registered uprobes are already faulted in, we only change the virtual addresses. NOTE: However, somehow we need to close the race with uprobe_register() which relies on map_info->vaddr. This needs another fix I'll try to do later. Probably we need uprobe_mmap() in move_vma() but we can not do this right now, this can confuse uprobes_state.counter (which I still hope we are going to kill). Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Fix overflow in vma_address()/find_active_uprobe()Oleg Nesterov1-2/+2
vma->vm_pgoff is "unsigned long", it should be promoted to loff_t before the multiplication to avoid the overflow. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Suppress uprobe_munmap() from mmput()Oleg Nesterov1-0/+3
uprobe_munmap() does get_user_pages() and it is also called from the final mmput()->exit_mmap() path. This slows down exit/mmput() for no reason, and I think it is simply dangerous/wrong to try to fault-in a page into the dying mm. If nothing else, this happens after the last sync_mm_rss(), afaics handle_mm_fault() can change the task->rss_stat and make the subsequent check_mm() unhappy. Change uprobe_munmap() to check mm->mm_users != 0. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()Oleg Nesterov1-4/+4
The bug was introduced by me in 449d0d7c ("uprobes: Simplify the usage of uprobe->pending_list"). Yes, we do not care about uprobe->pending_list after return and nobody can remove the current list entry, but put_uprobe(uprobe) can actually free it and thus we need list_for_each_safe(). Reported-by: Srikar Dronamraju <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Clean up and document write_opcode()->lock_page(old_page)Oleg Nesterov1-13/+14
The comment above write_opcode()->lock_page(old_page) tells about the race with do_wp_page(). I don't really understand which exactly race it means, but afaics this lock_page() was not enough to close all races with do_wp_page(). Anyway, since: 77fc4af1b59d uprobes: Change register_for_each_vma() to take mm->mmap_sem for writing this code is always called with ->mmap_sem held for writing, so we can forget about do_wp_page(). However, we can't simply remove this lock_page(), and the only (afaics) reason is __replace_page()->try_to_free_swap(). Nothing in write_opcode() needs it, move it into __replace_page() and fix the comment. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: Kill write_opcode()->lock_page(new_page)Oleg Nesterov1-2/+0
write_opcode() does lock_page(new_page) for no reason. Nobody can see this page until __replace_page() exposes it under ptl lock, and we do nothing with this page after pte_unmap_unlock(). If nothing else, the similar code in do_wp_page() doesn't lock the new page for page_add_new_anon_rmap/set_pte_at_notify. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2012-07-30uprobes: __replace_page() should not use page_address_in_vma()Oleg Nesterov1-7/+4
page_address_in_vma(old_page) in __replace_page() is ugly and wrong. The caller already knows the correct virtual address, this page was found by get_user_pages(vaddr). However, page_address_in_vma() can actually fail if page->mapping was cleared by __delete_from_page_cache() after get_user_pages() returns. But this means the race with page reclaim, write_opcode() should not fail, it should retry and read this page again. Probably the race with remove_mapping() is not possible due to page_freeze_refs() logic, but afaics at least shmem_writepage()->shmem_delete_from_page_cache() can clear ->mapping. We could change __replace_page() to return -EAGAIN in this case, but it would be better to simply use the caller's vaddr and rely on page_check_address(). Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>