aboutsummaryrefslogtreecommitdiff
path: root/drivers/infiniband/core/umem.c
AgeCommit message (Collapse)AuthorFilesLines
2020-10-16Merge branch 'dynamic_sg' into rdma.git for-nextJason Gunthorpe1-82/+12
From Maor Gottlieb says: ==================== This series extends __sg_alloc_table_from_pages to allow chaining of new pages to an already initialized SG table. This allows for drivers to utilize the optimization of merging contiguous pages without a need to pre allocate all the pages and hold them in a very large temporary buffer prior to the call to SG table initialization. The last patch changes the Infiniband core to use the new API. It removes duplicate functionality from the code and benefits from the optimization of allocating dynamic SG table from pages. In huge pages system of 2MB page size, without this change, the SG table would contain x512 SG entries. ==================== * branch 'dynamic_sg': RDMA/umem: Move to allocate SG table from pages lib/scatterlist: Add support in dynamic allocation of SG table from pages tools/testing/scatterlist: Show errors in human readable form tools/testing/scatterlist: Rejuvenate bit-rotten test
2020-10-05RDMA/umem: Move to allocate SG table from pagesMaor Gottlieb1-82/+12
Remove the implementation of ib_umem_add_sg_table and instead call to __sg_alloc_table_from_pages which already has the logic to merge contiguous pages. Besides that it removes duplicated functionality, it reduces the memory consumption of the SG table significantly. Prior to this patch, the SG table was allocated in advance regardless consideration of contiguous pages. In huge pages system of 2MB page size, without this change, the SG table would contain x512 SG entries. E.g. for 100GB memory registration: Number of entries Size Before 26214400 600.0MB After 51200 1.2MB Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Maor Gottlieb <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2020-09-11RDMA/mlx4: Use ib_umem_num_dma_blocks()Jason Gunthorpe1-12/+0
For the calls linked to mlx4_ib_umem_calc_optimal_mtt_size() use ib_umem_num_dma_blocks() inside the function, it is just some weird static default. All other places are just using it with PAGE_SIZE, switch to ib_umem_num_dma_blocks(). As this is the last call site, remove ib_umem_num_count(). Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]>
2020-09-11RDMA/umem: Split ib_umem_num_pages() into ib_umem_num_dma_blocks()Jason Gunthorpe1-1/+6
ib_umem_num_pages() should only be used by things working with the SGL in CPU pages directly. Drivers building DMA lists should use the new ib_num_dma_blocks() which returns the number of blocks rdma_umem_for_each_block() will return. To make this general for DMA drivers requires a different implementation. Computing DMA block count based on umem->address only works if the requested page size is < PAGE_SIZE and/or the IOVA == umem->address. Instead the number of DMA pages should be computed in the IOVA address space, not umem->address. Thus the IOVA has to be stored inside the umem so it can be used for these calculations. For now set it to umem->address by default and fix it up if ib_umem_find_best_pgsz() was called. This allows drivers to be converted to ib_umem_num_dma_blocks() safely. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]>
2020-09-09RDMA/umem: Use simpler logic for ib_umem_find_best_pgsz()Jason Gunthorpe1-3/+8
The calculation in rdma_find_pg_bit() is fairly complicated, and the function is never called anywhere else. Inline a simpler version into ib_umem_find_best_pgsz() Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]>
2020-09-09RDMA/umem: Prevent small pages from being returned by ib_umem_find_best_pgsz()Jason Gunthorpe1-0/+6
rdma_for_each_block() makes assumptions about how the SGL is constructed that don't work if the block size is below the page size used to to build the SGL. The rules for umem SGL construction require that the SG's all be PAGE_SIZE aligned and we don't encode the actual byte offset of the VA range inside the SGL using offset and length. So rdma_for_each_block() has no idea where the actual starting/ending point is to compute the first/last block boundary if the starting address should be within a SGL. Fixing the SGL construction turns out to be really hard, and will be the subject of other patches. For now block smaller pages. Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR") Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Leon Romanovsky <[email protected]> Reviewed-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2020-09-09RDMA/umem: Fix ib_umem_find_best_pgsz() for mappings that cross a page boundaryJason Gunthorpe1-2/+7
It is possible for a single SGL to span an aligned boundary, eg if the SGL is 61440 -> 90112 Then the length is 28672, which currently limits the block size to 32k. With a 32k page size the two covering blocks will be: 32768->65536 and 65536->98304 However, the correct answer is a 128K block size which will span the whole 28672 bytes in a single block. Instead of limiting based on length figure out which high IOVA bits don't change between the start and end addresses. That is the highest useful page size. Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR") Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Leon Romanovsky <[email protected]> Reviewed-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2020-07-31RDMA/umem: Add a schedule point in ib_umem_get()Eric Dumazet1-0/+1
Mapping as little as 64GB can take more than 10 seconds, triggering issues on kernels with CONFIG_PREEMPT_NONE=y. ib_umem_get() already splits the work in 2MB units on x86_64, adding a cond_resched() in the long-lasting loop is enough to solve the issue. Note that sg_alloc_table() can still use more than 100 ms, which is also problematic. This might be addressed later in ib_umem_add_sg_table(), adding new blocks in sgl on demand. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2020-02-13RDMA/core: Add weak ordering dma attr to dma mappingMichael Guralnik1-4/+7
For memory regions registered with IB_ACCESS_RELAXED_ORDERING will be dma mapped with the DMA_ATTR_WEAK_ORDERING. This will allow reads and writes to the mapping to be weakly ordered, such change can enhance performance on some supporting architectures. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Michael Guralnik <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2020-01-31Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-3/+6
Pull rdma updates from Jason Gunthorpe: "A very quiet cycle with few notable changes. Mostly the usual list of one or two patches to drivers changing something that isn't quite rc worthy. The subsystem seems to be seeing a larger number of rework and cleanup style patches right now, I feel that several vendors are prepping their drivers for new silicon. Summary: - Driver updates and cleanup for qedr, bnxt_re, hns, siw, mlx5, mlx4, rxe, i40iw - Larger series doing cleanup and rework for hns and hfi1. - Some general reworking of the CM code to make it a little more understandable - Unify the different code paths connected to the uverbs FD scheme - New UAPI ioctls conversions for get context and get async fd - Trace points for CQ and CM portions of the RDMA stack - mlx5 driver support for virtio-net formatted rings as RDMA raw ethernet QPs - verbs support for setting the PCI-E relaxed ordering bit on DMA traffic connected to a MR - A couple of bug fixes that came too late to make rc7" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (108 commits) RDMA/core: Make the entire API tree static RDMA/efa: Mask access flags with the correct optional range RDMA/cma: Fix unbalanced cm_id reference count during address resolve RDMA/umem: Fix ib_umem_find_best_pgsz() IB/mlx4: Fix leak in id_map_find_del IB/opa_vnic: Spelling correction of 'erorr' to 'error' IB/hfi1: Fix logical condition in msix_request_irq RDMA/cm: Remove CM message structs RDMA/cm: Use IBA functions for complex structure members RDMA/cm: Use IBA functions for simple structure members RDMA/cm: Use IBA functions for swapping get/set acessors RDMA/cm: Use IBA functions for simple get/set acessors RDMA/cm: Add SET/GET implementations to hide IBA wire format RDMA/cm: Add accessors for CM_REQ transport_type IB/mlx5: Return the administrative GUID if exists RDMA/core: Ensure that rdma_user_mmap_entry_remove() is a fence IB/mlx4: Fix memory leak in add_gid error flow IB/mlx5: Expose RoCE accelerator counters RDMA/mlx5: Set relaxed ordering when requested RDMA/core: Add the core support field to METHOD_GET_CONTEXT ...
2020-01-31mm, tree-wide: rename put_user_page*() to unpin_user_page*()John Hubbard1-1/+1
In order to provide a clearer, more symmetric API for pinning and unpinning DMA pages. This way, pin_user_pages*() calls match up with unpin_user_pages*() calls, and the API is a lot closer to being self-explanatory. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Alex Williamson <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Björn Töpel <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Dan Williams <[email protected]> Cc: Hans Verkuil <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Leon Romanovsky <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-01-31IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODPJohn Hubbard1-1/+1
Convert infiniband to use the new pin_user_pages*() calls. Also, revert earlier changes to Infiniband ODP that had it using put_user_page(). ODP is "Case 3" in Documentation/core-api/pin_user_pages.rst, which is to say, normal get_user_pages() and put_page() is the API to use there. The new pin_user_pages*() calls replace corresponding get_user_pages*() calls, and set the FOLL_PIN flag. The FOLL_PIN flag requires that the caller must return the pages via put_user_page*() calls, but infiniband was already doing that as part of an earlier commit. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Alex Williamson <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Björn Töpel <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Dan Williams <[email protected]> Cc: Hans Verkuil <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Leon Romanovsky <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-01-31IB/umem: use get_user_pages_fast() to pin DMA pagesJohn Hubbard1-11/+6
And get rid of the mmap_sem calls, as part of that. Note that get_user_pages_fast() will, if necessary, fall back to __gup_longterm_unlocked(), which takes the mmap_sem as needed. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Cc: Alex Williamson <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Björn Töpel <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Dan Williams <[email protected]> Cc: Hans Verkuil <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Mauro Carvalho Chehab <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-01-28RDMA/umem: Fix ib_umem_find_best_pgsz()Artemy Kovalyov1-3/+6
Except for the last entry, the ending iova alignment sets the maximum possible page size as the low bits of the iova must be zero when starting the next chunk. Fixes: 4a35339958f1 ("RDMA/umem: Add API to find best driver supported page size in an MR") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Artemy Kovalyov <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Tested-by: Gal Pressman <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2020-01-16IB: Allow calls to ib_umem_get from kernel ULPsMoni Shoua1-18/+9
So far the assumption was that ib_umem_get() and ib_umem_odp_get() are called from flows that start in UVERBS and therefore has a user context. This assumption restricts flows that are initiated by ULPs and need the service that ib_umem_get() provides. This patch changes ib_umem_get() and ib_umem_odp_get() to get IB device directly by relying on the fact that both UVERBS and ULPs sets that field correctly. Reviewed-by: Guy Levi <[email protected]> Signed-off-by: Moni Shoua <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]>
2019-11-17IB/umem: remove the dmasync argument to ib_umem_getChristoph Hellwig1-2/+1
The argument is always ignored, so remove it. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Acked-by: Michal Kalderon <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-11-14dma-mapping: remove the DMA_ATTR_WRITE_BARRIER flagChristoph Hellwig1-7/+2
This flag is not implemented by any backend and only set by the ib_umem module in a single instance. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-09-24mm/gup: add make_dirty arg to put_user_pages_dirty_lock()[email protected]1-4/+1
[11~From: John Hubbard <[email protected]> Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock() Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()", v3. There are about 50+ patches in my tree [2], and I'll be sending out the remaining ones in a few more groups: * The block/bio related changes (Jerome mostly wrote those, but I've had to move stuff around extensively, and add a little code) * mm/ changes * other subsystem patches * an RFC that shows the current state of the tracking patch set. That can only be applied after all call sites are converted, but it's good to get an early look at it. This is part a tree-wide conversion, as described in fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). This patch (of 3): Provide more capable variation of put_user_pages_dirty_lock(), and delete put_user_pages_dirty(). This is based on the following: 1. Lots of call sites become simpler if a bool is passed into put_user_page*(), instead of making the call site choose which put_user_page*() variant to call. 2. Christoph Hellwig's observation that set_page_dirty_lock() is usually correct, and set_page_dirty() is usually a bug, or at least questionable, within a put_user_page*() calling chain. This leads to the following API choices: * put_user_pages_dirty_lock(page, npages, make_dirty) * There is no put_user_pages_dirty(). You have to hand code that, in the rare case that it's required. [[email protected]: remove unused variable in siw_free_plist()] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Jan Kara <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jason Gunthorpe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-08-21RDMA/odp: remove ib_ucontext from ib_umemJason Gunthorpe1-2/+2
At this point the ucontext is only being stored to access the ib_device, so just store the ib_device directly instead. This is more natural and logical as the umem has nothing to do with the ucontext. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]>
2019-08-21RDMA/odp: Provide ib_umem_odp_release() to undo the allocsJason Gunthorpe1-16/+4
Now that there are allocator APIs that return the ib_umem_odp directly it should be freed through a umem_odp free'er as well. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-08-21RDMA/odp: Split creating a umem_odp from ib_umem_getJason Gunthorpe1-25/+5
This is the last creation API that is overloaded for both, there is very little code sharing and a driver has to be specifically ready for a umem_odp to be created to use the odp version. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-08-20RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLBJason Gunthorpe1-6/+1
When ODP is enabled with IB_ACCESS_HUGETLB then the required pages should be calculated based on the extent of the MR, which is rounded to the nearest huge page alignment. Fixes: d2183c6f1958 ("RDMA/umem: Move page_shift from ib_umem to ib_odp_umem") Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Doug Ledford <[email protected]>
2019-06-20RDMA: Check umem pointer validity prior to releaseLeon Romanovsky1-0/+3
Update ib_umem_release() to behave similarly to kfree() and allow submitting NULL pointer as safe input to this function. Fixes: a52c8e2469c3 ("RDMA: Clean destroy CQ in drivers do not return errors") Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2019-05-27RDMA: Convert put_page() to put_user_page*()John Hubbard1-3/+4
For infiniband code that retains pages via get_user_pages*(), release those pages via the new put_user_page(), or put_user_pages*(), instead of put_page() This is a tiny part of the second step of fixing the problem described in [1]. The steps are: 1) Provide put_user_page*() routines, intended to be used for releasing pages that were pinned via get_user_pages*(). 2) Convert all of the call sites for get_user_pages*(), to invoke put_user_page*(), instead of put_page(). This involves dozens of call sites, and will take some time. 3) After (2) is complete, use get_user_pages*() and put_user_page*() to implement tracking of these pages. This tracking will be separate from the existing struct page refcounting. 4) Use the tracking and identification of these pages, to implement special handling (especially in writeback paths) when the pages are backed by a filesystem. Again, [1] provides details as to why that is desirable. [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()" Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Dennis Dalessandro <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Reviewed-by: Jérôme Glisse <[email protected]> Acked-by: Jason Gunthorpe <[email protected]> Tested-by: Ira Weiny <[email protected]> Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-05-21RDMA/umem: Move page_shift from ib_umem to ib_odp_umemJason Gunthorpe1-2/+1
This value has always been set to PAGE_SHIFT in the core code, the only thing that does differently was the ODP path. Move the value into the ODP struct and still use it for ODP, but change all the non-ODP things to just use PAGE_SHIFT/PAGE_SIZE/PAGE_MASK directly. Reviewed-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]>
2019-05-14mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERMIra Weiny1-2/+3
Pach series "Add FOLL_LONGTERM to GUP fast and use it". HFI1, qib, and mthca, use get_user_pages_fast() due to its performance advantages. These pages can be held for a significant time. But get_user_pages_fast() does not protect against mapping FS DAX pages. Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which retains the performance while also adding the FS DAX checks. XDP has also shown interest in using this functionality.[1] In addition we change get_user_pages() to use the new FOLL_LONGTERM flag and remove the specialized get_user_pages_longterm call. [1] https://lkml.org/lkml/2019/3/19/939 "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Secondly, it depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an aside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. This patch (of 7): This patch starts a series which aims to support FOLL_LONGTERM in get_user_pages_fast(). Some callers who would like to do a longterm (user controlled pin) of pages with the fast variant of GUP for performance purposes. Rather than have a separate get_user_pages_longterm() call, introduce FOLL_LONGTERM and change the longterm callers to use it. This patch does not change any functionality. In the short term "longterm" or user controlled pins are unsafe for Filesystems and FS DAX in particular has been blocked. However, callers of get_user_pages_fast() were not "protected". FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it requires vmas to determine if DAX is in use. NOTE: In merging with the CMA changes we opt to change the get_user_pages() call in check_and_migrate_cma_pages() to a call of __get_user_pages_locked() on the newly migrated pages. This makes the code read better in that we are calling __get_user_pages_locked() on the pages before and after a potential migration. As a side affect some of the interfaces are cleaned up but this is not the primary purpose of the series. In review[1] it was asked: <quote> > This I don't get - if you do lock down long term mappings performance > of the actual get_user_pages call shouldn't matter to start with. > > What do I miss? A couple of points. First "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Second, It depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an asside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. </quote> [1] https://lore.kernel.org/lkml/[email protected]/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965 [[email protected]: v3] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ira Weiny <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Michal Hocko <[email protected]> Cc: John Hubbard <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Rich Felker <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: James Hogan <[email protected]> Cc: Dan Williams <[email protected]> Cc: Mike Marshall <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-05-06RDMA/umem: Remove hugetlb flagShiraz Saleem1-25/+1
The drivers i40iw and bnxt_re no longer dependent on the hugetlb flag. So remove this flag from ib_umem structure. Reviewed-by: Michael J. Ruhl <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-05-06RDMA/umem: Add API to find best driver supported page size in an MRShiraz Saleem1-0/+51
This helper iterates through the SG list to find the best page size to use from a bitmap of HW supported page sizes. Drivers that support multiple page sizes, but not mixed sizes in an MR can use this API. Suggested-by: Jason Gunthorpe <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-05-02RDMA/umem: Handle page combining avoidance correctly in ib_umem_add_sg_table()Shiraz Saleem1-6/+10
The flag update_cur_sg tracks whether contiguous pages from a new set of page_list pages can be merged into the SGE passed into ib_umem_add_sg_table(). If this flag is true, but the total segment length exceeds the max_seg_size supported by HW, we avoid combining to this SGE and move to a new SGE (x) and merge 'len' pages to it. However, if i < npages, the next iteration can incorrectly merge 'len' contiguous pages into x instead of into a new SGE since update_cur_sg is still true. Reset update_cur_sg to false always after the check to merge pages into the first SGE passed in to ib_umem_add_sg_table(). Also, prevent a new SGE's segment length from ever exceeding HW max_seg_sz. There is a crash on hfi1 as result of this where-in max_seg_sz is defaulting to 64K. Due to above bug, unfolding SGE's in __ib_umem_release points to a bad page ptr. TEST comp-wfr.perfnative.STL-22166-WDT _ perftest native 2-Write_4097QP_4MB STARTING at 1555387093 BUG: Bad page state in process ib_write_bw pfn:7ebca0 page:ffffcd675faf2800 count:0 mapcount:1 mapping:0000000000000000 index:0x1 flags: 0x17ffffc0000000() raw: 0017ffffc0000000 dead000000000100 dead000000000200 0000000000000000 raw: 0000000000000001 0000000000000000 0000000000000000 0000000000000000 page dumped because: nonzero mapcount CPU: 18 PID: 15853 Comm: ib_write_bw Tainted: G B 5.1.0-rc4 #1 Hardware name: Intel Corporation S2600CWR/S2600CW, BIOS SE5C610.86B.01.01.0014.121820151719 12/18/2015 Call Trace: dump_stack+0x5a/0x73 bad_page+0xf5/0x10f free_pcppages_bulk+0x62c/0x680 free_unref_page+0x54/0x70 __ib_umem_release+0x148/0x1a0 [ib_uverbs] ib_umem_release+0x22/0x80 [ib_uverbs] rvt_dereg_mr+0x67/0xb0 [rdmavt] ib_dereg_mr_user+0x37/0x60 [ib_core] destroy_hw_idr_uobject+0x1c/0x50 [ib_uverbs] uverbs_destroy_uobject+0x2e/0x180 [ib_uverbs] uobj_destroy+0x4d/0x60 [ib_uverbs] __uobj_get_destroy+0x33/0x50 [ib_uverbs] __uobj_perform_destroy+0xa/0x30 [ib_uverbs] ib_uverbs_dereg_mr+0x66/0x90 [ib_uverbs] ib_uverbs_write+0x3e1/0x500 [ib_uverbs] vfs_write+0xad/0x1b0 ksys_write+0x5a/0xd0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: d10bcf947a3e ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs") Tested-by: Mike Marciniszyn <[email protected]> Reviewed-by: Michael J. Ruhl <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Reviewed-by: Dennis Dalessandro <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-04-08RDMA/umem: Use correct value for SG entries in sg_copy_to_buffer()Shiraz Saleem1-2/+2
With page combining, the assumption that number of SG entries in umem SGL equal to number of system pages in umem no longer holds. umem->sg_nents tracks the SG entries in umem SGL. Use it in sg_pcopy_to_buffer() as opposed to ib_umem_num_pages(umem). Fixes: d10bcf947a3e ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs") Reported-by: Jason Gunthorpe <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-04-08RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEsShiraz Saleem1-20/+81
Combine contiguous regions of PAGE_SIZE pages into single scatter list entry while building the scatter table for a umem. This minimizes the number of the entries in the scatter list and reduces the DMA mapping overhead, particularly with the IOMMU. Set default max_seg_size in core for IB devices to 2G and do not combine if we exceed this limit. Also, purge npages in struct ib_umem as we now DMA map the umem SGL with sg_nents and npage computation is not needed. Drivers should now be using ib_umem_num_pages(), so fix the last stragglers. Move npages tracking to ib_umem_odp as ODP drivers still need it. Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Michael J. Ruhl <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Acked-by: Adit Ranadive <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Tested-by: Gal Pressman <[email protected]> Tested-by: Selvin Xavier <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-03-26IB/core: Ensure an invalidate_range callback on ODP MRIra Weiny1-0/+5
No device supports ODP MR without an invalidate_range callback. Warn on any any device which attempts to support ODP without supplying this callback. Then we can remove the checks for the callback within the code. This stems from the discussion https://www.spinics.net/lists/linux-rdma/msg76460.html ...which concluded this code was no longer necessary. Acked-by: John Hubbard <[email protected]> Reviewed-by: Haggai Eran <[email protected]> Signed-off-by: Ira Weiny <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-02-15IB/uverbs: Add ib_ucontext to uverbs_attr_bundle sent from ioctl and cmd flowsShamir Rabinovitch1-3/+7
Add ib_ucontext to the uverbs_attr_bundle sent down the iocl and cmd flows as soon as the flow has ib_uobject. In addition, remove rdma_get_ucontext helper function that is only used by ib_umem_get. Signed-off-by: Shamir Rabinovitch <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-02-07drivers/IB,core: reduce scope of mmap_semDavidlohr Bueso1-39/+2
ib_umem_get() uses gup_longterm() and relies on the lock to stabilze the vma_list, so we cannot really get rid of mmap_sem altogether, but now that the counter is atomic, we can get of some complexity that mmap_sem brings with only pinned_vm. Reviewed-by: Ira Weiny <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-02-07mm: make mm->pinned_vm an atomic64 counterDavidlohr Bueso1-6/+6
Taking a sleeping lock to _only_ increment a variable is quite the overkill, and pretty much all users do this. Furthermore, some drivers (ie: infiniband and scif) that need pinned semantics can go to quite some trouble to actually delay via workqueue (un)accounting for pinned pages when not possible to acquire it. By making the counter atomic we no longer need to hold the mmap_sem and can simply some code around it for pinned_vm users. The counter is 64-bit such that we need not worry about overflows such as rdma user input controlled from userspace. Reviewed-by: Ira Weiny <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2019-01-10IB/{core,hw}: Have ib_umem_get extract the ib_ucontext from ib_udataJason Gunthorpe1-2/+7
ib_umem_get() can only be called in a method callback, which always has a udata parameter. This allows ib_umem_get() to derive the ucontext pointer directly from the udata without requiring the drivers to find it in some way or another. Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Shamir Rabinovitch <[email protected]>
2018-09-27RDMA/core: Acquire and release mmap_sem on page rangeParav Pandit1-2/+5
Currently mmap_sem is read locked while pinning the memory. In a multi-threaded application of a process, holding mmap_sem lock creates contention with other threads who might be either registering memory, creating QPs or simply doing mmap() as such operations also require to hold the mmap_sem write lock. All such operation cannot make forward progress until one memory pin operation is completed. It becomes more worse if the memory is unpinned and/or memory registration is large (in GB range). Therefore, instead of holding mmap_sem for too long (for whole region pinning), acquire and release the lock for every few pages. For example on x86 with 4K page size, acquire and release mmap_sem for every 2Mbytes memory chunk. This allows other competing threads to make progress who might wish to hold mmap_sem for shorter duration. When memory registration latency is measured using [1] for memory sizes ranging from 4K to 48GB, <= 1% or 0.5% degradation is noticed. In many runs no difference is seen other than run-to-run variance. In other targeted tests of users with large memory, desired improvements are seen due to reduced contention of mmap_sem. [1] https://github.com/paravmellanox/rtool $ rdma_resource_lat -c 1 -s 48G -a -u L -i 500 -A It registers pinned memory from 4K to 48GB size with 500 iterations for each memory size. $ rdma_resource_lat -c 1 -s 12G -a -u L -i 500 -t 4 4 competing threads pin memory, each of 12GB size with 500 iterations. Signed-off-by: Parav Pandit <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-25RDMA/umem: Fix potential addition overflowDoug Ledford1-3/+5
Given a large enough memory allocation, it is possible to wrap the pinned_vm counter. Check for addition overflow to prevent such eventualities. Fixes: 40ddacf2dda9 ("RDMA/umem: Don't hold mmap_sem for too long") Reported-by: Jason Gunthorpe <[email protected]> Signed-off-by: Doug Ledford <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-25RDMA/umem: Minor optimizationsDoug Ledford1-8/+7
Noticed while reviewing commit d4b4dd1b9706 ("RDMA/umem: Do not use current->tgid to track the mm_struct") patch. Why would we take a lock, adjust a protected variable, drop the lock, and *then* check the input into our protected variable adjustment? Then we have to take the lock again on our error unwind. Let's just check the input early and skip taking the locks needlessly if the input isn't valid. It was also noticed that we set mm = current->mm, we then never modify mm, but we still go back and reference current->mm a number of times needlessly. Be consistent in using the stored reference in mm. Signed-off-by: Doug Ledford <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-09-21RDMA/umem: Get rid of struct ib_umem.odp_dataJason Gunthorpe1-4/+4
This no longer has any use, we can use container_of to get to the umem_odp, and a simple flag to indicate if this is an odp MR. Remove the few remaining references to it. Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-21RDMA/umem: Make ib_umem_odp into a sub structure of ib_umemJason Gunthorpe1-14/+22
These two structures are linked together, use the container_of pattern instead of a double allocation to make the code simpler and easier to follow. Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-21RDMA/umem: Use ib_umem_odp in all function signatures connected to ODPJason Gunthorpe1-1/+1
All of these functions already require the ODP version of the umem struct, make this very clear by having the signature require it. This paves the way to using the container_of() pattern to link umem_odp and umem together. Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-09-20RDMA/umem: Do not use current->tgid to track the mm_structJason Gunthorpe1-41/+36
This is just wrong, the process that calls into the reg_mr is the process associated with the umem, and that does not have to be the same process that created the context. When this code was first written mmgrab() didn't exist, however these days we can just directly hold the mm_struct pointer in the umem and have no ambiguity when it comes to releasing the umem as to which mm it was associated with. Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2018-07-13RDMA/umem: Refactor exit paths in ib_umem_getLeon Romanovsky1-24/+20
Simplify exit paths in ib_umem_get to use the standard goto unwind pattern. Signed-off-by: Leon Romanovsky <[email protected]> Reviewed-by: Michael J. Ruhl <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-07-13RDMA/umem: Don't hold mmap_sem for too longLeon Romanovsky1-10/+14
DMA mapping is time consuming operation and doesn't need to be performed with mmap_sem semaphore is held. The semaphore only needs to be held for accounting and get_user_pages related activities. Signed-off-by: Huy Nguyen <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-06-26RDMA/umem: Don't check for a negative return value of dma_map_sg_attrs()Leon Romanovsky1-1/+1
dma_map_sg_attrs() returns 0 on error and can't return a negative number (ensured by BUG_ON), so don't check. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-05-28Merge branch 'mr_fix' into ↵Jason Gunthorpe1-16/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma for-next Update mlx4 to support user MR creation against read-only memory, previously it required the memory to be writable. Based on rdma for-rc due to dependencies. * mr_fix: (2 commits) IB/mlx4: Mark user MR as writable if actual virtual memory is writable IB/core: Make testing MR flags for writability a static inline function
2018-05-28IB/core: Make testing MR flags for writability a static inline functionJack Morgenstein1-10/+1
Make the MR writability flags check, which is performed in umem.c, a static inline function in file ib_verbs.h This allows the function to be used by low-level infiniband drivers. Cc: <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Jack Morgenstein <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]>
2018-05-15IB/umem: Use the correct mm during ib_umem_releaseLidong Chen1-6/+1
User-space may invoke ibv_reg_mr and ibv_dereg_mr in different threads. If ibv_dereg_mr is called after the thread which invoked ibv_reg_mr has exited, get_pid_task will return NULL and ib_umem_release will not decrease mm->pinned_vm. Instead of using threads to locate the mm, use the overall tgid from the ib_ucontext struct instead. This matches the behavior of ODP and disassociate in handling the mm of the process that called ibv_reg_mr. Cc: <[email protected]> Fixes: 87773dd56d54 ("IB: ib_umem_release() should decrement mm->pinned_vm from ib_umem_get") Signed-off-by: Lidong Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>
2018-05-15IB/core: Remove redundant returnYuval Shaia1-2/+0
"return" statement at the end of void function is redundant, removing it. Signed-off-by: Yuval Shaia <[email protected]> Reviewed-by: Zhu Yanjun <[email protected]> Reviewed-by: Qing Huang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]>