aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-04-02mm: introduce page_ref_sub_return()John Hubbard1-0/+9
An upcoming patch requires subtracting a large chunk of refcounts from a page, and checking what the resulting refcount is. This is a little different than the usual "check for zero refcount" that many of the page ref functions already do. However, it is similar to a few other routines that (like this one) are generally useful for things such as 1-based refcounting. Add page_ref_sub_return(), that subtracts a chunk of refcounts atomically, and returns an atomic snapshot of the result. Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vlastimil Babka <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/gup: pass a flags arg to __gup_device_* functionsJohn Hubbard1-10/+18
A subsequent patch requires access to gup flags, so pass the flags argument through to the __gup_device_* functions. Also placate checkpatch.pl by shortening a nearby line. Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jérôme Glisse <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vlastimil Babka <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/gup: split get_user_pages_remote() into two routinesJohn Hubbard1-23/+33
Patch series "mm/gup: track FOLL_PIN pages", v6. This activates tracking of FOLL_PIN pages. This is in support of fixing the get_user_pages()+DMA problem described in [1]-[4]. FOLL_PIN support is now in the main linux tree. However, the patch to use FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test suite failure that involved (I think) page refcount overflows when huge pages were used. This patch definitively solves that kind of overflow problem, by adding an exact pincount, for compound pages (of order > 1), in the 3rd struct page of a compound page. If available, that form of pincounting is used, instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara for that idea. Other interesting changes: * dump_page(): added one, or two new things to report for compound pages: head refcount (for all compound pages), and map_pincount (for compound pages of order > 1). * Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the huge page refcount upper limit problems, and added notes about how it works now. Also added a note about the dump_page() enhancements. * Added some comments in gup.c and mm.h, to explain that there are two ways to count pinned pages: exact (for compound pages of order > 1) and fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages). ============================================================ General notes about the tracking patch: This is a prerequisite to solving the problem of proper interactions between file-backed pages, and [R]DMA activities, as discussed in [1], [2], [3], [4] and in a remarkable number of email threads since about 2017. :) In contrast to earlier approaches, the page tracking can be incrementally applied to the kernel call sites that, until now, have been simply calling get_user_pages() ("gup"). In other words, opt-in by changing from this: get_user_pages() (sets FOLL_GET) put_page() to this: pin_user_pages() (sets FOLL_PIN) unpin_user_page() ============================================================ Future steps: * Convert more subsystems from get_user_pages() to pin_user_pages(). The first probably needs to be bio/biovecs, because any filesystem testing is too difficult without those in place. * Change VFS and filesystems to respond appropriately when encountering dma-pinned pages. * Work with Ira and others to connect this all up with file system leases. [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ [4] LWN kernel index: get_user_pages() https://lwn.net/Kernel/Index/#Memory_management-get_user_pages This patch (of 12): An upcoming patch requires reusing the implementation of get_user_pages_remote(). Split up get_user_pages_remote() into an outer routine that checks flags, and an implementation routine that will be reused. This makes subsequent changes much easier to understand. There should be no change in behavior due to this patch. Signed-off-by: John Hubbard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Ira Weiny <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Al Viro <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vlastimil Babka <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/filemap.c: rewrite pagecache_get_page documentationMatthew Wilcox (Oracle)1-26/+23
- These were never called PCG flags; they've been called FGP flags since their introduction in 2014. - The FGP_FOR_MMAP flag was misleadingly documented as if it was an alternative to FGP_CREAT instead of an option to it. - Rename the 'offset' parameter to 'index'. - Capitalisation, formatting, rewording. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Pankaj Gupta <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/filemap.c: unexport find_get_entryMatthew Wilcox (Oracle)1-1/+0
No in-tree users (proc, madvise, memcg, mincore) can be built as a module. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Pankaj Gupta <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/page-writeback.c: use VM_BUG_ON_PAGE in clear_page_dirty_for_ioMatthew Wilcox (Oracle)1-1/+1
Dumping the page information in this circumstance helps for debugging. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Pankaj Gupta <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02include/linux/pagemap.h: rename arguments to find_subpageMatthew Wilcox (Oracle)1-5/+10
This isn't just a random struct page, it's known to be a head page, and calling it head makes the function better self-documenting. The pgoff_t is less confusing if it's named index instead of offset. Also add a couple of comments to explain why we're doing various things. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Pankaj Gupta <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/filemap.c: use vm_fault error code directlyMatthew Wilcox (Oracle)1-1/+1
Use VM_FAULT_OOM instead of indirecting through vmf_error(-ENOMEM). Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Pankaj Gupta <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/filemap.c: remove unused argument from shrink_readahead_size_eio()Souptick Joarder1-4/+3
The first argument of shrink_readahead_size_eio() is not used. Hence remove it from the function definition and from all the callers. Signed-off-by: Souptick Joarder <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/filemap.c: clear page error before actual readXianting Tian1-0/+8
Mount failure issue happens under the scenario: Application forked dozens of threads to mount the same number of cramfs images separately in docker, but several mounts failed with high probability. Mount failed due to the checking result of the page(read from the superblock of loop dev) is not uptodate after wait_on_page_locked(page) returned in function cramfs_read: wait_on_page_locked(page); if (!PageUptodate(page)) { ... } The reason of the checking result of the page not uptodate: systemd-udevd read the loopX dev before mount, because the status of loopX is Lo_unbound at this time, so loop_make_request directly trigger the calling of io_end handler end_buffer_async_read, which called SetPageError(page). So It caused the page can't be set to uptodate in function end_buffer_async_read: if(page_uptodate && !PageError(page)) { SetPageUptodate(page); } Then mount operation is performed, it used the same page which is just accessed by systemd-udevd above, Because this page is not uptodate, it will launch a actual read via submit_bh, then wait on this page by calling wait_on_page_locked(page). When the I/O of the page done, io_end handler end_buffer_async_read is called, because no one cleared the page error(during the whole read path of mount), which is caused by systemd-udevd reading, so this page is still in "PageError" status, which can't be set to uptodate in function end_buffer_async_read, then caused mount failure. But sometimes mount succeed even through systemd-udeved read loopX dev just before, The reason is systemd-udevd launched other loopX read just between step 3.1 and 3.2, the steps as below: 1, loopX dev default status is Lo_unbound; 2, systemd-udved read loopX dev (page is set to PageError); 3, mount operation 1) set loopX status to Lo_bound; ==>systemd-udevd read loopX dev<== 2) read loopX dev(page has no error) 3) mount succeed As the loopX dev status is set to Lo_bound after step 3.1, so the other loopX dev read by systemd-udevd will go through the whole I/O stack, part of the call trace as below: SYS_read vfs_read do_sync_read blkdev_aio_read generic_file_aio_read do_generic_file_read: ClearPageError(page); mapping->a_ops->readpage(filp, page); here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel, some function name changed, the call trace as below: blkdev_read_iter generic_file_read_iter generic_file_buffered_read: /* * A previous I/O error may have been due to temporary * failures, eg. mutipath errors. * Pg_error will be set again if readpage fails. */ ClearPageError(page); /* Start the actual read. The read will unlock the page*/ error=mapping->a_ops->readpage(flip, page); We can see ClearPageError(page) is called before the actual read, then the read in step 3.2 succeed. This patch is to add the calling of ClearPageError just before the actual read of read path of cramfs mount. Without the patch, the call trace as below when performing cramfs mount: do_mount cramfs_read cramfs_blkdev_read read_cache_page do_read_cache_page: filler(data, page); or mapping->a_ops->readpage(data, page); With the patch, the call trace as below when performing mount: do_mount cramfs_read cramfs_blkdev_read read_cache_page: do_read_cache_page: ClearPageError(page); <== new add filler(data, page); or mapping->a_ops->readpage(data, page); With the patch, mount operation trigger the calling of ClearPageError(page) before the actual read, the page has no error if no additional page error happen when I/O done. Signed-off-by: Xianting Tian <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Jan Kara <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/page-writeback.c: write_cache_pages(): deduplicate identical checksMauricio Faria de Oliveira1-4/+4
There used to be a 'retry' label in between the two (identical) checks when first introduced in commit f446daaea9d4 ("mm: implement writeback livelock avoidance using page tagging"), and later modified/updated in commit 6e6938b6d313 ("writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage"). The label has been removed in commit 64081362e8ff ("mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock"), and the (identical) checks are now present / performed immediately one after another. So, remove/deduplicate the latter check, moving tag_pages_for_writeback() into the former check before the 'tag' variable assignment, so it's clear that it's not used in this (similarly-named) function call but only later in pagevec_lookup_range_tag(). Signed-off-by: Mauricio Faria de Oliveira <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Jan Kara <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/filemap.c: don't bother dropping mmap_sem for zero size readaheadJan Kara1-1/+1
When handling a page fault, we drop mmap_sem to start async readahead so that we don't block on IO submission with mmap_sem held. However there's no point to drop mmap_sem in case readahead is disabled. Handle that case to avoid pointless dropping of mmap_sem and retrying the fault. This was actually reported to block mlockall(MCL_CURRENT) indefinitely. Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations") Reported-by: Minchan Kim <[email protected]> Reported-by: Robert Stupp <[email protected]> Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/Makefile: disable KCSAN for kmemleakQian Cai1-0/+1
Kmemleak could scan task stacks while plain writes happens to those stack variables which could results in data races. For example, in sys_rt_sigaction and do_sigaction(), it could have plain writes in a 32-byte size. Since the kmemleak does not care about the actual values of a non-pointer and all do_sigaction() call sites only copy to stack variables, just disable KCSAN for kmemleak to avoid annotating anything outside Kmemleak just because Kmemleak scans everything. Suggested-by: Marco Elver <[email protected]> Signed-off-by: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Marco Elver <[email protected]> Acked-by: Catalin Marinas <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/kmemleak.c: use address-of operator on section symbolsNathan Chancellor1-1/+1
Clang warns: mm/kmemleak.c:1955:28: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata) ^ mm/kmemleak.c:1955:60: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata) These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the resulting assembly with either clang/ld.lld or gcc/ld (tested with diff + objdump -Dr). Suggested-by: Nick Desaulniers <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Catalin Marinas <[email protected]> Link: https://github.com/ClangBuiltLinux/linux/issues/895 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02revert "topology: add support for node_to_mem_node() to determine the ↵Vlastimil Babka2-18/+0
fallback node" This reverts commit ad2c8144418c6a81cefe65379fd47bbe8344cef2. The function node_to_mem_node() was introduced by that commit for use in SLUB on systems with memoryless nodes, but it turned out to be unreliable on some architectures/configurations and a simpler solution exists than fixing it up. Thus commit 0715e6c516f1 ("mm, slub: prevent kmalloc_node crashes and memory leaks") removed the only user of node_to_mem_node() and we can revert the commit that introduced the function. Signed-off-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Bharata B Rao <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nathan Lynch <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: PUVICHAKRAVARTHY RAMACHANDRAN <[email protected]> Cc: Sachin Sant <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02slub: relocate freelist pointer to middle of objectKees Cook1-0/+7
In a recent discussion[1] with Vitaly Nikolenko and Silvio Cesare, it became clear that moving the freelist pointer away from the edge of allocations would likely improve the overall defensive posture of the inline freelist pointer. My benchmarks show no meaningful change to performance (they seem to show it being faster), so this looks like a reasonable change to make. Instead of having the freelist pointer at the very beginning of an allocation (offset 0) or at the very end of an allocation (effectively offset -sizeof(void *) from the next allocation), move it away from the edges of the allocation and into the middle. This provides some protection against small-sized neighboring overflows (or underflows), for which the freelist pointer is commonly the target. (Large or well controlled overwrites are much more likely to attack live object contents, instead of attempting freelist corruption.) The vaunted kernel build benchmark, across 5 runs. Before: Mean: 250.05 Std Dev: 1.85 and after, which appears mysteriously faster: Mean: 247.13 Std Dev: 0.76 Attempts at running "sysbench --test=memory" show the change to be well in the noise (sysbench seems to be pretty unstable here -- it's not really measuring allocation). Hackbench is more allocation-heavy, and while the std dev is above the difference, it looks like may manifest as an improvement as well: 20 runs of "hackbench -g 20 -l 1000", before: Mean: 36.322 Std Dev: 0.577 and after: Mean: 36.056 Std Dev: 0.598 [1] https://twitter.com/vnik5287/status/1235113523098685440 Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Vitaly Nikolenko <[email protected]> Cc: Silvio Cesare <[email protected]> Cc: Christoph Lameter <[email protected]>Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Link: http://lkml.kernel.org/r/202003051624.AAAC9AECC@keescook Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02slub: improve bit diffusion for freelist ptr obfuscationKees Cook1-1/+1
Under CONFIG_SLAB_FREELIST_HARDENED=y, the obfuscation was relatively weak in that the ptr and ptr address were usually so close that the first XOR would result in an almost entirely 0-byte value[1], leaving most of the "secret" number ultimately being stored after the third XOR. A single blind memory content exposure of the freelist was generally sufficient to learn the secret. Add a swab() call to mix bits a little more. This is a cheap way (1 cycle) to make attacks need more than a single exposure to learn the secret (or to know _where_ the exposure is in memory). kmalloc-32 freelist walk, before: ptr ptr_addr stored value secret ffff90c22e019020@ffff90c22e019000 is 86528eb656b3b5bd (86528eb656b3b59d) ffff90c22e019040@ffff90c22e019020 is 86528eb656b3b5fd (86528eb656b3b59d) ffff90c22e019060@ffff90c22e019040 is 86528eb656b3b5bd (86528eb656b3b59d) ffff90c22e019080@ffff90c22e019060 is 86528eb656b3b57d (86528eb656b3b59d) ffff90c22e0190a0@ffff90c22e019080 is 86528eb656b3b5bd (86528eb656b3b59d) ... after: ptr ptr_addr stored value secret ffff9eed6e019020@ffff9eed6e019000 is 793d1135d52cda42 (86528eb656b3b59d) ffff9eed6e019040@ffff9eed6e019020 is 593d1135d52cda22 (86528eb656b3b59d) ffff9eed6e019060@ffff9eed6e019040 is 393d1135d52cda02 (86528eb656b3b59d) ffff9eed6e019080@ffff9eed6e019060 is 193d1135d52cdae2 (86528eb656b3b59d) ffff9eed6e0190a0@ffff9eed6e019080 is f93d1135d52cdac2 (86528eb656b3b59d) [1] https://blog.infosectcbr.com.au/2020/03/weaknesses-in-linux-kernel-heap.html Fixes: 2482ddec670f ("mm: add SLUB free list pointer obfuscation") Reported-by: Silvio Cesare <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/202003051623.AF4F8CB@keescook Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/slub.c: replace kmem_cache->cpu_partial with wrapped APIschenqiwu1-7/+7
There are slub_cpu_partial() and slub_set_cpu_partial() APIs to wrap kmem_cache->cpu_partial. This patch will use the two APIs to replace kmem_cache->cpu_partial in slub code. Signed-off-by: chenqiwu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02mm/slub.c: replace cpu_slab->partial with wrapped APIschenqiwu1-2/+2
There are slub_percpu_partial() and slub_set_percpu_partial() APIs to wrap kmem_cache->cpu_partial. This patch will use the two to replace cpu_slab->partial in slub code. Signed-off-by: chenqiwu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02fs_parse: remove pr_notice() about each validationKees Cook1-2/+0
This notice fills my boot logs with scary-looking asterisks but doesn't really tell me anything. Let's just remove it; validation errors are already reported separately, so this is just a redundant list of filesystems. $ dmesg | grep VALIDATE [ 0.306256] *** VALIDATE tmpfs *** [ 0.307422] *** VALIDATE proc *** [ 0.308355] *** VALIDATE cgroup *** [ 0.308741] *** VALIDATE cgroup2 *** [ 0.813256] *** VALIDATE bpf *** [ 0.815272] *** VALIDATE ramfs *** [ 0.815665] *** VALIDATE hugetlbfs *** [ 0.876970] *** VALIDATE nfs *** [ 0.877383] *** VALIDATE nfs4 *** Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Seth Arnold <[email protected]> Cc: Alexander Viro <[email protected]> Link: http://lkml.kernel.org/r/202003061617.A8835CAAF@keescook Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: use memalloc_nofs_save instead of memalloc_noio_saveMatthew Wilcox (Oracle)1-14/+10
OCFS2 doesn't mind if memory reclaim makes I/Os happen; it just cares that it won't be reentered, so it can use memalloc_nofs_save() instead of memalloc_noio_save(). Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: use scnprintf() for avoiding potential buffer overflowTakashi Iwai4-80/+80
Since snprintf() returns the would-be-output size instead of the actual output size, the succeeding calls may go beyond the given buffer limit. Fix it by replacing with scnprintf(). Signed-off-by: Takashi Iwai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Joseph Qi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: roll back the reference count modification of the parent directory if ↵wangjian1-4/+11
an error occurs Under some conditions, the directory cannot be deleted. The specific scenarios are as follows: (for example, /mnt/ocfs2 is the mount point) 1. Create the /mnt/ocfs2/p_dir directory. At this time, the i_nlink corresponding to the inode of the /mnt/ocfs2/p_dir directory is equal to 2. 2. During the process of creating the /mnt/ocfs2/p_dir/s_dir directory, if the call to the inc_nlink function in ocfs2_mknod succeeds, the functions such as ocfs2_init_acl, ocfs2_init_security_set, and ocfs2_dentry_attach_lock fail. At this time, the i_nlink corresponding to the inode of the /mnt/ocfs2/p_dir directory is equal to 3, but /mnt/ocfs2/p_dir/s_dir is not added to the /mnt/ocfs2/p_dir directory entry. 3. Delete the /mnt/ocfs2/p_dir directory (rm -rf /mnt/ocfs2/p_dir). At this time, it is found that the i_nlink corresponding to the inode corresponding to the /mnt/ocfs2/p_dir directory is equal to 3. Therefore, the /mnt/ocfs2/p_dir directory cannot be deleted. Signed-off-by: Jian wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jun Piao <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: ocfs2_fs.h: replace zero-length array with flexible-array memberGustavo A. R. Silva1-9/+9
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://urldefense.com/v3/__https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html__;!!GqivPVa7Brio!OKPotRhYhHbCG2kibo8Q6_6CuKaa28d_74h1svxyR6rbshrK2L_BdrQpNbvJWBWb40QCkg$ [2] https://urldefense.com/v3/__https://github.com/KSPP/linux/issues/21__;!!GqivPVa7Brio!OKPotRhYhHbCG2kibo8Q6_6CuKaa28d_74h1svxyR6rbshrK2L_BdrQpNbvJWBUhNn9M6g$ [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/20200309202155.GA8432@embeddedor Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: dlm: replace zero-length array with flexible-array memberGustavo A. R. Silva1-4/+4
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://urldefense.com/v3/__https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html__;!!GqivPVa7Brio!OVOYL_CouISa5L1Lw-20EEFQntw6cKMx-j8UdY4z78uYgzKBUFcfpn50GaurvbV5v7YiUA$ [2] https://urldefense.com/v3/__https://github.com/KSPP/linux/issues/21__;!!GqivPVa7Brio!OVOYL_CouISa5L1Lw-20EEFQntw6cKMx-j8UdY4z78uYgzKBUFcfpn50GaurvbXs8Eh8eg$ [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/20200309202016.GA8210@embeddedor Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: cluster: replace zero-length array with flexible-array memberGustavo A. R. Silva1-1/+1
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://urldefense.com/v3/__https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html__;!!GqivPVa7Brio!NzMr-YRl2zy-K3lwLVVatz7x0uD2z7-ykQag4GrGigxmfWU8TWzDy6xrkTiW3hYl00czlw$ [2] https://urldefense.com/v3/__https://github.com/KSPP/linux/issues/21__;!!GqivPVa7Brio!NzMr-YRl2zy-K3lwLVVatz7x0uD2z7-ykQag4GrGigxmfWU8TWzDy6xrkTiW3hYHG1nAnw$ [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/20200309201907.GA8005@embeddedor Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: replace zero-length array with flexible-array memberGustavo A. R. Silva1-1/+1
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/20200213160244.GA6088@embeddedor Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: add missing annotations for ocfs2_refcount_cache_lock() and ↵Jules Irenge1-0/+2
ocfs2_refcount_cache_unlock() Sparse reports warnings at ocfs2_refcount_cache_lock() and ocfs2_refcount_cache_unlock() warning: context imbalance in ocfs2_refcount_cache_lock() - wrong count at exit warning: context imbalance in ocfs2_refcount_cache_unlock() - unexpected unlock The root cause is the missing annotation at ocfs2_refcount_cache_lock() and at ocfs2_refcount_cache_unlock() Add the missing __acquires(&rf->rf_lock) annotation to ocfs2_refcount_cache_lock() Add the missing __releases(&rf->rf_lock) annotation to ocfs2_refcount_cache_unlock() Signed-off-by: Jules Irenge <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: remove useless errAlex Shi2-4/+3
We don't need 'err' in these 2 places, better to remove them. Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Kate Stewart <[email protected]> Cc: ChenGang <[email protected]> Cc: Richard Fontana <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: correct annotation from "l_next_rec" to "l_next_free_rec"wangyan1-1/+1
Correct annotation from "l_next_rec" to "l_next_free_rec" Signed-off-by: Yan Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jun Piao <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: there is no need to log twice in several functionswangyan2-6/+0
There is no need to log twice in several functions. Signed-off-by: Yan Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jun Piao <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: remove dlm_lock_is_remoteAlex Shi1-2/+0
This macro has been unused since it was introduced. Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: use OCFS2_SEC_BITS in macroAlex Shi1-1/+1
This macro should be used. Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: remove unused macrosAlex Shi4-8/+0
O2HB_DEFAULT_BLOCK_BITS/DLM_THREAD_MAX_ASTS/DLM_MIGRATION_RETRY_MS and OCFS2_MAX_RESV_WINDOW_BITS/OCFS2_MIN_RESV_WINDOW_BITS have been unused since commit 66effd3c6812 ("ocfs2/dlm: Do not migrate resource to a node that is leaving the domain"). Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: ChenGang <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Richard Fontana <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Joseph Qi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02ocfs2: remove FS_OCFS2_NMAlex Shi1-2/+0
This macro is unused since commit ab09203e302b ("sysctl fs: Remove dead binary sysctl support"). Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02scripts/spelling.txt: add more spellings to spelling.txtColin Ian King1-1/+19
Here are some of the more common spelling mistakes and typos that I've found while fixing up spelling mistakes in the kernel since November 2019 Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Joe Perches <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02scripts/spelling.txt: add syfs/sysfs patternJonathan Neuschäfer1-0/+1
There are a few cases in the tree where "sysfs" is misspelled as "syfs". Signed-off-by: Jonathan Neuschäfer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Colin Ian King <[email protected]> Cc: Xiong <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Chris Paterson <[email protected]> Cc: Luca Ceresoli <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02asm-generic: make more kernel-space headers mandatoryMasahiro Yamada25-555/+52
Change a header to mandatory-y if both of the following are met: [1] At least one architecture (except um) specifies it as generic-y in arch/*/include/asm/Kbuild [2] Every architecture (except um) either has its own implementation (arch/*/include/asm/*.h) or specifies it as generic-y in arch/*/include/asm/Kbuild This commit was generated by the following shell script. ----------------------------------->8----------------------------------- arches=$(cd arch; ls -1 | sed -e '/Kconfig/d' -e '/um/d') tmpfile=$(mktemp) grep "^mandatory-y +=" include/asm-generic/Kbuild > $tmpfile find arch -path 'arch/*/include/asm/Kbuild' | xargs sed -n 's/^generic-y += \(.*\)/\1/p' | sort -u | while read header do mandatory=yes for arch in $arches do if ! grep -q "generic-y += $header" arch/$arch/include/asm/Kbuild && ! [ -f arch/$arch/include/asm/$header ]; then mandatory=no break fi done if [ "$mandatory" = yes ]; then echo "mandatory-y += $header" >> $tmpfile for arch in $arches do sed -i "/generic-y += $header/d" arch/$arch/include/asm/Kbuild done fi done sed -i '/^mandatory-y +=/d' include/asm-generic/Kbuild LANG=C sort $tmpfile >> include/asm-generic/Kbuild ----------------------------------->8----------------------------------- One obvious benefit is the diff stat: 25 files changed, 52 insertions(+), 557 deletions(-) It is tedious to list generic-y for each arch that needs it. So, mandatory-y works like a fallback default (by just wrapping asm-generic one) when arch does not have a specific header implementation. See the following commits: def3f7cefe4e81c296090e1722a76551142c227c a1b39bae16a62ce4aae02d958224f19316d98b24 It is tedious to convert headers one by one, so I processed by a shell script. Signed-off-by: Masahiro Yamada <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Michal Simek <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Arnd Bergmann <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02kthread: mark timer used by delayed kthread works as IRQ safePetr Mladek1-1/+2
The timer used by delayed kthread works are IRQ safe because the used kthread_delayed_work_timer_fn() is IRQ safe. It is properly marked when initialized by KTHREAD_DELAYED_WORK_INIT(). But TIMER_IRQSAFE flag is missing when initialized by kthread_init_delayed_work(). The missing flag might trigger invalid warning from del_timer_sync() when kthread_mod_delayed_work() is called with interrupts disabled. This patch is result of a discussion about using the API, see https://lkml.kernel.org/r/[email protected] Reported-by: Grygorii Strashko <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Grygorii Strashko <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02tools/accounting/getdelays.c: fix netlink attribute lengthDavid Ahern1-1/+1
A recent change to the netlink code: 6e237d099fac ("netlink: Relax attr validation for fixed length types") logs a warning when programs send messages with invalid attributes (e.g., wrong length for a u32). Yafang reported this error message for tools/accounting/getdelays.c. send_cmd() is wrongly adding 1 to the attribute length. As noted in include/uapi/linux/netlink.h nla_len should be NLA_HDRLEN + payload length, so drop the +1. Fixes: 9e06d3f9f6b1 ("per task delay accounting taskstats interface: documentation fix") Reported-by: Yafang Shao <[email protected]> Signed-off-by: David Ahern <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Yafang Shao <[email protected]> Cc: Johannes Berg <[email protected]> Cc: Shailabh Nagar <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-04-02lookup_open(): don't bother with fallbacks to lookup+createAl Viro1-25/+9
We fall back to lookup+create (instead of atomic_open) in several cases: 1) we don't have write access to filesystem and O_TRUNC is present in the flags. It's not something we want ->atomic_open() to see - it just might go ahead and truncate the file. However, we can pass it the flags sans O_TRUNC - eventually do_open() will call handle_truncate() anyway. 2) we have O_CREAT | O_EXCL and we can't write to parent. That's going to be an error, of course, but we want to know _which_ error should that be - might be EEXIST (if file exists), might be EACCES or EROFS. Simply stripping O_CREAT (and checking if we see ENOENT) would suffice, if not for O_EXCL. However, we used to have ->atomic_open() fully responsible for rejecting O_CREAT | O_EXCL on existing file and just stripping O_CREAT would've disarmed those checks. With nothing downstream to catch the problem - FMODE_OPENED used to be "don't bother with EEXIST checks, ->atomic_open() has done those". Now EEXIST checks downstream are skipped only if FMODE_CREATED is set - FMODE_OPENED alone is not enough. That has eliminated the need to fall back onto lookup+create path in this case. 3) O_WRONLY or O_RDWR when we have no write access to filesystem, with nothing else objectionable. Fallback is (and had always been) pointless. IOW, we don't really need that fallback; all we need in such cases is to trim O_TRUNC and O_CREAT properly. Signed-off-by: Al Viro <[email protected]>
2020-04-02atomic_open(): no need to pass struct open_flags anymoreAl Viro1-2/+1
argument had been unused since 1643b43fbd052 (lookup_open(): lift the "fallback to !O_CREAT" logics from atomic_open()) back in 2016 Signed-off-by: Al Viro <[email protected]>
2020-04-02open_last_lookups(): move complete_walk() into do_open()Al Viro1-10/+8
Signed-off-by: Al Viro <[email protected]>
2020-04-02open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()Al Viro1-5/+2
Currently path_openat() has "EEXIST on O_EXCL|O_CREAT" checks done on one of the ways out of open_last_lookups(). There are 4 cases: 1) the last component is . or ..; check is not done. 2) we had FMODE_OPENED or FMODE_CREATED set while in lookup_open(); check is not done. 3) symlink to be traversed is found; check is not done (nor should it be) 4) everything else: check done (before complete_walk(), even). In case (1) O_EXCL|O_CREAT ends up failing with -EISDIR - that's open("/tmp/.", O_CREAT|O_EXCL, 0600) Note that in the same conditions open("/tmp", O_CREAT|O_EXCL, 0600) would have yielded EEXIST. Either error is allowed, switching to -EEXIST in these cases would've been more consistent. Case (2) is more subtle; first of all, if we have FMODE_CREATED set, the object hadn't existed prior to the call. The check should not be done in such a case. The rest is problematic, though - we have FMODE_OPENED set (i.e. it went through ->atomic_open() and got successfully opened there) FMODE_CREATED is *NOT* set O_CREAT and O_EXCL are both set. Any such case is a bug - either we failed to set FMODE_CREATED when we had, in fact, created an object (no such instances in the tree) or we have opened a pre-existing file despite having had both O_CREAT and O_EXCL passed. One of those was, in fact caught (and fixed) while sorting out this mess (gfs2 on cold dcache). And in such situations we should fail with EEXIST. Note that for (1) and (4) FMODE_CREATED is not set - for (1) there's nothing in handle_dots() to set it, for (4) we'd explicitly checked that. And (1), (2) and (4) are exactly the cases when we leave the loop in the caller, with do_open() called immediately after that loop. IOW, we can move the check over there, and make it If we have O_CREAT|O_EXCL and after successful pathname resolution FMODE_CREATED is *not* set, we must have run into a preexisting file and should fail with EEXIST. Signed-off-by: Al Viro <[email protected]>
2020-04-02open_last_lookups(): don't abuse complete_walk() when all we want is unlazyAl Viro1-9/+5
Signed-off-by: Al Viro <[email protected]>
2020-04-02open_last_lookups(): consolidate fsnotify_create() callsAl Viro1-5/+2
Signed-off-by: Al Viro <[email protected]>
2020-04-02take post-lookup part of do_last() out of loopAl Viro1-12/+9
now we can have open_last_lookups() directly from the loop in path_openat() - the rest of do_last() never returns a symlink to follow, so we can bloody well leave the loop first. Rename the rest of that thing from do_last() to do_open() and make it return an int. Signed-off-by: Al Viro <[email protected]>
2020-04-02link_path_walk(): sample parent's i_uid and i_mode for the last componentAl Viro1-10/+7
Signed-off-by: Al Viro <[email protected]>
2020-04-02__nd_alloc_stack(): make it return boolAl Viro1-27/+18
... and adjust the caller (reserve_stack()). Rename to nd_alloc_stack(), while we are at it. Signed-off-by: Al Viro <[email protected]>
2020-04-02reserve_stack(): switch to __nd_alloc_stack()Al Viro1-11/+8
expand the call of nd_alloc_stack() into it (and don't recheck the depth on the second call) Signed-off-by: Al Viro <[email protected]>