aboutsummaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)AuthorFilesLines
2016-09-19xfs: refactor xfs_setfilesizeChristoph Hellwig2-10/+22
Rename the current function to __xfs_setfilesize and add a non-static wrapper that also takes care of creating the transaction. This new helper will be used by the new iomap-based DAX path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: take the ilock shared if possible in xfs_file_iomap_beginChristoph Hellwig1-4/+6
We always just read the extent first, and will later lock exlusively after first dropping the lock in case we actually allocate blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: fix locking for DAX writesChristoph Hellwig1-19/+1
So far DAX writes inherited the locking from direct I/O writes, but the direct I/O model of using shared locks for writes is actually wrong for DAX. For direct I/O we're out of any standards and don't have to provide the Posix required exclusion between writers, but for DAX which gets transparently enable on applications without any knowledge of it we can't simply drop the requirement. Even worse this only happens for aligned writes and thus doesn't show up for many typical use cases. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19iomap: add IOMAP_F_NEW flagChristoph Hellwig1-0/+1
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: rewrite and optimize the delalloc write pathChristoph Hellwig4-315/+181
Currently xfs_iomap_write_delay does up to lookups in the inode extent tree, which is rather costly especially with the new iomap based write path and small write sizes. But it turns out that the low-level xfs_bmap_search_extents gives us all the information we need in the regular delalloc buffered write path: - it will return us an extent covering the block we are looking up if it exists. In that case we can simply return that extent to the caller and are done - it will tell us if we are beyoned the last current allocated block with an eof return parameter. In that case we can create a delalloc reservation and use the also returned information about the last extent in the file as the hint to size our delalloc reservation. - it can tell us that we are writing into a hole, but that there is an extent beyoned this hole. In this case we can create a delalloc reservation that covers the requested size (possible capped to the next existing allocation). All that can be done in one single routine instead of bouncing up and down a few layers. This reduced the CPU overhead of the block mapping routines and also simplified the code a lot. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: make xfs_inode_set_eofblocks_tag cheaper for the common caseChristoph Hellwig2-0/+15
For long growing file writes we will usually already have the eofblocks tag set when adding more speculative preallocations. Add a flag in the inode to allow us to skip the the fairly expensive AG-wide spinlocks and multiple radix tree operations in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: factor our a helper to calculate the EOF alignmentChristoph Hellwig1-10/+21
And drop the pointless mp argument to xfs_iomap_eof_align_last_fsb, while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: move xfs_bmbt_to_iomap upChristoph Hellwig1-26/+26
We'll need it earlier in the file soon, so the unchanged function to the top of xfs_iomap.c Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: set up per-AG free space reservationsDarrick J. Wong13-54/+556
One unfortunate quirk of the reference count and reverse mapping btrees -- they can expand in size when blocks are written to *other* allocation groups if, say, one large extent becomes a lot of tiny extents. Since we don't want to start throwing errors in the middle of CoWing, we need to reserve some blocks to handle future expansion. The transaction block reservation counters aren't sufficient here because we have to have a reserve of blocks in every AG, not just somewhere in the filesystem. Therefore, create two per-AG block reservation pools. One feeds the AGFL so that rmapbt expansion always succeeds, and the other feeds all other metadata so that refcountbt expansion never fails. Use the count of how many reserved blocks we need to have on hand to create a virtual reservation in the AG. Through selective clamping of the maximum length of allocation requests and of the length of the longest free extent, we can make it look like there's less free space in the AG unless the reservation owner is asking for blocks. In other words, play some accounting tricks in-core to make sure that we always have blocks available. On the plus side, there's nothing to clean up if we crash, which is contrast to the strategy that the rough draft used (actually removing extents from the freespace btrees). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: defer should allow ->finish_item to request a new transactionDarrick J. Wong1-7/+72
When xfs_defer_finish calls ->finish_item, it's possible that (refcount) won't be able to finish all the work in a single transaction. When this happens, the ->finish_item handler should shorten the log done item's list count, update the work item to reflect where work should continue, and return -EAGAIN so that defer_finish knows to retain the pending item on the pending list, roll the transaction, and restart processing where we left off. Plumb in the code and document how this mechanism is supposed to work. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-09-19xfs: count the blocks in a btreeDarrick J. Wong2-0/+25
Provide a helper method to count the number of blocks in a short form btree. The refcount and rmap btrees need to know the number of blocks already in use to set up their per-AG block reservations during mount. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: create a standard btree size calculator codeDarrick J. Wong2-0/+26
Create a helper to generate AG btree height calculator functions. This will be used (much) later when we get to the refcount btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: remove xfs_btree_bigkeyDarrick J. Wong2-24/+12
Remove the xfs_btree_bigkey mess and simply make xfs_btree_key big enough to hold both keys in-core. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: convert RUI log formats to use variable length arraysDarrick J. Wong4-31/+28
Use variable length array declarations for RUI log items, and replace the open coded sizeof formulae with a single function. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-14xfs: normalize "infinite" retries in error configsEric Sandeen3-15/+45
As it stands today, the "fail immediately" vs. "retry forever" values for max_retries and retry_timeout_seconds in the xfs metadata error configurations are not consistent. A retry_timeout_seconds of 0 means "retry forever," but a max_retries of 0 means "fail immediately." retry_timeout_seconds < 0 is disallowed, while max_retries == -1 means "retry forever." Make this consistent across the error configs, such that a value of 0 means "fail immediately" (i.e. wait 0 seconds, or retry 0 times), and a value of -1 always means "retry forever." This makes retry_timeout a signed long to accommodate the -1, even though it stores jiffies. Given our limit of a 1 day maximum timeout, this should be sufficient even at much higher HZ values than we have available today. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-14xfs: fix signed integer overflowXie XiuQi1-2/+2
Use 1U for unsigned int to avoid a overflow warning from UBSAN. [ 31.910858] UBSAN: Undefined behaviour in fs/xfs/xfs_buf_item.c:889:25 [ 31.911252] signed integer overflow: [ 31.911478] -2147483648 - 1 cannot be represented in type 'int' [ 31.911846] CPU: 1 PID: 1011 Comm: tuned Tainted: G B ---- ------- 3.10.0-327.28.3.el7.x86_64 #1 [ 31.911857] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 01/07/2011 [ 31.911866] 1ffff1004069cd3b 0000000076bec3fd ffff8802034e69a0 ffffffff81ee3140 [ 31.911883] ffff8802034e69b8 ffffffff81ee31fd ffffffffa0ad79e0 ffff8802034e6b20 [ 31.911898] ffffffff81ee46e2 0000002d515470c0 0000000000000001 0000000041b58ab3 [ 31.911913] Call Trace: [ 31.911932] [<ffffffff81ee3140>] dump_stack+0x1e/0x20 [ 31.911947] [<ffffffff81ee31fd>] ubsan_epilogue+0x12/0x55 [ 31.911964] [<ffffffff81ee46e2>] handle_overflow+0x1ba/0x215 [ 31.912083] [<ffffffff81ee4798>] __ubsan_handle_sub_overflow+0x2a/0x31 [ 31.912204] [<ffffffffa08676fb>] xfs_buf_item_log+0x34b/0x3f0 [xfs] [ 31.912314] [<ffffffffa0880490>] xfs_trans_log_buf+0x120/0x260 [xfs] [ 31.912402] [<ffffffffa079a890>] xfs_btree_log_recs+0x80/0xc0 [xfs] [ 31.912490] [<ffffffffa07a29f8>] xfs_btree_delrec+0x11a8/0x2d50 [xfs] [ 31.913589] [<ffffffffa07a86f9>] xfs_btree_delete+0xc9/0x260 [xfs] [ 31.913762] [<ffffffffa075b5cf>] xfs_free_ag_extent+0x63f/0xe20 [xfs] [ 31.914339] [<ffffffffa075ec0f>] xfs_free_extent+0x2af/0x3e0 [xfs] [ 31.914641] [<ffffffffa0801b2b>] xfs_bmap_finish+0x32b/0x4b0 [xfs] [ 31.914841] [<ffffffffa083c2e7>] xfs_itruncate_extents+0x3b7/0x740 [xfs] [ 31.915216] [<ffffffffa08342fa>] xfs_setattr_size+0x60a/0x860 [xfs] [ 31.915471] [<ffffffffa08345ea>] xfs_vn_setattr+0x9a/0xe0 [xfs] [ 31.915590] [<ffffffff8149ad38>] notify_change+0x5c8/0x8a0 [ 31.915607] [<ffffffff81450f22>] do_truncate+0x122/0x1d0 [ 31.915640] [<ffffffff8147beee>] do_last+0x15de/0x2c80 [ 31.915707] [<ffffffff8147d777>] path_openat+0x1e7/0xcc0 [ 31.915802] [<ffffffff81480824>] do_filp_open+0xa4/0x160 [ 31.915848] [<ffffffff81453127>] do_sys_open+0x1b7/0x3f0 [ 31.915879] [<ffffffff81453392>] SyS_open+0x32/0x40 [ 31.915897] [<ffffffff81f08989>] system_call_fastpath+0x16/0x1b [ 240.086809] UBSAN: Undefined behaviour in fs/xfs/xfs_buf_item.c:866:34 [ 240.086820] signed integer overflow: [ 240.086830] -2147483648 - 1 cannot be represented in type 'int' [ 240.086846] CPU: 1 PID: 12969 Comm: rm Tainted: G B ---- ------- 3.10.0-327.28.3.el7.x86_64 #1 [ 240.086857] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 01/07/2011 [ 240.086868] 1ffff10040491def 00000000e2ea59c1 ffff88020248ef40 ffffffff81ee3140 [ 240.086885] ffff88020248ef58 ffffffff81ee31fd ffffffffa0ad79e0 ffff88020248f0c0 [ 240.086901] ffffffff81ee46e2 0000002d02488000 0000000000000001 0000000041b58ab3 [ 240.086915] Call Trace: [ 240.086938] [<ffffffff81ee3140>] dump_stack+0x1e/0x20 [ 240.086953] [<ffffffff81ee31fd>] ubsan_epilogue+0x12/0x55 [ 240.086971] [<ffffffff81ee46e2>] handle_overflow+0x1ba/0x215 ... Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-14Make __xfs_xattr_put_listen preperly report errors.Artem Savkov1-0/+1
Commit 2a6fba6 "xfs: only return -errno or success from attr ->put_listent" changes the returnvalue of __xfs_xattr_put_listen to 0 in case when there is insufficient space in the buffer assuming that setting context->count to -1 would be enough, but all of the ->put_listent callers only check seen_enough. This results in a failed assertion: XFS: Assertion failed: context->count >= 0, file: fs/xfs/xfs_xattr.c, line: 175 in insufficient buffer size case. This is only reproducible with at least 2 xattrs and only when the buffer gets depleted before the last one. Furthermore if buffersize is such that it is enough to hold the last xattr's name, but not enough to hold the sum of preceeding xattr names listxattr won't fail with ERANGE, but will suceed returning last xattr's name without the first character. The first character end's up overwriting data stored at (context->alist - 1). Signed-off-by: Artem Savkov <asavkov@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-14xfs: undo block reservation correctly in xfs_trans_reserve()Eryu Guan1-1/+1
"blocks" should be added back to fdblocks at undo time, not taken away, i.e. the minus sign should not be used. This is a regression introduced by commit 0d485ada404b ("xfs: use generic percpu counters for free block counter"). And it's found by code inspection, I didn't it in real world, so there's no reproducer. Signed-off-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-30xfs: track log done items directly in the deferred pending work itemDarrick J. Wong3-15/+6
Christoph reports slab corruption when a deferred refcount update aborts during _defer_finish(). The cause of this was broken log item state tracking in xfs_defer_pending -- upon an abort, _defer_trans_abort() will call abort_intent on all intent items, including the ones that have already had a done item attached. This is incorrect because each intent item has 2 refcount: the first is released when the intent item is committed to the log; and the second is released when the _done_ item is committed to the log, or by the intent creator if there is no done item. In other words, once we log the done item, responsibility for releasing the intent item's second refcount is transferred to the done item and /must not/ be performed by anything else. The dfp_committed flag should have been tracking whether or not we had a done item so that _defer_trans_abort could decide if it needs to abort the intent item, but due to a thinko this was not the case. Rip it out and track the done item directly so that we do the right thing w.r.t. intent item freeing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26xfs: prevent dropping ioend completions during buftarg waitBrian Foster1-1/+1
xfs_wait_buftarg() waits for all pending I/O, drains the ioend completion workqueue and walks the LRU until all buffers in the cache have been released. This is traditionally an unmount operation` but the mechanism is also reused during filesystem freeze. xfs_wait_buftarg() invokes drain_workqueue() as part of the quiesce, which is intended more for a shutdown sequence in that it indicates to the queue that new operations are not expected once the drain has begun. New work jobs after this point result in a WARN_ON_ONCE() and are otherwise dropped. With filesystem freeze, however, read operations are allowed and can proceed during or after the workqueue drain. If such a read occurs during the drain sequence, the workqueue infrastructure complains about the queued ioend completion work item and drops it on the floor. As a result, the buffer remains on the LRU and the freeze never completes. Despite the fact that the overall buffer cache cleanup is not necessary during freeze, fix up this operation such that it is safe to invoke during non-unmount quiesce operations. Replace the drain_workqueue() call with flush_workqueue(), which runs a similar serialization on pending workqueue jobs without causing new jobs to be dropped. This is safe for unmount as unmount independently locks out new operations by the time xfs_wait_buftarg() is invoked. cc: <stable@vger.kernel.org> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26xfs: fix superblock inprogress checkDave Chinner1-1/+2
From inspection, the superblock sb_inprogress check is done in the verifier and triggered only for the primary superblock via a "bp->b_bn == XFS_SB_DADDR" check. Unfortunately, the primary superblock is an uncached buffer, and hence it is configured by xfs_buf_read_uncached() with: bp->b_bn = XFS_BUF_DADDR_NULL; /* always null for uncached buffers */ And so this check never triggers. Fix it. cc: <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26xfs: simple btree query range should look right if LE lookup failsDarrick J. Wong1-0/+7
If the initial LOOKUP_LE in the simple query range fails to find anything, we should attempt to increment the btree cursor to see if there actually /are/ records for what we're trying to find. Without this patch, a bnobt range query of (0, $agsize) returns no results because the leftmost record never has a startblock of zero. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26xfs: fix some key handling problems in _btree_simple_query_rangeDarrick J. Wong1-1/+2
We only need the record's high key for the first record that we look at; for all records, we /definitely/ need the regular record key. Therefore, fix how the simple range query function gets its keys. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26xfs: don't log the entire end of the AGFDarrick J. Wong2-2/+6
When we're logging the last non-spare field in the AGF, we don't need to log the spare fields, so plumb in a new AGF logging flag to help us avoid that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26xfs: disallow mounting of realtime + rmap filesystemsDarrick J. Wong1-1/+8
Since the kernel doesn't currently support the realtime rmapbt, don't allow such filesystems to be mounted. Support will appear in a future release. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-26xfs: don't perform lookups on zero-height btreesDarrick J. Wong1-0/+4
If the caller passes in a cursor to a zero-height btree (which is impossible), we never set block to anything but NULL, which causes the later dereference of it to crash. Instead, just return -EFSCORRUPTED. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17Merge branch 'iomap-fixes-4.8-rc3' into for-nextDave Chinner4-12/+58
2016-08-17xfs: remove OWN_AG rmap when allocating a block from the AGFLDarrick J. Wong1-0/+13
When we're really tight on space, xfs_alloc_ag_vextent_small() can allocate a block from the AGFL and give it to the caller. Since the caller is never the AGFL-fixing method, we must remove the OWN_AG reverse mapping because it will clash with whatever rmap the caller wants to set up. This bug was discovered by running generic/299 repeatedly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: (re-)implement FIEMAP_FLAG_XATTRChristoph Hellwig3-1/+54
Use a special read-only iomap_ops implementation to support fiemap on the attr fork. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: simplify xfs_file_iomap_beginChristoph Hellwig2-11/+4
We'll never get nimap == 0 for a successful return from xfs_bmapi_read, so don't try to handle it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: store rmapbt block count in the AGFDarrick J. Wong4-3/+16
Track the number of blocks used for the rmapbt in the AGF. When we get to the AG reservation code we need this counter to quickly make our reservation during mount. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: don't invalidate whole file on DAX read/writeDave Chinner1-1/+12
When we do DAX IO, we try to invalidate the entire page cache held on the file. This is incorrect as it will trash the entire mapping tree that now tracks dirty state in exceptional entries in the radix tree slots. What we are trying to do is remove cached pages (e.g from reads into holes) that sit in the radix tree over the range we are about to write to. Hence we should just limit the invalidation to the range we are about to overwrite. Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: fix bogus space reservation in xfs_iomap_write_allocateChristoph Hellwig1-3/+7
The space reservations was without an explaination in commit "Add error reporting calls in error paths that return EFSCORRUPTED" back in 2003. There is no reason to reserve disk blocks in the transaction when allocating blocks for delalloc space as we already reserved the space when creating the delalloc extent. With this fix we stop running out of the reserved pool in generic/229, which has happened for long time with small blocksize file systems, and has increased in severity with the new buffered write path. [ dchinner: we still need to pass the block reservation into xfs_bmapi_write() to ensure we don't deadlock during AG selection. See commit dbd5c8c ("xfs: pass total block res. as total xfs_bmapi_write() parameter") for more details on why this is necessary. ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: don't assert fail on non-async buffers on ioacct decrementBrian Foster1-1/+0
The buffer I/O accounting mechanism tracks async buffers under I/O. As an optimization, the buffer I/O count is incremented only once on the first async I/O for a given hold cycle of a buffer and decremented once the buffer is released to the LRU (or freed). xfs_buf_ioacct_dec() has an ASSERT() check for an XBF_ASYNC buffer, but we have one or two corner cases where a buffer can be submitted for I/O multiple times via different methods in a single hold cycle. If an async I/O occurs first, the I/O count is incremented. If a sync I/O occurs before the hold count drops, XBF_ASYNC is cleared by the time the I/O count is decremented. Remove the async assert check from xfs_buf_ioacct_dec() as this is a perfectly valid scenario. For the purposes of I/O accounting, we really only care about the buffer async state at I/O submission time. Discovered-and-analyzed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-07fs: return EPERM on immutable inodeEryu Guan1-1/+1
In most cases, EPERM is returned on immutable inode, and there're only a few places returning EACCES. I noticed this when running LTP on overlayfs, setxattr03 failed due to unexpected EACCES on immutable inode. So converting all EACCES to EPERM on immutable inode. Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-06Merge tag 'xfs-rmap-for-linus-4.8-rc1' of ↵Linus Torvalds64-915/+6267
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull more xfs updates from Dave Chinner: "This is the second part of the XFS updates for this merge cycle, and contains the new reverse block mapping feature for XFS. Reverse mapping allows us to track the owner of a specific block on disk precisely. It is implemented as a set of btrees (one per allocation group) that track the owners of allocated extents. Effectively it is a "used space tree" that is updated when we allocate or free extents. i.e. it is coherent with the free space btrees we already maintain and never overlaps with them. This reverse mapping infrastructure is the building block of several upcoming features - reflink, copy-on-write data, dedupe, online metadata and data scrubbing, highly accurate bad sector/data loss reporting to users, and significantly improved reconstruction of damaged and corrupted filesystems. There's a lot of new stuff coming along in the next couple of cycles,a nd it all builds in the rmap infrastructure. As such, it's a huge chunk of new code with new on-disk format features and internal infrastructure. It warns at mount time as an experimental feature and that it may eat data (as we do with all new on-disk features until they stabilise). We have not released userspace suport for it yet - userspace support currently requires download from Darrick's xfsprogs repo and build from source, so the access to this feature is really developer/tester only at this point. Initial userspace support will be released at the same time kernel with this code in it is released. The new rmap enabled code regresses 3 xfstests - all are ENOSPC related corner cases, one of which Darrick posted a fix for a few hours ago. The other two are fixed by infrastructure that is part of the upcoming reflink patchset. This new ENOSPC infrastructure requires a on-disk format tweak required to keep mount times in check - we need to keep an on-disk count of allocated rmapbt blocks so we don't have to scan the entire btrees at mount time to count them. This is currently being tested and will be part of the fixes sent in the next week or two so users will not be exposed to this change" * tag 'xfs-rmap-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (52 commits) xfs: move (and rename) the deferred bmap-free tracepoints xfs: collapse single use static functions xfs: remove unnecessary parentheses from log redo item recovery functions xfs: remove the extents array from the rmap update done log item xfs: in btree_lshift, only allocate temporary cursor when needed xfs: remove unnecesary lshift/rshift key initialization xfs: remove the get*keys and update_keys btree ops pointers xfs: enable the rmap btree functionality xfs: don't update rmapbt when fixing agfl xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled xfs: add rmap btree block detection to log recovery xfs: add rmap btree geometry feature flag xfs: propagate bmap updates to rmapbt xfs: enable the xfs_defer mechanism to process rmaps to update xfs: log rmap intent items xfs: create rmap update intent log items xfs: add rmap btree insert and delete helpers xfs: convert unwritten status of reverse mappings xfs: remove an extent from the rmap btree xfs: add an extent to the rmap btree ...
2016-08-04Merge tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linuxLinus Torvalds3-5/+4
Pull nfsd updates from Bruce Fields: "Highlights: - Trond made a change to the server's tcp logic that allows a fast client to better take advantage of high bandwidth networks, but may increase the risk that a single client could starve other clients; a new sunrpc.svc_rpc_per_connection_limit parameter should help mitigate this in the (hopefully unlikely) event this becomes a problem in practice. - Tom Haynes added a minimal flex-layout pnfs server, which is of no use in production for now--don't build it unless you're doing client testing or further server development" * tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux: (32 commits) nfsd: remove some dead code in nfsd_create_locked() nfsd: drop unnecessary MAY_EXEC check from create nfsd: clean up bad-type check in nfsd_create_locked nfsd: remove unnecessary positive-dentry check nfsd: reorganize nfsd_create nfsd: check d_can_lookup in fh_verify of directories nfsd: remove redundant zero-length check from create nfsd: Make creates return EEXIST instead of EACCES SUNRPC: Detect immediate closure of accepted sockets SUNRPC: accept() may return sockets that are still in SYN_RECV nfsd: allow nfsd to advertise multiple layout types nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock nfsd/blocklayout: Make sure calculate signature/designator length aligned xfs: abstract block export operations from nfsd layouts SUNRPC: Remove unused callback xpo_adjust_wspace() SUNRPC: Change TCP socket space reservation SUNRPC: Add a server side per-connection limit SUNRPC: Micro optimisation for svc_data_ready SUNRPC: Call the default socket callbacks instead of open coding SUNRPC: lock the socket while detaching it ...
2016-08-03xfs: move (and rename) the deferred bmap-free tracepointsDarrick J. Wong2-2/+7
Rename the deferred bmap-free to extent_free and make them only trigger when we're really running deferred ops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: collapse single use static functionsDarrick J. Wong2-128/+63
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: remove unnecessary parentheses from log redo item recovery functionsDarrick J. Wong2-12/+12
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: remove the extents array from the rmap update done log itemDarrick J. Wong7-89/+15
Nothing ever uses the extent array in the rmap update done redo item, so remove it before it is fixed in the on-disk log format. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: in btree_lshift, only allocate temporary cursor when neededDarrick J. Wong1-17/+17
We only need the temporary cursor in _btree_lshift if we're shifting in an overlapped btree. Therefore, factor that into a single block of code so we avoid unnecessary cursor duplication. Also fix use of the wrong cursor when checking for corruption in xfs_btree_rshift(). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: remove unnecesary lshift/rshift key initializationDarrick J. Wong1-14/+0
In the lshift/rshift functions we don't use the key variable for anything now, so remove the variable and its initializer. The update_keys functions figure out the key for a block on their own. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: remove the get*keys and update_keys btree ops pointersDarrick J. Wong6-136/+61
These are internal btree functions; we don't need them to be dispatched via function pointers. Make them static again and just check the overlapped flag to figure out what we need to do. The strategy behind this patch was suggested by Christoph. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: enable the rmap btree functionalityDarrick J. Wong2-1/+6
Originally-From: Dave Chinner <dchinner@redhat.com> Add the feature flag to the supported matrix so that the kernel can mount and use rmap btree enabled filesystems Signed-off-by: Dave Chinner <dchinner@redhat.com> [darrick.wong@oracle.com: move the experimental tag] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: don't update rmapbt when fixing agflDarrick J. Wong2-4/+17
Allow a caller of xfs_alloc_fix_freelist to disable rmapbt updates when fixing the AG freelist. xfs_repair needs this during phase 5 to be able to adjust the freelist while it's reconstructing the rmap btree; the missing entries will be added back at the very end of phase 5 once the AGFL contents settle down. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabledDarrick J. Wong1-0/+4
Swapping extents between two inodes requires the owner to be updated in the rmap tree for all the extents that are swapped. This code does not yet exist, so switch off the XFS_IOC_SWAPEXT ioctl until support has been implemented. This will need to be done before the rmap btree code can have the experimental tag removed. This functionality will be provided in a (much) later patch, using some of the reflink deferred block remapping functionality to accomlish extent swapping with rmap updates. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add rmap btree block detection to log recoveryDarrick J. Wong1-0/+4
Originally-From: Dave Chinner <dchinner@redhat.com> So such blocks can be correctly identified and have their operations structures attached to validate recovery has not resulted in a correct block. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add rmap btree geometry feature flagDarrick J. Wong2-1/+4
Originally-From: Dave Chinner <dchinner@redhat.com> So xfs_info and other userspace utilities know the filesystem is using this feature. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: propagate bmap updates to rmapbtDarrick J. Wong8-16/+400
When we map, unmap, or convert an extent in a file's data or attr fork, schedule a respective update in the rmapbt. Previous versions of this patch required a 1:1 correspondence between bmap and rmap, but this is no longer true as we now have ability to make interval queries against the rmapbt. We use the deferred operations code to handle redo operations atomically and deadlock free. This plumbs in all five rmap actions (map, unmap, convert extent, alloc, free); we'll use the first three now for file data, and reflink will want the last two. We also add an error injection site to test log recovery. Finally, we need to fix the bmap shift extent code to adjust the rmaps correctly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>