aboutsummaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)AuthorFilesLines
2016-01-12Merge branch 'xfs-misc-fixes-for-4.5-2' into for-nextDave Chinner18-234/+159
2016-01-12xfs: handle dquot buffer readahead in log recovery correctlyDave Chinner5-9/+41
When we do dquot readahead in log recovery, we do not use a verifier as the underlying buffer may not have dquots in it. e.g. the allocation operation hasn't yet been replayed. Hence we do not want to fail recovery because we detect an operation to be replayed has not been run yet. This problem was addressed for inodes in commit d891400 ("xfs: inode buffers may not be valid during recovery readahead") but the problem was not recognised to exist for dquots and their buffers as the dquot readahead did not have a verifier. The result of not using a verifier is that when the buffer is then next read to replay a dquot modification, the dquot buffer verifier will only be attached to the buffer if *readahead is not complete*. Hence we can read the buffer, replay the dquot changes and then add it to the delwri submission list without it having a verifier attached to it. This then generates warnings in xfs_buf_ioapply(), which catches and warns about this case. Fix this and make it handle the same readahead verifier error cases as for inode buffers by adding a new readahead verifier that has a write operation as well as a read operation that marks the buffer as not done if any corruption is detected. Also make sure we don't run readahead if the dquot buffer has been marked as cancelled by recovery. This will result in readahead either succeeding and the buffer having a valid write verifier, or readahead failing and the buffer state requiring the subsequent read to resubmit the IO with the new verifier. In either case, this will result in the buffer always ending up with a valid write verifier on it. Note: we also need to fix the inode buffer readahead error handling to mark the buffer with EIO. Brian noticed the code I copied from there wrong during review, so fix it at the same time. Add comments linking the two functions that handle readahead verifier errors together so we don't forget this behavioural link in future. cc: <[email protected]> # 3.12 - current Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-12xfs: inode recovery readahead can race with inode buffer creationDave Chinner2-5/+14
When we do inode readahead in log recovery, we do can do the readahead before we've replayed the icreate transaction that stamps the buffer with inode cores. The inode readahead verifier catches this and marks the buffer as !done to indicate that it doesn't yet contain valid inodes. In adding buffer error notification (i.e. setting b_error = -EIO at the same time as as we clear the done flag) to such a readahead verifier failure, we can then get subsequent inode recovery failing with this error: XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32 This occurs when readahead completion races with icreate item replay such as: inode readahead find buffer lock buffer submit RA io .... icreate recovery xfs_trans_get_buffer find buffer lock buffer <blocks on RA completion> ..... <ra completion> fails verifier clear XBF_DONE set bp->b_error = -EIO release and unlock buffer <icreate gains lock> icreate initialises buffer marks buffer as done adds buffer to delayed write queue releases buffer At this point, we have an initialised inode buffer that is up to date but has an -EIO state registered against it. When we finally get to recovering an inode in that buffer: inode item recovery xfs_trans_read_buffer find buffer lock buffer sees XBF_DONE is set, returns buffer sees bp->b_error is set fail log recovery! Essentially, we need xfs_trans_get_buf_map() to clear the error status of the buffer when doing a lookup. This function returns uninitialised buffers, so the buffer returned can not be in an error state and none of the code that uses this function expects b_error to be set on return. Indeed, there is an ASSERT(!bp->b_error); in the transaction case in xfs_trans_get_buf_map() that would have caught this if log recovery used transactions.... This patch firstly changes the inode readahead failure to set -EIO on the buffer, and secondly changes xfs_buf_get_map() to never return a buffer with an error state set so this first change doesn't cause unexpected log recovery failures. cc: <[email protected]> # 3.12 - current Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-11xfs: eliminate committed arg from xfs_bmap_finishEric Sandeen10-218/+68
Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the associated comments were replicated several times across the attribute code, all dealing with what to do if the transaction was or wasn't committed. And in that replicated code, an ASSERT() test of an uninitialized variable occurs in several locations: error = xfs_attr_thing(&args); if (!error) { error = xfs_bmap_finish(&args.trans, args.flist, &committed); } if (error) { ASSERT(committed); If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish, never set "committed", and then test it in the ASSERT. Fix this up by moving the committed state internal to xfs_bmap_finish, and add a new inode argument. If an inode is passed in, it is passed through to __xfs_trans_roll() and joined to the transaction there if the transaction was committed. xfs_qm_dqalloc() was a little unique in that it called bjoin rather than ijoin, but as Dave points out we can detect the committed state but checking whether (*tpp != tp). Addresses-Coverity-Id: 102360 Addresses-Coverity-Id: 102361 Addresses-Coverity-Id: 102363 Addresses-Coverity-Id: 102364 Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-08xfs: bmapbt checking on debug kernels too expensiveDave Chinner1-2/+8
For large sparse or fragmented files, checking every single entry in the bmapbt on every operation is prohibitively expensive. Especially as such checks rarely discover problems during normal operations on high extent coutn files. Our regression tests don't tend to exercise files with hundreds of thousands to millions of extents, so mostly this isn't noticed. However, trying to run things like xfs_mdrestore of large filesystem dumps on a debug kernel quickly becomes impossible as the CPU is completely burnt up repeatedly walking the sparse file bmapbt that is generated for every allocation that is made. Hence, if the file has more than 10,000 extents, just don't bother with walking the tree to check it exhaustively. The btree code has checks that ensure that the newly inserted/removed/modified record is correctly ordered, so the entrie tree walk in thses cases has limited additional value. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-08xfs: add tracepoints to readpage callsDave Chinner2-0/+28
This allows us to see page cache driven readahead in action as it passes through XFS. This helps to understand buffered read throughput problems such as readahead IO IO sizes being too small for the underlying device to reach max throughput. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-06fs: use block_device name vsprintf helperDmitry Monakhov1-6/+2
Signed-off-by: Dmitry Monakhov <[email protected]> Signed-off-by: Al Viro <[email protected]>
2016-01-05Merge branch 'xfs-dax-fixes-for-4.5' into for-nextDave Chinner2-12/+24
2016-01-05Merge branch 'xfs-misc-fixes-for-4.5' into for-nextDave Chinner28-83/+126
2016-01-05xfs: debug mode log record crc error injectionBrian Foster3-7/+77
XFS now uses CRC verification over a limited section of the log to detect torn writes prior to a crash. This is difficult to test directly due to the timing and hardware requirements to cause a short write. Add a mechanism to inject CRC errors into log records to facilitate testing torn write detection during log recovery. This mechanism is dangerous and can result in filesystem corruption. Thus, it is only available in DEBUG mode for testing/development purposes. Set a non-zero value to the following sysfs entry to enable error injection: /sys/fs/xfs/<dev>/log/log_badcrc_factor Once enabled, XFS intentionally writes an invalid CRC to a log record at some random point in the future based on the provided frequency. The filesystem immediately shuts down once the record has been written to the physical log to prevent metadata writeback (e.g., AIL insertion) once the log write completes. This helps reasonably simulate a torn write to the log as the affected record must be safe to discard. The next mount after the intentional shutdown requires log recovery and should detect and recover from the torn write. Note again that this _will_ result in data loss or worse. For testing and development purposes only! Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-05xfs: detect and trim torn writes during log recoveryBrian Foster1-20/+289
Certain types of storage, such as persistent memory, do not provide sector atomicity for writes. This means that if a crash occurs while XFS is writing log records, only part of those records might make it to the storage. This is problematic because log recovery uses the cycle value packed at the top of each log block to locate the head/tail of the log. This can lead to CRC verification failures during log recovery and an unmountable fs for a filesystem that is otherwise consistent. Update log recovery to incorporate log record CRC verification as part of the head/tail discovery process. Once the head is located via the traditional algorithm, run a CRC-only pass over the records up to the head of the log. If CRC verification fails, assume that the records are torn as a matter of policy and trim the head block back to the start of the first bad record. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: introduce per-inode DAX enablementDave Chinner4-12/+51
Rather than just being able to turn DAX on and off via a mount option, some applications may only want to enable DAX for certain performance critical files in a filesystem. This patch introduces a new inode flag to enable DAX in the v3 inode di_flags2 field. It adds support for setting and clearing flags in the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the S_DAX inode flag appropriately when it is seen. When this flag is set on a directory, it acts as an "inherit flag". That is, inodes created in the directory will automatically inherit the on-disk inode DAX flag, enabling administrators to set up directory heirarchies that automatically use DAX. Setting this flag on an empty root directory will make the entire filesystem use DAX by default. Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: use FS_XFLAG definitions directlyDave Chinner4-74/+53
Now that the ioctls have been hoisted up to the VFS level, use the VFs definitions directly and remove the XFS specific definitions completely. Userspace is going to have to handle the change of this interface separately, so removing the definitions from xfs_fs.h is not an issue here at all. Signed-off-by: Dave Chinner <[email protected]>
2016-01-04fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotionDave Chinner1-33/+18
Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API from fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that the ioctls can be used by all filesystems, not just XFS. This enables (initially) ext4 to use the ioctl to set project IDs on inodes. Based-on-patch-from: Li Xi <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: fix recursive splice read locking with DAXDave Chinner1-9/+16
Doing a splice read (generic/249) generates a lockdep splat because we recursively lock the inode iolock in this path: SyS_sendfile64 do_sendfile do_splice_direct splice_direct_to_actor do_splice_to xfs_file_splice_read <<<<<< lock here default_file_splice_read vfs_readv do_readv_writev do_iter_readv_writev xfs_file_read_iter <<<<<< then here The issue here is that for DAX inodes we need to avoid the page cache path and hence simply push it into the normal read path. Unfortunately, we can't tell down at xfs_file_read_iter() whether we are being called from the splice path and hence we cannot avoid the locking at this layer. Hence we simply have to drop the inode locking at the higher splice layer for DAX. Signed-off-by: Dave Chinner <[email protected]> Tested-by: Ross Zwisler <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: Don't use reserved blocks for data blocks with DAXDave Chinner1-3/+8
Commit 1ca1915 ("xfs: Don't use unwritten extents for DAX") enabled the DAX allocation call to dip into the reserve pool in case it was converting unwritten extents rather than allocating blocks. This was a direct copy of the unwritten extent conversion code, but had an unintended side effect of allowing normal data block allocation to use the reserve pool. Hence normal block allocation could deplete the reserve pool and prevent unwritten extent conversion at ENOSPC, hence violating fallocate guarantees on preallocated space. Fix it by checking whether the incoming map from __xfs_get_blocks() spans an unwritten extent and only use the reserve pool if the allocation covers an unwritten extent. Signed-off-by: Dave Chinner <[email protected]> Tested-by: Ross Zwisler <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04XFS: Use a signed return type for suffix_kstrtoint()Markus Elfring1-1/+1
The return type "unsigned long" was used by the suffix_kstrtoint() function even though it will eventually return a negative error code. Improve this implementation detail by using the type "int" instead. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04libxfs: refactor short btree block verificationDarrick J. Wong4-54/+67
Create xfs_btree_sblock_verify() to verify short-format btree blocks (i.e. the per-AG btrees with 32-bit block pointers) instead of open-coding them. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correctDarrick J. Wong1-1/+1
Because struct xfs_agfl is 36 bytes long and has a 64-bit integer inside it, gcc will quietly round the structure size up to the nearest 64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE macro returning incorrect results for v5 filesystems on 64-bit machines (118 items instead of 119). As a result, a 32-bit xfs_repair will see garbage in AGFL item 119 and complain. Therefore, tell gcc not to pad the structure so that the AGFL size calculation is correct. cc: <[email protected]> # 3.10 - 4.4 Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04libxfs: use a convenience variable instead of open-coding the forkDarrick J. Wong1-11/+12
Use a convenience variable instead of open-coding the inode fork. This isn't really needed for now, but will become important when we add the copy-on-write fork later. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: fix log ticket type printingDarrick J. Wong1-2/+4
Update the log ticket reservation type printing code to reflect all the types of log tickets, to avoid incorrect debug output and avoid running off the end of the array. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04libxfs: make xfs_alloc_fix_freelist non-staticDarrick J. Wong2-1/+2
Since xfs_repair wants to use xfs_alloc_fix_freelist, remove the static designation. xfsprogs already has this; this simply brings the kernel up to date. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: make xfs_buf_ioend_async() staticAlexander Kuleshov1-1/+1
There are no callers of the xfs_buf_ioend_async() function outside of the fs/xfs/xfs_buf.c. So, let's make it static. Signed-off-by: Alexander Kuleshov <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: send warning of project quota to userspace via netlinkMasatake YAMATO1-5/+9
Linux's quota subsystem has an ability to handle project quota. This commit just utilizes the ability from xfs side. dbus-monitor and quota_nld shipped as part of quota-tools can be used for testing. See the patch posting on the XFS list for details on testing. Signed-off-by: Masatake YAMATO <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: get mp from bma->ip in xfs_bmap codeEric Sandeen1-2/+2
In my earlier commit c29aad4 xfs: pass mp to XFS_WANT_CORRUPTED_GOTO I added some local mp variables with code which indicates that mp might be NULL. Coverity doesn't like this now, because the updated per-fs XFS_STATS macros dereference mp. I don't think this is actually a problem; from what I can tell, we cannot get to these functions with a null bma->tp, so my NULL check was probably pointless. Still, it's not super obvious. So switch this code to get mp from the inode on the xfs_bmalloca structure, with no conditional, because the functions are already using bmap->ip directly. Addresses-Coverity-Id: 1339552 Addresses-Coverity-Id: 1339553 Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: print name of verifier if it failsEric Sandeen18-2/+24
This adds a name to each buf_ops structure, so that if a verifier fails we can print the type of verifier that failed it. Should be a slight debugging aid, I hope. Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04libxfs: Optimize the loop for xfs_bitmap_emptyJia He1-3/+3
If there is any non zero bit in a long bitmap, it can jump out of the loop and finish the function as soon as possible. Signed-off-by: Jia He <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: refactor log record start detection into a new helperBrian Foster1-35/+83
As part of the head/tail discovery process, log recovery locates the head block and then reverse seeks to find the start of the last active record in the log. This is non-trivial as the record itself could have wrapped around the end of the physical log. Log recovery torn write detection potentially needs to walk further behind the last record in the log, as multiple log I/Os can be in-flight at one time during a crash event. Therefore, refactor the reverse log record header search mechanism into a new helper that supports the ability to seek past an arbitrary number of log records (or until the tail is hit). Update the head/tail search mechanism to call the new helper, but otherwise there is no change in log recovery behavior. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: support a crc verification only log record passBrian Foster2-5/+20
Log recovery torn write detection uses CRC verification over a range of the active log to identify torn writes. Since the generic log recovery pass code implements a superset of the functionality required for CRC verification, it can be easily modified to support a CRC verification only pass. Create a new CRC pass type and update the log record processing helper to skip everything beyond CRC verification when in this mode. This pass will be invoked in subsequent patches to implement torn write detection. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: return start block of first bad log record during recoveryBrian Foster1-4/+16
Each log recovery pass walks from the tail block to the head block and processes records appropriately based on the associated log pass type. There are various failure conditions that can occur through this sequence, such as I/O errors, CRC errors, etc. Log torn write detection will perform CRC verification near the head of the log to detect torn writes and trim torn records from the log appropriately. As it is, xlog_do_recovery_pass() only returns an error code in the event of CRC failure, which isn't enough information to trim the head of the log. Update xlog_do_recovery_pass() to optionally return the start block of the associated record when an error occurs. This patch contains no functional changes. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: refactor and open code log record crc checkBrian Foster1-46/+26
Log record CRC verification currently occurs during active log recovery, immediately before a log record is unpacked. Therefore, the CRC calculation code is buried within the data unpack function. CRC verification pass support only needs to go so far as check the CRC, but this is not easily allowed as the code is currently organized. Since we now have a new log record processing helper, pull the record CRC verification code out from the unpack helper and open-code it at the top of the new process helper. This facilitates the ability to modify how records are processed based on the type of the current pass. This patch contains no functional changes. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: refactor log record unpack and data processingBrian Foster1-12/+23
xlog_do_recovery_pass() duplicates a couple function calls related to processing log records because the function must handle wrapping around the end of the log if the head is behind the tail. This is implemented as separate loops. CRC verification pass support will modify how records are processed in both of these loops. Rather than continue to duplicate code, factor the calls that process a log record into a new helper and call that helper from both loops. This patch contains no functional changes. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-01-04xfs: detect and handle invalid iclog size set by mkfsBrian Foster1-1/+25
XFS log records have separate fields for the record size and the iclog size used to write the record. mkfs.xfs zeroes the log and writes an unmount record to generate a clean log for the subsequent mount. The userspace record logging code has a bug where the iclog size (h_size) field of the log record is hardcoded to 32k, even if a log stripe unit is specified. The log record length is correctly extended to the stripe unit. Since the kernel log recovery code uses the h_size field to determine the log buffer size, this means that the kernel can attempt to read/process records larger than the buffer size and overrun the buffer. This has historically not been a problem because the kernel doesn't actually run through log recovery in the clean unmount case. Instead, the kernel detects that a single unmount record exists between the head and tail and pushes the tail forward such that the log is viewed as clean (head == tail). Once CRC verification is enabled, however, all records at the head of the log are verified for CRC errors and thus we are susceptible to overrun problems if the iclog field is not correct. While the core problem must be fixed in userspace, this is historical behavior that must be detected in the kernel to avoid severe side effects such as memory corruption and crashes. Update the log buffer size calculation code to detect this condition, warn the user and resize the log buffer based on the log stripe unit. Return a corruption error in cases where this does not look like a clean filesystem (i.e., the log record header indicates more than one operation). Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-12-30switch ->get_link() to delayed_call, kill ->put_link()Al Viro1-3/+3
Signed-off-by: Al Viro <[email protected]>
2015-12-08replace ->follow_link() with new method that could stay in RCU modeAl Viro1-2/+6
new method: ->get_link(); replacement of ->follow_link(). The differences are: * inode and dentry are passed separately * might be called both in RCU and non-RCU mode; the former is indicated by passing it a NULL dentry. * when called that way it isn't allowed to block and should return ERR_PTR(-ECHILD) if it needs to be called in non-RCU mode. It's a flagday change - the old method is gone, all in-tree instances converted. Conversion isn't hard; said that, so far very few instances do not immediately bail out when called in RCU mode. That'll change in the next commits. Signed-off-by: Al Viro <[email protected]>
2015-12-06xfs: Change how listxattr generates synthetic attributesAndreas Gruenbacher3-105/+59
Instead of adding the synthesized POSIX ACL attribute names after listing all non-synthesized attributes, generate them immediately when listing the non-synthesized attributes. In addition, merge xfs_xattr_put_listent and xfs_xattr_put_listent_sizes to ensure that the list size is computed correctly; the split version was overestimating the list size for non-root users. Signed-off-by: Andreas Gruenbacher <[email protected]> Cc: Dave Chinner <[email protected]> Cc: [email protected] Signed-off-by: Al Viro <[email protected]>
2015-12-06vfs: Distinguish between full xattr names and proper prefixesAndreas Gruenbacher1-6/+0
Add an additional "name" field to struct xattr_handler. When the name is set, the handler matches attributes with exactly that name. When the prefix is set instead, the handler matches attributes with the given prefix and with a non-empty suffix. This patch should avoid bugs like the one fixed in commit c361016a in the future. Signed-off-by: Andreas Gruenbacher <[email protected]> Reviewed-by: James Morris <[email protected]> Signed-off-by: Al Viro <[email protected]>
2015-12-06posix acls: Remove duplicate xattr name definitionsAndreas Gruenbacher1-4/+4
Remove POSIX_ACL_XATTR_{ACCESS,DEFAULT} and GFS2_POSIX_ACL_{ACCESS,DEFAULT} and replace them with the definitions in <include/uapi/linux/xattr.h>. Signed-off-by: Andreas Gruenbacher <[email protected]> Reviewed-by: James Morris <[email protected]> Signed-off-by: Al Viro <[email protected]>
2015-11-13Merge branch 'for-linus-3' of ↵Linus Torvalds1-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs xattr cleanups from Al Viro. * 'for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: f2fs: xattr simplifications squashfs: xattr simplifications 9p: xattr simplifications xattr handlers: Pass handler to operations instead of flags jffs2: Add missing capability check for listing trusted xattrs hfsplus: Remove unused xattr handler list operations ubifs: Remove unused security xattr handler vfs: Fix the posix_acl_xattr_list return value vfs: Check attribute names in posix acl xattr handers
2015-11-13xattr handlers: Pass handler to operations instead of flagsAndreas Gruenbacher1-4/+6
The xattr_handler operations are currently all passed a file system specific flags value which the operations can use to disambiguate between different handlers; some file systems use that to distinguish the xattr namespace, for example. In some oprations, it would be useful to also have access to the handler prefix. To allow that, pass a pointer to the handler to operations instead of the flags value alone. Signed-off-by: Andreas Gruenbacher <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2015-11-11Merge tag 'xfs-for-linus-4.4' of ↵Linus Torvalds61-411/+989
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "There is nothing really major here - the only significant addition is the per-mount operation statistics infrastructure. Otherwises there's various ACL, xattr, DAX, AIO and logging fixes, and a smattering of small cleanups and fixes elsewhere. Summary: - per-mount operational statistics in sysfs - fixes for concurrent aio append write submission - various logging fixes - detection of zeroed logs and invalid log sequence numbers on v5 filesystems - memory allocation failure message improvements - a bunch of xattr/ACL fixes - fdatasync optimisation - miscellaneous other fixes and cleanups" * tag 'xfs-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits) xfs: give all workqueues rescuer threads xfs: fix log recovery op header validation assert xfs: Fix error path in xfs_get_acl xfs: optimise away log forces on timestamp updates for fdatasync xfs: don't leak uuid table on rmmod xfs: invalidate cached acl if set via ioctl xfs: Plug memory leak in xfs_attrmulti_attr_set xfs: Validate the length of on-disk ACLs xfs: invalidate cached acl if set directly via xattr xfs: xfs_filemap_pmd_fault treats read faults as write faults xfs: add ->pfn_mkwrite support for DAX xfs: DAX does not use IO completion callbacks xfs: Don't use unwritten extents for DAX xfs: introduce BMAPI_ZERO for allocating zeroed extents xfs: fix inode size update overflow in xfs_map_direct() xfs: clear PF_NOFREEZE for xfsaild kthread xfs: fix an error code in xfs_fs_fill_super() xfs: stats are no longer dependent on CONFIG_PROC_FS xfs: simplify /proc teardown & error handling xfs: per-filesystem stats counter implementation ...
2015-11-11vfs: remove unused wrapper block_page_mkwrite()Ross Zwisler1-1/+1
The function currently called "__block_page_mkwrite()" used to be called "block_page_mkwrite()" until a wrapper for this function was added by: commit 24da4fab5a61 ("vfs: Create __block_page_mkwrite() helper passing error values back") This wrapper, the current "block_page_mkwrite()", is currently unused. __block_page_mkwrite() is used directly by ext4, nilfs2 and xfs. Remove the unused wrapper, rename __block_page_mkwrite() back to block_page_mkwrite() and update the comment above block_page_mkwrite(). Signed-off-by: Ross Zwisler <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Jan Kara <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Al Viro <[email protected]>
2015-11-10Merge branch 'xfs-misc-fixes-for-4.4-3' into for-nextDave Chinner4-5/+6
2015-11-10xfs: give all workqueues rescuer threadsChris Mason1-3/+4
We're consistently hitting deadlocks here with XFS on recent kernels. After some digging through the crash files, it looks like everyone in the system is waiting for XFS to reclaim memory. Something like this: PID: 2733434 TASK: ffff8808cd242800 CPU: 19 COMMAND: "java" #0 [ffff880019c53588] __schedule at ffffffff818c4df2 #1 [ffff880019c535d8] schedule at ffffffff818c5517 #2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348 #3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb #4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e #5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453 #6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7 #7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433 #8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9 #9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73 #10 [ffff880019c539c8] shrink_slab at ffffffff811460e6 #11 [ffff880019c53aa8] shrink_zone at ffffffff8114a53f #12 [ffff880019c53b48] do_try_to_free_pages at ffffffff8114a8ba #13 [ffff880019c53be8] try_to_free_pages at ffffffff8114ad5a #14 [ffff880019c53c78] __alloc_pages_nodemask at ffffffff8113e1b8 #15 [ffff880019c53d88] alloc_kmem_pages_node at ffffffff8113e671 #16 [ffff880019c53dd8] copy_process at ffffffff8104f781 #17 [ffff880019c53ec8] do_fork at ffffffff8105129c #18 [ffff880019c53f38] sys_clone at ffffffff810515b6 #19 [ffff880019c53f48] stub_clone at ffffffff818c8e4d xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting for IO, which is waiting for workers to complete the IO which is waiting for worker threads that don't exist yet: PID: 2752451 TASK: ffff880bd6bdda00 CPU: 37 COMMAND: "kworker/37:1" #0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2 #1 [ffff8808d20abc00] schedule at ffffffff818c5517 #2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c #3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495 #4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82 #5 [ffff8808d20abdf0] create_worker at ffffffff8106752f #6 [ffff8808d20abe40] worker_thread at ffffffff810699be #7 [ffff8808d20abec0] kthread at ffffffff8106ef59 #8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8 I think we should be using WQ_MEM_RECLAIM to make sure this thread pool makes progress when we're not able to allocate new workers. [dchinner: make all workqueues WQ_MEM_RECLAIM] Signed-off-by: Chris Mason <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-10xfs: fix log recovery op header validation assertBrian Foster1-1/+1
Commit 89cebc84 ("xfs: validate transaction header length on log recovery") added additional validation of the on-disk op header length to protect from buffer overflow during log recovery. It accounts for the fact that the transaction header can be split across multiple op headers. It added an assert for when this occurs that verifies the length of the second part of a split transaction header is less than a full transaction header. In other words, it expects that the first op header of a split transaction header includes at least some portion of the transaction header. This expectation is not always valid as a zero-length op header can exist for the first op header of a split transaction header (see xlog_recover_add_to_trans() for details). This means that the second op header can have a valid, full length transaction header and thus the full header is copied in xlog_recover_add_to_cont_trans(). Fix the assert in xlog_recover_add_to_cont_trans() to handle this case correctly and require that the op header length is less than or equal to a full transaction header. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-10xfs: Fix error path in xfs_get_aclAndreas Gruenbacher2-1/+1
Error codes from xfs_attr_get other than -ENOATTR were not properly reported. Fix that. In addition, the declaration of struct xfs_inode in xfs_acl.h isn't needed. Signed-off-by: Andreas Gruenbacher <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-06mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman1-1/+1
sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-11-03Merge branch 'xfs-dax-updates' into for-nextDave Chinner11-65/+225
2015-11-03Merge branch 'xfs-misc-fixes-for-4.4-2' into for-nextDave Chinner14-16/+101
2015-11-03xfs: optimise away log forces on timestamp updates for fdatasyncDave Chinner5-5/+29
xfs: timestamp updates cause excessive fdatasync log traffic Sage Weil reported that a ceph test workload was writing to the log on every fdatasync during an overwrite workload. Event tracing showed that the only metadata modification being made was the timestamp updates during the write(2) syscall, but fdatasync(2) is supposed to ignore them. The key observation was that the transactions in the log all looked like this: INODE: #regs: 4 ino: 0x8b flags: 0x45 dsize: 32 And contained a flags field of 0x45 or 0x85, and had data and attribute forks following the inode core. This means that the timestamp updates were triggering dirty relogging of previously logged parts of the inode that hadn't yet been flushed back to disk. There are two parts to this problem. The first is that XFS relogs dirty regions in subsequent transactions, so it carries around the fields that have been dirtied since the last time the inode was written back to disk, not since the last time the inode was forced into the log. The second part is that on v5 filesystems, the inode change count update during inode dirtying also sets the XFS_ILOG_CORE flag, so on v5 filesystems this makes a timestamp update dirty the entire inode. As a result when fdatasync is run, it looks at the dirty fields in the inode, and sees more than just the timestamp flag, even though the only metadata change since the last fdatasync was just the timestamps. Hence we force the log on every subsequent fdatasync even though it is not needed. To fix this, add a new field to the inode log item that tracks changes since the last time fsync/fdatasync forced the log to flush the changes to the journal. This flag is updated when we dirty the inode, but we do it before updating the change count so it does not carry the "core dirty" flag from timestamp updates. The fields are zeroed when the inode is marked clean (due to writeback/freeing) or when an fsync/datasync forces the log. Hence if we only dirty the timestamps on the inode between fsync/fdatasync calls, the fdatasync will not trigger another log force. Over 100 runs of the test program: Ext4 baseline: runtime: 1.63s +/- 0.24s avg lat: 1.59ms +/- 0.24ms iops: ~2000 XFS, vanilla kernel: runtime: 2.45s +/- 0.18s avg lat: 2.39ms +/- 0.18ms log forces: ~400/s iops: ~1000 XFS, patched kernel: runtime: 1.49s +/- 0.26s avg lat: 1.46ms +/- 0.25ms log forces: ~30/s iops: ~1500 Reported-by: Sage Weil <[email protected]> Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>