aboutsummaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)AuthorFilesLines
2015-11-03xfs: don't leak uuid table on rmmodDarrick J. Wong3-0/+12
Don't leak the UUID table when the module is unloaded. (Found with kmemleak.) Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-03xfs: invalidate cached acl if set via ioctlAndreas Gruenbacher3-15/+36
Setting or removing the "SGI_ACL_[FILE|DEFAULT]" attributes via the XFS_IOC_ATTRMULTI_BY_HANDLE ioctl completely bypasses the POSIX ACL infrastructure, like setting the "trusted.SGI_ACL_[FILE|DEFAULT]" xattrs did until commit 6caa1056. Similar to that commit, invalidate cached acls when setting/removing them via the ioctl as well. Signed-off-by: Andreas Gruenbacher <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-03xfs: Plug memory leak in xfs_attrmulti_attr_setAndreas Gruenbacher1-1/+4
When setting attributes via XFS_IOC_ATTRMULTI_BY_HANDLE, the user-space buffer is copied into a new kernel-space buffer via memdup_user; that buffer then isn't freed. Signed-off-by: Andreas Gruenbacher <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-03xfs: Validate the length of on-disk ACLsAndreas Gruenbacher2-7/+14
In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct, make sure that the length of the attributes is correct as well. Also, turn the aclp parameter into a const pointer. Signed-off-by: Andreas Gruenbacher <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-03xfs: invalidate cached acl if set directly via xattrBrian Foster1-2/+19
ACLs are stored as extended attributes of the inode to which they apply. XFS converts the standard "system.posix_acl_[access|default]" attribute names used to control ACLs to "trusted.SGI_ACL_[FILE|DEFAULT]" as stored on-disk. These xattrs are directly exposed in on-disk format via getxattr/setxattr, without any ACL aware code in the path to perform validation, etc. This is partly historical and supports backup/restore applications such as xfsdump to back up and restore the binary blob that represents ACLs as-is. Andreas reports that the ACLs observed via the getfacl interface is not consistent when ACLs are set directly via the setxattr path. This occurs because the ACLs are cached in-core against the inode and the xattr path has no knowledge that the operation relates to ACLs. Update the xattr set codepath to trap writes of the special XFS ACL attributes and invalidate the associated cached ACL when this occurs. This ensures that the correct ACLs are used on a subsequent operation through the actual ACL interface. Note that this does not update or add support for setting the ACL xattrs directly beyond the restore use case that requires a correctly formatted binary blob and to restore a consistent i_mode at the same time. It is still possible for a root user to set an invalid or inconsistent (with i_mode) ACL blob on-disk and potentially cause corruption. [ With fixes from Andreas Gruenbacher. ] Reported-by: Andreas Gruenbacher <[email protected]> Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-03xfs: xfs_filemap_pmd_fault treats read faults as write faultsDave Chinner1-4/+16
The code initially committed didn't have the same checks for write faults as the dax_pmd_fault code and hence treats all faults as write faults. We can get read faults through this path because they is no pmd_mkwrite path for write faults similar to the normal page fault path. Hence we need to ensure that we only do c/mtime updates on write faults, and freeze protection is unnecessary for read faults. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-03xfs: add ->pfn_mkwrite support for DAXDave Chinner2-0/+36
->pfn_mkwrite support is needed so that when a page with allocated backing store takes a write fault we can check that the fault has not raced with a truncate and is pointing to a region beyond the current end of file. This also allows us to update the timestamp on the inode, too, which fixes a generic/080 failure. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-03xfs: DAX does not use IO completion callbacksDave Chinner3-43/+2
For DAX, we are now doing block zeroing during allocation. This means we no longer need a special DAX fault IO completion callback to do unwritten extent conversion. Because mmap never extends the file size (it SEGVs the process) we don't need a callback to update the file size, either. Hence we can remove the completion callbacks from the __dax_fault and __dax_mkwrite calls. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-03xfs: Don't use unwritten extents for DAXDave Chinner2-6/+30
DAX has a page fault serialisation problem with block allocation. Because it allows concurrent page faults and does not have a page lock to serialise faults to the same page, it can get two concurrent faults to the page that race. When two read faults race, this isn't a huge problem as the data underlying the page is not changing and so "detect and drop" works just fine. The issues are to do with write faults. When two write faults occur, we serialise block allocation in get_blocks() so only one faul will allocate the extent. It will, however, be marked as an unwritten extent, and that is where the problem lies - the DAX fault code cannot differentiate between a block that was just allocated and a block that was preallocated and needs zeroing. The result is that both write faults end up zeroing the block and attempting to convert it back to written. The problem is that the first fault can zero and convert before the second fault starts zeroing, resulting in the zeroing for the second fault overwriting the data that the first fault wrote with zeros. The second fault then attempts to convert the unwritten extent, which is then a no-op because it's already written. Data loss occurs as a result of this race. Because there is no sane locking construct in the page fault code that we can use for serialisation across the page faults, we need to ensure block allocation and zeroing occurs atomically in the filesystem. This means we can still take concurrent page faults and the only time they will serialise is in the filesystem mapping/allocation callback. The page fault code will always see written, initialised extents, so we will be able to remove the unwritten extent handling from the DAX code when all filesystems are converted. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-03xfs: introduce BMAPI_ZERO for allocating zeroed extentsDave Chinner6-8/+97
To enable DAX to do atomic allocation of zeroed extents, we need to drive the block zeroing deep into the allocator. Because xfs_bmapi_write() can return merged extents on allocation that were only partially allocated (i.e. requested range spans allocated and hole regions, allocation into the hole was contiguous), we cannot zero the extent returned from xfs_bmapi_write() as that can overwrite existing data with zeros. Hence we have to drive the extent zeroing into the allocation code, prior to where we merge the extents into the BMBT and return the resultant map. This means we need to propagate this need down to the xfs_alloc_vextent() and issue the block zeroing at this point. While this functionality is being introduced for DAX, there is no reason why it is specific to DAX - we can per-zero blocks during the allocation transaction on any type of device. It's just slow (and usually slower than unwritten allocation and conversion) on traditional block devices so doesn't tend to get used. We can, however, hook hardware zeroing optimisations via sb_issue_zeroout() to this operation, so it may be useful in future and hence the "allocate zeroed blocks" API needs to be implementation neutral. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-03xfs: fix inode size update overflow in xfs_map_direct()Dave Chinner3-9/+49
Both direct IO and DAX pass an offset and count into get_blocks that will overflow a s64 variable when an IO goes into the last supported block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen from the tracing: xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096 xfs_gbmap_direct: [...] offset 0x7ffffffffffff000 count 4096 xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096 0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that overflows the s64 offset and we fail to detect the need for a filesize update and an ioend is not allocated. This is *mostly* avoided for direct IO because such extending IOs occur with full block allocation, and so the "IS_UNWRITTEN()" check still evaluates as true and we get an ioend that way. However, doing single sector extending IOs to this last block will expose the fact that file size updates will not occur after the first allocating direct IO as the overflow will then be exposed. There is one further complexity: the DAX page fault path also exposes the same issue in block allocation. However, page faults cannot extend the file size, so in this case we want to allocate the block but do not want to allocate an ioend to enable file size update at IO completion. Hence we now need to distinguish between the direct IO patch allocation and dax fault path allocation to avoid leaking ioend structures. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-11-02xfs: clear PF_NOFREEZE for xfsaild kthreadJiri Kosina1-0/+1
Since xfsaild has been converted to kthread in 0030807c, it calls try_to_freeze() during every AIL push iteration. It however doesn't set itself as freezable, and therefore this try_to_freeze() will never do anything. Before (hopefully eventually) kthread freezing gets converted to fileystem freezing, we'd rather mark xfsaild freezable (as it can generate I/O during suspend). Signed-off-by: Jiri Kosina <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-19Merge branch 'xfs-stats-fixes' into for-nextDave Chinner3-2/+4
2015-10-19xfs: fix an error code in xfs_fs_fill_super()Dan Carpenter1-1/+1
If alloc_percpu() fails, we accidentally return PTR_ERR(NULL), which means success, but we intended to return -ENOMEM. Fixes: 225e4635580c ('xfs: per-filesystem stats in sysfs') Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Bill O'Donnell <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-19xfs: stats are no longer dependent on CONFIG_PROC_FSDave Chinner2-1/+3
So we need to fix the makefile to understand this, otherwise build errors with CONFIG_PROC_FS=n occur. Reported-and-tested-by: Jim Davis <[email protected]> Signed-off-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12Merge branch 'xfs-misc-fixes-for-4.4-1' into for-nextDave Chinner9-13/+41
2015-10-12Merge branch 'xfs-io-fixes' into for-nextDave Chinner5-20/+59
2015-10-12Merge branch 'xfs-logging-fixes' into for-nextDave Chinner18-26/+210
2015-10-12xfs: simplify /proc teardown & error handlingEric Sandeen1-19/+7
remove_proc_subtree() was added in 3.9, and can be used to simplify our procfile creation error handling and cleanup, removing the nested gotos. It simply removes fs/xfs and everything created under it. Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: per-filesystem stats counter implementationBill O'Donnell21-118/+131
This patch modifies the stats counting macros and the callers to those macros to properly increment, decrement, and add-to the xfs stats counts. The counts for global and per-fs stats are correctly advanced, and cleared by writing a "1" to the corresponding clear file. global counts: /sys/fs/xfs/stats/stats per-fs counts: /sys/fs/xfs/sda*/stats/stats global clear: /sys/fs/xfs/stats/stats_clear per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around stats structures and macros. ] Signed-off-by: Bill O'Donnell <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: per-filesystem stats in sysfsBill O'Donnell3-3/+23
This patch implements per-filesystem stats objects in sysfs. It depends on the application of the previous patch series that develops the infrastructure to support both xfs global stats and xfs per-fs stats in sysfs. Stats objects are instantiated when an xfs filesystem is mounted and deleted on unmount. With this patch, the stats directory is created and populated with the familiar stats and stats_clear files. Example: /sys/fs/xfs/sda9/stats/stats /sys/fs/xfs/sda9/stats/stats_clear With this patch, the individual counts within the new per-fs stats file(s) remain at zero. Functions that use the the macros to increment, decrement, and add-to the per-fs stats counts will be covered in a separate new patch to follow this one. Note that the counts within the global stats file (/sys/fs/xfs/stats/stats) advance normally and can be cleared as it was prior to this patch. [dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so it is down before/after any path that uses the per-mount stats. ] Signed-off-by: Bill O'Donnell <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: more info from kmem deadlocks and high-level error msgsEric Sandeen2-2/+9
In an effort to get more useful out of "possible memory allocation deadlock" messages, print the size of the requested allocation, and dump the stack if the xfs error level is tuned high. The stack dump is implemented in define_xfs_printk_level() for error levels >= LOGLEVEL_ERR, partly because it seems generically useful, and also because kmem.c has no knowledge of xfs error level tunables or other such bits, it's very kmem-specific. Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: avoid null *src in memcpy call in xlog_writeEric Sandeen1-4/+13
The gcc undefined behavior sanitizer caught this; surely any sane memcpy implementation will no-op if size == 0, but behavior with a *src of NULL is technically undefined (declared nonnull), so avoid it here. We are actually in this situation frequently via xlog_commit_record(), because: struct xfs_log_iovec reg = { .i_addr = NULL, .i_len = 0, .i_type = XLOG_REG_TYPE_COMMIT, }; Reported-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: pass total block res. as total xfs_bmapi_write() parameterBrian Foster2-8/+8
The total field from struct xfs_alloc_arg is a bit of an unknown commodity. It is documented as the total block requirement for the transaction and is used in this manner from most call sites by virtue of passing the total block reservation of the transaction associated with an allocation. Several xfs_bmapi_write() callers pass hardcoded values of 0 or 1 for the total block requirement, which is a historical oddity without any clear reasoning. The xfs_iomap_write_direct() caller, for example, passes 0 for the total block requirement. This has been determined to cause problems in the form of ABBA deadlocks of AGF buffers due to incorrect AG selection in the block allocator. Specifically, the xfs_alloc_space_available() function incorrectly selects an AG that doesn't actually have sufficient space for the allocation. This occurs because the args.total field is 0 and thus the remaining free space check on the AG doesn't actually consider the size of the allocation request. This locks the AGF buffer, the allocation attempt proceeds and ultimately fails (in xfs_alloc_fix_minleft()), and xfs_alloc_vexent() moves on to the next AG. In turn, this can lead to incorrect AG locking order (if the allocator wraps around, attempting to lock AG 0 after acquiring AG N) and thus deadlock if racing with another operation. This problem has been reproduced via generic/299 on smallish (1GB) ramdisk test devices. To avoid this problem, replace the undocumented hardcoded total parameters from the iomap and utility callers to pass the block reservation used for the associated transaction. This is consistent with other xfs_bmapi_write() callers throughout XFS. The assumption is that the total field allows the selection of an AG that can handle the entire operation rather than simply the allocation/range being requested (e.g., resulting btree splits, etc.). This addresses the aforementioned generic/299 hang by ensuring AG selection only occurs when the allocation can be satisfied by the AG. Reported-by: Ross Zwisler <[email protected]> Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: avoid dependency on Linux XATTR_SIZE_MAXJan Tulak3-4/+12
Currently, we depends on Linux XATTR value for on disk definition. Which causes trouble on other platforms and maybe also if this value was to change. Fix it by creating a custom definition independent from those in Linux (although with the same values), so it is OK with the be16 fields used for holding these attributes. This patch reflects a change in xfsprogs. Signed-off-by: Jan Tulak <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: prefix XATTR_LIST_MAX with XFS_Jan Tulak3-2/+12
Remove a hard dependency of Linux XATTR_LIST_MAX value by using a prefixed version. This patch reflects the same change in xfsprogs. Signed-off-by: Jan Tulak <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12libxfs: fix two comment typosGeliang Tang1-2/+2
Just fix two typos in code comments. Signed-off-by: Geliang Tang <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: add an xfs_zero_eof() tracepointBrian Foster2-0/+3
Add a tracepoint in xfs_zero_eof() to facilitate tracking and debugging EOF zeroing events. This has proven useful in the context of other direct I/O tracepoints to ensure EOF zeroing occurs within appropriate file ranges. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: always drain dio before extending aio write submissionBrian Foster1-6/+9
XFS supports and typically allows concurrent asynchronous direct I/O submission to a single file. One exception to the rule is that file extending dio writes that start beyond the current EOF (e.g., potentially create a hole at EOF) require exclusive I/O access to the file. This is because such writes must zero any pre-existing blocks beyond EOF that are exposed by virtue of now residing within EOF as a result of the write about to be submitted. Before EOF zeroing can occur, the current file i_size must be stabilized to avoid data corruption. In this scenario, XFS upgrades the iolock to exclude any further I/O submission, waits on in-flight I/O to complete to ensure i_size is up to date (i_size is updated on dio write completion) and restarts the various checks against the state of the file. The problem is that this protection sequence is triggered only when the iolock is currently held shared. While this is true for async dio in most cases, the caller may upgrade the lock in advance based on arbitrary circumstances with respect to EOF zeroing. For example, the iolock is always acquired exclusively if the start offset is not block aligned. This means that even though the iolock is already held exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing can occur based on an unstable i_size. This problem has been reproduced as guest data corruption in virtual machines with file-backed qcow2 virtual disks hosted on an XFS filesystem. The virtual disks must be configured with aio=native mode and the must not be truncated out to the maximum file size (as some virt managers will do). Update xfs_file_aio_write_checks() to unconditionally drain in-flight dio before EOF zeroing can occur. Rather than trigger the wait based on iolock state, use a new flag and upgrade the iolock when necessary. Note that this results in a full restart of the inode checks even when the iolock was already held exclusive when technically it is only required to recheck i_size. This should be a rare enough occurrence that it is preferable to keep the code simple rather than create an alternate restart jump target. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: validate metadata LSNs against log on v5 superblocksBrian Foster15-10/+180
Since the onset of v5 superblocks, the LSN of the last modification has been included in a variety of on-disk data structures. This LSN is used to provide log recovery ordering guarantees (e.g., to ensure an older log recovery item is not replayed over a newer target data structure). While this works correctly from the point a filesystem is formatted and mounted, userspace tools have some problematic behaviors that defeat this mechanism. For example, xfs_repair historically zeroes out the log unconditionally (regardless of whether corruption is detected). If this occurs, the LSN of the filesystem is reset and the log is now in a problematic state with respect to on-disk metadata structures that might have a larger LSN. Until either the log catches up to the highest previously used metadata LSN or each affected data structure is modified and written out without incident (which resets the metadata LSN), log recovery is susceptible to filesystem corruption. This problem is ultimately addressed and repaired in the associated userspace tools. The kernel is still responsible to detect the problem and notify the user that something is wrong. Check the superblock LSN at mount time and fail the mount if it is invalid. From that point on, trigger verifier failure on any metadata I/O where an invalid LSN is detected. This results in a filesystem shutdown and guarantees that we do not log metadata changes with invalid LSNs on disk. Since this is a known issue with a known recovery path, present a warning to instruct the user how to recover. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: Print name and pid when memory allocation loopsTetsuo Handa2-5/+8
This patch adds comm name and pid to warning messages printed by kmem_alloc(), kmem_zone_alloc() and xfs_buf_allocate_memory(). This will help telling which memory allocations (e.g. kernel worker threads, OOM victim tasks, neither) are stalling because these functions are passing __GFP_NOWARN which suppresses not only backtrace but comm name and pid. Signed-off-by: Tetsuo Handa <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: log local to remote symlink conversions correctly on v5 supersBrian Foster2-4/+9
A local format symlink inode is converted to extent format when an extended attribute is set on an inode as part of the attribute fork creation. This means a block is allocated, the local symlink target name is copied to the block and the block is logged. Currently, xfs_bmap_local_to_extents() handles logging the remote block data based on the size of the data fork prior to the conversion. This is not correct on v5 superblock filesystems, which add an additional header to remote symlink blocks that is nonexistent in local format inodes. As a result, the full length of the remote symlink block content is not logged. This can lead to corruption should a crash occur and log recovery replay this transaction. Since a callout is already used to initialize the new remote symlink block, update the local-to-extents conversion mechanism to make the callout also responsible for logging the block. It is already required to set the log buffer type and format the block appropriately based on the superblock version. This ensures the remote symlink is always logged correctly. Note that xfs_bmap_local_to_extents() is only called for symlinks so there are no other callouts that require modification. Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: add missing ilock around dio write last extent alignmentBrian Foster3-12/+36
The iomap codepath (via get_blocks()) acquires and release the inode lock in the case of a direct write that requires block allocation. This is because xfs_iomap_write_direct() allocates a transaction, which means the ilock must be dropped and reacquired after the transaction is allocated and reserved. xfs_iomap_write_direct() invokes xfs_iomap_eof_align_last_fsb() before the transaction is created and thus before the ilock is reacquired. This can lead to calls to xfs_iread_extents() and reads of the in-core extent list without any synchronization (via xfs_bmap_eof() and xfs_bmap_last_extent()). xfs_iread_extents() assert fails if the ilock is not held, but this is not currently seen in practice as the current callers had already invoked xfs_bmapi_read(). What has been seen in practice are reports of crashes down in the xfs_bmap_eof() codepath on direct writes due to seemingly bogus pointer references from xfs_iext_get_ext(). While an explicit reproducer is not currently available to confirm the cause of the problem, crash analysis and code inspection from David Jeffrey had identified the insufficient locking. xfs_iomap_eof_align_last_fsb() is called from other contexts with the inode lock already held, so we cannot acquire it therein. __xfs_get_blocks() acquires and drops the ilock with variable flags to cover the event that the extent list must be read in. The common case is that __xfs_get_blocks() acquires the shared ilock. To provide locking around the last extent alignment call without adding more lock cycles to the dio path, update xfs_iomap_write_direct() to expect the shared ilock held on entry and do the extent alignment under its protection. Demote the lock, if necessary, from __xfs_get_blocks() and push the xfs_qm_dqattach() call outside of the shared lock critical section. Also, add an assert to document that the extent list is always expected to be present in this path. Otherwise, we risk a call to xfs_iread_extents() while under the shared ilock. This is safe as all current callers have executed an xfs_bmapi_read() call under the current iolock context. Reported-by: David Jeffery <[email protected]> Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12cancel the setfilesize transation when io error happenZhaohongjiang1-2/+11
When I ran xfstest/073 case, the remount process was blocked to wait transactions to be zero. I found there was a io error happened, and the setfilesize transaction was not released properly. We should add the changes to cancel the io error in this case. Reproduction steps: 1. dd if=/dev/zero of=xfs1.img bs=1M count=2048 2. mkfs.xfs xfs1.img 3. losetup -f ./xfs1.img /dev/loop0 4. mount -t xfs /dev/loop0 /home/test_dir/ 5. mkdir /home/test_dir/test 6. mkfs.xfs -dfile,name=image,size=2g 7. mount -t xfs -o loop image /home/test_dir/test 8. cp a file bigger than 2g to /home/test_dir/test 9. mount -t xfs -o remount,ro /home/test_dir/test [ dchinner: moved io error detection to xfs_setfilesize_ioend() after transaction context restoration. ] Signed-off-by: Zhao Hongjiang <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: pass xfsstats structures to handlers and macrosBill O'Donnell6-37/+63
This patch is the next step toward per-fs xfs stats. The patch makes the show and clear routines able to handle any stats structure associated with a kobject. Instead of a single global xfsstats structure, add kobject and a pointer to a per-cpu struct xfsstats. Modify the macros that manipulate the stats accordingly: XFS_STATS_INC, XFS_STATS_DEC, and XFS_STATS_ADD now access xfsstats->xs_stats. The sysfs functions need to get from the kobject back to the xfsstats structure which contains it, and pass the pointer to the ->xs_stats percpu structure into the show & clear routines. Signed-off-by: Bill O'Donnell <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: consolidate sysfs opsBill O'Donnell1-119/+63
As a part of the series to move xfs global stats from procfs to sysfs, this patch consolidates the sysfs ops functions and removes redundancy. Signed-off-by: Bill O'Donnell <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: remove unused procfs codeBill O'Donnell1-74/+0
As a part of the work to move xfs global stats from procfs to sysfs, this patch removes the now unused procfs code that was xfs stat specific. Signed-off-by: Bill O'Donnell <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: create symlink proc/fs/xfs/stat to sys/fs/xfs/statsBill O'Donnell1-2/+3
As a part of the work to move xfs global stats from procfs to sysfs, this patch creates the symlink from proc/fs/xfs/stat to sys/fs/xfs/stats. Signed-off-by: Bill O'Donnell <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-10-12xfs: create global stats and stats_clear in sysfsBill O'Donnell6-17/+178
Currently, xfs global stats are in procfs. This patch introduces (replicates) the global stats in sysfs. Additionally a stats_clear file is introduced in sysfs. Signed-off-by: Bill O'Donnell <[email protected]> Reviewed-by: Eric Sandeen <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-09-08xfs: huge page fault supportMatthew Wilcox2-1/+30
Use DAX to provide support for huge pages. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-08dax: move DAX-related functions to a new headerMatthew Wilcox1-0/+1
In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is defined in linux/mm.h. Given that we don't want to include <linux/mm.h> in <linux/fs.h>, the easiest solution is to move the DAX-related functions to a new header, <linux/dax.h>. We could also have moved VM_FAULT_* definitions to a new header, or a different header that isn't quite such a boil-the-ocean header as <linux/mm.h>, but this felt like the best option. Signed-off-by: Matthew Wilcox <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-07Merge tag 'xfs-for-linus-4.3' of ↵Linus Torvalds55-478/+861
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "There isn't a whole lot to this update - it's mostly bug fixes and they are spread pretty much all over XFS. There are some corruption fixes, some fixes for log recovery, some fixes that prevent unount from hanging, a lockdep annotation rework for inode locking to prevent false positives and the usual random bunch of cleanups and minor improvements. Deatils: - large rework of EFI/EFD lifecycle handling to fix log recovery corruption issues, crashes and unmount hangs - separate metadata UUID on disk to enable changing boot label UUID for v5 filesystems - fixes for gcc miscompilation on certain platforms and optimisation levels - remote attribute allocation and recovery corruption fixes - inode lockdep annotation rework to fix bugs with too many subclasses - directory inode locking changes to prevent lockdep false positives - a handful of minor corruption fixes - various other small cleanups and bug fixes" * tag 'xfs-for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (42 commits) xfs: fix error gotos in xfs_setattr_nonsize xfs: add mssing inode cache attempts counter increment xfs: return errors from partial I/O failures to files libxfs: bad magic number should set da block buffer error xfs: fix non-debug build warnings xfs: collapse allocsize and biosize mount option handling xfs: Fix file type directory corruption for btree directories xfs: lockdep annotations throw warnings on non-debug builds xfs: Fix uninitialized return value in xfs_alloc_fix_freelist() xfs: inode lockdep annotations broke non-lockdep build xfs: flush entire file on dio read/write to cached file xfs: Fix xfs_attr_leafblock definition libxfs: readahead of dir3 data blocks should use the read verifier xfs: stop holding ILOCK over filldir callbacks xfs: clean up inode lockdep annotations xfs: swap leaf buffer into path struct atomically during path shift xfs: relocate sparse inode mount warning xfs: dquots should be stamped with sb_meta_uuid xfs: log recovery needs to validate against sb_meta_uuid xfs: growfs not aware of sb_meta_uuid ...
2015-09-05Merge branch 'for-linus' of ↵Linus Torvalds1-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "In this one: - d_move fixes (Eric Biederman) - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error handling ought to be fixed) - switch of sb_writers to percpu rwsem (Oleg Nesterov) - superblock scalability (Josef Bacik and Dave Chinner) - swapon(2) race fix (Hugh Dickins)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits) vfs: Test for and handle paths that are unreachable from their mnt_root dcache: Reduce the scope of i_lock in d_splice_alias dcache: Handle escaped paths in prepend_path mm: fix potential data race in SyS_swapon inode: don't softlockup when evicting inodes inode: rename i_wb_list to i_io_list sync: serialise per-superblock sync operations inode: convert inode_sb_list_lock to per-sb inode: add hlist_fake to avoid the inode hash lock in evict writeback: plug writeback at a high level change sb_writers to use percpu_rw_semaphore shift percpu_counter_destroy() into destroy_super_work() percpu-rwsem: kill CONFIG_PERCPU_RWSEM percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() percpu-rwsem: introduce percpu_down_read_trylock() document rwsem_release() in sb_wait_write() fix the broken lockdep logic in __sb_start_write() introduce __sb_writers_{acquired,release}() helpers ufs_inode_get{frag,block}(): get rid of 'phys' argument ufs_getfrag_block(): tidy up a bit ...
2015-09-04fs: create and use seq_show_option for escapingKees Cook1-2/+2
Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [[email protected]: add lost chunk, per Kees] [[email protected]: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <[email protected]> Acked-by: Serge Hallyn <[email protected]> Acked-by: Jan Kara <[email protected]> Acked-by: Paul Moore <[email protected]> Cc: J. R. Okajima <[email protected]> Signed-off-by: Kees Cook <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-02Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds2-9/+6
Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
2015-09-01Merge branch 'xfs-misc-fixes-for-4.3-4' into for-nextDave Chinner4-5/+9
2015-08-28xfs: fix error gotos in xfs_setattr_nonsizeEric Sandeen1-4/+4
As the code stands today, if xfs_trans_reserve() fails, we goto out_dqrele, which does not free the allocated transaction. Fix up the goto targets to undo everything properly. Addresses-Coverity-Id: 145571 Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-08-28xfs: add mssing inode cache attempts counter incrementLucas Stach1-0/+2
Increasing the inode cache attempt counter was apparently dropped while refactoring the cache code and so stayed at the initial 0 value. Add the increment back to make the runtime stats more useful. Signed-off-by: Lucas Stach <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-08-28xfs: return errors from partial I/O failures to filesDavid Jeffery1-1/+2
There is an issue with xfs's error reporting in some cases of I/O partially failing and partially succeeding. Calls like fsync() can report success even though not all I/O was successful in partial-failure cases such as one disk of a RAID0 array being offline. The issue can occur when there are more than one bio per xfs_ioend struct. Each call to xfs_end_bio() for a bio completing will write a value to ioend->io_error. If a successful bio completes after any failed bio, no error is reported do to it writing 0 over the error code set by any failed bio. The I/O error information is now lost and when the ioend is completed only success is reported back up the filesystem stack. xfs_end_bio() should only set ioend->io_error in the case of BIO_UPTODATE being clear. ioend->io_error is initialized to 0 at allocation so only needs to be updated by a failed bio. Also check that ioend->io_error is 0 so that the first error reported will be the error code returned. Cc: [email protected] Signed-off-by: David Jeffery <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-08-28libxfs: bad magic number should set da block buffer errorDarrick J. Wong1-0/+1
If xfs_da3_node_read_verify() doesn't recognize the magic number of a buffer it's just read, set the buffer error to -EFSCORRUPTED so that the error can be sent up to userspace. Without this patch we'll notice the bad magic eventually while trying to traverse or change the block, but we really ought to fail early in the verifier. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>