aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2015-03-02NFSv4: Pin the superblock while we're returning the delegationTrond Myklebust1-4/+16
This patch ensures that the superblock doesn't go ahead and disappear underneath us while the state manager thread is returning delegations. Signed-off-by: Trond Myklebust <[email protected]>
2015-03-02NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation()Trond Myklebust1-1/+4
Ensure that nfs_inode_set_delegation() doesn't inadvertently detach a delegation that is already in the process of being returned. Signed-off-by: Trond Myklebust <[email protected]>
2015-03-02NFSv4: Ensure that we don't reap a delegation that is being returnedTrond Myklebust1-5/+7
Signed-off-by: Trond Myklebust <[email protected]>
2015-03-02NFS: Fix stateid used for NFS v4 closesAnna Schumaker1-2/+2
After 566fcec60 the client uses the "current stateid" from the nfs4_state structure to close a file. This could potentially contain a delegation stateid, which is disallowed by the protocol and causes servers to return NFS4ERR_BAD_STATEID. This patch restores the (correct) behavior of sending the open stateid to close a file. Reported-by: Olga Kornievskaia <[email protected]> Fixes: 566fcec60 (NFSv4: Fix an atomicity problem in CLOSE) Signed-off-by: Anna Schumaker <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2015-03-02Btrfs: incremental send, don't rename a directory too soonFilipe Manana1-15/+156
There's one more case where we can't issue a rename operation for a directory as soon as we process it. We used to delay directory renames only if they have some ancestor directory with a higher inode number that got renamed too, but there's another case where we need to delay the rename too - when a directory A is renamed to the old name of a directory B but that directory B has its rename delayed because it has now (in the send root) an ancestor with a higher inode number that was renamed. If we don't delay the directory rename in this case, the receiving end of the send stream will attempt to rename A to the old name of B before B got renamed to its new name, which results in a "directory not empty" error. So fix this by delaying directory renames for this case too. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a $ mkdir /mnt/b $ mkdir /mnt/c $ touch /mnt/a/file $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/c /mnt/x $ mv /mnt/a /mnt/x/y $ mv /mnt/b /mnt/a $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 -f /tmp/1.send $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ btrfs receive /mnt2 -f /tmp/1.send $ btrfs receive /mnt2 -f /tmp/2.send ERROR: rename b -> a failed. Directory not empty A test case for xfstests follows soon. Reported-by: Ames Cornish <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-03-02btrfs: fix lost return value due to variable shadowingDavid Sterba1-1/+0
A block-local variable stores error code but btrfs_get_blocks_direct may not return it in the end as there's a ret defined in the function scope. CC: <[email protected]> # 3.6+ Fixes: d187663ef24c ("Btrfs: lock extents as we map them in DIO") Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-03-02Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattrFilipe Manana1-2/+6
The return value from btrfs_lookup_xattr() can be a pointer encoding an error, therefore deal with it. This fixes commit 5f5bc6b1e2d5 ("Btrfs: make xattr replace operations atomic"). Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-03-02Btrfs: fix off-by-one logic error in btrfs_realloc_nodeFilipe Manana1-4/+4
The end_slot variable actually matches the number of pointers in the node and not the last slot (which is 'nritems - 1'). Therefore in order to check that the current slot in the for loop doesn't match the last one, the correct logic is to check if 'i' is less than 'end_slot - 1' and not 'end_slot - 2'. Fix this and set end_slot to be 'nritems - 1', as it's less confusing since the variable name implies it's inclusive rather then exclusive. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-03-02Btrfs: add missing inode update when punching holeFilipe Manana1-3/+28
When punching a file hole if we endup only zeroing parts of a page, because the start offset isn't a multiple of the sector size or the start offset and length fall within the same page, we were not updating the inode item. This prevented an fsync from doing anything, if no other file changes happened in the current transaction, because the fields in btrfs_inode used to check if the inode needs to be fsync'ed weren't updated. This issue is easy to reproduce and the following excerpt from the xfstest case I made shows how to trigger it: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our test file. $XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Fsync the file, this makes btrfs update some btrfs inode specific fields # that are used to track if the inode needs to be written/updated to the fsync # log or not. After this fsync, the new values for those fields indicate that # a subsequent fsync does not need to touch the fsync log. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Force a commit of the current transaction. After this point, any operation # that modifies the data or metadata of our file, should update those fields in # the btrfs inode with values that make the next fsync operation write to the # fsync log. sync # Punch a hole in our file. This small range affects only 1 page. # This made the btrfs hole punching implementation write only some zeroes in # one page, but it did not update the btrfs inode fields used to determine if # the next fsync needs to write to the fsync log. $XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo # Another variation of the previously mentioned case. $XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo # Now fsync the file. This was a no-operation because the previous hole punch # operation didn't update the inode's fields mentioned before, so they remained # with the values they had after the first fsync - that is, they indicate that # it is not needed to write to fsync log. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo echo "File content before:" od -t x1 $SCRATCH_MNT/foo # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Enable writes and mount the fs. This makes the fsync log replay code run. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Because the last fsync didn't do anything, here the file content matched what # it was after the first fsync, before the holes were punched, and not what it # was after the holes were punched. echo "File content after:" od -t x1 $SCRATCH_MNT/foo This issue has been around since 2012, when the punch hole implementation was added, commit 2aaa66558172 ("Btrfs: add hole punching"). A test case for xfstests follows soon. Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: Liu Bo <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-03-02Btrfs: abort the transaction if we fail to update the free space cache inodeJosef Bacik1-0/+16
Our gluster boxes were hitting a problem where they'd run out of space when updating the block group cache and therefore wouldn't be able to update the free space inode. This is a problem because this is how we invalidate the cache and protect ourselves from errors further down the stack, so if this fails we have to abort the transaction so we make sure we don't end up with stale free space cache. Thanks, Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-03-02Btrfs: fix fsync race leading to ordered extent memory leaksFilipe Manana1-5/+2
We can have multiple fsync operations against the same file during the same transaction and they can collect the same ordered extents while they don't complete (still accessible from the inode's ordered tree). If this happens, those ordered extents will never get their reference counts decremented to 0, leading to memory leaks and inode leaks (an iput for an ordered extent's inode is scheduled only when the ordered extent's refcount drops to 0). The following sequence diagram explains this race: CPU 1 CPU 2 btrfs_sync_file() btrfs_sync_file() mutex_lock(inode->i_mutex) btrfs_log_inode() btrfs_get_logged_extents() --> collects ordered extent X --> increments ordered extent X's refcount btrfs_submit_logged_extents() mutex_unlock(inode->i_mutex) mutex_lock(inode->i_mutex) btrfs_sync_log() btrfs_wait_logged_extents() --> list_del_init(&ordered->log_list) btrfs_log_inode() btrfs_get_logged_extents() --> Adds ordered extent X to logged_list because at this point: list_empty(&ordered->log_list) && test_bit(BTRFS_ORDERED_LOGGED, &ordered->flags) == 0 --> Increments ordered extent X's refcount --> check if ordered extent's io is finished or not, start it if necessary and wait for it to finish --> sets bit BTRFS_ORDERED_LOGGED on ordered extent X's flags and adds it to trans->ordered btrfs_sync_log() finishes btrfs_submit_logged_extents() btrfs_log_inode() finishes mutex_unlock(inode->i_mutex) btrfs_sync_file() finishes btrfs_sync_log() btrfs_wait_logged_extents() --> Sees ordered extent X has the bit BTRFS_ORDERED_LOGGED set in its flags --> X's refcount is untouched btrfs_sync_log() finishes btrfs_sync_file() finishes btrfs_commit_transaction() --> called by transaction kthread for e.g. btrfs_wait_pending_ordered() --> waits for ordered extent X to complete --> decrements ordered extent X's refcount by 1 only, corresponding to the increment done by the fsync task ran by CPU 1 In the scenario of the above diagram, after the transaction commit, the ordered extent will remain with a refcount of 1 forever, leaking the ordered extent structure and preventing the i_count of its inode from ever decreasing to 0, since the delayed iput is scheduled only when the ordered extent's refcount drops to 0, preventing the inode from ever being evicted by the VFS. Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to mean that an ordered extent is already being processed by an fsync call, which will attach it to the current transaction, preventing it from being collected by subsequent fsync operations against the same inode. This race was introduced with the following change (added in 3.19 and backported to stable 3.18 and 3.17): Btrfs: make sure logged extents complete in the current transaction V3 commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f I ran into this issue while running xfstests/generic/113 in a loop, which failed about 1 out of 10 runs with the following warning in dmesg: [ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]() [ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo ppy e1000 scsi_mod [last unloaded: btrfs] [ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1 [ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [ 2612.457709] 0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8 [ 2612.459829] 0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000 [ 2612.461564] ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068 [ 2612.463163] Call Trace: [ 2612.463719] [<ffffffff8142425e>] dump_stack+0x4c/0x65 [ 2612.464789] [<ffffffff81045308>] warn_slowpath_common+0xa1/0xbb [ 2612.466026] [<ffffffffa036da56>] ? free_fs_root+0x36/0x133 [btrfs] [ 2612.467247] [<ffffffff810453c5>] warn_slowpath_null+0x1a/0x1c [ 2612.468416] [<ffffffffa036da56>] free_fs_root+0x36/0x133 [btrfs] [ 2612.469625] [<ffffffffa036f2a7>] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs] [ 2612.471251] [<ffffffffa036f353>] btrfs_free_fs_roots+0xa4/0xd6 [btrfs] [ 2612.472536] [<ffffffff8142612e>] ? wait_for_completion+0x24/0x26 [ 2612.473742] [<ffffffffa0370bbc>] close_ctree+0x1f3/0x33c [btrfs] [ 2612.475477] [<ffffffff81059d1d>] ? destroy_workqueue+0x148/0x1ba [ 2612.476695] [<ffffffffa034e3da>] btrfs_put_super+0x19/0x1b [btrfs] [ 2612.477911] [<ffffffff81153e53>] generic_shutdown_super+0x73/0xef [ 2612.479106] [<ffffffff811540e2>] kill_anon_super+0x13/0x1e [ 2612.480226] [<ffffffffa034e1e3>] btrfs_kill_super+0x17/0x23 [btrfs] [ 2612.481471] [<ffffffff81154307>] deactivate_locked_super+0x3b/0x50 [ 2612.482686] [<ffffffff811547a7>] deactivate_super+0x3f/0x43 [ 2612.483791] [<ffffffff8116b3ed>] cleanup_mnt+0x59/0x78 [ 2612.484842] [<ffffffff8116b44c>] __cleanup_mnt+0x12/0x14 [ 2612.485900] [<ffffffff8105d019>] task_work_run+0x8f/0xbc [ 2612.486960] [<ffffffff810028d8>] do_notify_resume+0x5a/0x6b [ 2612.488083] [<ffffffff81236e5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 2612.489333] [<ffffffff8142a17f>] int_signal+0x12/0x17 [ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]--- [ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds. Have a nice day... Kmemleak confirmed the ordered extent leak (and btrfs inode specific structures such as delayed nodes): $ cat /sys/kernel/debug/kmemleak unreferenced object 0xffff880154290db0 (size 576): comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s) hex dump (first 32 bytes): 01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff [email protected].... 00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff ..........)T.... backtrace: [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83 [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190 [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs] [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs] [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs] [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs] [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs] [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b [<ffffffff81429e92>] system_call_fastpath+0x12/0x17 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff88014ef11db0 (size 576): comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s) hex dump (first 32 bytes): 02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff ...........N.... backtrace: [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83 [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190 [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs] [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs] [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs] [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs] [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs] [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b [<ffffffff81429e92>] system_call_fastpath+0x12/0x17 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff8800336feda8 (size 584): comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s) hex dump (first 32 bytes): 00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00 .@>........B.... 00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00 ................ backtrace: [<ffffffff8114eb34>] create_object+0x172/0x29a [<ffffffff8141d790>] kmemleak_alloc+0x25/0x41 [<ffffffff81141ae6>] kmemleak_alloc_recursive.constprop.52+0x16/0x18 [<ffffffff81145288>] kmem_cache_alloc+0xf7/0x198 [<ffffffffa0389243>] __btrfs_add_ordered_extent+0x43/0x309 [btrfs] [<ffffffffa038968b>] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs] [<ffffffffa03810e2>] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs] [<ffffffff81181349>] do_blockdev_direct_IO+0x62a/0xb47 [<ffffffff8118189a>] __blockdev_direct_IO+0x34/0x36 [<ffffffffa03776e5>] btrfs_direct_IO+0x16a/0x1e8 [btrfs] [<ffffffff81100373>] generic_file_direct_write+0xb8/0x12d [<ffffffffa038615c>] btrfs_file_write_iter+0x24b/0x42f [btrfs] [<ffffffff8118bb0d>] aio_run_iocb+0x2b7/0x32e [<ffffffff8118c99a>] do_io_submit+0x26e/0x2ff [<ffffffff8118ca3b>] SyS_io_submit+0x10/0x12 [<ffffffff81429e92>] system_call_fastpath+0x12/0x17 CC: <[email protected]> # 3.19, 3.18 and 3.17 Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2015-03-01NFSv4: Don't call put_rpccred() under the rcu_read_lock()Trond Myklebust1-1/+1
put_rpccred() can sleep. Fixes: 8f649c3762547 ("NFSv4: Fix the locking in nfs_inode_reclaim_delegation()") Cc: [email protected] # 2.6.35+ Signed-off-by: Trond Myklebust <[email protected]>
2015-03-01NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache()Trond Myklebust1-3/+13
If the server does not return a valid set of attributes that we can use to either create a file or refresh the inode, then there is no value in calling nfs_prime_dcache(). However if we're just refreshing the inode using the attributes that the server returned, then it shouldn't matter whether or not we have a filehandle, as long as we check the fsid+fileid combination. Signed-off-by: Trond Myklebust <[email protected]>
2015-03-01NFSv3: Use the readdir fileid as the mounted-on-fileidTrond Myklebust1-0/+5
When we call readdirplus, set the fileid normally returned by readdir as the mounted-on-fileid, since that is commonly the case if there is a mountpoint. To ensure that we get it right, we only set the flag if the readdir fileid differs from the one returned in the readdirplus attributes. This again means that we can avoid the issues described in commit 2ef47eb1aee17 ("NFS: Fix use of nfs_attr_use_mounted_on_fileid()"), which only fixed NFSv4. Signed-off-by: Trond Myklebust <[email protected]>
2015-03-01NFS: Don't invalidate a submounted dentry in nfs_prime_dcache()Trond Myklebust1-0/+6
If we're traversing a directory which contains a submounted filesystem, or one that has a referral, the NFS server that is processing the READDIR request will often return information for the underlying (mounted-on) directory. It may, or may not, also return filehandle information. If this happens, and the lookup in nfs_prime_dcache() returns the dentry for the submounted directory, the filehandle comparison will fail, and we call d_invalidate(). Post-commit 8ed936b5671bf ("vfs: Lazily remove mounts on unlinked files and directories."), this means the entire subtree is unmounted. The following minimal patch addresses this problem by punting on the invalidation if there is a submount. Kudos to Neil Brown <[email protected]> for having tracked down this issue (see link). Reported-by: Nix <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] # 3.18+ Signed-off-by: Trond Myklebust <[email protected]>
2015-03-01NFSv4: Set a barrier in the update_changeattr() helperTrond Myklebust2-0/+2
Ensure that we don't regress the changes that were made to the directory. Signed-off-by: Trond Myklebust <[email protected]> Tested-by: Chuck Lever <[email protected]>
2015-03-01NFS: Fix nfs_post_op_update_inode() to set an attribute barrierTrond Myklebust1-0/+1
nfs_post_op_update_inode() is called after a self-induced attribute update. Ensure that it also sets the barrier. Signed-off-by: Trond Myklebust <[email protected]> Tested-by: Chuck Lever <[email protected]>
2015-03-01NFS: Remove size hack in nfs_inode_attrs_need_update()Trond Myklebust1-8/+0
Prior to this patch, we used to always OK attribute updates that extended the file size on the assumption that we might be performing writeback. Now that we have attribute barriers to protect the writeback related updates, we should remove this hack, as it can cause truncate() operations to apparently be reverted if/when a readahead or getattr RPC call races with our on-the-wire SETATTR. Signed-off-by: Trond Myklebust <[email protected]> Tested-by: Chuck Lever <[email protected]>
2015-03-01NFSv4: Add attribute update barriers to delegreturn and pNFS layoutcommitTrond Myklebust1-0/+1
Ensure that other operations that race with delegreturn and layoutcommit cannot revert the attribute updates that were made on the server. Signed-off-by: Trond Myklebust <[email protected]> Tested-by: Chuck Lever <[email protected]>
2015-03-01NFS: Add attribute update barriers to NFS writebacksTrond Myklebust6-8/+56
Ensure that other operations that race with our write RPC calls cannot revert the file size updates that were made on the server. Signed-off-by: Trond Myklebust <[email protected]> Tested-by: Chuck Lever <[email protected]>
2015-03-01NFS: Set an attribute barrier on all updatesTrond Myklebust1-0/+4
Ensure that we update the attribute barrier even if there were no invalidations, provided that this value is newer than the old one. Signed-off-by: Trond Myklebust <[email protected]> Tested-by: Chuck Lever <[email protected]>
2015-03-01NFS: Add attribute update barriers to nfs_setattr_update_inode()Trond Myklebust4-10/+17
Ensure that other operations which raced with our setattr RPC call cannot revert the file attribute changes that were made on the server. To do so, we artificially bump the attribute generation counter on the inode so that all calls to nfs_fattr_init() that precede ours will be dropped. The motivation for the patch came from Chuck Lever's reports of readaheads racing with truncate operations and causing the file size to be reverted. Reported-by: Chuck Lever <[email protected]> Signed-off-by: Trond Myklebust <[email protected]> Tested-by: Chuck Lever <[email protected]>
2015-03-01NFS: Add a helper to set attribute barriersTrond Myklebust1-0/+16
Signed-off-by: Trond Myklebust <[email protected]> Tested-by: Chuck Lever <[email protected]>
2015-03-01NFS: Ensure that buffered writes wait for O_DIRECT writes to completeTrond Myklebust1-0/+4
The O_DIRECT code will grab the inode->i_mutex and flush out buffered writes, before scheduling a read or a write. However there is no equivalent in the buffered write code to wait for O_DIRECT to complete. Fixes a reported issue in xfstests generic/133, when first performing an O_DIRECT write followed by a buffered write. Signed-off-by: Trond Myklebust <[email protected]> Tested-by: Chuck Lever <[email protected]>
2015-02-28Merge tag 'xfs-for-linus-4.0-rc2' of ↵Linus Torvalds6-31/+41
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "These are fixes for regressions/bugs introduced in the 4.0 merge cycle and problems discovered during the merge window that need to be pushed back to stable kernels ASAP. This contains: - ensure quota type is reset in on-disk dquots - fix missing partial EOF block data flush on truncate extension - fix transaction leak in error handling for new pnfs block layout support - add missing target_ip check to RENAME_EXCHANGE" * tag 'xfs-for-linus-4.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: cancel failed transaction in xfs_fs_commit_blocks() xfs: Ensure we have target_ip for RENAME_EXCHANGE xfs: ensure truncate forces zeroed blocks to disk xfs: Fix quota type in quota structures when reusing quota file
2015-02-28nilfs2: fix potential memory overrun on inodeRyusuke Konishi1-3/+44
Each inode of nilfs2 stores a root node of a b-tree, and it turned out to have a memory overrun issue: Each b-tree node of nilfs2 stores a set of key-value pairs and the number of them (in "bn_nchildren" member of nilfs_btree_node struct), as well as a few other "bn_*" members. Since the value of "bn_nchildren" is used for operations on the key-values within the b-tree node, it can cause memory access overrun if a large number is incorrectly set to "bn_nchildren". For instance, nilfs_btree_node_lookup() function determines the range of binary search with it, and too large "bn_nchildren" leads nilfs_btree_node_get_key() in that function to overrun. As for intermediate b-tree nodes, this is prevented by a sanity check performed when each node is read from a drive, however, no sanity check has been done for root nodes stored in inodes. This patch fixes the issue by adding missing sanity check against b-tree root nodes so that it's called when on-memory inodes are read from ifile, inode metadata file. Signed-off-by: Ryusuke Konishi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-02-27NFSv4: nfs4_open_recover_helper() must set share accessTrond Myklebust1-0/+3
The share access mode is now specified as an argument in the nfs4_opendata, and so nfs4_open_recover_helper() needs to call nfs4_map_atomic_open_share() in order to set it. Fixes: 6ae373394c42 ("NFSv4.1: Ask for no delegation on OPEN if using O_DIRECT") Signed-off-by: Trond Myklebust <[email protected]>
2015-02-26nfsd: fix clp->cl_revoked list deletion causing softlock in nfsdAndrew Elble1-1/+1
commit 2d4a532d385f ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock") removed the use of the reaplist to clean out clp->cl_revoked. It failed to change list_entry() to walk clp->cl_revoked.next instead of reaplist.next Fixes: 2d4a532d385f ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock") Cc: [email protected] Reported-by: Eric Meddaugh <[email protected]> Tested-by: Eric Meddaugh <[email protected]> Signed-off-by: Andrew Elble <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-02-26Merge branch 'for-linus' of ↵Linus Torvalds1-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "I'm still testing more fixes, but I wanted to get out the fix for the btrfs raid5/6 memory corruption I mentioned in my merge window pull" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix allocation size calculations in alloc_btrfs_bio
2015-02-26fuse: set stolen page uptodateMiklos Szeredi1-2/+2
Regular pipe buffers' ->steal method (generic_pipe_buf_steal()) doesn't set PG_uptodate. Don't warn on this condition, just set the uptodate flag. Signed-off-by: Miklos Szeredi <[email protected]> Cc: [email protected]
2015-02-26fuse: notify: don't move pagesMiklos Szeredi1-0/+3
fuse_try_move_page() is not prepared for replacing pages that have already been read. Reported-by: Al Viro <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> Cc: [email protected]
2015-02-24eCryptfs: ensure copy to crypt_stat->cipher does not overrunColin Ian King3-4/+4
The patch 237fead61998: "[PATCH] ecryptfs: fs/Makefile and fs/Kconfig" from Oct 4, 2006, leads to the following static checker warning: fs/ecryptfs/crypto.c:846 ecryptfs_new_file_context() error: off-by-one overflow 'crypt_stat->cipher' size 32. rl = '0-32' There is a mismatch between the size of ecryptfs_crypt_stat.cipher and ecryptfs_mount_crypt_stat.global_default_cipher_name causing the copy of the cipher name to cause a off-by-one string copy error. This fix ensures the space reserved for this string is the same size including the trailing zero at the end throughout ecryptfs. This fix avoids increasing the size of ecryptfs_crypt_stat.cipher and also ecryptfs_parse_tag_70_packet_silly_stack.cipher_string and instead reduces the of ECRYPTFS_MAX_CIPHER_NAME_SIZE to 31 and includes the + 1 for the end of string terminator. NOTE: An overflow is not possible in practice since the value copied into global_default_cipher_name is validated by ecryptfs_code_for_cipher_string() at mount time. None of the allowed cipher strings are long enough to cause the potential buffer overflow fixed by this patch. Signed-off-by: Colin Ian King <[email protected]> Reported-by: Dan Carpenter <[email protected]> [tyhicks: Added the NOTE about the overflow not being triggerable] Signed-off-by: Tyler Hicks <[email protected]>
2015-02-24xfs: cancel failed transaction in xfs_fs_commit_blocks()Eric Sandeen1-1/+3
If xfs_trans_reserve fails we don't cancel the transaction, and we'll leak the allocated transaction pointer. Spotted by Coverity. Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-02-24xfs: Ensure we have target_ip for RENAME_EXCHANGEEric Sandeen1-0/+4
We shouldn't get here with RENAME_EXCHANGE set and no target_ip, but let's be defensive, because xfs_cross_rename() will dereference it. Spotted by Coverity. Signed-off-by: Eric Sandeen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-02-23xfs: ensure truncate forces zeroed blocks to diskDave Chinner3-30/+29
A new fsync vs power fail test in xfstests indicated that XFS can have unreliable data consistency when doing extending truncates that require block zeroing. The blocks beyond EOF get zeroed in memory, but we never force those changes to disk before we run the transaction that extends the file size and exposes those blocks to userspace. This can result in the blocks not being correctly zeroed after a crash. Because in-memory behaviour is correct, tools like fsx don't pick up any coherency problems - it's not until the filesystem is shutdown or the system crashes after writing the truncate transaction to the journal but before the zeroed data in the page cache is flushed that the issue is exposed. Fix this by also flushing the dirty data in memory region between the old size and new size when we've found blocks that need zeroing in the truncate process. Reported-by: Liu Bo <[email protected]> cc: <[email protected]> Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-02-23xfs: Fix quota type in quota structures when reusing quota fileJan Kara1-0/+5
For filesystems without separate project quota inode field in the superblock we just reuse project quota file for group quotas (and vice versa) if project quota file is allocated and we need group quota file. When we reuse the file, quota structures on disk suddenly have wrong type stored in d_flags though. Nobody really cares about this (although structure type reported to userspace was wrong as well) except that after commit 14bf61ffe6ac (quota: Switch ->get_dqblk() and ->set_dqblk() to use bytes as space units) assertion in xfs_qm_scall_getquota() started to trigger on xfs/106 test (apparently I was testing without XFS_DEBUG so I didn't notice when submitting the above commit). Fix the problem by properly resetting ddq->d_flags when running quotacheck for a quota file. CC: [email protected] Reported-by: Al Viro <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2015-02-22Merge tag 'ext4_for_linus' of ↵Linus Torvalds5-56/+108
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Ext4 bug fixes. We also reserved code points for encryption and read-only images (for which the implementation is mostly just the reserved code point for a read-only feature :-)" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix indirect punch hole corruption ext4: ignore journal checksum on remount; don't fail ext4: remove duplicate remount check for JOURNAL_CHECKSUM change ext4: fix mmap data corruption in nodelalloc mode when blocksize < pagesize ext4: support read-only images ext4: change to use setup_timer() instead of init_timer() ext4: reserve codepoints used by the ext4 encryption feature jbd2: complain about descriptor block checksum errors
2015-02-22Merge branch 'for-linus-2' of ↵Linus Torvalds52-676/+715
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted stuff from this cycle. The big ones here are multilayer overlayfs from Miklos and beginning of sorting ->d_inode accesses out from David" * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (51 commits) autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation procfs: fix race between symlink removals and traversals debugfs: leave freeing a symlink body until inode eviction Documentation/filesystems/Locking: ->get_sb() is long gone trylock_super(): replacement for grab_super_passive() fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) SELinux: Use d_is_positive() rather than testing dentry->d_inode Smack: Use d_is_positive() rather than testing dentry->d_inode TOMOYO: Use d_is_dir() rather than d_inode and S_ISDIR() Apparmor: Use d_is_positive/negative() rather than testing dentry->d_inode Apparmor: mediated_filesystem() should use dentry->d_sb not inode->i_sb VFS: Split DCACHE_FILE_TYPE into regular and special types VFS: Add a fallthrough flag for marking virtual dentries VFS: Add a whiteout dentry type VFS: Introduce inode-getting helpers for layered/unioned fs environments Infiniband: Fix potential NULL d_inode dereference posix_acl: fix reference leaks in posix_acl_create autofs4: Wrong format for printing dentry ...
2015-02-22autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocationAl Viro1-2/+6
X-Coverup: just ask spender Cc: [email protected] Signed-off-by: Al Viro <[email protected]>
2015-02-22procfs: fix race between symlink removals and traversalsAl Viro3-12/+22
use_pde()/unuse_pde() in ->follow_link()/->put_link() resp. Cc: [email protected] Signed-off-by: Al Viro <[email protected]>
2015-02-22debugfs: leave freeing a symlink body until inode evictionAl Viro1-17/+17
As it is, we have debugfs_remove() racing with symlink traversals. Supply ->evict_inode() and do freeing there - inode will remain pinned until we are done with the symlink body. And rip the idiocy with checking if dentry is positive right after we'd verified debugfs_positive(), which is a stronger check... Cc: [email protected] Signed-off-by: Al Viro <[email protected]>
2015-02-22trylock_super(): replacement for grab_super_passive()Konstantin Khlebnikov3-26/+22
I've noticed significant locking contention in memory reclaimer around sb_lock inside grab_super_passive(). Grab_super_passive() is called from two places: in icache/dcache shrinkers (function super_cache_scan) and from writeback (function __writeback_inodes_wb). Both are required for progress in memory allocator. Grab_super_passive() acquires sb_lock to increment sb->s_count and check sb->s_instances. It seems sb->s_umount locked for read is enough here: super-block deactivation always runs under sb->s_umount locked for write. Protecting super-block itself isn't a problem: in super_cache_scan() sb is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers are still active. Inside writeback super-block comes from inode from bdi writeback list under wb->list_lock. This patch removes locking sb_lock and checks s_instances under s_umount: generic_shutdown_super() unlinks it under sb->s_umount locked for write. New variant is called trylock_super() and since it only locks semaphore, callers must call up_read(&sb->s_umount) instead of drop_super(sb) when they're done. Signed-off-by: Konstantin Khlebnikov <[email protected]> Signed-off-by: Al Viro <[email protected]>
2015-02-22fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversionsDavid Howells1-1/+1
Fanotify probably doesn't want to watch autodirs so make it use d_can_lookup() rather than d_is_dir() when checking a dir watch and give an error on fake directories. Signed-off-by: David Howells <[email protected]> Signed-off-by: Al Viro <[email protected]>
2015-02-22Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversionsDavid Howells4-9/+9
Fix up the following scripted S_ISDIR/S_ISREG/S_ISLNK conversions (or lack thereof) in cachefiles: (1) Cachefiles mostly wants to use d_can_lookup() rather than d_is_dir() as it doesn't want to deal with automounts in its cache. (2) Coccinelle didn't find S_IS* expressions in ASSERT() statements in cachefiles. Signed-off-by: David Howells <[email protected]> Signed-off-by: Al Viro <[email protected]>
2015-02-22VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)David Howells30-65/+65
Convert the following where appropriate: (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry). (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry). (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more complicated than it appears as some calls should be converted to d_can_lookup() instead. The difference is whether the directory in question is a real dir with a ->lookup op or whether it's a fake dir with a ->d_automount op. In some circumstances, we can subsume checks for dentry->d_inode not being NULL into this, provided we the code isn't in a filesystem that expects d_inode to be NULL if the dirent really *is* negative (ie. if we're going to use d_inode() rather than d_backing_inode() to get the inode pointer). Note that the dentry type field may be set to something other than DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS manages the fall-through from a negative dentry to a lower layer. In such a case, the dentry type of the negative union dentry is set to the same as the type of the lower dentry. However, if you know d_inode is not NULL at the call site, then you can use the d_is_xxx() functions even in a filesystem. There is one further complication: a 0,0 chardev dentry may be labelled DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was intended for special directory entry types that don't have attached inodes. The following perl+coccinelle script was used: use strict; my @callers; open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') || die "Can't grep for S_ISDIR and co. callers"; @callers = <$fd>; close($fd); unless (@callers) { print "No matches\n"; exit(0); } my @cocci = ( '@@', 'expression E;', '@@', '', '- S_ISLNK(E->d_inode->i_mode)', '+ d_is_symlink(E)', '', '@@', 'expression E;', '@@', '', '- S_ISDIR(E->d_inode->i_mode)', '+ d_is_dir(E)', '', '@@', 'expression E;', '@@', '', '- S_ISREG(E->d_inode->i_mode)', '+ d_is_reg(E)' ); my $coccifile = "tmp.sp.cocci"; open($fd, ">$coccifile") || die $coccifile; print($fd "$_\n") || die $coccifile foreach (@cocci); close($fd); foreach my $file (@callers) { chomp $file; print "Processing ", $file, "\n"; system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 || die "spatch failed"; } [AV: overlayfs parts skipped] Signed-off-by: David Howells <[email protected]> Signed-off-by: Al Viro <[email protected]>
2015-02-22VFS: Split DCACHE_FILE_TYPE into regular and special typesDavid Howells1-5/+13
Split DCACHE_FILE_TYPE into DCACHE_REGULAR_TYPE (dentries representing regular files) and DCACHE_SPECIAL_TYPE (representing blockdev, chardev, FIFO and socket files). d_is_reg() and d_is_special() are added to detect these subtypes and d_is_file() is left as the union of the two. This allows a number of places that use S_ISREG(dentry->d_inode->i_mode) to use d_is_reg(dentry) instead. Signed-off-by: David Howells <[email protected]> Signed-off-by: Al Viro <[email protected]>
2015-02-22VFS: Add a fallthrough flag for marking virtual dentriesDavid Howells1-1/+18
Add a DCACHE_FALLTHRU flag to indicate that, in a layered filesystem, this is a virtual dentry that covers another one in a lower layer that should be used instead. This may be recorded on medium if directory integration is stored there. The flag can be set with d_set_fallthru() and tested with d_is_fallthru(). Original-author: Valerie Aurora <[email protected]> Signed-off-by: David Howells <[email protected]> Signed-off-by: Al Viro <[email protected]>
2015-02-21Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of ↵Linus Torvalds10-8/+393
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs pnfs block layout support from Dave Chinner: "This contains the changes to XFS needed to support the PNFS block layout server that you pulled in through Bruce's NFS server tree merge. I originally thought that I'd need to merge changes into the NFS server side, but Bruce had already picked them up and so this is purely changes to the fs/xfs/ codebase. Summary: This update contains the implementation of the PNFS server export methods that enable use of XFS filesystems as a block layout target" * tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: recall pNFS layouts on conflicting access xfs: implement pNFS export operations
2015-02-21Merge tag 'nfs-for-3.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds15-151/+140
Pull more NFS client updates from Trond Myklebust: "Highlights include: - Fix a use-after-free in decode_cb_sequence_args() - Fix a compile error when #undef CONFIG_PROC_FS - NFSv4.1 backchannel spinlocking issue - Cleanups in the NFS unstable write code requested by Linus - NFSv4.1 fix issues when the server denies our backchannel request - Cleanups in create_session and bind_conn_to_session" * tag 'nfs-for-3.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4.1: Clean up bind_conn_to_session NFSv4.1: Always set up a forward channel when binding the session NFSv4.1: Don't set up a backchannel if the server didn't agree to do so NFSv4.1: Clean up create_session pnfs: Refactor the *_layout_mark_request_commit to use pnfs_layout_mark_request_commit NFSv4: Kill unused nfs_inode->delegation_state field NFS: struct nfs_commit_info.lock must always point to inode->i_lock nfs: Can call nfs_clear_page_commit() instead nfs: Provide and use helper functions for marking a page as unstable SUNRPC: Always manipulate rpc_rqst::rq_bc_pa_list under xprt->bc_pa_lock SUNRPC: Fix a compile error when #undef CONFIG_PROC_FS NFSv4.1: Convert open-coded array allocation calls to kmalloc_array() NFSv4.1: Fix a kfree() of uninitialised pointers in decode_cb_sequence_args
2015-02-21Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: "This contains: - EFI fixes - a boot printout fix - ASLR/kASLR fixes - intel microcode driver fixes - other misc fixes Most of the linecount comes from an EFI revert" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/ASLR: Avoid PAGE_SIZE redefinition for UML subarch x86/microcode/intel: Handle truncated microcode images more robustly x86/microcode/intel: Guard against stack overflow in the loader x86, mm/ASLR: Fix stack randomization on 64-bit systems x86/mm/init: Fix incorrect page size in init_memory_mapping() printks x86/mm/ASLR: Propagate base load address calculation Documentation/x86: Fix path in zero-page.txt x86/apic: Fix the devicetree build in certain configs Revert "efi/libstub: Call get_memory_map() to obtain map and desc sizes" x86/efi: Avoid triple faults during EFI mixed mode calls