aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-06-09Btrfs: send, remove dead code from __get_cur_name_and_parentFilipe Manana1-6/+0
Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09Btrfs: send, account for orphan directories when building path stringsFilipe Manana1-24/+9
If we have directories with a pending move/rename operation, we must take into account any orphan directories that got created before executing the pending move/rename. Those orphan directories are directories with an inode number higher then the current send progress and that don't exist in the parent snapshot, they are created before current progress reaches their inode number, with a generated name of the form oN-M-I and at the root of the filesystem tree, and later when progress matches their inode number, moved/renamed to their final location. Reproducer: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b/c/d $ mkdir /mnt/a/b/e $ mv /mnt/a/b/c /mnt/a/b/e/CC $ mkdir /mnt/a/b/e/CC/d/f $ mkdir /mnt/a/g $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mkdir /mnt/a/g/h $ mv /mnt/a/b/e /mnt/a/g/h/EE $ mv /mnt/a/g/h/EE/CC/d /mnt/a/g/h/EE/DD $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send The second receive command failed with the following error: ERROR: rename a/b/e/CC/d -> o264-7-0/EE/DD failed. No such file or directory A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09Btrfs: send, avoid unnecessary inode item lookup in the btreeFilipe Manana1-6/+7
Regardless of whether the caller is interested or not in knowing the inode's generation (dir_gen != NULL), get_first_ref always does a btree lookup to get the inode item. Avoid this useless lookup if dir_gen parameter is NULL (which is in some cases). Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: add dev maxs limit for __btrfs_alloc_chunk in kernel spaceGui Hecheng1-0/+16
For RAID0,5,6,10, For system chunk, there shouldn't be too many stripes to make a btrfs_chunk that exceeds BTRFS_SYSTEM_CHUNK_ARRAY_SIZE For data/meta chunk, there shouldn't be too many stripes to make a btrfs_chunk that exceeds a leaf. Signed-off-by: Gui Hecheng <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: fix wrong max system array size check in kernel spaceGui Hecheng1-1/+2
For system chunk array, We copy a "disk_key" and an chunk item each time, so there should be enough space to hold both of them, not only the chunk item. Signed-off-by: Gui Hecheng <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: Add check to avoid cleanup roots already in fs_info->dead_roots.Qu Wenruo1-7/+30
Current btrfs_orphan_cleanup will also cleanup roots which is already in fs_info->dead_roots without protection. This will have conditional race with fs_info->cleaner_kthread. This patch will use refs in root->root_item to detect roots in dead_roots and avoid conflicts. Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09Btrfs: reclaim the reserved metadata space at backgroundMiao Xie4-1/+114
Before applying this patch, the task had to reclaim the metadata space by itself if the metadata space was not enough. And When the task started the space reclamation, all the other tasks which wanted to reserve the metadata space were blocked. At some cases, they would be blocked for a long time, it made the performance fluctuate wildly. So we introduce the background metadata space reclamation, when the space is about to be exhausted, we insert a reclaim work into the workqueue, the worker of the workqueue helps us to reclaim the reserved space at the background. By this way, the tasks needn't reclaim the space by themselves at most cases, and even if the tasks have to reclaim the space or are blocked for the space reclamation, they will get enough space more quickly. Here is my test result(Tested by compilebench): Memory: 2GB CPU: 2Cores * 1CPU Partition: 40GB(SSD) Test command: # compilebench -D <mnt> -m Without this patch: intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s) compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s) read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s) delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s) With this patch: intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s) compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s) read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s) delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s) Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09Btrfs: output warning instead of error when loading free space cache failedMiao Xie1-2/+2
If we fail to load a free space cache, we can rebuild it from the extent tree, so it is not a serious error, we should not output a error message that would make the users uncomfortable. This patch uses warning message instead of it. Signed-off-by: Miao Xie <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: Add ctime/mtime update for btrfs device add/remove.Qu Wenruo1-2/+24
Btrfs will send uevent to udev inform the device change, but ctime/mtime for the block device inode is not udpated, which cause libblkid used by btrfs-progs unable to detect device change and use old cache, causing 'btrfs dev scan; btrfs dev rmove; btrfs dev scan' give an error message. Reported-by: Tsutomu Itoh <[email protected]> Cc: Karel Zak <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: assert that send is not in progres before root deletionDavid Sterba2-13/+1
CC: Miao Xie <[email protected]> CC: Wang Shilong <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: protect snapshots from deleting during sendDavid Sterba3-2/+53
The patch "Btrfs: fix protection between send and root deletion" (18f687d538449373c37c) does not actually prevent to delete the snapshot and just takes care during background cleaning, but this seems rather user unfriendly, this patch implements the idea presented in http://www.spinics.net/lists/linux-btrfs/msg30813.html - add an internal root_item flag to denote a dead root - check if the send_in_progress is set and refuse to delete, otherwise set the flag and proceed - check the flag in send similar to the btrfs_root_readonly checks, for all involved roots The root lookup in send via btrfs_read_fs_root_no_name will check if the root is really dead or not. If it is, ENOENT, aborted send. If it's alive, it's protected by send_in_progress, send can continue. CC: Miao Xie <[email protected]> CC: Wang Shilong <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: remove redundant null check in btrfs_dentry_release()Daeseok Youn1-2/+1
It doesn't need to check NULL for kfree() Signed-off-by: Daeseok Youn <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09Btrfs: implement inode_operations callback tmpfileFilipe Manana1-20/+98
This implements the tmpfile callback of struct inode_operations, introduced in the linux kernel 3.11, and implemented already by some filesystems. This callback is invoked by the VFS when the flag O_TMPFILE is passed to the open system call. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]> Reviewed-by: David Sterba <[email protected]>
2014-06-09btrfs: make FS_INFO ioctl available to anyoneDavid Sterba1-3/+0
This ioctl provides basic info about the filesystem that can be obtained in other ways (eg. sysfs), there's no reason to restrict it to CAP_SYSADMIN. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: make DEV_INFO ioctl available to anyoneDavid Sterba1-3/+0
This ioctl provides basic info about the devices that can be obtained in other ways (eg. sysfs), there's no reason to restrict it to CAP_SYSADMIN. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: export more from FS_INFO to sysfsDavid Sterba1-0/+40
Similar to the FS_INFO updates, export the basic filesystem info through sysfs: node size, sector size and clone alignment. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: retrieve more info from FS_INFO ioctlDavid Sterba2-1/+9
Provide the basic information about filesystem through the ioctl: * b-tree node size (same as leaf size) * sector size * expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments Backward compatibility: if the values are 0, kernel does not provide this information, the applications should ignore them. Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: balance filter: add limit of processed chunksDavid Sterba4-2/+27
This started as debugging helper, to watch the effects of converting between raid levels on multiple devices, but could be useful standalone. In my case the usage filter was not finegrained enough and led to converting too many chunks at once. Another example use is in connection with drange+devid or vrange filters that allow to work with a specific chunk or even with a chunk on a given device. The limit filter applies last, the value of 0 means no limiting. CC: Ilya Dryomov <[email protected]> CC: Hugo Mills <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09Btrfs: fix leaf corruption caused by ENOSPC while hole punchingFilipe Manana1-1/+19
While running a stress test with multiple threads writing to the same btrfs file system, I ended up with a situation where a leaf was corrupted in that it had 2 file extent item keys that had the same exact key. I was able to detect this quickly thanks to the following patch which triggers an assertion as soon as a leaf is marked dirty if there are duplicated keys or out of order keys: Btrfs: check if items are ordered when a leaf is marked dirty (https://patchwork.kernel.org/patch/3955431/) Basically while running the test, I got the following in dmesg: [28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]() (...) [28877.415917] Call Trace: [28877.415922] [<ffffffff816f1189>] dump_stack+0x4e/0x68 [28877.415926] [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0 [28877.415929] [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20 [28877.415944] [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs] [28877.415949] [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0 [28877.415962] [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs] [28877.415972] [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs] [28877.415984] [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs] (...) [29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24 [29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778 (...) [29854.132637] item 23 key (3486 108 667648) itemoff 2694 itemsize 53 [29854.132638] extent data disk bytenr 14574411776 nr 286720 [29854.132639] extent data offset 0 nr 286720 ram 286720 [29854.132640] item 24 key (3486 108 954368) itemoff 2641 itemsize 53 [29854.132641] extent data disk bytenr 0 nr 0 [29854.132643] extent data offset 0 nr 0 ram 0 [29854.132644] item 25 key (3486 108 954368) itemoff 2588 itemsize 53 [29854.132645] extent data disk bytenr 8699670528 nr 77824 [29854.132646] extent data offset 0 nr 77824 ram 77824 [29854.132647] item 26 key (3486 108 1146880) itemoff 2535 itemsize 53 [29854.132648] extent data disk bytenr 8699670528 nr 77824 [29854.132649] extent data offset 0 nr 77824 ram 77824 (...) [29854.132707] kernel BUG at fs/btrfs/ctree.h:3901! (...) [29854.132771] Call Trace: [29854.132779] [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs] [29854.132791] [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs] [29854.132794] [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0 [29854.132797] [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10 [29854.132800] [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0 [29854.132810] [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs] [29854.132820] [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs] [29854.132830] [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs] (...) So this is caused by getting an -ENOSPC error while punching a file hole, more specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one or more file extent items due to lack of enough free space. When this happens, in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction block reservation object to root->fs_info->trans_block_rsv, end our transaction and start a new transaction basically - and, we keep increasing our current offset (cur_offset) as long as it's smaller than the end of the target range (lockend) - this makes use leave the loop with cur_offset == drop_end which in turn makes us call fill_holes() for inserting a file extent item that represents a 0 bytes range hole (and this insertion succeeds, as in the meanwhile more space became available). This 0 bytes file hole extent item is a problem because any subsequent caller of __btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a start file offset that is equal to the offset of the hole, will not remove this extent item due to the following conditional in the while loop of __btrfs_drop_extents: if (extent_end <= search_start) { path->slots[0]++; goto next_slot; } This later makes the call to setup_items_for_insert() (at the very end of __btrfs_drop_extents), insert a new file extent item with the same offset as the 0 bytes file hole extent item that follows it. Needless is to say that this causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook), where we perform leaf sanity checks or in subsequent operations that manipulate file extent items, as in the fallocate call as shown by the dmesg trace above. Without my other patch to perform the leaf sanity checks once a leaf is marked as dirty (if the integrity checker is enabled), it would have been much harder to debug this issue. This change might fix a few similar issues reported by users in the mailing list regarding assertion failures in btrfs_set_item_key_safe calls performed by __btrfs_drop_extents, such as the following report: http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938 Asking fill_holes() to create a 0 bytes wide file hole item also produced the first warning in the trace above, as we passed a range to btrfs_drop_extent_cache that has an end smaller (by -1) than its start. On 3.14 kernels this issue manifests itself through leaf corruption, as we get duplicated file extent item keys in a leaf when calling setup_items_for_insert(), but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(), instead we have callers of __btrfs_drop_extents(), namely the functions inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling btrfs_insert_empty_item() to insert the new file extent item, which would fail with error -EEXIST, instead of inserting a duplicated key - which is still a serious issue as it would make all similar file extent item replace operations keep failing if they target the same file range. Cc: [email protected] Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09Btrfs: do not increment on bio_index one by oneLiu Bo1-1/+1
'bio_index' is just a index, it's really not necessary to do increment one by one. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09Btrfs: read inode size after acquiring the mutex when punching a holeFilipe Manana1-1/+2
In a previous change, commit 12870f1c9b2de7d475d22e73fd7db1b418599725, I accidentally moved the roundup of inode->i_size to outside of the critical section delimited by the inode mutex, which is not atomic and not correct since the size can be changed by other task before we acquire the mutex. Therefore fix it. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: Remove unnecessary check for NULLTobias Klauser1-2/+2
iput() already checks for the inode being NULL, thus it's unnecessary to check before calling. Signed-off-by: Tobias Klauser <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: fix inline compressed read err corruptionZach Brown1-10/+5
uncompress_inline() is dropping the error from btrfs_decompress() after testing it and zeroing the page that was supposed to hold decompressed data. This can silently turn compressed inline data in to zeros if decompression fails due to corrupt compressed data or memory allocation failure. I verified this by manually forcing the error from btrfs_decompress() for a silly named copy of od: if (!strcmp(current->comm, "failod")) ret = -ENOMEM; # od -x /mnt/btrfs/dir/80 | head -1 0000000 3031 3038 310a 2d30 6f70 6e69 0a74 3031 # echo 3 > /proc/sys/vm/drop_caches # cp $(which od) /tmp/failod # /tmp/failod -x /mnt/btrfs/dir/80 | head -1 0000000 0000 0000 0000 0000 0000 0000 0000 0000 The fix is to pass the error to its caller. Which still has a BUG_ON(). So we fix that too. There seems to be no reason for the zeroing of the page on the error from btrfs_decompress() but not from the allocation error a few lines above. So the page zeroing is removed. Signed-off-by: Zach Brown <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: return ptr error from compression workspaceZach Brown1-3/+3
The btrfs compression wrappers translated errors from workspace allocation to either -ENOMEM or -1. The compression type workspace allocators are already returning a ERR_PTR(-ENOMEM). Just return that and get rid of the magical -1. This helps a future patch return errors from the compression wrappers. Signed-off-by: Zach Brown <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: return errno instead of -1 from compressionZach Brown2-20/+20
The compression layer seems to have been built to return -1 and have callers make up errors that make sense. This isn't great because there are different errors that originate down in the compression layer. Let's return real negative errnos from the compression layer so that callers can pass on the error without having to guess what happened. ENOMEM for allocation failure, E2BIG when compression exceeds the uncompressed input, and EIO for everything else. This helps a future path return errors from btrfs_decompress(). Signed-off-by: Zach Brown <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09btrfs: check_int: propagate out-of-memory error upwardsStefan Behrens1-1/+4
This issue was not causing any harm but IMO (and in the opinion of the static code checker) it is better to propagate this error status upwards. Signed-off-by: Stefan Behrens <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09Btrfs: fix hang on error (such as ENOSPC) when writing extent pagesFilipe Manana1-5/+11
When running low on available disk space and having several processes doing buffered file IO, I got the following trace in dmesg: [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds. [ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2 0x00000000 [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs] [ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40 [ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490 [ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001 [ 4202.720922] Call Trace: [ 4202.720931] [<ffffffff816a3cb9>] schedule+0x29/0x70 [ 4202.720950] [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs] [ 4202.720956] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0 [ 4202.720972] [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs] [ 4202.720988] [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs] [ 4202.720994] [<ffffffff810680e5>] process_one_work+0x1f5/0x530 (...) [ 4202.721027] 2 locks held by kworker/u8:1/5450: [ 4202.721028] #0: (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530 [ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530 [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds. [ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4202.721699] btrfs D 0000000000000001 0 7891 7890 0x00000001 [ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40 [ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490 [ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff [ 4202.721718] Call Trace: [ 4202.721723] [<ffffffff816a3cb9>] schedule+0x29/0x70 [ 4202.721727] [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270 [ 4202.721732] [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140 [ 4202.721736] [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40 [ 4202.721740] [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [ 4202.721744] [<ffffffff816a488f>] wait_for_completion+0xdf/0x120 [ 4202.721749] [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310 [ 4202.721765] [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs] [ 4202.721781] [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs] [ 4202.721786] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0 [ 4202.721799] [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs] [ 4202.721813] [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs] (...) It turns out that extent_io.c:__extent_writepage(), which ends up being called through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting -ENOSPC when calling the fill_delalloc callback. In this situation, it returned without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook) ever being called for the respective page, which prevents the ordered extent's bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work is never queued into the endio_write_workers queue. This makes the task that called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered extent. This is fairly easy to reproduce using a small filesystem and fsstress on a quad core vm: mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd mount /dev/sdd /mnt fsstress -p 6 -d /mnt -n 100000 -x \ "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \ -f allocsp=0 \ -f bulkstat=0 \ -f bulkstat1=0 \ -f chown=0 \ -f creat=1 \ -f dread=0 \ -f dwrite=0 \ -f fallocate=1 \ -f fdatasync=0 \ -f fiemap=0 \ -f freesp=0 \ -f fsync=0 \ -f getattr=0 \ -f getdents=0 \ -f link=0 \ -f mkdir=0 \ -f mknod=0 \ -f punch=1 \ -f read=0 \ -f readlink=0 \ -f rename=0 \ -f resvsp=0 \ -f rmdir=0 \ -f setxattr=0 \ -f stat=0 \ -f symlink=0 \ -f sync=0 \ -f truncate=1 \ -f unlink=0 \ -f unresvsp=0 \ -f write=4 So just ensure that if an error happens while writing the extent page we call the writepage_end_io_hook callback. Also make it return the error code and ensure the caller (extent_write_cache_pages) processes all pages in the page vector even if an error happens only for some of them, so that ordered extents end up released. Signed-off-by: Filipe David Borba Manana <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2014-06-09Merge tag 'trace-3.16' of ↵Linus Torvalds30-729/+1326
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "Lots of tweaks, small fixes, optimizations, and some helper functions to help out the rest of the kernel to ease their use of trace events. The big change for this release is the allowing of other tracers, such as the latency tracers, to be used in the trace instances and allow for function or function graph tracing to be in the top level simultaneously" * tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits) tracing: Fix memory leak on instance deletion tracing: Fix leak of ring buffer data when new instances creation fails tracing/kprobes: Avoid self tests if tracing is disabled on boot up tracing: Return error if ftrace_trace_arrays list is empty tracing: Only calculate stats of tracepoint benchmarks for 2^32 times tracing: Convert stddev into u64 in tracepoint benchmark tracing: Introduce saved_cmdlines_size file tracing: Add __get_dynamic_array_len() macro for trace events tracing: Remove unused variable in trace_benchmark tracing: Eliminate double free on failure of allocation on boot up ftrace/x86: Call text_ip_addr() instead of the duplicated code tracing: Print max callstack on stacktrace bug tracing: Move locking of trace_cmdline_lock into start/stop seq calls tracing: Try again for saved cmdline if failed due to locking tracing: Have saved_cmdlines use the seq_read infrastructure tracing: Add tracepoint benchmark tracepoint tracing: Print nasty banner when trace_printk() is in use tracing: Add funcgraph_tail option to print function name after closing braces tracing: Eliminate duplicate TRACE_GRAPH_PRINT_xx defines tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks ...
2014-06-09Merge branch 'for-3.16' of ↵Linus Torvalds22-1004/+1966
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on cgroup side. Heavy restructuring including locking simplification took place to improve the code base and enable implementation of the unified hierarchy, which currently exists behind a __DEVEL__ mount option. The core support is mostly complete but individual controllers need further work. To explain the design and rationales of the the unified hierarchy Documentation/cgroups/unified-hierarchy.txt is added. Another notable change is css (cgroup_subsys_state - what each controller uses to identify and interact with a cgroup) iteration update. This is part of continuing updates on css object lifetime and visibility. cgroup started with reference count draining on removal way back and is now reaching a point where csses behave and are iterated like normal refcnted objects albeit with some complexities to allow distinguishing the state where they're being deleted. The css iteration update isn't taken advantage of yet but is planned to be used to simplify memcg significantly" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits) cgroup: disallow disabled controllers on the default hierarchy cgroup: don't destroy the default root cgroup: disallow debug controller on the default hierarchy cgroup: clean up MAINTAINERS entries cgroup: implement css_tryget() device_cgroup: use css_has_online_children() instead of has_children() cgroup: convert cgroup_has_live_children() into css_has_online_children() cgroup: use CSS_ONLINE instead of CGRP_DEAD cgroup: iterate cgroup_subsys_states directly cgroup: introduce CSS_RELEASED and reduce css iteration fallback window cgroup: move cgroup->serial_nr into cgroup_subsys_state cgroup: link all cgroup_subsys_states in their sibling lists cgroup: move cgroup->sibling and ->children into cgroup_subsys_state cgroup: remove cgroup->parent device_cgroup: remove direct access to cgroup->children memcg: update memcg_has_children() to use css_next_child() memcg: remove tasks/children test from mem_cgroup_force_empty() cgroup: remove css_parent() cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css cgroup: use cgroup->self.refcnt for cgroup refcnting ...
2014-06-09Merge branch 'for-3.16' of ↵Linus Torvalds79-193/+324
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: "Nothing too interesting - another ahci platform driver variant, additional controller support, minor fixes and cleanups" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: ahci: Add Device ID for HighPoint RocketRaid 642L ata: ep93xx: use dmaengine_prep_slave_sg api instead of internal callback ahci: add PCI ID for Marvell 88SE91A0 SATA Controller sata_fsl: remove check for CONFIG_MPC8315_DS ahci: add support for Hisilicon sata libahci_platform: add host_flags parameter in ahci_platform_init_host() ata: ahci: append new hflag AHCI_HFLAG_NO_FBS ata: use CONFIG_PM_SLEEP instead of CONFIG_PM where applicable in host drivers ata: ahci_mvebu: new driver for Marvell Armada 380 AHCI interfaces Documentation: dt-bindings: reformat and order list of ahci-platform compatibles libata-sff: remove dead code ata: SATL compliance for Inquiry Product Revision pata_octeon_cf: use devm_kzalloc() to allocate cf_port
2014-06-09Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds3-336/+154
Pull workqueue updates from Tejun Heo: "Lai simplified worker destruction path and internal workqueue locking and there are some other minor changes. Except for the removal of some long-deprecated interfaces which haven't had any in-kernel user for quite a while, there shouldn't be any difference to workqueue users" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info workqueue: remove the confusing POOL_FREEZING workqueue: rename first_worker() to first_idle_worker() workqueue: remove unused work_clear_pending() workqueue: remove unused WORK_CPU_END workqueue: declare system_highpri_wq workqueue: use generic attach/detach routine for rescuers workqueue: separate pool-attaching code out from create_worker() workqueue: rename manager_mutex to attach_mutex workqueue: narrow the protection range of manager_mutex workqueue: convert worker_idr to worker_ida workqueue: separate iteration role from worker_idr workqueue: destroy worker directly in the idle timeout handler workqueue: async worker destruction workqueue: destroy_worker() should destroy idle workers only workqueue: use manager lock only to protect worker_idr workqueue: Remove deprecated system_nrt[_freezable]_wq workqueue: Remove deprecated flush[_delayed]_work_sync() kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
2014-06-09Merge branch 'for-3.16' of ↵Linus Torvalds3-3/+35
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "Nothing too exciting. percpu_ref is going through some interface changes and getting new features with more changes in the pipeline but given its young age and few users, it's very low impact" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu-refcount: implement percpu_ref_tryget() percpu-refcount: rename percpu_ref_tryget() to percpu_ref_tryget_live() percpu: Replace __get_cpu_var with this_cpu_ptr
2014-06-09alienware-wmi: Update WMAX brightness method limit to 15Mario Limonciello1-2/+1
This more closely reflects what the hardware can actually support. Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Matthew Garrett <[email protected]>
2014-06-09pvpanic: Set high notifier priorityTakashi Iwai1-0/+1
We've observed the missing pvpanic call at panic, and it turned out that this was blocked by the broken notifier of drm_fb_helper, where scheduling may be called during switching to the fb console. It's fairly difficult to fix the drm_fb problem and a quick fix isn't foreseen, a simpler solution for the missing pvpanic call would be just to call this earlier. In order to assure that, this patch sets a higher priority to pvpanic notifier_block. Once when the issue of drm_fb is resolved, we can remove this priority again. Signed-off-by: Takashi Iwai <[email protected]> Signed-off-by: Matthew Garrett <[email protected]>
2014-06-09platform/x86: samsung-laptop: Add support for Samsung's NP7[34]0U3E models.Scott Thrasher1-0/+38
These models have only 4 levels of keyboard backlight brightness and forget how to work the backlight after resuming from S3 sleep. I've added a quirk to set the appropriate number of backlight levels, and one to re-enable the keyboard backlight on resume. (Whitespace cleaned up by Matthew Garrett) Signed-off-by: Scott Thrasher <[email protected]> Signed-off-by: Matthew Garrett <[email protected]>
2014-06-09toshiba_acpi: Add alternative keymap support for Satellite M840Takashi Iwai1-1/+29
Toshiba Satellite M840 laptop has a complete different keymap although it's bound with the same ACPI ID "TOS1900". This patch provides an alternative keymap specific to this machine by identifying via DMI matching. The keymap table doesn't fill all entries that were used before since some keys aren't found on this machine at all. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=69761 Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=812209 Reported-and-tested-by: Federico Vecchiarelli <[email protected]> Signed-off-by: Takashi Iwai <[email protected]> Signed-off-by: Matthew Garrett <[email protected]>
2014-06-09platform-drivers-x86: intel_pmic_gpio: Fix off-by-one valid offset range checkAxel Lin1-2/+2
Only pin 0-7 support input, so the valid offset range should be 0 ~ 7. Signed-off-by: Axel Lin <[email protected]> Signed-off-by: Matthew Garrett <[email protected]>
2014-06-10Merge branch 'xfs-misc-fixes-3-for-3.16' into for-nextDave Chinner13-53/+60
2014-06-10Merge branch 'xfs-da-geom' into for-nextDave Chinner26-778/+819
2014-06-10xfs: fix xfs_da_args sparse warning in xfs_readdirDave Chinner1-1/+1
The kbuild test robot reported: >> fs/xfs/xfs_dir2_readdir.c:672:41: sparse: Using plain integer as NULL pointer Fix it. Reported-by: kbuild test robot <[email protected]> Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2014-06-09nfsd4: fix FREE_STATEID lockowner leakJ. Bruce Fields1-1/+1
27b11428b7de ("nfsd4: remove lockowner when removing lock stateid") introduced a memory leak. Cc: [email protected] Reported-by: Jeff Layton <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2014-06-09IB/mlx4: Fix gfp passing in create_qp_common()Jiri Kosina1-2/+2
There are two kzalloc() calls which were not converted to use value of gfp passed to create_qp_common() instead of using hardcoded GFP_KERNEL in 40f2287bd583 ("IB/mlx4: Implement IB_QP_CREATE_USE_GFP_NOIO"). Fix this by passing gfp value down properly. Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Jiri Kosina <[email protected]> Signed-off-by: Roland Dreier <[email protected]>
2014-06-09Merge branch 'topic/xilinx' into for-linusVinod Koul6-0/+1517
2014-06-09Merge branch 'topic/dw' into for-linusVinod Koul3-24/+40
2014-06-09blk-mq: add timer in blk_mq_start_requestMing Lei1-16/+1
This way will become consistent with non-mq case, also avoid to update rq->deadline twice for mq. The comment said: "We do this early, to ensure we are on the right CPU.", but no percpu stuff is used in blk_add_timer(), so it isn't necessary. Even when inserting from plug list, there is no such guarantee at all. Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2014-06-09blk-mq: always initialize request->start_timeJens Axboe1-3/+2
The blk-mq core only initializes this if io stats are enabled, since blk-mq only reads the field in that case. But drivers could potentially use it internally, so ensure that we always set it to the current time when the request is allocated. Reported-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2014-06-09Merge remote-tracking branch 'scsi-queue/drivers-for-3.16' into for-linusJames Bottomley111-2518/+4508
2014-06-09NFSv4.1: Fix typo in dprintkTom Haynes1-1/+1
Signed-off-by: Tom Haynes <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2014-06-09NFSv4.1: Comment is now wrong and redundant to codeTom Haynes1-4/+1
The save of the write offset was removed some time ago, so that part of the comment is bogus. The remainder is pretty self-evident. So off with it! Signed-off-by: Tom Haynes <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2014-06-09vhost: move memory pointer to VQsMichael S. Tsirkin5-42/+33
commit 2ae76693b8bcabf370b981cd00c36cd41d33fabc vhost: replace rcu with mutex replaced rcu sync for memory accesses with VQ mutex locl/unlock. This is correct since all accesses are under VQ mutex, but incomplete: we still do useless rcu lock/unlock operations, someone might copy this code into some other context where this won't be right. This use of RCU is also non standard and hard to understand. Let's copy the pointer to each VQ structure, this way the access rules become straight-forward, and there's no need for RCU anymore. Reported-by: Eric Dumazet <[email protected]> Signed-off-by: Michael S. Tsirkin <[email protected]>