aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2016-08-26team: loadbalance: push lacpdus to exact deliveryJiri Pirko1-0/+15
When team is in bridge and LACP is utilized, LACPDU packets are pushed to userspace using raw socket and there they are processed. However, since 8626c56c8279b, LACPDU skbs are dropped by bridge rx_handler so they never reach packet handlers in rx path. Fix this by explicity treat LACPDUs to be pushed to exact delivery in team rx_handler. Reported-by: Ido Schimmel <[email protected]> Fixes: 8626c56c8279b ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict") Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-08-26net: hns: dereference ppe_cb->ppe_common_cb if it is non-nullColin Ian King1-1/+2
ppe_cb->ppe_common_cb is being dereferenced before a null check is being made on it. If ppe_cb->ppe_common_cb is null then we end up with a null pointer dereference when assigning dsaf_dev. Fix this by moving the initialisation of dsaf_dev once we know ppe_cb->ppe_common_cb is OK to dereference. Signed-off-by: Colin Ian King <[email protected]> Acked-by: Yisen Zhuang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-08-26dlm: fix malfunction of dlm_tool caused by debugfs changesEric Ren1-14/+48
With the current kernel, `dlm_tool lockdebug` fails as below: "dlm_tool lockdebug ED0BD86DCE724393918A1AE8FDBF1EE3 can't open /sys/kernel/debug/dlm/ED0BD86DCE724393918A1AE8FDBF1EE3: Operation not permitted" This is because table_open() depends on file->f_op to tell which seq_file ops should be passed down. But, the original file ops in file->f_op is replaced by "debugfs_full_proxy_file_operations" with commit 49d200deaa68 ("debugfs: prevent access to removed files' private data"). Currently, I can think up 2 solutions: 1st, replace debugfs_create_file() with debugfs_create_file_unsafe(); 2nd, make different table_open#() accordingly. The 1st one is neat, but I don't thoroughly understand its risk. Maybe someone has a better one. Signed-off-by: Eric Ren <[email protected]> Signed-off-by: David Teigland <[email protected]>
2016-08-26clocksource/drivers/sun4i: Clear interrupts after stopping timer in probe ↵Chen-Yu Tsai1-1/+8
function The bootloader (U-boot) sometimes uses this timer for various delays. It uses it as a ongoing counter, and does comparisons on the current counter value. The timer counter is never stopped. In some cases when the user interacts with the bootloader, or lets it idle for some time before loading Linux, the timer may expire, and an interrupt will be pending. This results in an unexpected interrupt when the timer interrupt is enabled by the kernel, at which point the event_handler isn't set yet. This results in a NULL pointer dereference exception, panic, and no way to reboot. Clear any pending interrupts after we stop the timer in the probe function to avoid this. Cc: [email protected] Signed-off-by: Chen-Yu Tsai <[email protected]> Signed-off-by: Daniel Lezcano <[email protected]> Acked-by: Maxime Ripard <[email protected]>
2016-08-26drivers/clocksource/pistachio: Fix memory corruption in initMarcin Nowakowski1-4/+4
Driver init code incorrectly uses the block base address and as a result clears clocksource structure's fields instead of the hardware registers. Commit 09a998201649 ("timekeeping: Lift clocksource cacheline restriction") has changed the offsets within pistachio_clocksource structure and what has previously gone unnoticed now leads to a kernel panic during boot. Signed-off-by: Marcin Nowakowski <[email protected]> Signed-off-by: Daniel Lezcano <[email protected]>
2016-08-26clocksource/drivers/timer-atmel-pit: Enable mck clockAlexandre Belloni1-0/+6
mck is needed to get the PIT working. Explicitly prepare_enable it instead of assuming it is enabled. This solves an issue where the system is freezing when the ETM/ETB drivers are enabled. Reported-by: Olivier Schonken <[email protected]> Reviewed-by: Boris Brezillon <[email protected]> Acked-by: Nicolas Ferre <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]> Signed-off-by: Daniel Lezcano <[email protected]>
2016-08-26xfs: prevent dropping ioend completions during buftarg waitBrian Foster1-1/+1
xfs_wait_buftarg() waits for all pending I/O, drains the ioend completion workqueue and walks the LRU until all buffers in the cache have been released. This is traditionally an unmount operation` but the mechanism is also reused during filesystem freeze. xfs_wait_buftarg() invokes drain_workqueue() as part of the quiesce, which is intended more for a shutdown sequence in that it indicates to the queue that new operations are not expected once the drain has begun. New work jobs after this point result in a WARN_ON_ONCE() and are otherwise dropped. With filesystem freeze, however, read operations are allowed and can proceed during or after the workqueue drain. If such a read occurs during the drain sequence, the workqueue infrastructure complains about the queued ioend completion work item and drops it on the floor. As a result, the buffer remains on the LRU and the freeze never completes. Despite the fact that the overall buffer cache cleanup is not necessary during freeze, fix up this operation such that it is safe to invoke during non-unmount quiesce operations. Replace the drain_workqueue() call with flush_workqueue(), which runs a similar serialization on pending workqueue jobs without causing new jobs to be dropped. This is safe for unmount as unmount independently locks out new operations by the time xfs_wait_buftarg() is invoked. cc: <[email protected]> Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-08-26xfs: fix superblock inprogress checkDave Chinner1-1/+2
From inspection, the superblock sb_inprogress check is done in the verifier and triggered only for the primary superblock via a "bp->b_bn == XFS_SB_DADDR" check. Unfortunately, the primary superblock is an uncached buffer, and hence it is configured by xfs_buf_read_uncached() with: bp->b_bn = XFS_BUF_DADDR_NULL; /* always null for uncached buffers */ And so this check never triggers. Fix it. cc: <[email protected]> Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Brian Foster <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-08-26xfs: simple btree query range should look right if LE lookup failsDarrick J. Wong1-0/+7
If the initial LOOKUP_LE in the simple query range fails to find anything, we should attempt to increment the btree cursor to see if there actually /are/ records for what we're trying to find. Without this patch, a bnobt range query of (0, $agsize) returns no results because the leftmost record never has a startblock of zero. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-08-26xfs: fix some key handling problems in _btree_simple_query_rangeDarrick J. Wong1-1/+2
We only need the record's high key for the first record that we look at; for all records, we /definitely/ need the regular record key. Therefore, fix how the simple range query function gets its keys. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-08-26xfs: don't log the entire end of the AGFDarrick J. Wong2-2/+6
When we're logging the last non-spare field in the AGF, we don't need to log the spare fields, so plumb in a new AGF logging flag to help us avoid that. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-08-26xfs: disallow mounting of realtime + rmap filesystemsDarrick J. Wong1-1/+8
Since the kernel doesn't currently support the realtime rmapbt, don't allow such filesystems to be mounted. Support will appear in a future release. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Carlos Maiolino <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-08-26xfs: don't perform lookups on zero-height btreesDarrick J. Wong1-0/+4
If the caller passes in a cursor to a zero-height btree (which is impossible), we never set block to anything but NULL, which causes the later dereference of it to crash. Instead, just return -EFSCORRUPTED. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Dave Chinner <[email protected]>
2016-08-258139cp: Fix one possible deadloop in cp_rx_pollGao Feng1-1/+1
When cp_rx_poll does not get enough packet, it will check the rx interrupt status again. If so, it will jumpt to rx_status_loop again. But the goto jump resets the rx variable as zero too. As a result, it causes one possible deadloop. Assume this case, rx_status_loop only gets the packet count which is less than budget, and (cpr16(IntrStatus) & cp_rx_intr_mask) condition is always true. It causes the deadloop happens and system is blocked. Signed-off-by: Gao Feng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-08-25i40e: Change some init flow for the clientAnjali Singhai Jain2-12/+30
This change makes a common flow for Client instance open during init and reset path. The Client subtask can handle both the cases instead of making a separate notify_client_of_open call. Also it may fix a bug during reset where the service task was leaking some memory and causing issues. Change-Id: I7232a32fd52b82e863abb54266fa83122f80a0cd Signed-off-by: Anjali Singhai Jain <[email protected]> Tested-by: Andrew Bowers <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-08-25Revert "phy: IRQ cannot be shared"Xander Huff1-2/+4
This reverts: commit 33c133cc7598 ("phy: IRQ cannot be shared") On hardware with multiple PHY devices hooked up to the same IRQ line, allow them to share it. Sergei Shtylyov says: "I'm not sure now what was the reason I concluded that the IRQ sharing was impossible... most probably I thought that the kernel IRQ handling code exited the loop over the IRQ actions once IRQ_HANDLED was returned -- which is obviously not so in reality..." Signed-off-by: Xander Huff <[email protected]> Signed-off-by: Nathan Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-08-25net: dsa: bcm_sf2: Fix race condition while unmasking interruptsFlorian Fainelli1-1/+1
We kept shadow copies of which interrupt sources we have enabled and disabled, but due to an order bug in how intrl2_mask_clear was defined, we could run into the following scenario: CPU0 CPU1 intrl2_1_mask_clear(..) sets INTRL2_CPU_MASK_CLEAR bcm_sf2_switch_1_isr read INTRL2_CPU_STATUS and masks with stale irq1_mask value updates irq1_mask value Which would make us loop again and again trying to process and interrupt we are not clearing since our copy of whether it was enabled before still indicates it was not. Fix this by updating the shadow copy first, and then unasking at the HW level. Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-08-25qdisc: fix a module refcount leak in qdisc_create_dflt()Eric Dumazet1-4/+5
Should qdisc_alloc() fail, we must release the module refcount we got right before. Fixes: 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline") Signed-off-by: Eric Dumazet <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-08-25tipc: fix the error handling in tipc_udp_enable()Wei Yongjun1-1/+4
Fix to return a negative error code in enable_mcast() error handling case, and release udp socket when necessary. Fixes: d0f91938bede ("tipc: add ip/udp media type") Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-08-25Merge tag 'usb-serial-4.8-rc4' of ↵Greg Kroah-Hartman3-3/+12
git://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus Johan writes: USB-serial fixes for v4.8-rc4 Here are a couple of fixes for non-atomic allocations in write paths, and some new option device ids. Signed-off-by: Johan Hovold <[email protected]>
2016-08-25mmc: fix use-after-free of struct requestAdrian Hunter2-3/+5
We call mmc_req_is_special() after having processed a request, but it could be freed after that. Check that ahead of time, and use the cached value. Reported-by: Hans de Goede <[email protected]> Tested-by: Hans de Goede <[email protected]> Fixes: c2df40dfb8c0 ("drivers: use req op accessor") Signed-off-by: Jens Axboe <[email protected]>
2016-08-26Merge tag 'drm-intel-fixes-2016-08-25' of ↵Dave Airlie7-22/+309
git://anongit.freedesktop.org/drm-intel into drm-fixes i915 fixes queue. * tag 'drm-intel-fixes-2016-08-25' of git://anongit.freedesktop.org/drm-intel: drm/i915: Fix botched merge that downgrades CSR versions. drm/i915/skl: Ensure pipes with changed wms get added to the state drm/i915/gen9: Only copy WM results for changed pipes to skl_hw drm/i915/skl: Add support for the SAGV, fix underrun hangs drm/i915/gen6+: Interpret mailbox error flags drm/i915: Reattach comment, complete type specification drm/i915: Unconditionally flush any chipset buffers before execbuf drm/i915/gen9: Drop invalid WARN() during data rate calculation drm/i915/gen9: Initialize intel_state->active_crtcs during WM sanitization (v2)
2016-08-26drm: Protect fb_defio in drivers with CONFIG_KMS_FBDEV_EMULATIONDaniel Vetter2-0/+8
For reasons that entirely elude me fb.h exposes all the structures, even when it is not enabled. Except for special stuff like fb_defio. Which means all the drivers which haven't yet switched over to the defio support in the helpers and still roll their own, will fail to compile when fbdev emulation is disabled. Protect just those bits, as a gnarly reminder that conversion to the core defio helpers would be good. Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Daniel Vetter <[email protected]> Link: http://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Dave Airlie <[email protected]>
2016-08-25Bluetooth: Fix hci_sock_recvmsg when MSG_TRUNC is not setLuiz Augusto von Dentz1-1/+1
Similar to bt_sock_recvmsg MSG_TRUNC shall be checked using the original flags not msg_flags. Signed-off-by: Luiz Augusto von Dentz <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>
2016-08-25Bluetooth: Fix bt_sock_recvmsg when MSG_TRUNC is not setLuiz Augusto von Dentz1-1/+1
Commit b5f34f9420b50c9b5876b9a2b68e96be6d629054 attempt to introduce proper handling for MSG_TRUNC but recv and variants should still work as read if no flag is passed, but because the code may set MSG_TRUNC to msg->msg_flags that shall not be used as it may cause it to be behave as if MSG_TRUNC is always, so instead of using it this changes the code to use the flags parameter which shall contain the original flags. Signed-off-by: Luiz Augusto von Dentz <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>
2016-08-25Merge tag 'asoc-fix-v4.8-rc4' of ↵Takashi Iwai564-3159/+7054
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus ASoC: Fixes for v4.8 A clutch of fixes for v4.8. These are mainly driver specific, the most notable ones being those for OMAP which fix a series of issues that broke boot on some platforms there when deferred probe kicked in. There's also one core fix for an issue when unbinding a card which for some reason had managed to not manifest until recently.
2016-08-25Merge branch 'misc-fixes' into k.o/for-4.8-rcDoug Ledford1-21/+1
2016-08-25i40iw: Send last streaming mode message for loopback connectionsTatyana Nikolova1-21/+1
Send a zero length last streaming mode message for loopback connections to synchronize between accepting QP and connecting QP. This avoids data transfer to start on the accepting QP before the connecting QP is in RTS. Also remove function i40iw_loopback_nop() as it is no longer used. Fixes: f27b4746f378 ("i40iw: add connection management code") Signed-off-by: Tatyana Nikolova <[email protected]> Signed-off-by: Shiraz Saleem <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2016-08-25Revert "floppy: refactor open() flags handling"Jens Axboe1-19/+15
This reverts commit 09954bad448791ef01202351d437abdd9497a804.
2016-08-25Revert "floppy: fix open(O_ACCMODE) for ioctl-only open"Jens Axboe1-9/+12
This reverts commit ff06db1efb2ad6db06eb5b99b88a0c15a9cc9b0e.
2016-08-25fs/block_dev: fix potential NULL ptr deref in freeze_bdev()Andrey Ryabinin1-1/+2
Calling freeze_bdev() twice on the same block device without mounted filesystem get_super() will return NULL, which will lead to NULL-ptr dereference later in drop_super(). Check get_super() result to fix that. Note, that this is a purely theoretical issue. We have only 3 freeze_bdev() callers. 2 of them are in filesystem code and used on a device with mounted fs. The third one in lock_fs() has protection in upper-layer code against freezing block device the second time without thawing it first. Signed-off-by: Andrey Ryabinin <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-08-25Merge tag 'fixes-for-v4.8-rc3' of ↵Greg Kroah-Hartman12-20/+57
git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-linus Felipe writes: usb: fixes for v4.8-rc3 Few fixes on dwc3 again, the most important being a fix for pm_runtime to make it work with current intel platforms. Other than that, there's a signedness bug fix in fsl udc and some other minor fixes.
2016-08-25Btrfs: fix lockdep warning on deadlock against an inode's log mutexFilipe Manana3-4/+8
Commit 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink"), which landed in 4.8-rc2, introduced a possibility for a deadlock due to double locking of an inode's log mutex by the same task, which lockdep reports with: [23045.433975] ============================================= [23045.434748] [ INFO: possible recursive locking detected ] [23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted [23045.436044] --------------------------------------------- [23045.436044] xfs_io/3688 is trying to acquire lock: [23045.436044] (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] but task is already holding lock: [23045.436044] (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] other info that might help us debug this: [23045.436044] Possible unsafe locking scenario: [23045.436044] CPU0 [23045.436044] ---- [23045.436044] lock(&ei->log_mutex); [23045.436044] lock(&ei->log_mutex); [23045.436044] *** DEADLOCK *** [23045.436044] May be due to missing lock nesting notation [23045.436044] 3 locks held by xfs_io/3688: [23045.436044] #0: (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffffa035f2ae>] btrfs_sync_file+0x14e/0x425 [btrfs] [23045.436044] #1: (sb_internal#2){.+.+.+}, at: [<ffffffff8118446b>] __sb_start_write+0x5f/0xb0 [23045.436044] #2: (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] stack backtrace: [23045.436044] CPU: 4 PID: 3688 Comm: xfs_io Not tainted 4.7.0-rc6-btrfs-next-34+ #1 [23045.436044] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [23045.436044] 0000000000000000 ffff88022f5f7860 ffffffff8127074d ffffffff82a54b70 [23045.436044] ffffffff82a54b70 ffff88022f5f7920 ffffffff81092897 ffff880228015d68 [23045.436044] 0000000000000000 ffffffff82a54b70 ffffffff829c3f00 ffff880228015d68 [23045.436044] Call Trace: [23045.436044] [<ffffffff8127074d>] dump_stack+0x67/0x90 [23045.436044] [<ffffffff81092897>] __lock_acquire+0xcbb/0xe4e [23045.436044] [<ffffffff8109155f>] ? mark_lock+0x24/0x201 [23045.436044] [<ffffffff8109179a>] ? mark_held_locks+0x5e/0x74 [23045.436044] [<ffffffff81092de0>] lock_acquire+0x12f/0x1c3 [23045.436044] [<ffffffff81092de0>] ? lock_acquire+0x12f/0x1c3 [23045.436044] [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] [<ffffffff814a51a4>] mutex_lock_nested+0x77/0x3a7 [23045.436044] [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] [<ffffffffa039705e>] ? btrfs_release_delayed_node+0xb/0xd [btrfs] [23045.436044] [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs] [23045.436044] [<ffffffff810a0ed1>] ? vprintk_emit+0x453/0x465 [23045.436044] [<ffffffffa0385a61>] btrfs_log_inode+0x66e/0xc95 [btrfs] [23045.436044] [<ffffffffa03c084d>] log_new_dir_dentries+0x26c/0x359 [btrfs] [23045.436044] [<ffffffffa03865aa>] btrfs_log_inode_parent+0x4a6/0x628 [btrfs] [23045.436044] [<ffffffffa0387552>] btrfs_log_dentry_safe+0x5a/0x75 [btrfs] [23045.436044] [<ffffffffa035f464>] btrfs_sync_file+0x304/0x425 [btrfs] [23045.436044] [<ffffffff811acaf4>] vfs_fsync_range+0x8c/0x9e [23045.436044] [<ffffffff811acb22>] vfs_fsync+0x1c/0x1e [23045.436044] [<ffffffff811acc79>] do_fsync+0x31/0x4a [23045.436044] [<ffffffff811ace99>] SyS_fsync+0x10/0x14 [23045.436044] [<ffffffff814a88e5>] entry_SYSCALL_64_fastpath+0x18/0xa8 [23045.436044] [<ffffffff8108f039>] ? trace_hardirqs_off_caller+0x3f/0xaa An example reproducer for this is: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ touch /mnt/dir/foo $ sync $ mv /mnt/dir/foo /mnt/dir/bar $ touch /mnt/dir/foo $ xfs_io -c "fsync" /mnt/dir/bar This is because while logging the inode of file bar we end up logging its parent directory (since its inode has an unlink_trans field matching the current transaction id due to the rename operation), which in turn logs the inodes for all its new dentries, so that the new inode for the new file named foo gets logged which in turn triggered another logging attempt for the inode we are fsync'ing, since that inode had an old name that corresponds to the name of the new inode. So fix this by ensuring that when logging the inode for a new dentry that has a name matching an old name of some other inode, we don't log again the original inode that we are fsync'ing. Fixes: 44f714dae50a ("Btrfs: improve performance on fsync against new inode after rename/unlink") Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25Btrfs: detect corruption when non-root leaf has zero itemLiu Bo1-1/+22
Right now we treat leaf which has zero item as a valid one because we could have an empty tree, that is, a root that is also a leaf without any item, however, in the same case but when the leaf is not a root, we can end up with hitting the BUG_ON(1) in btrfs_extend_item() called by setup_inline_extent_backref(). This makes us check the situation as a corruption if leaf is not its own root. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25Btrfs: check btree node's nritemsLiu Bo1-0/+16
When btree node (level = 1) has nritems which equals to zero, we can end up with panic due to insert_ptr()'s BUG_ON(slot > nritems); where slot is 1 and nritems is 0, as copy_for_split() calls insert_ptr(.., path->slots[1] + 1, ...); A invalid value results in the whole mess, this adds the check for btree's node nritems so that we stop reading block when when something is wrong. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: don't create or leak aliased root while cleaning up orphansJeff Mahoney3-11/+22
commit 909c3a22da3 (Btrfs: fix loading of orphan roots leading to BUG_ON) avoids the BUG_ON but can add an aliased root to the dead_roots list or leak the root. Since we've already been loading roots into the radix tree, we should use it before looking the root up on disk. Cc: <[email protected]> # 4.5 Signed-off-by: Jeff Mahoney <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25Btrfs: fix em leak in find_first_block_groupJosef Bacik1-0/+1
We need to call free_extent_map() on the em we look up. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: Omar Sandoval <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: do not background blkdev_put()Anand Jain1-8/+19
At the end of unmount/dev-delete, if the device exclusive open is not actually closed, then there might be a race with another program in the userland who is trying to open the device in exclusive mode and it may fail for eg: unmount /btrfs; fsck /dev/x btrfs dev del /dev/x /btrfs; fsck /dev/x so here background blkdev_put() is not a choice Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25Btrfs: clarify do_chunk_alloc()'s return valueLiu Bo1-0/+9
Function start_transaction() can return ERR_PTR(1) when flush is BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is start_transaction (return ERR_PTR(1)) -> btrfs_block_rsv_add (return 1) -> reserve_metadata_bytes (return 1) -> flush_space (return 1) -> do_chunk_alloc (return 1) With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the flush_state of ALLOC_CHUNK and it successfully allocates a new chunk, then instead of trying to reserve space again, reserve_metadata_bytes returns 1 immediately. Eventually the callers who call start_transaction() usually just do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get a panic when dereferencing a pointer which is ERR_PTR(1). The following patch fixes the above problem. "btrfs: flush_space: treat return value of do_chunk_alloc properly" https://patchwork.kernel.org/patch/7778651/ This add comments to clarify do_chunk_alloc()'s return value. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: fix fsfreeze hang caused by delayed iputs dealWang Xiaoguang4-1/+25
When running fstests generic/068, sometimes we got below deadlock: xfs_io D ffff8800331dbb20 0 6697 6693 0x00000080 ffff8800331dbb20 ffff88007acfc140 ffff880034d895c0 ffff8800331dc000 ffff880032d243e8 fffffffeffffffff ffff880032d24400 0000000000000001 ffff8800331dbb38 ffffffff816a9045 ffff880034d895c0 ffff8800331dbba8 Call Trace: [<ffffffff816a9045>] schedule+0x35/0x80 [<ffffffff816abab2>] rwsem_down_read_failed+0xf2/0x140 [<ffffffff8118f5e1>] ? __filemap_fdatawrite_range+0xd1/0x100 [<ffffffff8134f978>] call_rwsem_down_read_failed+0x18/0x30 [<ffffffffa06631fc>] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs] [<ffffffff810d32b5>] percpu_down_read+0x35/0x50 [<ffffffff81217dfc>] __sb_start_write+0x2c/0x40 [<ffffffffa067f5d5>] start_transaction+0x2a5/0x4d0 [btrfs] [<ffffffffa067f857>] btrfs_join_transaction+0x17/0x20 [btrfs] [<ffffffffa068ba34>] btrfs_evict_inode+0x3c4/0x5d0 [btrfs] [<ffffffff81230a1a>] evict+0xba/0x1a0 [<ffffffff812316b6>] iput+0x196/0x200 [<ffffffffa06851d0>] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs] [<ffffffffa067f1d8>] btrfs_commit_transaction+0x928/0xa80 [btrfs] [<ffffffffa0646df0>] btrfs_freeze+0x30/0x40 [btrfs] [<ffffffff81218040>] freeze_super+0xf0/0x190 [<ffffffff81229275>] do_vfs_ioctl+0x4a5/0x5c0 [<ffffffff81003176>] ? do_audit_syscall_entry+0x66/0x70 [<ffffffff810038cf>] ? syscall_trace_enter_phase1+0x11f/0x140 [<ffffffff81229409>] SyS_ioctl+0x79/0x90 [<ffffffff81003c12>] do_syscall_64+0x62/0x110 [<ffffffff816acbe1>] entry_SYSCALL64_slow_path+0x25/0x25 >From this warning, freeze_super() already holds SB_FREEZE_FS, but btrfs_freeze() will call btrfs_commit_transaction() again, if btrfs_commit_transaction() finds that it has delayed iputs to handle, it'll start_transaction(), which will try to get SB_FREEZE_FS lock again, then deadlock occurs. The root cause is that in btrfs, sync_filesystem(sb) does not make sure all metadata is updated. There still maybe some codes adding delayed iputs, see below sample race window: CPU1 | CPU2 |-> freeze_super() | |-> sync_filesystem(sb); | | |-> cleaner_kthread() | | |-> btrfs_delete_unused_bgs() | | |-> btrfs_remove_chunk() | | |-> btrfs_remove_block_group() | | |-> btrfs_add_delayed_iput() | | |-> sb->s_writers.frozen = SB_FREEZE_FS; | |-> sb_wait_write(sb, SB_FREEZE_FS); | | acquire SB_FREEZE_FS lock. | | | |-> btrfs_freeze() | |-> btrfs_commit_transaction() | |-> btrfs_run_delayed_iputs() | | will handle delayed iputs, | | that means start_transaction() | | will be called, which will try | | to get SB_FREEZE_FS lock. | To fix this issue, introduce a "int fs_frozen" to record internally whether fs has been frozen. If fs has been frozen, we can not handle delayed iputs. Signed-off-by: Wang Xiaoguang <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add comment to btrfs_freeze ] Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: update btrfs_space_info's bytes_may_use timelyWang Xiaoguang7-63/+73
This patch can fix some false ENOSPC errors, below test script can reproduce one false ENOSPC error: #!/bin/bash dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128 dev=$(losetup --show -f fs.img) mkfs.btrfs -f -M $dev mkdir /tmp/mntpoint mount $dev /tmp/mntpoint cd /tmp/mntpoint xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile Above script will fail for ENOSPC reason, but indeed fs still has free space to satisfy this request. Please see call graph: btrfs_fallocate() |-> btrfs_alloc_data_chunk_ondemand() | bytes_may_use += 64M |-> btrfs_prealloc_file_range() |-> btrfs_reserve_extent() |-> btrfs_add_reserved_bytes() | alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not | change bytes_may_use, and bytes_reserved += 64M. Now | bytes_may_use + bytes_reserved == 128M, which is greater | than btrfs_space_info's total_bytes, false enospc occurs. | Note, the bytes_may_use decrease operation will be done in | end of btrfs_fallocate(), which is too late. Here is another simple case for buffered write: CPU 1 | CPU 2 | |-> cow_file_range() |-> __btrfs_buffered_write() |-> btrfs_reserve_extent() | | | | | | | | | ..... | |-> btrfs_check_data_free_space() | | | | |-> extent_clear_unlock_delalloc() | In CPU 1, btrfs_reserve_extent()->find_free_extent()-> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease operation will be delayed to be done in extent_clear_unlock_delalloc(). Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's btrfs_check_data_free_space() tries to reserve 100MB data space. If 100MB > data_sinfo->total_bytes - data_sinfo->bytes_used - data_sinfo->bytes_reserved - data_sinfo->bytes_pinned - data_sinfo->bytes_readonly - data_sinfo->bytes_may_use btrfs_check_data_free_space() will try to allcate new data chunk or call btrfs_start_delalloc_roots(), or commit current transaction in order to reserve some free space, obviously a lot of work. But indeed it's not necessary as long as decreasing bytes_may_use timely, we still have free space, decreasing 128M from bytes_may_use. To fix this issue, this patch chooses to update bytes_may_use for both data and metadata in btrfs_add_reserved_bytes(). For compress path, real extent length may not be equal to file content length, so introduce a ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by file content length. Then compress path can update bytes_may_use correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC and RESERVE_FREE. As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as PREALLOC, we also need to update bytes_may_use, but can not pass EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook() to update btrfs_space_info's bytes_may_use. Meanwhile __btrfs_prealloc_file_range() will call btrfs_free_reserved_data_space() internally for both sucessful and failed path, btrfs_prealloc_file_range()'s callers does not need to call btrfs_free_reserved_data_space() any more. Signed-off-by: Wang Xiaoguang <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: divide btrfs_update_reserved_bytes() into two functionsWang Xiaoguang1-40/+57
This patch divides btrfs_update_reserved_bytes() into btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and next patch will extend btrfs_add_reserved_bytes()to fix some false ENOSPC error, please see later patch for detailed info. Signed-off-by: Wang Xiaoguang <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()Wang Xiaoguang1-4/+6
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses wrong file offset for reloc_inode, it uses cluster->start and cluster->end, which indeed are extent's bytenr. The correct value should be cluster->[start|end] minus block group's start bytenr. start bytenr cluster->start | | extent | extent | ...| extent | |----------------------------------------------------------------| | block group reloc_inode | Signed-off-by: Wang Xiaoguang <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: qgroup: Fix qgroup incorrectness caused by log replayQu Wenruo1-0/+16
When doing log replay at mount time(after power loss), qgroup will leak numbers of replayed data extents. The cause is almost the same of balance. So fix it by manually informing qgroup for owner changed extents. The bug can be detected by btrfs/119 test case. Cc: Mark Fasheh <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-and-Tested-by: Goldwyn Rodrigues <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: relocation: Fix leaking qgroups numbers on data extentsQu Wenruo1-6/+103
This patch fixes a REGRESSION introduced in 4.2, caused by the big quota rework. When balancing data extents, qgroup will leak all its numbers for relocated data extents. The relocation is done in the following steps for data extents: 1) Create data reloc tree and inode 2) Copy all data extents to data reloc tree And commit transaction 3) Create tree reloc tree(special snapshot) for any related subvolumes 4) Replace file extent in tree reloc tree with new extents in data reloc tree And commit transaction 5) Merge tree reloc tree with original fs, by swapping tree blocks For 1)~4), since tree reloc tree and data reloc tree doesn't count to qgroup, everything is OK. But for 5), the swapping of tree blocks will only info qgroup to track metadata extents. If metadata extents contain file extents, qgroup number for file extents will get lost, leading to corrupted qgroup accounting. The fix is, before commit transaction of step 5), manually info qgroup to track all file extents in data reloc tree. Since at commit transaction time, the tree swapping is done, and qgroup will account these data extents correctly. Cc: Mark Fasheh <[email protected]> Reported-by: Mark Fasheh <[email protected]> Reported-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Tested-by: Goldwyn Rodrigues <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()Qu Wenruo4-47/+71
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions: 1. btrfs_qgroup_insert_dirty_extent_nolock() Almost the same with original code. For delayed_ref usage, which has delayed refs locked. Change the return value type to int, since caller never needs the pointer, but only needs to know if they need to free the allocated memory. 2. btrfs_qgroup_insert_dirty_extent() The more encapsulated version. Will do the delayed_refs lock, memory allocation, quota enabled check and other things. The original design is to keep exported functions to minimal, but since more btrfs hacks exposed, like replacing path in balance, we need to record dirty extents manually, so we have to add such functions. Also, add comment for both functions, to info developers how to keep qgroup correct when doing hacks. Cc: Mark Fasheh <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-and-Tested-by: Goldwyn Rodrigues <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: waiting on qgroup rescan should not always be interruptibleJeff Mahoney4-6/+13
We wait on qgroup rescan completion in three places: file system shutdown, the quota disable ioctl, and the rescan wait ioctl. If the user sends a signal while we're waiting, we continue happily along. This is expected behavior for the rescan wait ioctl. It's racy in the shutdown path but mostly works due to other unrelated synchronization points. In the quota disable path, it Oopses the kernel pretty much immediately. Cc: <[email protected]> # v4.4+ Signed-off-by: Jeff Mahoney <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: properly track when rescan worker is runningJeff Mahoney3-1/+10
The qgroup_flags field is overloaded such that it reflects the on-disk status of qgroups and the runtime state. The BTRFS_QGROUP_STATUS_FLAG_RESCAN flag is used to indicate that a rescan operation is in progress, but if the file system is unmounted while a rescan is running, the rescan operation is paused. If the file system is then mounted read-only, the flag will still be present but the rescan operation will not have been resumed. When we go to umount, btrfs_qgroup_wait_for_completion will see the flag and interpret it to mean that the rescan worker is still running and will wait for a completion that will never come. This patch uses a separate flag to indicate when the worker is running. The locking and state surrounding the qgroup rescan worker needs a lot of attention beyond this patch but this is enough to avoid a hung umount. Cc: <[email protected]> # v4.4+ Signed-off-by; Jeff Mahoney <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25btrfs: flush_space: treat return value of do_chunk_alloc properlyAlex Lyakas1-1/+1
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk. But flush_space will not convert this to 0, and will also return 1. As a result, reserve_metadata_bytes will think that flush_space failed, and may potentially return this value "1" to the caller (depends how reserve_metadata_bytes was called). The caller will also treat this as an error. For example, btrfs_block_rsv_refill does: int ret = -ENOSPC; ... ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, 0); return 0; } return ret; So it will return -ENOSPC. Signed-off-by: Alex Lyakas <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>
2016-08-25Btrfs: add ASSERT for block group's memory leakLiu Bo1-0/+5
This adds several ASSERT()' s to report memory leak of block group cache. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Chris Mason <[email protected]>