Age | Commit message (Collapse) | Author | Files | Lines |
|
.i_compress_level was introduced by commit 3fde13f817e2 ("f2fs: compress:
support compress level"), but never be used.
This patch updates as below:
- load high 8-bits of on-disk .i_compress_flag to in-memory .i_compress_level
- load low 8-bits of on-disk .i_compress_flag to in-memory .i_compress_flag
- change type of in-memory .i_compress_flag from unsigned short to unsigned
char.
w/ above changes, we can avoid unneeded bit shift whenever during
.init_compress_ctx(), and shrink size of struct f2fs_inode_info.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
This patch tries to use bitfield in struct f2fs_io_info to improve
memory usage.
struct f2fs_io_info {
...
unsigned int need_lock:8; /* indicate we need to lock cp_rwsem */
unsigned int version:8; /* version of the node */
unsigned int submitted:1; /* indicate IO submission */
unsigned int in_list:1; /* indicate fio is in io_list */
unsigned int is_por:1; /* indicate IO is from recovery or not */
unsigned int retry:1; /* need to reallocate block address */
unsigned int encrypted:1; /* indicate file is encrypted */
unsigned int post_read:1; /* require post read */
...
};
After this patch, size of struct f2fs_io_info reduces from 136 to 120.
[Nathan: fix a compile warning (single-bit-bitfield-constant-conversion)]
Signed-off-by: Nathan Chancellor <[email protected]>
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Factor the logic to log a path for reads and writs into a helper
shared between the read_iter and write_iter methods.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Just open code the logic in the only caller, where it is more
obvious.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Remove __refresh_next_blkoff by opencoding the SSR vs LFS segment check
in the only caller, and then add helpers for SSR block selection and
blkoff randomization instead.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Just fold this trivial wrapper into the only caller.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Simplify the check whether to allocate a new segment or reuse an open
one.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Add a helper to return the valid blocks on log and SSR segments, and
replace the last two uses of curseg_blkoff with it.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
For each loop add a local curseg_info pointer insted of looking it up
for each of the three fields.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
There is a list_del corruption when remove the jffs2 module:
list_del corruption, ffffffffa0623e60->next is NULL
WARNING: CPU: 6 PID: 6332 at lib/list_debug.c:49 __list_del_entry_valid+0x98/0x130
Modules linked in: jffs2(-) ]
CPU: 6 PID: 6332 Comm: rmmod Tainted: G W 6.1.0-rc2+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
RIP: 0010:__list_del_entry_valid+0x98/0x130
...
Call Trace:
<TASK>
jffs2_unregister_compressor+0x3e/0xe0 [jffs2]
jffs2_zlib_exit+0x11/0x30 [jffs2]
jffs2_compressors_exit+0x1e/0x30 [jffs2]
exit_jffs2_fs+0x16/0x44f [jffs2]
__do_sys_delete_module.constprop.0+0x244/0x370
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
If one of the compressor initialize failed, the module always insert
success since jffs2_compressors_init() always return success, then
something bad may happen during remove the module.
For this scenario, let's insmod failed.
Signed-off-by: Zhang Xiaoxu <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
The initialized compressors should be released if one of them
initialize fail, this is the pre-patch for fix the problem, use
function instead of the macro in jffs2_compressors_init() to
simplify the codes, no functional change intended.
Signed-off-by: Zhang Xiaoxu <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
There is a spelling mistake in comment. Fix it.
Signed-off-by: Yu Zhe <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Fix function name in fs/ubifs/io.c kernel-doc comment
to remove some warnings found by clang(make W=1 LLVM=1).
fs/ubifs/io.c:497: warning: expecting prototype for
wbuf_timer_callback(). Prototype was for wbuf_timer_callback_nolock()
instead
fs/ubifs/io.c:513: warning: expecting prototype for new_wbuf_timer().
Prototype was for new_wbuf_timer_nolock() instead
fs/ubifs/io.c:538: warning: expecting prototype for cancel_wbuf_timer().
Prototype was for cancel_wbuf_timer_nolock() instead
Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Remove warnings found by running scripts/kernel-doc,
which is caused by using 'make W=1'.
fs/ubifs/journal.c:1221: warning: Function parameter or member
'old_inode' not described in 'ubifs_jnl_rename'
fs/ubifs/journal.c:1221: warning: Function parameter or member 'old_nm'
not described in 'ubifs_jnl_rename'
fs/ubifs/journal.c:1221: warning: Function parameter or member
'new_inode' not described in 'ubifs_jnl_rename'
fs/ubifs/journal.c:1221: warning: Function parameter or member 'new_nm'
not described in 'ubifs_jnl_rename'
fs/ubifs/journal.c:1221: warning: Function parameter or member
'whiteout' not described in 'ubifs_jnl_rename'
fs/ubifs/journal.c:1221: warning: Excess function parameter 'old_dentry'
description in 'ubifs_jnl_rename'
fs/ubifs/journal.c:1221: warning: Excess function parameter 'new_dentry'
description in 'ubifs_jnl_rename'
Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Reviewed-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
There are two states for ubifs writing pages:
1. Dirty, Private
2. Not Dirty, Not Private
The normal process cannot go to ubifs_releasepage() which means there
exists pages being private but not dirty. Reproducer[1] shows that it
could occur (which maybe related to [2]) with following process:
PA PB PC
lock(page)[PA]
ubifs_write_end
attach_page_private // set Private
__set_page_dirty_nobuffers // set Dirty
unlock(page)
write_cache_pages[PA]
lock(page)
clear_page_dirty_for_io(page) // clear Dirty
ubifs_writepage
do_truncation[PB]
truncate_setsize
i_size_write(inode, newsize) // newsize = 0
i_size = i_size_read(inode) // i_size = 0
end_index = i_size >> PAGE_SHIFT
if (page->index > end_index)
goto out // jump
out:
unlock(page) // Private, Not Dirty
generic_fadvise[PC]
lock(page)
invalidate_inode_page
try_to_release_page
ubifs_releasepage
ubifs_assert(c, 0)
// bad assertion!
unlock(page)
truncate_pagecache[PB]
Then we may get following assertion failed:
UBIFS error (ubi0:0 pid 1683): ubifs_assert_failed [ubifs]:
UBIFS assert failed: 0, in fs/ubifs/file.c:1513
UBIFS warning (ubi0:0 pid 1683): ubifs_ro_mode [ubifs]:
switched to read-only mode, error -22
CPU: 2 PID: 1683 Comm: aa Not tainted 5.16.0-rc5-00184-g0bca5994cacc-dirty #308
Call Trace:
dump_stack+0x13/0x1b
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_releasepage+0x67/0x1d0 [ubifs]
try_to_release_page+0x57/0xe0
invalidate_inode_page+0xfb/0x130
__invalidate_mapping_pages+0xb9/0x280
invalidate_mapping_pagevec+0x12/0x20
generic_fadvise+0x303/0x3c0
ksys_fadvise64_64+0x4c/0xb0
[1] https://bugzilla.kernel.org/show_bug.cgi?id=215373
[2] https://linux-mtd.infradead.narkive.com/NQoBeT1u/patch-rfc-ubifs-fix-assert-failed-in-ubifs-set-page-dirty
Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
There are two states for ubifs writing pages:
1. Dirty, Private
2. Not Dirty, Not Private
There is a third possibility which maybe related to [1] that page is
private but not dirty caused by following process:
PA
lock(page)
ubifs_write_end
attach_page_private // set Private
__set_page_dirty_nobuffers // set Dirty
unlock(page)
write_cache_pages
lock(page)
clear_page_dirty_for_io(page) // clear Dirty
ubifs_writepage
write_inode
// fail, goto out, following codes are not executed
// do_writepage
// set_page_writeback // set Writeback
// detach_page_private // clear Private
// end_page_writeback // clear Writeback
out:
unlock(page) // Private, Not Dirty
PB
ksys_fadvise64_64
generic_fadvise
invalidate_inode_page
// page is neither Dirty nor Writeback
invalidate_complete_page
// page_has_private is true
try_to_release_page
ubifs_releasepage
ubifs_assert(c, 0) !!!
Then we may get following assertion failed:
UBIFS error (ubi0:0 pid 1492): ubifs_assert_failed [ubifs]:
UBIFS assert failed: 0, in fs/ubifs/file.c:1499
UBIFS warning (ubi0:0 pid 1492): ubifs_ro_mode [ubifs]:
switched to read-only mode, error -22
CPU: 2 PID: 1492 Comm: aa Not tainted 5.16.0-rc2-00012-g7bb767dee0ba-dirty
Call Trace:
dump_stack+0x13/0x1b
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_releasepage+0x7e/0x1e0 [ubifs]
try_to_release_page+0x57/0xe0
invalidate_inode_page+0xfb/0x130
invalidate_mapping_pagevec+0x12/0x20
generic_fadvise+0x303/0x3c0
vfs_fadvise+0x35/0x40
ksys_fadvise64_64+0x4c/0xb0
Jump [2] to find a reproducer.
[1] https://linux-mtd.infradead.narkive.com/NQoBeT1u/patch-rfc-ubifs-fix-assert-failed-in-ubifs-set-page-dirty
[2] https://bugzilla.kernel.org/show_bug.cgi?id=215357
Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Following process will cause a memleak for copied up znode:
dirty_cow_znode
zn = copy_znode(c, znode);
err = insert_old_idx(c, zbr->lnum, zbr->offs);
if (unlikely(err))
return ERR_PTR(err); // No one refers to zn.
Fix it by adding copied znode back to tnc, then it will be freed
by ubifs_destroy_tnc_subtree() while closing tnc.
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216705
Fixes: 1e51764a3c2a ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Dirty znodes will be written on flash in committing process with
following states:
process A | znode state
------------------------------------------------------
do_commit | DIRTY_ZNODE
ubifs_tnc_start_commit | DIRTY_ZNODE
get_znodes_to_commit | DIRTY_ZNODE | COW_ZNODE
layout_commit | DIRTY_ZNODE | COW_ZNODE
fill_gap | 0
write master | 0 or OBSOLETE_ZNODE
process B | znode state
------------------------------------------------------
do_commit | DIRTY_ZNODE[1]
ubifs_tnc_start_commit | DIRTY_ZNODE
get_znodes_to_commit | DIRTY_ZNODE | COW_ZNODE
ubifs_tnc_end_commit | DIRTY_ZNODE | COW_ZNODE
write_index | 0
write master | 0 or OBSOLETE_ZNODE[2] or
| DIRTY_ZNODE[3]
[1] znode is dirtied without concurrent committing process
[2] znode is copied up (re-dirtied by other process) before cleaned
up in committing process
[3] znode is re-dirtied after cleaned up in committing process
Currently, the clean znode count is updated in free_obsolete_znodes(),
which is called only in normal path. If do_commit failed, clean znode
count won't be updated, which triggers a failure ubifs assertion[4] in
ubifs_tnc_close():
ubifs_assert_failed [ubifs]: UBIFS assert failed: freed == n
[4] Commit 380347e9ca7682 ("UBIFS: Add an assertion for clean_zn_cnt").
Fix it by re-statisticing cleaned znode count in tnc_destroy_cnext().
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216704
Fixes: 1e51764a3c2a ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
kmemleak reported a sequence of memory leaks, and show them as following:
unreferenced object 0xffff8881575f8400 (size 1024):
comm "mount", pid 19625, jiffies 4297119604 (age 20.383s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8176cecd>] __kmalloc+0x4d/0x150
[<ffffffffa0406b2b>] ubifs_mount+0x307b/0x7170 [ubifs]
[<ffffffff819fa8fd>] legacy_get_tree+0xed/0x1d0
[<ffffffff81936f2d>] vfs_get_tree+0x7d/0x230
[<ffffffff819b2bd4>] path_mount+0xdd4/0x17b0
[<ffffffff819b37aa>] __x64_sys_mount+0x1fa/0x270
[<ffffffff83c14295>] do_syscall_64+0x35/0x80
[<ffffffff83e0006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
unreferenced object 0xffff8881798a6e00 (size 512):
comm "mount", pid 19677, jiffies 4297121912 (age 37.816s)
hex dump (first 32 bytes):
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
backtrace:
[<ffffffff8176cecd>] __kmalloc+0x4d/0x150
[<ffffffffa0418342>] ubifs_wbuf_init+0x52/0x480 [ubifs]
[<ffffffffa0406ca5>] ubifs_mount+0x31f5/0x7170 [ubifs]
[<ffffffff819fa8fd>] legacy_get_tree+0xed/0x1d0
[<ffffffff81936f2d>] vfs_get_tree+0x7d/0x230
[<ffffffff819b2bd4>] path_mount+0xdd4/0x17b0
[<ffffffff819b37aa>] __x64_sys_mount+0x1fa/0x270
[<ffffffff83c14295>] do_syscall_64+0x35/0x80
[<ffffffff83e0006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
The problem is that the ubifs_wbuf_init() returns an error in the
loop which in the alloc_wbufs(), then the wbuf->buf and wbuf->inodes
that were successfully alloced before are not freed.
Fix it by adding error hanging path in alloc_wbufs() which frees
the memory alloced before when ubifs_wbuf_init() returns an error.
Fixes: 1e51764a3c2a ("UBIFS: add new flash file system")
Signed-off-by: Li Zetao <[email protected]>
Reviewed-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
UBIFS calculates available space by c->main_bytes - c->lst.total_used
(which means non-index lebs' free and dirty space is accounted into
total available), then index lebs and four lebs (one for gc_lnum, one
for deletions, two for journal heads) are deducted.
In following situation, ubifs may get -ENOSPC from make_reservation():
LEB 84: DATAHD free 122880 used 1920 dirty 2176 dark 6144
LEB 110:DELETION free 126976 used 0 dirty 0 dark 6144 (empty)
LEB 201:gc_lnum free 126976 used 0 dirty 0 dark 6144
LEB 272:GCHD free 77824 used 47672 dirty 1480 dark 6144
LEB 356:BASEHD free 0 used 39776 dirty 87200 dark 6144
OTHERS: index lebs, zero-available non-index lebs
UBIFS calculates the available bytes is 6888 (How to calculate it:
126976 * 5[remain main bytes] - 1920[used] - 47672[used] - 39776[used] -
126976 * 1[deletions] - 126976 * 1[gc_lnum] - 126976 * 2[journal heads]
- 6144 * 5[dark] = 6888) after doing budget, however UBIFS cannot use
BASEHD's dirty space(87200), because UBIFS cannot find next BASEHD to
reclaim current BASEHD. (c->bi.min_idx_lebs equals to c->lst.idx_lebs,
the empty leb won't be found by ubifs_find_free_space(), and dirty index
lebs won't be picked as gced lebs. All non-index lebs has dirty space
less then c->dead_wm, non-index lebs won't be picked as gced lebs
either. So new free lebs won't be produced.). See more details in Link.
To fix it, reserve one leb for each journal head while doing budget.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216562
Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
If target inode is a special file (eg. block/char device) with nlink
count greater than 1, the inode with ui->data will be re-written on
disk. However, UBIFS losts target inode's data_len while doing space
budget. Bad space budget may let make_reservation() return with -ENOSPC,
which could turn ubifs to read-only mode in do_writepage() process.
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216494
Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Each dirty inode should reserve 'c->bi.inode_budget' bytes in space
budget calculation. Currently, space budget for dirty inode reports
more space than what UBIFS actually needs to write.
Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Just like other operations (eg. ubifs_create, do_rename), add comments
and debug information for ubifs_xrename().
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
There is no space budget for ubifs_xrename(). It may let
make_reservation() return with -ENOSPC, which could turn
ubifs to read-only mode in do_writepage() process.
Fix it by adding space budget for ubifs_xrename().
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216569
Fixes: 9ec64962afb170 ("ubifs: Implement RENAME_EXCHANGE")
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Fix bad space budget when symlink file is encrypted. Bad space budget
may let make_reservation() return with -ENOSPC, which could turn ubifs
to read-only mode in do_writepage() process.
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216490
Fixes: ca7f85be8d6cf9 ("ubifs: Add support for encrypted symlinks")
Signed-off-by: Zhihao Cheng <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
When insmod ubifs.ko, a kmemleak reported as below:
unreferenced object 0xffff88817fb1a780 (size 8):
comm "insmod", pid 25265, jiffies 4295239702 (age 100.130s)
hex dump (first 8 bytes):
75 62 69 66 73 00 ff ff ubifs...
backtrace:
[<ffffffff81b3fc4c>] slab_post_alloc_hook+0x9c/0x3c0
[<ffffffff81b44bf3>] __kmalloc_track_caller+0x183/0x410
[<ffffffff8198d3da>] kstrdup+0x3a/0x80
[<ffffffff8198d486>] kstrdup_const+0x66/0x80
[<ffffffff83989325>] kvasprintf_const+0x155/0x190
[<ffffffff83bf55bb>] kobject_set_name_vargs+0x5b/0x150
[<ffffffff83bf576b>] kobject_set_name+0xbb/0xf0
[<ffffffff8100204c>] do_one_initcall+0x14c/0x5a0
[<ffffffff8157e380>] do_init_module+0x1f0/0x660
[<ffffffff815857be>] load_module+0x6d7e/0x7590
[<ffffffff8158644f>] __do_sys_finit_module+0x19f/0x230
[<ffffffff815866b3>] __x64_sys_finit_module+0x73/0xb0
[<ffffffff88c98e85>] do_syscall_64+0x35/0x80
[<ffffffff88e00087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
When kset_register() failed, we should call kset_put to cleanup it.
Fixes: 2e3cbf425804 ("ubifs: Export filesystem error counters")
Signed-off-by: Liu Shixin <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
With CONFIG_UBIFS_FS_AUTHENTICATION not set, the compiler can assume that
ubifs_node_check_hash() is never true and drops the call to ubifs_bad_hash().
Is CONFIG_CC_OPTIMIZE_FOR_SIZE enabled this optimization does not happen anymore.
So When CONFIG_UBIFS_FS and CONFIG_CC_OPTIMIZE_FOR_SIZE is enabled but
CONFIG_UBIFS_FS_AUTHENTICATION is not set, the build errors is as followd:
ERROR: modpost: "ubifs_bad_hash" [fs/ubifs/ubifs.ko] undefined!
Fix it by add no-op ubifs_bad_hash() for the CONFIG_UBIFS_FS_AUTHENTICATION=n case.
Fixes: 16a26b20d2af ("ubifs: authentication: Add hashes to index nodes")
Signed-off-by: Li Hua <[email protected]>
Reviewed-by: Sascha Hauer <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Bug description and fix:
1. Write data to a file, say all 1s from offset 0 to 16.
2. Truncate the file to a smaller size, say 8 bytes.
3. Write new bytes (say 2s) from an offset past the original size of the
file, say at offset 20, for 4 bytes. This is supposed to create a "hole"
in the file, meaning that the bytes from offset 8 (where it was truncated
above) up to the new write at offset 20, should all be 0s (zeros).
4. Flush all caches using "echo 3 > /proc/sys/vm/drop_caches" (or unmount
and remount) the f/s.
5. Check the content of the file. It is wrong. The 1s that used to be
between bytes 9 and 16, before the truncation, have REAPPEARED (they should
be 0s).
We wrote a script and helper C program to reproduce the bug
(reproduce_jffs2_write_begin_issue.sh, write_file.c, and Makefile). We can
make them available to anyone.
The above example is shown when writing a small file within the same first
page. But the bug happens for larger files, as long as steps 1, 2, and 3
above all happen within the same page.
The problem was traced to the jffs2_write_begin code, where it goes into an
'if' statement intended to handle writes past the current EOF (i.e., writes
that may create a hole). The code computes a 'pageofs' that is the floor
of the write position (pos), aligned to the page size boundary. In other
words, 'pageofs' will never be larger than 'pos'. The code then sets the
internal jffs2_raw_inode->isize to the size of max(current inode size,
pageofs) but that is wrong: the new file size should be the 'pos', which is
larger than both the current inode size and pageofs.
Similarly, the code incorrectly sets the internal jffs2_raw_inode->dsize to
the difference between the pageofs minus current inode size; instead it
should be the current pos minus the current inode size. Finally,
inode->i_size was also set incorrectly.
The patch below fixes this bug. The bug was discovered using a new tool
for finding f/s bugs using model checking, called MCFS (Model Checking File
Systems).
Signed-off-by: Yifei Liu <[email protected]>
Signed-off-by: Erez Zadok <[email protected]>
Signed-off-by: Manish Adkar <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
When received corrupted snap trace we don't know what exactly has
happened in MDS side. And we shouldn't continue IOs and metadatas
access to MDS, which may corrupt or get incorrect contents.
This patch will just block all the further IO/MDS requests
immediately and then evict the kclient itself.
The reason why we still need to evict the kclient just after
blocking all the further IOs is that the MDS could revoke the caps
faster.
Link: https://tracker.ceph.com/issues/57686
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Venky Shankar <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
These flags are only used in ceph filesystem in fs/ceph, so just
move it to the place it should be.
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Venky Shankar <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
The use of kmap() is being deprecated in favor of kmap_local_page().
There are two main problems with kmap(): (1) It comes with an overhead as
the mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.
With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Furthermore,
the tasks can be preempted and, when they are scheduled to run again, the
kernel virtual addresses are restored and still valid.
Therefore, replace kmap() with kmap_local_page() in hostfs_kern.c, it
being the only file with kmap() call sites currently left in fs/hostfs.
Cc: "Venkataramanan, Anirudh" <[email protected]>
Suggested-by: Ira Weiny <[email protected]>
Signed-off-by: Fabio M. De Francesco <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
Merge branch 'mm-hotfixes-stable' into mm-stable
|
|
A Sysbot [1] corrupted filesystem exposes two flaws in the handling and
sanity checking of the xattr_ids count in the filesystem. Both of these
flaws cause computation overflow due to incorrect typing.
In the corrupted filesystem the xattr_ids value is 4294967071, which
stored in a signed variable becomes the negative number -225.
Flaw 1 (64-bit systems only):
The signed integer xattr_ids variable causes sign extension.
This causes variable overflow in the SQUASHFS_XATTR_*(A) macros. The
variable is first multiplied by sizeof(struct squashfs_xattr_id) where the
type of the sizeof operator is "unsigned long".
On a 64-bit system this is 64-bits in size, and causes the negative number
to be sign extended and widened to 64-bits and then become unsigned. This
produces the very large number 18446744073709548016 or 2^64 - 3600. This
number when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and
divided by SQUASHFS_METADATA_SIZE overflows and produces a length of 0
(stored in len).
Flaw 2 (32-bit systems only):
On a 32-bit system the integer variable is not widened by the unsigned
long type of the sizeof operator (32-bits), and the signedness of the
variable has no effect due it always being treated as unsigned.
The above corrupted xattr_ids value of 4294967071, when multiplied
overflows and produces the number 4294963696 or 2^32 - 3400. This number
when rounded up by SQUASHFS_METADATA_SIZE - 1 (8191 bytes) and divided by
SQUASHFS_METADATA_SIZE overflows again and produces a length of 0.
The effect of the 0 length computation:
In conjunction with the corrupted xattr_ids field, the filesystem also has
a corrupted xattr_table_start value, where it matches the end of
filesystem value of 850.
This causes the following sanity check code to fail because the
incorrectly computed len of 0 matches the incorrect size of the table
reported by the superblock (0 bytes).
len = SQUASHFS_XATTR_BLOCK_BYTES(*xattr_ids);
indexes = SQUASHFS_XATTR_BLOCKS(*xattr_ids);
/*
* The computed size of the index table (len bytes) should exactly
* match the table start and end points
*/
start = table_start + sizeof(*id_table);
end = msblk->bytes_used;
if (len != (end - start))
return ERR_PTR(-EINVAL);
Changing the xattr_ids variable to be "usigned int" fixes the flaw on a
64-bit system. This relies on the fact the computation is widened by the
unsigned long type of the sizeof operator.
Casting the variable to u64 in the above macro fixes this flaw on a 32-bit
system.
It also means 64-bit systems do not implicitly rely on the type of the
sizeof operator to widen the computation.
[1] https://lore.kernel.org/lkml/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 506220d2ba21 ("squashfs: add more sanity checks in xattr id lookup")
Signed-off-by: Phillip Lougher <[email protected]>
Reported-by: <[email protected]>
Cc: Alexey Khoroshilov <[email protected]>
Cc: Fedor Pchelkin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Fixes for hugetlb mapcount at most 1 for shared PMDs".
This issue of mapcount in hugetlb pages referenced by shared PMDs was
discussed in [1]. The following two patches address user visible behavior
caused by this issue.
[1] https://lore.kernel.org/linux-mm/Y9BF+OCdWnCSilEu@monkey/
This patch (of 2):
A hugetlb page will have a mapcount of 1 if mapped by multiple processes
via a shared PMD. This is because only the first process increases the
map count, and subsequent processes just add the shared PMD page to their
page table.
page_mapcount is being used to decide if a hugetlb page is shared or
private in /proc/PID/smaps. Pages referenced via a shared PMD were
incorrectly being counted as private.
To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found
count the hugetlb page as shared. A new helper to check for a shared PMD
is added.
[[email protected]: simplification, per David]
[[email protected]: hugetlb.h: include page_ref.h for page_count()]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 25ee01a2fca0 ("mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps")
Signed-off-by: Mike Kravetz <[email protected]>
Acked-by: Peter Xu <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: James Houghton <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Fix a spello in freevxfs Kconfig.
(reported by codespell)
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
While mounting a corrupted filesystem, a signed integer '*xattr_ids' can
become less than zero. This leads to the incorrect computation of 'len'
and 'indexes' values which can cause null-ptr-deref in copy_bio_to_actor()
or out-of-bounds accesses in the next sanity checks inside
squashfs_read_xattr_id_table().
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 506220d2ba21 ("squashfs: add more sanity checks in xattr id lookup")
Reported-by: <[email protected]>
Signed-off-by: Fedor Pchelkin <[email protected]>
Signed-off-by: Alexey Khoroshilov <[email protected]>
Cc: Phillip Lougher <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
In gfs2_make_fs_rw(), make sure to call gfs2_consist() to report an
inconsistency and mark the filesystem as withdrawn when
gfs2_find_jhead() fails.
At the end of gfs2_make_fs_rw(), when we discover that the filesystem
has been withdrawn, make sure we report an error. This also replaces
the gfs2_withdrawn() check after gfs2_find_jhead().
Reported-by: Tetsuo Handa <[email protected]>
Cc: [email protected]
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
This reverts commit 970343cd4904 ("GFS2: free disk inode which is
deleted by remote node -V2").
The original intent behind commit 970343cd49 was to cull dentries when a
remote node requests to demote an iopen glock, which happens when the
remote node tries to delete the inode. This is now handled by
gfs2_try_evict(), which is called via iopen_go_callback() ->
gfs2_queue_try_to_evict().
Signed-off-by: Bob Peterson <[email protected]>
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Add a gfs2_evict_inodes() helper that evicts inodes cooperatively across
the cluster. This avoids running into timeouts during unmount
unnecessarily.
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
In gfs2_kill_sb(), flush the delete work queue after setting the
SDF_DEACTIVATING flag. This ensures that no new inodes will be
instantiated anymore, and the inode cache will be empty after the
following kill_block_super() -> generic_shutdown_super() ->
evict_inodes() call.
With that, function gfs2_make_fs_ro() now calls gfs2_flush_delete_work()
after the workqueue has been destroyed. Skip that by checking for the
presence of the SDF_DEACTIVATING flag.
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Add a check to delete_work_func() so that it quits when it finds that
the filesystem is deactivating. This speeds up the delete workqueue
draining in gfs2_kill_sb().
In addition, make sure that iopen_go_callback() won't queue any new
delete work while the filesystem is deactivating.
Signed-off-by: Bob Peterson <[email protected]>
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Add a new SDF_DEACTIVATING super block flag that is set when the
filesystem has started to deactivate. This will be used in the next
patch to stop and drain the delete work during unmount.
Signed-off-by: Bob Peterson <[email protected]>
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Function gfs2_clear_rgrpd() is called during unmount to free all rgrps
and their sub-objects. If the rgrp glock is held (e.g. in SH) it calls
gfs2_glock_cb() to unlock, then calls flush_delayed_work() to make
sure any glock work is finished. However, there is a race with other
cluster nodes who may request the rgrp glock in another mode (say, EX).
Func gfs2_clear_rgrpd() calls glock_clear_object() which sets gl_object
to NULL but that's done without holding the gl_lockref spin_lock.
While the lock is not held Another node's demote request can cause the
state machine to run again, and since the gl_lockref is released in
do_xmote, the second process's call to do_xmote can call go_inval
(rgrp_go_inval) after the gl_object has been cleared, which results in
NULL pointer reference of the rgrp glock's gl_object.
Other go_inval glops functions don't require the gl_object to exist, as
evidenced by function inode_go_inval() which explicitly checks for if
(ip) before referencing gl_object. This patch does the same thing
for rgrp glocks. Both the go_inval and go_sync ops are patched to check
the existence of gl_object (rgd) before trying to dereference it.
Signed-off-by: Bob Peterson <[email protected]>
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Function delete_work_func() is used for two purposes:
* to immediately try to evict the glock's inode, and
* to verify after a little while that the inode has been deleted as
expected, and didn't just get skipped.
These two operations are not separated very well, so introduce two new
glock flags to improved that. Split gfs2_queue_delete_work() into
gfs2_queue_try_to_evict and gfs2_queue_verify_evict().
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Move the global delete workqueue into struct gfs2_sbd so that we can
flush / drain it without interfering with other filesystems.
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Get rid of the GLF_PENDING_DELETE glock flag introduced by commit
a0e3cc65fa29 ("gfs2: Turn gl_delete into a delayed work"). The only use
of that flag is to prevent the iopen glock from being demoted (i.e.,
unlocked) while delete work is pending. It turns out that demoting the
iopen glock while delete work is pending is perfectly fine; we only need
to make sure that the glock isn't being freed while still in use. This
is ensured by the previous patch because delete_work_func() owns a
reference while the work is queued or running.
With these changes, gfs2_queue_delete_work() no longer takes the glock
spin lock, so we can use it in iopen_go_callback() instead of
open-coding it there.
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
In __gfs2_glock_put(), remove the glock from the lru list *after*
dropping the glock lock. This prevents deadlocks against
gfs2_scan_glock_lru().
In gfs2_scan_glock_lru(), make sure that the glock's reference count is
zero before moving the glock to the dispose list. This skips glocks
that are marked dead as well as glocks that are still in use.
Additionally, switch to spin_trylock() as we already do in
gfs2_dispose_glock_lru(); this alone would also be enough to prevent
deadlocks against __gfs2_glock_put().
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Switch to list_for_each_entry_safe() and eliminate the "skipped" list in
gfs2_scan_glock_lru().
At the same time, scan the requested number of items to scan, not one
more than that number.
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Improve the comment describing the inode and iopen glock interactions
and the glock poking related to inode evict.
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
This function just assigns a summary entry. This can be done entirely
typesafe with an open code struct assignment that relies on array
indexing.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|