Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch adds mount options to reserve some blocks via resgid=%u,resuid=%u.
It only activates with reserve_root=%u.
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Cgroup writeback requires explicit support from the filesystem.
f2fs's data and node writeback IOs go through __write_data_page,
which sets fio for submiting IOs. So, we add io_wbc for fio,
associate bios with blkcg by invoking wbc_init_bio() and
account IOs issuing by wbc_account_io().
In addtion, f2fs_fill_super() is updated to set SB_I_CGROUPWB.
Meta writeback IOs is left alone by this patch and will always be
attributed to the root cgroup.
The results show that f2fs can throttle writeback nicely for
data writing and file creating.
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Yufen Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
In commit 78997b569f56 ("f2fs: split discard policy"), we have get rid
of using pend_list_tag field in struct discard_cmd_control, but forgot
to remove it, now do it.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
We take very long time to finish generic/476, this is because we will
check consistence of all discard entries in global rb tree while
traversing all different granularity pending lists, even when the list
is empty, in order to avoid that unneeded overhead, we have to skip
the check when coming up an empty list.
generic/476 time consumption:
cost
Before patch & w/o consistence check 57s
Before patch & w/ consistence check 1426s
After patch 78s
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Fixes the following sparse warnings:
fs/f2fs/segment.c:887:6: warning:
symbol '__check_sit_bitmap' was not declared. Should it be static?
fs/f2fs/segment.c:1327:6: warning:
symbol 'f2fs_wait_discard_bio' was not declared. Should it be static?
fs/f2fs/super.c:1661:5: warning:
symbol 'f2fs_get_projid' was not declared. Should it be static?
Signed-off-by: Wei Yongjun <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
This patch allows root to reserve some blocks via mount option.
"-o reserve_root=N" means N x 4KB-sized blocks for root only.
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
In some case, the node blocks has wrong blkaddr whose segment type is
NODE, e.g., recover inode has missing xattr flag and the blkaddr is in
the xattr range. Since fsck.f2fs does not check the recovery nodes, this
will cause __f2fs_replace_block change the curseg of node and do the
update_sit_entry(sbi, new_blkaddr, 1) with no next_blkoff refresh, as a
result, when recovery process write checkpoint and sync nodes, the
next_blkoff of curseg is used in the segment bit map, then it will
cause f2fs_bug_on. So let's check segment type in __f2fs_replace_block.
Signed-off-by: Yunlong Song <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
After checkpoint,
1. creat a new file A ,(with dirty inode && dirty inode page && xattr info)
2. backgroud wb write back file A inode page (without update from inode cache)
3. fsync file A, write back inode page of file A with inode cache info
4. sudden power off before new checkpoint
In this case, recovery process will try to recover a zero inode
page. Inline xattr flag of file A will be miss and xattr info
will be taken as blkaddr index.
Signed-off-by: Yunlei He <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Let's show precise # of blocks that user/root can use through bavail and bfree
respectively.
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Commit 6afc662e68b5 ("f2fs: support flexible inline xattr size")
declared f2fs_sb_has_flexible_inline_xattr in f2fs.h for latter being
used in get_inline_xattr_addrs, but in latter version, related code
has been changed, leave f2fs_sb_has_flexible_inline_xattr w/o any
users. Let's remove it for cleanup.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
While doing direct IO, if we run out-of-space when we preallocate blocks,
we should not return ENOSPC error directly, instead, we should continue
to do following direct IO, which will keep directIO of f2fs acting like
other filesystems.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
We have to enable quota only when remounting from read to write. Otherwise,
we'll get remount failure. (e.g., write to write case)
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
We can give another chance to write user data, which can resolve
generic/441.
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
This fixes generic/449 hang problem caused by no ENOSPC forever which should be
returned by setxattr under disk full scenario.
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
This fixes generic/342 which doesn't recover renamed file which was fsynced
before. It will be done via another fsync on newly created file.
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Let's avoid BUG_ON during fill_super, when on-disk was totall corrupted.
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
-Thread A Thread B
-write_checkpoint
-block_operations
-f2fs_unlock_all -f2fs_sync_file
-f2fs_write_inode
-f2fs_inode_synced
-f2fs_sync_inode_meta
-sync_node_pages
-set_page_drity
In this case, if sudden power off without next new checkpoint,
the last inode page update will lost. wb_writeback is same with
fsync.
Yunlei also reproduced the bug by:
@@ -366,7 +366,7 @@ int update_inode(struct inode *inode, struct page *node_page)
struct extent_tree *et = F2FS_I(inode)->extent_tree;
f2fs_inode_synced(inode);
-
+ msleep(10000);
f2fs_wait_on_page_writeback(node_page, NODE, true);
shell 1: shell2:
dd if=/dev/zero of=./test bs=1M count=10
sync
echo "hello" >> ./test
fsync test // sleep 10s
sync //return quickly
echo c > /proc/sysrq-trigger
Signed-off-by: Yunlei He <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
As Jia-Ju Bai reported:
"According to fs/f2fs/trace.c, the kernel module may sleep under a spinlock.
The function call path is:
f2fs_trace_pid (acquire the spinlock)
f2fs_radix_tree_insert
cond_resched --> may sleep
I do not find a good way to fix it, so I only report.
This possible bug is found by my static analysis tool (DSAC) and my code
review."
Obviously, it's problemetic to schedule in critical region of spinlock,
which will cause uninterruptable sleep if there is no waker.
This patch changes to use mutex lock intead of spinlock to avoid this
condition.
Reported-by: Jia-Ju Bai <[email protected]>
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
No need return value in restore summary process
Signed-off-by: Yunlei He <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Since the variable release is only nonzero when another unlikely
case occurs, use unlikely() on it seems logical.
Signed-off-by: Fan li <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
There is no caller cares about return value of truncate_data_blocks_range,
remove it.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
f2fs_map_blocks():
if (blkaddr == NEW_ADDR || blkaddr == NULL_ADDR) {
if (create) {
...
} else {
...
if (flag == F2FS_GET_BLOCK_FIEMAP &&
blkaddr == NULL_ADDR) {
...
}
if (flag != F2FS_GET_BLOCK_FIEMAP ||
blkaddr != NEW_ADDR)
goto sync_out;
}
It means we can break the loop in cases of:
a) flag != F2FS_GET_BLOCK_FIEMAP or
b) flag == F2FS_GET_BLOCK_FIEMAP && blkaddr == NULL_ADDR
Condition b) is the same as previous one, so merge operations of them
for readability.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
f2fs_chksum and f2fs_crc32 use the same 'crc32' crypto engine, also
their implementation are almost the same, except with different
shash description context.
Introduce __f2fs_crc32 to wrap the common codes, and reuse it in
f2fs_chksum and f2fs_crc32.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
In fill_super, if we fail to call f2fs_build_stats(), it needs to detach
from global f2fs shrink list, otherwise once system starts to shrink slab
cache, we will encounter below panic:
BUG: unable to handle kernel paging request at 00007d35
Oops: 0002 [#1] PREEMPT SMP
EIP: __lock_acquire+0x70/0x12c0
Call Trace:
lock_acquire+0xae/0x220
mutex_trylock+0xc5/0xf0
f2fs_shrink_count+0x32/0xb0 [f2fs]
shrink_slab+0xf1/0x5b0
drop_slab_node+0x35/0x60
drop_slab+0xf/0x20
drop_caches_sysctl_handler+0x79/0xc0
proc_sys_call_handler+0xa4/0xc0
proc_sys_write+0x1f/0x30
__vfs_write+0x24/0x150
SyS_write+0x44/0x90
do_fast_syscall_32+0xa1/0x1ca
entry_SYSENTER_32+0x4c/0x7b
In addition, this patch relocates f2fs_join_shrinker in fill_super to
avoid unneeded error handling of it.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Use f2fs_k{m,z}alloc as much as possible to increase fault injection
points.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
This patch supports to inject fault into kvmalloc/kvzalloc.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
This patch introduces f2fs_kzalloc based on f2fs_kmalloc in order to
support error injection for kzalloc().
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Avoid checking is_inode repeatedly, and make the logic
a little bit clearer.
Signed-off-by: Fan li <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
When blocks are allocated for direct write, select the type of
segment using the kiocb hint. But if an inode has FI_NO_ALLOC,
use the inode hint.
Signed-off-by: Hyunchul Lee <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Eric Biggers <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Eric Biggers <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Eric Biggers <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Eric Biggers <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Eric Biggers <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
atomic_t variables are currently used to implement reference
counters with the following properties:
- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)
Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.
The variable posix_acl.a_refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.
**Important note for maintainers:
Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.
The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.
For the posix_acl.a_refcount it might make a difference
in following places:
- get_cached_acl(): increment in refcount_inc_not_zero() only
guarantees control dependency on success vs. fully ordered
atomic counterpart. However this operation is performed under
rcu_read_lock(), so this should be fine.
- posix_acl_release(): decrement in refcount_dec_and_test() only
provides RELEASE ordering and control dependency on success
vs. fully ordered atomic counterpart
Suggested-by: Kees Cook <[email protected]>
Reviewed-by: David Windsor <[email protected]>
Reviewed-by: Hans Liljestrand <[email protected]>
Signed-off-by: Elena Reshetova <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
f2fs: remove repeated f2fs_bug_on which has already existed
in function invalidate_blocks.
Signed-off-by: Zhikang Zhang <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Remove the variable page_idx which no one would miss.
Signed-off-by: Fan li <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
test/generic/208 reports a potential deadlock as below:
Chain exists of:
&mm->mmap_sem --> &fi->i_mmap_sem --> &fi->dio_rwsem[WRITE]
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fi->dio_rwsem[WRITE]);
lock(&fi->i_mmap_sem);
lock(&fi->dio_rwsem[WRITE]);
lock(&mm->mmap_sem);
This patch changes the lock dependency as below in fallocate() to
fix this issue:
- dio_rwsem
- i_mmap_sem
Fixes: bb06664a534b ("f2fs: avoid race in between GC and block exchange")
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Commit d260081ccf37 ("f2fs: change recovery policy of xattr node block")
removes the use of blkaddr, which is no longer used. So remove the
parameter.
Signed-off-by: Sheng Yong <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
If there is not enough space left, f2fs_preallocate_blocks may only
preallocte partial blocks. As a result, the write operation fails
but i_blocks is not 0. To avoid this, f2fs should write data in
non-preallocation way and write as many data as the size of i_blocks.
Signed-off-by: Sheng Yong <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
This patch introduces a sysfs interface readdir_ra to enable/disable
readaheading inode block in f2fs_readdir. When readdir_ra is enabled,
it improves the performance of "readdir + stat".
For 300,000 files:
time find /data/test > /dev/null
disable readdir_ra: 1m25.69s real 0m01.94s user 0m50.80s system
enable readdir_ra: 0m18.55s real 0m00.44s user 0m15.39s system
Signed-off-by: Sheng Yong <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
alloc_nid_failed and scan_nat_page can be called at the same time,
and we haven't protected add_free_nid and update_free_nid_bitmap
with the same nid_list_lock. That could lead to
Thread A Thread B
- __build_free_nids
- scan_nat_page
- add_free_nid
- alloc_nid_failed
- update_free_nid_bitmap
- update_free_nid_bitmap
scan_nat_page will clear the free bitmap since the nid is PREALLOC_NID,
but alloc_nid_failed needs to set the free bitmap. This results in
free nid with free bitmap cleared.
This patch update the bitmap under the same nid_list_lock in add_free_nid.
And use __GFP_NOFAIL to make sure to update status of free nid correctly.
Signed-off-by: Fan li <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
We forgot to remov memory footprint accounting of per-cpu type
variables, fix it.
Fixes: 35782b233f37 ("f2fs: remove percpu_count due to performance regression")
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
No need to read nat block if nat_block_bitmap is set.
Signed-off-by: Yunlei He <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
During mkfs, quota sysfiles have already occupied nid resource,
it needs to adjust remaining available nid count in kernel side.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Pull MTD fixes from Richard Weinberger:
"This contains the following regression fixes:
- fix bitflip handling in brcmnand and gpmi nand drivers
- revert a bad device tree binding for spi-nor
- fix a copy&paste error in gpio-nand driver
- fix a too strict length check in mtd core"
* tag 'for-linus-20171218' of git://git.infradead.org/linux-mtd:
mtd: Fix mtd_check_oob_ops()
mtd: nand: gpio: Fix ALE gpio configuration
mtd: nand: brcmnand: Zero bitflip is not an error
mtd: nand: gpmi: Fix failure when a erased page has a bitflip at BBM
Revert "dt-bindings: mtd: add sst25wf040b and en25s64 to sip-nor list"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"There are two important fixes here:
- Add PCI quirks to disable built-in a serial AUX and a graphics
cards from specific GSP (management board) PCI cards. This fixes
boot via serial console on rp3410 and rp3440 machines.
- Revert the "Re-enable interrups early" patch which was added to
kernel v4.10. It can trigger stack overflows and thus silent data
corruption. With this patch reverted we can lower our thread stack
back to 16kb again.
The other patches are minor cleanups: avoid duplicate includes,
indenting fixes, correctly align variable in asm code"
* 'parisc-4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Reduce thread stack to 16 kb
Revert "parisc: Re-enable interrupts early"
parisc: remove duplicate includes
parisc: Hide Diva-built-in serial aux and graphics card
parisc: Align os_hpmc_size on word boundary
parisc: Fix indenting in puts()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 syscall entry code changes for PTI from Ingo Molnar:
"The main changes here are Andy Lutomirski's changes to switch the
x86-64 entry code to use the 'per CPU entry trampoline stack'. This,
besides helping fix KASLR leaks (the pending Page Table Isolation
(PTI) work), also robustifies the x86 entry code"
* 'WIP.x86-pti.entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
x86/cpufeatures: Make CPU bugs sticky
x86/paravirt: Provide a way to check for hypervisors
x86/paravirt: Dont patch flush_tlb_single
x86/entry/64: Make cpu_entry_area.tss read-only
x86/entry: Clean up the SYSENTER_stack code
x86/entry/64: Remove the SYSENTER stack canary
x86/entry/64: Move the IST stacks into struct cpu_entry_area
x86/entry/64: Create a per-CPU SYSCALL entry trampoline
x86/entry/64: Return to userspace from the trampoline stack
x86/entry/64: Use a per-CPU trampoline stack for IDT entries
x86/espfix/64: Stop assuming that pt_regs is on the entry stack
x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
x86/entry: Remap the TSS into the CPU entry area
x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
x86/dumpstack: Handle stack overflow on all stacks
x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss
x86/kasan/64: Teach KASAN about the cpu_entry_area
x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce struct cpu_entry_area
x86/entry/gdt: Put per-CPU GDT remaps in ascending order
x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
...
|
|
The mtd_check_oob_ops() helper verifies if the operation defined by the
user is correct.
Fix the check that verifies if the entire requested area exists. This
check is too restrictive and will fail anytime the last data byte of the
very last page is included in an operation.
Fixes: 5cdd929da53d ("mtd: Add sanity checks in mtd_write/read_oob()")
Signed-off-by: Miquel Raynal <[email protected]>
Acked-by: Boris Brezillon <[email protected]>
Signed-off-by: Richard Weinberger <[email protected]>
|
|
|