Age | Commit message (Collapse) | Author | Files | Lines |
|
Even though the length of the critical section when adding / removing
orphaned inodes was significantly reduced by using orphan file, the
contention of lock protecting orphan file still appears high in profiles
for truncate / unlink intensive workloads with high number of threads.
This patch makes handling of orphan file completely lockless. Also to
reduce conflicts between CPUs different CPUs start searching for empty
slot in orphan file in different blocks.
Performance comparison of locked orphan file handling, lockless orphan
file handling, and completely disabled orphan inode handling
from 80 CPU Xeon Server with 526 GB of RAM, filesystem located on
SAS SSD disk, average of 5 runs:
stress-orphan (microbenchmark truncating files byte-by-byte from N
processes in parallel)
Threads Time Time Time
Orphan locked Orphan lockless No orphan
1 0.945600 0.939400 0.891200
2 1.331800 1.246600 1.174400
4 1.995000 1.780600 1.713200
8 6.424200 4.900000 4.106000
16 14.937600 8.516400 8.138000
32 33.038200 24.565600 24.002200
64 60.823600 39.844600 38.440200
128 122.941400 70.950400 69.315000
So we can see that with lockless orphan file handling, addition /
deletion of orphaned inodes got almost completely out of picture even
for a microbenchmark stressing it.
For reaim creat_clo workload on ramdisk there are also noticeable gains
(average of 5 runs):
Clients Vanilla (ops/s) Patched (ops/s)
creat_clo-1 14705.88 ( 0.00%) 14354.07 * -2.39%*
creat_clo-3 27108.43 ( 0.00%) 28301.89 ( 4.40%)
creat_clo-5 37406.48 ( 0.00%) 45180.73 * 20.78%*
creat_clo-7 41338.58 ( 0.00%) 54687.50 * 32.29%*
creat_clo-9 45226.13 ( 0.00%) 62937.07 * 39.16%*
creat_clo-11 44000.00 ( 0.00%) 65088.76 * 47.93%*
creat_clo-13 36516.85 ( 0.00%) 68661.97 * 88.03%*
creat_clo-15 30864.20 ( 0.00%) 69551.78 * 125.35%*
creat_clo-17 27478.45 ( 0.00%) 67729.08 * 146.48%*
creat_clo-19 25000.00 ( 0.00%) 61621.62 * 146.49%*
creat_clo-21 18772.35 ( 0.00%) 63829.79 * 240.02%*
creat_clo-23 16698.94 ( 0.00%) 61938.96 * 270.92%*
creat_clo-25 14973.05 ( 0.00%) 56947.61 * 280.33%*
creat_clo-27 16436.69 ( 0.00%) 65008.03 * 295.51%*
creat_clo-29 13949.01 ( 0.00%) 69047.62 * 395.00%*
creat_clo-31 14283.52 ( 0.00%) 67982.45 * 375.95%*
Reviewed-by: Theodore Ts'o <[email protected]>
Reviewed-by: Lukas Czerner <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Ext4 orphan inode handling is a bottleneck for workloads which heavily
truncate / unlink small files since it contends on the global
s_orphan_mutex lock (and generally it's difficult to improve scalability
of the ondisk linked list of orphaned inodes).
This patch implements new way of handling orphan inodes. Instead of
linking orphaned inode into a linked list, we store it's inode number in
a new special file which we call "orphan file". Only if there's no more
space in the orphan file (too many inodes are currently orphaned) we
fall back to using old style linked list. Currently we protect
operations in the orphan file with a spinlock for simplicity but even in
this setting we can substantially reduce the length of the critical
section and thus speedup some workloads. In the next patch we improve
this by making orphan handling lockless.
Note that the change is backwards compatible when the filesystem is
clean - the existence of the orphan file is a compat feature, we set
another ro-compat feature indicating orphan file needs scanning for
orphaned inodes when mounting filesystem read-write. This ro-compat
feature gets cleared on unmount / remount read-only.
Some performance data from 80 CPU Xeon Server with 512 GB of RAM,
filesystem located on SSD, average of 5 runs:
stress-orphan (microbenchmark truncating files byte-by-byte from N
processes in parallel)
Threads Time Time
Vanilla Patched
1 1.057200 0.945600
2 1.680400 1.331800
4 2.547000 1.995000
8 7.049400 6.424200
16 14.827800 14.937600
32 40.948200 33.038200
64 87.787400 60.823600
128 206.504000 122.941400
So we can see significant wins all over the board.
Reviewed-by: Theodore Ts'o <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Move functions for handling orphan inodes into a new file
fs/ext4/orphan.c to have them in one place and somewhat reduce size of
other files. No code changes.
Reviewed-by: Andreas Dilger <[email protected]>
Reviewed-by: Theodore Ts'o <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
JBD2 layer support triggers which are called when journaling layer moves
buffer to a certain state. We can use the frozen trigger, which gets
called when buffer data is frozen and about to be written out to the
journal, to compute block checksums for some buffer types (similarly as
does ocfs2). This avoids unnecessary repeated recomputation of the
checksum (at the cost of larger window where memory corruption won't be
caught by checksumming) and is even necessary when there are
unsynchronized updaters of the checksummed data.
So add superblock and journal trigger type arguments to
ext4_journal_get_write_access() and ext4_journal_get_create_access() so
that frozen triggers can be set accordingly. Also add inode argument to
ext4_walk_page_buffers() and all the callbacks used with that function
for the same purpose. This patch is mostly only a change of prototype of
the above mentioned functions and a few small helpers. Real checksumming
will come later.
Reviewed-by: Theodore Ts'o <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Add sparse annotations to suppress false positive context imbalance
warnings, and use NULL instead of 0 in EXT_MAX_{EXTENT,INDEX}.
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
The location of the system.data extended attribute can change whenever
xattr_sem is not taken. So we need to recalculate the i_inline_off
field since it mgiht have changed between ext4_write_begin() and
ext4_write_end().
This means that caching i_inline_off is probably not helpful, so in
the long run we should probably get rid of it and shrink the in-memory
ext4 inode slightly, but let's fix the race the simple way for now.
Cc: [email protected]
Fixes: f19d5870cbf72 ("ext4: add normal write support for inline data")
Reported-by: [email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
If ext4 filesystem is corrupted so that quota files are linked from
directory hirerarchy, bad things can happen. E.g. quota files can get
corrupted or deleted. Make sure we are not grabbing quota file inodes
when we expect normal inodes.
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Commit 81414b4dd48 ("ext4: remove redundant sb checksum
recomputation") removed checksum recalculation after updating
superblock free space / inode counters in ext4_fill_super() based on
the fact that we will recalculate the checksum on superblock
writeout.
That is correct assumption but until the writeout happens (which can
take a long time) the checksum is incorrect in the buffer cache and if
programs such as tune2fs or resize2fs is called shortly after a file
system is mounted can fail. So return back the checksum recalculation
and add a comment explaining why.
Fixes: 81414b4dd48f ("ext4: remove redundant sb checksum recomputation")
Cc: [email protected]
Reported-by: Boyang Xue <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
If the underlying storage device is using thin-provisioning, it's
possible for a zeroout operation to return ENOSPC.
Commit df22291ff0fd ("ext4: Retry block allocation if we have free blocks
left") added logic to retry block allocation since we might get free block
after we commit a transaction. But the ENOSPC from thin-provisioning
will confuse ext4, and lead to an infinite loop.
Since using zeroout instead of splitting the extent node is an
optimization, if it fails, we might as well fall back to splitting the
extent node.
Reported-by: yangerkun <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Let's pass fc_dentry directly since those arguments (tag, parent_ino and
ino etc) can be deferenced from it.
Signed-off-by: Guoqing Jiang <[email protected]>
Reviewed-by: Harshad Shirwadkar <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The background discard kwork tries to mark blocks used and issue
discard. This can make filesystem suffer from NOSPC error, xfstest
generic/371 can fail due to it. Fix it by flushing discard kwork
in ext4_should_retry_alloc. At the same time, give up discard at
the moment.
Signed-off-by: Wang Jianchao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Right now, discard is issued and waited to be completed in jbd2
commit kthread context after the logs are committed. When large
amount of files are deleted and discard is flooding, jbd2 commit
kthread can be blocked for long time. Then all of the metadata
operations can be blocked to wait the log space.
One case is the page fault path with read mm->mmap_sem held, which
wants to update the file time but has to wait for the log space.
When other threads in the task wants to do mmap, then write mmap_sem
is blocked. Finally all of the following read mmap_sem requirements
are blocked, even the ps command which need to read the /proc/pid/
-cmdline. Our monitor service which needs to read /proc/pid/cmdline
used to be blocked for 5 mins.
This patch frees the blocks back to buddy after commit and then do
discard in a async kworker context in fstrim fashion, namely,
- mark blocks to be discarded as used if they have not been allocated
- do discard
- mark them free
After this, jbd2 commit kthread won't be blocked any more by discard
and we won't get NOSPC even if the discard is slow or throttled.
Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2
Suggested-by: Theodore Ts'o <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Wang Jianchao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe:
"This adds io_uring support for mkdirat, symlinkat, and linkat"
* tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block:
io_uring: add support for IORING_OP_LINKAT
io_uring: add support for IORING_OP_SYMLINKAT
io_uring: add support for IORING_OP_MKDIRAT
namei: update do_*() helpers to return ints
namei: make do_linkat() take struct filename
namei: add getname_uflags()
namei: make do_symlinkat() take struct filename
namei: make do_mknodat() take struct filename
namei: make do_mkdirat() take struct filename
namei: change filename_parentat() calling conventions
namei: ignore ERR/NULL names in putname()
|
|
Pull support for struct bio recycling from Jens Axboe:
"This adds bio recycling support for polled IO, allowing quick reuse of
a bio for high IOPS scenarios via a percpu bio_set list.
It's good for almost a 10% improvement in performance, bumping our
per-core IO limit from ~3.2M IOPS to ~3.5M IOPS"
* tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block:
bio: improve kerneldoc documentation for bio_alloc_kiocb()
block: provide bio_clear_hipri() helper
block: use the percpu bio cache in __blkdev_direct_IO
io_uring: enable use of bio alloc cache
block: clear BIO_PERCPU_CACHE flag if polling isn't supported
bio: add allocation cache abstraction
fs: add kiocb alloc cache flag
bio: optimize initialization of a bio
|
|
Pull io_uring updates from Jens Axboe:
- cancellation cleanups (Hao, Pavel)
- io-wq accounting cleanup (Hao)
- io_uring submit locking fix (Hao)
- io_uring link handling fixes (Hao)
- fixed file improvements (wangyangbo, Pavel)
- allow updates of linked timeouts like regular timeouts (Pavel)
- IOPOLL fix (Pavel)
- remove batched file get optimization (Pavel)
- improve reference handling (Pavel)
- IRQ task_work batching (Pavel)
- allow pure fixed file, and add support for open/accept (Pavel)
- GFP_ATOMIC RT kernel fix
- multiple CQ ring waiter improvement
- funnel IRQ completions through task_work
- add support for limiting async workers explicitly
- add different clocksource support for timeouts
- io-wq wakeup race fix
- lots of cleanups and improvement (Pavel et al)
* tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block: (87 commits)
io-wq: fix wakeup race when adding new work
io-wq: wqe and worker locks no longer need to be IRQ safe
io-wq: check max_worker limits if a worker transitions bound state
io_uring: allow updating linked timeouts
io_uring: keep ltimeouts in a list
io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts
io-wq: provide a way to limit max number of workers
io_uring: add build check for buf_index overflows
io_uring: clarify io_req_task_cancel() locking
io_uring: add task-refs-get helper
io_uring: fix failed linkchain code logic
io_uring: remove redundant req_set_fail()
io_uring: don't free request to slab
io_uring: accept directly into fixed file table
io_uring: hand code io_accept() fd installing
io_uring: openat directly into fixed fd table
net: add accept helper not installing fd
io_uring: fix io_try_cancel_userdata race for iowq
io_uring: IRQ rw completion batching
io_uring: batch task work locking
...
|
|
Pull block updates from Jens Axboe:
"Nothing major in here - lots of good cleanups and tech debt handling,
which is also evident in the diffstats. In particular:
- Add disk sequence numbers (Matteo)
- Discard merge fix (Ming)
- Relax disk zoned reporting restrictions (Niklas)
- Bio error handling zoned leak fix (Pavel)
- Start of proper add_disk() error handling (Luis, Christoph)
- blk crypto fix (Eric)
- Non-standard GPT location support (Dmitry)
- IO priority improvements and cleanups (Damien)o
- blk-throtl improvements (Chunguang)
- diskstats_show() stack reduction (Abd-Alrhman)
- Loop scheduler selection (Bart)
- Switch block layer to use kmap_local_page() (Christoph)
- Remove obsolete disk_name helper (Christoph)
- block_device refcounting improvements (Christoph)
- Ensure gendisk always has a request queue reference (Christoph)
- Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"
* tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
sg: pass the device name to blk_trace_setup
block, bfq: cleanup the repeated declaration
blk-crypto: fix check for too-large dun_bytes
blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
block: mark blkdev_fsync static
block: refine the disk_live check in del_gendisk
mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
mmc: block: Support alternative_gpt_sector() operation
partitions/efi: Support non-standard GPT location
block: Add alternative_gpt_sector() operation
bio: fix page leak bio_add_hw_page failure
block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
block: remove a pointless call to MINOR() in device_add_disk
null_blk: add error handling support for add_disk()
virtio_blk: add error handling support for add_disk()
block: add error handling for device_add_disk / add_disk
block: return errors from disk_alloc_events
block: return errors from blk_integrity_add
block: call blk_register_queue earlier in device_add_disk
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for timekeeping, timers and related drivers:
Core code:
- Cure a couple of correctness issues in the posix CPU timer code to
prevent that the tick dependency for NOHZ full is kept alive for no
reason.
- Avoid expensive double reprogramming of the clockevent device in
hrtimer_start_range_ns().
- Avoid pointless SMP function calls when the clock was set to avoid
disturbing CPUs which do not have any affected timers queued.
- Make the clocksource watchdog test work correctly when CONFIG_HZ is
less than 100.
Drivers:
- Prefer the ARM architected timer over the Exynos timer which is way
more expensive to access.
- Add device tree bindings for new Ingenic SoCs
- The usual improvements and cleanups all over the place"
* tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
clocksource: Make clocksource watchdog test safe for slow-HZ systems
dt-bindings: timer: Add ABIs for new Ingenic SoCs
clocksource/drivers/fttmr010: Pass around less pointers
clocksource/drivers/mediatek: Optimize systimer irq clear flow on shutdown
clocksource/drivers/ingenic: Use bitfield macro helpers
clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel
dt-bindings: timer: convert rockchip,rk-timer.txt to YAML
clocksource/drivers/exynos_mct: Mark MCT device as CLOCK_EVT_FEAT_PERCPU
clocksource/drivers/exynos_mct: Prioritise Arm arch timer on arm64
hrtimer: Unbreak hrtimer_force_reprogram()
hrtimer: Use raw_cpu_ptr() in clock_was_set()
hrtimer: Avoid more SMP function calls in clock_was_set()
hrtimer: Avoid unnecessary SMP function calls in clock_was_set()
hrtimer: Add bases argument to clock_was_set()
time/timekeeping: Avoid invoking clock_was_set() twice
timekeeping: Distangle resume and clock-was-set events
timerfd: Provide timerfd_resume()
hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case
hrtimer: Ensure timerfd notification for HIGHRES=n
hrtimer: Consolidate reprogramming code
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- The biggest change in this cycle is scheduler support for asymmetric
scheduling affinity, to support the execution of legacy 32-bit tasks
on AArch32 systems that also have 64-bit-only CPUs.
Architectures can fill in this functionality by defining their own
task_cpu_possible_mask(p). When this is done, the scheduler will make
sure the task will only be scheduled on CPUs that support it.
(The actual arm64 specific changes are not part of this tree.)
For other architectures there will be no change in functionality.
- Add cgroup SCHED_IDLE support
- Increase node-distance flexibility & delay determining it until a CPU
is brought online. (This enables platforms where node distance isn't
final until the CPU is only.)
- Deadline scheduler enhancements & fixes
- Misc fixes & cleanups.
* tag 'sched-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
eventfd: Make signal recursion protection a task bit
sched/fair: Mark tg_is_idle() an inline in the !CONFIG_FAIR_GROUP_SCHED case
sched: Introduce dl_task_check_affinity() to check proposed affinity
sched: Allow task CPU affinity to be restricted on asymmetric systems
sched: Split the guts of sched_setaffinity() into a helper function
sched: Introduce task_struct::user_cpus_ptr to track requested affinity
sched: Reject CPU affinity changes based on task_cpu_possible_mask()
cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()
cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()
cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1
sched: Introduce task_cpu_possible_mask() to limit fallback rq selection
sched: Cgroup SCHED_IDLE support
sched/topology: Skip updating masks for non-online nodes
sched: Replace deprecated CPU-hotplug functions.
sched: Skip priority checks with SCHED_FLAG_KEEP_PARAMS
sched: Fix UCLAMP_FLAG_IDLE setting
sched/deadline: Fix missing clock update in migrate_task_rq_dl()
sched/fair: Avoid a second scan of target in select_idle_cpu
sched/fair: Use prev instead of new target as recent_used_cpu
sched: Don't report SCHED_FLAG_SUGOV in sched_getattr()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull file locking updates from Jeff Layton:
"This starts with a couple of fixes for potential deadlocks in the
fowner/fasync handling.
The next patch removes the old mandatory locking code from the kernel
altogether.
The last patch cleans up rw_verify_area a bit more after the mandatory
locking removal"
* tag 'locks-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: clean up after mandatory file locking support removal
fs: remove mandatory file locking support
fcntl: fix potential deadlock for &fasync_struct.fa_lock
fcntl: fix potential deadlocks for &fown_struct.lock
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fs hole punching vs cache filling race fixes from Jan Kara:
"Fix races leading to possible data corruption or stale data exposure
in multiple filesystems when hole punching races with operations such
as readahead.
This is the series I was sending for the last merge window but with
your objection fixed - now filemap_fault() has been modified to take
invalidate_lock only when we need to create new page in the page cache
and / or bring it uptodate"
* tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
filesystems/locking: fix Malformed table warning
cifs: Fix race between hole punch and page fault
ceph: Fix race between hole punch and page fault
fuse: Convert to using invalidate_lock
f2fs: Convert to using invalidate_lock
zonefs: Convert to using invalidate_lock
xfs: Convert double locking of MMAPLOCK to use VFS helpers
xfs: Convert to use invalidate_lock
xfs: Refactor xfs_isilocked()
ext2: Convert to using invalidate_lock
ext4: Convert to use mapping->invalidate_lock
mm: Add functions to lock invalidate_lock for two mappings
mm: Protect operations adding pages to page cache with invalidate_lock
documentation: Sync file_operations members with reality
mm: Fix comments mentioning i_mutex
|
|
Instead of messing around with XDR padding in the RDMA layer, we should
just give the RPC layer an aligned buffer. Try to avoid creating extra
RPC calls by aligning to the smaller value of ALIGN(len, rsize) and
PAGE_SIZE.
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF and isofs updates from Jan Kara:
"Several smaller fixes and cleanups in UDF and isofs"
* tag 'fs_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf_get_extendedattr() had no boundary checks.
isofs: joliet: Fix iocharset=utf8 mount option
udf: Fix iocharset=utf8 mount option
udf: Get rid of 0-length arrays in struct fileIdentDesc
udf: Get rid of 0-length arrays
udf: Remove unused declaration
udf: Check LVID earlier
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull FIEMAP cleanups from Jan Kara:
"FIEMAP cleanups from Christoph transitioning all remaining filesystems
supporting FIEMAP (ext2, hpfs) to iomap API and removing the old
helper"
* tag 'fiemap_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs: remove generic_block_fiemap
hpfs: use iomap_fiemap to implement ->fiemap
ext2: use iomap_fiemap to implement ->fiemap
ext2: make ext2_iomap_ops available unconditionally
|
|
Let's only enable realtime discard if and only if device supports
discard functionality.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
We must flush all the dirty data when enabling checkpoint back. Let's guarantee
that first by adding a retry logic on sync_inodes_sb(). In addition to that,
this patch adds to flush data in fsync when checkpoint is disabled, which can
mitigate the sync_inodes_sb() failures in advance.
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
We need to unmap pages from userspace process before removing pagecache
in punch_hole() like we did in f2fs_setattr().
Similar change:
commit 5e44f8c374dc ("ext4: hole-punch use truncate_pagecache_range")
Fixes: fbfa2cc58d53 ("f2fs: add file operations")
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
In below path, it will return ENOENT if filesystem is shutdown:
- f2fs_map_blocks
- f2fs_get_dnode_of_data
- f2fs_get_node_page
- __get_node_page
- read_node_page
- is_sbi_flag_set(sbi, SBI_IS_SHUTDOWN)
return -ENOENT
- force return value from ENOENT to 0
It should be fine for read case, since it indicates a hole condition,
and caller could use .m_next_pgofs to skip the hole and continue the
lookup.
However it may cause confusing for write case, since leaving a hole
there, and said nothing was wrong doesn't help.
There is at least one case from dax_iomap_actor() will complain that,
so fix this in prior to supporting dax in f2fs.
xfstest generic/388 reports below warning:
ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 485833 at fs/dax.c:1127 dax_iomap_actor+0x339/0x370
Call Trace:
iomap_apply+0x1c4/0x7b0
? dax_iomap_rw+0x1c0/0x1c0
dax_iomap_rw+0xad/0x1c0
? dax_iomap_rw+0x1c0/0x1c0
f2fs_file_write_iter+0x5ab/0x970 [f2fs]
do_iter_readv_writev+0x273/0x2e0
do_iter_write+0xab/0x1f0
vfs_iter_write+0x21/0x40
iter_file_splice_write+0x287/0x540
do_splice+0x37c/0xa60
__x64_sys_splice+0x15f/0x3a0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
RIP: 0010:dax_iomap_pte_fault.isra.0+0x72e/0x14a0
Call Trace:
dax_iomap_fault+0x44/0x70
f2fs_dax_huge_fault+0x155/0x400 [f2fs]
f2fs_dax_fault+0x18/0x30 [f2fs]
__do_fault+0x4e/0x120
do_fault+0x3cf/0x7a0
__handle_mm_fault+0xa8c/0xf20
? find_held_lock+0x39/0xd0
handle_mm_fault+0x1b6/0x480
do_user_addr_fault+0x320/0xcd0
? rcu_read_lock_sched_held+0x67/0xc0
exc_page_fault+0x77/0x3f0
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
Fixes: 83a3bfdb5a8a ("f2fs: indicate shutdown f2fs to allow unmount successfully")
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
There is a missing place we forgot to account .skipped_gc_rwsem, fix it.
Fixes: 6f8d4455060d ("f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc")
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
This patch adjusts unlock order of .i_mmap_sem and .i_gc_rwsem for
cleanup.
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Don't create discard thread when device doesn't support realtime discard
or user specifies nodiscard mount option.
Signed-off-by: Fengnan Chang <[email protected]>
Signed-off-by: Yangtao Li <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"fsnotify speedups when notification actually isn't used and support
for identifying processes which caused fanotify events through pidfd
instead of normal pid"
* tag 'fsnotify_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: optimize the case of no marks of any type
fsnotify: count all objects with attached connectors
fsnotify: count s_fsnotify_inode_refs for attached connectors
fsnotify: replace igrab() with ihold() on attach connector
fanotify: add pidfd support to the fanotify API
fanotify: introduce a generic info record copying helper
fanotify: minor cosmetic adjustments to fid labels
kernel/pid.c: implement additional checks upon pidfd_create() parameters
kernel/pid.c: remove static qualifier from pidfd_create()
|
|
Capitalize comments and end with period for better reading.
Also function comments are now little more kernel-doc style. This way we
can easily convert them to kernel-doc style if we want. Note that these
are not yet complete with this style. Example function comments start
with /* and in kernel-doc style they start /**.
Use imperative mood in function descriptions.
Change words like ntfs -> NTFS, linux -> Linux.
Use "we" not "I" when commenting code.
Signed-off-by: Kari Argillander <[email protected]>
Signed-off-by: Konstantin Komarov <[email protected]>
|
|
When new work is added, io_wqe_enqueue() checks if we need to wake or
create a new worker. But that check is done outside the lock that
otherwise synchronizes us with a worker going to sleep, so we can end
up in the following situation:
CPU0 CPU1
lock
insert work
unlock
atomic_read(nr_running) != 0
lock
atomic_dec(nr_running)
no wakeup needed
Hold the wqe lock around the "need to wakeup" check. Then we can also get
rid of the temporary work_flags variable, as we know the work will remain
valid as long as we hold the lock.
Cc: [email protected]
Reported-by: Andres Freund <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_uring no longer queues async work off completion handlers that run in
hard or soft interrupt context, and that use case was the only reason that
io-wq had to use IRQ safe locks for wqe and worker locks.
Signed-off-by: Jens Axboe <[email protected]>
|
|
For the two places where new workers are created, we diligently check if
we are allowed to create a new worker. If we're currently at the limit
of how many workers of a given type we can have, then we don't create
any new ones.
If you have a mixed workload with various types of bound and unbounded
work, then it can happen that a worker finishes one type of work and
is then transitioned to the other type. For this case, we don't check
if we are actually allowed to do so. This can cause io-wq to temporarily
exceed the allowed number of workers for a given type.
When retrieving work, check that the types match. If they don't, check
if we are allowed to transition to the other type. If not, then don't
handle the new work.
Cc: [email protected]
Reported-by: Johannes Lundberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We allow updating normal timeouts, add support for adjusting timings of
linked timeouts as well.
Reported-by: Victor Stewart <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
A preparation patch. Keep all queued linked timeout in a list, so they
may be found and updated.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than
CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC.
Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that
allows timeouts and linked timeouts to use the selected clock source.
Only one clock source may be selected, and we -EINVAL the request if more
than one is given. If neither BOOTIME nor REALTIME are selected, the
previous default of MONOTONIC is used.
Link: https://github.com/axboe/liburing/issues/369
Signed-off-by: Jens Axboe <[email protected]>
|
|
io-wq divides work into two categories:
1) Work that completes in a bounded time, like reading from a regular file
or a block device. This type of work is limited based on the size of
the SQ ring.
2) Work that may never complete, we call this unbounded work. The amount
of workers here is just limited by RLIMIT_NPROC.
For various uses cases, it's handy to have the kernel limit the maximum
amount of pending workers for both categories. Provide a way to do with
with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation.
IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets
the max worker count to what is being passed in for each category. The
old values are returned into that same array. If 0 is being passed in for
either category, it simply returns the current value.
The value is capped at RLIMIT_NPROC. This actually isn't that important
as it's more of a hint, if we're exceeding the value then our attempt
to fork a new worker will fail. This happens naturally already if more
than one node is in the system, as these values are per-node internally
for io-wq.
Reported-by: Johannes Lundberg <[email protected]>
Link: https://github.com/axboe/liburing/issues/420
Signed-off-by: Jens Axboe <[email protected]>
|
|
The recursion protection for eventfd_signal() is based on a per CPU
variable and relies on the !RT semantics of spin_lock_irqsave() for
protecting this per CPU variable. On RT kernels spin_lock_irqsave() neither
disables preemption nor interrupts which allows the spin lock held section
to be preempted. If the preempting task invokes eventfd_signal() as well,
then the recursion warning triggers.
Paolo suggested to protect the per CPU variable with a local lock, but
that's heavyweight and actually not necessary. The goal of this protection
is to prevent the task stack from overflowing, which can be achieved with a
per task recursion protection as well.
Replace the per CPU variable with a per task bit similar to other recursion
protection bits like task_struct::in_page_owner. This works on both !RT and
RT kernels and removes as a side effect the extra per CPU storage.
No functional change for !RT kernels.
Reported-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Daniel Bristot de Oliveira <[email protected]>
Acked-by: Jason Wang <[email protected]>
Cc: Al Viro <[email protected]>
Link: https://lore.kernel.org/r/87wnp9idso.ffs@tglx
|
|
After trunking is discovered in nfs4_discover_server_trunking(),
add the transport to the old client structure if the allowed limit
of transports has not been reached.
An example: there exists a multi-homed server and client mounts
one server address and some volume and then doest another mount to
a different address of the same server and perhaps a different
volume. Previously, the client checks that this is a session
trunkable servers (same server), and removes the newly created
client structure along with its transport. Now, the client
adds the connection from the 2nd mount into the xprt switch of
the existing client (it leads to having 2 available connections).
Signed-off-by: Olga Kornievskaia <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
If we are adding new transports via rpc_clnt_test_and_add_xprt()
then check if we've reached the limit. Currently only pnfs path
adds transports via that function but this is done in
preparation when the client would add new transports when
session trunking is detected. A warning is logged if the
limit is reached.
Signed-off-by: Olga Kornievskaia <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
This option will control up to how many xprts can the client
establish to the server with a distinct address (that means
nconnect connections are not counted towards this new limit).
This patch is setting up nfs structures to keeep track of the
max_connect limit (does not enforce it).
The default value is kept at 1 so that no current mounts that
don't want any additional connections would be effected. The
maximum value is set at 16.
Mounts to DS are not limited to default value of 1 but instead
set to the maximum default value of 16 (NFS_MAX_TRANSPORTS).
Signed-off-by: Olga Kornievskaia <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
As eb96d5c97b08 ("SUNRPC handle EKEYEXPIRED in call_refreshresult")
commit handle EKEYEXPIRED in call_refreshresult, so there is only handle
when "task->tk_status" is equal "-EJUKEBOX" in nfs3_async_handle_jukebox.
Signed-off-by: Ye Bin <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Dan reported __write_overflow warning in ndr_read_string.
CC [M] fs/ksmbd/ndr.o
In file included from ./include/linux/string.h:253,
from ./include/linux/bitmap.h:11,
from ./include/linux/cpumask.h:12,
from ./arch/x86/include/asm/cpumask.h:5,
from ./arch/x86/include/asm/msr.h:11,
from ./arch/x86/include/asm/processor.h:22,
from ./arch/x86/include/asm/cpufeature.h:5,
from ./arch/x86/include/asm/thread_info.h:53,
from ./include/linux/thread_info.h:60,
from ./arch/x86/include/asm/preempt.h:7,
from ./include/linux/preempt.h:78,
from ./include/linux/spinlock.h:55,
from ./include/linux/wait.h:9,
from ./include/linux/wait_bit.h:8,
from ./include/linux/fs.h:6,
from fs/ksmbd/ndr.c:7:
In function memcpy,
inlined from ndr_read_string at fs/ksmbd/ndr.c:86:2,
inlined from ndr_decode_dos_attr at fs/ksmbd/ndr.c:167:2:
./include/linux/fortify-string.h:219:4: error: call to __write_overflow
declared with attribute error: detected write beyond size of object
__write_overflow();
^~~~~~~~~~~~~~~~~~
This seems to be a false alarm because hex_attr size is always smaller
than n->length. This patch fix this warning by allocation hex_attr with
n->length.
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
req->buf_index is u16 and so we rely on registered buffers indexes
fitting into it. Add a build check, so when the upper limit for the
number of buffers is lifted we get a compliation fail but not lurking
problems.
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/787e8e1a17cea51ca6301426b1c4c4887b8bd676.1629920396.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
It's too easy to forget and misjudge about synchronisation in
io_req_task_cancel(), add a comment clarifying it.
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/71099083835f983a1fd73d5a3da6391924da8300.1629920396.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
There are three bugs in this code:
1) If indx_get_root() fails, then return -EINVAL instead of success.
2) On the "/* make root external */" -EOPNOTSUPP; error path it should
free "re" but it has a memory leak.
3) If indx_new() fails then it will lead to an error pointer dereference
when we call put_indx_node().
I've re-written the error handling to be more clear.
Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Kari Argillander <[email protected]>
Signed-off-by: Konstantin Komarov <[email protected]>
|
|
The "e" pointer is dereferenced before it has been checked for NULL.
Move the dereference after the NULL check to prevent an Oops.
Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Kari Argillander <[email protected]>
Signed-off-by: Konstantin Komarov <[email protected]>
|