Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild update for C11 language base from Masahiro Yamada:
"Kbuild -std=gnu11 updates for v5.18
Linus pointed out the benefits of C99 some years ago, especially
variable declarations in loops [1]. At that time, we were not ready
for the migration due to old compilers.
Recently, Jakob Koschel reported a bug in list_for_each_entry(), which
leaks the invalid pointer out of the loop [2]. In the discussion, we
agreed that the time had come. Now that GCC 5.1 is the minimum
compiler version, there is nothing to prevent us from going to
-std=gnu99, or even straight to -std=gnu11.
Discussions for a better list iterator implementation are ongoing, but
this patch set must land first"
[1] https://lore.kernel.org/all/CAHk-=wgr12JkKmRd21qh-se-_Gs69kbPgR9x4C+Es-yJV2GLkA@mail.gmail.com/
[2] https://lore.kernel.org/lkml/[email protected]/
* tag 'kbuild-gnu11-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
Kbuild: use -std=gnu11 for KBUILD_USERCFLAGS
Kbuild: move to -std=gnu11
Kbuild: use -Wdeclaration-after-statement
Kbuild: add -Wno-shift-negative-value where -Wextra is used
|
|
We have duplicated definitions for various SMB3 PDUs in
fs/ksmbd and fs/cifs. Some had already been moved to
fs/smbfs_common/smb2pdu.h
Move definitions for
- error response
- query info and various related protocol flags
- various lease handling flags and the create lease context
to smbfs_common/smb2pdu.h to reduce code duplication
Reviewed-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
When part of the user buffer passed to generic_perform_write() or
iomap_file_buffered_write() cannot be faulted in for reading, the entire
write currently fails. The correct behavior would be to write all the
data that can be written, up to the point of failure.
Commit a6294593e8a1 ("iov_iter: Turn iov_iter_fault_in_readable into
fault_in_iov_iter_readable") gave us the information needed, so fix the
page prefaulting in generic_perform_write() and iomap_write_iter() to
only bail out when no pages could be faulted in.
We already factor in that pages that are faulted in may no longer be
resident by the time they are accessed. Paging out pages has the same
effect as not faulting in those pages in the first place, so the code
can already deal with that.
Signed-off-by: Andreas Gruenbacher <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
|
|
io_put_kbuf_comp() should only be called while holding
->completion_lock, however there is no such assumption in io_clean_op()
and thus it can corrupt ->io_buffer_comp. Take the lock there, and
workaround the only user of io_clean_op() calling it with locks. Not
the prettiest solution, but it's easier to refactor it for-next.
Fixes: cc3cec8367cba ("io_uring: speedup provided buffer handling")
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/743e2130b73ec6d48c4c5dd15db896c433431e6d.1648212967.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_req_complete_failed() doesn't require callers to hold ->uring_lock,
use IO_URING_F_UNLOCKED version of io_put_kbuf(). The only affected
place is the fail path of io_apoll_task_func(). Also add a lockdep
annotation to catch such bugs in the future.
Fixes: 3b2b78a8eb7cc ("io_uring: extend provided buf return to fails")
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/ccf602dbf8df3b6a8552a262d8ee0a13a086fbc7.1648212967.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Move a misplaced comment about req->creds and add a line with
assumptions about req->link.
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/1e51d1e6b1f3708c2d4127b4e371f9daa4c5f859.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
When polling sockets for accept, use EPOLLEXCLUSIVE. This is helpful
when multiple accept SQEs are submitted.
For O_NONBLOCK sockets multiple queued SQEs would previously have all
completed at once, but most with -EAGAIN as the result. Now only one
wakes up and completes.
For sockets without O_NONBLOCK there is no user facing change, but
internally the extra requests would previously be queued onto a worker
thread as they would wake up with no connection waiting, and be
punted. Now they do not wake up unnecessarily.
Co-developed-by: Jens Axboe <[email protected]>
Signed-off-by: Dylan Yudaken <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pull ceph updates from Ilya Dryomov:
"The highlights are:
- several changes to how snap context and snap realms are tracked
(Xiubo Li). In particular, this should resolve a long-standing
issue of high kworker CPU usage and various stalls caused by
needless iteration over all inodes in the snap realm.
- async create fixes to address hangs in some edge cases (Jeff
Layton)
- support for getvxattr MDS op for querying server-side xattrs, such
as file/directory layouts and ephemeral pins (Milind Changire)
- average latency is now maintained for all metrics (Venky Shankar)
- some tweaks around handling inline data to make it fit better with
netfs helper library (David Howells)
Also a couple of memory leaks got plugged along with a few assorted
fixups. Last but not least, Xiubo has stepped up to serve as a CephFS
co-maintainer"
* tag 'ceph-for-5.18-rc1' of https://github.com/ceph/ceph-client: (27 commits)
ceph: fix memory leak in ceph_readdir when note_last_dentry returns error
ceph: uninitialized variable in debug output
ceph: use tracked average r/w/m latencies to display metrics in debugfs
ceph: include average/stdev r/w/m latency in mds metrics
ceph: track average r/w/m latency
ceph: use ktime_to_timespec64() rather than jiffies_to_timespec64()
ceph: assign the ci only when the inode isn't NULL
ceph: fix inode reference leakage in ceph_get_snapdir()
ceph: misc fix for code style and logs
ceph: allocate capsnap memory outside of ceph_queue_cap_snap()
ceph: do not release the global snaprealm until unmounting
ceph: remove incorrect and unused CEPH_INO_DOTDOT macro
MAINTAINERS: add Xiubo Li as cephfs co-maintainer
ceph: eliminate the recursion when rebuilding the snap context
ceph: do not update snapshot context when there is no new snapshot
ceph: zero the dir_entries memory when allocating it
ceph: move to a dedicated slabcache for ceph_cap_snap
ceph: add getvxattr op
libceph: drop else branches in prepare_read_data{,_cont}
ceph: fix comments mentioning i_mutex
...
|
|
Pull xfs updates from Darrick Wong:
"The biggest change this cycle is bringing XFS' inode attribute setting
code back towards alignment with what the VFS does. IOWs, setgid bit
handling should be a closer match with ext4 and btrfs behavior.
The rest of the branch is bug fixes around the filesystem -- patching
gaps in quota enforcement, removing bogus selinux audit messages, and
fixing log corruption and problems with log recovery. There will be a
second pull request later on in the merge window with more bug fixes.
Dave Chinner will be taking over as XFS maintainer for one release
cycle, starting from the day 5.18-rc1 drops until 5.19-rc1 is tagged
so that I can focus on starting a massive design review for the
(feature complete after five years) online repair feature.
Summary:
- Fix some incorrect mapping state being passed to iomap during COW
- Don't create bogus selinux audit messages when deciding to degrade
gracefully due to lack of privilege
- Fix setattr implementation to use VFS helpers so that we drop
setgid consistently with the other filesystems
- Fix link/unlink/rename to check quota limits
- Constify xfs_name_dotdot to prevent abuse of in-kernel symbols
- Fix log livelock between the AIL and inodegc threads during
recovery
- Fix a log stall when the AIL races with pushers
- Fix stalls in CIL flushes due to pinned inode cluster buffers
during recovery
- Fix log corruption due to incorrect usage of xfs_is_shutdown vs
xlog_is_shutdown because during an induced fs shutdown, AIL
writeback must continue until the log is shut down, even if the
filesystem has already shut down"
* tag 'xfs-5.18-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight
xfs: AIL should be log centric
xfs: log items should have a xlog pointer, not a mount
xfs: async CIL flushes need pending pushes to be made stable
xfs: xfs_ail_push_all_sync() stalls when racing with updates
xfs: check buffer pin state after locking in delwri_submit
xfs: log worker needs to start before intent/unlink recovery
xfs: constify xfs_name_dotdot
xfs: constify the name argument to various directory functions
xfs: reserve quota for target dir expansion when renaming files
xfs: reserve quota for dir expansion when linking/unlinking files
xfs: refactor user/group quota chown in xfs_setattr_nonsize
xfs: use setattr_copy to set vfs inode attributes
xfs: don't generate selinux audit messages for capability testing
xfs: add missing cmap->br_state = XFS_EXT_NORM update
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull DAX updates from Dan Williams:
"Andrew has been shepherding major dax features that touch the core -mm
through his tree, but I still collect the dax updates that are core-mm
independent.
- Fix a crash due to a missing rcu_barrier() in dax_fs_exit()
- Fix two miscellaneous doc issues"
* tag 'dax-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Fix missing kdoc for dax_device
dax: make sure inodes are flushed before destroy cache
fsdax: fix function description
|
|
While profiling task_work intensive workloads, I noticed that most of
the time in tctx_task_work() is spending stalled on loading 'req'. This
is one of the unfortunate side effects of using linked lists,
particularly when they end up being passe around.
Prefetch the next request, if there is one. There's a sufficient amount
of work in between that this makes it available for the next loop.
While fiddling with the cache layout, move the link outside of the
hot completion cacheline. It's rarely used in hot workloads, so better
to bring in kbuf which is used for networked loads with provided buffers.
This reduces tctx_task_work() overhead from ~3% to 1-1.5% in my testing.
Signed-off-by: Jens Axboe <[email protected]>
|
|
When direct writes fail with -ENOTBLK because we're writing into a
hole (gfs2_iomap_begin()) or because of a page invalidation failure
(iomap_dio_rw()), we're falling back to buffered writes. In that case,
when we lose the inode glock in gfs2_file_buffered_write(), we want to
re-acquire it instead of returning a short write.
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Function iomap_dio_rw() only returns -ENOTBLK for write requests and
gfs2_file_direct_read() no longer returns -ENOTBLK since commit
1d45bb7f9d2a5 ("gfs2: Use iomap for stuffed direct I/O reads"), so there
is no need to check for -ENOTBLK in gfs2_file_read_iter() anymore.
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Since commit 554c577cee95b, gfs2_file_buffered_write() can accidentally
return a truncated iov_iter, which might confuse callers. Fix that.
Fixes: 554c577cee95b ("gfs2: Prevent endless loops in gfs2_file_buffered_write")
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Do not set REQ_F_NOWAIT if the socket is non blocking. When enabled this
causes the accept to immediately post a CQE with EAGAIN, which means you
cannot perform an accept SQE on a NONBLOCK socket asynchronously.
By removing the flag if there is no pending accept then poll is armed as
usual and when a connection comes in the CQE is posted.
Signed-off-by: Dylan Yudaken <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Merge more updates from Andrew Morton:
"Various misc subsystems, before getting into the post-linux-next
material.
41 patches.
Subsystems affected by this patch series: procfs, misc, core-kernel,
lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump,
taskstats, panic, kcov, resource, and ubsan"
* emailed patches from Andrew Morton <[email protected]>: (41 commits)
Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang"
kernel/resource: fix kfree() of bootmem memory again
kcov: properly handle subsequent mmap calls
kcov: split ioctl handling into locked and unlocked parts
panic: move panic_print before kmsg dumpers
panic: add option to dump all CPUs backtraces in panic_print
docs: sysctl/kernel: add missing bit to panic_print
taskstats: remove unneeded dead assignment
kasan: no need to unset panic_on_warn in end_report()
ubsan: no need to unset panic_on_warn in ubsan_epilogue()
panic: unset panic_on_warn inside panic()
docs: kdump: add scp example to write out the dump file
docs: kdump: update description about sysfs file system support
arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible
cgroup: use irqsave in cgroup_rstat_flush_locked().
fat: use pointer to simple type in put_user()
minix: fix bug when opening a file with O_DIRECT
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull flexible-array transformations from Gustavo Silva:
"Treewide patch that replaces zero-length arrays with flexible-array
members.
This has been baking in linux-next for a whole development cycle"
* tag 'flexible-array-transformations-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
treewide: Replace zero-length arrays with flexible-array members
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull mount attributes PREEMPT_RT update from Christian Brauner:
"This contains Sebastian's fix to make changing mount
attributes/getting write access compatible with CONFIG_PREEMPT_RT.
The change only applies when users explicitly opt-in to real-time via
CONFIG_PREEMPT_RT otherwise things are exactly as before. We've waited
quite a long time with this to make sure folks could take a good look"
* tag 'fs.rt.v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs/namespace: Boost the mount_lock.lock owner instead of spinning on PREEMPT_RT.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull mount_setattr updates from Christian Brauner:
"This contains a few more patches to massage the mount_setattr()
codepaths and one minor fix to reuse a helper we added some time back.
The final two patches do similar cleanups in different ways. One patch
is mine and the other is Al's who was nice enough to give me a branch
for it.
Since his came in later and my branch had been sitting in -next for
quite some time we just put his on top instead of swap them"
* tag 'fs.v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
mount_setattr(): clean the control flow and calling conventions
fs: clean up mount_setattr control flow
fs: don't open-code mnt_hold_writers()
fs: simplify check in mount_setattr_commit()
fs: add mnt_allow_writers() and simplify mount_setattr_prepare()
|
|
A subvolume with an active swapfile must not be deleted otherwise it
would not be possible to deactivate it.
After the subvolume is deleted, we cannot swapoff the swapfile in this
deleted subvolume because the path is unreachable. The swapfile is
still active and holding references, the filesystem cannot be unmounted.
The test looks like this:
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
btrfs sub create $mnt/subvol
touch $mnt/subvol/swapfile
chmod 600 $mnt/subvol/swapfile
chattr +C $mnt/subvol/swapfile
dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096
mkswap $mnt/subvol/swapfile
swapon $mnt/subvol/swapfile
btrfs sub delete $mnt/subvol
swapoff $mnt/subvol/swapfile # failed: No such file or directory
swapoff --all
unmount $mnt # target is busy.
To prevent above issue, we simply check that whether the subvolume
contains any active swapfile, and stop the deleting process. This
behavior is like snapshot ioctl dealing with a swapfile.
CC: [email protected] # 5.4+
Reviewed-by: Robbie Ko <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Kaiwen Hu <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
This is a long time leftover from when I originally added the free space
inode, the point was to catch cases where we weren't honoring the NOCOW
flag. However there exists a race with relocation, if we allocate our
free space inode in a block group that is about to be relocated, we
could trigger the COW path before the relocation has the opportunity to
find the extents and delete the free space cache. In production where
we have auto-relocation enabled we're seeing this WARN_ON_ONCE() around
5k times in a 2 week period, so not super common but enough that it's at
the top of our metrics.
We're properly handling the error here, and with us phasing out v1 space
cache anyway just drop the WARN_ON_ONCE.
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
[BUG]
There is a report that autodefrag is defragging single sector, which
is completely waste of IO, and no help for defragging:
btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096
[CAUSE]
In defrag_collect_targets(), we check if the current range (A) can be merged
with next one (B).
If mergeable, we will add range A into target for defrag.
However there is a catch for autodefrag, when checking mergeability
against range B, we intentionally pass 0 as @newer_than, hoping to get a
higher chance to merge with the next extent.
But in the next iteration, range B will looked up by defrag_lookup_extent(),
with non-zero @newer_than.
And if range B is not really newer, it will rejected directly, causing
only range A being defragged, while we expect to defrag both range A and
B.
[FIX]
Since the root cause is the difference in check condition of
defrag_check_next_extent() and defrag_collect_targets(), we fix it by:
1. Pass @newer_than to defrag_check_next_extent()
2. Pass @extent_thresh to defrag_check_next_extent()
This makes the check between defrag_collect_targets() and
defrag_check_next_extent() more consistent.
While there is still some minor difference, the remaining checks are
focus on runtime flags like writeback/delalloc, which are mostly
transient and safe to be checked only in defrag_collect_targets().
Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
CC: [email protected] # 5.16+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files. This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero range). Because the call can be used to change file contents, we
should treat it like we do any other modification to a file -- update
the mtime, and drop set[ug]id privileges/capabilities.
The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
[BUG]
There is a report that a btrfs has a bad super block num devices.
This makes btrfs to reject the fs completely.
BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
BTRFS error (device sdd3): failed to read chunk tree: -22
BTRFS error (device sdd3): open_ctree failed
[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:
btrfs_rm_device()
|- btrfs_rm_dev_item(device)
| |- trans = btrfs_start_transaction()
| | Now we got transaction X
| |
| |- btrfs_del_item()
| | Now device item is removed from chunk tree
| |
| |- btrfs_commit_transaction()
| Transaction X got committed, super num devs untouched,
| but device item removed from chunk tree.
| (AKA, super num devs is already incorrect)
|
|- cur_devices->num_devices--;
|- cur_devices->total_devices--;
|- btrfs_set_super_num_devices()
All those operations are not in transaction X, thus it will
only be written back to disk in next transaction.
So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.
[FIX]
Instead of starting and committing a transaction inside
btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and
pass it to btrfs_rm_dev_item().
And only commit the transaction after everything is done.
Reported-by: Luca Béla Palkovics <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
CC: [email protected] # 4.14+
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
There is no reason to retry the operation if a session error had
occurred in such case result structure isn't filled out.
Fixes: dff58530c4ca ("NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION")
Signed-off-by: Olga Kornievskaia <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.
To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].
This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
This was introduced with the message ring opcode, but isn't strictly
required for the request itself. The sender can encode what is needed
in user_data, which is passed to the receiver. It's unclear if having
a separate flag that essentially says "This CQE did not originate from
an SQE on this ring" provides any real utility to applications. While
we can always re-introduce a flag to provide this information, we cannot
take it away at a later point in time.
Remove the flag while we still can, before it's in a released kernel.
Signed-off-by: Jens Axboe <[email protected]>
|
|
The put_user(val,ptr) macro wants a pointer to a simple type, but in
fat_ioctl_filldir() the d_name field references an "array of chars". Be
more accurate and explicitly give the pointer to the first character of
the d_name[] array.
I noticed that issue while trying to optimize the parisc put_user()
macro and used an intermediate variable to store the pointer. In that
case I got this error:
In file included from include/linux/uaccess.h:11,
from include/linux/compat.h:17,
from fs/fat/dir.c:18:
fs/fat/dir.c: In function `fat_ioctl_filldir':
fs/fat/dir.c:725:33: error: invalid initializer
725 | if (put_user(0, d2->d_name) || \
| ^~
include/asm/uaccess.h:152:33: note: in definition of macro `__put_user'
152 | __typeof__(ptr) __ptr = ptr; \
| ^~~
fs/fat/dir.c:759:1: note: in expansion of macro `FAT_IOCTL_FILLDIR_FUNC'
759 | FAT_IOCTL_FILLDIR_FUNC(fat_ioctl_filldir, __fat_dirent)
Andreas Schwab <[email protected]> suggested to use
__typeof__(&*(ptr)) __ptr = ptr;
instead. This works, but nevertheless it's probably reasonable to fix
the original caller too.
Link: https://lkml.kernel.org/r/Ygo+A9MREmC1H3kr@p100
Signed-off-by: Helge Deller <[email protected]>
Acked-by: OGAWA Hirofumi <[email protected]>
Cc: David Laight <[email protected]>
Cc: Andreas Schwab <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Testcase:
1. create a minix file system and mount it
2. open a file on the file system with O_RDWR|O_CREAT|O_TRUNC|O_DIRECT
3. open fails with -EINVAL but leaves an empty file behind. All other
open() failures don't leave the failed open files behind.
It is hard to check the direct_IO op before creating the inode. Just as
ext4 and btrfs do, this patch will resolve the issue by allowing to
create the file with O_DIRECT but returning error when writing the file.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Qinghua Jin <[email protected]>
Reported-by: Colin Ian King <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
head, tail, ring_size are declared as unsigned int, so all local
variables that operate with these fields have to be unsigned to avoid
signed integer overflow.
Right now, it isn't an issue because the maximum pipe size is limited by
1U<<31.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrei Vagin <[email protected]>
Suggested-by: Dmitry Safonov <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Right now, kcalloc is used to allocate a pipe_buffer array. The size of
the pipe_buffer struct is 40 bytes. kcalloc allows allocating reliably
chunks with sizes less or equal to PAGE_ALLOC_COSTLY_ORDER (3). It
means that the maximum pipe size is 3.2MB in this case.
In CRIU, we use pipes to dump processes memory. CRIU freezes a target
process, injects a parasite code into it and then this code splices
memory into pipes. If a maximum pipe size is small, we need to do many
iterations or create many pipes.
kvcalloc attempt to allocate physically contiguous memory, but upon
failure, fall back to non-contiguous (vmalloc) allocation and so it
isn't limited by PAGE_ALLOC_COSTLY_ORDER.
The maximum pipe size for non-root users is limited by the
/proc/sys/fs/pipe-max-size sysctl that is 1MB by default, so only the
root user will be able to trigger vmalloc allocations.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrei Vagin <[email protected]>
Reviewed-by: Dmitry Safonov <[email protected]>
Cc: Alexander Viro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix a spelling problem to remove warnings found by running
scripts/kernel-doc, which is caused by using 'make W=1'.
fs/proc/vmcore.c:492: warning: Function parameter or member 'size' not described in 'vmcore_alloc_buf'
fs/proc/vmcore.c:492: warning: Excess function parameter 'sizez' description in 'vmcore_alloc_buf'
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Li <[email protected]>
Reported-by: Abaci Robot <[email protected]>
Acked-by: Baoquan He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Lockdep noticed that there is chance for a deadlock if we have concurrent
mmap, concurrent read, and the addition/removal of a callback.
As nicely explained by Boqun:
"Lockdep warned about the above sequences because rw_semaphore is a
fair read-write lock, and the following can cause a deadlock:
TASK 1 TASK 2 TASK 3
====== ====== ======
down_write(mmap_lock);
down_read(vmcore_cb_rwsem)
down_write(vmcore_cb_rwsem); // blocked
down_read(vmcore_cb_rwsem); // cannot get the lock because of the fairness
down_read(mmap_lock); // blocked
IOW, a reader can block another read if there is a writer queued by
the second reader and the lock is fair"
To fix this, convert to srcu to make this deadlock impossible. We need
srcu as our callbacks can sleep. With this change, I cannot trigger any
lockdep warnings.
======================================================
WARNING: possible circular locking dependency detected
5.17.0-0.rc0.20220117git0c947b893d69.68.test.fc36.x86_64 #1 Not tainted
------------------------------------------------------
makedumpfile/542 is trying to acquire lock:
ffffffff832d2eb8 (vmcore_cb_rwsem){.+.+}-{3:3}, at: mmap_vmcore+0x340/0x580
but task is already holding lock:
ffff8880af226438 (&mm->mmap_lock#2){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x150
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&mm->mmap_lock#2){++++}-{3:3}:
lock_acquire+0xc3/0x1a0
__might_fault+0x4e/0x70
_copy_to_user+0x1f/0x90
__copy_oldmem_page+0x72/0xc0
read_from_oldmem+0x77/0x1e0
read_vmcore+0x2c2/0x310
proc_reg_read+0x47/0xa0
vfs_read+0x101/0x340
__x64_sys_pread64+0x5d/0xa0
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #0 (vmcore_cb_rwsem){.+.+}-{3:3}:
validate_chain+0x9f4/0x2670
__lock_acquire+0x8f7/0xbc0
lock_acquire+0xc3/0x1a0
down_read+0x4a/0x140
mmap_vmcore+0x340/0x580
proc_reg_mmap+0x3e/0x90
mmap_region+0x504/0x880
do_mmap+0x38a/0x520
vm_mmap_pgoff+0xc1/0x150
ksys_mmap_pgoff+0x178/0x200
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mm->mmap_lock#2);
lock(vmcore_cb_rwsem);
lock(&mm->mmap_lock#2);
lock(vmcore_cb_rwsem);
*** DEADLOCK ***
1 lock held by makedumpfile/542:
#0: ffff8880af226438 (&mm->mmap_lock#2){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x150
stack backtrace:
CPU: 0 PID: 542 Comm: makedumpfile Not tainted 5.17.0-0.rc0.20220117git0c947b893d69.68.test.fc36.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
__lock_acquire+0x8f7/0xbc0
lock_acquire+0xc3/0x1a0
down_read+0x4a/0x140
mmap_vmcore+0x340/0x580
proc_reg_mmap+0x3e/0x90
mmap_region+0x504/0x880
do_mmap+0x38a/0x520
vm_mmap_pgoff+0xc1/0x150
ksys_mmap_pgoff+0x178/0x200
do_syscall_64+0x43/0x90
Link: https://lkml.kernel.org/r/[email protected]
Fixes: cc5f2704c934 ("proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks")
Signed-off-by: David Hildenbrand <[email protected]>
Reported-by: Baoquan He <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Dave Young <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It's not a standard approach that use __get_free_page() to alloc path
buffer directly. We'd better use kmalloc and PATH_MAX.
PAGE_SIZE is different on different archs. An unlinked file
with very long canonical pathname will readlink differently
because "(deleted)" eats into a buffer. --adobriyan
[[email protected]: remove now-unneeded cast]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hao Lee <[email protected]>
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: James Morris <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"There are three sets of updates for 5.18 in the asm-generic tree:
- The set_fs()/get_fs() infrastructure gets removed for good.
This was already gone from all major architectures, but now we can
finally remove it everywhere, which loses some particularly tricky
and error-prone code. There is a small merge conflict against a
parisc cleanup, the solution is to use their new version.
- The nds32 architecture ends its tenure in the Linux kernel.
The hardware is still used and the code is in reasonable shape, but
the mainline port is not actively maintained any more, as all
remaining users are thought to run vendor kernels that would never
be updated to a future release.
- A series from Masahiro Yamada cleans up some of the uapi header
files to pass the compile-time checks"
* tag 'asm-generic-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (27 commits)
nds32: Remove the architecture
uaccess: remove CONFIG_SET_FS
ia64: remove CONFIG_SET_FS support
sh: remove CONFIG_SET_FS support
sparc64: remove CONFIG_SET_FS support
lib/test_lockup: fix kernel pointer check for separate address spaces
uaccess: generalize access_ok()
uaccess: fix type mismatch warnings from access_ok()
arm64: simplify access_ok()
m68k: fix access_ok for coldfire
MIPS: use simpler access_ok()
MIPS: Handle address errors for accesses above CPU max virtual user address
uaccess: add generic __{get,put}_kernel_nofault
nios2: drop access_ok() check from __put_user()
x86: use more conventional access_ok() definition
x86: remove __range_not_ok()
sparc64: add __{get,put}_kernel_nofault()
nds32: fix access_ok() checks in get/put_user
uaccess: fix nios2 and microblaze get_user_8()
sparc64: fix building assembly files
...
|
|
If we need to continue doing this IO, then we don't want a potentially
selected buffer recycled. Add a flag for that.
Set this for recv/recvmsg if they do partial IO.
Signed-off-by: Jens Axboe <[email protected]>
|
|
We currently don't attempt to get the full asked for length even if
MSG_WAITALL is set, if we get a partial receive. If we do see a partial
receive, then just note how many bytes we did and return -EAGAIN to
get it retried.
The iov is advanced appropriately for the vector based case, and we
manually bump the buffer and remainder for the non-vector case.
Cc: [email protected]
Reported-by: Constantine Gavrilov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We use extent_changeset->bytes_changed in qgroup_reserve_data() to record
how many bytes we set for EXTENT_QGROUP_RESERVED state. Currently the
bytes_changed is set as "unsigned int", and it will overflow if we try to
fallocate a range larger than 4GiB. The result is we reserve less bytes
and eventually break the qgroup limit.
Unlike regular buffered/direct write, which we use one changeset for
each ordered extent, which can never be larger than 256M. For
fallocate, we use one changeset for the whole range, thus it no longer
respects the 256M per extent limit, and caused the problem.
The following example test script reproduces the problem:
$ cat qgroup-overflow.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Set qgroup limit to 2GiB.
btrfs quota enable $MNT
btrfs qgroup limit 2G $MNT
# Try to fallocate a 3GiB file. This should fail.
echo
echo "Try to fallocate a 3GiB file..."
fallocate -l 3G $MNT/3G.file
# Try to fallocate a 5GiB file.
echo
echo "Try to fallocate a 5GiB file..."
fallocate -l 5G $MNT/5G.file
# See we break the qgroup limit.
echo
sync
btrfs qgroup show -r $MNT
umount $MNT
When running the test:
$ ./qgroup-overflow.sh
(...)
Try to fallocate a 3GiB file...
fallocate: fallocate failed: Disk quota exceeded
Try to fallocate a 5GiB file...
qgroupid rfer excl max_rfer
-------- ---- ---- --------
0/5 5.00GiB 5.00GiB 2.00GiB
Since we have no control of how bytes_changed is used, it's better to
set it to u64.
CC: [email protected] # 4.14+
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Ethan Lien <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block
groups") we started allowing DUP on metadata block groups, so the
ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are
no longer valid and in fact even harmful.
Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups")
CC: [email protected] # 5.17
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
btrfs_can_activate_zone() can be called with the device_list_mutex already
held, which will lead to a deadlock:
insert_dev_extents() // Takes device_list_mutex
`-> insert_dev_extent()
`-> btrfs_insert_empty_item()
`-> btrfs_insert_empty_items()
`-> btrfs_search_slot()
`-> btrfs_cow_block()
`-> __btrfs_cow_block()
`-> btrfs_alloc_tree_block()
`-> btrfs_reserve_extent()
`-> find_free_extent()
`-> find_free_extent_update_loop()
`-> can_allocate_chunk()
`-> btrfs_can_activate_zone() // Takes device_list_mutex again
Instead of using the RCU on fs_devices->device_list we
can use fs_devices->alloc_list, protected by the chunk_mutex to traverse
the list of active devices.
We are in the chunk allocation thread. The newer chunk allocation
happens from the devices in the fs_device->alloc_list protected by the
chunk_mutex.
btrfs_create_chunk()
lockdep_assert_held(&info->chunk_mutex);
gather_device_info
list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
Also, a device that reappears after the mount won't join the alloc_list
yet and, it will be in the dev_list, which we don't want to consider in
the context of the chunk alloc.
[15.166572] WARNING: possible recursive locking detected
[15.167117] 5.17.0-rc6-dennis #79 Not tainted
[15.167487] --------------------------------------------
[15.167733] kworker/u8:3/146 is trying to acquire lock:
[15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs]
[15.167733]
[15.167733] but task is already holding lock:
[15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
[15.167733]
[15.167733] other info that might help us debug this:
[15.167733] Possible unsafe locking scenario:
[15.167733]
[15.171834] CPU0
[15.171834] ----
[15.171834] lock(&fs_devs->device_list_mutex);
[15.171834] lock(&fs_devs->device_list_mutex);
[15.171834]
[15.171834] *** DEADLOCK ***
[15.171834]
[15.171834] May be due to missing lock nesting notation
[15.171834]
[15.171834] 5 locks held by kworker/u8:3/146:
[15.171834] #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
[15.171834] #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
[15.176244] #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs]
[15.176244] #3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
[15.176244] #4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs]
[15.179641]
[15.179641] stack backtrace:
[15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis #79
[15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
[15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
[15.179641] Call Trace:
[15.179641] <TASK>
[15.179641] dump_stack_lvl+0x45/0x59
[15.179641] __lock_acquire.cold+0x217/0x2b2
[15.179641] lock_acquire+0xbf/0x2b0
[15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs]
[15.183838] __mutex_lock+0x8e/0x970
[15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs]
[15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs]
[15.183838] ? lock_is_held_type+0xd7/0x130
[15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs]
[15.183838] find_free_extent+0x15a/0x14f0 [btrfs]
[15.183838] ? _raw_spin_unlock+0x24/0x40
[15.183838] ? btrfs_get_alloc_profile+0x106/0x230 [btrfs]
[15.187601] btrfs_reserve_extent+0x131/0x260 [btrfs]
[15.187601] btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs]
[15.187601] __btrfs_cow_block+0x138/0x600 [btrfs]
[15.187601] btrfs_cow_block+0x10f/0x230 [btrfs]
[15.187601] btrfs_search_slot+0x55f/0xbc0 [btrfs]
[15.187601] ? lock_is_held_type+0xd7/0x130
[15.187601] btrfs_insert_empty_items+0x2d/0x60 [btrfs]
[15.187601] btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs]
[15.187601] __btrfs_end_transaction+0x36/0x2a0 [btrfs]
[15.192037] flush_space+0x374/0x600 [btrfs]
[15.192037] ? find_held_lock+0x2b/0x80
[15.192037] ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs]
[15.192037] ? lock_release+0x131/0x2b0
[15.192037] btrfs_async_reclaim_data_space+0x70/0x180 [btrfs]
[15.192037] process_one_work+0x24c/0x5a0
[15.192037] worker_thread+0x4a/0x3d0
Fixes: a85f05e59bc1 ("btrfs: zoned: avoid chunk allocation if active block group has enough space")
CC: [email protected] # 5.16+
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
The bug is here:
if (!tcon) {
resched = true;
list_del_init(&ses->rlist);
cifs_put_smb_ses(ses);
Because the list_for_each_entry() never exits early (without any
break/goto/return inside the loop), the iterator 'ses' after the
loop will always be an pointer to a invalid struct containing the
HEAD (&pserver->smb_ses_list). As a result, the uses of 'ses' above
will lead to a invalid memory access.
The original intention should have been to walk each entry 'ses' in
'&tmp_ses_list', delete '&ses->rlist' and put 'ses'. So fix it with
a list_for_each_entry_safe().
Cc: [email protected] # 5.17
Fixes: 3663c9045f51a ("cifs: check reconnects for channels of active tcons too")
Signed-off-by: Xiaomeng Tong <[email protected]>
Reviewed-by: Shyam Prasad N <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
There is no need to store the fids as le64 integers as they are opaque
to the client and only used for equality.
Signed-off-by: Paulo Alcantara (SUSE) <[email protected]>
Reviewed-by: Tom Talpey <[email protected]>
Acked-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
The client used to partially convert the fids to le64, while storing
or sending them by using host endianness. This broke the client on
big-endian machines. Instead of converting them to le64, store them
as opaque integers and then avoid byteswapping when sending them over
wire.
Signed-off-by: Paulo Alcantara (SUSE) <[email protected]>
Reviewed-by: Namjae Jeon <[email protected]>
Reviewed-by: Tom Talpey <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
This will reduce the number of Open/Close we send on the wire and replace
a Open/GetInfo/Close compound with just a simple GetInfo request
IF we have a cached handle for the object.
Signed-off-by: Ronnie Sahlberg <[email protected]>
Reviewed-by: Paulo Alcantara (SUSE) <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
and not in the callers.
Signed-off-by: Ronnie Sahlberg <[email protected]>
Reviewed-by: Paulo Alcantara (SUSE) <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
Clean up the retry logic in the read and write functions somewhat.
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
During lockless buffered reads, filemap_read() holds page cache page
references while trying to copy data to the user-space buffer. The
calling process isn't holding the inode glock, but the page references
it holds prevent those pages from being removed from the page cache, and
that prevents the underlying inode glock from being moved to another
node. Thus, we can end up in the same kinds of distributed deadlock
situations as with normal (non-lockless) buffered reads.
Fix that by disabling page faults during lockless reads as well.
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
Fix the fault-in window size logic:
* Use a maximum window size of 1 MiB instead of BIO_MAX_VECS * PAGE_SIZE.
The previous window size was always one page because the pages variable
was accidentally being defined and then redefined in
should_fault_in_pages().
* The nr_dirtied heuristic for guessing when there might be memory
pressure often results in very small window sizes. Don't let
nr_dirtied drop below 8 pages (as btrfs does).
* Compute the window size in units of bytes, not pages.
* Account for page overlap (unaligned iterators).
Signed-off-by: Andreas Gruenbacher <[email protected]>
|
|
The mpage bio alloc cleanup accidentally removed clearing ~GFP_KERNEL
bits from the mask passed to bio_alloc. Fix this up in a slightly
less obsfucated way that mirrors what iomap does in its readpage code.
Fixes: 77c436de01c0 ("mpage: pass the operation to bio_alloc")
Reported-by: Guenter Roeck <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Ryusuke Konishi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
We only really need to recycle the buffer when going async for a file
type that has an indefinite reponse time (eg non-file/bdev). And for
files that to arm poll, the async worker will arm poll anyway and the
buffer will get recycled there.
In that latter case, we're not holding ctx->uring_lock. Ensure we take
the issue_flags into account and acquire it if we need to.
Fixes: b1c62645758e ("io_uring: recycle provided buffers if request goes async")
Reported-by: Stefan Roesch <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|