Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull ceph updates from Ilya Dryomov:
"Assorted CephFS fixes and cleanups with nothing standing out"
* tag 'ceph-for-6.8-rc1' of https://github.com/ceph/ceph-client:
ceph: get rid of passing callbacks in __dentry_leases_walk()
ceph: d_obtain_{alias,root}(ERR_PTR(...)) will do the right thing
ceph: fix invalid pointer access if get_quota_realm return ERR_PTR
ceph: remove duplicated code in ceph_netfs_issue_read()
ceph: send oldest_client_tid when renewing caps
ceph: rename create_session_open_msg() to create_session_full_msg()
ceph: select FS_ENCRYPTION_ALGS if FS_ENCRYPTION
ceph: fix deadlock or deadcode of misusing dget()
ceph: try to allocate a smaller extent map for sparse read
libceph: remove MAX_EXTENTS check for sparse reads
ceph: reinitialize mds feature bit even when session in open
ceph: skip reconnecting if MDS is not ready
|
|
Pull xfs fix from Chandan Babu:
- Fix per-inode space accounting bug
* tag 'xfs-6.8-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix backwards logic in xfs_bmap_alloc_account
|
|
Pull more smb server updates from Steve French:
- Fix for incorrect oplock break on directories when leases disabled
- UAF fix for race between create and destroy of tcp connection
- Important session setup SPNEGO fix
- Update ksmbd feature status summary
* tag '6.8-rc-smb-server-fixes-part2' of git://git.samba.org/ksmbd:
ksmbd: only v2 leases handle the directory
ksmbd: fix UAF issue in ksmbd_tcp_new_connection()
ksmbd: validate mech token in session setup
ksmbd: update feature status in documentation
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
"This extends the netfs helper library that network filesystems can use
to replace their own implementations. Both afs and 9p are ported. cifs
is ready as well but the patches are way bigger and will be routed
separately once this is merged. That will remove lots of code as well.
The overal goal is to get high-level I/O and knowledge of the page
cache and ouf of the filesystem drivers. This includes knowledge about
the existence of pages and folios
The pull request converts afs and 9p. This removes about 800 lines of
code from afs and 300 from 9p. For 9p it is now possible to do writes
in larger than a page chunks. Additionally, multipage folio support
can be turned on for 9p. Separate patches exist for cifs removing
another 2000+ lines. I've included detailed information in the
individual pulls I took.
Summary:
- Add NFS-style (and Ceph-style) locking around DIO vs buffered I/O
calls to prevent these from happening at the same time.
- Support for direct and unbuffered I/O.
- Support for write-through caching in the page cache.
- O_*SYNC and RWF_*SYNC writes use write-through rather than writing
to the page cache and then flushing afterwards.
- Support for write-streaming.
- Support for write grouping.
- Skip reads for which the server could only return zeros or EOF.
- The fscache module is now part of the netfs library and the
corresponding maintainer entry is updated.
- Some helpers from the fscache subsystem are renamed to mark them as
belonging to the netfs library.
- Follow-up fixes for the netfs library.
- Follow-up fixes for the 9p conversion"
* tag 'vfs-6.8.netfs' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (50 commits)
netfs: Fix wrong #ifdef hiding wait
cachefiles: Fix signed/unsigned mixup
netfs: Fix the loop that unmarks folios after writing to the cache
netfs: Fix interaction between write-streaming and cachefiles culling
netfs: Count DIO writes
netfs: Mark netfs_unbuffered_write_iter_locked() static
netfs: Fix proc/fs/fscache symlink to point to "netfs" not "../netfs"
netfs: Rearrange netfs_io_subrequest to put request pointer first
9p: Use length of data written to the server in preference to error
9p: Do a couple of cleanups
9p: Fix initialisation of netfs_inode for 9p
cachefiles: Fix __cachefiles_prepare_write()
9p: Use netfslib read/write_iter
afs: Use the netfs write helpers
netfs: Export the netfs_sreq tracepoint
netfs: Optimise away reads above the point at which there can be no data
netfs: Implement a write-through caching option
netfs: Provide a launder_folio implementation
netfs: Provide a writepages implementation
netfs, cachefiles: Pass upper bound length to allow expansion
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Fix a "BUG: kernel NULL pointer dereference" issue due to
inconsistent on-disk indices of compressed inodes against
per-sb `available_compr_algs` generated by Syzkaller
- Don't use certain unnecessary folio_*() helpers if the folio
type (page cache) is known
* tag 'erofs-for-6.8-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: Don't use certain unnecessary folio_*() functions
erofs: fix inconsistent per-file compression format
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull eventfs updates from Steven Rostedt:
- Remove "lookup" parameter of create_dir_dentry() and
create_file_dentry(). These functions were called by lookup and the
readdir logic, where readdir needed it to up the ref count of the
dentry but the lookup did not. A "lookup" parameter was passed in to
tell it what to do, but this complicated the code. It is better to
just always up the ref count and require the caller to decrement it,
even for lookup.
- Modify the .iterate_shared callback to not use the dcache_readdir()
logic and just handle what gets displayed by that one function. This
removes the need for eventfs to hijack the file->private_data from
the dcache_readdir() "cursor" pointer, and makes the code a bit more
sane
- Use the root and instance inodes for default ownership. Instead of
walking the dentry tree and updating each dentry gid, use the
getattr(), setattr() and permission() callbacks to set the ownership
and permissions using the root or instance as the default
- Some other optimizations with the eventfs iterate_shared logic
- Hard-code the inodes for eventfs to the same number for files, and
the same number for directories
- Have getdent() not create dentries/inodes in iterate_shared() as now
it has hard-coded inode numbers
- Use kcalloc() instead of kzalloc() on a list of elements
- Fix seq_buf warning and make static work properly.
* tag 'eventfs-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
seq_buf: Make DECLARE_SEQ_BUF() usable
eventfs: Use kcalloc() instead of kzalloc()
eventfs: Do not create dentries nor inodes in iterate_shared
eventfs: Have the inodes all for files and directories all be the same
eventfs: Shortcut eventfs_iterate() by skipping entries already read
eventfs: Read ei->entries before ei->children in eventfs_iterate()
eventfs: Do ctx->pos update for all iterations in eventfs_iterate()
eventfs: Have eventfs_iterate() stop immediately if ei->is_freed is set
tracefs/eventfs: Use root and instance inodes as default ownership
eventfs: Stop using dcache_readdir() for getdents()
eventfs: Remove "lookup" parameter from create_dir/file_dentry()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here are the set of driver core and kernfs changes for 6.8-rc1.
Nothing major in here this release cycle, just lots of small cleanups
and some tweaks on kernfs that in the very end, got reverted and will
come back in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues"
* tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits)
Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"
kernfs: convert kernfs_idr_lock to an irq safe raw spinlock
class: fix use-after-free in class_register()
PM: clk: make pm_clk_add_notifier() take a const pointer
EDAC: constantify the struct bus_type usage
kernfs: fix reference to renamed function
driver core: device.h: fix Excess kernel-doc description warning
driver core: class: fix Excess kernel-doc description warning
driver core: mark remaining local bus_type variables as const
driver core: container: make container_subsys const
driver core: bus: constantify subsys_register() calls
driver core: bus: make bus_sort_breadthfirst() take a const pointer
kernfs: d_obtain_alias(NULL) will do the right thing...
driver core: Better advertise dev_err_probe()
kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy()
initramfs: Expose retained initrd as sysfs file
fs/kernfs/dir: obey S_ISGID
kernel/cgroup: use kernfs_create_dir_ns()
...
|
|
Pull kvm updates from Paolo Bonzini:
"Generic:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be
resized. guest_memfd files do however support PUNCH_HOLE, which can
be used to switch a memory area between guest_memfd and regular
anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that
guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new
guest_memfd and page attributes infrastructure. This is mostly
useful for testing, since there is no pKVM-like infrastructure to
provide a meaningfully reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages
during CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in
non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually
care about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a
stable TSC", because some of them don't expect the "TSC stable" bit
(added to the pvclock ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for
TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM
always flushes on nested transitions, i.e. always satisfies flush
requests. This allows running bleeding edge versions of VMware
Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV
support.
- On AMD machines with vNMI, always rely on hardware instead of
intercepting IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters
and other state prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events
using a dedicated field instead of snapshotting the "previous"
counter. If the hardware PMC count triggers overflow that is
recognized in the same VM-Exit that KVM manually bumps an event
count, KVM would pend PMIs for both the hardware-triggered overflow
and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be
problematic for subsystems that require no regressions for W=1
builds.
- Advertise all of the host-supported CPUID bits that enumerate
IA32_SPEC_CTRL "features".
- Don't force a masterclock update when a vCPU synchronizes to the
current TSC generation, as updating the masterclock can cause
kvmclock's time to "jump" unexpectedly, e.g. when userspace
hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter
fault paths, partly as a super minor optimization, but mostly to
make KVM play nice with position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the
code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
"emulation" at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with a prefix
branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV support to
that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list
selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing
flag in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix
the various bugs that were lurking due to lack of said annotation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
x86/kvm: Do not try to disable kvmclock if it was not enabled
KVM: x86: add missing "depends on KVM"
KVM: fix direction of dependency on MMU notifiers
KVM: introduce CONFIG_KVM_COMMON
KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
RISC-V: KVM: selftests: Add get-reg-list test for STA registers
RISC-V: KVM: selftests: Add steal_time test support
RISC-V: KVM: selftests: Add guest_sbi_probe_extension
RISC-V: KVM: selftests: Move sbi_ecall to processor.c
RISC-V: KVM: Implement SBI STA extension
RISC-V: KVM: Add support for SBI STA registers
RISC-V: KVM: Add support for SBI extension registers
RISC-V: KVM: Add SBI STA info to vcpu_arch
RISC-V: KVM: Add steal-update vcpu request
RISC-V: KVM: Add SBI STA extension skeleton
RISC-V: paravirt: Implement steal-time support
RISC-V: Add SBI STA extension definitions
RISC-V: paravirt: Add skeleton for pv-time support
RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:
"UBI:
- Use in-tree fault injection framework and add new injection types
- Fix for a memory leak in the block driver
UBIFS:
- kernel-doc fixes
- Various minor fixes"
* tag 'ubifs-for-linus-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: block: fix memleak in ubiblock_create()
ubifs: fix kernel-doc warnings
mtd: Add several functions to the fail_function list
ubi: Reserve sufficient buffer length for the input mask
ubi: Add six fault injection type for testing
ubi: Split io_failures into write_failure and erase_failure
ubi: Use the fault injection framework to enhance the fault injection capability
ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path
ubifs: Check @c->dirty_[n|p]n_cnt and @c->nroot state under @c->lp_mutex
ubifs: describe function parameters
ubifs: auth.c: fix kernel-doc function prototype warning
ubifs: use crypto_shash_tfm_digest() in ubifs_hmac_wkm()
|
|
Pull fscrypt fix from Eric Biggers:
"Fix a bug in my change to how f2fs frees its superblock info (which
was part of changing the timing of fscrypt keyring destruction)"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
f2fs: fix double free of f2fs_sb_info
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"This contains two fixes for the current merge window. The listmount
changes that you requested and a fix for a fsnotify performance
regression:
- The proposed listmount changes are currently under my authorship. I
wasn't sure whether you'd wanted to be author as the patch wasn't
signed off. If you do I'm happy if you just apply your own patch.
I've tested the patch with my sh4 cross-build setup. And confirmed
that a) the build failure with sh on current upstream is
reproducible and that b) the proposed patch fixes the build
failure. That should only leave the task of fixing put_user on sh.
- The fsnotify regression was caused by moving one of the hooks out
of the security hook in preparation for other fsnotify work. This
meant that CONFIG_SECURITY would have compiled out the fsnotify
hook before but didn't do so now.
That lead to up to 6% performance regression in some io_uring
workloads that compile all fsnotify and security checks out. Fix
this by making sure that the relevant hooks are covered by the
already existing CONFIG_FANOTIFY_ACCESS_PERMISSIONS where the
relevant hook belongs"
* tag 'vfs-6.8-rc1.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
fs: rework listmount() implementation
fsnotify: compile out fsnotify permission hooks if !FANOTIFY_ACCESS_PERMISSIONS
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"For once not mostly MM-related.
17 hotfixes. 10 address post-6.7 issues and the other 7 are cc:stable"
* tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
userfaultfd: avoid huge_zero_page in UFFDIO_MOVE
MAINTAINERS: add entry for shrinker
selftests: mm: hugepage-vmemmap fails on 64K page size systems
mm/memory_hotplug: fix memmap_on_memory sysfs value retrieval
mailmap: switch email for Tanzir Hasan
mailmap: add old address mappings for Randy
kernel/crash_core.c: make __crash_hotplug_lock static
efi: disable mirror feature during crashkernel
kexec: do syscore_shutdown() in kernel_kexec
mailmap: update entry for Manivannan Sadhasivam
fs/proc/task_mmu: move mmu notification mechanism inside mm lock
mm: zswap: switch maintainers to recently active developers and reviewers
scripts/decode_stacktrace.sh: optionally use LLVM utilities
kasan: avoid resetting aux_lock
lib/Kconfig.debug: disable CONFIG_DEBUG_INFO_BTF for Hexagon
MAINTAINERS: update LTP maintainers
kdump: defer the insertion of crashkernel resources
|
|
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.
So, use the purpose specific kcalloc() function instead of the argument
size * count in the kzalloc() function.
[1] https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Mark Rutland <[email protected]>
Link: https://github.com/KSPP/linux/issues/162
Signed-off-by: Erick Archer <[email protected]>
Reviewed-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
The original eventfs code added a wrapper around the dcache_readdir open
callback and created all the dentries and inodes at open, and increment
their ref count. A wrapper was added around the dcache_readdir release
function to decrement all the ref counts of those created inodes and
dentries. But this proved to be buggy[1] for when a kprobe was created
during a dir read, it would create a dentry between the open and the
release, and because the release would decrement all ref counts of all
files and directories, that would include the kprobe directory that was
not there to have its ref count incremented in open. This would cause the
ref count to go to negative and later crash the kernel.
To solve this, the dentries and inodes that were created and had their ref
count upped in open needed to be saved. That list needed to be passed from
the open to the release, so that the release would only decrement the ref
counts of the entries that were incremented in the open.
Unfortunately, the dcache_readdir logic was already using the
file->private_data, which is the only field that can be used to pass
information from the open to the release. What was done was the eventfs
created another descriptor that had a void pointer to save the
dcache_readdir pointer, and it wrapped all the callbacks, so that it could
save the list of entries that had their ref counts incremented in the
open, and pass it to the release. The wrapped callbacks would just put
back the dcache_readdir pointer and call the functions it used so it could
still use its data[2].
But Linus had an issue with the "hijacking" of the file->private_data
(unfortunately this discussion was on a security list, so no public link).
Which we finally agreed on doing everything within the iterate_shared
callback and leave the dcache_readdir out of it[3]. All the information
needed for the getents() could be created then.
But this ended up being buggy too[4]. The iterate_shared callback was not
the right place to create the dentries and inodes. Even Christian Brauner
had issues with that[5].
An attempt was to go back to creating the inodes and dentries at
the open, create an array to store the information in the
file->private_data, and pass that information to the other callbacks.[6]
The difference between that and the original method, is that it does not
use dcache_readdir. It also does not up the ref counts of the dentries and
pass them. Instead, it creates an array of a structure that saves the
dentry's name and inode number. That information is used in the
iterate_shared callback, and the array is freed in the dir release. The
dentries and inodes created in the open are not used for the iterate_share
or release callbacks. Just their names and inode numbers.
Linus did not like that either[7] and just wanted to remove the dentries
being created in iterate_shared and use the hard coded inode numbers.
[ All this while Linus enjoyed an unexpected vacation during the merge
window due to lack of power. ]
[1] https://lore.kernel.org/linux-trace-kernel/[email protected]/
[2] https://lore.kernel.org/linux-trace-kernel/[email protected]/
[3] https://lore.kernel.org/linux-trace-kernel/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/
[5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/
[6] https://lore.kernel.org/all/[email protected]/
[7] https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ajay Kaher <[email protected]>
Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
The dentries and inodes are created in the readdir for the sole purpose of
getting a consistent inode number. Linus stated that is unnecessary, and
that all inodes can have the same inode number. For a virtual file system
they are pretty meaningless.
Instead use a single unique inode number for all files and one for all
directories.
Link: https://lore.kernel.org/all/[email protected]/
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Ajay Kaher <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
|
|
Filesystems should use folio->index and folio->mapping, instead of
folio_index(folio), folio_mapping() and folio_file_mapping() since
they know that it's in the pagecache.
Change this automagically with:
perl -p -i -e 's/folio_mapping[(]([^)]*)[)]/\1->mapping/g' fs/erofs/*.c
perl -p -i -e 's/folio_file_mapping[(]([^)]*)[)]/\1->mapping/g' fs/erofs/*.c
perl -p -i -e 's/folio_index[(]([^)]*)[)]/\1->index/g' fs/erofs/*.c
Reported-by: Matthew Wilcox <[email protected]>
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Yue Hu <[email protected]>
Cc: Jeffle Xu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Gao Xiang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
__dentry_leases_walk() gets a callback and calls it for
a bunch of denties; there are exactly two callers and
we already have a flag telling them apart - lwc->dir_lease.
Seeing that indirect calls are costly these days, let's
get rid of the callback and just call the right function
directly. Has a side benefit of saner signatures...
[ xiubli: a minor fix in the commit title ]
Signed-off-by: Al Viro <[email protected]>
Reviewed-by: Xiubo Li <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
Clean up the code.
Signed-off-by: Al Viro <[email protected]>
Reviewed-by: Xiubo Li <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
This issue is reported by smatch that get_quota_realm() might return
ERR_PTR but we did not handle it. It's not a immediate bug, while we
still should address it to avoid potential bugs if get_quota_realm()
is changed to return other ERR_PTR in future.
Set ceph_snap_realm's pointer in get_quota_realm()'s to address this
issue, the pointer would be set to NULL if get_quota_realm() failed
to get struct ceph_snap_realm, so no ERR_PTR would happen any more.
[ xiubli: minor code style clean up ]
Signed-off-by: Wenchao Hao <[email protected]>
Reviewed-by: Xiubo Li <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
When allocating an osd request the libceph.ko will add the
'read_from_replica' flag by default.
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
Update the oldest_client_tid via the session renew caps msg to
make sure that the MDSs won't pile up the completed request list
in a very large size.
[ idryomov: drop inapplicable comment ]
Link: https://tracker.ceph.com/issues/63364
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
Makes the create session msg helper to be more general and could
be used by other ops.
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
The kconfig options for filesystems that support FS_ENCRYPTION are
supposed to select FS_ENCRYPTION_ALGS. This is needed to ensure that
required crypto algorithms get enabled as loadable modules or builtin as
is appropriate for the set of enabled filesystems. Do this for CEPH_FS
so that there aren't any missing algorithms if someone happens to have
CEPH_FS as their only enabled filesystem that supports encryption.
Cc: [email protected]
Fixes: f061feda6c54 ("ceph: add fscrypt ioctls and ceph.fscrypt.auth vxattr")
Signed-off-by: Eric Biggers <[email protected]>
Reviewed-by: Xiubo Li <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
The lock order is incorrect between denty and its parent, we should
always make sure that the parent get the lock first.
But since this deadcode is never used and the parent dir will always
be set from the callers, let's just remove it.
Link: https://lore.kernel.org/r/20231116081919.GZ1957730@ZenIV
Reported-by: Al Viro <[email protected]>
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
In fscrypt case and for a smaller read length we can predict the
max count of the extent map. And for small read length use cases
this could save some memories.
[ idryomov: squash into a single patch to avoid build break, drop
redundant variable in ceph_alloc_sparse_ext_map() ]
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
Following along the same lines as per the user-space fix. Right
now this isn't really an issue with the ceph kernel driver because
of the feature bit laginess, however, that can change over time
(when the new snaprealm info type is ported to the kernel driver)
and depending on the MDS version that's being upgraded can cause
message decoding issues - so, fix that early on.
Link: http://tracker.ceph.com/issues/63188
Signed-off-by: Venky Shankar <[email protected]>
Reviewed-by: Xiubo Li <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
When MDS closed the session the kclient will send to reconnect to
it immediately, but if the MDS just restarted and still not ready
yet, such as still in the up:replay state and the sessionmap journal
logs hasn't be replayed, the MDS will close the session.
And then the kclient could remove the session and later when the
mdsmap is in RECONNECT phrase it will skip reconnecting. But the MDS
will wait until timeout and then evict the kclient.
Just skip sending the reconnection request until the MDS is ready.
Link: https://tracker.ceph.com/issues/62489
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Ilya Dryomov <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
When smb2 leases is disable, ksmbd can send oplock break notification
and cause wait oplock break ack timeout. It may appear like hang when
accessing a directory. This patch make only v2 leases handle the
directory.
Cc: [email protected]
Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
The race is between the handling of a new TCP connection and
its disconnection. It leads to UAF on `struct tcp_transport` in
ksmbd_tcp_new_connection() function.
Cc: [email protected]
Reported-by: [email protected] # ZDI-CAN-22991
Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
If client send invalid mech token in session setup request, ksmbd
validate and make the error if it is invalid.
Cc: [email protected]
Reported-by: [email protected] # ZDI-CAN-22890
Signed-off-by: Namjae Jeon <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
EROFS can select compression algorithms on a per-file basis, and each
per-file compression algorithm needs to be marked in the on-disk
superblock for initialization.
However, syzkaller can generate inconsistent crafted images that use
an unsupported algorithmtype for specific inodes, e.g. use MicroLZMA
algorithmtype even it's not set in `sbi->available_compr_algs`. This
can lead to an unexpected "BUG: kernel NULL pointer dereference" if
the corresponding decompressor isn't built-in.
Fix this by checking against `sbi->available_compr_algs` for each
m_algorithmformat request. Incorrect !erofs_sb_has_compr_cfgs preset
bitmap is now fixed together since it was harmless previously.
Reported-by: <[email protected]>
Fixes: 8f89926290c4 ("erofs: get compression algorithms directly on mapping")
Fixes: 622ceaddb764 ("erofs: lzma compression support")
Reviewed-by: Yue Hu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Gao Xiang <[email protected]>
|
|
Linus pointed out that there's error handling and naming issues in the
that we should rewrite:
* Perform the access checks for the buffer before actually doing any
work instead of doing it during the iteration.
* Rename the arguments to listmount() and do_listmount() to clarify what
the arguments are used for.
* Get rid of the pointless ctr variable and overflow checking.
* Get rid of the pointless speculation check.
Link: https://lore.kernel.org/r/CAHk-=wjh6Cypo8WC-McXgSzCaou3UXccxB+7PVeSuGR8AjCphg@mail.gmail.com
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
kill_f2fs_super() is called even if f2fs_fill_super() fails.
f2fs_fill_super() frees the struct f2fs_sb_info, so it must set
sb->s_fs_info to NULL to prevent it from being freed again.
Fixes: 275dca4630c1 ("f2fs: move release of block devices to after kill_block_super()")
Reported-by: <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]
Reviewed-by: Chao Yu <[email protected]>
Link: https://lore.kernel.org/linux-f2fs-devel/[email protected]
Signed-off-by: Eric Biggers <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat updates from Namjae Jeon:
- Replace the internal table lookup algorithm with the hweight library
and ffs of the bitops library.
- Handle the two types of stream entry, valid data size (has been
written) and data size separately. It improves compatibility with two
differently sized files created on Windows.
* tag 'exfat-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: do not zero the extended part
exfat: change to get file size from DataLength
exfat: using ffs instead of internal logic
exfat: using hweight instead of internal logic
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull bcachefs locking fix from Al Viro:
"Fix broken locking in bch2_ioctl_subvolume_destroy()"
* tag 'pull-bcachefs-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
bch2_ioctl_subvolume_destroy(): fix locking
new helper: user_path_locked_at()
|
|
Move mmu notification mechanism inside mm lock to prevent race condition
in other components which depend on it. The notifier will invalidate
memory range. Depending upon the number of iterations, different memory
ranges would be invalidated.
The following warning would be removed by this patch:
WARNING: CPU: 0 PID: 5067 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:734 kvm_mmu_notifier_change_pte+0x860/0x960 arch/x86/kvm/../../../virt/kvm/kvm_main.c:734
There is no behavioural and performance change with this patch when
there is no component registered with the mmu notifier.
[[email protected]: narrow the scope of `range', per Sean]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Muhammad Usama Anjum <[email protected]>
Reported-by: [email protected]
Link: https://lore.kernel.org/all/[email protected]/
Reviewed-by: Sean Christopherson <[email protected]>
Cc: Andrei Vagin <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Liam R. Howlett <[email protected]>
Cc: Michał Mirosław <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs update from Jaegeuk Kim:
"In this series, we've some progress to support Zoned block device
regarding to the power-cut recovery flow and enabling
checkpoint=disable feature which is essential for Android OTA.
Other than that, some patches touched sysfs entries and tracepoints
which are minor, while several bug fixes on error handlers and
compression flows are good to improve the overall stability.
Enhancements:
- enable checkpoint=disable for zoned block device
- sysfs entries such as discard status, discard_io_aware, dir_level
- tracepoints such as f2fs_vm_page_mkwrite(), f2fs_rename(),
f2fs_new_inode()
- use shared inode lock during f2fs_fiemap() and f2fs_seek_block()
Bug fixes:
- address some power-cut recovery issues on zoned block device
- handle errors and logics on do_garbage_collect(),
f2fs_reserve_new_block(), f2fs_move_file_range(),
f2fs_recover_xattr_data()
- don't set FI_PREALLOCATED_ALL for partial write
- fix to update iostat correctly in f2fs_filemap_fault()
- fix to wait on block writeback for post_read case
- fix to tag gcing flag on page during block migration
- restrict max filesize for 16K f2fs
- fix to avoid dirent corruption
- explicitly null-terminate the xattr list
There are also several clean-up patches to remove dead codes and
better readability"
* tag 'f2fs-for-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (33 commits)
f2fs: show more discard status by sysfs
f2fs: Add error handling for negative returns from do_garbage_collect
f2fs: Constrain the modification range of dir_level in the sysfs
f2fs: Use wait_event_freezable_timeout() for freezable kthread
f2fs: fix to check return value of f2fs_recover_xattr_data
f2fs: don't set FI_PREALLOCATED_ALL for partial write
f2fs: fix to update iostat correctly in f2fs_filemap_fault()
f2fs: fix to check compress file in f2fs_move_file_range()
f2fs: fix to wait on block writeback for post_read case
f2fs: fix to tag gcing flag on page during block migration
f2fs: add tracepoint for f2fs_vm_page_mkwrite()
f2fs: introduce f2fs_invalidate_internal_cache() for cleanup
f2fs: update blkaddr in __set_data_blkaddr() for cleanup
f2fs: introduce get_dnode_addr() to clean up codes
f2fs: delete obsolete FI_DROP_CACHE
f2fs: delete obsolete FI_FIRST_BLOCK_WRITTEN
f2fs: Restrict max filesize for 16K f2fs
f2fs: let's finish or reset zones all the time
f2fs: check write pointers when checkpoint=disable
f2fs: fix write pointers on zoned device after roll forward
...
|
|
Pull smb server updates from Steve French:
- memory allocation fix
- three lease fixes, including important rename fix
- read only share fix
- thread freeze fix
- three cleanup fixes (two kernel doc related)
- locking fix in setting EAs
- packet header validation fix
* tag '6.8-rc-smb-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: Add missing set_freezable() for freezable kthread
ksmbd: free ppace array on error in parse_dacl
ksmbd: send lease break notification on FILE_RENAME_INFORMATION
ksmbd: don't allow O_TRUNC open on read-only share
ksmbd: vfs: fix all kernel-doc warnings
ksmbd: auth: fix most kernel-doc warnings
ksmbd: Remove usage of the deprecated ida_simple_xx() API
ksmbd: don't increment epoch if current state and request state are same
ksmbd: fix potential circular locking issue in smb2_set_ea()
ksmbd: set v2 lease version on lease upgrade
ksmbd: validate the zero field of packet header
|
|
Pull misc filesystem updates from Al Viro:
"Misc cleanups (the part that hadn't been picked by individual fs
trees)"
* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
apparmorfs: don't duplicate kfree_link()
orangefs: saner arguments passing in readdir guts
ocfs2_find_match(): there's no such thing as NULL or negative ->d_parent
reiserfs_add_entry(): get rid of pointless namelen checks
__ocfs2_add_entry(), ocfs2_prepare_dir_for_insert(): namelen checks
ext4_add_entry(): ->d_name.len is never 0
befs: d_obtain_alias(ERR_PTR(...)) will do the right thing
affs: d_obtain_alias(ERR_PTR(...)) will do the right thing
/proc/sys: use d_splice_alias() calling conventions to simplify failure exits
hostfs: use d_splice_alias() calling conventions to simplify failure exits
udf_fiiter_add_entry(): check for zero ->d_name.len is bogus...
udf: d_obtain_alias(ERR_PTR(...)) will do the right thing...
udf: d_splice_alias() will do the right thing on ERR_PTR() inode
nfsd: kill stale comment about simple_fill_super() requirements
bfs_add_entry(): get rid of pointless ->d_name.len checks
nilfs2: d_obtain_alias(ERR_PTR(...)) will do the right thing...
zonefs: d_splice_alias() will do the right thing on ERR_PTR() inode
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
"Change of locking rules for __dentry_kill(), regularized refcounting
rules in that area, assorted cleanups and removal of weird corner
cases (e.g. now ->d_iput() on child is always called before the parent
might hit __dentry_kill(), etc)"
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
dcache: remove unnecessary NULL check in dget_dlock()
kill DCACHE_MAY_FREE
__d_unalias() doesn't use inode argument
d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant
get rid of DCACHE_GENOCIDE
d_genocide(): move the extern into fs/internal.h
simple_fill_super(): don't bother with d_genocide() on failure
nsfs: use d_make_root()
d_alloc_pseudo(): move setting ->d_op there from the (sole) caller
kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller
retain_dentry(): introduce a trimmed-down lockless variant
__dentry_kill(): new locking scheme
d_prune_aliases(): use a shrink list
switch select_collect{,2}() to use of to_shrink_list()
to_shrink_list(): call only if refcount is 0
fold dentry_kill() into dput()
don't try to cut corners in shrink_lock_dentry()
fold the call of retain_dentry() into fast_dput()
Call retain_dentry() with refcount 0
dentry_kill(): don't bother with retain_dentry() on slow path
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull rename updates from Al Viro:
"Fix directory locking scheme on rename
This was broken in 6.5; we really can't lock two unrelated directories
without holding ->s_vfs_rename_mutex first and in case of same-parent
rename of a subdirectory 6.5 ends up doing just that"
* tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
rename(): avoid a deadlock in the case of parents having no common ancestor
kill lock_two_inodes()
rename(): fix the locking of subdirectories
f2fs: Avoid reading renamed directory if parent does not change
ext4: don't access the source subdirectory content on same-directory rename
ext2: Avoid reading renamed directory if parent does not change
udf_rename(): only access the child content on cross-directory rename
ocfs2: Avoid touching renamed directory if parent does not change
reiserfs: Avoid touching renamed directory if parent does not change
|
|
Pull minixfs updates from Al Viro:
"minixfs kmap_local_page() switchover and related fixes - very similar
to sysv series"
* tag 'pull-minix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
minixfs: switch to kmap_local_page()
minixfs: Use dir_put_page() in minix_unlink() and minix_rename()
minixfs: change the signature of dir_get_page()
minixfs: use offset_in_page()
|
|
Pull documentation update from Jonathan Corbet:
"Another moderately busy cycle for documentation, including:
- The minimum Sphinx requirement has been raised to 2.4.4, following
a warning that was added in 6.2
- Some reworking of the Documentation/process front page to,
hopefully, make it more useful
- Various kernel-doc tweaks to, for example, make it deal properly
with __counted_by annotations
- We have also restored a warning for documentation of nonexistent
structure members that disappeared a while back. That had the
delightful consequence of adding some 600 warnings to the docs
build. A sustained effort by Randy, Vegard, and myself has
addressed almost all of those, bringing the documentation back into
sync with the code. The fixes are going through the appropriate
maintainer trees
- Various improvements to the HTML rendered docs, including automatic
links to Git revisions and a nice new pulldown to make translations
easy to access
- Speaking of translations, more of those for Spanish and Chinese
... plus the usual stream of documentation updates and typo fixes"
* tag 'docs-6.8' of git://git.lwn.net/linux: (57 commits)
MAINTAINERS: use tabs for indent of CONFIDENTIAL COMPUTING THREAT MODEL
A reworked process/index.rst
ring-buffer/Documentation: Add documentation on buffer_percent file
Translated the RISC-V architecture boot documentation.
Docs: remove mentions of fdformat from util-linux
Docs/zh_CN: Fix the meaning of DEBUG to pr_debug()
Documentation: move driver-api/dcdbas to userspace-api/
Documentation: move driver-api/isapnp to userspace-api/
Documentation/core-api : fix typo in workqueue
Documentation/trace: Fixed typos in the ftrace FLAGS section
kernel-doc: handle a void function without producing a warning
scripts/get_abi.pl: ignore some temp files
docs: kernel_abi.py: fix command injection
scripts/get_abi: fix source path leak
CREDITS, MAINTAINERS, docs/process/howto: Update man-pages' maintainer
docs: translations: add translations links when they exist
kernel-doc: Align quick help and the code
MAINTAINERS: add reviewer for Spanish translations
docs: ignore __counted_by attribute in structure definitions
scripts: kernel-doc: Clarify missing struct member description
..
|
|
Pull block updates from Jens Axboe:
"Pretty quiet round this time around. This contains:
- NVMe updates via Keith:
- nvme fabrics spec updates (Guixin, Max)
- nvme target udpates (Guixin, Evan)
- nvme attribute refactoring (Daniel)
- nvme-fc numa fix (Keith)
- MD updates via Song:
- Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai)
- Fix raid5 hang issue (Junxiao Bi)
- Add Yu Kuai as Reviewer of the md subsystem
- Remove deprecated flavors (Song Liu)
- raid1 read error check support (Li Nan)
- Better handle events off-by-1 case (Alex Lyakas)
- Efficiency improvements for passthrough (Kundan)
- Support for mapping integrity data directly (Keith)
- Zoned write fix (Damien)
- rnbd fixes (Kees, Santosh, Supriti)
- Default to a sane discard size granularity (Christoph)
- Make the default max transfer size naming less confusing
(Christoph)
- Remove support for deprecated host aware zoned model (Christoph)
- Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel,
Bart, Christoph)"
* tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits)
block: Treat sequential write preferred zone type as invalid
block: remove disk_clear_zoned
sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics
drivers/block/xen-blkback/common.h: Fix spelling typo in comment
blk-cgroup: fix rcu lockdep warning in blkg_lookup()
blk-cgroup: don't use removal safe list iterators
block: floor the discard granularity to the physical block size
mtd_blkdevs: use the default discard granularity
bcache: use the default discard granularity
zram: use the default discard granularity
null_blk: use the default discard granularity
nbd: use the default discard granularity
ubd: use the default discard granularity
block: default the discard granularity to sector size
bcache: discard_granularity should not be smaller than a sector
block: remove two comments in bio_split_discard
block: rename and document BLK_DEF_MAX_SECTORS
loop: don't abuse BLK_DEF_MAX_SECTORS
aoe: don't abuse BLK_DEF_MAX_SECTORS
null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"The most interesting thing is probably the networking structs
reorganization and a significant amount of changes is around
self-tests.
Core & protocols:
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up build
time warnings to safeguard against future header changes
This improves TCP performances with many concurrent connections up
to 40%
- Add page-pool netlink-based introspection, exposing the memory
usage and recycling stats. This helps indentify bad PP users and
possible leaks
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set. This
lowers the time taken by connect() for hosts having many active
connections to the same destination
- Refactor the TCP bind conflict code, shrinking related socket
structs
- Refactor TCP SYN-Cookie handling, as a preparation step to allow
arbitrary SYN-Cookie processing via eBPF
- Tune optmem_max for 0-copy usage, increasing the default value to
128KB and namespecifying it
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations
- Reduce extension header parsing overhead at GRO time
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries
- Reorder nftables struct members, to keep data accessed by the
datapath first
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex)
- More data-race annotations
- Extend the diag interface to dump TCP bound-only sockets
- Conditional notification of events for TC qdisc class and actions
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID
- Implement SMCv2.1 virtual ISM device support
- Add support for Batman-avd mulicast packet type
BPF:
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers.
This improves the verifier "instructions processed" metric from
single digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer
experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques
- Support uid/gid options when mounting bpffs
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is
identified by its id
- Extend verifier to allow bpf_refcount_acquire() of a map value
field obtained via direct load which is a use-case needed in
sched_ext
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter
- Support for VLAN tag in XDP hints
- Remove deprecated bpfilter kernel leftovers given the project is
developed in user-space (https://github.com/facebook/bpfilter)
Misc:
- Support for parellel TC self-tests execution
- Increase MPTCP self-tests coverage
- Updated the bridge documentation, including several so-far
undocumented features
- Convert all the net self-tests to run in unique netns, to avoid
random failures due to conflict and allow concurrent runs
- Add TCP-AO self-tests
- Add kunit tests for both cfg80211 and mac80211
- Autogenerate Netlink families documentation from YAML spec
- Add yml-gen support for fixed headers and recursive nests, the tool
can now generate user-space code for all genetlink families for
which we have specs
- A bunch of additional module descriptions fixes
- Catch incorrect freeing of pages belonging to a page pool
Driver API:
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host
- Add support for ethtool symmetric-xor RSS hash
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute
- Convert older drivers to platform remove callback returning void
- Add support for PHY package MMD read/write
New hardware / drivers:
- Ethernet:
- Octeon CN10K devices
- Broadcom 5760X P7
- Qualcomm SM8550 SoC
- Texas Instrument DP83TG720S PHY
- Bluetooth:
- IMC Networks Bluetooth radio
Removed:
- WiFi:
- libertas 16-bit PCMCIA support
- Atmel at76c50x drivers
- HostAP ISA/PCMCIA style 802.11b driver
- zd1201 802.11b USB dongles
- Orinoco ISA/PCMCIA 802.11b driver
- Aviator/Raytheon driver
- Planet WL3501 driver
- RNDIS USB 802.11b driver
Driver updates:
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running
timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will
allow in future releases combining multiple PFs devices
attached to different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion
for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications
for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring
param, coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine
driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync"
* tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits)
lan78xx: remove redundant statement in lan78xx_get_eee
lan743x: remove redundant statement in lan743x_ethtool_get_eee
bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer()
bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel()
bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter()
tcp: Revert no longer abort SYN_SENT when receiving some ICMP
Revert "mlx5 updates 2023-12-20"
Revert "net: stmmac: Enable Per DMA Channel interrupt"
ipvlan: Remove usage of the deprecated ida_simple_xx() API
ipvlan: Fix a typo in a comment
net/sched: Remove ipt action tests
net: stmmac: Use interrupt mode INTM=1 for per channel irq
net: stmmac: Add support for TX/RX channel interrupt
net: stmmac: Make MSI interrupt routine generic
dt-bindings: net: snps,dwmac: per channel irq
net: phy: at803x: make read_status more generic
net: phy: at803x: add support for cdt cross short test for qca808x
net: phy: at803x: refactor qca808x cable test get status function
net: phy: at803x: generalize cdt fault length function
net: ethernet: cortina: Drop TSO support
...
|
|
This reverts commit dad3fb67ca1cbef87ce700e83a55835e5921ce8a.
The commit converted kernfs_idr_lock to an IRQ-safe raw_spinlock because it
could be acquired while holding an rq lock through bpf_cgroup_from_id().
However, kernfs_idr_lock is held while doing GPF_NOWAIT allocations which
involves acquiring an non-IRQ-safe and non-raw lock leading to the following
lockdep warning:
=============================
[ BUG: Invalid wait context ]
6.7.0-rc5-kzm9g-00251-g655022a45b1c #578 Not tainted
-----------------------------
swapper/0/0 is trying to lock:
dfbcd488 (&c->lock){....}-{3:3}, at: local_lock_acquire+0x0/0xa4
other info that might help us debug this:
context-{5:5}
2 locks held by swapper/0/0:
#0: dfbc9c60 (lock){+.+.}-{3:3}, at: local_lock_acquire+0x0/0xa4
#1: c0c012a8 (kernfs_idr_lock){....}-{2:2}, at: __kernfs_new_node.constprop.0+0x68/0x258
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc5-kzm9g-00251-g655022a45b1c #578
Hardware name: Generic SH73A0 (Flattened Device Tree)
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x90
dump_stack_lvl from __lock_acquire+0x3cc/0x168c
__lock_acquire from lock_acquire+0x274/0x30c
lock_acquire from local_lock_acquire+0x28/0xa4
local_lock_acquire from ___slab_alloc+0x234/0x8a8
___slab_alloc from __slab_alloc.constprop.0+0x30/0x44
__slab_alloc.constprop.0 from kmem_cache_alloc+0x7c/0x148
kmem_cache_alloc from radix_tree_node_alloc.constprop.0+0x44/0xdc
radix_tree_node_alloc.constprop.0 from idr_get_free+0x110/0x2b8
idr_get_free from idr_alloc_u32+0x9c/0x108
idr_alloc_u32 from idr_alloc_cyclic+0x50/0xb8
idr_alloc_cyclic from __kernfs_new_node.constprop.0+0x88/0x258
__kernfs_new_node.constprop.0 from kernfs_create_root+0xbc/0x154
kernfs_create_root from sysfs_init+0x18/0x5c
sysfs_init from mnt_init+0xc4/0x220
mnt_init from vfs_caches_init+0x6c/0x88
vfs_caches_init from start_kernel+0x474/0x528
start_kernel from 0x0
Let's rever the commit. It's undesirable to spread out raw spinlock usage
anyway and the problem can be solved by protecting the lookup path with RCU
instead.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Andrea Righi <[email protected]>
Reported-by: Geert Uytterhoeven <[email protected]>
Link: http://lkml.kernel.org/r/CAMuHMdV=AKt+mwY7svEq5gFPx41LoSQZ_USME5_MEdWQze13ww@mail.gmail.com
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
We're only allocating from the realtime device if the inode is marked
for realtime and we're /not/ allocating into the attr fork.
Fixes: 58643460546d ("xfs: also use xfs_bmap_btalloc_accounting for RT allocations")
Signed-off-by: "Darrick J. Wong" <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"To help make the move of sysctls out of kernel/sysctl.c not incur a
size penalty sysctl has been changed to allow us to not require the
sentinel, the final empty element on the sysctl array. Joel Granados
has been doing all this work.
In the v6.6 kernel we got the major infrastructure changes required to
support this. For v6.7 we had all arch/ and drivers/ modified to
remove the sentinel. For v6.8-rc1 we get a few more updates for fs/
directory only.
The kernel/ directory is left but we'll save that for v6.9-rc1 as
those patches are still being reviewed. After that we then can expect
also the removal of the no longer needed check for procname == NULL.
Let us recap the purpose of this work:
- this helps reduce the overall build time size of the kernel and run
time memory consumed by the kernel by about ~64 bytes per array
- the extra 64-byte penalty is no longer inncurred now when we move
sysctls out from kernel/sysctl.c to their own files
Thomas Weißschuh also sent a few cleanups, for v6.9-rc1 we expect to
see further work by Thomas Weißschuh with the constificatin of the
struct ctl_table.
Due to Joel Granados's work, and to help bring in new blood, I have
suggested for him to become a maintainer and he's accepted. So for
v6.9-rc1 I look forward to seeing him sent you a pull request for
further sysctl changes. This also removes Iurii Zaikin as a maintainer
as he has moved on to other projects and has had no time to help at
all"
* tag 'sysctl-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: remove struct ctl_path
sysctl: delete unused define SYSCTL_PERM_EMPTY_DIR
coda: Remove the now superfluous sentinel elements from ctl_table array
sysctl: Remove the now superfluous sentinel elements from ctl_table array
fs: Remove the now superfluous sentinel elements from ctl_table array
cachefiles: Remove the now superfluous sentinel element from ctl_table array
sysclt: Clarify the results of selftest run
sysctl: Add a selftest for handling empty dirs
sysctl: Fix out of bounds access for empty sysctl registers
MAINTAINERS: Add Joel Granados as co-maintainer for proc sysctl
MAINTAINERS: remove Iurii Zaikin from proc sysctl
|
|
Pull header cleanups from Kent Overstreet:
"The goal is to get sched.h down to a type only header, so the main
thing happening in this patchset is splitting out various _types.h
headers and dependency fixups, as well as moving some things out of
sched.h to better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies"
* tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits)
Kill sched.h dependency on rcupdate.h
kill unnecessary thread_info.h include
Kill unnecessary kernel.h include
preempt.h: Kill dependency on list.h
rseq: Split out rseq.h from sched.h
LoongArch: signal.c: add header file to fix build error
restart_block: Trim includes
lockdep: move held_lock to lockdep_types.h
sem: Split out sem_types.h
uidgid: Split out uidgid_types.h
seccomp: Split out seccomp_types.h
refcount: Split out refcount_types.h
uapi/linux/resource.h: fix include
x86/signal: kill dependency on time.h
syscall_user_dispatch.h: split out *_types.h
mm_types_task.h: Trim dependencies
Split out irqflags_types.h
ipc: Kill bogus dependency on spinlock.h
shm: Slim down dependencies
workqueue: Split out workqueue_types.h
...
|
|
Pull bcachefs updates from Kent Overstreet:
- btree write buffer rewrite: instead of adding keys to the btree write
buffer at transaction commit time, we now journal them with a
different journal entry type and copy them from the journal to the
write buffer just prior to journal write.
This reduces the number of atomic operations on shared cachelines in
the transaction commit path and is a signicant performance
improvement on some workloads: multithreaded 4k random writes went
from ~650k iops to ~850k iops.
- Bring back optimistic spinning for six locks: the new implementation
doesn't use osq locks; instead we add to the lock waitlist as normal,
and then spin on the lock_acquired bit in the waitlist entry, _not_
the lock itself.
- New ioctls:
- BCH_IOCTL_DEV_USAGE_V2, which allows for new data types
- BCH_IOCTL_OFFLINE_FSCK, which runs the kernel implementation of
fsck but without mounting: useful for transparently using the
kernel version of fsck from 'bcachefs fsck' when the kernel
version is a better match for the on disk filesystem.
- BCH_IOCTL_ONLINE_FSCK: online fsck. Not all passes are supported
yet, but the passes that are supported are fully featured - errors
may be corrected as normal.
The new ioctls use the new 'thread_with_file' abstraction for kicking
off a kthread that's tied to a file descriptor returned to userspace
via the ioctl.
- btree_paths within a btree_trans are now dynamically growable,
instead of being limited to 64. This is important for the
check_directory_structure phase of fsck, and also fixes some issues
we were having with btree path overflow in the reflink btree.
- Trigger refactoring; prep work for the upcoming disk space accounting
rewrite
- Numerous bugfixes :)
* tag 'bcachefs-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (226 commits)
bcachefs: eytzinger0_find() search should be const
bcachefs: move "ptrs not changing" optimization to bch2_trigger_extent()
bcachefs: fix simulateously upgrading & downgrading
bcachefs: Restart recovery passes more reliably
bcachefs: bch2_dump_bset() doesn't choke on u64s == 0
bcachefs: improve checksum error messages
bcachefs: improve validate_bset_keys()
bcachefs: print sb magic when relevant
bcachefs: __bch2_sb_field_to_text()
bcachefs: %pg is banished
bcachefs: Improve would_deadlock trace event
bcachefs: fsck_err()s don't need to manually check c->sb.version anymore
bcachefs: Upgrades now specify errors to fix, like downgrades
bcachefs: no thread_with_file in userspace
bcachefs: Don't autofix errors we can't fix
bcachefs: add missing bch2_latency_acct() call
bcachefs: increase max_active on io_complete_wq
bcachefs: add time_stats for btree_node_read_done()
bcachefs: don't clear accessed bit in btree node fill
bcachefs: Add an option to control btree node prefetching
...
|