Age | Commit message (Collapse) | Author | Files | Lines |
|
- locking performance improvement for hidraw code (André Almeida)
|
|
- support for USI style pens (Tero Kristo, Mika Westerberg)
- quirk for devices that need inverted X/Y axes (Alistair Francis)
- small core code cleanups and deduplication (Benjamin Tissoires)
|
|
Merge in fixes directly in prep for the 5.17 merge window.
No conflicts.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Replace kfree_skb() with kfree_skb_reason() in __udp4_lib_rcv.
New drop reason 'SKB_DROP_REASON_UDP_CSUM' is added for udp csum
error.
Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Replace kfree_skb() with kfree_skb_reason() in tcp_v4_rcv(). Following
drop reasons are added:
SKB_DROP_REASON_NO_SOCKET
SKB_DROP_REASON_PKT_TOO_SMALL
SKB_DROP_REASON_TCP_CSUM
SKB_DROP_REASON_TCP_FILTER
After this patch, 'kfree_skb' event will print message like this:
$ TASK-PID CPU# ||||| TIMESTAMP FUNCTION
$ | | | ||||| | |
<idle>-0 [000] ..s1. 36.113438: kfree_skb: skbaddr=(____ptrval____) protocol=2048 location=(____ptrval____) reason: NO_SOCKET
The reason of skb drop is printed too.
Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Introduce the interface kfree_skb_reason(), which is able to pass
the reason why the skb is dropped to 'kfree_skb' tracepoint.
Add the 'reason' field to 'trace_kfree_skb', therefor user can get
more detail information about abnormal skb with 'drop_monitor' or
eBPF.
All drop reasons are defined in the enum 'skb_drop_reason', and
they will be print as string in 'kfree_skb' tracepoint in format
of 'reason: XXX'.
( Maybe the reasons should be defined in a uapi header file, so that
user space can use them? )
Signed-off-by: Menglong Dong <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Netfilter conntrack maintains NAT flags per connection indicating
whether NAT was configured for the connection. Openvswitch maintains
NAT flags on the per packet flow key ct_state field, indicating
whether NAT was actually executed on the packet.
When a packet misses from tc to ovs the conntrack NAT flags are set.
However, NAT was not necessarily executed on the packet because the
connection's state might still be in NEW state. As such, openvswitch
wrongly assumes that NAT was executed and sets an incorrect flow key
NAT flags.
Fix this, by flagging to openvswitch which NAT was actually done in
act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so
the packet flow key NAT flags will be correctly set.
Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Paul Blakey <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next. This
includes one patch to update ovs and act_ct to use nf_ct_put() instead
of nf_conntrack_put().
1) Add netns_tracker to nfnetlink_log and masquerade, from Eric Dumazet.
2) Remove redundant rcu read-size lock in nf_tables packet path.
3) Replace BUG() by WARN_ON_ONCE() in nft_payload.
4) Consolidate rule verdict tracing.
5) Replace WARN_ON() by WARN_ON_ONCE() in nf_tables core.
6) Make counter support built-in in nf_tables.
7) Add new field to conntrack object to identify locally generated
traffic, from Florian Westphal.
8) Prevent NAT from shadowing well-known ports, from Florian Westphal.
9) Merge nf_flow_table_{ipv4,ipv6} into nf_flow_table_inet, also from
Florian.
10) Remove redundant pointer in nft_pipapo AVX2 support, from Colin Ian King.
11) Replace opencoded max() in conntrack, from Jiapeng Chong.
12) Update conntrack to use refcount_t API, from Florian Westphal.
13) Move ip_ct_attach indirection into the nf_ct_hook structure.
14) Constify several pointer object in the netfilter codebase,
from Florian Westphal.
15) Tree-wide replacement of nf_conntrack_put() by nf_ct_put(), also
from Florian.
16) Fix egress splat due to incorrect rcu notation, from Florian.
17) Move stateful fields of connlimit, last, quota, numgen and limit
out of the expression data area.
18) Build a blob to represent the ruleset in nf_tables, this is a
requirement of the new register tracking infrastructure.
19) Add NFT_REG32_NUM to define the maximum number of 32-bit registers.
20) Add register tracking infrastructure to skip redundant
store-to-register operations, this includes support for payload,
meta and bitwise expresssions.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: (32 commits)
netfilter: nft_meta: cancel register tracking after meta update
netfilter: nft_payload: cancel register tracking after payload update
netfilter: nft_bitwise: track register operations
netfilter: nft_meta: track register operations
netfilter: nft_payload: track register operations
netfilter: nf_tables: add register tracking infrastructure
netfilter: nf_tables: add NFT_REG32_NUM
netfilter: nf_tables: add rule blob layout
netfilter: nft_limit: move stateful fields out of expression data
netfilter: nft_limit: rename stateful structure
netfilter: nft_numgen: move stateful fields out of expression data
netfilter: nft_quota: move stateful fields out of expression data
netfilter: nft_last: move stateful fields out of expression data
netfilter: nft_connlimit: move stateful fields out of expression data
netfilter: egress: avoid a lockdep splat
net: prefer nf_ct_put instead of nf_conntrack_put
netfilter: conntrack: avoid useless indirection during conntrack destruction
netfilter: make function op structures const
netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook
netfilter: conntrack: convert to refcount_t api
...
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
include/linux/netfilter_netdev.h:97 suspicious rcu_dereference_check() usage!
2 locks held by sd-resolve/1100:
0: ..(rcu_read_lock_bh){1:3}, at: ip_finish_output2
1: ..(rcu_read_lock_bh){1:3}, at: __dev_queue_xmit
__dev_queue_xmit+0 ..
The helper has two callers, one uses rcu_read_lock, the other
rcu_read_lock_bh(). Annotate the dereference to reflect this.
Fixes: 42df6e1d221dd ("netfilter: Introduce egress hook")
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
nf_ct_put() results in a usesless indirection:
nf_ct_put -> nf_conntrack_put -> nf_conntrack_destroy -> rcu readlock +
indirect call of ct_hooks->destroy().
There are two _put helpers:
nf_ct_put and nf_conntrack_put. The latter is what should be used in
code that MUST NOT cause a linker dependency on the conntrack module
(e.g. calls from core network stack).
Everyone else should call nf_ct_put() instead.
A followup patch will convert a few nf_conntrack_put() calls to
nf_ct_put(), in particular from modules that already have a conntrack
dependency such as act_ct or even nf_conntrack itself.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
No functional changes, these structures should be const.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
ip_ct_attach predates struct nf_ct_hook, we can place it there and
remove the exported symbol.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
Convert nf_conn reference counting from atomic_t to refcount_t based api.
refcount_t api provides more runtime sanity checks and will warn on
certain constructs, e.g. refcount_inc() on a zero reference count, which
usually indicates use-after-free.
For this reason template allocation is changed to init the refcount to
1, the subsequenct add operations are removed.
Likewise, init_conntrack() is changed to set the initial refcount to 1
instead refcount_inc().
This is safe because the new entry is not (yet) visible to other cpus.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
My patch to rework oabi fcntl64() introduced a harmless
sparse warning when file locking is disabled:
arch/arm/kernel/sys_oabi-compat.c:251:51: sparse: sparse: incorrect type in argument 3 (different address spaces) @@ expected struct flock64 [noderef] __user *user @@ got struct flock64 * @@
arch/arm/kernel/sys_oabi-compat.c:251:51: sparse: expected struct flock64 [noderef] __user *user
arch/arm/kernel/sys_oabi-compat.c:251:51: sparse: got struct flock64 *
arch/arm/kernel/sys_oabi-compat.c:265:55: sparse: sparse: incorrect type in argument 4 (different address spaces) @@ expected struct flock64 [noderef] __user *user @@ got struct flock64 * @@
arch/arm/kernel/sys_oabi-compat.c:265:55: sparse: expected struct flock64 [noderef] __user *user
arch/arm/kernel/sys_oabi-compat.c:265:55: sparse: got struct flock64 *
When file locking is enabled, everything works correctly and the
right data gets passed, but the stub declarations in linux/fs.h
did not get modified when the calling conventions changed in an
earlier patch.
Reported-by: kernel test robot <[email protected]>
Fixes: 7e2d8c29ecdd ("ARM: 9111/1: oabi-compat: rework fcntl64() emulation")
Fixes: a75d30c77207 ("fs/locks: pass kernel struct flock to fcntl_getlk/setlk")
Cc: Christoph Hellwig <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
|
|
Move the 'inline' keyword to the front of 'void'.
Remove a warning found by clang(make W=1 LLVM=1)
./include/linux/blk-mq.h:259:1: warning: ‘inline’ is not at beginning of
declaration
Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
If something went wrong during the TPM firmware upgrade, like power
failure or the firmware image file get corrupted, the TPM might end
up in Upgrade or Failure mode upon the next start. The state is
persistent between the TPM power cycle/restart.
According to TPM specification:
* If the TPM is in Upgrade mode, it will answer with TPM2_RC_UPGRADE
to all commands except TPM2_FieldUpgradeData(). It may also accept
other commands if it is able to complete them using the previously
installed firmware.
* If the TPM is in Failure mode, it will allow performing TPM
initialization but will not provide any crypto operations.
Will happily respond to Field Upgrade calls.
Change the behavior of the tpm2_auto_startup(), so it detects the active
running mode of the TPM by adding the following checks. If
tpm2_do_selftest() call returns TPM2_RC_UPGRADE, the TPM is in Upgrade
mode.
If the TPM is in Failure mode, it will successfully respond to both
tpm2_do_selftest() and tpm2_startup() calls. Although, will fail to
answer to tpm2_get_cc_attrs_tbl(). Use this fact to conclude that TPM is
in Failure mode.
If detected that the TPM is in the Upgrade or Failure mode, the function
sets TPM_CHIP_FLAG_FIRMWARE_UPGRADE_MODE flag.
The TPM_CHIP_FLAG_FIRMWARE_UPGRADE_MODE flag is used later during driver
initialization/deinitialization to disable functionality which makes no
sense or will fail in the current TPM state. Following functionality is
affected:
* Do not register TPM as a hwrng
* Do not register sysfs entries which provide information impossible to
obtain in limited mode
* Do not register resource managed character device
Signed-off-by: axelj <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
|
|
NFSv4.1 supports an optional lock notification feature which notifies
the client when a lock comes available. (Normally NFSv4 clients just
poll for locks if necessary.) To make that work, we need to request a
blocking lock from the filesystem.
We turned that off for NFS in commit f657f8eef3ff ("nfs: don't atempt
blocking locks on nfs reexports") [sic] because it actually blocks the
nfsd thread while waiting for the lock.
Thanks to Vasily Averin for pointing out that NFS isn't the only
filesystem with that problem.
Any filesystem that leaves ->lock NULL will use posix_lock_file(), which
does the right thing. Simplest is just to assume that any filesystem
that defines its own ->lock is not safe to request a blocking lock from.
So, this patch mostly reverts commit f657f8eef3ff ("nfs: don't atempt
blocking locks on nfs reexports") [sic] and commit b840be2f00c0 ("lockd:
don't attempt blocking locks on nfs reexports"), and instead uses a
check of ->lock (Vasily's suggestion) to decide whether to support
blocking lock notifications on a given filesystem. Also add a little
documentation.
Perhaps someday we could add back an export flag later to allow
filesystems with "good" ->lock methods to support blocking lock
notifications.
Reported-by: Vasily Averin <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
[ cel: Description rewritten to address checkpatch nits ]
[ cel: Fixed warning when SUNRPC debugging is disabled ]
[ cel: Fixed NULL check ]
Signed-off-by: Chuck Lever <[email protected]>
Reviewed-by: Vasily Averin <[email protected]>
|
|
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
All profile_handoff_task does is notify the task_free_notifier chain.
The helpers task_handoff_register and task_handoff_unregister are used
to add and delete entries from that chain and are never called.
So remove the dead code and make it much easier to read and reason
about __put_task_struct.
Suggested-by: Al Viro <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
When I say remove I mean remove. All profile_task_exit and
profile_munmap do is call a blocking notifier chain. The helpers
profile_task_register and profile_task_unregister are not called
anywhere in the tree. Which means this is all dead code.
So remove the dead code and make it easier to read do_exit.
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
This helper is misleading. It tests for an ongoing exec as well as
the process having received a fatal signal.
Sometimes it is appropriate to treat an on-going exec differently than
a process that is shutting down due to a fatal signal. In particular
taking the fast path out of exit_signals instead of retargeting
signals is not appropriate during exec, and not changing the the exit
code in do_group_exit during exec.
Removing the helper makes it more obvious what is going on as both
cases must be coded for explicitly.
While removing the helper fix the two cases where I have observed
using signal_group_exit resulted in the wrong result.
In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are
retargetted during an exec.
In do_group_exit use 0 as the exit code during an exec as de_thread
does not set group_exit_code. As best as I can determine
group_exit_code has been is set to 0 most of the time during
de_thread. During a thread group stop group_exit_code is set to the
stop signal and when the thread group receives SIGCONT group_exit_code
is reset to 0.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
The only remaining user of group_exit_task is exec. Rename the field
so that it is clear which part of the code uses it.
Update the comment above the definition of group_exec_task to document
how it is currently used.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
After the previous cleanups "signal->core_state" is set whenever
SIGNAL_GROUP_COREDUMP is set and "signal->core_state" is tested
whenver the code wants to know if a coredump is in progress. The
remaining tests of SIGNAL_GROUP_COREDUMP also test to see if
SIGNAL_GROUP_EXIT is set. Similarly the only place that sets
SIGNAL_GROUP_COREDUMP also sets SIGNAL_GROUP_EXIT.
Which makes SIGNAL_GROUP_COREDUMP unecessary and redundant. So stop
setting SIGNAL_GROUP_COREDUMP, stop testing SIGNAL_GROUP_COREDUMP, and
remove it's definition.
With the setting of SIGNAL_GROUP_COREDUMP gone, coredump_finish no
longer needs to clear SIGNAL_GROUP_COREDUMP out of signal->flags
by setting SIGNAL_GROUP_EXIT.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
When CONFIG_SYSFS is not enabled provide stubs for the helper functions
to not break their callers.
Fixes: 539b9c94ac83 ("power: supply: add helpers for charge_behaviour sysfs")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Thomas Weißschuh <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Hans de Goede <[email protected]>
|
|
The point of using set_child_tid to hold the kthread pointer was that
it already did what is necessary. There are now restrictions on when
set_child_tid can be initialized and when set_child_tid can be used in
schedule_tail. Which indicates that continuing to use set_child_tid
to hold the kthread pointer is a bad idea.
Instead of continuing to use the set_child_tid field of task_struct
generalize the pf_io_worker field of task_struct and use it to hold
the kthread pointer.
Rename pf_io_worker (which is a void * pointer) to worker_private so
it can be used to store kthreads struct kthread pointer. Update the
kthread code to store the kthread pointer in the worker_private field.
Remove the places where set_child_tid had to be dealt with carefully
because kthreads also used it.
Link: https://lkml.kernel.org/r/CAHk-=wgtFAA9SbVYg0gR1tqPMC17-NYcs0GQkaYg1bGhh1uJQQ@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
|
|
Conflicts:
drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We currently store large folios as 2^N consecutive entries. While this
consumes rather more memory than necessary, it also turns out to be buggy.
A writeback operation which starts within a tail page of a dirty folio will
not write back the folio as the xarray's dirty bit is only set on the
head index. With multi-index entries, the dirty bit will be found no
matter where in the folio the operation starts.
This does end up simplifying the page cache slightly, although not as
much as I had hoped.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
|
|
Add a new helper function to help iterate over multi-index entries.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
|
|
All of its callers now call folio_batch_remove_exceptionals().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
|
|
find_lock_entries() already only returned the head page of folios, so
convert it to return a folio_batch instead of a pagevec. That cascades
through converting truncate_inode_pages_range() to
delete_from_page_cache_batch() and page_cache_delete_batch().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
|
|
The callers have all been converted to work on folios, so convert
find_get_entries() to return a batch of folios instead of pages.
We also now return multiple large folios in a single call.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
|
|
Convert all callers of truncate_inode_page() to call
truncate_inode_folio() instead, and move the declaration to mm/internal.h.
Move the assertion that the caller is not passing in a tail page to
generic_error_remove_page(). We can't entirely remove the struct page
from the callers yet because the page pointer in the pvec might be a
shadow/dax/swap entry instead of actually a page.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
|
|
Convert both callers of unmap_mapping_page() to call unmap_mapping_folio()
instead. Also move zap_details from linux/mm.h to mm/memory.c
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: William Kucharski <[email protected]>
|
|
Pull drm fixes from Dave Airlie:
"There is only the amdgpu runtime pm regression fix in here:
amdgpu:
- suspend/resume fix
- fix runtime PM regression"
* tag 'drm-fixes-2022-01-07' of git://anongit.freedesktop.org/drm/drm:
drm/amdgpu: disable runpm if we are the primary adapter
fbdev: fbmem: add a helper to determine if an aperture is used by a fw fb
drm/amd/pm: keep the BACO feature enabled for suspend
|
|
Commit b42faeee718c ("spi: Add a PTP system timestamp
to the transfer structure") added an include of ptp_clock_kernel.h
to spi.h for struct ptp_system_timestamp but a forward declaration
is enough. Let's use that to limit the number of objects we have
to rebuild every time we touch networking headers.
Signed-off-by: Jakub Kicinski <[email protected]>
Tested-by: Vladimir Oltean <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>
|
|
This adds basic support for delivering 2 level event channels to a guest.
Initially, it only supports delivery via the IRQ routing table, triggered
by an eventfd. In order to do so, it has a kvm_xen_set_evtchn_fast()
function which will use the pre-mapped shared_info page if it already
exists and is still valid, while the slow path through the irqfd_inject
workqueue will remap the shared_info page if necessary.
It sets the bits in the shared_info page but not the vcpu_info; that is
deferred to __kvm_xen_has_interrupt() which raises the vector to the
appropriate vCPU.
Add a 'verbose' mode to xen_shinfo_test while adding test cases for this.
Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This can be used in two modes. There is an atomic mode where the cached
mapping is accessed while holding the rwlock, and a mode where the
physical address is used by a vCPU in guest mode.
For the latter case, an invalidation will wake the vCPU with the new
KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any
caches it still needs to access before entering guest mode again.
Only one vCPU can be targeted by the wake requests; it's simple enough
to make it wake all vCPUs or even a mask but I don't see a use case for
that additional complexity right now.
Invalidation happens from the invalidate_range_start MMU notifier, which
needs to be able to sleep in order to wake the vCPU and wait for it.
This means that revalidation potentially needs to "wait" for the MMU
operation to complete and the invalidate_range_end notifier to be
invoked. Like the vCPU when it takes a page fault in that period, we
just spin — fixing that in a future patch by implementing an actual
*wait* may be another part of shaving this particularly hirsute yak.
As noted in the comments in the function itself, the only case where
the invalidate_range_start notifier is expected to be called *without*
being able to sleep is when the OOM reaper is killing the process. In
that case, we expect the vCPU threads already to have exited, and thus
there will be nothing to wake, and no reason to wait. So we clear the
KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly
if there actually *was* anything to wake up.
Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The various kvm_write_guest() and mark_page_dirty() functions must only
ever be called in the context of an active vCPU, because if dirty ring
tracking is enabled it may simply oops when kvm_get_running_vcpu()
returns NULL for the vcpu and then kvm_dirty_ring_get() dereferences it.
This oops was reported by "butt3rflyh4ck" <[email protected]> in
https://lore.kernel.org/kvm/CAFcO6XOmoS7EacN_n6v4Txk7xL7iqRa2gABg3F7E3Naf5uG94g@mail.gmail.com/
That actual bug will be fixed under separate cover but this warning
should help to prevent new ones from being added.
Signed-off-by: David Woodhouse <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 5.16
- Simplification of the 'vcpu first run' by integrating it into
KVM's 'pid change' flow
- Refactoring of the FP and SVE state tracking, also leading to
a simpler state and less shared data between EL1 and EL2 in
the nVHE case
- Tidy up the header file usage for the nvhe hyp object
- New HYP unsharing mechanism, finally allowing pages to be
unmapped from the Stage-1 EL2 page-tables
- Various pKVM cleanups around refcounting and sharing
- A couple of vgic fixes for bugs that would trigger once
the vcpu xarray rework is merged, but not sooner
- Add minimal support for ARMv8.7's PMU extension
- Rework kvm_pgtable initialisation ahead of the NV work
- New selftest for IRQ injection
- Teach selftests about the lack of default IPA space and
page sizes
- Expand sysreg selftest to deal with Pointer Authentication
- The usual bunch of cleanups and doc update
|
|
Add a stat counter of culling events whereby the cache backend culls a file
to make space (when asked by cachefilesd in this case) and display in
/proc/fs/fscache/stats.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/163819654165.215744.3797804661644212436.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906961387.143852.9291157239960289090.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967168266.1823006.14436200166581605746.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021567619.640689.4339228906248763197.stgit@warthog.procyon.org.uk/ # v4
|
|
Add stat counters of no-space events that caused caching not to happen and
display in /proc/fs/fscache/stats.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/163819653216.215744.17210522251617386509.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906958369.143852.7257100711818401748.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967166917.1823006.14842444049198947892.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021566184.640689.4417328329632709265.stgit@warthog.procyon.org.uk/ # v4
|
|
Store the volume coherency data in an xattr and check it when we rebind the
volume. If it doesn't match the cache volume is moved to the graveyard and
rebuilt anew.
Changes
=======
ver #4:
- Remove a couple of debugging prints.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Link: https://lore.kernel.org/r/163967164397.1823006.2950539849831291830.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021563138.640689.15851092065380543119.stgit@warthog.procyon.org.uk/ # v4
|
|
Use an inode flag, S_KERNEL_FILE, to mark that a backing file is in use by
the kernel to prevent cachefiles or other kernel services from interfering
with that file.
Alter rmdir to reject attempts to remove a directory marked with this flag.
This is used by cachefiles to prevent cachefilesd from removing them.
Using S_SWAPFILE instead isn't really viable as that has other effects in
the I/O paths.
Changes
=======
ver #3:
- Check for the object pointer being NULL in the tracepoints rather than
the caller.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/163819630256.215744.4815885535039369574.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906931596.143852.8642051223094013028.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967141000.1823006.12920680657559677789.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021541207.640689.564689725898537127.stgit@warthog.procyon.org.uk/ # v4
|
|
Provide a function to change the size of the storage attached to a cookie,
to match the size of the file being cached when it's changed by truncate or
fallocate:
void fscache_resize_cookie(struct fscache_cookie *cookie,
loff_t new_size);
This acts synchronously and is expected to run under the inode lock of the
caller.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/163819621839.215744.7895597119803515402.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906922387.143852.16394459879816147793.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967128998.1823006.10740669081985775576.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021527861.640689.3466382085497236267.stgit@warthog.procyon.org.uk/ # v4
|
|
Provide a function to be called from a network filesystem's releasepage
method to indicate that a page has been released that might have been a
reflection of data upon the server - and now that data must be reloaded
from the server or the cache.
This is used to end an optimisation for empty files, in particular files
that have just been created locally, whereby we know there cannot yet be
any data that we would need to read from the server or the cache.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/163819617128.215744.4725572296135656508.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906920354.143852.7511819614661372008.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967128061.1823006.611781655060034988.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021525963.640689.9264556596205140044.stgit@warthog.procyon.org.uk/ # v4
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-01-06
1) Expose FEC per lane block counters via ethtool
2) Trivial fixes/updates/cleanup to mlx5e netdev driver
3) Fix htmldoc build warning
4) Spread mlx5 SFs (sub-functions) to all available CPU cores: Commits 1..5
Shay Drory Says:
================
Before this patchset, mlx5 subfunction shared the same IRQs (MSI-X) with
their peers subfunctions, causing them to use same CPU cores.
In large scale, this is very undesirable, SFs use small number of cpu
cores and all of them will be packed on the same CPU cores, not
utilizing all CPU cores in the system.
In this patchset we want to achieve two things.
a) Spread IRQs used by SFs to all cpu cores
b) Pack less SFs in the same IRQ, will result in multiple IRQs per core.
In this patchset, we spread SFs over all online cpus available to mlx5
irqs in Round-Robin manner. e.g.: Whenever a SF is created, pick the next
CPU core with least number of SF IRQs bound to it, SFs will share IRQs on
the same core until a certain limit, when such limit is reached, we
request a new IRQ and add it to that CPU core IRQ pool, when out of IRQs,
pick any IRQ with least number of SF users.
This enhancement is done in order to achieve a better distribution of
the SFs over all the available CPUs, which reduces application latency,
as shown bellow.
Machine details:
Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 56 cores.
PCI Express 3 with BW of 126 Gb/s.
ConnectX-5 Ex; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0
x16.
Base line test description:
Single SF on the system. One instance of netperf is running on-top the
SF.
Numbers: latency = 15.136 usec, CPU Util = 35%
Test description:
There are 250 SFs on the system. There are 3 instances of netperf
running, on-top three different SFs, in parallel.
Perf numbers:
# netperf SFs latency(usec) latency CPU utilization
affinity affinity (lower is better) increase %
1 cpu=0 cpu={0} ~23 (app 1-3) 35% 75%
2 cpu=0,2,4 cpu={0} app 1: 21.625 30% 68% (CPU 0)
app 2-3: 16.5 9% 15% (CPU 2,4)
3 cpu=0 cpu={0,2,4} app 1: ~16 7% 84% (CPU 0)
app 2-3: ~17.9 14% 22% (CPU 2,4)
4 cpu=0,2,4 cpu={0,2,4} 15.2 (app 1-3) 0% 33% (CPU 0,2,4)
- The first two entries (#1 and #2) show current state. e.g.: SFs are
using the same CPU. The last two entries (#3 and #4) shows the latency
reduction improvement of this patch. e.g.: SFs are on different CPUs.
- Whenever we use several CPUs, in case there is a different CPU
utilization, write the utilization of each CPU separately.
- Whenever the latency result of the netperf instances were different,
write the latency of each netperf instances separately.
Commands:
- for netperf CPU=0:
$ for i in {1..3}; do taskset -c 0 netperf -H 1${i}.1.1.1 -t TCP_RR -- \
-o RT_LATENCY -r8 & done
- for netperf CPU=0,2,4
$ for i in {1..3}; do taskset -c $(( ($i - 1) * 2 )) netperf -H \
1${i}.1.1.1 -t TCP_RR -- -o RT_LATENCY -r8 & done
================
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
Cachefiles has a problem in that it needs to keep the backing file for a
cookie open whilst there are local modifications pending that need to be
written to it. However, we don't want to keep the file open indefinitely,
as that causes EMFILE/ENFILE/ENOMEM problems.
Reopening the cache file, however, is a problem if this is being done due
to writeback triggered by exit(). Some filesystems will oops if we try to
open a file in that context because they want to access current->fs or
other resources that have already been dismantled.
To get around this, I added the following:
(1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem
inode to indicate that we have a usage count on the cookie caching
that inode.
(2) A flag in struct writeback_control, unpinned_fscache_wb, that is set
when __writeback_single_inode() clears the last dirty page from
i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this
flag.
This has to be done here so that clearing I_PINNING_FSCACHE_WB can be
done atomically with the check of PAGECACHE_TAG_DIRTY that clears
I_DIRTY_PAGES.
(3) A function, fscache_set_page_dirty(), which if it is not set, sets
I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache
resources.
(4) A function, fscache_unpin_writeback(), to be called by ->write_inode()
to unuse the cookie.
(5) A function, fscache_clear_inode_writeback(), to be called when the
inode is evicted, before clear_inode() is called. This cleans up any
lingering I_PINNING_FSCACHE_WB.
The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the cache as
well as to the server.
For the future, I'm working on write helpers for netfs lib that should
allow this facility to be removed by keeping track of the dirty regions
separately - but that's incomplete at the moment and is also going to be
affected by folios, one way or another, since it deals with pages
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4
|
|
Provide a higher-level function than fscache_write() to perform a write
from an inode's pagecache to the cache, whilst fending off concurrent
writes by means of the PG_fscache mark on a page:
void fscache_write_to_cache(struct fscache_cookie *cookie,
struct address_space *mapping,
loff_t start,
size_t len,
loff_t i_size,
netfs_io_terminated_t term_func,
void *term_func_priv,
bool caching);
If caching is false, this function does nothing except call (*term_func)()
if given. It assumes that, in such a case, PG_fscache will not have been
set on the pages.
Otherwise, if caching is true, this function requires the source pages to
have had PG_fscache set on them before calling. start and len define the
region of the file to be modified and i_size indicates the new file size.
The source pages are extracted from the mapping.
term_func and term_func_priv work as for fscache_write(). The PG_fscache
marks will be cleared at the end of the operation, before term_func is
called or the function otherwise returns.
There is an additonal helper function to clear the PG_fscache bits from a
range of pages:
void fscache_clear_page_bits(struct fscache_cookie *cookie,
struct address_space *mapping,
loff_t start, size_t len,
bool caching);
If caching is true, the pages to be managed are expected to be located on
mapping in the range defined by start and len. If caching is false, it
does nothing.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/163819614155.215744.5528123235123721230.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906916346.143852.15632773570362489926.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967123599.1823006.12946816026724657428.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021522672.640689.4381958316198807813.stgit@warthog.procyon.org.uk/ # v4
|
|
Provide a pair of functions to perform raw I/O on the cache. The first
function allows an arbitrary asynchronous direct-IO read to be made against
a cache object, though the read should be aligned and sized appropriately
for the backing device:
int fscache_read(struct netfs_cache_resources *cres,
loff_t start_pos,
struct iov_iter *iter,
enum netfs_read_from_hole read_hole,
netfs_io_terminated_t term_func,
void *term_func_priv);
The cache resources must have been previously initialised by
fscache_begin_read_operation(). A read operation is sent to the backing
filesystem, starting at start_pos within the file. The size of the read is
specified by the iterator, as is the location of the output buffer.
If there is a hole in the data it can be ignored and left to the backing
filesystem to deal with (NETFS_READ_HOLE_IGNORE), a hole at the beginning
can be skipped over and the buffer padded with zeros
(NETFS_READ_HOLE_CLEAR) or -ENODATA can be given (NETFS_READ_HOLE_FAIL).
If term_func is not NULL, the operation may be performed asynchronously.
Upon completion, successful or otherwise, (*term_func)() will be called and
passed term_func_priv, along with an error or the amount of data
transferred. If the op is run asynchronously, fscache_read() will return
-EIOCBQUEUED.
The second function allows an arbitrary asynchronous direct-IO write to be
made against a cache object, though the write should be aligned and sized
appropriately for the backing device:
int fscache_write(struct netfs_cache_resources *cres,
loff_t start_pos,
struct iov_iter *iter,
netfs_io_terminated_t term_func,
void *term_func_priv);
This works in very similar way to fscache_read(), except that there's no
need to deal with holes (they're just overwritten).
The caller is responsible for preventing concurrent overlapping writes.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/163819613224.215744.7877577215582621254.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906915386.143852.16936177636106480724.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967122632.1823006.7487049517698562172.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021521420.640689.12747258780542678309.stgit@warthog.procyon.org.uk/ # v4
|
|
Pass more information to the cache on how to deal with a hole if it
encounters one when trying to read from the cache. Three options are
provided:
(1) NETFS_READ_HOLE_IGNORE. Read the hole along with the data, assuming
it to be a punched-out extent by the backing filesystem.
(2) NETFS_READ_HOLE_CLEAR. If there's a hole, erase the requested region
of the cache and clear the read buffer.
(3) NETFS_READ_HOLE_FAIL. Fail the read if a hole is detected.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/163819612321.215744.9738308885948264476.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906914460.143852.6284247083607910189.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967119923.1823006.15637375885194297582.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021519762.640689.16994364383313159319.stgit@warthog.procyon.org.uk/ # v4
|