Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch prepares extent_cache to be ready for addition.
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Pull cxl updates from Dan Williams:
"Compute Express Link (CXL) updates for 6.2.
While it may seem backwards, the CXL update this time around includes
some focus on CXL 1.x enabling where the work to date had been with
CXL 2.0 (VH topologies) in mind.
First generation CXL can mostly be supported via BIOS, similar to DDR,
however it became clear there are use cases for OS native CXL error
handling and some CXL 3.0 endpoint features can be deployed on CXL 1.x
hosts (Restricted CXL Host (RCH) topologies). So, this update brings
RCH topologies into the Linux CXL device model.
In support of the ongoing CXL 2.0+ enabling two new core kernel
facilities are added.
One is the ability for the kernel to flag collisions between userspace
access to PCI configuration registers and kernel accesses. This is
brought on by the PCIe Data-Object-Exchange (DOE) facility, a hardware
mailbox over config-cycles.
The other is a cpu_cache_invalidate_memregion() API that maps to
wbinvd_on_all_cpus() on x86. To prevent abuse it is disabled in guest
VMs and architectures that do not support it yet. The CXL paths that
need it, dynamic memory region creation and security commands (erase /
unlock), are disabled when it is not present.
As for the CXL 2.0+ this cycle the subsystem gains support Persistent
Memory Security commands, error handling in response to PCIe AER
notifications, and support for the "XOR" host bridge interleave
algorithm.
Summary:
- Add the cpu_cache_invalidate_memregion() API for cache flushing in
response to physical memory reconfiguration, or memory-side data
invalidation from operations like secure erase or memory-device
unlock.
- Add a facility for the kernel to warn about collisions between
kernel and userspace access to PCI configuration registers
- Add support for Restricted CXL Host (RCH) topologies (formerly CXL
1.1)
- Add handling and reporting of CXL errors reported via the PCIe AER
mechanism
- Add support for CXL Persistent Memory Security commands
- Add support for the "XOR" algorithm for CXL host bridge interleave
- Rework / simplify CXL to NVDIMM interactions
- Miscellaneous cleanups and fixes"
* tag 'cxl-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (71 commits)
cxl/region: Fix memdev reuse check
cxl/pci: Remove endian confusion
cxl/pci: Add some type-safety to the AER trace points
cxl/security: Drop security command ioctl uapi
cxl/mbox: Add variable output size validation for internal commands
cxl/mbox: Enable cxl_mbox_send_cmd() users to validate output size
cxl/security: Fix Get Security State output payload endian handling
cxl: update names for interleave ways conversion macros
cxl: update names for interleave granularity conversion macros
cxl/acpi: Warn about an invalid CHBCR in an existing CHBS entry
tools/testing/cxl: Require cache invalidation bypass
cxl/acpi: Fail decoder add if CXIMS for HBIG is missing
cxl/region: Fix spelling mistake "memergion" -> "memregion"
cxl/regs: Fix sparse warning
cxl/acpi: Set ACPI's CXL _OSC to indicate RCD mode support
tools/testing/cxl: Add an RCH topology
cxl/port: Add RCD endpoint port enumeration
cxl/mem: Move devm_cxl_add_endpoint() from cxl_core to cxl_mem
tools/testing/cxl: Add XOR Math support to cxl_test
cxl/acpi: Support CXL XOR Interleave Math (CXIMS)
...
|
|
into clk-next
- Tracepoints for clk_rate_request structures
* clk-mediatek:
clk: mediatek: fix dependency of MT7986 ADC clocks
clk: mediatek: Change PLL register API for MT8186
clk: mediatek: Add new clock driver to handle FHCTL hardware
dt-bindings: clock: mediatek: Add new bindings of MediaTek frequency hopping
clk: mediatek: Export PLL operations symbols
clk: mediatek: mt8186-topckgen: Add GPU clock mux notifier
clk: mediatek: mt8186-mfg: Propagate rate changes to parent
clk: mediatek: mt8195-topckgen: Drop flags for main/univpll fixed factors
clk: mediatek: mt8192: Drop flags for main/univpll fixed factors
clk: mediatek: mt6795-topckgen: Drop flags for main/sys/univpll fixed factors
clk: mediatek: mt8173: Drop flags for main/sys/univpll fixed factors
clk: mediatek: mt8183: Drop flags for sys/univpll fixed factors
clk: mediatek: mt8183: Compress top_divs array entries
clk: mediatek: mt8186-topckgen: Drop flags for main/univpll fixed factors
clk: mediatek: clk-mtk: Allow specifying flags on mtk_fixed_factor clocks
* clk-trace:
clk: Add trace events for rate requests
clk: Store clk_core for clk_rate_request
* clk-qcom: (69 commits)
clk: qcom: rpmh: add support for SM6350 rpmh IPA clock
clk: qcom: mmcc-msm8974: use parent_hws/_data instead of parent_names
clk: qcom: mmcc-msm8974: move clock parent tables down
clk: qcom: mmcc-msm8974: use ARRAY_SIZE instead of specifying num_parents
clk: qcom: gcc-msm8974: use parent_hws/_data instead of parent_names
clk: qcom: gcc-msm8974: move clock parent tables down
clk: qcom: gcc-msm8974: use ARRAY_SIZE instead of specifying num_parents
dt-bindings: clocks: qcom,mmcc: define clocks/clock-names for MSM8974
dt-bindings: clock: split qcom,gcc-msm8974,-msm8226 to the separate file
clk: qcom: gcc-ipq4019: switch to devm_clk_notifier_register
clk: qcom: rpmh: remove usage of platform name
clk: qcom: rpmh: rename VRM clock data
clk: qcom: rpmh: rename ARC clock data
clk: qcom: rpmh: support separate symbol name for the RPMH clocks
clk: qcom: rpmh: remove platform names from BCM clocks
clk: qcom: rpmh: drop all _ao names
clk: qcom: rpmh: reuse common duplicate clocks
clk: qcom: rpmh: group clock definitions together
clk: qcom: rpm: drop the platform from clock definitions
clk: qcom: rpm: drop the _clk suffix completely
...
* clk-microchip:
clk: microchip: enable the MPFS clk driver by default if SOC_MICROCHIP_POLARFIRE
clk: microchip: check for null return of devm_kzalloc()
|
|
"mm_khugepaged_collapse_file" for capturing is_shmem.
Currently, is_shmem is not being captured. Capturing is_shmem is useful
as it can indicate if tmpfs is being used as a backing store instead of
persistent storage. Add the tracepoint in collapse_file() named
"mm_khugepaged_collapse_file" for capturing is_shmem.
[[email protected]: swap is_shmem and addr to save space, per Steven Rostedt]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Gautam Menghani <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]> [tracing]
Cc: David Hildenbrand <[email protected]>
Cc: Masami Hiramatsu (Google) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Clean up: Simplify the tracepoint's only call site.
Also, I noticed that when svc_authenticate() returns SVC_COMPLETE,
it leaves rq_auth_stat set to an error value. That doesn't need to
be recorded in the trace log.
Signed-off-by: Chuck Lever <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
|
|
Steven Rostedt says:
> The include/trace/events/ directory should only hold files that
> are to create events, not headers that hold helper functions.
>
> Can you please move them out of include/trace/events/ as that
> directory is "special" in the creation of events.
Signed-off-by: Chuck Lever <[email protected]>
Acked-by: Leon Romanovsky <[email protected]>
Acked-by: Steven Rostedt (Google) <[email protected]>
Acked-by: Anna Schumaker <[email protected]>
|
|
For dependencies in following patches
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
fast-commit of create, link, and unlink operations in encrypted
directories is completely broken because the unencrypted filenames are
being written to the fast-commit journal instead of the encrypted
filenames. These operations can't be replayed, as encryption keys
aren't present at journal replay time. It is also an information leak.
Until if/when we can get this working properly, make encrypted directory
operations ineligible for fast-commit.
Note that fast-commit operations on encrypted regular files continue to
be allowed, as they seem to work.
Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Cc: <[email protected]> # v5.10+
Signed-off-by: Eric Biggers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
The print format error was found when using ftrace event:
<...>-1406 [000] .... 23599442.895823: jbd2_end_commit: dev 252,8 transaction -1866216965 sync 0 head -1866217368
<...>-1406 [000] .... 23599442.896299: jbd2_start_commit: dev 252,8 transaction -1866216964 sync 0
Use the correct print format for transaction, head and tid.
Fixes: 879c5e6b7cb4 ('jbd2: convert instrumentation from markers to tracepoints')
Signed-off-by: Bixuan Cui <[email protected]>
Reviewed-by: Jason Yan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Cc: [email protected]
|
|
No conflicts.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
It is currently fairly difficult to follow what clk_rate_request are
issued, and how they have been modified once done.
Indeed, there's multiple paths that can be taken, some functions are
recursive and will just forward the request to its parent, etc.
Adding a lot of debug prints is just not very convenient, so let's add
trace events for the clock requests, one before they are submitted and
one after they are returned.
That way we can simply toggle the tracing on without modifying the
kernel code and without affecting performances or the kernel logs too
much.
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Signed-off-by: Maxime Ripard <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Stephen Boyd <[email protected]>
|
|
If a cookie expires from the LRU and the LRU_DISCARD flag is set, but
the state machine has not run yet, it's possible another thread can call
fscache_use_cookie and begin to use it.
When the cookie_worker finally runs, it will see the LRU_DISCARD flag
set, transition the cookie->state to LRU_DISCARDING, which will then
withdraw the cookie. Once the cookie is withdrawn the object is removed
the below oops will occur because the object associated with the cookie
is now NULL.
Fix the oops by clearing the LRU_DISCARD bit if another thread uses the
cookie before the cookie_worker runs.
BUG: kernel NULL pointer dereference, address: 0000000000000008
...
CPU: 31 PID: 44773 Comm: kworker/u130:1 Tainted: G E 6.0.0-5.dneg.x86_64 #1
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Workqueue: events_unbound netfs_rreq_write_to_cache_work [netfs]
RIP: 0010:cachefiles_prepare_write+0x28/0x90 [cachefiles]
...
Call Trace:
netfs_rreq_write_to_cache_work+0x11c/0x320 [netfs]
process_one_work+0x217/0x3e0
worker_thread+0x4a/0x3b0
kthread+0xd6/0x100
Fixes: 12bb21a29c19 ("fscache: Implement cookie user counting and resource pinning")
Reported-by: Daire Byrne <[email protected]>
Signed-off-by: Dave Wysochanski <[email protected]>
Signed-off-by: David Howells <[email protected]>
Tested-by: Daire Byrne <[email protected]>
Link: https://lore.kernel.org/r/[email protected]/ # v1
Link: https://lore.kernel.org/r/[email protected]/ # v2
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add prepare_ondemand_read() callback dedicated for the on-demand read
scenario, so that callers from this scenario can be decoupled from
netfs_io_subrequest.
The original cachefiles_prepare_read() is now refactored to a generic
routine accepting a parameter list instead of netfs_io_subrequest.
There's no logic change, except that the debug id of subrequest and
request is removed from trace_cachefiles_prep_read().
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Jingbo Xu <[email protected]>
Acked-by: David Howells <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Gao Xiang <[email protected]>
|
|
The first argument to the CXL AER trace points is the source device.
Pass a 'const struct device *' rather than a 'const char *' for more
type precision / safety.
Cc: Jonathan Cameron <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Steven Rostedt <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Link: https://lore.kernel.org/r/167030091477.4045167.15174636482098463885.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <[email protected]>
|
|
Record and report an error code for the events. This allows to report
about failed calls without ambiguity and so gives a more complete
picture.
Acked-by: Conor Dooley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Thierry Reding <[email protected]>
|
|
The extent_io_tree::private_data was meant to be a preparatory work for
the metadata inode rework but that never materialized. Now it's used
only for an inode so it's better to change the appropriate type and
rename it.
Reviewed-by: Anand Jain <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
Add tracepoint events for recording the CXL uncorrectable and correctable
errors. For uncorrectable errors, there is additional data of 512B from
the header log register (CXL spec rev3 8.2.4.16.7). The trace event will
intake a dynamic array that will dump the entire Header Log data. If
multiple errors are set in the status register, then the
'first error' field (CXL spec rev3 v8.2.4.16.6) is read from the Error
Capabilities and Control Register in order to determine the error.
This implementation does not include CXL IDE Error details.
Cc: Steven Rostedt <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Link: https://lore.kernel.org/r/166974413388.1608150.5875712482260436188.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dan Williams <[email protected]>
|
|
we might want to know why jbd2 thread using high io for detail,
split ext4_journal_start trace to ext4_journal_start_sb and
ext4_journal_start_inode, show ino and handle type when possible.
Signed-off-by: changfengnan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Since commit ac33e91e2daca ("blk-iocost: implement vtime loss
compensation") rename original vtime_rate to vtime_base_rate
and current vtime_rate is original vtime_rate with compensation.
The current rate showed in tracepoint is mixed with vtime_rate
and vtime_base_rate:
1) In function ioc_adjust_base_vrate, the first trace_iocost_ioc_vrate_adj
shows vtime_rate, the second trace_iocost_ioc_vrate_adj shows
vtime_base_rate.
2) In function iocg_activate shows vtime_rate by calling
TRACE_IOCG_PATH(iocg_activate...
3) In function ioc_check_iocgs shows vtime_rate by calling
TRACE_IOCG_PATH(iocg_idle...
Trace vtime_base_rate instead of vtime_rate as:
1) Before commit ac33e91e2daca ("blk-iocost: implement vtime loss
compensation"), the traced rate is without compensation, so still
show rate without compensation.
2) The vtime_base_rate is more stable while vtime_rate heavily depends on
excess budeget on current period which may change abruptly in next period.
Signed-off-by: Kemeng Shi <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
For ACKs generated inside the I/O thread, transmit the ACK at the point of
generation. Where the ACK is generated outside of the I/O thread, it's
offloaded to the I/O thread to transmit it.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
Add a tracepoint to log when a cwnd reset occurs due to lack of
transmission on a call.
Add stat counters to count transmission underflows (ie. when we have tx
window space, but sendmsg doesn't manage to keep up), cwnd resets and
transmission failures.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
Move the functions from the call->processor and local->processor work items
into the domain of the I/O thread.
The call event processor, now called from the I/O thread, then takes over
the job of cranking the call state machine, processing incoming packets and
transmitting DATA, ACK and ABORT packets. In a future patch,
rxrpc_send_ACK() will transmit the ACK on the spot rather than queuing it
for later transmission.
The call event processor becomes purely received-skb driven. It only
transmits things in response to events. We use "pokes" to queue a dummy
skb to make it do things like start/resume transmitting data. Timer expiry
also results in pokes.
The connection event processor, becomes similar, though crypto events, such
as dealing with CHALLENGE and RESPONSE packets is offloaded to a work item
to avoid doing crypto in the I/O thread.
The local event processor is removed and VERSION response packets are
generated directly from the packet parser. Similarly, ABORTs generated in
response to protocol errors will be transmitted immediately rather than
being pushed onto a queue for later transmission.
Changes:
========
ver #2)
- Fix a couple of introduced lock context imbalances.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
A received skbuff needs a ref when it gets put on a call data queue or conn
packet queue, and rxrpc_input_packet() and co. jump through a lot of hoops
to avoid double-dropping the skbuff ref so that we can avoid getting a ref
when we queue the packet.
Change this so that the skbuff ref is unconditionally dropped by the caller
of rxrpc_input_packet(). An additional ref is then taken on the packet if
it is pushed onto a queue.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
Move DATA transmission into the call processor work item. In a future
patch, this will be called from the I/O thread rather than being itsown
work item.
This will allow DATA transmission to be driven directly by incoming ACKs,
pokes and timers as those are processed.
The Tx queue is also split: The queue of packets prepared by sendmsg is now
places in call->tx_sendmsg and the packet dispatcher decants the packets
into call->tx_buffer as space becomes available in the transmission
window. This allows sendmsg to run ahead of the available space to try and
prevent an underflow in transmission.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
Copy client call parameters into rxrpc_call earlier so that that can be
used to convey them to the connection code - which can then be offloaded to
the I/O thread.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
Provide a means by which an event notification can be sent to a call such
that the I/O thread can process it rather than it being done in a separate
workqueue. This will allow a lot of locking to be removed.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
Currently, rxrpc gives the connection's work item a ref on the connection
when it queues it - and this is called from the timer expiration function.
The problem comes when queue_work() fails (ie. the work item is already
queued): the timer routine must put the ref - but this may cause the
cleanup code to run.
This has the unfortunate effect that the cleanup code may then be run in
softirq context - which means that any spinlocks it might need to touch
have to be guarded to disable softirqs (ie. they need a "_bh" suffix).
(1) Don't give a ref to the work item.
(2) Simplify handling of service connections by adding a separate active
count so that the refcount isn't also used for this.
(3) Connection destruction for both client and service connections can
then be cleaned up by putting rxrpc_put_connection() out of line and
making a tidy progression through the destruction code (offloaded to a
workqueue if put from softirq or processor function context). The RCU
part of the cleanup then only deals with the freeing at the end.
(4) Make rxrpc_queue_conn() return immediately if it sees the active count
is -1 rather then queuing the connection.
(5) Make sure that the cleanup routine waits for the work item to
complete.
(6) Stash the rxrpc_net pointer in the conn struct so that the rcu free
routine can use it, even if the local endpoint has been freed.
Unfortunately, neither the timer nor the work item can simply get around
the problem by just using refcount_inc_not_zero() as the waits would still
have to be done, and there would still be the possibility of having to put
the ref in the expiration function.
Note the connection work item is mostly going to go away with the main
event work being transferred to the I/O thread, so the wait in (6) will
become obsolete.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
Currently, rxrpc gives the call timer a ref on the call when it starts it
and this is passed along to the workqueue by the timer expiration function.
The problem comes when queue_work() fails (ie. the work item is already
queued): the timer routine must put the ref - but this may cause the
cleanup code to run.
This has the unfortunate effect that the cleanup code may then be run in
softirq context - which means that any spinlocks it might need to touch
have to be guarded to disable softirqs (ie. they need a "_bh" suffix).
Fix this by:
(1) Don't give a ref to the timer.
(2) Making the expiration function not do anything if the refcount is 0.
Note that this is more of an optimisation.
(3) Make sure that the cleanup routine waits for timer to complete.
However, this has a consequence that timer cannot give a ref to the work
item. Therefore the following fixes are also necessary:
(4) Don't give a ref to the work item.
(5) Make the work item return asap if it sees the ref count is 0.
(6) Make sure that the cleanup routine waits for the work item to
complete.
Unfortunately, neither the timer nor the work item can simply get around
the problem by just using refcount_inc_not_zero() as the waits would still
have to be done, and there would still be the possibility of having to put
the ref in the expiration function.
Note the call work item is going to go away with the work being transferred
to the I/O thread, so the wait in (6) will become obsolete.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
In rxrpc tracing, use enums to generate lists of points of interest rather
than __builtin_return_address() for the sk_buff tracepoint.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
Add a tracepoint for the rxrpc_bundle refcounting.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
In rxrpc tracing, use enums to generate lists of points of interest rather
than __builtin_return_address() for the rxrpc_call tracepoint
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
In rxrpc tracing, use enums to generate lists of points of interest rather
than __builtin_return_address() for the rxrpc_conn tracepoint
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
In rxrpc tracing, use enums to generate lists of points of interest rather
than __builtin_return_address() for the rxrpc_peer tracepoint
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
In rxrpc tracing, use enums to generate lists of points of interest rather
than __builtin_return_address() for the rxrpc_local tracepoint
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
Remove the kproto() and _proto() debugging macros in preference to using
tracepoints for this.
Signed-off-by: David Howells <[email protected]>
cc: Marc Dionne <[email protected]>
cc: [email protected]
|
|
Currently mm_struct maintains rss_stats which are updated on page fault
and the unmapping codepaths. For page fault codepath the updates are
cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
The reason for caching is performance for multithreaded applications
otherwise the rss_stats updates may become hotspot for such applications.
However this optimization comes with the cost of error margin in the rss
stats. The rss_stats for applications with large number of threads can be
very skewed. At worst the error margin is (nr_threads * 64) and we have a
lot of applications with 100s of threads, so the error margin can be very
high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
Recently we started seeing the unbounded errors for rss_stats for specific
applications which use TCP rx0cp. It seems like vm_insert_pages()
codepath does not sync rss_stats at all.
This patch converts the rss_stats into percpu_counter to convert the error
margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However
this conversion enable us to get the accurate stats for situations where
accuracy is more important than the cpu cost.
This patch does not make such tradeoffs - we can just use
percpu_counter_add_local() for the updates and percpu_counter_sum() (or
percpu_counter_sync() + percpu_counter_read) for the readers. At the
moment the readers are either procfs interface, oom_killer and memory
reclaim which I think are not performance critical and should be ok with
slow read. However I think we can make that change in a separate patch.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Shakeel Butt <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
|
|
tools/lib/bpf/ringbuf.c
927cbb478adf ("libbpf: Handle size overflow for ringbuf mmap")
b486d19a0ab0 ("libbpf: checkpatch: Fixed code alignments in ringbuf.c")
https://lore.kernel.org/all/[email protected]/
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
As with PG_arch_2, this flag is only allowed on 64-bit architectures due
to the shortage of bits available. It will be used by the arm64 MTE code
in subsequent patches.
Signed-off-by: Peter Collingbourne <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Steven Price <[email protected]>
[[email protected]: added flag preserving in __split_huge_page_tail()]
Signed-off-by: Catalin Marinas <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Commit 4beba9486abd ("mm: Add PG_arch_2 page flag") introduced a new
page flag for all 64-bit architectures. However, even if an architecture
is 64-bit, it may still have limited spare bits in the 'flags' member of
'struct page'. This may happen if an architecture enables SPARSEMEM
without SPARSEMEM_VMEMMAP as is the case with the newly added loongarch.
This architecture port needs 19 more bits for the sparsemem section
information and, while it is currently fine with PG_arch_2, adding any
more PG_arch_* flags will trigger build-time warnings.
Add a new CONFIG_ARCH_USES_PG_ARCH_X option which can be selected by
architectures that need more PG_arch_* flags beyond PG_arch_1. Select it
on arm64.
Signed-off-by: Catalin Marinas <[email protected]>
[[email protected]: fix build with CONFIG_ARM64_MTE disabled]
Signed-off-by: Peter Collingbourne <[email protected]>
Reported-by: kernel test robot <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Steven Price <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
For the cases where 'reason' doesn't give any clue, it's still
nice to be able to track the kfree_skb caller location. %p doesn't
help much so let's use %pS which prints the symbol+offset.
Signed-off-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
|
|
filename from function call
Refactor the mm_khugepaged_scan_file tracepoint to move filename
dereference to the tracepoint definition, to maintain consistency with
other tracepoints[1].
[1]:lore.kernel.org/lkml/[email protected]/
Link: https://lkml.kernel.org/r/[email protected]
Fixes: d41fd2016ed07 ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()")
Signed-off-by: Gautam Menghani <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Zach O'Keefe <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Masami Hiramatsu (Google) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Due to compiler optimizations like inlining, there are cases where
MMIO traces using _THIS_IP_ for caller information might not be
sufficient to provide accurate debug traces.
1) With optimizations (Seen with GCC):
In this case, _THIS_IP_ works fine and prints the caller information
since it will be inlined into the caller and we get the debug traces
on who made the MMIO access, for ex:
rwmmio_read: qcom_smmu_tlb_sync+0xe0/0x1b0 width=32 addr=0xffff8000087447f4
rwmmio_post_read: qcom_smmu_tlb_sync+0xe0/0x1b0 width=32 val=0x0 addr=0xffff8000087447f4
2) Without optimizations (Seen with Clang):
_THIS_IP_ will not be sufficient in this case as it will print only
the MMIO accessors itself which is of not much use since it is not
inlined as below for example:
rwmmio_read: readl+0x4/0x80 width=32 addr=0xffff8000087447f4
rwmmio_post_read: readl+0x48/0x80 width=32 val=0x4 addr=0xffff8000087447f4
So in order to handle this second case as well irrespective of the compiler
optimizations, add _RET_IP_ to MMIO trace to make it provide more accurate
debug information in all these scenarios.
Before:
rwmmio_read: readl+0x4/0x80 width=32 addr=0xffff8000087447f4
rwmmio_post_read: readl+0x48/0x80 width=32 val=0x4 addr=0xffff8000087447f4
After:
rwmmio_read: qcom_smmu_tlb_sync+0xe0/0x1b0 -> readl+0x4/0x80 width=32 addr=0xffff8000087447f4
rwmmio_post_read: qcom_smmu_tlb_sync+0xe0/0x1b0 -> readl+0x4/0x80 width=32 val=0x0 addr=0xffff8000087447f4
Fixes: 210031971cdd ("asm-generic/io: Add logging support for MMIO accessors")
Signed-off-by: Sai Prakash Ranjan <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
|
|
In DLM when we send a dlm message it is easy to add the lock resource
name, but additional lookup is required when to trace the receive
message side. The idea here is to move the lookup work to the user by
using a lookup to find the right send message with recv message. As note
DLM can't drop any message which is guaranteed by a special session
layer.
For doing the lookup a 3 tupel is required as an unique identification
which is dst nodeid, src nodeid and sequence number. This patch adds the
destination nodeid to the dlm message trace points. The source nodeid is
given by the h_nodeid field inside the header.
Signed-off-by: Alexander Aring <[email protected]>
Signed-off-by: David Teigland <[email protected]>
|
|
This patch renames seq to h_seq as it is named in the dlm header
structure.
Signed-off-by: Alexander Aring <[email protected]>
Signed-off-by: David Teigland <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Long Li says:
====================
Introduce Microsoft Azure Network Adapter (MANA) RDMA driver [netdev prep]
The first 11 patches which modify the MANA Ethernet driver to support
RDMA driver.
* 'mana-shared-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
net: mana: Define data structures for protection domain and memory registration
net: mana: Define data structures for allocating doorbell page from GDMA
net: mana: Define and process GDMA response code GDMA_STATUS_MORE_ENTRIES
net: mana: Define max values for SGL entries
net: mana: Move header files to a common location
net: mana: Record port number in netdev
net: mana: Export Work Queue functions for use by RDMA driver
net: mana: Set the DMA device max segment size
net: mana: Handle vport sharing between devices
net: mana: Record the physical address for doorbell page region
net: mana: Add support for auxiliary device
====================
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Leon Romanovsky <[email protected]>
|
|
Tracepoints are not allowed to sleep, as such the following splat is
generated due to call to ib_query_pkey() in atomic context.
WARNING: CPU: 0 PID: 1888000 at kernel/trace/ring_buffer.c:2492 rb_commit+0xc1/0x220
CPU: 0 PID: 1888000 Comm: kworker/u9:0 Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.3.1.el8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 1.13.0-2.module_el8.3.0+555+a55c8938 04/01/2014
Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
RIP: 0010:rb_commit+0xc1/0x220
RSP: 0000:ffffa8ac80f9bca0 EFLAGS: 00010202
RAX: ffff8951c7c01300 RBX: ffff8951c7c14a00 RCX: 0000000000000246
RDX: ffff8951c707c000 RSI: ffff8951c707c57c RDI: ffff8951c7c14a00
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8951c7c01300 R11: 0000000000000001 R12: 0000000000000246
R13: 0000000000000000 R14: ffffffff964c70c0 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8951fbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f20e8f39010 CR3: 000000002ca10005 CR4: 0000000000170ef0
Call Trace:
ring_buffer_unlock_commit+0x1d/0xa0
trace_buffer_unlock_commit_regs+0x3b/0x1b0
trace_event_buffer_commit+0x67/0x1d0
trace_event_raw_event_ib_mad_recv_done_handler+0x11c/0x160 [ib_core]
ib_mad_recv_done+0x48b/0xc10 [ib_core]
? trace_event_raw_event_cq_poll+0x6f/0xb0 [ib_core]
__ib_process_cq+0x91/0x1c0 [ib_core]
ib_cq_poll_work+0x26/0x80 [ib_core]
process_one_work+0x1a7/0x360
? create_worker+0x1a0/0x1a0
worker_thread+0x30/0x390
? create_worker+0x1a0/0x1a0
kthread+0x116/0x130
? kthread_flush_work_fn+0x10/0x10
ret_from_fork+0x35/0x40
---[ end trace 78ba8509d3830a16 ]---
Fixes: 821bf1de45a1 ("IB/MAD: Add recv path trace point")
Signed-off-by: Leonid Ravich <[email protected]>
Link: https://lore.kernel.org/r/Y2t5feomyznrVj7V@leonid-Inspiron-3421
Signed-off-by: Leon Romanovsky <[email protected]>
|
|
This event is used in order to validate/debug a start address of freed VA,
number of currently outstanding and maximum allowed areas.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
It is for debug purposes to track number of freed vmap areas including a
range it occurs on.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "Add basic trace events for vmap/vmalloc (v2)", v2.
This small series add some basic trace events for the vmap/vmalloc code.
Since currently we lack any, sometimes it is hard to start debuging vmap
code if an issue is reported or occured.
For example https://lore.kernel.org/linux-mm/Y0p8BZIiDXLQbde%2F@pc636/T/
The final patch adds two reviewers for vmalloc code.
This patch (of 7):
It is for debug purposes and for validation of passed parameters.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Oleksiy Avramchenko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|