| Age | Commit message (Collapse) | Author | Files | Lines |
|
In order to be able to make S2 TLB invalidations more performant on NV,
let's use a scheme derived from the FEAT_TTL extension.
If bits [56:55] in the leaf descriptor translating the address in the
corresponding shadow S2 are non-zero, they indicate a level which can
be used as an invalidation range. This allows further reduction of the
systematic over-invalidation that takes place otherwise.
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Populate bits [56:55] of the leaf entry with the level provided
by the guest's S2 translation. This will allow us to better scope
the invalidation by remembering the mapping size.
Of course, this assume that the guest will issue an invalidation
with an address that falls into the same leaf. If the guest doesn't,
we'll over-invalidate.
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Support guest-provided information information to size the range of
required invalidation. This helps with reducing over-invalidation,
provided that the guest actually provides accurate information.
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
TLBI IPAS2E1* are the last class of TLBI instructions we need
to handle. For each matching S2 MMU context, we invalidate a
range corresponding to the largest possible mapping for that
context.
At this stage, we don't handle TTL, which means we are likely
over-invalidating. Further patches will aim at making this
a bit better.
Co-developed-by: Jintack Lim <[email protected]>
Co-developed-by: Christoffer Dall <[email protected]>
Signed-off-by: Jintack Lim <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
TLBI ALLE1* is a pretty big hammer that invalides all S1/S2 TLBs.
This translates into the unmapping of all our shadow S2 PTs, itself
resulting in the corresponding TLB invalidations.
Co-developed-by: Jintack Lim <[email protected]>
Co-developed-by: Christoffer Dall <[email protected]>
Signed-off-by: Jintack Lim <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Emulating TLBI VMALLS12E1* results in tearing down all the shadow
S2 PTs that match the current VMID, since our shadow S2s are just
some form of SW-managed TLBs. That teardown itself results in a
full TLB invalidation for both S1 and S2.
This can result in over-invalidation if two vcpus use the same VMID
to tag private S2 PTs, but this is still correct from an architecture
perspective.
Co-developed-by: Jintack Lim <[email protected]>
Co-developed-by: Christoffer Dall <[email protected]>
Signed-off-by: Jintack Lim <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
While dealing with TLB invalidation targeting the guest hypervisor's
own stage-1 was easy, doing the same thing for its own guests is
a bit more involved.
Since such an invalidation is scoped by VMID, it needs to apply to
all s2_mmu contexts that have been tagged by that VMID, irrespective
of the value of VTTBR_EL2.BADDR.
So for each s2_mmu context matching that VMID, we invalidate the
corresponding TLBs, each context having its own "physical" VMID.
Co-developed-by: Jintack Lim <[email protected]>
Co-developed-by: Christoffer Dall <[email protected]>
Signed-off-by: Jintack Lim <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Due to the way FEAT_NV2 suppresses traps when accessing EL2
system registers, we can't track when the guest changes its
HCR_EL2.TGE setting. This means we always trap EL1 TLBIs,
even if they don't affect any L2 guest.
Given that invalidating the EL2 TLBs doesn't require any messing
with the shadow stage-2 page-tables, we can simply emulate the
instructions early and return directly to the guest.
This is conditioned on the instruction being an EL1 one and
the guest's HCR_EL2.{E2H,TGE} being {1,1} (indicating that
the instruction targets the EL2 S1 TLBs), or the instruction
being one of the EL2 ones (which are not ambiguous).
EL1 TLBIs issued with HCR_EL2.{E2H,TGE}={1,0} are not handled
here, and cause a full exit so that they can be handled in
the context of a VMID.
Co-developed-by: Jintack Lim <[email protected]>
Co-developed-by: Christoffer Dall <[email protected]>
Signed-off-by: Jintack Lim <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Provide the primitives required to handle TLB invalidation for
Stage-1 EL2 TLBs, which by definition do not require messing
with the Stage-2 page tables.
Co-developed-by: Jintack Lim <[email protected]>
Co-developed-by: Christoffer Dall <[email protected]>
Signed-off-by: Jintack Lim <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Unmap/flush shadow stage 2 page tables for the nested VMs as well as the
stage 2 page table for the guest hypervisor.
Note: A bunch of the code in mmu.c relating to MMU notifiers is
currently dealt with in an extremely abrupt way, for example by clearing
out an entire shadow stage-2 table. This will be handled in a more
efficient way using the reverse mapping feature in a later version of
the patch series.
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Jintack Lim <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
If we are faulting on a shadow stage 2 translation, we first walk the
guest hypervisor's stage 2 page table to see if it has a mapping. If
not, we inject a stage 2 page fault to the virtual EL2. Otherwise, we
create a mapping in the shadow stage 2 page table.
Note that we have to deal with two IPAs when we got a shadow stage 2
page fault. One is the address we faulted on, and is in the L2 guest
phys space. The other is from the guest stage-2 page table walk, and is
in the L1 guest phys space. To differentiate them, we rename variables
so that fault_ipa is used for the former and ipa is used for the latter.
When mapping a page in a shadow stage-2, special care must be taken not
to be more permissive than the guest is.
Co-developed-by: Christoffer Dall <[email protected]>
Co-developed-by: Jintack Lim <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Jintack Lim <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Based on the pseudo-code in the ARM ARM, implement a stage 2 software
page table walker.
Co-developed-by: Jintack Lim <[email protected]>
Signed-off-by: Jintack Lim <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
Add Stage-2 mmu data structures for virtual EL2 and for nested guests.
We don't yet populate shadow Stage-2 page tables, but we now have a
framework for getting to a shadow Stage-2 pgd.
We allocate twice the number of vcpus as Stage-2 mmu structures because
that's sufficient for each vcpu running two translation regimes without
having to flush the Stage-2 page tables.
Co-developed-by: Christoffer Dall <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Oliver Upton <[email protected]>
|
|
In order to allow for requesting a memory region that can be used for
things like pstore on multiple machines where the memory layout is not the
same, add a new option to the kernel command line called "reserve_mem".
The format is: reserve_mem=nn:align:name
Where it will find nn amount of memory at the given alignment of align.
The name field is to allow another subsystem to retrieve where the memory
was found. For example:
reserve_mem=12M:4096:oops ramoops.mem_name=oops
Where ramoops.mem_name will tell ramoops that memory was reserved for it
via the reserve_mem option and it can find it by calling:
if (reserve_mem_find_by_name("oops", &start, &size)) {
// start holds the start address and size holds the size given
This is typically used for systems that do not wipe the RAM, and this
command line will try to reserve the same physical memory on soft reboots.
Note, it is not guaranteed to be the same location. For example, if KASLR
places the kernel at the location of where the RAM reservation was from a
previous boot, the new reservation will be at a different location. Any
subsystem using this feature must add a way to verify that the contents of
the physical memory is from a previous boot, as there may be cases where
the memory will not be located at the same location.
Not all systems may work either. There could be bit flips if the reboot
goes through the BIOS. Using kexec to reboot the machine is likely to
have better results in such cases.
Link: https://lore.kernel.org/all/[email protected]/
Suggested-by: Mike Rapoport <[email protected]>
Tested-by: Guilherme G. Piccoli <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport (IBM) <[email protected]>
|
|
PORT_ALPM_* registers are using MMIO_TRANS2 macro. This is not correct as
they are port register. Use _PORT_MMIO instead.
Fixes: 4ee30a448255 ("drm/i915/alpm: Add ALPM register definitions")
Signed-off-by: Jouni Högander <[email protected]>
Reviewed-by: Animesh Manna <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
This reverts commit f3c2031db7dfdf470a2d9bf3bd1efa6edfa72d8d.
We want to notice possible issues faced with PSR2 Region Early Transport as
early as possible -> let's revert patch disabling Region Early Transport by
default. Also eDP 1.5 Panel Replay requires Early Transport.
Signed-off-by: Jouni Högander <[email protected]>
Reviewed-by: Animesh Manna <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Add new debug bit to be used with i915_edp_psr_debug debugfs
interface. This can be used to disable Panel Replay.
v2: ensure that fastset is performed when the bit changes
Signed-off-by: Jouni Högander <[email protected]>
Reviewed-by: Animesh Manna <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Enabling/disabling Panel Replay on sink side has to be done before link
training. We can't disable it in sink side on PSR disable.
Fixes: 88ae6c65ecdb ("drm/i915/psr: Unify panel replay enable/disable sink")
Signed-off-by: Jouni Högander <[email protected]>
Reviewed-by: Animesh Manna <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Currently PSR2 SU Region Early Transport is enabled by default on Lunarlake
if panel supports it despite enable_psr module parameter value. This patch
makes it possible for user to limit used PSR mode and prevent SU Region
Early Transport by setting enable_psr as 2. With default (-1) PSR2 SU
Region Early Transport is allowed.
v2: fix/improve commit desciption
Signed-off-by: Jouni Högander <[email protected]>
Reviewed-by: Animesh Manna <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
If user is specifically limiting PSR mode to PSR1 or PSR2: disable Panel
Replay. With default value -1 all modes are allowed including Panel
Replay. Disabling PSR using value 0 disables Panel Replay as well.
Also own compute config helper is added for Panel Replay. This makes sense
because number of Panel Replay specific checks are increasing.
v2: Squash adding Panel Replay compute config helper
Signed-off-by: Jouni Högander <[email protected]>
Reviewed-by: Animesh Manna <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Port clock is link rate in 10 kbit/s units. Take this into account when
calculating AUX Less wake time.
Fixes: da6a9836ac09 ("drm/i915/psr: Calculate aux less wake time")
Signed-off-by: Jouni Högander <[email protected]>
Reviewed-by: Animesh Manna <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Wa 16021440873 is writing wrong register. Instead of PIPE_SRCSZ_ERLY_TPT
write CURPOS_ERLY_TPT.
v2: use right offset as well
Fixes: 29cdef8539c3 ("drm/i915/display: Implement Wa_16021440873")
Signed-off-by: Jouni Högander <[email protected]>
Reviewed-by: Mika Kahola <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Currently SU area width is set as MAX_INT. This is causing
problems. Instead set it as pipe src width.
Fixes: 86b26b6aeac7 ("drm/i915/psr: Carry su area in crtc_state")
Signed-off-by: Jouni Högander <[email protected]>
Reviewed-by: Mika Kahola <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Use the FIELD_GET() macro instead of open coding the masks and shifts.
This makes the code more compact and improves readability as it avoids
the need to wrap excessively long lines.
Signed-off-by: Aapo Vienamo <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Yehezkel Bernat <[email protected]>
Signed-off-by: Mika Westerberg <[email protected]>
|
|
Since commit 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS
swap-space"), we can plug multiple pages then unplug them all together.
That means iov_iter_count(iter) could be way bigger than PAGE_SIZE, it
actually equals the size of iov_iter_npages(iter, INT_MAX).
Note this issue has nothing to do with large folios as we don't support
THP_SWPOUT to non-block devices.
Fixes: 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS swap-space")
Reported-by: Christoph Hellwig <[email protected]>
Closes: https://lore.kernel.org/linux-mm/[email protected]/
Cc: NeilBrown <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Steve French <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Chuanhua Han <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Paulo Alcantara <[email protected]>
Cc: Ronnie Sahlberg <[email protected]>
Cc: Shyam Prasad N <[email protected]>
Cc: Tom Talpey <[email protected]>
Cc: Bharath SM <[email protected]>
Cc: <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
Support for the Allwinner H618, H618 and H700 was added to the sun50i
cpufreq-nvmem driver recently [1] however at the time some operating
points supported by the H700 (1.008, 1.032 and 1.512 GHz) and in use by
vendor BSPs were found to be unstable during testing, so the H700 speed
bin and the 1.032 GHz OPP were not included in the mainline driver.
Retesting with kernel 6.10rc2 (which carries additional fixes for the
driver) now shows stable operation with these points.
Add the H700 speed bin to the driver.
Reviewed-by: Andre Przywara <[email protected]>
Signed-off-by: Ryan Walklin <[email protected]>
--
[1] https://lore.kernel.org/linux-sunxi/[email protected]
Signed-off-by: Viresh Kumar <[email protected]>
|
|
The SPD5118 specification says, in its documentation of the page bits
in the MR11 register:
"
This register only applies to non-volatile memory (1024) Bytes) access of
SPD5 Hub device.
For volatile memory access, this register must be programmed to '000'.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
"
Renesas/ITD SPD5118 hub controllers take this literally and disable access
to volatile memory if the page selected in MR11 is != 0. Since the BIOS or
ROMMON will access the non-volatile memory and likely select a page != 0,
this means that the driver will not instantiate since it can not identify
the chip. Even if the driver instantiates, access to volatile registers
is blocked after a nvram read operation which selects a page other than 0.
To solve the problem, add initialization code to select page 0 during
probe. Before doing that, use basic validation to ensure that this is
really a SPD5118 device and not some random EEPROM.
Cc: Sasha Kozachuk <[email protected]>
Cc: John Hamrick <[email protected]>
Cc: Chris Sarra <[email protected]>
Tested-by: Armin Wolf <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
|
|
Using regmap for paging significantly improves caching since the regmap
cache no longer needs to be cleared after changing the page, so let's
use it.
Suggested-by: Armin Wolf <[email protected]>
Tested-by: Armin Wolf <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
|
|
As defined by the MANA Hardware spec, the queue size for DMA is 4KB
minimal, and power of 2. And, the HWC queue size has to be exactly
4KB.
To support page sizes other than 4KB on ARM64, define the minimal
queue size as a macro separately from the PAGE_SIZE, which we always
assumed it to be 4KB before supporting ARM64.
Also, add MANA specific macros and update code related to size
alignment, DMA region calculations, etc.
Signed-off-by: Haiyang Zhang <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Kamal Heib says:
====================
net/mlx4_en: Use ethtool_puts/sprintf
This patchset updates the mlx4_en driver to use the ethtool_puts and
ethtool_sprintf helper functions.
Signed-off-by: Kamal Heib <[email protected]>
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Use the ethtool_puts/ethtool_sprintf helper to print the stats strings
into the ethtool strings interface.
Signed-off-by: Kamal Heib <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Use the ethtool_puts helper to print the selftest strings into the
ethtool strings interface.
Signed-off-by: Kamal Heib <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Use the ethtool_puts helper to print the priv flags strings into the
ethtool strings interface.
Signed-off-by: Kamal Heib <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
commit be27b8965297 ("net: stmmac: replace priv->speed with
the portTransmitRate from the tc-cbs parameters") introduced
a problem. When deleting, it prompts "Invalid portTransmitRate
0 (idleSlope - sendSlope)" and exits. Add judgment on cbs.enable.
Only when offload is enabled, speed divider needs to be calculated.
Fixes: be27b8965297 ("net: stmmac: replace priv->speed with the portTransmitRate from the tc-cbs parameters")
Signed-off-by: Xiaolei Wang <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Get master/slave configuration for initial system start with the link in
down state. This ensures ethtool shows current configuration. Also
fixes link reconfiguration with ethtool while in down state, preventing
ethtool from displaying outdated configuration.
Even though dp83tg720_config_init() is executed periodically as long as
the link is in admin up state but no carrier is detected, this is not
sufficient for the link in admin down state where
dp83tg720_read_status() is not periodically executed. To cover this
case, we need an extra read role configuration in
dp83tg720_config_aneg().
Fixes: cb80ee2f9bee1 ("net: phy: Add support for the DP83TG720S Ethernet PHY")
Cc: [email protected]
Signed-off-by: Oleksij Rempel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
In case this PHY is bootstrapped for managed mode, we need to manually
wake it. Otherwise no link will be detected.
Cc: [email protected]
Fixes: cb80ee2f9bee1 ("net: phy: Add support for the DP83TG720S Ethernet PHY")
Signed-off-by: Oleksij Rempel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Milkv Duo does not have a write-protect pin, so disable write protect
to prevent SDcards misdetected as read-only.
Fixes: 89a7056ed4f7 ("riscv: dts: sophgo: add sdcard support for milkv duo")
Signed-off-by: Haylen Chu <[email protected]>
Link: https://lore.kernel.org/r/SEYPR01MB4221943C7B101DD2318DA0D3D7CE2@SEYPR01MB4221.apcprd01.prod.exchangelabs.com
Signed-off-by: Inochi Amaoto <[email protected]>
Signed-off-by: Chen Wang <[email protected]>
|
|
When CXL subsystem is auto-assembling a pmem region during cxl
endpoint port probing, always hit below calltrace.
BUG: kernel NULL pointer dereference, address: 0000000000000078
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
RIP: 0010:cxl_pmem_region_probe+0x22e/0x360 [cxl_pmem]
Call Trace:
<TASK>
? __die+0x24/0x70
? page_fault_oops+0x82/0x160
? do_user_addr_fault+0x65/0x6b0
? exc_page_fault+0x7d/0x170
? asm_exc_page_fault+0x26/0x30
? cxl_pmem_region_probe+0x22e/0x360 [cxl_pmem]
? cxl_pmem_region_probe+0x1ac/0x360 [cxl_pmem]
cxl_bus_probe+0x1b/0x60 [cxl_core]
really_probe+0x173/0x410
? __pfx___device_attach_driver+0x10/0x10
__driver_probe_device+0x80/0x170
driver_probe_device+0x1e/0x90
__device_attach_driver+0x90/0x120
bus_for_each_drv+0x84/0xe0
__device_attach+0xbc/0x1f0
bus_probe_device+0x90/0xa0
device_add+0x51c/0x710
devm_cxl_add_pmem_region+0x1b5/0x380 [cxl_core]
cxl_bus_probe+0x1b/0x60 [cxl_core]
The cxl_nvd of the memdev needs to be available during the pmem region
probe. Currently the cxl_nvd is registered after the endpoint port probe.
The endpoint probe, in the case of autoassembly of regions, can cause a
pmem region probe requiring the not yet available cxl_nvd. Adjust the
sequence so this dependency is met.
This requires adding a port parameter to cxl_find_nvdimm_bridge() that
can be used to query the ancestor root port. The endpoint port is not
yet available, but will share a common ancestor with its parent, so
start the query from there instead.
Fixes: f17b558d6663 ("cxl/pmem: Refactor nvdimm device registration, delete the workqueue")
Co-developed-by: Dan Williams <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Li Ming <[email protected]>
Tested-by: Alison Schofield <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Reviewed-by: Alison Schofield <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Dave Jiang <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
- filesystems: warn_unused_result warnings
- seccomp: format-zero-length warnings
- fchmodat2: clang build warnings due to-static-libasan
- openat2: clang build warnings due to static-libasan, LOCAL_HDRS
* tag 'linux_kselftest-fixes-6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/fchmodat2: fix clang build failure due to -static-libasan
selftests/openat2: fix clang build failures: -static-libasan, LOCAL_HDRS
selftests: seccomp: fix format-zero-length warnings
selftests: filesystems: fix warn_unused_result build warnings
|
|
openvswitch.sh makes use of substitutions of the form ${ns:0:1}, to
obtain the first character of $ns. Empirically, this is works with bash
but not dash. When run with dash these evaluate to an empty string and
printing an error to stdout.
# dash -c 'ns=client; echo "${ns:0:1}"' 2>error
# cat error
dash: 1: Bad substitution
# bash -c 'ns=client; echo "${ns:0:1}"' 2>error
c
# cat error
This leads to tests that neither pass nor fail.
F.e.
TEST: arp_ping [START]
adding sandbox 'test_arp_ping'
Adding DP/Bridge IF: sbx:test_arp_ping dp:arpping {, , }
create namespaces
./openvswitch.sh: 282: eval: Bad substitution
TEST: ct_connect_v4 [START]
adding sandbox 'test_ct_connect_v4'
Adding DP/Bridge IF: sbx:test_ct_connect_v4 dp:ct4 {, , }
./openvswitch.sh: 322: eval: Bad substitution
create namespaces
Resolve this by making openvswitch.sh a bash script.
Fixes: 918423fda910 ("selftests: openvswitch: add an initial flow programming case")
Signed-off-by: Simon Horman <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
On 32bit systems, the "4 * max" multiply can overflow. Use kcalloc()
to do the allocation to prevent this.
Fixes: 44c494c8e30e ("ptp: track available ptp vclocks information")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Wojciech Drewek <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Reviewed-by: Heng Qi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
While adding a SPI device, the SPI core ensures that multiple logical CS
doesn't map to the same physical CS. For example, spi->chip_select[0] !=
spi->chip_select[1] and so forth. However, unlike the SPI master, the SPI
slave doesn't have the list of chip selects, this leads to probe failure
when the SPI controller is configured as slave. Update the
__spi_add_device() function to perform this check only if the SPI
controller is configured as master.
Fixes: 4d8ff6b0991d ("spi: Add multi-cs memories support in SPI core")
Signed-off-by: Amit Kumar Mahapatra <[email protected]>
Link: https://msgid.link/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>
|
|
Add basic selftests.
Signed-off-by: David Vernet <[email protected]>
Acked-by: Tejun Heo <[email protected]>
|
|
Add Documentation/scheduler/sched-ext.rst which gives a high-level overview
and pointers to the examples.
v6: - Add paragraph explaining debug dump.
v5: - Updated to reflect /sys/kernel interface change. Kconfig options
added.
v4: - README improved, reformatted in markdown and renamed to README.md.
v3: - Added tools/sched_ext/README.
- Dropped _example prefix from scheduler names.
v2: - Apply minor edits suggested by Bagas. Caveats section dropped as all
of them are addressed.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
Cc: Bagas Sanjaya <[email protected]>
|
|
Currently, a dsq is always a FIFO. A task which is dispatched earlier gets
consumed or executed earlier. While this is sufficient when dsq's are used
for simple staging areas for tasks which are ready to execute, it'd make
dsq's a lot more useful if they can implement custom ordering.
This patch adds a vtime-ordered priority queue to dsq's. When the BPF
scheduler dispatches a task with the new scx_bpf_dispatch_vtime() helper, it
can specify the vtime tha the task should be inserted at and the task is
inserted into the priority queue in the dsq which is ordered according to
time_before64() comparison of the vtime values.
A DSQ can either be a FIFO or priority queue and automatically switches
between the two depending on whether scx_bpf_dispatch() or
scx_bpf_dispatch_vtime() is used. Using the wrong variant while the DSQ
already has the other type queued is not allowed and triggers an ops error.
Built-in DSQs must always be FIFOs.
This makes it very easy for the BPF schedulers to implement proper vtime
based scheduling within each dsq very easy and efficient at a negligible
cost in terms of code complexity and overhead.
scx_simple and scx_example_flatcg are updated to default to weighted
vtime scheduling (the latter within each cgroup). FIFO scheduling can be
selected with -f option.
v4: - As allowing mixing priority queue and FIFO on the same DSQ sometimes
led to unexpected starvations, DSQs now error out if both modes are
used at the same time and the built-in DSQs are no longer allowed to
be priority queues.
- Explicit type struct scx_dsq_node added to contain fields needed to be
linked on DSQs. This will be used to implement stateful iterator.
- Tasks are now always linked on dsq->list whether the DSQ is in FIFO or
PRIQ mode. This confines PRIQ related complexities to the enqueue and
dequeue paths. Other paths only need to look at dsq->list. This will
also ease implementing BPF iterator.
- Print p->scx.dsq_flags in debug dump.
v3: - SCX_TASK_DSQ_ON_PRIQ flag is moved from p->scx.flags into its own
p->scx.dsq_flags. The flag is protected with the dsq lock unlike other
flags in p->scx.flags. This led to flag corruption in some cases.
- Add comments explaining the interaction between using consumption of
p->scx.slice to determine vtime progress and yielding.
v2: - p->scx.dsq_vtime was not initialized on load or across cgroup
migrations leading to some tasks being stalled for extended period of
time depending on how saturated the machine is. Fixed.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
|
|
The core-sched support is composed of the following parts:
- task_struct->scx.core_sched_at is added. This is a timestamp which can be
used to order tasks. Depending on whether the BPF scheduler implements
custom ordering, it tracks either global FIFO ordering of all tasks or
local-DSQ ordering within the dispatched tasks on a CPU.
- prio_less() is updated to call scx_prio_less() when comparing SCX tasks.
scx_prio_less() calls ops.core_sched_before() if available or uses the
core_sched_at timestamp. For global FIFO ordering, the BPF scheduler
doesn't need to do anything. Otherwise, it should implement
ops.core_sched_before() which reflects the ordering.
- When core-sched is enabled, balance_scx() balances all SMT siblings so
that they all have tasks dispatched if necessary before pick_task_scx() is
called. pick_task_scx() picks between the current task and the first
dispatched task on the local DSQ based on availability and the
core_sched_at timestamps. Note that FIFO ordering is expected among the
already dispatched tasks whether running or on the local DSQ, so this path
always compares core_sched_at instead of calling into
ops.core_sched_before().
qmap_core_sched_before() is added to scx_qmap. It scales the
distances from the heads of the queues to compare the tasks across different
priority queues and seems to behave as expected.
v3: Fixed build error when !CONFIG_SCHED_SMT reported by Andrea Righi.
v2: Sched core added the const qualifiers to prio_less task arguments.
Explicitly drop them for ops.core_sched_before() task arguments. BPF
enforces access control through the verifier, so the qualifier isn't
actually operative and only gets in the way when interacting with
various helpers.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Reviewed-by: Josh Don <[email protected]>
Cc: Andrea Righi <[email protected]>
|
|
PM operations freeze userspace. Some BPF schedulers have active userspace
component and may misbehave as expected across PM events. While the system
is frozen, nothing too interesting is happening in terms of scheduling and
we can get by just fine with the fallback FIFO behavior. Let's make things
easier by always bypassing the BPF scheduler while PM events are in
progress.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
|
|
Add ops.cpu_online/offline() which are invoked when CPUs come online and
offline respectively. As the enqueue path already automatically bypasses
tasks to the local dsq on a deactivated CPU, BPF schedulers are guaranteed
to see tasks only on CPUs which are between online() and offline().
If the BPF scheduler doesn't implement ops.cpu_online/offline(), the
scheduler is automatically exited with SCX_ECODE_RESTART |
SCX_ECODE_RSN_HOTPLUG. Userspace can implement CPU hotpplug support
trivially by simply reinitializing and reloading the scheduler.
scx_qmap is updated to print out online CPUs on hotplug events. Other
schedulers are updated to restart based on ecode.
v3: - The previous implementation added @reason to
sched_class.rq_on/offline() to distinguish between CPU hotplug events
and topology updates. This was buggy and fragile as the methods are
skipped if the current state equals the target state. Instead, add
scx_rq_[de]activate() which are directly called from
sched_cpu_de/activate(). This also allows ops.cpu_on/offline() to
sleep which can be useful.
- ops.dispatch() could be called on a CPU that the BPF scheduler was
told to be offline. The dispatch patch is updated to bypass in such
cases.
v2: - To accommodate lock ordering change between scx_cgroup_rwsem and
cpus_read_lock(), CPU hotplug operations are put into its own SCX_OPI
block and enabled eariler during scx_ope_enable() so that
cpus_read_lock() can be dropped before acquiring scx_cgroup_rwsem.
- Auto exit with ECODE added.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: David Vernet <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
|
|
Scheduler classes are strictly ordered and when a higher priority class has
tasks to run, the lower priority ones lose access to the CPU. Being able to
monitor and act on these events are necessary for use cases includling
strict core-scheduling and latency management.
This patch adds two operations ops.cpu_acquire() and .cpu_release(). The
former is invoked when a CPU becomes available to the BPF scheduler and the
opposite for the latter. This patch also implements
scx_bpf_reenqueue_local() which can be called from .cpu_release() to trigger
requeueing of all tasks in the local dsq of the CPU so that the tasks can be
reassigned to other available CPUs.
scx_pair is updated to use .cpu_acquire/release() along with
%SCX_KICK_WAIT to make the pair scheduling guarantee strict even when a CPU
is preempted by a higher priority scheduler class.
scx_qmap is updated to use .cpu_acquire/release() to empty the local
dsq of a preempted CPU. A similar approach can be adopted by BPF schedulers
that want to have a tight control over latency.
v4: Use the new SCX_KICK_IDLE to wake up a CPU after re-enqueueing.
v3: Drop the const qualifier from scx_cpu_release_args.task. BPF enforces
access control through the verifier, so the qualifier isn't actually
operative and only gets in the way when interacting with various
helpers.
v2: Add p->scx.kf_mask annotation to allow calling scx_bpf_reenqueue_local()
from ops.cpu_release() nested inside ops.init() and other sleepable
operations.
Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
|
|
If set when calling scx_bpf_kick_cpu(), the invoking CPU will busy wait for
the kicked cpu to enter the scheduler. See the following for example usage:
https://github.com/sched-ext/scx/blob/main/scheds/c/scx_pair.bpf.c
v2: - Updated to fit the updated kick_cpus_irq_workfn() implementation.
- Include SCX_KICK_WAIT related information in debug dump.
Signed-off-by: David Vernet <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Josh Don <[email protected]>
Acked-by: Hao Luo <[email protected]>
Acked-by: Barret Rhoden <[email protected]>
|