| Age | Commit message (Collapse) | Author | Files | Lines |
|
Create a vnic devlink health reporter for PFs/VFs interfaces.
The reporter's diagnose callback displays the values of vNIC/vport
transport debug counters of PFs/VFs, as follows:
$ devlink health diagnose pci/0000:08:00.0 reporter vnic
vNIC env counters:
total_error_queues: 0 send_queue_priority_update_flow: 0
comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0
invalid_command: 0 quota_exceeded_command: 0
nic_receive_steering_discard: 0
Moreover, add documentation on the reporter functionality and the
counters description.
While at it, expose the vNIC counters diagnose function to be used by
the downstream patch, which will reveal the counters for representor
interfaces.
Signed-off-by: Maher Sanalla <[email protected]>
Reviewed-by: Moshe Shemesh <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Adjacent changes:
net/mptcp/protocol.h
63740448a32e ("mptcp: fix accept vs worker race")
2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close")
ddb1a072f858 ("mptcp: move first subflow allocation at mpc access time")
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter and bpf.
There are a few fixes for new code bugs, including the Mellanox one
noted in the last networking pull. No known regressions outstanding.
Current release - regressions:
- sched: clear actions pointer in miss cookie init fail
- mptcp: fix accept vs worker race
- bpf: fix bpf_arch_text_poke() with new_addr == NULL on s390
- eth: bnxt_en: fix a possible NULL pointer dereference in unload
path
- eth: veth: take into account peer device for
NETDEV_XDP_ACT_NDO_XMIT xdp_features flag
Current release - new code bugs:
- eth: revert "net/mlx5: Enable management PF initialization"
Previous releases - regressions:
- netfilter: fix recent physdev match breakage
- bpf: fix incorrect verifier pruning due to missing register
precision taints
- eth: virtio_net: fix overflow inside xdp_linearize_page()
- eth: cxgb4: fix use after free bugs caused by circular dependency
problem
- eth: mlxsw: pci: fix possible crash during initialization
Previous releases - always broken:
- sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg
- netfilter: validate catch-all set elements
- bridge: don't notify FDB entries with "master dynamic"
- eth: bonding: fix memory leak when changing bond type to ethernet
- eth: i40e: fix accessing vsi->active_filters without holding lock
Misc:
- Mat is back as MPTCP co-maintainer"
* tag 'net-6.3-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits)
net: bridge: switchdev: don't notify FDB entries with "master dynamic"
Revert "net/mlx5: Enable management PF initialization"
MAINTAINERS: Resume MPTCP co-maintainer role
mailmap: add entries for Mat Martineau
e1000e: Disable TSO on i219-LM card to increase speed
bnxt_en: fix free-runnig PHC mode
net: dsa: microchip: ksz8795: Correctly handle huge frame configuration
bpf: Fix incorrect verifier pruning due to missing register precision taints
hamradio: drop ISA_DMA_API dependency
mlxsw: pci: Fix possible crash during initialization
mptcp: fix accept vs worker race
mptcp: stops worker on unaccepted sockets at listener close
net: rpl: fix rpl header size calculation
net: vmxnet3: Fix NULL pointer dereference in vmxnet3_rq_rx_complete()
bonding: Fix memory leak when changing bond type to Ethernet
veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag
mlxfw: fix null-ptr-deref in mlxfw_mfa2_tlv_next()
bnxt_en: Fix a possible NULL pointer dereference in unload path
bnxt_en: Do not initialize PTP on older P3/P4 chips
netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements
...
|
|
Userspace can't easily discover how much of a sleep cycle was spent in a
hardware sleep state without using kernel tracing and vendor specific sysfs
or debugfs files.
To make this information more discoverable, introduce 3 new sysfs files:
1) The time spent in a hw sleep state for last cycle.
2) The time spent in a hw sleep state since the kernel booted
3) The maximum time that the hardware can report for a sleep cycle.
All of these files will be present only if the system supports s2idle.
Reviewed-by: Hans de Goede <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
* for-next/perf: (24 commits)
KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege
drivers/perf: hisi: add NULL check for name
drivers/perf: hisi: Remove redundant initialized of pmu->name
perf/arm-cmn: Fix port detection for CMN-700
arm64: pmuv3: dynamically map PERF_COUNT_HW_BRANCH_INSTRUCTIONS
perf/arm-cmn: Validate cycles events fully
Revert "ARM: mach-virt: Select PMUv3 driver by default"
drivers/perf: apple_m1: Add Apple M2 support
dt-bindings: arm-pmu: Add PMU compatible strings for Apple M2 cores
perf: arm_cspmu: Fix variable dereference warning
perf/amlogic: Fix config1/config2 parsing issue
drivers/perf: Use devm_platform_get_and_ioremap_resource()
kbuild, drivers/perf: remove MODULE_LICENSE in non-modules
perf: qcom: Use devm_platform_get_and_ioremap_resource()
perf: arm: Use devm_platform_get_and_ioremap_resource()
perf/arm-cmn: Move overlapping wp_combine field
ARM: mach-virt: Select PMUv3 driver by default
ARM: perf: Allow the use of the PMUv3 driver on 32bit ARM
ARM: Make CONFIG_CPU_V7 valid for 32bit ARMv8 implementations
perf: pmuv3: Change GENMASK to GENMASK_ULL
...
|
|
The latest version of your favorite fork of the PE/COFF spec includes a
new type of header flag that is intended to be used in the context of
EFI firmware to indicate to the image loader that the executable regions
of an image can be mapped with BTI/IBT enforcement enabled.
So let's import these definitions so we can use them in subsequent
patches.
Signed-off-by: Ard Biesheuvel <[email protected]>
|
|
The tracking of used_hiwater adds an atomic operation to the hot
path. This is acceptable only when debugging the kernel. To make
sure that the fields can never be used by mistake, do not even
include them in struct io_tlb_mem if CONFIG_DEBUG_FS is not set.
The build fails after doing that. To fix it, it is necessary to
remove all code specific to debugfs and instead provide a stub
implementation of swiotlb_create_debugfs_files(). As a bonus, this
change allows to remove one __maybe_unused attribute.
Signed-off-by: Petr Tesarik <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
device_property functions do not modify the device pointer passed to them.
The underlying of_device and fwnode_ functions actually already take
const * arguments. Mark the parameter constant to simplify conversion
from of_property to device_property functions, and to let the calling code
use const device pointers where possible.
Cc: Chris Packham <[email protected]>
Reviewed-by: Chris Packham <[email protected]>
Reviewed-by: Sakari Ailus <[email protected]>
Signed-off-by: Guenter Roeck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
* for-next/misc:
arm64: kexec: include reboot.h
arm64: delete dead code in this_cpu_set_vectors()
arm64: kernel: Fix kernel warning when nokaslr is passed to commandline
arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step
arm64/sme: Fix some comments of ARM SME
arm64/signal: Alloc tpidr2 sigframe after checking system_supports_tpidr2()
arm64/signal: Use system_supports_tpidr2() to check TPIDR2
arm64: compat: Remove defines now in asm-generic
arm64: kexec: remove unnecessary (void*) conversions
arm64: armv8_deprecated: remove unnecessary (void*) conversions
firmware: arm_sdei: Fix sleep from invalid context BUG
|
|
Currently the PSP semaphore communication base address is discovered
by using an MSR that is not architecturally guaranteed for future
platforms. Also the mailbox that is utilized for communication with
the PSP may have other consumers in the kernel, so it's better to
make all communication go through a single driver.
Signed-off-by: Mario Limonciello <[email protected]>
Reviewed-by: Mark Hasemeyer <[email protected]>
Acked-by: Jarkko Nikula <[email protected]>
Tested-by: Mark Hasemeyer <[email protected]>
Acked-by: Wolfram Sang <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
Add a crypto_tfm_get interface to allow tfm objects to be shared.
They can still be freed in the usual way.
This should only be done with tfm objects with no keys. You must
also not modify the tfm flags in any way once it becomes shared.
Signed-off-by: Herbert Xu <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
Many of the older USB drivers in the Linux USB stack were written
based simply on a vendor's device specification. They use the
endpoint information in the spec and assume these endpoints will
always be present, with the properties listed, in any device matching
the given vendor and product IDs.
While that may have been true back then, with spoofing and fuzzing it
is not true any more. More and more we are finding that those old
drivers need to perform at least a minimum of checking before they try
to use any endpoint other than ep0.
To make this checking as simple as possible, we now add a couple of
utility routines to the USB core. usb_check_bulk_endpoints() and
usb_check_int_endpoints() take an interface pointer together with a
list of endpoint addresses (numbers and directions). They check that
the interface's current alternate setting includes endpoints with
those addresses and that each of these endpoints has the right type:
bulk or interrupt, respectively.
Although we already have usb_find_common_endpoints() and related
routines meant for a similar purpose, they are not well suited for
this kind of checking. Those routines find endpoints of various
kinds, but only one (either the first or the last) of each kind, and
they don't verify that the endpoints' addresses agree with what the
caller expects.
In theory the new routines could be more general: They could take a
particular altsetting as their argument instead of always using the
interface's current altsetting. In practice I think this won't matter
too much; multiple altsettings tend to be used for transferring media
(audio or visual) over isochronous endpoints, not bulk or interrupt.
Drivers for such devices will generally require more sophisticated
checking than these simplistic routines provide.
Signed-off-by: Alan Stern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
This reverts commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4.
Paul reports that it causes a regression with IB on CX4
and FW 12.18.1000. In addition I think that the concept
of "management PF" is not fully accepted and requires
a discussion.
Fixes: fe998a3c77b9 ("net/mlx5: Enable management PF initialization")
Reported-by: Paul Moore <[email protected]>
Link: https://lore.kernel.org/all/CAHC9VhQ7A4+msL38WpbOMYjAqLp0EtOjeLh4Dc6SQtD6OUvCQg@mail.gmail.com/
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"22 hotfixes.
19 are cc:stable and the remainder address issues which were
introduced during this merge cycle, or aren't considered suitable for
-stable backporting.
19 are for MM and the remainder are for other subsystems"
* tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
nilfs2: initialize unused bytes in segment summary blocks
mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
mm/mmap: regression fix for unmapped_area{_topdown}
maple_tree: fix mas_empty_area() search
maple_tree: make maple state reusable after mas_empty_area_rev()
mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()
mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
tools/Makefile: do missed s/vm/mm/
mm: fix memory leak on mm_init error handling
mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Revert "userfaultfd: don't fail on unrecognized features"
writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug
tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used
mailmap: update jtoppins' entry to reference correct email
mm/mempolicy: fix use-after-free of VMA iterator
mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO
mm/mprotect: fix do_mprotect_pkey() return on error
mm/khugepaged: check again on anon uffd-wp during isolation
...
|
|
Currently call_bind_status places a hard limit of 3 to the number of
retries on EACCES error. This limit was done to prevent NLM unlock
requests from being hang forever when the server keeps returning garbage.
However this change causes problem for cases when NLM service takes
longer than 9 seconds to register with the port mapper after a restart.
This patch removes this hard coded limit and let the RPC handles
the retry based on the standard hard/soft task semantics.
Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests")
Reported-by: Helen Chao <[email protected]>
Tested-by: Helen Chao <[email protected]>
Signed-off-by: Dai Ngo <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
|
|
Locking range start and locking range length
attributes may be require to satisfy restrictions
exposed by OPAL2 geometry feature reporting.
Geometry reporting feature is described in TCG OPAL SSC,
section 3.1.1.4 (ALIGN, LogicalBlockSize, AlignmentGranularity
and LowestAlignedLBA).
4.3.5.2.1.1 RangeStart Behavior:
[ StartAlignment = (RangeStart modulo AlignmentGranularity) - LowestAlignedLBA ]
When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:
a) the AlignmentRequired (ALIGN above) column in the LockingInfo
table is TRUE;
b) RangeStart is non-zero; and
c) StartAlignment is non-zero, then the method SHALL fail and
return an error status code INVALID_PARAMETER.
4.3.5.2.1.2 RangeLength Behavior:
If RangeStart is zero, then
[ LengthAlignment = (RangeLength modulo AlignmentGranularity) - LowestAlignedLBA ]
If RangeStart is non-zero, then
[ LengthAlignment = (RangeLength modulo AlignmentGranularity) ]
When processing a Set method or CreateRow method on the Locking
table for a non-Global Range row, if:
a) the AlignmentRequired (ALIGN above) column in the LockingInfo
table is TRUE;
b) RangeLength is non-zero; and
c) LengthAlignment is non-zero, then the method SHALL fail and
return an error status code INVALID_PARAMETER
In userspace we stuck to logical block size reported by general
block device (via sysfs or ioctl), but we can not read
'AlignmentGranularity' or 'LowestAlignedLBA' anywhere else and
we need to get those values from sed-opal interface otherwise
we will not be able to report or avoid locking range setup
INVALID_PARAMETER errors above.
Signed-off-by: Ondrej Kozina <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Christian Brauner <[email protected]>
Tested-by: Milan Broz <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Raw NAND core changes:
* Convert to platform remove callback returning void
* Fix spelling mistake waifunc() -> waitfunc()
Raw NAND controller driver changes:
* imx: Remove unused is_imx51_nfc and imx53_nfc functions
* omap2: Drop obsolete dependency on COMPILE_TEST
* orion: Use devm_platform_ioremap_resource()
* qcom:
- Use of_property_present() for testing DT property presence
- Use devm_platform_get_and_ioremap_resource()
* stm32_fmc2: Depends on ARCH_STM32 instead of MACH_STM32MP157
* tmio: Remove reference to config MTD_NAND_TMIO in the parsers
Raw NAND manufacturer driver changes:
* hynix: Fix up bit 0 of sdr_timing_mode
SPI-NAND changes:
* Add support for ESMT F50x1G41LB
Signed-off-by: Miquel Raynal <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into driver-core-next
Sudeep writes:
cacheinfo and arch_topology updates for v6.4
The cache information can be extracted from either a Device Tree(DT),
the PPTT ACPI table, or arch registers (clidr_el1 for arm64).
When the DT is used but no cache properties are advertised, the current
code doesn't correctly fallback to using arch information. The changes
fixes the same and also assuse the that L1 data/instruction caches
are private and L2/higher caches are shared when the cache information
is missing in DT/ACPI and is derived form clidr_el1/arch registers.
Currently the cacheinfo is built from the primary CPU prior to secondary
CPUs boot, if the DT/ACPI description contains cache information.
However, if not present, it still reverts to the old behavior, which
allocates the cacheinfo memory on each secondary CPUs which causes
RT kernels to triggers a "BUG: sleeping function called from invalid
context".
The changes here attempts to enable automatic detection for RT kernels
when no DT/ACPI cache information is available, by pre-allocating
cacheinfo memory on the primary CPU.
* tag 'cacheinfo-updates-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
cacheinfo: Add use_arch[|_cache]_info field/function
arch_topology: Remove early cacheinfo error message if -ENOENT
cacheinfo: Check cache properties are present in DT
cacheinfo: Check sib_leaf in cache_leaves_are_shared()
cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
cacheinfo: Add arm64 early level initializer implementation
cacheinfo: Add arch specific early level initializer
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc into char-misc-next
Georgi writes:
interconnect changes for 6.4
This pull request contains the interconnect changes for the 6.4-rc1 merge
window, which this time are mostly cleanups.
Core changes:
interconnect: Skip call into provider if initial bw is zero
interconnect: Use of_property_present() for testing DT property presence
interconnect: drop racy registration API
interconnect: drop unused icc_link_destroy() interface
Driver changes:
interconnect: qcom: Drop obsolete dependency on COMPILE_TEST
interconnect: qcom: drop obsolete OSM_L3/EPSS defines
interconnect: qcom: osm-l3: drop unuserd header inclusion
interconnect: qcom: rpm: drop bogus pm domain attach
interconnect: qcom: rpm: make QoS INVALID default
interconnect: qcom: rpm: Add support for specifying channel num
interconnect: qcom: Sort kerneldoc entries
dt-bindings: interconnect: OSM L3: Add SM6375 CPUCP compatible
dt-bindings: interconnect: qcom,msm8998-bwmon: Resolve MSM8998 support
Signed-off-by: Georgi Djakov <[email protected]>
* tag 'icc-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/djakov/icc:
dt-bindings: interconnect: qcom,msm8998-bwmon: Resolve MSM8998 support
dt-bindings: interconnect: OSM L3: Add SM6375 CPUCP compatible
interconnect: qcom: Sort kerneldoc entries
interconnect: qcom: rpm: Add support for specifying channel num
interconnect: qcom: rpm: make QoS INVALID default
interconnect: qcom: rpm: drop bogus pm domain attach
interconnect: drop unused icc_link_destroy() interface
interconnect: drop racy registration API
interconnect: Use of_property_present() for testing DT property presence
interconnect: qcom: osm-l3: drop unuserd header inclusion
interconnect: qcom: drop obsolete OSM_L3/EPSS defines
interconnect: Skip call into provider if initial bw is zero
interconnect: qcom: Drop obsolete dependency on COMPILE_TEST
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mani/mhi into char-misc-next
Manivannan writes:
MHI Host
========
Core
----
- Removed the "mhi_poll()" API as there are no in-kernel users available at the
moment.
- Added range check for the CHDBOFF and ERDBOFF registers in case the device
reports bad values.
- Fixed the errno for the rest of the range checks to use -ERANGE.
- Modified the event ring handlers to ring the doorbell only if there are
any pending elements in the ring to process for the device.
- Removed the check for EE (Execution Environment) while processing the SYS_ERR
transition as it creates device recovery issues when SBL (Secondary
Bootloader) crashes early.
- Used mhi_tryset_pm_state() API to set the error state instead of open coding
if the firmware loading fails. This avoids the race with other pm_state
updates.
pci_generic
-----------
- Dropped the dedundant pci_{enable/disable}_pcie_error_reporting() calls from
driver probe's error path as the PCI core itself takes care of that now.
- Revered the commit 2d5253a096c6 ("bus: mhi: host: pci_generic: Add a secondary
AT port to Telit FN990") as it turned out to be erroneous. This happened due
to the patch adding secondary AT port for FN990 getting applied through NET
and MHI trees and this caused two commits for the same functionality but one
of them ended up wrong.
- Added support for Foxconn T99W510 modem based on SDX24 chipset from Qualcomm.
MHI Endpoint
============
- Demoted the channel not supported error log to debug as not all devices will
support all channels defined in MHI spec and this may spam users.
* tag 'mhi-for-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mani/mhi:
bus: mhi: host: Use mhi_tryset_pm_state() for setting fw error state
bus: mhi: host: Remove duplicate ee check for syserr
bus: mhi: host: Avoid ringing EV DB if there are no elements to process
bus: mhi: pci_generic: Add Foxconn T99W510
bus: mhi: host: Use ERANGE for BHIOFF/BHIEOFF range check
bus: mhi: host: Range check CHDBOFF and ERDBOFF
bus: mhi: host: pci_generic: Revert "Add a secondary AT port to Telit FN990"
bus: mhi: host: pci_generic: Drop redundant pci_enable_pcie_error_reporting()
bus: mhi: ep: Demote unsupported channel error log to debug
bus: mhi: host: Remove mhi_poll() API
|
|
Accesses to nf_trace and ipvs_property are already wrapped
by ifdefs where necessary. Don't allocate the bits for those
fields at all if possible.
Acked-by: Florian Westphal <[email protected]>
Acked-by: Simon Horman <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
nf_trace is a debug feature, AFAIU, and yet it sits oddly
high in the sk_buff bitfield. Move it down, pushing up
dst_pending_confirm and inner_protocol_type.
Next change will make nf_trace optional (under Kconfig)
and all optional fields should be placed after 2b fields
to avoid 2b fields straddling bytes.
dst_pending_confirm is L3, so it makes sense next to ignore_df.
inner_protocol_type goes up just to keep the balance.
Acked-by: Florian Westphal <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
alloc_cpu is currently between 4 byte fields, so it's almost
guaranteed to create a 2B hole. It has a knock on effect of
creating a 4B hole after @end (and @end and @tail being in
different cachelines).
None of this matters hugely, but for kernel configs which
don't enable all the features there may well be a 2B hole
after the bitfield. Move alloc_cpu there.
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
SCTP is not universally deployed, allow hiding its bit
from the skb.
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Datacenter kernel builds will very likely not include WIRELESS,
so let them shave 2 bits off the skb by hiding the wifi fields.
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Acked-by: Johannes Berg <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Linux LEDs can be requested to perform hardware accelerated
blinking. Pass this to the PHY driver, if it implements the op.
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Christian Marangi <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Linux LEDs can be software controlled via the brightness file in /sys.
LED drivers need to implement a brightness_set function which the core
will call. Implement an intermediary in phy_device, which will call
into the phy driver if it implements the necessary function.
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Christian Marangi <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Define common binding parsing for all PHY drivers with LEDs using
phylib. Parse the DT as part of the phy_probe and add LEDs to the
linux LED class infrastructure. For the moment, provide a dummy
brightness function, which will later be replaced with a call into the
PHY driver. This allows testing since the LED core might otherwise
reject an LED whose brightness cannot be set.
Add a dependency on LED_CLASS. It either needs to be built in, or not
enabled, since a modular build can result in linker errors.
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Christian Marangi <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Provide stubs for devm_led_classdev_register_ext() and
led_init_default_state_get() so that LED drivers embedded within other
drivers such as PHYs and Ethernet switches still build when LEDS_CLASS
or NEW_LEDS are disabled. This also helps with Kconfig dependencies,
which are somewhat hairy for phylib and mdio and only get worse when
adding a dependency on LED_CLASS.
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Christian Marangi <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Currently, bonding only obtain the timestamp (ts) information of
the active slave, which is available only for modes 1, 5, and 6.
For other modes, bonding only has software rx timestamping support.
However, some users who use modes such as LACP also want tx timestamp
support. To address this issue, let's check the ts information of each
slave. If all slaves support tx timestamping, we can enable tx
timestamping support for the bond.
Add a note that the get_ts_info may be called with RCU, or rtnl or
reference on the device in ethtool.h>
Suggested-by: Miroslav Lichvar <[email protected]>
Signed-off-by: Hangbin Liu <[email protected]>
Acked-by: Jay Vosburgh <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Unbreak br_netfilter physdev match support, from Florian Westphal.
2) Use GFP_KERNEL_ACCOUNT for stateful/policy objects, from Chen Aotian.
3) Use IS_ENABLED() in nf_reset_trace(), from Florian Westphal.
4) Fix validation of catch-all set element.
5) Tighten requirements for catch-all set elements.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements
netfilter: nf_tables: validate catch-all set elements
netfilter: nf_tables: fix ifdef to also consider nf_tables=m
netfilter: nf_tables: Modify nla_memdup's flag to GFP_KERNEL_ACCOUNT
netfilter: br_netfilter: fix recent physdev match breakage
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
hwpoison_user_mappings() is updated to support ksm pages, and add
collect_procs_ksm() to collect processes when the error hit an ksm page.
The difference from collect_procs_anon() is that it also needs to traverse
the rmap-item list on the stable node of the ksm page. At the same time,
add_to_kill_ksm() is added to handle ksm pages. And
task_in_to_kill_list() is added to avoid duplicate addition of tsk to the
to_kill list. This is because when scanning the list, if the pages that
make up the ksm page all come from the same process, they may be added
repeatedly.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Longlong Xia <[email protected]>
Tested-by: Naoya Horiguchi <[email protected]>
Reviewed-by: Kefeng Wang <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Nanyong Sun <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Delay accounting does not track the delay of IRQ/SOFTIRQ. While
IRQ/SOFTIRQ could have obvious impact on some workloads productivity, such
as when workloads are running on system which is busy handling network
IRQ/SOFTIRQ.
Get the delay of IRQ/SOFTIRQ could help users to reduce such delay. Such
as setting interrupt affinity or task affinity, using kernel thread for
NAPI etc. This is inspired by "sched/psi: Add PSI_IRQ to track
IRQ/SOFTIRQ pressure"[1]. Also fix some code indent problems of older
code.
And update tools/accounting/getdelays.c:
/ # ./getdelays -p 156 -di
print delayacct stats ON
printing IO accounting
PID 156
CPU count real total virtual total delay total delay average
15 15836008 16218149 275700790 18.380ms
IO count delay total delay average
0 0 0.000ms
SWAP count delay total delay average
0 0 0.000ms
RECLAIM count delay total delay average
0 0 0.000ms
THRASHING count delay total delay average
0 0 0.000ms
COMPACT count delay total delay average
0 0 0.000ms
WPCOPY count delay total delay average
36 7586118 0.211ms
IRQ count delay total delay average
42 929161 0.022ms
[1] commit 52b1364ba0b1("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure")
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yang Yang <[email protected]>
Cc: Jiang Xuexin <[email protected]>
Cc: wangyong <[email protected]>
Cc: junhua huang <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
This has a slight benefit for x86 and has no effect on other targets.
The benefit to x86 is it change the codegen for setting a node to block
from `mov %r0, %r1; or $RB_BLACK, %r1` to `lea RB_BLACK(%r0), %r1` which
saves an instructions.
In all other cases it just replace ALU with ALU (or -> and) which
perform the same on all machines I am aware of.
Total instructions in rbtree.o:
Before - 802
After - 782
so it saves about 20 `mov` instructions.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Noah Goldstein <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The interface for fcntl expects the argument passed for the command
F_ADD_SEALS to be of type int. The current code wrongly treats it as a
long. In order to avoid access to undefined bits, we should explicitly
cast the argument to int.
This commit changes the signature of all the related and helper functions
so that they treat the argument as int instead of long.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Luca Vizzarro <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Chuck Lever <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Szabolcs Nagy <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: David Laight <[email protected]>
Cc: Mark Rutland <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Android 14 and later default to MGLRU [1] and field telemetry showed
occasional long tail latency (>100ms) in the reclaim path.
Tracing revealed priority inversion in the reclaim path. In
try_to_inc_max_seq(), when high priority tasks were blocked on
wait_event_killable(), the preemption of the low priority task to call
wake_up_all() caused those high priority tasks to wait longer than
necessary. In general, this problem is not different from others of its
kind, e.g., one caused by mutex_lock(). However, it is specific to MGLRU
because it introduced the new wait queue lruvec->mm_state.wait.
The purpose of this new wait queue is to avoid the thundering herd
problem. If many direct reclaimers rush into try_to_inc_max_seq(), only
one can succeed, i.e., the one to wake up the rest, and the rest who
failed might cause premature OOM kills if they do not wait. So far there
is no evidence supporting this scenario, based on how often the wait has
been hit. And this begs the question how useful the wait queue is in
practice.
Based on Minchan's recommendation, which is in line with his commit
6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path") and the
rest of the MGLRU code which also uses trylock when possible, remove the
wait queue.
[1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf
Link: https://lkml.kernel.org/r/[email protected]
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Kalesh Singh <[email protected]>
Suggested-by: Minchan Kim <[email protected]>
Reported-by: Wei Wang <[email protected]>
Acked-by: Yu Zhao <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Jan Alexander Steffens (heftig) <[email protected]>
Cc: Oleksandr Natalenko <[email protected]>
Cc: Suleiman Souhlal <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
During reclaim, we keep track of pages reclaimed from other means than
LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
which we stash a pointer to in current task_struct.
However, we keep track of more than just reclaimed slab pages through
this. We also use it for clean file pages dropped through pruned inodes,
and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add a
helper function that wraps updating it through current, so that future
changes to this logic are contained within include/linux/swap.h.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Hyeonggon Yoo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Yu Zhao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Non-void KMSAN hooks may return error codes that indicate that KMSAN
failed to reflect the changed memory state in the metadata (e.g. it could
not create the necessary memory mappings). In such cases the callers
should handle the errors to prevent the tool from using the inconsistent
metadata in the future.
We mark non-void hooks with __must_check so that error handling is not
skipped.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Alexander Potapenko <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Dipanjan Das <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
copy-on-write of hugetlb user pages with uncorrectable errors will result
in a kernel crash. This is because the copy is performed in kernel mode
and in general we can not handle accessing memory with such errors while
in kernel mode. Commit a873dfe1032a ("mm, hwpoison: try to recover from
copy-on write faults") introduced the routine copy_user_highpage_mc() to
gracefully handle copying of user pages with uncorrectable errors.
However, the separate hugetlb copy-on-write code paths were not modified
as part of commit a873dfe1032a.
Modify hugetlb copy-on-write code paths to use copy_mc_user_highpage() so
that they can also gracefully handle uncorrectable errors in user pages.
This involves changing the hugetlb specific routine
copy_user_large_folio() from type void to int so that it can return an
error. Modify the hugetlb userfaultfd code in the same way so that it can
return -EHWPOISON if it encounters an uncorrectable error.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liu Shixin <[email protected]>
Acked-by: Mike Kravetz <[email protected]>
Reviewed-by: Naoya Horiguchi <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Now we use ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP config option to
indicate devdax and hugetlb vmemmap optimization support. Hence rename
that to a generic ARCH_WANT_OPTIMIZE_VMEMMAP
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
Cc: Joao Martins <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Tarun Sahu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for
compound devmaps") added support for using optimized vmmemap for devdax
devices. But how vmemmap mappings are created are architecture specific.
For example, powerpc with hash translation doesn't have vmemmap mappings
in init_mm page table instead they are bolted table entries in the
hardware page table
vmemmap_populate_compound_pages() used by vmemmap optimization code is not
aware of these architecture-specific mapping. Hence allow architecture to
opt for this feature. I selected architectures supporting
HUGETLB_PAGE_OPTIMIZE_VMEMMAP option as also supporting this feature.
This patch fixes the below crash on ppc64.
BUG: Unable to handle kernel data access on write at 0xc00c000100400038
Faulting instruction address: 0xc000000001269d90
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 7 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc5-150500.34-default+ #2 5c90a668b6bbd142599890245c2fb5de19d7d28a
Hardware name: IBM,9009-42G POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.40 (VL950_099) hv:phyp pSeries
NIP: c000000001269d90 LR: c0000000004c57d4 CTR: 0000000000000000
REGS: c000000003632c30 TRAP: 0300 Not tainted (6.3.0-rc5-150500.34-default+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24842228 XER: 00000000
CFAR: c0000000004c57d0 DAR: c00c000100400038 DSISR: 42000000 IRQMASK: 0
....
NIP [c000000001269d90] __init_single_page.isra.74+0x14/0x4c
LR [c0000000004c57d4] __init_zone_device_page+0x44/0xd0
Call Trace:
[c000000003632ed0] [c000000003632f60] 0xc000000003632f60 (unreliable)
[c000000003632f10] [c0000000004c5ca0] memmap_init_zone_device+0x170/0x250
[c000000003632fe0] [c0000000005575f8] memremap_pages+0x2c8/0x7f0
[c0000000036330c0] [c000000000557b5c] devm_memremap_pages+0x3c/0xa0
[c000000003633100] [c000000000d458a8] dev_dax_probe+0x108/0x3e0
[c0000000036331a0] [c000000000d41430] dax_bus_probe+0xb0/0x140
[c0000000036331d0] [c000000000cef27c] really_probe+0x19c/0x520
[c000000003633260] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
[c0000000036332e0] [c000000000cef888] driver_probe_device+0x58/0x120
[c000000003633320] [c000000000cefa6c] __device_attach_driver+0x11c/0x1e0
[c0000000036333a0] [c000000000cebc58] bus_for_each_drv+0xa8/0x130
[c000000003633400] [c000000000ceefcc] __device_attach+0x15c/0x250
[c0000000036334a0] [c000000000ced458] bus_probe_device+0x108/0x110
[c0000000036334f0] [c000000000ce92dc] device_add+0x7fc/0xa10
[c0000000036335b0] [c000000000d447c8] devm_create_dev_dax+0x1d8/0x530
[c000000003633640] [c000000000d46b60] __dax_pmem_probe+0x200/0x270
[c0000000036337b0] [c000000000d46bf0] dax_pmem_probe+0x20/0x70
[c0000000036337d0] [c000000000d2279c] nvdimm_bus_probe+0xac/0x2b0
[c000000003633860] [c000000000cef27c] really_probe+0x19c/0x520
[c0000000036338f0] [c000000000cef6b4] __driver_probe_device+0xb4/0x230
[c000000003633970] [c000000000cef888] driver_probe_device+0x58/0x120
[c0000000036339b0] [c000000000cefd08] __driver_attach+0x1d8/0x240
[c000000003633a30] [c000000000cebb04] bus_for_each_dev+0xb4/0x130
[c000000003633a90] [c000000000cee564] driver_attach+0x34/0x50
[c000000003633ab0] [c000000000ced878] bus_add_driver+0x218/0x300
[c000000003633b40] [c000000000cf1144] driver_register+0xa4/0x1b0
[c000000003633bb0] [c000000000d21a0c] __nd_driver_register+0x5c/0x100
[c000000003633c10] [c00000000206a2e8] dax_pmem_init+0x34/0x48
[c000000003633c30] [c0000000000132d0] do_one_initcall+0x60/0x320
[c000000003633d00] [c0000000020051b0] kernel_init_freeable+0x360/0x400
[c000000003633de0] [c000000000013764] kernel_init+0x34/0x1d0
[c000000003633e50] [c00000000000de14] ret_from_kernel_thread+0x5c/0x64
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps")
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reported-by: Tarun Sahu <[email protected]>
Reviewed-by: Joao Martins <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Convert mfill_atomic_pte_copy(), shmem_mfill_atomic_pte() and
mfill_atomic_pte() to take in a folio pointer.
Convert mfill_atomic() to use a folio. Convert page_kaddr to kaddr in
mfill_atomic().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Replace copy_user_huge_page() with copy_user_large_folio().
copy_user_large_folio() does the same as copy_user_huge_page(), but takes
in folios instead of pages. Remove pages_per_huge_page from
copy_user_large_folio(), because we can get that from folio_nr_pages(dst).
Convert copy_user_gigantic_page() to take in folios.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Convert hugetlb_mfill_atomic_pte() to take in a folio pointer instead of
a page pointer.
Convert mfill_atomic_hugetlb() to use a folio.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Replace copy_huge_page_from_user() with copy_folio_from_user().
copy_folio_from_user() does the same as copy_huge_page_from_user(), but
takes in a folio instead of a page.
Convert page_kaddr to kaddr in copy_folio_from_user() to do indenting
cleanup.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: ZhangPeng <[email protected]>
Reviewed-by: Sidhartha Kumar <[email protected]>
Reviewed-by: Mike Kravetz <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
When !CONFIG_SHMEM smaps_shmem_walk_ops is defined but not used,
triggering a compiler warning. To avoid the warning remove the #ifdef
around the usage. This has no effect because shmem_mapping() is a stub
returning false when !CONFIG_SHMEM so the code will be compiled out,
however we now need to also provide a stub for shmem_swap_usage().
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7b86ac3371b7 ("pagewalk: separate function pointers from iterator data")
Signed-off-by: Steven Price <[email protected]>
Reported-by: kernel test robot <[email protected]>
Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Reviewed-by: Jason Gunthorpe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Thomas Hellström <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Similar to how copy_mc_user_highpage is implemented for copy_user_highpage
on #MC supported architecture, introduce the #MC handled version of
copy_highpage.
This helper has immediate usage when khugepaged wants to copy file-backed
memory pages and tolerate #MC.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Jiaqi Yan <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: David Stevens <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Miaohe Lin <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Tong Tiangen <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
In workingset_refault(), we call
mem_cgroup_flush_stats_atomic_ratelimited() to read accurate stats within
an RCU read section and with sleeping disallowed. Move the call above the
RCU read section to make it non-atomic.
Flushing is an expensive operation that scales with the number of cpus and
the number of cgroups in the system, so avoid doing it atomically where
possible.
Since workingset_refault() is the only caller of
mem_cgroup_flush_stats_atomic_ratelimited(), just make it non-atomic, and
rename it to mem_cgroup_flush_stats_ratelimited().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Currently, all contexts that flush memcg stats do so with sleeping not
allowed. Some of these contexts are perfectly safe to sleep in, such as
reading cgroup files from userspace or the background periodic flusher.
Flushing is an expensive operation that scales with the number of cpus and
the number of cgroups in the system, so avoid doing it atomically where
possible.
Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
sleepable), and provide a separate atomic version. The atomic version is
used in reclaim, refault, writeback, and in mem_cgroup_usage(). All other
code paths are left to use the non-atomic version. This includes
callbacks for userspace reads and the periodic flusher.
Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
change it to mem_cgroup_flush_stats_atomic_ratelimited(). Reclaim and
refault code paths are modified to do non-atomic flushing in separate
later patches -- so it will eventually be changed back to
mem_cgroup_flush_stats_ratelimited().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
mem_cgroup_flush_stats_delayed() suggests his is using a delayed_work, but
this is actually sometimes flushing directly from the callsite.
What it's doing is ratelimited calls. A better name would be
mem_cgroup_flush_stats_ratelimited().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yosry Ahmed <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michal Koutný <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vasily Averin <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|