Age | Commit message (Collapse) | Author | Files | Lines |
|
The GPU scheduler has now a variable number of run-queues, which are set up at
drm_sched_init() time. This way, each driver announces how many run-queues it
requires (supports) per each GPU scheduler it creates. Note, that run-queues
correspond to scheduler "priorities", thus if the number of run-queues is set
to 1 at drm_sched_init(), then that scheduler supports a single run-queue,
i.e. single "priority". If a driver further sets a single entity per
run-queue, then this creates a 1-to-1 correspondence between a scheduler and
a scheduled entity.
Cc: Lucas Stach <[email protected]>
Cc: Russell King <[email protected]>
Cc: Qiang Yu <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Abhinav Kumar <[email protected]>
Cc: Dmitry Baryshkov <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: Boris Brezillon <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Christian König <[email protected]>
Cc: Emma Anholt <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Luben Tuikov <[email protected]>
Acked-by: Christian König <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The existing code checks for the correct state transition after sending
a command. However, it is possible for the message box to return -1,
which indicates an error, if an error has occurred in the firmware.
We can detect if the error has occurred, and return a different error.
In addition, there is no recovering from a CSPL error, so the retry
mechanism is not needed in this case, and we can return immediately.
Signed-off-by: Stefan Binding <[email protected]>
Acked-by: Mark Brown <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
To ensure the chip has correctly reset during probe and system suspend,
we need to force a software reset, in case of systems where the
hardware reset is not available.
The software reset register was labelled as volatile but not readable,
however, it is readable, (just returns 0x0). Adding it to readable
registers means it will be correctly treated as volatile, and thus
will not be cached.
Signed-off-by: Stefan Binding <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
Merge the immutable branch genpd_dt into next, to allow the DT bindings to
be tested together with new pmdomain changes that are targeted for v6.7.
Signed-off-by: Ulf Hansson <[email protected]>
|
|
Document GMXC (Graphics MXC) power domain index which will be used on
SC8380XP SoCs.
Signed-off-by: Sibi Sankar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[Ulf: Re-based to step up the index number]
Signed-off-by: Ulf Hansson <[email protected]>
|
|
Document the RPMh Power Domains on the SM8650 Platform.
Signed-off-by: Neil Armstrong <[email protected]>
Link: https://lore.kernel.org/r/20231025-topic-sm8650-upstream-rpmpd-v1-1-f25d313104c6@linaro.org
Signed-off-by: Ulf Hansson <[email protected]>
|
|
When remapping hardware is configured by system software in scalable mode
as Nested (PGTT=011b) and with PWSNP field Set in the PASID-table-entry,
it may Set Accessed bit and Dirty bit (and Extended Access bit if enabled)
in first-stage page-table entries even when second-stage mappings indicate
that corresponding first-stage page-table is Read-Only.
As the result, contents of pages designated by VMM as Read-Only can be
modified by IOMMU via PML5E (PML4E for 4-level tables) access as part of
address translation process due to DMAs issued by Guest.
This disallows read-only mappings in the domain that is supposed to be used
as nested parent. Reference from Sapphire Rapids Specification Update [1],
errata details, SPR17. Userspace should know this limitation by checking
the IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 flag reported in the IOMMU_GET_HW_INFO
ioctl.
[1] https://www.intel.com/content/www/us/en/content-details/772415/content-details.html
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
This adds IOMMU_HWPT_DATA_VTD_S1 for stage-1 hw_pagetable of Intel
VT-d and the corressponding data structure for userspace specified parameter
for the domain allocation.
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Wrap up the data type/pointer/len sanity and a copy_struct_from_user call
for iommu drivers to copy driver specific data via struct iommu_user_data.
And expect it to be used in the domain_alloc_user op for example.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Nicolin Chen <[email protected]>
Co-developed-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
IOMMU_HWPT_ALLOC already supports iommu_domain allocation for usersapce.
But it can only allocate a hw_pagetable that associates to a given IOAS,
i.e. only a kernel-managed hw_pagetable of IOMMUFD_OBJ_HWPT_PAGING type.
IOMMU drivers can now support user-managed hw_pagetables, for two-stage
translation use cases that require user data input from the user space.
Add a new IOMMUFD_OBJ_HWPT_NESTED type with its abort/destroy(). Pair it
with a new iommufd_hwpt_nested structure and its to_hwpt_nested() helper.
Update the to_hwpt_paging() helper, so a NESTED-type hw_pagetable can be
handled in the callers, for example iommufd_hw_pagetable_enforce_rr().
Screen the inputs including the parent PAGING-type hw_pagetable that has
a need of a new nest_parent flag in the iommufd_hwpt_paging structure.
Extend the IOMMU_HWPT_ALLOC ioctl to accept an IOMMU driver specific data
input which is tagged by the enum iommu_hwpt_data_type. Also, update the
@pt_id to accept hwpt_id too besides an ioas_id. Then, use them to allocate
a hw_pagetable of IOMMUFD_OBJ_HWPT_NESTED type using the
iommufd_hw_pagetable_alloc_nested() allocator.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Nicolin Chen <[email protected]>
Co-developed-by: Yi Liu <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
domain_alloc_user op already accepts user flags for domain allocation, add
a parent domain pointer and a driver specific user data support as well.
The user data would be tagged with a type for iommu drivers to add their
own driver specific user data per hw_pagetable.
Add a struct iommu_user_data as a bundle of data_ptr/data_len/type from an
iommufd core uAPI structure. Make the user data opaque to the core, since
a userspace driver must match the kernel driver. In the future, if drivers
share some common parameter, there would be a generic parameter as well.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Lu Baolu <[email protected]>
Co-developed-by: Nicolin Chen <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Introduce a new domain type for a user I/O page table, which is nested on
top of another user space address represented by a PAGING domain. This
new domain can be allocated by the domain_alloc_user op, and attached to
a device through the existing iommu_attach_device/group() interfaces.
The mappings of a nested domain are managed by user space software, so it
is not necessary to have map/unmap callbacks.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Lu Baolu <[email protected]>
Signed-off-by: Nicolin Chen <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
Merge updates related to system sleep handling, one power capping update
and one PM utility update for 6.7-rc1:
- Use __get_safe_page() rather than touching the list in hibernation
snapshot code (Brian Geffon).
- Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav).
- Clean up sync_read handling in snapshot_write_next() (Brian Geffon).
- Fix kerneldoc comments for swsusp_check() and swsusp_close() to
better match code (Christoph Hellwig).
- Downgrade BIOS locked limits pr_warn() in the Intel RAPL power
capping driver to pr_debug() (Ville Syrjälä).
- Change the minimum python version for the intel_pstate_tracer utility
from 2.7 to 3.6 (Doug Smythies).
* pm-sleep:
PM: hibernate: fix the kerneldoc comment for swsusp_check() and swsusp_close()
PM: hibernate: Clean up sync_read handling in snapshot_write_next()
PM: sleep: Fix symbol export for _SIMPLE_ variants of _PM_OPS()
PM: hibernate: Use __get_safe_page() rather than touching the list
* powercap:
powercap: intel_rapl: Downgrade BIOS locked limits pr_warn() to pr_debug()
* pm-tools:
tools/power/x86/intel_pstate_tracer: python minimum version
|
|
Merge cpufreq updates for 6.7-rc1:
- Add support for several Qualcomm SoC versions and other similar
changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil
Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan
Gerhold and Varadarajan Narayanan).
- Clean up the tegra cpufreq driver (Sumit Gupta).
- Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob
Herring).
- Add support for TI's am62p5 Soc (Bryan Brattlof).
- Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian
Fainelli).
- Update Kconfig to mention i.MX7 as well (Alexander Stein).
- Revise global turbo disable check in intel_pstate (Srinivas
Pandruvada).
- Carry out initialization of sg_cpu in the schedutil cpufreq governor
in one loop (Liao Chang).
- Simplify the condition for storing 'down_threshold' in the
conservative cpufreq governor (Liao Chang).
- Use fine-grained mutex in the userspace cpufreq governor (Liao
Chang).
- Move is_managed indicator in the userspace cpufreq governor into a
per-policy structure (Liao Chang).
- Rebuild sched-domains when removing cpufreq driver (Pierre Gondois).
- Fix buffer overflow detection in trans_stats() (Christian Marangi).
* pm-cpufreq: (32 commits)
dt-bindings: cpufreq: qcom-hw: document SM8650 CPUFREQ Hardware
cpufreq: arm: Kconfig: Add i.MX7 to supported SoC for ARM_IMX_CPUFREQ_DT
cpufreq: qcom-nvmem: add support for IPQ8064
cpufreq: qcom-nvmem: also accept operating-points-v2-krait-cpu
cpufreq: qcom-nvmem: drop pvs_ver for format a fuses
dt-bindings: cpufreq: qcom-cpufreq-nvmem: Document krait-cpu
cpufreq: qcom-nvmem: add support for IPQ6018
dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ6018
cpufreq: qcom-nvmem: Add MSM8909
cpufreq: qcom-nvmem: Simplify driver data allocation
cpufreq: stats: Fix buffer overflow detection in trans_stats()
dt-bindings: cpufreq: cpufreq-qcom-hw: Add SDX75 compatible
cpufreq: ARM_BRCMSTB_AVS_CPUFREQ cannot be used with ARM_SCMI_CPUFREQ
cpufreq: ti-cpufreq: Add opp support for am62p5 SoCs
cpufreq: dt-platdev: add am62p5 to blocklist
cpufreq: tegra194: remove redundant AND with cpu_online_mask
cpufreq: tegra194: use refclk delta based loop instead of udelay
cpufreq: tegra194: save CPU data to avoid repeated SMP calls
cpufreq: Rebuild sched-domains when removing cpufreq driver
cpufreq: userspace: Move is_managed indicator into per-policy structure
...
|
|
Merge devfreq updates for 6.7-rc1:
- Switch to dev_pm_opp_find_freq_(ceil/floor)_indexed() APIs to support
specific devices like UFS which handle multiple clocks through OPP
(Operationg Performance Point) framework (Manivannan Sadhasivam).
- Add perf support to the Rockchip DFI (DDR Monitor Module) devfreq-
event driver:
* Generalize rockchip-dfi.c to support new RK3568/RK3588 using
different DDR type (Sascha Hauer).
* Convert devicetree bidning document format to yaml (Sascha Hauer).
* Add perf support for DFI (a unit suitable for measuring DDR
utilization) to rockchip-dfi.c to extend DFI usage (Sascha Hauer).
- Add locking to the OPP handling code in the Mediatek CCI devfreq
driver, because the voltage of shared OPP might be changed by
multiple drivers (Mark Tseng, Dan Carpenter).
- Use device_get_match_data() in the Samsung Exynos PPMU devfreq-event
driver (Rob Herring).
* pm-devfreq: (26 commits)
dt-bindings: devfreq: event: rockchip,dfi: Add rk3588 support
dt-bindings: devfreq: event: rockchip,dfi: Add rk3568 support
dt-bindings: devfreq: event: convert Rockchip DFI binding to yaml
PM / devfreq: rockchip-dfi: add support for RK3588
PM / devfreq: rockchip-dfi: account for multiple DDRMON_CTRL registers
PM / devfreq: rockchip-dfi: make register stride SoC specific
PM / devfreq: rockchip-dfi: Add perf support
PM / devfreq: rockchip-dfi: give variable a better name
PM / devfreq: rockchip-dfi: Prepare for multiple users
PM / devfreq: rockchip-dfi: Pass private data struct to internal functions
PM / devfreq: rockchip-dfi: Handle LPDDR4X
PM / devfreq: rockchip-dfi: Handle LPDDR2 correctly
PM / devfreq: rockchip-dfi: Add RK3568 support
PM / devfreq: rockchip-dfi: Clean up DDR type register defines
PM / devfreq: rk3399_dmc,dfi: generalize DDRTYPE defines
PM / devfreq: rockchip-dfi: introduce channel mask
PM / devfreq: rockchip-dfi: Use free running counter
PM / devfreq: mediatek: unlock on error in mtk_ccifreq_target()
PM / devfreq: exynos-ppmu: Use device_get_match_data()
PM / devfreq: rockchip-dfi: dfi store raw values in counter struct
...
|
|
Merge updates of the ACPI AC and ACPI PAD drivers and PNP updates for
6.7-rc1:
- Switch over the ACPI AC and ACPI PAD drivers to using the platform
driver interface which, is more logically consistent than binding a
driver directly to an ACPI device object, and clean them up (Michal
Wilczynski).
- Replace strncpy() in the PNP code with either memcpy() or strscpy()
as appropriate (Justin Stitt).
- Clean up coding style in pnp.h (GuoHua Cheng).
* acpi-ac:
ACPI: AC: Rename ACPI device from device to adev
ACPI: AC: Replace acpi_driver with platform_driver
ACPI: AC: Use string_choices API instead of ternary operator
ACPI: AC: Remove redundant checks
* acpi-pad:
ACPI: acpi_pad: Rename ACPI device from device to adev
ACPI: acpi_pad: Use dev groups for sysfs
ACPI: acpi_pad: Replace acpi_driver with platform_driver
* pnp:
PNP: replace deprecated strncpy() with memcpy()
PNP: ACPI: replace deprecated strncpy() with strscpy()
PNP: Clean up coding style in pnp.h
|
|
Merge ACPI bus type driver updates for 6.7-rc1:
- Add context argument to acpi_dev_install_notify_handler() (Rafael
Wysocki).
- Clarify ACPI bus concepts in the ACPI device enumeration
documentation (Rafael Wysocki).
* acpi-bus:
ACPI: bus: Add context argument to acpi_dev_install_notify_handler()
ACPI: docs: enumeration: Clarify ACPI bus concepts
|
|
Merge ACPI EC driver updates, ACPI sysfs interface updates, misc updates
related to ACPI and changes related to ACPI _UID handling for 6.7-rc1:
- Add EC GPE detection quirk for HP 250 G7 Notebook PC (Jonathan
Denose).
- Fix and clean up create_pnp_modalias() and create_of_modalias()
(Christophe JAILLET).
- Modify 2 pieces of code to use acpi_evaluate_dsm_typed() (Andy
Shevchenko).
- Define acpi_dev_uid_match() for matching _UID and use it in several
places (Raag Jadav).
- Use acpi_device_uid() for fetching _UID in 2 places (Raag Jadav).
* acpi-ec:
ACPI: EC: Add quirk for HP 250 G7 Notebook PC
* acpi-sysfs:
ACPI: sysfs: Clean up create_pnp_modalias() and create_of_modalias()
ACPI: sysfs: Fix create_pnp_modalias() and create_of_modalias()
* acpi-misc:
ACPI: x86: s2idle: Switch to use acpi_evaluate_dsm_typed()
ACPI: PCI: Switch to use acpi_evaluate_dsm_typed()
* acpi-uid:
perf: arm_cspmu: use acpi_dev_hid_uid_match() for matching _HID and _UID
ACPI: x86: use acpi_dev_uid_match() for matching _UID
ACPI: utils: use acpi_dev_uid_match() for matching _UID
pinctrl: intel: use acpi_dev_uid_match() for matching _UID
ACPI: utils: Introduce acpi_dev_uid_match() for matching _UID
perf: qcom: use acpi_device_uid() for fetching _UID
ACPI: sysfs: use acpi_device_uid() for fetching _UID
|
|
Merge ACPI backlight driver updates, ACPI APEI updates, ACPI PRM updates
and changes related to ACPI PCC for 6.7-rc1:
- Add acpi_backlight=vendor quirk for Toshiba Portégé R100 (Ondrej
Zary).
- Add "vendor" backlight quirks for 3 Lenovo x86 Android tablets (Hans
de Goede).
- Move Xiaomi Mi Pad 2 backlight quirk to its own section (Hans de
Goede).
- Annotate struct prm_module_info with __counted_by (Kees Cook).
- Fix AER info corruption in aer_recover_queue() when error status data
has multiple sections (Shiju Jose).
- Make APEI use ERST max execution time value for slow devices (Jeshua
Smith).
- Add support for platform notification handling to the PCC mailbox
driver and modify it to support shared interrupts for multiple
subspaces (Huisong Li).
- Define common macros to use when referring to various bitfields in the
PCC generic communications channel command and status fields and use
them in some drivers (Sudeep Holla).
* acpi-video:
ACPI: video: Add acpi_backlight=vendor quirk for Toshiba Portégé R100
ACPI: video: Add "vendor" quirks for 3 Lenovo x86 Android tablets
ACPI: video: Move Xiaomi Mi Pad 2 quirk to its own section
* acpi-prm:
ACPI: PRM: Annotate struct prm_module_info with __counted_by
* acpi-apei:
ACPI: APEI: Use ERST timeout for slow devices
ACPI: APEI: Fix AER info corruption when error status data has multiple sections
* acpi-pcc:
soc: kunpeng_hccs: Migrate to use generic PCC shmem related macros
hwmon: (xgene) Migrate to use generic PCC shmem related macros
i2c: xgene-slimpro: Migrate to use generic PCC shmem related macros
ACPI: PCC: Add PCC shared memory region command and status bitfields
mailbox: pcc: Support shared interrupt for multiple subspaces
mailbox: pcc: Add support for platform notification handling
|
|
Merge ACPI utilities updates, ACPI resource management updates, ACPI
device properties management updates and ACPI LPSS (Intel SoC) driver
update for 6.7-rc1:
- Rework acpi_handle_list handling so as to manage it dynamically,
including size computation (Rafael Wysocki).
- Clean up ACPI utilities code so as to make it follow the kernel
coding style (Jonathan Bergh).
- Consolidate IRQ trigger-type override DMI tables and drop .ident
values from dmi_system_id tables used for ACPI resources management
quirks (Hans de Goede).
- Add ACPI IRQ override for TongFang GMxXGxx (Werner Sembach).
- Allow _DSD buffer data only for byte accessors and document the _DSD
data buffer GUID (Andy Shevchenko).
- Drop BayTrail and Lynxpoint pinctrl device IDs from the ACPI LPSS
driver, because it does not need them (Raag Jadav).
* acpi-utils:
ACPI: utils: Remove redundant braces around individual statement
ACPI: utils: Fix up white space in a few places
ACPI: utils: Dynamically determine acpi_handle_list size
ACPI: thermal: Merge trip initialization functions
ACPI: thermal: Collapse trip devices update function wrappers
ACPI: thermal: Collapse trip devices update functions
ACPI: thermal: Add device list to struct acpi_thermal_trip
ACPI: thermal: Fix a small leak in acpi_thermal_add()
ACPI: thermal: Drop valid flag from struct acpi_thermal_trip
ACPI: thermal: Drop redundant trip point flags
ACPI: thermal: Untangle initialization and updates of active trips
ACPI: thermal: Untangle initialization and updates of the passive trip
ACPI: thermal: Simplify critical and hot trips representation
ACPI: thermal: Create and populate trip points table earlier
ACPI: thermal: Determine the number of trip points earlier
ACPI: thermal: Fold acpi_thermal_get_info() into its caller
ACPI: thermal: Simplify initialization of critical and hot trips
* acpi-resource:
ACPI: resource: Do IRQ override on TongFang GMxXGxx
ACPI: resource: Drop .ident values from dmi_system_id tables
ACPI: resource: Consolidate IRQ trigger-type override DMI tables
* acpi-property:
ACPI: property: Document the _DSD data buffer GUID
ACPI: property: Allow _DSD buffer data only for byte accessors
* acpi-soc:
ACPI: LPSS: drop BayTrail and Lynxpoint pinctrl HIDs
|
|
commit ef8dd01538ea ("genirq/msi: Make interrupt allocation less
convoluted"), reworked the code so that the x86 specific quirk for affinity
setting of non-maskable PCI/MSI interrupts is not longer activated if
necessary.
This could be solved by restoring the original logic in the core MSI code,
but after a deeper analysis it turned out that the quirk flag is not
required at all.
The quirk is only required when the PCI/MSI device cannot mask the MSI
interrupts, which in turn also prevents reservation mode from being enabled
for the affected interrupt.
This allows ot remove the NOMASK quirk bit completely as msi_set_affinity()
can instead check whether reservation mode is enabled for the interrupt,
which gives exactly the same answer.
Even in the momentary non-existing case that the reservation mode would be
not set for a maskable MSI interrupt this would not cause any harm as it
just would cause msi_set_affinity() to go needlessly through the
functionaly equivalent slow path, which works perfectly fine with maskable
interrupts as well.
Rework msi_set_affinity() to query the reservation mode and remove all
NOMASK quirk logic from the core code.
[ tglx: Massaged changelog ]
Fixes: ef8dd01538ea ("genirq/msi: Make interrupt allocation less convoluted")
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Koichiro Den <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next. Mostly
nf_tables updates with two patches for connlabel and br_netfilter.
1) Rename function name to perform on-demand GC for rbtree elements,
and replace async GC in rbtree by sync GC. Patches from Florian Westphal.
2) Use commit_mutex for NFT_MSG_GETRULE_RESET to ensure that two
concurrent threads invoking this command do not underrun stateful
objects. Patches from Phil Sutter.
3) Use single hook to deal with IP and ARP packets in br_netfilter.
Patch from Florian Westphal.
4) Use atomic_t in netns->connlabel use counter instead of using a
spinlock, also patch from Florian.
5) Cleanups for stateful objects infrastructure in nf_tables.
Patches from Phil Sutter.
6) Flush path uses opaque set element offered by the iterator, instead of
calling pipapo_deactivate() which looks up for it again.
7) Set backend .flush interface always succeeds, make it return void
instead.
8) Add struct nft_elem_priv placeholder structure and use it by replacing
void * to pass opaque set element representation from backend to frontend
which defeats compiler type checks.
9) Shrink memory consumption of set element transactions, by reducing
struct nft_trans_elem object size and reducing stack memory usage.
10) Use struct nft_elem_priv also for set backend .insert operation too.
11) Carry reset flag in nft_set_dump_ctx structure, instead of passing it
as a function argument, from Phil Sutter.
netfilter pull request 23-10-25
* tag 'nf-next-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: Carry reset boolean in nft_set_dump_ctx
netfilter: nf_tables: set->ops->insert returns opaque set element in case of EEXIST
netfilter: nf_tables: shrink memory consumption of set elements
netfilter: nf_tables: expose opaque set element as struct nft_elem_priv
netfilter: nf_tables: set backend .flush always succeeds
netfilter: nft_set_pipapo: no need to call pipapo_deactivate() from flush
netfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctx
netfilter: nf_tables: nft_obj_filter fits into cb->ctx
netfilter: nf_tables: Carry s_idx in nft_obj_dump_ctx
netfilter: nf_tables: A better name for nft_obj_filter
netfilter: nf_tables: Unconditionally allocate nft_obj_filter
netfilter: nf_tables: Drop pointless memset in nf_tables_dump_obj
netfilter: conntrack: switch connlabels to atomic_t
br_netfilter: use single forward hook for ip and arp
netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests
netfilter: nf_tables: Introduce nf_tables_getrule_single()
netfilter: nf_tables: Open-code audit log call in nf_tables_getrule()
netfilter: nft_set_rbtree: prefer sync gc to async worker
netfilter: nft_set_rbtree: rename gc deactivate+erase function
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Merge an ACPICA change for 6.7-rc1 which adds symbol definitions related
to CDAT (Dave Jiang).
* acpica:
ACPICA: Add defines for CDAT SSLBIS
|
|
Replace the old __attribute__((packed)) with the new __packed.
Only cleanup, no functional changes.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
The header file contains lots of outdated comments and definitions.
Drop those as cleanup.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
Replace the old __attribute__((packed)) with the new __packed.
Only cleanup, no functional changes.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
Replace the old __attribute__((packed)) with the new __packed.
Only cleanup, no functional changes.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>
|
|
RTAX_FEATURE_ALLFRAG was added before the first git commit:
https://www.mail-archive.com/[email protected]/msg03399.html
The feature would send packets to the fragmentation path if a box
receives a PMTU value with less than 1280 byte. However, since commit
9d289715eb5c ("ipv6: stop sending PTB packets for MTU < 1280"), such
message would be simply discarded. The feature flag is neither supported
in iproute2 utility. In theory one can still manipulate it with direct
netlink message, but it is not ideal because it was based on obsoleted
guidance of RFC-2460 (replaced by RFC-8200).
The feature would always test false at the moment, so remove related
code or mark them as unused.
Signed-off-by: Yan Zhai <[email protected]>
Reviewed-by: Florian Westphal <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/d78e44dcd9968a252143ffe78460446476a472a1.1698156966.git.yan@cloudflare.com
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
mbind(2) holds down_write of current task's mmap_lock throughout
(exclusive because it needs to set the new mempolicy on the vmas);
migrate_pages(2) holds down_read of pid's mmap_lock throughout.
They both hold mmap_lock across the internal migrate_pages(), under which
all new page allocations (huge or small) are made. I'm nervous about it;
and migrate_pages() certainly does not need mmap_lock itself. It's done
this way for mbind(2), because its page allocator is vma_alloc_folio() or
alloc_hugetlb_folio_vma(), both of which depend on vma and address.
Now that we have alloc_pages_mpol(), depending on (refcounted) memory
policy and interleave index, mbind(2) can be modified to use that or
alloc_hugetlb_folio_nodemask(), and then not need mmap_lock across the
internal migrate_pages() at all: add alloc_migration_target_by_mpol() to
replace mbind's new_page().
(After that change, alloc_hugetlb_folio_vma() is used by nothing but a
userfaultfd function: move it out of hugetlb.h and into the #ifdef.)
migrate_pages(2) has chosen its target node before migrating, so can
continue to use the standard alloc_migration_target(); but let it take and
drop mmap_lock just around migrate_to_node()'s queue_pages_range():
neither the node-to-node calculations nor the page migrations need it.
It seems unlikely, but it is conceivable that some userspace depends on
the kernel's mmap_lock exclusion here, instead of doing its own locking:
more likely in a testsuite than in real life. It is also possible, of
course, that some pages on the list will be munmapped by another thread
before they are migrated, or a newer memory policy applied to the range by
that time: but such races could happen before, as soon as mmap_lock was
dropped, so it does not appear to be a concern.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Shrink shmem's stack usage by eliminating the pseudo-vma from its folio
allocation. alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the
principal actor for passing mempolicy choice down to __alloc_pages(),
rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).
vma_alloc_folio() and alloc_pages() remain, but as wrappers around
alloc_pages_mpol(). alloc_pages_bulk_*() untouched, except to provide the
additional args to policy_nodemask(), which subsumes policy_node().
Cleanup throughout, cutting out some unhelpful "helpers".
It would all be much simpler without MPOL_INTERLEAVE, but that adds a
dynamic to the constant mpol: complicated by v3.6 commit 09c231cb8bfd
("tmpfs: distribute interleave better across nodes"), which added ino bias
to the interleave, hidden from mm/mempolicy.c until this commit.
Hence "ilx" throughout, the "interleave index". Originally I thought it
could be done just with nid, but that's wrong: the nodemask may come from
the shared policy layer below a shmem vma, or it may come from the task
layer above a shmem vma; and without the final nodemask then nodeid cannot
be decided. And how ilx is applied depends also on page order.
The interleave index is almost always irrelevant unless MPOL_INTERLEAVE:
with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX
passed down from vma-less alloc_pages() is also used as hint not to use
THP-style hugepage allocation - to avoid the overhead of a hugepage arg
(though I don't understand why we never just added a GFP bit for THP - if
it actually needs a different allocation strategy from other pages of the
same order). vma_alloc_folio() still carries its hugepage arg here, but
it is not used, and should be removed when agreed.
get_vma_policy() no longer allows a NULL vma: over time I believe we've
eradicated all the places which used to need it e.g. swapoff and madvise
used to pass NULL vma to read_swap_cache_async(), but now know the vma.
[[email protected]: handle NULL mpol being passed to __read_swap_cache_async()]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
v3.8 commit b24f53a0bea3 ("mm: mempolicy: Add MPOL_MF_LAZY") introduced
MPOL_MF_LAZY, and included it in the MPOL_MF_VALID flags; but a720094ded8
("mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now")
immediately removed it from MPOL_MF_VALID flags, pending further review.
"This will need to be revisited", but it has not been reinstated.
The present state is confusing: there is dead code in mm/mempolicy.c to
handle MPOL_MF_LAZY cases which can never occur. Remove that: it can be
resurrected later if necessary. But keep the definition of MPOL_MF_LAZY,
which must remain in the UAPI, even though it always fails with EINVAL.
https://lore.kernel.org/linux-mm/[email protected]/
links to a previous request to remove MPOL_MF_LAZY.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Prefer the more explicit "pgoff_t" to "unsigned long" when dealing with a
shared mempolicy tree. Delete confusing comment about pseudo mm vmas.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Before getting down to work, do a little cleanup, mainly of inconsistent
variable naming. I gave up trying to rationalize mpol versus pol versus
policy, and node versus nid, but let's avoid p and nd. Remove a few
superfluous blank lines, but add one; and here prefer vma->vm_policy to
vma_policy(vma) - the latter being appropriate in other sources, which
have to allow for !CONFIG_NUMA. That intriguing line about KERNEL_DS?
should have gone in v2.6.15, when numa_policy_init() stopped using
set_mempolicy(2)'s system call handler.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mempolicy: cleanups leading to NUMA mpol without vma", v2.
Mostly cleanups in mm/mempolicy.c, but finally removing the pseudo-vma
from shmem folio allocation, and removing the mmap_lock around folio
migration for mbind and migrate_pages syscalls.
This patch (of 12):
hugetlbfs_fallocate() goes through the motions of pasting a shared NUMA
mempolicy onto its pseudo-vma, but how could there ever be a shared NUMA
mempolicy for this file? hugetlb_vm_ops has never offered a set_policy
method, and hugetlbfs_parse_param() has never supported any mpol options
for a mount-wide default policy.
It's just an illusion: clean it away so as not to confuse others, giving
us more freedom to adjust shmem's set_policy/get_policy implementation.
But hugetlbfs_inode_info is still required, just to accommodate seals.
Yes, shared NUMA mempolicy support could be added to hugetlbfs, with a
set_policy method and/or mpol mount option (Andi's first posting did
include an admitted-unsatisfactory hugetlb_set_policy()); but it seems
that nobody has bothered to add that in the nineteen years since v2.6.7
made it possible, and there is at least one company that has invested
enough into hugetlbfs, that I guess they have learnt well enough how to
manage its NUMA, without needing shared mempolicy.
Remove linux/mempolicy.h from linux/hugetlb.h: include linux/pagemap.h in
its place, because hugetlb.h's recently added use of filemap_lock_folio()
requires that (although most .configs and .c's get it in some other way).
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Sidhartha Kumar <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Tejun heo <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "avoid divide-by-zero due to max_nr_accesses overflow".
The maximum nr_accesses of given DAMON context can be calculated by
dividing the aggregation interval by the sampling interval. Some logics
in DAMON uses the maximum nr_accesses as a divisor. Hence, the value
shouldn't be zero. Such case is avoided since DAMON avoids setting the
agregation interval as samller than the sampling interval. However, since
nr_accesses is unsigned int while the intervals are unsigned long, the
maximum nr_accesses could be zero while casting.
Avoid the divide-by-zero by implementing a function that handles the
corner case (first patch), and replaces the vulnerable direct max
nr_accesses calculations (remaining patches).
Note that the patches for the replacements are divided for broken commits,
to make backporting on required tres easier. Especially, the last patch
is for a patch that not yet merged into the mainline but in mm tree.
This patch (of 4):
The maximum nr_accesses of given DAMON context can be calculated by
dividing the aggregation interval by the sampling interval. Some logics
in DAMON uses the maximum nr_accesses as a divisor. Hence, the value
shouldn't be zero. Such case is avoided since DAMON avoids setting the
agregation interval as samller than the sampling interval. However, since
nr_accesses is unsigned int while the intervals are unsigned long, the
maximum nr_accesses could be zero while casting. Implement a function
that handles the corner case.
Note that this commit is not fixing the real issue since this is only
introducing the safe function that will replaces the problematic
divisions. The replacements will be made by followup commits, to make
backporting on stable series easier.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 198f0f4c58b9 ("mm/damon/vaddr,paddr: support pageout prioritization")
Signed-off-by: SeongJae Park <[email protected]>
Reported-by: Jakub Acs <[email protected]>
Cc: <[email protected]> [5.16+]
Signed-off-by: Andrew Morton <[email protected]>
|
|
Also remove count_memcg_page_event now that its last caller no longer uses
it and reword hpage_collapse_alloc_page() to hpage_collapse_alloc_folio().
This removes 1 call to compound_head() and helps convert khugepaged to
use folios throughout.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add nr_split to trace_mm_migrate_pages for large folio (including THP)
split events.
[[email protected]: cleanup per Huang, Ying]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Zi Yan <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since kmemleak_alloc_phys() rather than kmemleak_alloc() was called from
memblock_alloc_range_nid(), kmemleak_free_part_phys() should be used to
delete kmemleak object in free_bootmem_page(). In debug mode, there are
following warning:
kmemleak: Partially freeing unknown object at 0xffff97345aff7000 (size 4096)
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 028725e73375 ("bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page")
Signed-off-by: Liu Shixin <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Patrick Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since all calls use folio_xchg_last_cpupid(), remove
page_cpupid_xchg_last().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Make finish_mkwrite_fault static since it is not used outside of
memory.c.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add folio_xchg_last_cpupid() wrapper, which is required to convert
page_cpupid_xchg_last() to folio vertion later in the series.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since all calls use folio_xchg_access_time(), remove
xchg_page_access_time().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add folio_xchg_access_time() wrapper, which is required to convert
xchg_page_access_time() to folio vertion later in the series.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Since all calls use folio_last_cpupid(), remove page_cpupid_last().
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Add folio_last_cpupid() wrapper, which is required to convert
page_cpupid_last() to folio vertion later in the series.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Patch series "mm: convert page cpupid functions to folios", v3.
The cpupid(or access time) used by numa balancing is stored in flags or
_last_cpupid(if LAST_CPUPID_NOT_IN_PAGE_FLAGS) of page, this is to convert
page cpupid to folio cpupid, a new _last_cpupid is added into folio, which
make us to use folio->_last_cpupid directly, and the page cpupid functions
are converted to folio ones.
page_cpupid_last() -> folio_last_cpupid()
xchg_page_access_time() -> folio_xchg_access_time()
page_cpupid_xchg_last() -> folio_xchg_last_cpupid()
This patch (of 19):
If WANT_PAGE_VIRTUAL and LAST_CPUPID_NOT_IN_PAGE_FLAGS defined, the
'virtual' and '_last_cpupid' are in struct page, and since _last_cpupid is
used by numa balancing feature, it is better to move it before KMSAN
metadata from struct page, also add them into struct folio to make us to
access them from folio directly.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kefeng Wang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Zi Yan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Reimplement get_obj_cgroup_from_current() using current_obj_cgroup().
get_obj_cgroup_from_current() and current_obj_cgroup() share 80% of the
code, so the new implementation is almost trivial.
get_obj_cgroup_from_current() is a convenient function used by the
bpf subsystem, so there is no reason to get rid of it completely.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin (Cruise) <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Naresh Kamboju <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Switch to a scope-based protection of the objcg pointer on slab/kmem
allocation paths. Instead of using the get_() semantics in the
pre-allocation hook and put the reference afterwards, let's rely on the
fact that objcg is pinned by the scope.
It's possible because:
1) if the objcg is received from the current task struct, the task is
keeping a reference to the objcg.
2) if the objcg is received from an active memcg (remote charging),
the memcg is pinned by the scope and has a reference to the
corresponding objcg.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin (Cruise) <[email protected]>
Tested-by: Naresh Kamboju <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Keep a reference to the original objcg object for the entire life of a
memcg structure.
This allows to simplify the synchronization on the kernel memory
allocation paths: pinning a (live) memcg will also pin the corresponding
objcg.
The memory overhead of this change is minimal because object cgroups
usually outlive their corresponding memory cgroups even without this
change, so it's only an additional pointer per memcg.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin (Cruise) <[email protected]>
Tested-by: Naresh Kamboju <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
To charge a freshly allocated kernel object to a memory cgroup, the kernel
needs to obtain an objcg pointer. Currently it does it indirectly by
obtaining the memcg pointer first and then calling to
__get_obj_cgroup_from_memcg().
Usually tasks spend their entire life belonging to the same object cgroup.
So it makes sense to save the objcg pointer on task_struct directly, so
it can be obtained faster. It requires some work on fork, exit and cgroup
migrate paths, but these paths are way colder.
To avoid any costly synchronization the following rules are applied:
1) A task sets it's objcg pointer itself.
2) If a task is being migrated to another cgroup, the least
significant bit of the objcg pointer is set atomically.
3) On the allocation path the objcg pointer is obtained locklessly
using the READ_ONCE() macro and the least significant bit is
checked. If it's set, the following procedure is used to update
it locklessly:
- task->objcg is zeroed using cmpxcg
- new objcg pointer is obtained
- task->objcg is updated using try_cmpxchg
- operation is repeated if try_cmpxcg fails
It guarantees that no updates will be lost if task migration
is racing against objcg pointer update. It also allows to keep
both read and write paths fully lockless.
Because the task is keeping a reference to the objcg, it can't go away
while the task is alive.
This commit doesn't change the way the remote memcg charging works.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin (Cruise) <[email protected]>
Tested-by: Naresh Kamboju <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|