| Age | Commit message (Collapse) | Author | Files | Lines |
|
verifier.c is huge. Let's try to move out parts that are logging-related
into log.c, as we previously did with bpf_log() and other related stuff.
This patch moves line info verbose output routines: it's pretty
self-contained and isolated code, so there is no problem with this.
Acked-by: Eduard Zingerman <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
mlx5-updates-2023-11-13
1) Cleanup patches, leftovers from previous cycle
2) Allow sync reset flow when BF MGT interface device is present
3) Trivial ptp refactorings and improvements
4) Add local loopback counter to vport rep stats
Signed-off-by: David S. Miller <[email protected]>
|
|
Change the API to select MAC default time stamping instead of the PHY.
Indeed the PHY is closer to the wire therefore theoretically it has less
delay than the MAC timestamping but the reality is different. Due to lower
time stamping clock frequency, latency in the MDIO bus and no PHC hardware
synchronization between different PHY, the PHY PTP is often less precise
than the MAC. The exception is for PHY designed specially for PTP case but
these devices are not very widespread. For not breaking the compatibility I
introduce a default_timestamp flag in phy_device that is set by the phy
driver to know we are using the old API behavior.
The phy_set_timestamp function is called at each call of phy_attach_direct.
In case of MAC driver using phylink this function is called when the
interface is turned up. Then if the interface goes down and up again the
last choice of timestamp will be overwritten by the default choice.
A solution could be to cache the timestamp status but it can bring other
issues. In case of SFP, if we change the module, it doesn't make sense to
blindly re-set the timestamp back to PHY, if the new module has a PHY with
mediocre timestamping capabilities.
Signed-off-by: Kory Maincent <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Replace hwtstamp_source which is only used by the kernel_hwtstamp_config
structure by the more widely use timestamp_layer structure. This is done
to prepare the support of selectable timestamping source.
Signed-off-by: Kory Maincent <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Make the dev_set_hwtstamp_phylib function accessible in prevision to use
it from ethtool to reset the tstamp current configuration.
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: Kory Maincent <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The vlan, macvlan and the bonding drivers call their "real" device driver
in order to report the time stamping capabilities. Provide a core
ethtool helper function to avoid copy/paste in the stack.
Signed-off-by: Richard Cochran <[email protected]>
Signed-off-by: Kory Maincent <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Reviewed-by: Jay Vosburgh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The PHYs hwtstamp callback are still getting the timestamp config from
ifreq and using copy_from/to_user.
Get rid of these functions by using timestamp configuration in parameter.
This also allow to move on to kernel_hwtstamp_config and be similar to
net devices using the new ndo_hwstamp_get/set.
This adds the possibility to manipulate the timestamp configuration
from the kernel which was not possible with the copy_from/to_user.
Signed-off-by: Kory Maincent <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
wait_on_bit already does it.
Signed-off-by: Mateusz Guzik <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>
|
|
Ask block layer to block writes to block devices mounted by filesystems.
Signed-off-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Christian Brauner <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
Writing to mounted devices is dangerous and can lead to filesystem
corruption as well as crashes. Furthermore syzbot comes with more and
more involved examples how to corrupt block device under a mounted
filesystem leading to kernel crashes and reports we can do nothing
about. Add tracking of writers to each block device and a kernel cmdline
argument which controls whether other writeable opens to block devices
open with BLK_OPEN_RESTRICT_WRITES flag are allowed. We will make
filesystems use this flag for used devices.
Note that this effectively only prevents modification of the particular
block device's page cache by other writers. The actual device content
can still be modified by other means - e.g. by issuing direct scsi
commands, by doing writes through devices lower in the storage stack
(e.g. in case loop devices, DM, or MD are involved) etc. But blocking
direct modifications of the block device page cache is enough to give
filesystems a chance to perform data validation when loading data from
the underlying storage and thus prevent kernel crashes.
Syzbot can use this cmdline argument option to avoid uninteresting
crashes. Also users whose userspace setup does not need writing to
mounted block devices can set this option for hardening.
Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
blkdev_get_by_*() and blkdev_put() functions are now unused. Remove
them.
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Christian Brauner <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
Before [1] freezing a filesystems through the block layer only worked
for the main block device as the owning superblock of additional block
devices could not be found. Any filesystem that made use of multiple
block devices would only be freezable via it's main block device.
For example, consider xfs over device mapper with /dev/dm-0 as main
block device and /dev/dm-1 as external log device. Two freeze requests
before [1]:
(1) dmsetup suspend /dev/dm-0 on the main block device
bdev_freeze(dm-0)
-> dm-0->bd_fsfreeze_count++
-> freeze_super(xfs-sb)
The owning superblock is found and the filesystem gets frozen.
Returns 0.
(2) dmsetup suspend /dev/dm-1 on the log device
bdev_freeze(dm-1)
-> dm-1->bd_fsfreeze_count++
The owning superblock isn't found and only the block device freeze
count is incremented. Returns 0.
Two freeze requests after [1]:
(1') dmsetup suspend /dev/dm-0 on the main block device
bdev_freeze(dm-0)
-> dm-0->bd_fsfreeze_count++
-> freeze_super(xfs-sb)
The owning superblock is found and the filesystem gets frozen.
Returns 0.
(2') dmsetup suspend /dev/dm-1 on the log device
bdev_freeze(dm-0)
-> dm-0->bd_fsfreeze_count++
-> freeze_super(xfs-sb)
The owning superblock is found and the filesystem gets frozen.
Returns -EBUSY.
When (2') is called we initiate a freeze from another block device of
the same superblock. So we increment the bd_fsfreeze_count for that
additional block device. But we now also find the owning superblock for
additional block devices and call freeze_super() again which reports
-EBUSY.
This can be reproduced through xfstests via:
mkfs.xfs -f -m crc=1,reflink=1,rmapbt=1, -i sparse=1 -lsize=1g,logdev=/dev/nvme1n1p4 /dev/nvme1n1p3
mkfs.xfs -f -m crc=1,reflink=1,rmapbt=1, -i sparse=1 -lsize=1g,logdev=/dev/nvme1n1p6 /dev/nvme1n1p5
FSTYP=xfs
export TEST_DEV=/dev/nvme1n1p3
export TEST_DIR=/mnt/test
export TEST_LOGDEV=/dev/nvme1n1p4
export SCRATCH_DEV=/dev/nvme1n1p5
export SCRATCH_MNT=/mnt/scratch
export SCRATCH_LOGDEV=/dev/nvme1n1p6
export USE_EXTERNAL=yes
sudo ./check generic/311
Current semantics allow two concurrent freezers: one initiated from
userspace via FREEZE_HOLDER_USERSPACE and one initiated from the kernel
via FREEZE_HOLDER_KERNEL. If there are multiple concurrent freeze
requests from either FREEZE_HOLDER_USERSPACE or FREEZE_HOLDER_KERNEL
-EBUSY is returned.
We need to preserve these semantics because as they are uapi via
FIFREEZE and FITHAW ioctl()s. IOW, freezes don't nest for FIFREEZE and
FITHAW. Other kernels consumers rely on non-nesting freezes as well.
With freezes initiated from the block layer freezes need to nest if the
same superblock is frozen via multiple devices. So we need to start
counting the number of freeze requests.
If FREEZE_MAY_NEST is passed alongside FREEZE_HOLDER_KERNEL or
FREEZE_HOLDER_USERSPACE we allow the caller to nest freeze calls.
To accommodate the old semantics we split the freeze counter into two
counting kernel initiated and userspace initiated freezes separately. We
can then also stop recording FREEZE_HOLDER_* in struct sb_writers.
We also simplify freezing by making all concurrent freezers share a
single active superblock reference count instead of having separate
references for kernel and userspace. I don't see why we would need two
active reference counts. Neither FREEZE_HOLDER_KERNEL nor
FREEZE_HOLDER_USERSPACE can put the active reference as long as they are
concurrent freezers anwyay. That was already true before we allowed
nesting freezes.
Survives various fstests runs with different options including the
reproducer, online scrub, and online repair, fsfreze, and so on. Also
survives blktests.
Link: https://lore.kernel.org/linux-block/87bkccnwxc.fsf@debian-BULLSEYE-live-builder-AMD64
Link: https://lore.kernel.org/r/[email protected]
Fixes: 288d8706abfc ("bdev: implement freeze and thaw holder operations") [1] # no backport needed
Tested-by: Chandan Babu R <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reported-by: Chandan Babu R <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
Add a comment to @fs_holder_ops that @holder must point to a superblock.
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
Remove bd_fsfreeze_sb as it's now unused and can be removed. Also move
bd_fsfreeze_count down to not have it weirdly placed in the middle of
the holder fields.
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
This function is now unused so remove it. One less function that uses
the global superblock list.
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
The old method of implementing block device freeze and thaw operations
required us to rely on get_active_super() to walk the list of all
superblocks on the system to find any superblock that might use the
block device. This is wasteful and not very pleasant overall.
Now that we can finally go straight from block device to owning
superblock things become way simpler.
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
Add block device freeze and thaw holder operations. Follow-up patches
will implement block device freeze and thaw based on stuct
blk_holder_ops.
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
We have bdev_mark_dead() etc and we're going to move block device
freezing to holder ops in the next patch. Make the naming consistent:
* freeze_bdev() -> bdev_freeze()
* thaw_bdev() -> bdev_thaw()
Also document the return code.
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
When adding a mount to a namespace insert it into an rbtree rooted in the
mnt_namespace instead of a linear list.
The mnt.mnt_list is still used to set up the mount tree and for
propagation, but not after the mount has been added to a namespace. Hence
mnt_list can live in union with rb_node. Use MNT_ONRB mount flag to
validate that the mount is on the correct list.
This allows removing the cursor used for reading /proc/$PID/mountinfo. The
mnt_id_unique of the next mount can be used as an index into the seq file.
Tested by inserting 100k bind mounts, unsharing the mount namespace, and
unmounting. No performance regressions have been observed.
For the last mount in the 100k list the statmount() call was more than 100x
faster due to the mount ID lookup not having to do a linear search. This
patch makes the overhead of mount ID lookup non-observable in this range.
Signed-off-by: Miklos Szeredi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Ian Kent <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral
BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0].
A few selftests and veristat need to be adjusted in the same patch as
well.
[0] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
The macro SUMMARY_SIZE represents the size of the struct f2fs_summary,
instead of the size of the struct summary.
Signed-off-by: Yang Hubin <[email protected]>
Signed-off-by: Qian Haolai <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Adds kernel client API function tee_client_system_session() for a client
to request a system service entry in TEE context.
This feature is needed to prevent a system deadlock when several TEE
client applications invoke TEE, consuming all TEE thread contexts
available in the secure world. The deadlock can happen in the OP-TEE
driver for example if all these TEE threads issue an RPC call from TEE
to Linux OS to access an eMMC RPMB partition (TEE secure storage) which
device clock or regulator controller is accessed through an OP-TEE SCMI
services. In that case, Linux SCMI driver must reach OP-TEE SCMI service
without waiting until one of the consumed TEE threads is freed.
Reviewed-by: Sumit Garg <[email protected]>
Co-developed-by: Jens Wiklander <[email protected]>
Signed-off-by: Etienne Carriere <[email protected]>
Signed-off-by: Jens Wiklander <[email protected]>
|
|
With commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture"),
there is no need to keep the IA-64 definition of the KSYM_FUNC macro.
Clean up the IA-64 definition of the KSYM_FUNC macro.
Signed-off-by: Lukas Bulwahn <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
Add a linkmode_fill() helper, which will allow us to convert phylink's
open coded bitmap_fill() operations.
Signed-off-by: Russell King (Oracle) <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Sometimes the users want to match the single value string property
against an array of predefined strings. Create a helper for them.
Signed-off-by: Andy Shevchenko <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Cameron <[email protected]>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Signed-off-by: Paolo Abeni <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from BPF and netfilter.
Current release - regressions:
- core: fix undefined behavior in netdev name allocation
- bpf: do not allocate percpu memory at init stage
- netfilter: nf_tables: split async and sync catchall in two
functions
- mptcp: fix possible NULL pointer dereference on close
Current release - new code bugs:
- eth: ice: dpll: fix initial lock status of dpll
Previous releases - regressions:
- bpf: fix precision backtracking instruction iteration
- af_unix: fix use-after-free in unix_stream_read_actor()
- tipc: fix kernel-infoleak due to uninitialized TLV value
- eth: bonding: stop the device in bond_setup_by_slave()
- eth: mlx5:
- fix double free of encap_header
- avoid referencing skb after free-ing in drop path
- eth: hns3: fix VF reset
- eth: mvneta: fix calls to page_pool_get_stats
Previous releases - always broken:
- core: set SOCK_RCU_FREE before inserting socket into hashtable
- bpf: fix control-flow graph checking in privileged mode
- eth: ppp: limit MRU to 64K
- eth: stmmac: avoid rx queue overrun
- eth: icssg-prueth: fix error cleanup on failing initialization
- eth: hns3: fix out-of-bounds access may occur when coalesce info is
read via debugfs
- eth: cortina: handle large frames
Misc:
- selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45"
* tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits)
macvlan: Don't propagate promisc change to lower dev in passthru
net: sched: do not offload flows with a helper in act_ct
net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors
net/mlx5e: Check return value of snprintf writing to fw_version buffer
net/mlx5e: Reduce the size of icosq_str
net/mlx5: Increase size of irq name buffer
net/mlx5e: Update doorbell for port timestamping CQ before the software counter
net/mlx5e: Track xmit submission to PTP WQ after populating metadata map
net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe
net/mlx5e: Don't modify the peer sent-to-vport rules for IPSec offload
net/mlx5e: Fix pedit endianness
net/mlx5e: fix double free of encap_header in update funcs
net/mlx5e: fix double free of encap_header
net/mlx5: Decouple PHC .adjtime and .adjphase implementations
net/mlx5: DR, Allow old devices to use multi destination FTE
net/mlx5: Free used cpus mask when an IRQ is released
Revert "net/mlx5: DR, Supporting inline WQE when possible"
bpf: Do not allocate percpu memory at init stage
net: Fix undefined behavior in netdev name allocation
dt-bindings: net: ethernet-controller: Fix formatting error
...
|
|
Pull virtio fixes from Michael Tsirkin:
"Bugfixes all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost-vdpa: fix use after free in vhost_vdpa_probe()
virtio_pci: Switch away from deprecated irq_set_affinity_hint
riscv, qemu_fw_cfg: Add support for RISC-V architecture
vdpa_sim_blk: allocate the buffer zeroed
virtio_pci: move structure to a header
|
|
The SPI_MASTER_HALF_DUPLEX is the legacy name of a definition
for a half duplex flag. Since all others had been replaced with
the respective SPI_CONTROLLER prefix get rid of the last one
as well. There is no functional change intended.
Signed-off-by: Andy Shevchenko <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Ulf Hansson <[email protected]> # For MMC
Acked-by: Dmitry Torokhov <[email protected]> # for input
Acked-by: Paolo Abeni <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>
|
|
Introduce the new subtype of "CORESIGHT_DEV_SUBTYPE_SOURCE_TPDM"
for TPDM components in driver.
Signed-off-by: Tao Zhang <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Fix a small typo in the kerneldoc comment of the INDIRECT_CALL_$NR
macro.
Signed-off-by: Tobias Klauser <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Alexei Starovoitov says:
====================
pull-request: bpf 2023-11-15
We've added 7 non-merge commits during the last 6 day(s) which contain
a total of 9 files changed, 200 insertions(+), 49 deletions(-).
The main changes are:
1) Do not allocate bpf specific percpu memory unconditionally, from Yonghong.
2) Fix precision backtracking instruction iteration, from Andrii.
3) Fix control flow graph checking, from Andrii.
4) Fix xskxceiver selftest build, from Anders.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Do not allocate percpu memory at init stage
selftests/bpf: add more test cases for check_cfg()
bpf: fix control-flow graph checking in privileged mode
selftests/bpf: add edge case backtracking logic test
bpf: fix precision backtracking instruction iteration
bpf: handle ldimm64 properly in check_cfg()
selftests: bpf: xskxceiver: ksft_print_msg: fix format type error
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Equivalent of kern_path_locked() taking dfd/userland name.
User introduced in the next commit.
Signed-off-by: Al Viro <[email protected]>
|
|
This driver uses delayed work to perform periodic battery state read out.
This delayed work is not stopped across suspend and resume cycle. The
read out can occur early in the resume cycle. In case of an I2C variant
of this hardware, that read out triggers I2C transfer. That I2C transfer
may happen while the I2C controller is still suspended, which produces a
WARNING in the kernel log.
Fix this by introducing trivial PM ops, which stop the delayed work before
the system enters suspend, and schedule the delayed work right after the
system resumes.
Signed-off-by: Marek Vasut <[email protected]>
Reviewed-by: Hans de Goede <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sebastian Reichel <[email protected]>
|
|
Add simple sanity checks that validate well-formed ranges (min <= max)
across u64, s64, u32, and s32 ranges. Also for cases when the value is
constant (either 64-bit or 32-bit), we validate that ranges and tnums
are in agreement.
These bounds checks are performed at the end of BPF_ALU/BPF_ALU64
operations, on conditional jumps, and for LDX instructions (where subreg
zero/sign extension is probably the most important to check). This
covers most of the interesting cases.
Also, we validate the sanity of the return register when manually
adjusting it for some special helpers.
By default, sanity violation will trigger a warning in verifier log and
resetting register bounds to "unbounded" ones. But to aid development
and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will
trigger hard failure of verification with -EFAULT on register bounds
violations. This allows selftests to catch such issues. veristat will
also gain a CLI option to enable this behavior.
Acked-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Shung-Hsi Yu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Generalize bounds adjustment logic of reg_set_min_max() to handle not
just register vs constant case, but in general any register vs any
register cases. For most of the operations it's trivial extension based
on range vs range comparison logic, we just need to properly pick
min/max of a range to compare against min/max of the other range.
For BPF_JSET we keep the original capabilities, just make sure JSET is
integrated in the common framework. This is manifested in the
internal-only BPF_JSET + BPF_X "opcode" to allow for simpler and more
uniform rev_opcode() handling. See the code for details. This allows to
reuse the same code exactly both for TRUE and FALSE branches without
explicitly handling both conditions with custom code.
Note also that now we don't need a special handling of BPF_JEQ/BPF_JNE
case none of the registers are constants. This is now just a normal
generic case handled by reg_set_min_max().
To make tnum handling cleaner, tnum_with_subreg() helper is added, as
that's a common operator when dealing with 32-bit subregister bounds.
This keeps the overall logic much less noisy when it comes to tnums.
Acked-by: Eduard Zingerman <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Acked-by: Shung-Hsi Yu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Some mlx5 devices do not support the default advertised maximum frequency
adjustment value for the PTP hardware clock that is set by the driver.
These devices need to be queried when initializing the clock functionality
in order to get the maximum supported frequency adjustment value. This
value can be greater than the minimum supported frequency adjustment across
mlx5 devices (50 million ppb).
Signed-off-by: Rahul Rameshbabu <[email protected]>
Reviewed-by: Tariq Toukan <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Kirill Shutemov reported significant percpu memory consumption increase after
booting in 288-cpu VM ([1]) due to commit 41a5db8d8161 ("bpf: Add support for
non-fix-size percpu mem allocation"). The percpu memory consumption is
increased from 111MB to 969MB. The number is from /proc/meminfo.
I tried to reproduce the issue with my local VM which at most supports upto
255 cpus. With 252 cpus, without the above commit, the percpu memory
consumption immediately after boot is 57MB while with the above commit the
percpu memory consumption is 231MB.
This is not good since so far percpu memory from bpf memory allocator is not
widely used yet. Let us change pre-allocation in init stage to on-demand
allocation when verifier detects there is a need of percpu memory for bpf
program. With this change, percpu memory consumption after boot can be reduced
signicantly.
[1] https://lore.kernel.org/lkml/[email protected]/
Fixes: 41a5db8d8161 ("bpf: Add support for non-fix-size percpu mem allocation")
Reported-and-tested-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Acked-by: Hou Tao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Let's kickstart the v6.8 release cycle.
Signed-off-by: Maxime Ripard <[email protected]>
|
|
Avoid conflicts, base on fixes.
Signed-off-by: Peter Zijlstra <[email protected]>
|
|
Adds:
- DEFINE_GUARD_COND() / DEFINE_LOCK_GUARD_1_COND() to extend existing
guards with conditional lock primitives, eg. mutex_trylock(),
mutex_lock_interruptible().
nb. both primitives allow NULL 'locks', which cause the lock to
fail (obviously).
- extends scoped_guard() to not take the body when the the
conditional guard 'fails'. eg.
scoped_guard (mutex_intr, &task->signal_cred_guard_mutex) {
...
}
will only execute the body when the mutex is held.
- provides scoped_cond_guard(name, fail, args...); which extends
scoped_guard() to do fail when the lock-acquire fails.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/20231102110706.460851167%40infradead.org
|
|
Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).
RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).
Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org
|
|
All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.
Collapse all that.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Phil Auld <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.1699095159.git.bristot@kernel.org
|
|
Sort the task timeline by virtual deadline and keep the min_vruntime
in the augmented tree, so we can avoid doubling the worst case cost
and make full use of the cached leftmost node to enable O(1) fastpath
picking in next patch.
Signed-off-by: Abel Wu <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the
task had previously accessed VMA. However unconditional scan of VMAs are
allowed during initial phase of VMA creation until process's
mm numa_scan_seq reaches 2 even though current task had not accessed VMA.
Rationale:
- Without initial scan subsequent PTE update may never happen.
- Give fair opportunity to all the VMAs to be scanned and subsequently
understand the access pattern of all the VMAs.
But it has a corner case where, if a VMA is created after some time,
process's mm numa_scan_seq could be already greater than 2.
For e.g., values of mm numa_scan_seq when VMAs are created by running
mmtest autonuma benchmark briefly looks like:
start_seq=0 : 459
start_seq=2 : 138
start_seq=3 : 144
start_seq=4 : 8
start_seq=8 : 1
start_seq=9 : 1
This results in no unconditional PTE updates for those VMAs created after
some time.
Fix:
- Note down the initial value of mm numa_scan_seq in per VMA start_seq.
- Allow unconditional scan till start_seq + 2.
Result:
SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
base kernel: upstream 6.6-rc6 with Mels patches [1] applied.
kernbench
========== base patched %gain
Amean elsp-128 165.09 ( 0.00%) 164.78 * 0.19%*
Duration User 41404.28 41375.08
Duration System 9862.22 9768.48
Duration Elapsed 519.87 518.72
Ops NUMA PTE updates 1041416.00 831536.00
Ops NUMA hint faults 263296.00 220966.00
Ops NUMA pages migrated 258021.00 212769.00
Ops AutoNUMA cost 1328.67 1114.69
autonumabench
NUMA01_THREADLOCAL
==================
Amean elsp-NUMA01_THREADLOCAL 81.79 (0.00%) 67.74 * 17.18%*
Duration User 54832.73 47379.67
Duration System 75.00 185.75
Duration Elapsed 576.72 476.09
Ops NUMA PTE updates 394429.00 11121044.00
Ops NUMA hint faults 1001.00 8906404.00
Ops NUMA pages migrated 288.00 2998694.00
Ops AutoNUMA cost 7.77 44666.84
Signed-off-by: Raghavendra K T <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lore.kernel.org/r/2ea7cbce80ac7c62e90cbfb9653a7972f902439f.1697816692.git.raghavendra.kt@amd.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- stackleak: add declarations for global functions (Arnd Bergmann)
- gcc-plugins: randstruct: Only warn about true flexible arrays (Kees
Cook)
- gcc-plugins: latent_entropy: Fix description typo (Konstantin Runov)
* tag 'hardening-v6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
gcc-plugins: latent_entropy: Fix typo (args -> argc) in plugin description
gcc-plugins: randstruct: Only warn about true flexible arrays
stackleak: add declarations for global functions
|
|
Audit of the refcounting turned up that perf_pmu_migrate_context()
fails to migrate the ctx refcount.
Fixes: bd2756811766 ("perf: Rewrite core context handling")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Cc: <[email protected]>
|
|
Widely used variable names can be used by macro users, potentially
leading to name collisions.
Suffix locals used inside the macros with an underscore, to reduce the
collision potential.
Suggested-by: Lucas De Marchi <[email protected]>
Signed-off-by: Michał Winiarski <[email protected]>
Reviewed-by: Lucas De Marchi <[email protected]>
Signed-off-by: Lucas De Marchi <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
|
|
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.
The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it. Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.
A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory. In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.
The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption. In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.
Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).
guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs. But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.
The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.
Pending post-merge work includes:
- hugepage support
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
the same memory attributes introduced here
- SNP and TDX support
There are two non-KVM patches buried in the middle of this series:
fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
|
|
Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs. Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.
Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Reviewed-by: Fuad Tabba <[email protected]>
Tested-by: Fuad Tabba <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|