aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-11-18bpf: move verbose_linfo() into kernel/bpf/log.cAndrii Nakryiko1-0/+4
verifier.c is huge. Let's try to move out parts that are logging-related into log.c, as we previously did with bpf_log() and other related stuff. This patch moves line info verbose output routines: it's pretty self-contained and isolated code, so there is no problem with this. Acked-by: Eduard Zingerman <[email protected]> Acked-by: Stanislav Fomichev <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-11-18Merge tag 'mlx5-updates-2023-11-13' of ↵David S. Miller1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux mlx5-updates-2023-11-13 1) Cleanup patches, leftovers from previous cycle 2) Allow sync reset flow when BF MGT interface device is present 3) Trivial ptp refactorings and improvements 4) Add local loopback counter to vport rep stats Signed-off-by: David S. Miller <[email protected]>
2023-11-18net: Change the API of PHY default timestamp to MACKory Maincent2-0/+9
Change the API to select MAC default time stamping instead of the PHY. Indeed the PHY is closer to the wire therefore theoretically it has less delay than the MAC timestamping but the reality is different. Due to lower time stamping clock frequency, latency in the MDIO bus and no PHC hardware synchronization between different PHY, the PHY PTP is often less precise than the MAC. The exception is for PHY designed specially for PTP case but these devices are not very widespread. For not breaking the compatibility I introduce a default_timestamp flag in phy_device that is set by the phy driver to know we are using the old API behavior. The phy_set_timestamp function is called at each call of phy_attach_direct. In case of MAC driver using phylink this function is called when the interface is turned up. Then if the interface goes down and up again the last choice of timestamp will be overwritten by the default choice. A solution could be to cache the timestamp status but it can bring other issues. In case of SFP, if we change the module, it doesn't make sense to blindly re-set the timestamp back to PHY, if the new module has a PHY with mediocre timestamping capabilities. Signed-off-by: Kory Maincent <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-11-18net: Replace hwtstamp_source by timestamping layerKory Maincent1-8/+3
Replace hwtstamp_source which is only used by the kernel_hwtstamp_config structure by the more widely use timestamp_layer structure. This is done to prepare the support of selectable timestamping source. Signed-off-by: Kory Maincent <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-11-18net: Make dev_set_hwtstamp_phylib accessibleKory Maincent1-0/+3
Make the dev_set_hwtstamp_phylib function accessible in prevision to use it from ethtool to reset the tstamp current configuration. Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: Kory Maincent <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-11-18net: ethtool: Refactor identical get_ts_info implementations.Richard Cochran1-0/+8
The vlan, macvlan and the bonding drivers call their "real" device driver in order to report the time stamping capabilities. Provide a core ethtool helper function to avoid copy/paste in the stack. Signed-off-by: Richard Cochran <[email protected]> Signed-off-by: Kory Maincent <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Reviewed-by: Jay Vosburgh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-11-18net: Convert PHYs hwtstamp callback to use kernel_hwtstamp_configKory Maincent2-3/+7
The PHYs hwtstamp callback are still getting the timestamp config from ifreq and using copy_from/to_user. Get rid of these functions by using timestamp configuration in parameter. This also allow to move on to kernel_hwtstamp_config and be similar to net devices using the new ndo_hwstamp_get/set. This adds the possibility to manipulate the timestamp configuration from the kernel which was not possible with the copy_from/to_user. Signed-off-by: Kory Maincent <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-11-18vfs: remove a redundant might_sleep in wait_on_inodeMateusz Guzik1-1/+0
wait_on_bit already does it. Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2023-11-18fs: Block writes to mounted block devicesJan Kara1-1/+2
Ask block layer to block writes to block devices mounted by filesystems. Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-18block: Add config option to not allow writing to mounted devicesJan Kara2-0/+3
Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add tracking of writers to each block device and a kernel cmdline argument which controls whether other writeable opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are allowed. We will make filesystems use this flag for used devices. Note that this effectively only prevents modification of the particular block device's page cache by other writers. The actual device content can still be modified by other means - e.g. by issuing direct scsi commands, by doing writes through devices lower in the storage stack (e.g. in case loop devices, DM, or MD are involved) etc. But blocking direct modifications of the block device page cache is enough to give filesystems a chance to perform data validation when loading data from the underlying storage and thus prevent kernel crashes. Syzbot can use this cmdline argument option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this option for hardening. Link: https://lore.kernel.org/all/[email protected] Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-18block: Remove blkdev_get_by_*() functionsJan Kara1-5/+0
blkdev_get_by_*() and blkdev_put() functions are now unused. Remove them. Acked-by: Christoph Hellwig <[email protected]> Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-18fs: handle freezing from multiple devicesChristian Brauner1-1/+17
Before [1] freezing a filesystems through the block layer only worked for the main block device as the owning superblock of additional block devices could not be found. Any filesystem that made use of multiple block devices would only be freezable via it's main block device. For example, consider xfs over device mapper with /dev/dm-0 as main block device and /dev/dm-1 as external log device. Two freeze requests before [1]: (1) dmsetup suspend /dev/dm-0 on the main block device bdev_freeze(dm-0) -> dm-0->bd_fsfreeze_count++ -> freeze_super(xfs-sb) The owning superblock is found and the filesystem gets frozen. Returns 0. (2) dmsetup suspend /dev/dm-1 on the log device bdev_freeze(dm-1) -> dm-1->bd_fsfreeze_count++ The owning superblock isn't found and only the block device freeze count is incremented. Returns 0. Two freeze requests after [1]: (1') dmsetup suspend /dev/dm-0 on the main block device bdev_freeze(dm-0) -> dm-0->bd_fsfreeze_count++ -> freeze_super(xfs-sb) The owning superblock is found and the filesystem gets frozen. Returns 0. (2') dmsetup suspend /dev/dm-1 on the log device bdev_freeze(dm-0) -> dm-0->bd_fsfreeze_count++ -> freeze_super(xfs-sb) The owning superblock is found and the filesystem gets frozen. Returns -EBUSY. When (2') is called we initiate a freeze from another block device of the same superblock. So we increment the bd_fsfreeze_count for that additional block device. But we now also find the owning superblock for additional block devices and call freeze_super() again which reports -EBUSY. This can be reproduced through xfstests via: mkfs.xfs -f -m crc=1,reflink=1,rmapbt=1, -i sparse=1 -lsize=1g,logdev=/dev/nvme1n1p4 /dev/nvme1n1p3 mkfs.xfs -f -m crc=1,reflink=1,rmapbt=1, -i sparse=1 -lsize=1g,logdev=/dev/nvme1n1p6 /dev/nvme1n1p5 FSTYP=xfs export TEST_DEV=/dev/nvme1n1p3 export TEST_DIR=/mnt/test export TEST_LOGDEV=/dev/nvme1n1p4 export SCRATCH_DEV=/dev/nvme1n1p5 export SCRATCH_MNT=/mnt/scratch export SCRATCH_LOGDEV=/dev/nvme1n1p6 export USE_EXTERNAL=yes sudo ./check generic/311 Current semantics allow two concurrent freezers: one initiated from userspace via FREEZE_HOLDER_USERSPACE and one initiated from the kernel via FREEZE_HOLDER_KERNEL. If there are multiple concurrent freeze requests from either FREEZE_HOLDER_USERSPACE or FREEZE_HOLDER_KERNEL -EBUSY is returned. We need to preserve these semantics because as they are uapi via FIFREEZE and FITHAW ioctl()s. IOW, freezes don't nest for FIFREEZE and FITHAW. Other kernels consumers rely on non-nesting freezes as well. With freezes initiated from the block layer freezes need to nest if the same superblock is frozen via multiple devices. So we need to start counting the number of freeze requests. If FREEZE_MAY_NEST is passed alongside FREEZE_HOLDER_KERNEL or FREEZE_HOLDER_USERSPACE we allow the caller to nest freeze calls. To accommodate the old semantics we split the freeze counter into two counting kernel initiated and userspace initiated freezes separately. We can then also stop recording FREEZE_HOLDER_* in struct sb_writers. We also simplify freezing by making all concurrent freezers share a single active superblock reference count instead of having separate references for kernel and userspace. I don't see why we would need two active reference counts. Neither FREEZE_HOLDER_KERNEL nor FREEZE_HOLDER_USERSPACE can put the active reference as long as they are concurrent freezers anwyay. That was already true before we allowed nesting freezes. Survives various fstests runs with different options including the reproducer, online scrub, and online repair, fsfreze, and so on. Also survives blktests. Link: https://lore.kernel.org/linux-block/87bkccnwxc.fsf@debian-BULLSEYE-live-builder-AMD64 Link: https://lore.kernel.org/r/[email protected] Fixes: 288d8706abfc ("bdev: implement freeze and thaw holder operations") [1] # no backport needed Tested-by: Chandan Babu R <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reported-by: Chandan Babu R <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-18blkdev: comment fs_holder_opsChristian Brauner1-0/+5
Add a comment to @fs_holder_ops that @holder must point to a superblock. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-18super: remove bd_fsfreeze_sbChristian Brauner1-5/+2
Remove bd_fsfreeze_sb as it's now unused and can be removed. Also move bd_fsfreeze_count down to not have it weirdly placed in the middle of the holder fields. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Jan Kara <[email protected]> Suggested-by: Jan Kara <[email protected]> Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-18fs: remove get_active_super()Christian Brauner1-1/+0
This function is now unused so remove it. One less function that uses the global superblock list. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-18bdev: implement freeze and thaw holder operationsChristian Brauner1-1/+1
The old method of implementing block device freeze and thaw operations required us to rely on get_active_super() to walk the list of all superblocks on the system to find any superblock that might use the block device. This is wasteful and not very pleasant overall. Now that we can finally go straight from block device to owning superblock things become way simpler. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-18bdev: add freeze and thaw holder operationsChristian Brauner1-0/+10
Add block device freeze and thaw holder operations. Follow-up patches will implement block device freeze and thaw based on stuct blk_holder_ops. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-18bdev: rename freeze and thaw helpersChristian Brauner1-2/+2
We have bdev_mark_dead() etc and we're going to move block device freezing to holder ops in the next patch. Make the naming consistent: * freeze_bdev() -> bdev_freeze() * thaw_bdev() -> bdev_thaw() Also document the return code. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-18mounts: keep list of mounts in an rbtreeMiklos Szeredi1-3/+2
When adding a mount to a namespace insert it into an rbtree rooted in the mnt_namespace instead of a linear list. The mnt.mnt_list is still used to set up the mount tree and for propagation, but not after the mount has been added to a namespace. Hence mnt_list can live in union with rb_node. Use MNT_ONRB mount flag to validate that the mount is on the correct list. This allows removing the cursor used for reading /proc/$PID/mountinfo. The mnt_id_unique of the next mount can be used as an index into the seq file. Tested by inserting 100k bind mounts, unsharing the mount namespace, and unmounting. No performance regressions have been observed. For the last mount in the 100k list the statmount() call was more than 100x faster due to the mount ID lookup not having to do a linear search. This patch makes the overhead of mount ID lookup non-observable in this range. Signed-off-by: Miklos Szeredi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Ian Kent <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-11-17bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTSAndrii Nakryiko1-1/+1
Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0]. A few selftests and veristat need to be adjusted in the same patch as well. [0] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/ Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-11-17f2fs: the name of a struct is wrong in a comment.Yang Hubin1-1/+1
The macro SUMMARY_SIZE represents the size of the struct f2fs_summary, instead of the size of the struct summary. Signed-off-by: Yang Hubin <[email protected]> Signed-off-by: Qian Haolai <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2023-11-17tee: system sessionEtienne Carriere1-0/+16
Adds kernel client API function tee_client_system_session() for a client to request a system service entry in TEE context. This feature is needed to prevent a system deadlock when several TEE client applications invoke TEE, consuming all TEE thread contexts available in the secure world. The deadlock can happen in the OP-TEE driver for example if all these TEE threads issue an RPC call from TEE to Linux OS to access an eMMC RPMB partition (TEE secure storage) which device clock or regulator controller is accessed through an OP-TEE SCMI services. In that case, Linux SCMI driver must reach OP-TEE SCMI service without waiting until one of the consumed TEE threads is freed. Reviewed-by: Sumit Garg <[email protected]> Co-developed-by: Jens Wiklander <[email protected]> Signed-off-by: Etienne Carriere <[email protected]> Signed-off-by: Jens Wiklander <[email protected]>
2023-11-17linux/export: clean up the IA-64 KSYM_FUNC macroLukas Bulwahn1-3/+1
With commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture"), there is no need to keep the IA-64 definition of the KSYM_FUNC macro. Clean up the IA-64 definition of the KSYM_FUNC macro. Signed-off-by: Lukas Bulwahn <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2023-11-16net: linkmode: add linkmode_fill() helperRussell King (Oracle)1-0/+5
Add a linkmode_fill() helper, which will allow us to convert phylink's open coded bitmap_fill() operations. Signed-off-by: Russell King (Oracle) <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-11-16device property: Add fwnode_property_match_property_string()Andy Shevchenko1-0/+12
Sometimes the users want to match the single value string property against an array of predefined strings. Create a helper for them. Signed-off-by: Andy Shevchenko <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jonathan Cameron <[email protected]>
2023-11-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni17-261/+82
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Paolo Abeni <[email protected]>
2023-11-16Merge tag 'net-6.7-rc2' of ↵Linus Torvalds2-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from BPF and netfilter. Current release - regressions: - core: fix undefined behavior in netdev name allocation - bpf: do not allocate percpu memory at init stage - netfilter: nf_tables: split async and sync catchall in two functions - mptcp: fix possible NULL pointer dereference on close Current release - new code bugs: - eth: ice: dpll: fix initial lock status of dpll Previous releases - regressions: - bpf: fix precision backtracking instruction iteration - af_unix: fix use-after-free in unix_stream_read_actor() - tipc: fix kernel-infoleak due to uninitialized TLV value - eth: bonding: stop the device in bond_setup_by_slave() - eth: mlx5: - fix double free of encap_header - avoid referencing skb after free-ing in drop path - eth: hns3: fix VF reset - eth: mvneta: fix calls to page_pool_get_stats Previous releases - always broken: - core: set SOCK_RCU_FREE before inserting socket into hashtable - bpf: fix control-flow graph checking in privileged mode - eth: ppp: limit MRU to 64K - eth: stmmac: avoid rx queue overrun - eth: icssg-prueth: fix error cleanup on failing initialization - eth: hns3: fix out-of-bounds access may occur when coalesce info is read via debugfs - eth: cortina: handle large frames Misc: - selftests: gso: support CONFIG_MAX_SKB_FRAGS up to 45" * tag 'net-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits) macvlan: Don't propagate promisc change to lower dev in passthru net: sched: do not offload flows with a helper in act_ct net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors net/mlx5e: Check return value of snprintf writing to fw_version buffer net/mlx5e: Reduce the size of icosq_str net/mlx5: Increase size of irq name buffer net/mlx5e: Update doorbell for port timestamping CQ before the software counter net/mlx5e: Track xmit submission to PTP WQ after populating metadata map net/mlx5e: Avoid referencing skb after free-ing in drop path of mlx5e_sq_xmit_wqe net/mlx5e: Don't modify the peer sent-to-vport rules for IPSec offload net/mlx5e: Fix pedit endianness net/mlx5e: fix double free of encap_header in update funcs net/mlx5e: fix double free of encap_header net/mlx5: Decouple PHC .adjtime and .adjphase implementations net/mlx5: DR, Allow old devices to use multi destination FTE net/mlx5: Free used cpus mask when an IRQ is released Revert "net/mlx5: DR, Supporting inline WQE when possible" bpf: Do not allocate percpu memory at init stage net: Fix undefined behavior in netdev name allocation dt-bindings: net: ethernet-controller: Fix formatting error ...
2023-11-16Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-7/+0
Pull virtio fixes from Michael Tsirkin: "Bugfixes all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost-vdpa: fix use after free in vhost_vdpa_probe() virtio_pci: Switch away from deprecated irq_set_affinity_hint riscv, qemu_fw_cfg: Add support for RISC-V architecture vdpa_sim_blk: allocate the buffer zeroed virtio_pci: move structure to a header
2023-11-16treewide, spi: Get rid of SPI_MASTER_HALF_DUPLEXAndy Shevchenko1-2/+0
The SPI_MASTER_HALF_DUPLEX is the legacy name of a definition for a half duplex flag. Since all others had been replaced with the respective SPI_CONTROLLER prefix get rid of the last one as well. There is no functional change intended. Signed-off-by: Andy Shevchenko <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Acked-by: Ulf Hansson <[email protected]> # For MMC Acked-by: Dmitry Torokhov <[email protected]> # for input Acked-by: Paolo Abeni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2023-11-16coresight-tpdm: Introduce TPDM subtype to TPDM driverTao Zhang1-0/+1
Introduce the new subtype of "CORESIGHT_DEV_SUBTYPE_SOURCE_TPDM" for TPDM components in driver. Signed-off-by: Tao Zhang <[email protected]> Signed-off-by: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-11-16indirect_call_wrapper: Fix typo in INDIRECT_CALL_$NR kerneldocTobias Klauser1-1/+1
Fix a small typo in the kerneldoc comment of the INDIRECT_CALL_$NR macro. Signed-off-by: Tobias Klauser <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2023-11-15Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski1-3/+7
Alexei Starovoitov says: ==================== pull-request: bpf 2023-11-15 We've added 7 non-merge commits during the last 6 day(s) which contain a total of 9 files changed, 200 insertions(+), 49 deletions(-). The main changes are: 1) Do not allocate bpf specific percpu memory unconditionally, from Yonghong. 2) Fix precision backtracking instruction iteration, from Andrii. 3) Fix control flow graph checking, from Andrii. 4) Fix xskxceiver selftest build, from Anders. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Do not allocate percpu memory at init stage selftests/bpf: add more test cases for check_cfg() bpf: fix control-flow graph checking in privileged mode selftests/bpf: add edge case backtracking logic test bpf: fix precision backtracking instruction iteration bpf: handle ldimm64 properly in check_cfg() selftests: bpf: xskxceiver: ksft_print_msg: fix format type error ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-11-15new helper: user_path_locked_at()Al Viro1-0/+1
Equivalent of kern_path_locked() taking dfd/userland name. User introduced in the next commit. Signed-off-by: Al Viro <[email protected]>
2023-11-15power: supply: bq27xxx: Stop and start delayed work in suspend and resumeMarek Vasut1-0/+1
This driver uses delayed work to perform periodic battery state read out. This delayed work is not stopped across suspend and resume cycle. The read out can occur early in the resume cycle. In case of an I2C variant of this hardware, that read out triggers I2C transfer. That I2C transfer may happen while the I2C controller is still suspended, which produces a WARNING in the kernel log. Fix this by introducing trivial PM ops, which stop the delayed work before the system enters suspend, and schedule the delayed work right after the system resumes. Signed-off-by: Marek Vasut <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sebastian Reichel <[email protected]>
2023-11-15bpf: add register bounds sanity checks and sanitizationAndrii Nakryiko1-0/+1
Add simple sanity checks that validate well-formed ranges (min <= max) across u64, s64, u32, and s32 ranges. Also for cases when the value is constant (either 64-bit or 32-bit), we validate that ranges and tnums are in agreement. These bounds checks are performed at the end of BPF_ALU/BPF_ALU64 operations, on conditional jumps, and for LDX instructions (where subreg zero/sign extension is probably the most important to check). This covers most of the interesting cases. Also, we validate the sanity of the return register when manually adjusting it for some special helpers. By default, sanity violation will trigger a warning in verifier log and resetting register bounds to "unbounded" ones. But to aid development and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will trigger hard failure of verification with -EFAULT on register bounds violations. This allows selftests to catch such issues. veristat will also gain a CLI option to enable this behavior. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-11-15bpf: generalize reg_set_min_max() to handle non-const register comparisonsAndrii Nakryiko1-0/+4
Generalize bounds adjustment logic of reg_set_min_max() to handle not just register vs constant case, but in general any register vs any register cases. For most of the operations it's trivial extension based on range vs range comparison logic, we just need to properly pick min/max of a range to compare against min/max of the other range. For BPF_JSET we keep the original capabilities, just make sure JSET is integrated in the common framework. This is manifested in the internal-only BPF_JSET + BPF_X "opcode" to allow for simpler and more uniform rev_opcode() handling. See the code for details. This allows to reuse the same code exactly both for TRUE and FALSE branches without explicitly handling both conditions with custom code. Note also that now we don't need a special handling of BPF_JEQ/BPF_JNE case none of the registers are constants. This is now just a normal generic case handled by reg_set_min_max(). To make tnum handling cleaner, tnum_with_subreg() helper is added, as that's a common operator when dealing with 32-bit subregister bounds. This keeps the overall logic much less noisy when it comes to tnums. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-11-15net/mlx5: Query maximum frequency adjustment of the PTP hardware clockRahul Rameshbabu1-1/+4
Some mlx5 devices do not support the default advertised maximum frequency adjustment value for the PTP hardware clock that is set by the driver. These devices need to be queried when initializing the clock functionality in order to get the maximum supported frequency adjustment value. This value can be greater than the minimum supported frequency adjustment across mlx5 devices (50 million ppb). Signed-off-by: Rahul Rameshbabu <[email protected]> Reviewed-by: Tariq Toukan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2023-11-15bpf: Do not allocate percpu memory at init stageYonghong Song1-1/+1
Kirill Shutemov reported significant percpu memory consumption increase after booting in 288-cpu VM ([1]) due to commit 41a5db8d8161 ("bpf: Add support for non-fix-size percpu mem allocation"). The percpu memory consumption is increased from 111MB to 969MB. The number is from /proc/meminfo. I tried to reproduce the issue with my local VM which at most supports upto 255 cpus. With 252 cpus, without the above commit, the percpu memory consumption immediately after boot is 57MB while with the above commit the percpu memory consumption is 231MB. This is not good since so far percpu memory from bpf memory allocator is not widely used yet. Let us change pre-allocation in init stage to on-demand allocation when verifier detects there is a need of percpu memory for bpf program. With this change, percpu memory consumption after boot can be reduced signicantly. [1] https://lore.kernel.org/lkml/[email protected]/ Fixes: 41a5db8d8161 ("bpf: Add support for non-fix-size percpu mem allocation") Reported-and-tested-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-11-15Merge drm/drm-next into drm-misc-nextMaxime Ripard335-2486/+8846
Let's kickstart the v6.8 release cycle. Signed-off-by: Maxime Ripard <[email protected]>
2023-11-15Merge branch 'tip/perf/urgent'Peter Zijlstra329-2412/+8792
Avoid conflicts, base on fixes. Signed-off-by: Peter Zijlstra <[email protected]>
2023-11-15cleanup: Add conditional guard supportPeter Zijlstra4-8/+70
Adds: - DEFINE_GUARD_COND() / DEFINE_LOCK_GUARD_1_COND() to extend existing guards with conditional lock primitives, eg. mutex_trylock(), mutex_lock_interruptible(). nb. both primitives allow NULL 'locks', which cause the lock to fail (obviously). - extends scoped_guard() to not take the body when the the conditional guard 'fails'. eg. scoped_guard (mutex_intr, &task->signal_cred_guard_mutex) { ... } will only execute the body when the mutex is held. - provides scoped_cond_guard(name, fail, args...); which extends scoped_guard() to do fail when the lock-acquire fails. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/20231102110706.460851167%40infradead.org
2023-11-15sched/deadline: Introduce deadline serversPeter Zijlstra1-1/+21
Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks with higher priority (e.g., SCHED_FIFO) monopolize CPU(s). RT Throttling has been introduced a while ago as a (mostly debug) countermeasure one can utilize to reserve some CPU time for low priority tasks (usually background type of work, e.g. workqueues, timers, etc.). It however has its own problems (see documentation) and the undesired effect of unconditionally throttling FIFO tasks even when no lower priority activity needs to run (there are mechanisms to fix this issue as well, but, again, with their own problems). Introduce deadline servers to service low priority tasks needs under starvation conditions. Deadline servers are built extending SCHED_DEADLINE implementation to allow 2-level scheduling (a sched_deadline entity becomes a container for lower priority scheduling entities). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org
2023-11-15sched: Unify runtime accounting across classesPeter Zijlstra1-1/+1
All classes use sched_entity::exec_start to track runtime and have copies of the exact same code around to compute runtime. Collapse all that. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Phil Auld <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.1699095159.git.bristot@kernel.org
2023-11-15sched/eevdf: Sort the rbtree by virtual deadlineAbel Wu1-1/+1
Sort the task timeline by virtual deadline and keep the min_vruntime in the augmented tree, so we can avoid doubling the worst case cost and make full use of the cached leftmost node to enable O(1) fastpath picking in next patch. Signed-off-by: Abel Wu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-11-15sched/numa: Fix mm numa_scan_seq based unconditional scanRaghavendra K T1-0/+3
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic") NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the task had previously accessed VMA. However unconditional scan of VMAs are allowed during initial phase of VMA creation until process's mm numa_scan_seq reaches 2 even though current task had not accessed VMA. Rationale: - Without initial scan subsequent PTE update may never happen. - Give fair opportunity to all the VMAs to be scanned and subsequently understand the access pattern of all the VMAs. But it has a corner case where, if a VMA is created after some time, process's mm numa_scan_seq could be already greater than 2. For e.g., values of mm numa_scan_seq when VMAs are created by running mmtest autonuma benchmark briefly looks like: start_seq=0 : 459 start_seq=2 : 138 start_seq=3 : 144 start_seq=4 : 8 start_seq=8 : 1 start_seq=9 : 1 This results in no unconditional PTE updates for those VMAs created after some time. Fix: - Note down the initial value of mm numa_scan_seq in per VMA start_seq. - Allow unconditional scan till start_seq + 2. Result: SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus. base kernel: upstream 6.6-rc6 with Mels patches [1] applied. kernbench ========== base patched %gain Amean elsp-128 165.09 ( 0.00%) 164.78 * 0.19%* Duration User 41404.28 41375.08 Duration System 9862.22 9768.48 Duration Elapsed 519.87 518.72 Ops NUMA PTE updates 1041416.00 831536.00 Ops NUMA hint faults 263296.00 220966.00 Ops NUMA pages migrated 258021.00 212769.00 Ops AutoNUMA cost 1328.67 1114.69 autonumabench NUMA01_THREADLOCAL ================== Amean elsp-NUMA01_THREADLOCAL 81.79 (0.00%) 67.74 * 17.18%* Duration User 54832.73 47379.67 Duration System 75.00 185.75 Duration Elapsed 576.72 476.09 Ops NUMA PTE updates 394429.00 11121044.00 Ops NUMA hint faults 1001.00 8906404.00 Ops NUMA pages migrated 288.00 2998694.00 Ops AutoNUMA cost 7.77 44666.84 Signed-off-by: Raghavendra K T <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Mel Gorman <[email protected]> Link: https://lore.kernel.org/r/2ea7cbce80ac7c62e90cbfb9653a7972f902439f.1697816692.git.raghavendra.kt@amd.com
2023-11-14Merge tag 'hardening-v6.7-rc2' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - stackleak: add declarations for global functions (Arnd Bergmann) - gcc-plugins: randstruct: Only warn about true flexible arrays (Kees Cook) - gcc-plugins: latent_entropy: Fix description typo (Konstantin Runov) * tag 'hardening-v6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: gcc-plugins: latent_entropy: Fix typo (args -> argc) in plugin description gcc-plugins: randstruct: Only warn about true flexible arrays stackleak: add declarations for global functions
2023-11-15perf/core: Fix cpuctx refcountingPeter Zijlstra1-5/+8
Audit of the refcounting turned up that perf_pmu_migrate_context() fails to migrate the ctx refcount. Fixes: bd2756811766 ("perf: Rewrite core context handling") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Cc: <[email protected]>
2023-11-14iosys-map: Rename locals used inside macrosMichał Winiarski1-22/+22
Widely used variable names can be used by macro users, potentially leading to name collisions. Suffix locals used inside the macros with an underscore, to reduce the collision potential. Suggested-by: Lucas De Marchi <[email protected]> Signed-off-by: Michał Winiarski <[email protected]> Reviewed-by: Lucas De Marchi <[email protected]> Signed-off-by: Lucas De Marchi <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2023-11-14Merge branch 'kvm-guestmemfd' into HEADPaolo Bonzini4-26/+140
Introduce several new KVM uAPIs to ultimately create a guest-first memory subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM to provide features, enhancements, and optimizations that are kludgly or outright impossible to implement in a generic memory subsystem. The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar to the generic memfd_create(), creates an anonymous file and returns a file descriptor that refers to it. Again like "regular" memfd files, guest_memfd files live in RAM, have volatile storage, and are automatically released when the last reference is dropped. The key differences between memfd files (and every other memory subystem) is that guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to convert a guest memory area between the shared and guest-private states. A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify attributes for a given page of guest memory. In the long term, it will likely be extended to allow userspace to specify per-gfn RWX protections, including allowing memory to be writable in the guest without it also being writable in host userspace. The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM. For such use cases, being able to map memory into KVM guests without requiring said memory to be mapped into the host is a hard requirement. While SEV+ and TDX prevent untrusted software from reading guest private data by encrypting guest memory, pKVM provides confidentiality and integrity *without* relying on memory encryption. In addition, with SEV-SNP and especially TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Long term, guest_memfd may be useful for use cases beyond CoCo VMs, for example hardening userspace against unintentional accesses to guest memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to define the allow guest protection (with an exception granted to mapping guest memory executable), and similarly KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size. Decoupling the mappings sizes would allow userspace to precisely map only what is needed and with the required permissions, without impacting guest performance. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to DMA from or into guest memory). guest_memfd is the result of 3+ years of development and exploration; taking on memory management responsibilities in KVM was not the first, second, or even third choice for supporting CoCo VMs. But after many failed attempts to avoid KVM-specific backing memory, and looking at where things ended up, it is quite clear that of all approaches tried, guest_memfd is the simplest, most robust, and most extensible, and the right thing to do for KVM and the kernel at-large. The "development cycle" for this version is going to be very short; ideally, next week I will merge it as is in kvm/next, taking this through the KVM tree for 6.8 immediately after the end of the merge window. The series is still based on 6.6 (plus KVM changes for 6.7) so it will require a small fixup for changes to get_file_rcu() introduced in 6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU"). The fixup will be done as part of the merge commit, and most of the text above will become the commit message for the merge. Pending post-merge work includes: - hugepage support - looking into using the restrictedmem framework for guest memory - introducing a testing mechanism to poison memory, possibly using the same memory attributes introduced here - SNP and TDX support There are two non-KVM patches buried in the middle of this series: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-14KVM: Allow arch code to track number of memslot address spaces per VMSean Christopherson1-6/+11
Let x86 track the number of address spaces on a per-VM basis so that KVM can disallow SMM memslots for confidential VMs. Confidentials VMs are fundamentally incompatible with emulating SMM, which as the name suggests requires being able to read and write guest memory and register state. Disallowing SMM will simplify support for guest private memory, as KVM will not need to worry about tracking memory attributes for multiple address spaces (SMM is the only "non-default" address space across all architectures). Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Reviewed-by: Fuad Tabba <[email protected]> Tested-by: Fuad Tabba <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>