Age | Commit message (Collapse) | Author | Files | Lines |
|
The author of commit b3b64ebd3822 ("mm/page_alloc: do bulk array
bounds check after checking populated elements") was possibly
confused by the mixture of return values throughout the function.
The API contract is clear that the function "Returns the number of pages
on the list or array." It does not list zero as a unique return value with
a special meaning. Therefore zero is a plausible return value only if
@nr_pages is zero or less.
Clean up the return logic to make it clear that the returned value is
always the total number of pages in the array/list, not the number of
pages that were allocated during this call.
The only change in behavior with this patch is the value returned if
prepare_alloc_pages() fails. To match the API contract, the number of
pages currently in the array/list is returned in this case.
The call site in __page_pool_alloc_pages_slow() also seems to be confused
on this matter. It should be attended to by someone who is familiar with
that code.
[[email protected]: Return nr_populated if 0 pages are requested]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Cc: Desmond Cheong Zhi Xi <[email protected]>
Cc: Zhang Qiang <[email protected]>
Cc: Yanfei Xu <[email protected]>
Cc: Matteo Croce <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If the array passed in is already partially populated, we should return
"nr_populated" even failing at preparing arguments stage.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yanfei Xu <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Syzbot is reporting potential deadlocks due to pagesets.lock when
PAGE_OWNER is enabled. One example from Desmond Cheong Zhi Xi is as
follows
__alloc_pages_bulk()
local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here
prep_new_page():
post_alloc_hook():
set_page_owner():
__set_page_owner():
save_stack():
stack_depot_save():
alloc_pages():
alloc_page_interleave():
__alloc_pages():
get_page_from_freelist():
rm_queue():
rm_queue_pcplist():
local_lock_irqsave(&pagesets.lock, flags);
*** DEADLOCK ***
Zhang, Qiang also reported
BUG: sleeping function called from invalid context at mm/page_alloc.c:5179
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
.....
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:96
___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:9153
prepare_alloc_pages+0x3da/0x580 mm/page_alloc.c:5179
__alloc_pages+0x12f/0x500 mm/page_alloc.c:5375
alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2147
alloc_pages+0x238/0x2a0 mm/mempolicy.c:2270
stack_depot_save+0x39d/0x4e0 lib/stackdepot.c:303
save_stack+0x15e/0x1e0 mm/page_owner.c:120
__set_page_owner+0x50/0x290 mm/page_owner.c:181
prep_new_page mm/page_alloc.c:2445 [inline]
__alloc_pages_bulk+0x8b9/0x1870 mm/page_alloc.c:5313
alloc_pages_bulk_array_node include/linux/gfp.h:557 [inline]
vm_area_alloc_pages mm/vmalloc.c:2775 [inline]
__vmalloc_area_node mm/vmalloc.c:2845 [inline]
__vmalloc_node_range+0x39d/0x960 mm/vmalloc.c:2947
__vmalloc_node mm/vmalloc.c:2996 [inline]
vzalloc+0x67/0x80 mm/vmalloc.c:3066
There are a number of ways it could be fixed. The page owner code could
be audited to strip GFP flags that allow sleeping but it'll impair the
functionality of PAGE_OWNER if allocations fail. The bulk allocator could
add a special case to release/reacquire the lock for prep_new_page and
lookup PCP after the lock is reacquired at the cost of performance. The
pages requiring prep could be tracked using the least significant bit and
looping through the array although it is more complicated for the list
interface. The options are relatively complex and the second one still
incurs a performance penalty when PAGE_OWNER is active so this patch takes
the simple approach -- disable bulk allocation of PAGE_OWNER is active.
The caller will be forced to allocate one page at a time incurring a
performance penalty but PAGE_OWNER is already a performance penalty.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
Signed-off-by: Mel Gorman <[email protected]>
Reported-by: Desmond Cheong Zhi Xi <[email protected]>
Reported-by: "Zhang, Qiang" <[email protected]>
Reported-by: [email protected]
Tested-by: [email protected]
Acked-by: Rafael Aquini <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This reverts commit f7173090033c70886d925995e9dfdfb76dbb2441.
Fix an unresolved symbol error when CONFIG_DEBUG_INFO_BTF=y:
LD vmlinux
BTFIDS vmlinux
FAILED unresolved symbol should_fail_alloc_page
make: *** [Makefile:1199: vmlinux] Error 255
make: *** Deleting file 'vmlinux'
Link: https://lkml.kernel.org/r/[email protected]
Fixes: f7173090033c ("mm/page_alloc: make should_fail_alloc_page() static")
Signed-off-by: Matteo Croce <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Tested-by: John Hubbard <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The <linux/kasan.h> header relies on _RET_IP_ being defined, and had been
receiving that definition via inclusion of bug.h which includes kernel.h.
However, since f39650de687e ("kernel.h: split out panic and oops helpers")
that is no longer the case and get the following build error when building
CONFIG_KASAN_HW_TAGS on arm64:
In file included from arch/arm64/mm/kasan_init.c:10:
include/linux/kasan.h: In function 'kasan_slab_free':
include/linux/kasan.h:230:39: error: '_RET_IP_' undeclared (first use in this function)
230 | return __kasan_slab_free(s, object, _RET_IP_, init);
Fix it by including kernel.h from kasan.h.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: f39650de687e ("kernel.h: split out panic and oops helpers")
Signed-off-by: Marco Elver <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Peter Collingbourne <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Issue: when SLUB debug is on, hwtag kasan_unpoison() would overwrite the
redzone of object with unaligned size.
An additional memzero_explicit() path is added to replacing init by hwtag
instruction for those unaligned size at SLUB debug mode.
The penalty is acceptable since they are only enabled in debug mode, not
production builds. A block of comment is added for explanation.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Yee Lee <[email protected]>
Suggested-by: Andrey Konovalov <[email protected]>
Suggested-by: Marco Elver <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Nicholas Tang <[email protected]>
Cc: Kuan-Ying Lee <[email protected]>
Cc: Chinwen Chang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Move the helper to check slub_debug_enabled, so that we can confine the
use of #ifdef outside slub.c as well.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Yee Lee <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Andrey Ryabinin <[email protected]>
Cc: Chinwen Chang <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Kuan-Ying Lee <[email protected]>
Cc: Nicholas Tang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If we encounter a directory that has been configured to pass on an
extent size hint to a new realtime file and the hint isn't an integer
multiple of the rt extent size, we should flag the hint for
administrative review because that is a misconfiguration (that other
parts of the kernel will fix automatically).
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
|
|
During a realtime grow operation, we run a single transaction for each
rt bitmap block added to the filesystem. This means that each step has
to be careful to increase sb_rblocks appropriately.
Fix the integer overflow error in this calculation that can happen when
the extent size is very large. Found by running growfs to add a rt
volume to a filesystem formatted with a 1g rt extent size.
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
|
|
Improve the checking at the start of a realtime grow operation so that
we avoid accidentally set a new extent size that is too large and avoid
adding an rt volume to a filesystem with rmap or reflink because we
don't support rt rmap or reflink yet.
While we're at it, separate the checks so that we're only testing one
aspect at a time.
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
|
|
Commit 603f000b15f2 changed xfs_ioctl_setattr_check_extsize to reject an
attempt to set an EXTSZINHERIT extent size hint on a directory with
RTINHERIT set if the hint isn't a multiple of the realtime extent size.
However, I have recently discovered that it is possible to change the
realtime extent size when adding a rt device to a filesystem, which
means that the existence of directories with misaligned inherited hints
is not an accident.
As a result, it's possible that someone could have set a valid hint and
added an rt volume with a different rt extent size, which invalidates
the ondisk hints. After such a sequence, FSGETXATTR will report a
misaligned hint, which FSSETXATTR will trip over, causing confusion if
the user was doing the usual GET/SET sequence to change some other
attribute. Change xfs_fill_fsxattr to omit the hint if it isn't aligned
properly.
Fixes: 603f000b15f2 ("xfs: validate extsz hints against rt extent size when rtinherit is set")
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
|
|
While auditing the realtime growfs code, I realized that the GROWFSRT
ioctl (and by extension xfs_growfs) has always allowed sysadmins to
change the realtime extent size when adding a realtime section to the
filesystem. Since we also have always allowed sysadmins to set
RTINHERIT and EXTSZINHERIT on directories even if there is no realtime
device, this invalidates the premise laid out in the comments added in
commit 603f000b15f2.
In other words, this is not a case of inadequate metadata validation.
This is a case of nearly forgotten (and apparently untested) but
supported functionality. Update the comments to reflect what we've
learned, and remove the log message about correcting the misalignment.
Fixes: 603f000b15f2 ("xfs: validate extsz hints against rt extent size when rtinherit is set")
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Carlos Maiolino <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
|
|
While running xfs/168, I noticed a second source of post-shrink
corruption errors causing shutdowns.
Let's say that directory B has a low inode number and is a child of
directory A, which has a high number. If B is empty but open, and
unlinked from A, B's dotdot link continues to point to A. If A is then
unlinked and the filesystem shrunk so that A is no longer a valid inode,
a subsequent AIL push of B will trip the inode verifiers because the
dotdot entry points outside of the filesystem.
To avoid this problem, reset B's dotdot entry to the root directory when
unlinking directories, since the root directory cannot be removed.
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
|
|
While running xfs/168, I noticed occasional write verifier shutdowns
involving inodes at the very end of the filesystem. Existing inode
btree validation code checks that all inode clusters are fully contained
within the filesystem.
However, due to inadequate checking in the fs shrink code, it's possible
that there could be a sparse inode cluster at the end of the filesystem
where the upper inodes of the cluster are marked as holes and the
corresponding blocks are free. In this case, the last blocks in the AG
are listed in the bnobt. This enables the shrink to proceed but results
in a filesystem that trips the inode verifiers. Fix this by disallowing
the shrink.
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
|
|
Now that we create those objects in iomap_writepage_map when needed,
there's no need to pre-create them in iomap_page_mkwrite_actor anymore.
Signed-off-by: Andreas Gruenbacher <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
In iomap_readpage_actor, don't create iop objects for inline inodes.
Otherwise, iomap_read_inline_data will set PageUptodate without setting
iop->uptodate, and iomap_page_release will eventually complain.
To prevent this kind of bug from occurring in the future, make sure the
page doesn't have private data attached in iomap_read_inline_data.
Signed-off-by: Andreas Gruenbacher <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Create an iop in the writeback path if one doesn't exist. This allows us
to avoid creating the iop in some cases. We'll initially do that for pages
with inline data, but it can be extended to pages which are entirely within
an extent. It also allows for an iop to be removed from pages in the
future (eg page split).
Co-developed-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Andreas Gruenbacher <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Signed-off-by: Christoph Hellwig <[email protected]>
Reported-by: Leizhen (ThunderTown) <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
|
|
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Signed-off-by: Christoph Hellwig <[email protected]>
Reported-by: Leizhen (ThunderTown) <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
|
|
We suppress KCOV for entry.o rather than entry-common.o. As entry.o is
built from entry.S, this is pointless, and permits instrumentation of
entry-common.o, which is built from entry-common.c.
Fix the Makefile to suppress KCOV for entry-common.o, as we had intended
to begin with. I've verified with objdump that this is working as
expected.
Fixes: bf6fa2c0dda7 ("arm64: entry: don't instrument entry code with KCOV")
Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: James Morse <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
|
|
We intend that all the early exception handling code is marked as
`noinstr`, but we forgot this for __el0_error_handler_common(), which is
called before we have completed entry from user mode. If it were
instrumented, we could run into problems with RCU, lockdep, etc.
Mark it as `noinstr` to prevent this.
The few other functions in entry-common.c which do not have `noinstr` are
called once we've completed entry, and are safe to instrument.
Fixes: bb8e93a287a5 ("arm64: entry: convert SError handlers to C")
Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Joey Gouly <[email protected]>
Cc: James Morse <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
|
|
Since commit:
bad1e1c663e0a72f ("arm64: mte: switch GCR_EL1 in kernel entry and exit")
we saved/restored the user GCR_EL1 value at exception boundaries, and
update_gcr_el1_excl() is no longer used for this. However it is used to
restore the kernel's GCR_EL1 value when returning from a suspend state.
Thus, the comment is misleading (and an ISB is necessary).
When restoring the kernel's GCR value, we need an ISB to ensure this is
used by subsequent instructions. We don't necessarily get an ISB by
other means (e.g. if the kernel is built without support for pointer
authentication). As __cpu_setup() initialised GCR_EL1.Exclude to 0xffff,
until a context synchronization event, allocation tag 0 may be used
rather than the desired set of tags.
This patch drops the misleading comment, adds the missing ISB, and for
clarity folds update_gcr_el1_excl() into its only user.
Fixes: bad1e1c663e0 ("arm64: mte: switch GCR_EL1 in kernel entry and exit")
Signed-off-by: Mark Rutland <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
|
|
Al reminds us that the usercopy API must only return complete failure
if absolutely nothing could be copied. Currently, if userspace does
something silly like giving us an unaligned pointer to Device memory,
or a size which overruns MTE tag bounds, we may fail to honour that
requirement when faulting on a multi-byte access even though a smaller
access could have succeeded.
Add a mitigation to the fixup routines to fall back to a single-byte
copy if we faulted on a larger access before anything has been written
to the destination, to guarantee making *some* forward progress. We
needn't be too concerned about the overall performance since this should
only occur when callers are doing something a bit dodgy in the first
place. Particularly broken userspace might still be able to trick
generic_perform_write() into an infinite loop by targeting write() at
an mmap() of some read-only device register where the fault-in load
succeeds but any store synchronously aborts such that copy_to_user() is
genuinely unable to make progress, but, well, don't do that...
CC: [email protected]
Reported-by: Chen Huang <[email protected]>
Suggested-by: Al Viro <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Robin Murphy <[email protected]>
Link: https://lore.kernel.org/r/dc03d5c675731a1f24a62417dba5429ad744234e.1626098433.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <[email protected]>
|
|
xen-blkfront has a weird protocol where close message from the remote
side can be delayed, and where hot removals are treated somewhat
differently from regular removals, all leading to potential NULL
pointer removals, and a del_gendisk from the block device release
method, which will deadlock. Fix this by just performing normal hot
removals even when the device is opened like all other Linux block
drivers.
Fixes: c76f48eb5c08 ("block: take bd_mutex around delete_partitions in del_gendisk")
Reported-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Tested-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 5.14
- fix various races in nvme-pci when shutting down just after probing
(Casey Chen)
- fix a net_device leak in nvme-tcp (Prabhakar Kushwaha)"
* tag 'nvme-5.14-2021-07-15' of git://git.infradead.org/nvme:
nvme-pci: do not call nvme_dev_remove_admin from nvme_remove
nvme-pci: fix multiple races in nvme_setup_io_queues
nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE
|
|
We must release the queue before freeing the tagset.
Fixes: 4af5f2e03013 ("nbd: use blk_mq_alloc_disk and blk_cleanup_disk")
Reported-and-tested-by: [email protected]
Signed-off-by: Wang Qing <[email protected]>
Signed-off-by: Guoqing Jiang <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
We must release the queue before freeing the tagset.
Fixes: 262d431f9000 ("pd: use blk_mq_alloc_disk and blk_cleanup_disk")
Signed-off-by: Guoqing Jiang <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
There's no need for fixed strings to be under 'patternProperties', so move
them under 'properties' instead.
Cc: Jean Delvare <[email protected]>
Cc: Guenter Roeck <[email protected]>
Cc: Kishon Vijay Abraham I <[email protected]>
Cc: Vinod Koul <[email protected]>
Cc: Saravanan Sekar <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Jagan Teki <[email protected]>
Cc: Troy Kisky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Rob Herring <[email protected]>
Acked-by: Mark Brown <[email protected]>
Acked-by: Guenter Roeck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Another round of removing redundant minItems/maxItems from new schema in
the recent merge window.
If a property has an 'items' list, then a 'minItems' or 'maxItems' with the
same size as the list is redundant and can be dropped. Note that is DT
schema specific behavior and not standard json-schema behavior. The tooling
will fixup the final schema adding any unspecified minItems/maxItems.
This condition is partially checked with the meta-schema already, but
only if both 'minItems' and 'maxItems' are equal to the 'items' length.
An improved meta-schema is pending.
Cc: Stephen Boyd <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Miquel Raynal <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Vignesh Raghavendra <[email protected]>
Cc: Alessandro Zummo <[email protected]>
Cc: Alexandre Belloni <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Sureshkumar Relli <[email protected]>
Cc: Brian Norris <[email protected]>
Cc: Kamal Dasu <[email protected]>
Cc: Linus Walleij <[email protected]>
Cc: Sebastian Siewior <[email protected]>
Cc: Laurent Pinchart <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Rob Herring <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Acked-by: Alexandre Belloni <[email protected]>
Reviewed-by: Laurent Pinchart <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Two additional tests are added:
- SMM triggered from L2 does not currupt L1 host state.
- Save/restore during SMM triggered from L2 does not corrupt guest/host
state.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If the VM was migrated while in SMM, no nested state was saved/restored,
and therefore svm_leave_smm has to load both save and control area
of the vmcb12. Save area is already loaded from HSAVE area,
so now load the control area as well from the vmcb12.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
VMCB split commit 4995a3685f1b ("KVM: SVM: Use a separate vmcb for the
nested L2 guest") broke return from SMM when we entered there from guest
(L2) mode. Gen2 WS2016/Hyper-V is known to do this on boot. The problem
manifests itself like this:
kvm_exit: reason EXIT_RSM rip 0x7ffbb280 info 0 0
kvm_emulate_insn: 0:7ffbb280: 0f aa
kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x7ffb3000
kvm_nested_vmrun: rip: 0x000000007ffbb280 vmcb: 0x0000000008224000
nrip: 0xffffffffffbbe119 int_ctl: 0x01020000 event_inj: 0x00000000
npt: on
kvm_nested_intercepts: cr_read: 0000 cr_write: 0010 excp: 40060002
intercepts: fd44bfeb 0000217f 00000000
kvm_entry: vcpu 0, rip 0xffffffffffbbe119
kvm_exit: reason EXIT_NPF rip 0xffffffffffbbe119 info
200000006 1ab000
kvm_nested_vmexit: vcpu 0 reason npf rip 0xffffffffffbbe119 info1
0x0000000200000006 info2 0x00000000001ab000 intr_info 0x00000000
error_code 0x00000000
kvm_page_fault: address 1ab000 error_code 6
kvm_nested_vmexit_inject: reason EXIT_NPF info1 200000006 info2 1ab000
int_info 0 int_info_err 0
kvm_entry: vcpu 0, rip 0x7ffbb280
kvm_exit: reason EXIT_EXCP_GP rip 0x7ffbb280 info 0 0
kvm_emulate_insn: 0:7ffbb280: 0f aa
kvm_inj_exception: #GP (0x0)
Note: return to L2 succeeded but upon first exit to L1 its RIP points to
'RSM' instruction but we're not in SMM.
The problem appears to be that VMCB01 gets irreversibly destroyed during
SMM execution. Previously, we used to have 'hsave' VMCB where regular
(pre-SMM) L1's state was saved upon nested_svm_vmexit() but now we just
switch to VMCB01 from VMCB02.
Pre-split (working) flow looked like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() restores L1's state from 'hsave'
- SMM -> RSM
- enter_svm_guest_mode() switches to L2 but keeps 'hsave' intact so we have
pre-SMM (and pre L2 VMRUN) L1's state there
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from L1.
This was always broken with regards to svm_get_nested_state()/
svm_set_nested_state(): 'hsave' was never a part of what's being
save and restored so migration happening during SMM triggered from L2 would
never restore L1's state correctly.
Post-split flow (broken) looks like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() switches to VMCB01 from VMCB02
- SMM -> RSM
- enter_svm_guest_mode() switches from VMCB01 to VMCB02 but pre-SMM VMCB01
is already lost.
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from VMCB01 but it is corrupted
(reflects the state during 'RSM' execution).
VMX doesn't have this problem because unlike VMCB, VMCS keeps both guest
and host state so when we switch back to VMCS02 L1's state is intact there.
To resolve the issue we need to save L1's state somewhere. We could've
created a third VMCB for SMM but that would require us to modify saved
state format. L1's architectural HSAVE area (pointed by MSR_VM_HSAVE_PA)
seems appropriate: L0 is free to save any (or none) of L1's state there.
Currently, KVM does 'none'.
Note, for nested state migration to succeed, both source and destination
hypervisors must have the fix. We, however, don't need to create a new
flag indicating the fact that HSAVE area is now populated as migration
during SMM triggered from L2 was always broken.
Fixes: 4995a3685f1b ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Separate the code setting non-VMLOAD-VMSAVE state from
svm_set_nested_state() into its own function. This is going to be
re-used from svm_enter_smm()/svm_leave_smm().
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
APM states that "The address written to the VM_HSAVE_PA MSR, which holds
the address of the page used to save the host state on a VMRUN, must point
to a hypervisor-owned page. If this check fails, the WRMSR will fail with
a #GP(0) exception. Note that a value of 0 is not considered valid for the
VM_HSAVE_PA MSR and a VMRUN that is attempted while the HSAVE_PA is 0 will
fail with a #GP(0) exception."
svm_set_msr() already checks that the supplied address is valid, so only
check for '0' is missing. Add it to nested_svm_vmrun().
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
APM states that #GP is raised upon write to MSR_VM_HSAVE_PA when
the supplied address is not page-aligned or is outside of "maximum
supported physical address for this implementation".
page_address_valid() check seems suitable. Also, forcefully page-align
the address when it's written from VMM.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Reviewed-by: Maxim Levitsky <[email protected]>
[Add comment about behavior for host-provided values. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use IS_ERR() instead of checking for a NULL pointer when querying for
sev_pin_memory() failures. sev_pin_memory() always returns an error code
cast to a pointer, or a valid pointer; it never returns NULL.
Reported-by: Dan Carpenter <[email protected]>
Cc: Steve Rutherford <[email protected]>
Cc: Brijesh Singh <[email protected]>
Cc: Ashish Kalra <[email protected]>
Fixes: d3d1af85e2c7 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Fixes: 15fb7de1a7f5 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Return -EFAULT if copy_to_user() fails; if accessing user memory faults,
copy_to_user() returns the number of bytes remaining, not an error code.
Reported-by: Dan Carpenter <[email protected]>
Cc: Steve Rutherford <[email protected]>
Cc: Brijesh Singh <[email protected]>
Cc: Ashish Kalra <[email protected]>
Fixes: d3d1af85e2c7 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In theory there are no side effects of not intercepting #SMI,
because then #SMI becomes transparent to the OS and the KVM.
Plus an observation on recent Zen2 CPUs reveals that these
CPUs ignore #SMI interception and never deliver #SMI VMexits.
This is also useful to test nested KVM to see that L1
handles #SMIs correctly in case when L1 doesn't intercept #SMI.
Finally the default remains the same, the SMI are intercepted
by default thus this patch doesn't have any effect unless
non default module param value is used.
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Kernel never sends real INIT even to CPUs, other than on boot.
Thus INIT interception is an error which should be caught
by a check for an unknown VMexit reason.
On top of that, the current INIT VM exit handler skips
the current instruction which is wrong.
That was added in commit 5ff3a351f687 ("KVM: x86: Move trivial
instruction-based exit handlers to common code").
Fixes: 5ff3a351f687 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Commit 5ff3a351f687 ("KVM: x86: Move trivial instruction-based
exit handlers to common code"), unfortunately made a mistake of
treating nop_on_interception and nop_interception in the same way.
Former does truly nothing while the latter skips the instruction.
SMI VM exit handler should do nothing.
(SMI itself is handled by the host when we do STGI)
Fixes: 5ff3a351f687 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
Signed-off-by: Maxim Levitsky <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
vmx_msr_index was used to record the list of MSRs which can be lazily
restored when kvm returns to userspace. It is now reimplemented as
kvm_uret_msrs_list, a common x86 list which is only used inside x86.c.
So just remove the obsolete declaration in vmx.h.
Signed-off-by: Yu Zhang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When the host is using debug registers but the guest is not using them
nor is the guest in guest-debug state, the kvm code does not reset
the host debug registers before kvm_x86->run(). Rather, it relies on
the hardware vmentry instruction to automatically reset the dr7 registers
which ensures that the host breakpoints do not affect the guest.
This however violates the non-instrumentable nature around VM entry
and exit; for example, when a host breakpoint is set on vcpu->arch.cr2,
Another issue is consistency. When the guest debug registers are active,
the host breakpoints are reset before kvm_x86->run(). But when the
guest debug registers are inactive, the host breakpoints are delayed to
be disabled. The host tracing tools may see different results depending
on what the guest is doing.
To fix the problems, we clear %db7 unconditionally before kvm_x86->run()
if the host has set any breakpoints, no matter if the guest is using
them or not.
Signed-off-by: Lai Jiangshan <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
[Only clear %db7 instead of reloading all debug registers. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Commit a75a895e6457 ("KVM: selftests: Unconditionally use memslot 0 for
vaddr allocations") removed the memslot parameters from vm_vaddr_alloc.
It addressed all callers except one under lib/aarch64/, due to a race
with commit e3db7579ef35 ("KVM: selftests: Add exception handling
support for aarch64")
Fix the vm_vaddr_alloc call in lib/aarch64/processor.c.
Reported-by: Zenghui Yu <[email protected]>
Signed-off-by: Ricardo Koller <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In commit bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors")
loop for filling debugfs_stat_data was copy-pasted 2 times, but
in the second loop pointers are saved over pointers allocated
in the first loop. All this causes is a memory leak, fix it.
Fixes: bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors")
Signed-off-by: Pavel Skripkin <[email protected]>
Reviewed-by: Jing Zhang <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Jing Zhang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Some of the lines aren't properly indented, causing yamllint to warn
about them:
.../nxp,sja1105.yaml:70:17: [warning] wrong indentation: expected 18 but found 16 (indentation)
Use the proper indentation to fix those warnings.
Signed-off-by: Thierry Reding <[email protected]>
Fixes: 070f5b701d559ae1 ("dt-bindings: net: dsa: sja1105: add SJA1110 bindings")
Tested-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Rob Herring <[email protected]>
|
|
"LinusTorvalds" is not pretty. Replace it with "Linus Torvalds".
Signed-off-by: Hu Haowen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
|
|
Risc-V gained support recently.
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
|
|
A couple of exotic quote characters came in with this license text; they
can confuse software that is not expecting non-ASCII text. Switch to
normal quotes here, with no changes to the actual license text.
Reported-by: Rahul T R <[email protected]>
Signed-off-by: Nishanth Menon <[email protected]>
CC: Greg Kroah-Hartman <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Acked-by: Thorsten Leemhuis <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jonathan Corbet <[email protected]>
|
|
The conversion to C introduced several bugs in TM handling that can
cause host crashes with TM bad thing interrupts. Mostly just simple
typos or missed logic in the conversion that got through due to my
not testing TM in the guest sufficiently.
- Early TM emulation for the softpatch interrupt should be done if fake
suspend mode is _not_ active.
- Early TM emulation wants to return immediately to the guest so as to
not doom transactions unnecessarily.
- And if exiting from the guest, the host MSR should include the TM[S]
bit if the guest was T/S, before it is treclaimed.
After this fix, all the TM selftests pass when running on a P9 processor
that implements TM with softpatch interrupt.
Fixes: 89d35b2391015 ("KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C")
Reported-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|