Age | Commit message (Collapse) | Author | Files | Lines |
|
To abstract the rmid counters into a helper that returns the number
of bytes counted, architecture specific per-rmid state is needed.
It needs to be possible to reset this hidden state, as the values
may outlive the life of an rmid, or the mount time of the filesystem.
mon_event_read() is called with first = true when an rmid is first
allocated in mkdir_mondata_subdir(). Add resctrl_arch_reset_rmid()
and call it from __mon_event_count()'s rr->first check.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
A renamed __rmid_read() is intended as the function that an
architecture agnostic resctrl filesystem driver can use to
read a value in bytes from a counter. Currently the function returns
the MBM values in chunks directly from hardware. For bandwidth
counters the resctrl filesystem uses this to calculate the number of
bytes ever seen.
MPAM's scaling of counters can be changed at runtime, reducing the
resolution but increasing the range. When this is changed the prev_msr
values need to be converted by the architecture code.
Add an array for per-rmid private storage. The prev_msr and chunks
values will move here to allow resctrl_arch_rmid_read() to always
return the number of bytes read by this counter without assistance
from the filesystem. The values are moved in later patches when
the overflow and correction calls are moved into __rmid_read().
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
mbm_bw_count() is only called by the mbm_handle_overflow() worker once a
second. It reads the hardware register, calculates the bandwidth and
updates m->prev_bw_msr which is used to hold the previous hardware register
value.
Operating directly on hardware register values makes it difficult to make
this code architecture independent, so that it can be moved to /fs/,
making the mba_sc feature something resctrl supports with no additional
support from the architecture.
Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware
register using __mon_event_count().
Change mbm_bw_count() to use the current chunks value most recently saved
by __mon_event_count(). This removes an extra call to __rmid_read().
Instead of using m->prev_msr to calculate the number of chunks seen,
use the rr->val that was updated by __mon_event_count(). This removes an
extra call to mbm_overflow_count() and get_corrected_mbm_count().
Calculating bandwidth like this means mbm_bw_count() no longer operates
on hardware register values directly.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
update_mba_bw() calculates a new control value for the MBA resource
based on the user provided mbps_val and the current measured
bandwidth. Some control values need remapping by delay_bw_map().
It does this by calling wrmsrl() directly. This needs splitting
up to be done by an architecture specific helper, so that the
remainder can eventually be moved to /fs/.
Add resctrl_arch_update_one() to apply one configuration value
to the provided resource and domain. This avoids the staging
and cross-calling that is only needed with changes made by
user-space. delay_bw_map() moves to be part of the arch code,
to maintain the 'percentage control' view of MBA resources
in resctrl.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The resctrl arch code provides a second configuration array mbps_val[]
for the MBA software controller.
Since resctrl switched over to allocating and freeing its own array
when needed, nothing uses the arch code version.
Remove it.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Updates to resctrl's software controller follow the same path as
other configuration updates, but they don't modify the hardware state.
rdtgroup_schemata_write() uses parse_line() and the resource's
parse_ctrlval() function to stage the configuration.
resctrl_arch_update_domains() then updates the mbps_val[] array
instead, and resctrl_arch_update_domains() skips the rdt_ctrl_update()
call that would update hardware.
This complicates the interface between resctrl's filesystem parts
and architecture specific code. It should be possible for mba_sc
to be completely implemented by the filesystem parts of resctrl. This
would allow it to work on a second architecture with no additional code.
resctrl_arch_update_domains() using the mbps_val[] array prevents this.
Change parse_bw() to write the configuration value directly to the
mbps_val[] array in the domain structure. Change rdtgroup_schemata_write()
to skip the call to resctrl_arch_update_domains(), meaning all the
mba_sc specific code in resctrl_arch_update_domains() can be removed.
On the read-side, show_doms() and update_mba_bw() are changed to read
the mbps_val[] array from the domain structure. With this,
resctrl_arch_get_config() no longer needs to consider mba_sc resources.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
To support resctrl's MBA software controller, the architecture must provide
a second configuration array to hold the mbps_val[] from user-space.
This complicates the interface between the architecture specific code and
the filesystem portions of resctrl that will move to /fs/, to allow
multiple architectures to support resctrl.
Make the filesystem parts of resctrl create an array for the mba_sc
values. The software controller can be changed to use this, allowing
the architecture code to only consider the values configured in hardware.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
To determine whether the mba_MBps option to resctrl should be supported,
resctrl tests the boot CPUs' x86_vendor.
This isn't portable, and needs abstracting behind a helper so this check
can be part of the filesystem code that moves to /fs/.
Re-use the tests set_mba_sc() does to determine if the mba_sc is supported
on this system. An 'alloc_capable' test is added so that support for the
controls isn't implied by the 'delay_linear' property, which is always
true for MPAM. Because mbm_update() only update mba_sc if the mbm_local
counters are enabled, supports_mba_mbps() checks is_mbm_local_enabled().
(instead of using is_mbm_enabled(), which checks both).
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
set_mba_sc() enables the 'software controller' to regulate the bandwidth
based on the byte counters. This can be managed entirely in the parts
of resctrl that move to /fs/, without any extra support from the
architecture specific code. set_mba_sc() is called by rdt_enable_ctx()
during mount and unmount. It currently resets the arch code's ctrl_val[]
and mbps_val[] arrays.
The ctrl_val[] was already reset when the domain was created, and by
reset_all_ctrls() when the filesystem was last unmounted. Doing the work
in set_mba_sc() is not necessary as the values are already at their
defaults due to the creation of the domain, or were previously reset
during umount(), or are about to reset during umount().
Add a reset of the mbps_val[] in reset_all_ctrls(), allowing the code in
set_mba_sc() that reaches in to the architecture specific structures to
be removed.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Because domains are exposed to user-space via resctrl, the filesystem
must update its state when CPU hotplug callbacks are triggered.
Some of this work is common to any architecture that would support
resctrl, but the work is tied up with the architecture code to
free the memory.
Move the monitor subdir removal and the cancelling of the mbm/limbo
works into a new resctrl_offline_domain() call. These bits are not
specific to the architecture. Grouping them in one function allows
that code to be moved to /fs/ and re-used by another architecture.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
domain_add_cpu() and domain_remove_cpu() need to kfree() the child
arrays that were allocated by domain_setup_ctrlval().
As this memory is moved around, and new arrays are created, adjusting
the error handling cleanup code becomes noisier.
To simplify this, move all the kfree() calls into a domain_free() helper.
This depends on struct rdt_hw_domain being kzalloc()d, allowing it to
unconditionally kfree() all the child arrays.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Because domains are exposed to user-space via resctrl, the filesystem
must update its state when CPU hotplug callbacks are triggered.
Some of this work is common to any architecture that would support
resctrl, but the work is tied up with the architecture code to
allocate the memory.
Move domain_setup_mon_state(), the monitor subdir creation call and the
mbm/limbo workers into a new resctrl_online_domain() call. These bits
are not specific to the architecture. Grouping them in one function
allows that code to be moved to /fs/ and re-used by another architecture.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
mon_enabled and mon_capable are always set as a pair by
rdt_get_mon_l3_config().
There is no point having two values.
Merge them together.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
rdt_resources_all[] used to have extra entries for L2CODE/L2DATA.
These were hidden from resctrl by the alloc_enabled value.
Now that the L2/L2CODE/L2DATA resources have been merged together,
alloc_enabled doesn't mean anything, it always has the same value as
alloc_capable which indicates allocation is supported by this resource.
Remove alloc_enabled and its helpers.
Signed-off-by: James Morse <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Jamie Iles <[email protected]>
Reviewed-by: Shaopeng Tan <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xin Hao <[email protected]>
Tested-by: Shaopeng Tan <[email protected]>
Tested-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
VM_FAULT_NOPAGE is expected behaviour for -EBUSY failure path, when
augmenting a page, as this means that the reclaimer thread has been
triggered, and the intention is just to round-trip in ring-3, and
retry with a new page fault.
Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
Signed-off-by: Haitao Huang <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Vijay Dhanraj <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
|
|
Unsanitized pages trigger WARN_ON() unconditionally, which can panic the
whole computer, if /proc/sys/kernel/panic_on_warn is set.
In sgx_init(), if misc_register() fails or misc_register() succeeds but
neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be
prematurely stopped. This may leave unsanitized pages, which will result a
false warning.
Refine __sgx_sanitize_pages() to return:
1. Zero when the sanitization process is complete or ksgxd has been
requested to stop.
2. The number of unsanitized pages otherwise.
Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list")
Reported-by: Paul Menzel <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/linux-sgx/[email protected]/T/#u
Link: https://lkml.kernel.org/r/[email protected]
|
|
Print both old and new versions of microcode after a reload is complete
because knowing the previous microcode version is sometimes important
from a debugging perspective.
[ bp: Massage commit message. ]
Signed-off-by: Ashok Raj <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Acked-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Christian Brauner (Microsoft) <[email protected]>
Signed-off-by: Al Viro <[email protected]>
|
|
prefetch register
The current pseudo_lock.c code overwrites the value of the
MSR_MISC_FEATURE_CONTROL to 0 even if the original value is not 0.
Therefore, modify it to save and restore the original values.
Fixes: 018961ae5579 ("x86/intel_rdt: Pseudo-lock region creation/removal core")
Fixes: 443810fe6160 ("x86/intel_rdt: Create debugfs files for pseudo-locking testing")
Fixes: 8a2fc0e1bc0c ("x86/intel_rdt: More precise L2 hit/miss measurements")
Signed-off-by: Kohei Tarumizu <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Reinette Chatre <[email protected]>
Link: https://lkml.kernel.org/r/eb660f3c2010b79a792c573c02d01e8e841206ad.1661358182.git.reinette.chatre@intel.com
|
|
When memory poison consumption machine checks fire, MCE notifier
handlers like nfit_handle_mce() record the impacted physical address
range which is reported by the hardware in the MCi_MISC MSR. The error
information includes data about blast radius, i.e. how many cachelines
did the hardware determine are impacted. A recent change
7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity")
updated nfit_handle_mce() to stop hard coding the blast radius value of
1 cacheline, and instead rely on the blast radius reported in 'struct
mce' which can be up to 4K (64 cachelines).
It turns out that apei_mce_report_mem_error() had a similar problem in
that it hard coded a blast radius of 4K rather than reading the blast
radius from the error information. Fix apei_mce_report_mem_error() to
convey the proper poison granularity.
Signed-off-by: Jane Chu <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new performance
monitoring features for AMD processors.
Bit 1 of EAX indicates support for Last Branch Record Extension Version 2
(LbrExtV2) features. If found to be set during PMU initialization, the EBX
bits of the same leaf can be used to determine the number of available LBR
entries.
For better utilization of feature words, LbrExtV2 is added as a scattered
feature bit.
[peterz: Rename to AMD_LBR_V2]
Signed-off-by: Sandipan Das <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/172d2b0df39306ed77221c45ee1aa62e8ae0548d.1660211399.git.sandipan.das@amd.com
|
|
181b6f40e9ea ("x86/microcode: Rip out the OLD_INTERFACE")
removed the old microcode loading interface but forgot to remove the
related ->request_microcode_user() functionality which it uses.
Rip it out now too.
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Older Intel CPUs that are not in the affected processor list for MMIO
Stale Data vulnerabilities currently report "Not affected" in sysfs,
which may not be correct. Vulnerability status for these older CPUs is
unknown.
Add known-not-affected CPUs to the whitelist. Report "unknown"
mitigation status for CPUs that are not in blacklist, whitelist and also
don't enumerate MSR ARCH_CAPABILITIES bits that reflect hardware
immunity to MMIO Stale Data vulnerabilities.
Mitigation is not deployed when the status is unknown.
[ bp: Massage, fixup. ]
Fixes: 8d50cdf8b834 ("x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data")
Suggested-by: Andrew Cooper <[email protected]>
Suggested-by: Tony Luck <[email protected]>
Signed-off-by: Pawan Gupta <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/a932c154772f2121794a5f2eded1a11013114711.1657846269.git.pawan.kumar.gupta@linux.intel.com
|
|
Modify the comments for sgx_encl_lookup_backing() and for
sgx_encl_alloc_backing() to indicate that they take a reference
which must be dropped with a call to sgx_encl_put_backing().
Make sgx_encl_lookup_backing() static for now, and change the
name of sgx_encl_get_backing() to __sgx_encl_get_backing() to
make it more clear that sgx_encl_get_backing() is an internal
function.
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"Fix the 'IBPB mitigated RETBleed' mode of operation on AMD CPUs (not
turned on by default), which also need STIBP enabled (if available) to
be '100% safe' on even the shortest speculation windows"
* tag 'x86-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Enable STIBP for IBPB mitigated RETBleed
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 eIBRS fixes from Borislav Petkov:
"More from the CPU vulnerability nightmares front:
Intel eIBRS machines do not sufficiently mitigate against RET
mispredictions when doing a VM Exit therefore an additional RSB,
one-entry stuffing is needed"
* tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Add LFENCE to RSB fill sequence
x86/speculation: Add RSB VM Exit protections
|
|
AMD's "Technical Guidance for Mitigating Branch Type Confusion,
Rev. 1.0 2022-07-12" whitepaper, under section 6.1.2 "IBPB On
Privileged Mode Entry / SMT Safety" says:
Similar to the Jmp2Ret mitigation, if the code on the sibling thread
cannot be trusted, software should set STIBP to 1 or disable SMT to
ensure SMT safety when using this mitigation.
So, like already being done for retbleed=unret, and now also for
retbleed=ibpb, force STIBP on machines that have it, and report its SMT
vulnerability status accordingly.
[ bp: Remove the "we" and remove "[AMD]" applicability parameter which
doesn't work here. ]
Fixes: 3ebc17006888 ("x86/bugs: Add retbleed=ibpb")
Signed-off-by: Kim Phillips <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: [email protected] # 5.10, 5.15, 5.19
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"Updates to various subsystems which I help look after. lib, ocfs2,
fatfs, autofs, squashfs, procfs, etc. A relatively small amount of
material this time"
* tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
scripts/gdb: ensure the absolute path is generated on initial source
MAINTAINERS: kunit: add David Gow as a maintainer of KUnit
mailmap: add linux.dev alias for Brendan Higgins
mailmap: update Kirill's email
profile: setup_profiling_timer() is moslty not implemented
ocfs2: fix a typo in a comment
ocfs2: use the bitmap API to simplify code
ocfs2: remove some useless functions
lib/mpi: fix typo 'the the' in comment
proc: add some (hopefully) insightful comments
bdi: remove enum wb_congested_state
kernel/hung_task: fix address space of proc_dohung_task_timeout_secs
lib/lzo/lzo1x_compress.c: replace ternary operator with min() and min_t()
squashfs: support reading fragments in readahead call
squashfs: implement readahead
squashfs: always build "file direct" version of page actor
Revert "squashfs: provide backing_dev_info in order to disable read-ahead"
fs/ocfs2: Fix spelling typo in comment
ia64: old_rr4 added under CONFIG_HUGETLB_PAGE
proc: fix test for "vsyscall=xonly" boot option
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- build fix for old(er) binutils
- build fix for new GCC
- kexec boot environment fix
* tag 'x86-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
x86/numa: Use cpumask_available instead of hardcoded NULL check
x86/bus_lock: Don't assume the init value of DEBUGCTLMSR.BUS_LOCK_DETECT to be zero
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Dave Hansen:
"A set of x86/sgx changes focused on implementing the "SGX2" features,
plus a minor cleanup:
- SGX2 ISA support which makes enclave memory management much more
dynamic. For instance, enclaves can now change enclave page
permissions on the fly.
- Removal of an unused structure member"
* tag 'x86_sgx_for_v6.0-2022-08-03.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
x86/sgx: Drop 'page_index' from sgx_backing
selftests/sgx: Page removal stress test
selftests/sgx: Test reclaiming of untouched page
selftests/sgx: Test invalid access to removed enclave page
selftests/sgx: Test faulty enclave behavior
selftests/sgx: Test complete changing of page type flow
selftests/sgx: Introduce TCS initialization enclave operation
selftests/sgx: Introduce dynamic entry point
selftests/sgx: Test two different SGX2 EAUG flows
selftests/sgx: Add test for TCS page permission changes
selftests/sgx: Add test for EPCM permission changes
Documentation/x86: Introduce enclave runtime management section
x86/sgx: Free up EPC pages directly to support large page ranges
x86/sgx: Support complete page removal
x86/sgx: Support modifying SGX page type
x86/sgx: Tighten accessible memory range after enclave initialization
x86/sgx: Support adding of pages to an initialized enclave
x86/sgx: Support restricting of enclave page permissions
x86/sgx: Support VA page allocation without reclaiming
x86/sgx: Export sgx_encl_page_alloc()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Consolidate duplicated 'next function' scanning and extend to allow
'isolated functions' on s390, similar to existing hypervisors
(Niklas Schnelle)
Resource management:
- Implement pci_iobar_pfn() for sparc, which allows us to remove the
sparc-specific pci_mmap_page_range() and pci_mmap_resource_range().
This removes the ability to map the entire PCI I/O space using
/proc/bus/pci, but we believe that's already been broken since
v2.6.28 (Arnd Bergmann)
- Move common PCI definitions to asm-generic/pci.h and rework others
to be be more specific and more encapsulated in arches that need
them (Stafford Horne)
Power management:
- Convert drivers to new *_PM_OPS macros to avoid need for '#ifdef
CONFIG_PM_SLEEP' or '__maybe_unused' (Bjorn Helgaas)
Virtualization:
- Add ACS quirk for Broadcom BCM5750x multifunction NICs that isolate
the functions but don't advertise an ACS capability (Pavan Chebbi)
Error handling:
- Clear PCI Status register during enumeration in case firmware left
errors logged (Kai-Heng Feng)
- When we have native control of AER, enable error reporting for all
devices that support AER. Previously only a few drivers enabled
this (Stefan Roese)
- Keep AER error reporting enabled for switches. Previously we
enabled this during enumeration but immediately disabled it (Stefan
Roese)
- Iterate over error counters instead of error strings to avoid
printing junk in AER sysfs counters (Mohamed Khalfella)
ASPM:
- Remove pcie_aspm_pm_state_change() so ASPM config changes, e.g.,
via sysfs, are not lost across power state changes (Kai-Heng Feng)
Endpoint framework:
- Don't stop an EPC when unbinding an EPF from it (Shunsuke Mie)
Endpoint embedded DMA controller driver:
- Simplify and clean up support for the DesignWare embedded DMA
(eDMA) controller (Frank Li, Serge Semin)
Broadcom STB PCIe controller driver:
- Avoid config space accesses when link is down because we can't
recover from the CPU aborts these cause (Jim Quinlan)
- Look for power regulators described under Root Ports in DT and
enable them before scanning the secondary bus (Jim Quinlan)
- Disable/enable regulators in suspend/resume (Jim Quinlan)
Freescale i.MX6 PCIe controller driver:
- Simplify and clean up clock and PHY management (Richard Zhu)
- Disable/enable regulators in suspend/resume (Richard Zhu)
- Set PCIE_DBI_RO_WR_EN before writing DBI registers (Richard Zhu)
- Allow speeds faster than Gen2 (Richard Zhu)
- Make link being down a non-fatal error so controller probe doesn't
fail if there are no Endpoints connected (Richard Zhu)
Loongson PCIe controller driver:
- Add ACPI and MCFG support for Loongson LS7A (Huacai Chen)
- Avoid config reads to non-existent LS2K/LS7A devices because a
hardware defect causes machine hangs (Huacai Chen)
- Work around LS7A integrated devices that report incorrect Interrupt
Pin values (Jianmin Lv)
Marvell Aardvark PCIe controller driver:
- Add support for AER and Slot capability on emulated bridge (Pali
Rohár)
MediaTek PCIe controller driver:
- Add Airoha EN7532 to DT binding (John Crispin)
- Allow building of driver for ARCH_AIROHA (Felix Fietkau)
MediaTek PCIe Gen3 controller driver:
- Print decoded LTSSM state when the link doesn't come up (Jianjun
Wang)
NVIDIA Tegra194 PCIe controller driver:
- Convert DT binding to json-schema (Vidya Sagar)
- Add DT bindings and driver support for Tegra234 Root Port and
Endpoint mode (Vidya Sagar)
- Fix some Root Port interrupt handling issues (Vidya Sagar)
- Set default Max Payload Size to 256 bytes (Vidya Sagar)
- Fix Data Link Feature capability programming (Vidya Sagar)
- Extend Endpoint mode support to devices beyond Controller-5 (Vidya
Sagar)
Qualcomm PCIe controller driver:
- Rework clock, reset, PHY power-on ordering to avoid hangs and
improve consistency (Robert Marko, Christian Marangi)
- Move pipe_clk handling to PHY drivers (Dmitry Baryshkov)
- Add IPQ60xx support (Selvam Sathappan Periakaruppan)
- Allow ASPM L1 and substates for 2.7.0 (Krishna chaitanya chundru)
- Add support for more than 32 MSI interrupts (Dmitry Baryshkov)
Renesas R-Car PCIe controller driver:
- Convert DT binding to json-schema (Herve Codina)
- Add Renesas RZ/N1D (R9A06G032) to rcar-gen2 DT binding and driver
(Herve Codina)
Samsung Exynos PCIe controller driver:
- Fix phy-exynos-pcie driver so it follows the 'phy_init() before
phy_power_on()' PHY programming model (Marek Szyprowski)
Synopsys DesignWare PCIe controller driver:
- Simplify and clean up the DWC core extensively (Serge Semin)
- Fix an issue with programming the ATU for regions that cross a 4GB
boundary (Serge Semin)
- Enable the CDM check if 'snps,enable-cdm-check' exists; previously
we skipped it if 'num-lanes' was absent (Serge Semin)
- Allocate a 32-bit DMA-able page to be MSI target instead of using a
driver data structure that may not be addressable with 32-bit
address (Will McVicker)
- Add DWC core support for more than 32 MSI interrupts (Dmitry
Baryshkov)
Xilinx Versal CPM PCIe controller driver:
- Add DT binding and driver support for Versal CPM5 Gen5 Root Port
(Bharat Kumar Gogada)"
* tag 'pci-v5.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (150 commits)
PCI: imx6: Support more than Gen2 speed link mode
PCI: imx6: Set PCIE_DBI_RO_WR_EN before writing DBI registers
PCI: imx6: Reformat suspend callback to keep symmetric with resume
PCI: imx6: Move the imx6_pcie_ltssm_disable() earlier
PCI: imx6: Disable clocks in reverse order of enable
PCI: imx6: Do not hide PHY driver callbacks and refine the error handling
PCI: imx6: Reduce resume time by only starting link if it was up before suspend
PCI: imx6: Mark the link down as non-fatal error
PCI: imx6: Move regulator enable out of imx6_pcie_deassert_core_reset()
PCI: imx6: Turn off regulator when system is in suspend mode
PCI: imx6: Call host init function directly in resume
PCI: imx6: Disable i.MX6QDL clock when disabling ref clocks
PCI: imx6: Propagate .host_init() errors to caller
PCI: imx6: Collect clock enables in imx6_pcie_clk_enable()
PCI: imx6: Factor out ref clock disable to match enable
PCI: imx6: Move imx6_pcie_clk_disable() earlier
PCI: imx6: Move imx6_pcie_enable_ref_clk() earlier
PCI: imx6: Move PHY management functions together
PCI: imx6: Move imx6_pcie_grp_offset(), imx6_pcie_configure_type() earlier
PCI: imx6: Convert to NOIRQ_SYSTEM_SLEEP_PM_OPS()
...
|
|
Pull kvm updates from Paolo Bonzini:
"Quite a large pull request due to a selftest API overhaul and some
patches that had come in too late for 5.19.
ARM:
- Unwinder implementations for both nVHE modes (classic and
protected), complete with an overflow stack
- Rework of the sysreg access from userspace, with a complete rewrite
of the vgic-v3 view to allign with the rest of the infrastructure
- Disagregation of the vcpu flags in separate sets to better track
their use model.
- A fix for the GICv2-on-v3 selftest
- A small set of cosmetic fixes
RISC-V:
- Track ISA extensions used by Guest using bitmap
- Added system instruction emulation framework
- Added CSR emulation framework
- Added gfp_custom flag in struct kvm_mmu_memory_cache
- Added G-stage ioremap() and iounmap() functions
- Added support for Svpbmt inside Guest
s390:
- add an interface to provide a hypervisor dump for secure guests
- improve selftests to use TAP interface
- enable interpretive execution of zPCI instructions (for PCI
passthrough)
- First part of deferred teardown
- CPU Topology
- PV attestation
- Minor fixes
x86:
- Permit guests to ignore single-bit ECC errors
- Intel IPI virtualization
- Allow getting/setting pending triple fault with
KVM_GET/SET_VCPU_EVENTS
- PEBS virtualization
- Simplify PMU emulation by just using PERF_TYPE_RAW events
- More accurate event reinjection on SVM (avoid retrying
instructions)
- Allow getting/setting the state of the speaker port data bit
- Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls
are inconsistent
- "Notify" VM exit (detect microarchitectural hangs) for Intel
- Use try_cmpxchg64 instead of cmpxchg64
- Ignore benign host accesses to PMU MSRs when PMU is disabled
- Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
- Allow NX huge page mitigation to be disabled on a per-vm basis
- Port eager page splitting to shadow MMU as well
- Enable CMCI capability by default and handle injected UCNA errors
- Expose pid of vcpu threads in debugfs
- x2AVIC support for AMD
- cleanup PIO emulation
- Fixes for LLDT/LTR emulation
- Don't require refcounted "struct page" to create huge SPTEs
- Miscellaneous cleanups:
- MCE MSR emulation
- Use separate namespaces for guest PTEs and shadow PTEs bitmasks
- PIO emulation
- Reorganize rmap API, mostly around rmap destruction
- Do not workaround very old KVM bugs for L0 that runs with nesting enabled
- new selftests API for CPUID
Generic:
- Fix races in gfn->pfn cache refresh; do not pin pages tracked by
the cache
- new selftests API using struct kvm_vcpu instead of a (vm, id)
tuple"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (606 commits)
selftests: kvm: set rax before vmcall
selftests: KVM: Add exponent check for boolean stats
selftests: KVM: Provide descriptive assertions in kvm_binary_stats_test
selftests: KVM: Check stat name before other fields
KVM: x86/mmu: remove unused variable
RISC-V: KVM: Add support for Svpbmt inside Guest/VM
RISC-V: KVM: Use PAGE_KERNEL_IO in kvm_riscv_gstage_ioremap()
RISC-V: KVM: Add G-stage ioremap() and iounmap() functions
KVM: Add gfp_custom flag in struct kvm_mmu_memory_cache
RISC-V: KVM: Add extensible CSR emulation framework
RISC-V: KVM: Add extensible system instruction emulation framework
RISC-V: KVM: Factor-out instruction emulation into separate sources
RISC-V: KVM: move preempt_disable() call in kvm_arch_vcpu_ioctl_run
RISC-V: KVM: Make kvm_riscv_guest_timer_init a void function
RISC-V: KVM: Fix variable spelling mistake
RISC-V: KVM: Improve ISA extension by using a bitmap
KVM, x86/mmu: Fix the comment around kvm_tdp_mmu_zap_leafs()
KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog
KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT
KVM: x86: Do not block APIC write for non ICR registers
...
|
|
ACRN Hypervisor reports timing information via CPUID leaf 0x40000010.
Get the TSC and CPU frequency via CPUID leaf 0x40000010 and set the
kernel values accordingly.
Signed-off-by: Fei Li <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Conghui <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.
== Background ==
Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.
To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced. eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.
== Problem ==
Here's a simplification of how guests are run on Linux' KVM:
void run_kvm_guest(void)
{
// Prepare to run guest
VMRESUME();
// Clean up after guest runs
}
The execution flow for that would look something like this to the
processor:
1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()
Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:
* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.
* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".
IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.
However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.
Balanced CALL/RET instruction pairs such as in step #5 are not affected.
== Solution ==
The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e.,
eIBRS systems which enable legacy IBRS explicitly.
However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT
and most of them need a new mitigation.
Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT.
The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.
In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.
There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.
[ bp: Massage, incorporate review comments from Andy Cooper. ]
Signed-off-by: Daniel Sneddon <[email protected]>
Co-developed-by: Pawan Gupta <[email protected]>
Signed-off-by: Pawan Gupta <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"Though there's been a decent amount of RNG-related development during
this last cycle, not all of it is coming through this tree, as this
cycle saw a shift toward tackling early boot time seeding issues,
which took place in other trees as well.
Here's a summary of the various patches:
- The CONFIG_ARCH_RANDOM .config option and the "nordrand" boot
option have been removed, as they overlapped with the more widely
supported and more sensible options, CONFIG_RANDOM_TRUST_CPU and
"random.trust_cpu". This change allowed simplifying a bit of arch
code.
- x86's RDRAND boot time test has been made a bit more robust, with
RDRAND disabled if it's clearly producing bogus results. This would
be a tip.git commit, technically, but I took it through random.git
to avoid a large merge conflict.
- The RNG has long since mixed in a timestamp very early in boot, on
the premise that a computer that does the same things, but does so
starting at different points in wall time, could be made to still
produce a different RNG state. Unfortunately, the clock isn't set
early in boot on all systems, so now we mix in that timestamp when
the time is actually set.
- User Mode Linux now uses the host OS's getrandom() syscall to
generate a bootloader RNG seed and later on treats getrandom() as
the platform's RDRAND-like faculty.
- The arch_get_random_{seed_,}_long() family of functions is now
arch_get_random_{seed_,}_longs(), which enables certain platforms,
such as s390, to exploit considerable performance advantages from
requesting multiple CPU random numbers at once, while at the same
time compiling down to the same code as before on platforms like
x86.
- A small cleanup changing a cmpxchg() into a try_cmpxchg(), from
Uros.
- A comment spelling fix"
More info about other random number changes that come in through various
architecture trees in the full commentary in the pull request:
https://lore.kernel.org/all/[email protected]/
* tag 'random-6.0-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
random: correct spelling of "overwrites"
random: handle archrandom with multiple longs
um: seed rng using host OS rng
random: use try_cmpxchg in _credit_init_bits
timekeeping: contribute wall clock to rng on time change
x86/rdrand: Remove "nordrand" flag in favor of "random.trust_cpu"
random: remove CONFIG_ARCH_RANDOM
|
|
be zero
It's possible that this kernel has been kexec'd from a kernel that
enabled bus lock detection, or (hypothetically) BIOS/firmware has set
DEBUGCTLMSR_BUS_LOCK_DETECT.
Disable bus lock detection explicitly if not wanted.
Fixes: ebb1064e7c2e ("x86/traps: Handle #DB for bus lock")
Signed-off-by: Chenyi Qiang <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- Remove the vendor check when selecting MWAIT as the default idle
state
- Respect idle=nomwait when supplied on the kernel cmdline
- Two small cleanups
* tag 'x86_cpu_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Use MSR_IA32_MISC_ENABLE constants
x86: Fix comment for X86_FEATURE_ZEN
x86: Remove vendor checks from prefer_mwait_c1_over_halt
x86: Handle idle=nomwait cmdline properly for x86_idle
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vmware cleanup from Borislav Petkov:
- A single statement simplification by using the BIT() macro
* tag 'x86_vmware_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vmware: Use BIT() macro for shifting
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS update from Borislav Petkov:
"A single RAS change:
- Probe whether hardware error injection (direct MSR writes) is
possible when injecting errors on AMD platforms. In some cases, the
platform could prohibit those"
* tag 'ras_core_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Check whether writes to MCA_STATUS are getting ignored
|
|
KVM/s390, KVM/x86 and common infrastructure changes for 5.20
x86:
* Permit guests to ignore single-bit ECC errors
* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache
* Intel IPI virtualization
* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS
* PEBS virtualization
* Simplify PMU emulation by just using PERF_TYPE_RAW events
* More accurate event reinjection on SVM (avoid retrying instructions)
* Allow getting/setting the state of the speaker port data bit
* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent
* "Notify" VM exit (detect microarchitectural hangs) for Intel
* Cleanups for MCE MSR emulation
s390:
* add an interface to provide a hypervisor dump for secure guests
* improve selftests to use TAP interface
* enable interpretive execution of zPCI instructions (for PCI passthrough)
* First part of deferred teardown
* CPU Topology
* PV attestation
* Minor fixes
Generic:
* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple
x86:
* Use try_cmpxchg64 instead of cmpxchg64
* Bugfixes
* Ignore benign host accesses to PMU MSRs when PMU is disabled
* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
* x86/MMU: Allow NX huge pages to be disabled on a per-vm basis
* Port eager page splitting to shadow MMU as well
* Enable CMCI capability by default and handle injected UCNA errors
* Expose pid of vcpu threads in debugfs
* x2AVIC support for AMD
* cleanup PIO emulation
* Fixes for LLDT/LTR emulation
* Don't require refcounted "struct page" to create huge SPTEs
x86 cleanups:
* Use separate namespaces for guest PTEs and shadow PTEs bitmasks
* PIO emulation
* Reorganize rmap API, mostly around rmap destruction
* Do not workaround very old KVM bugs for L0 that runs with nesting enabled
* new selftests API for CPUID
|
|
Some cloud hypervisors do not provide IBPB on very recent CPU processors,
including AMD processors affected by Retbleed.
Using IBPB before firmware calls on such systems would cause a GPF at boot
like the one below. Do not enable such calls when IBPB support is not
present.
EFI Variables Facility v0.08 2004-May-17
general protection fault, maybe for address 0x1: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 24 Comm: kworker/u2:1 Not tainted 5.19.0-rc8+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Workqueue: efi_rts_wq efi_call_rts
RIP: 0010:efi_call_rts
Code: e8 37 33 58 ff 41 bf 48 00 00 00 49 89 c0 44 89 f9 48 83 c8 01 4c 89 c2 48 c1 ea 20 66 90 b9 49 00 00 00 b8 01 00 00 00 31 d2 <0f> 30 e8 7b 9f 5d ff e8 f6 f8 ff ff 4c 89 f1 4c 89 ea 4c 89 e6 48
RSP: 0018:ffffb373800d7e38 EFLAGS: 00010246
RAX: 0000000000000001 RBX: 0000000000000006 RCX: 0000000000000049
RDX: 0000000000000000 RSI: ffff94fbc19d8fe0 RDI: ffff94fbc1b2b300
RBP: ffffb373800d7e70 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000000b R11: 000000000000000b R12: ffffb3738001fd78
R13: ffff94fbc2fcfc00 R14: ffffb3738001fd80 R15: 0000000000000048
FS: 0000000000000000(0000) GS:ffff94fc3da00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff94fc30201000 CR3: 000000006f610000 CR4: 00000000000406f0
Call Trace:
<TASK>
? __wake_up
process_one_work
worker_thread
? rescuer_thread
kthread
? kthread_complete_and_exit
ret_from_fork
</TASK>
Modules linked in:
Fixes: 28a99e95f55c ("x86/amd: Use IBPB for firmware calls")
Reported-by: Dimitri John Ledkov <[email protected]>
Signed-off-by: Thadeu Lima de Souza Cascardo <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
x86/kernel/cpu/cyrix.c now needs to include <linux/isa-dma.h> since the
'isa_dma_bridge_buggy' variable was moved to it.
Fixes this build error:
../arch/x86/kernel/cpu/cyrix.c: In function ‘init_cyrix’:
../arch/x86/kernel/cpu/cyrix.c:277:17: error: ‘isa_dma_bridge_buggy’ undeclared (first use in this function)
277 | isa_dma_bridge_buggy = 2;
Fixes: abb4970ac335 ("PCI: Move isa_dma_bridge_buggy out of asm/dma.h")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Acked-by: Stafford Horne <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: [email protected]
|
|
IBRS mitigation for spectre_v2 forces write to MSR_IA32_SPEC_CTRL at
every kernel entry/exit. On Enhanced IBRS parts setting
MSR_IA32_SPEC_CTRL[IBRS] only once at boot is sufficient. MSR writes at
every kernel entry/exit incur unnecessary performance loss.
When Enhanced IBRS feature is present, print a warning about this
unnecessary performance loss.
Signed-off-by: Pawan Gupta <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Thadeu Lima de Souza Cascardo <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/2a5eaf54583c2bfe0edc4fea64006656256cca17.1657814857.git.pawan.kumar.gupta@linux.intel.com
|
|
Instead of the magic numbers 1<<11 and 1<<12 use the constants
from msr-index.h. This makes it obvious where those bits
of MSR_IA32_MISC_ENABLE are consumed (and in fact that Linux
consumes them at all) to simple minds that grep for
MSR_IA32_MISC_ENABLE_.*_UNAVAIL.
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
On AMD IBRS does not prevent Retbleed; as such use IBPB before a
firmware call to flush the branch history state.
And because in order to do an EFI call, the kernel maps a whole lot of
the kernel page table into the EFI page table, do an IBPB just in case
in order to prevent the scenario of poisoning the BTB and causing an EFI
call using the unprotected RET there.
[ bp: Massage. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The decision of whether or not to trust RDRAND is controlled by the
"random.trust_cpu" boot time parameter or the CONFIG_RANDOM_TRUST_CPU
compile time default. The "nordrand" flag was added during the early
days of RDRAND, when there were worries that merely using its values
could compromise the RNG. However, these days, RDRAND values are not
used directly but always go through the RNG's hash function, making
"nordrand" no longer useful.
Rather, the correct switch is "random.trust_cpu", which not only handles
the relevant trust issue directly, but also is general to multiple CPU
types, not just x86.
However, x86 RDRAND does have a history of being occasionally
problematic. Prior, when the kernel would notice something strange, it'd
warn in dmesg and suggest enabling "nordrand". We can improve on that by
making the test a little bit better and then taking the step of
automatically disabling RDRAND if we detect it's problematic.
Also disable RDSEED if the RDRAND test fails.
Cc: [email protected]
Cc: Theodore Ts'o <[email protected]>
Suggested-by: H. Peter Anvin <[email protected]>
Suggested-by: Borislav Petkov <[email protected]>
Acked-by: Borislav Petkov <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
|
|
When RDRAND was introduced, there was much discussion on whether it
should be trusted and how the kernel should handle that. Initially, two
mechanisms cropped up, CONFIG_ARCH_RANDOM, a compile time switch, and
"nordrand", a boot-time switch.
Later the thinking evolved. With a properly designed RNG, using RDRAND
values alone won't harm anything, even if the outputs are malicious.
Rather, the issue is whether those values are being *trusted* to be good
or not. And so a new set of options were introduced as the real
ones that people use -- CONFIG_RANDOM_TRUST_CPU and "random.trust_cpu".
With these options, RDRAND is used, but it's not always credited. So in
the worst case, it does nothing, and in the best case, maybe it helps.
Along the way, CONFIG_ARCH_RANDOM's meaning got sort of pulled into the
center and became something certain platforms force-select.
The old options don't really help with much, and it's a bit odd to have
special handling for these instructions when the kernel can deal fine
with the existence or untrusted existence or broken existence or
non-existence of that CPU capability.
Simplify the situation by removing CONFIG_ARCH_RANDOM and using the
ordinary asm-generic fallback pattern instead, keeping the two options
that are actually used. For now it leaves "nordrand" for now, as the
removal of that will take a different route.
Acked-by: Michael Ellerman <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Acked-by: Borislav Petkov <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
|
|
Patch series "cpumask: Fix invalid uniprocessor assumptions", v4.
On uniprocessor builds, it is currently assumed that any cpumask will
contain the single CPU: cpu0. This assumption is used to provide
optimised implementations.
The current assumption also appears to be wrong, by ignoring the fact that
users can provide empty cpumasks. This can result in bugs as explained in
[1] - for_each_cpu() will run one iteration of the loop even when passed
an empty cpumask.
This series introduces some basic tests, and updates the optimisations for
uniprocessor builds.
The x86 patch was written after the kernel test robot [2] ran into a
failed build. I have tried to list the files potentially affected by the
changes to cpumask.h, in an attempt to find any other cases that fail on
!SMP. I've gone through some of the files manually, and ran a few cross
builds, but nothing else popped up. I (build) checked about half of the
potientally affected files, but I do not have the resources to do them
all. I hope we can fix other issues if/when they pop up later.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
This patch (of 5):
The maps to keep track of shared caches between CPUs on SMP systems are
declared in asm/smp.h, among them specifically cpu_llc_shared_map. These
maps are externally defined in cpu/smpboot.c. The latter is only compiled
on CONFIG_SMP=y, which means the declared extern symbols from asm/smp.h do
not have a corresponding definition on uniprocessor builds.
The inline cpu_llc_shared_mask() function from asm/smp.h refers to the map
declaration mentioned above. This function is referenced in cacheinfo.c
inside for_each_cpu() loop macros, to provide cpumask for the loop. On
uniprocessor builds, the symbol for the cpu_llc_shared_map does not exist.
However, the current implementation of for_each_cpu() also (wrongly)
ignores the provided mask.
By sheer luck, the compiler thus optimises out this unused reference to
cpu_llc_shared_map, and the linker therefore does not require the
cpu_llc_shared_mask to actually exist on uniprocessor builds. Only on SMP
bulids does smpboot.o exist to provide the required symbols.
To no longer rely on compiler optimisations for successful uniprocessor
builds, move the definitions of cpu_llc_shared_map and cpu_l2c_shared_map
from smpboot.c to cacheinfo.c.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/e8167ddb570f56744a3dc12c2149a660a324d969.1656777646.git.sander@svanheule.net
Signed-off-by: Sander Vanheule <[email protected]>
Cc: Andy Shevchenko <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Yury Norov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Remove a superfluous ' in the mitigation string.
Fixes: e8ec1b6e08a2 ("x86/bugs: Enable STIBP for JMP2RET")
Signed-off-by: Kim Phillips <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
|
|
This symbol is not used outside of bugs.c, so mark it static.
Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Jiapeng Chong <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|