Age | Commit message (Collapse) | Author | Files | Lines |
|
amd_get_boost_ratio_numerator()
The special case in amd_pstate_highest_perf_set() is the value used
for calculating the boost numerator. Merge this into
amd_get_boost_ratio_numerator() and then use that to calculate boost
ratio.
This allows dropping more special casing of the highest perf value.
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
|
|
AMD systems that support preferred cores will use "166" as their
numerator for max frequency calculations instead of "255".
Add a function for detecting preferred cores by looking at the
highest perf value on all cores.
If preferred cores are enabled return 166 and if disabled the
value in the highest perf register. As the function will be called
multiple times, cache the values for the boost numerator and if
preferred cores will be enabled in global variables.
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
|
|
amd_pstate_get_highest_perf() is a helper used to get the highest perf
value on AMD systems. It's used in amd-pstate as part of preferred
core handling, but applicable for acpi-cpufreq as well.
Move it out to cppc handling code as amd_get_highest_perf().
Reviewed-by: Perry Yuan <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
|
|
If the boost ratio isn't calculated properly for the system for any
reason this can cause other problems that are non-obvious.
Raise all messages to warn instead.
Suggested-by: Perry Yuan <[email protected]>
Reviewed-by: Perry Yuan <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
|
|
perf_ratio is a u64 and SCHED_CAPACITY_SCALE is a large number.
Shifting by one will never have a zero value.
Drop the check.
Suggested-by: Gautham R. Shenoy <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
|
|
The function name is ambiguous because it returns an intermediate value
for calculating maximum frequency rather than the CPPC 'Highest Perf'
register.
Rename the function to clarify its use and allow the function to return
errors. Adjust the consumer in acpi-cpufreq to catch errors.
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
|
|
To prepare to let amd_get_highest_perf() detect preferred cores
it will require CPPC functions. Move amd_get_highest_perf() to
cppc.c to prepare for 'preferred core detection' rework.
No functional changes intended.
Reviewed-by: Perry Yuan <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Mario Limonciello <[email protected]>
|
|
To update with the latest fixes.
|
|
Patch series "mm: Care about shadow stack guard gap when getting an
unmapped area", v2.
As covered in the commit log for c44357c2e76b ("x86/mm: care about shadow
stack guard gap during placement") our current mmap() implementation does
not take care to ensure that a new mapping isn't placed with existing
mappings inside it's own guard gaps. This is particularly important for
shadow stacks since if two shadow stacks end up getting placed adjacent to
each other then they can overflow into each other which weakens the
protection offered by the feature.
On x86 there is a custom arch_get_unmapped_area() which was updated by the
above commit to cover this case by specifying a start_gap for allocations
with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and
use the generic implementation of arch_get_unmapped_area() so let's make
the equivalent change there so they also don't get shadow stack pages
placed without guard pages. The arm64 and RISC-V shadow stack
implementations are currently on the list:
https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec94743
https://lore.kernel.org/lkml/[email protected]/
Given the addition of the use of vm_flags in the generic implementation we
also simplify the set of possibilities that have to be dealt with in the
core code by making arch_get_unmapped_area() take vm_flags as standard.
This is a bit invasive since the prototype change touches quite a few
architectures but since the parameter is ignored the change is
straightforward, the simplification for the generic code seems worth it.
This patch (of 3):
When we introduced arch_get_unmapped_area_vmflags() in 961148704acd ("mm:
introduce arch_get_unmapped_area_vmflags()") we did so as part of properly
supporting guard pages for shadow stacks on x86_64, which uses a custom
arch_get_unmapped_area(). Equivalent features are also present on both
arm64 and RISC-V, both of which use the generic implementation of
arch_get_unmapped_area() and will require equivalent modification there.
Rather than continue to deal with having two versions of the functions
let's bite the bullet and have all implementations of
arch_get_unmapped_area() take vm_flags as a parameter.
The new parameter is currently ignored by all implementations other than
x86. The only caller that doesn't have a vm_flags available is
mm_get_unmapped_area(), as for the x86 implementation and the wrapper used
on other architectures this is modified to supply no flags.
No functional changes.
Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-0-a46b8b6dc0ed@kernel.org
Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-1-a46b8b6dc0ed@kernel.org
Signed-off-by: Mark Brown <[email protected]>
Acked-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Acked-by: Helge Deller <[email protected]> [parisc]
Cc: Alexander Gordeev <[email protected]>
Cc: Andreas Larsson <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: "Edgecombe, Rick P" <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: John Paul Adrian Glaubitz <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Naveen N Rao <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Rich Felker <[email protected]>
Cc: Russell King <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: Thomas Bogendoerfer <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Vineet Gupta <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: Yoshinori Sato <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Add a documentation overview of Confidential Computing VM support
(Michael Kelley)
- Use lapic timer in a TDX VM without paravisor (Dexuan Cui)
- Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
(Michael Kelley)
- Fix a kexec crash due to VP assist page corruption (Anirudh
Rayabharam)
- Python3 compatibility fix for lsvmbus (Anthony Nandaa)
- Misc fixes (Rachel Menge, Roman Kisel, zhang jiao, Hongbo Li)
* tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hv: vmbus: Constify struct kobj_type and struct attribute_group
tools: hv: rm .*.cmd when make clean
x86/hyperv: fix kexec crash due to VP assist page corruption
Drivers: hv: vmbus: Fix the misplaced function description
tools: hv: lsvmbus: change shebang to use python3
x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
Documentation: hyperv: Add overview of Confidential Computing VM support
clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor
Drivers: hv: Remove deprecated hv_fcopy declarations
|
|
There are several comments all over the place, which uses a wrong singular
form of jiffies.
Replace 'jiffie' by 'jiffy'. No functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Geert Uytterhoeven <[email protected]> # m68k
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
|
|
For optimized performance, firmware typically distributes EPC sections
evenly across different NUMA nodes. However, there are scenarios where
a node may have both CPUs and memory but no EPC section configured. For
example, in an 8-socket system with a Sub-Numa-Cluster=2 setup, there
are a total of 16 nodes. Given that the maximum number of supported EPC
sections is 8, it is simply not feasible to assign one EPC section to
each node. This configuration is not incorrect - SGX will still operate
correctly; it is just not optimized from a NUMA standpoint.
For this reason, log a message when a node with both CPUs and memory
lacks an EPC section. This will provide users with a hint as to why they
might be experiencing less-than-ideal performance when running SGX
enclaves.
Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Acked-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/all/20240905080855.1699814-3-aaron.lu%40intel.com
|
|
When the current node doesn't have an EPC section configured by firmware
and all other EPC sections are used up, CPU can get stuck inside the
while loop that looks for an available EPC page from remote nodes
indefinitely, leading to a soft lockup. Note how nid_of_current will
never be equal to nid in that while loop because nid_of_current is not
set in sgx_numa_mask.
Also worth mentioning is that it's perfectly fine for the firmware not
to setup an EPC section on a node. While setting up an EPC section on
each node can enhance performance, it is not a requirement for
functionality.
Rework the loop to start and end on *a* node that has SGX memory. This
avoids the deadlock looking for the current SGX-lacking node to show up
in the loop when it never will.
Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
Reported-by: "Molina Sabido, Gerardo" <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Tested-by: Zhimin Luo <[email protected]>
Link: https://lore.kernel.org/all/20240905080855.1699814-2-aaron.lu%40intel.com
|
|
When the SRSO mitigation is disabled, either via mitigations=off or
spec_rstack_overflow=off, the warning about the lack of IBPB-enhancing
microcode is printed anyway.
This is unnecessary since the user has turned off the mitigation.
[ bp: Massage, drop SBPB rationale as it doesn't matter because when
mitigations are disabled x86_pred_cmd is not being used anyway. ]
Signed-off-by: David Kaplan <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Acked-by: Josh Poimboeuf <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The Moorefield and Lightning Mountain Atom processors are
missing the NO_SSB flag in the vulnerabilities whitelist.
This will cause unaffected parts to incorrectly be reported
as vulnerable. Add the missing flag.
These parts are currently out of service and were verified
internally with archived documentation that they need the
NO_SSB flag.
Closes: https://lore.kernel.org/lkml/CAEJ9NQdhh+4GxrtG1DuYgqYhvc0hi-sKZh-2niukJ-MyFLntAA@mail.gmail.com/
Reported-by: Shanavas.K.S <[email protected]>
Signed-off-by: Daniel Sneddon <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
commit 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when
CPUs go online/offline") introduces a new cpuhp state for hyperv
initialization.
cpuhp_setup_state() returns the state number if state is
CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN and 0 for all other states.
For the hyperv case, since a new cpuhp state was introduced it would
return 0. However, in hv_machine_shutdown(), the cpuhp_remove_state() call
is conditioned upon "hyperv_init_cpuhp > 0". This will never be true and
so hv_cpu_die() won't be called on all CPUs. This means the VP assist page
won't be reset. When the kexec kernel tries to setup the VP assist page
again, the hypervisor corrupts the memory region of the old VP assist page
causing a panic in case the kexec kernel is using that memory elsewhere.
This was originally fixed in commit dfe94d4086e4 ("x86/hyperv: Fix kexec
panic/hang issues").
Get rid of hyperv_init_cpuhp entirely since we are no longer using a
dynamic cpuhp state and use CPUHP_AP_HYPERV_ONLINE directly with
cpuhp_remove_state().
Cc: [email protected]
Fixes: 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline")
Signed-off-by: Anirudh Rayabharam (Microsoft) <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Wei Liu <[email protected]>
Message-ID: <[email protected]>
|
|
In order be able to compute the sizes of tasks consistently across all
CPUs in a hybrid system, it is necessary to provide CPU capacity scaling
information to the scheduler via arch_scale_cpu_capacity(). Moreover,
the value returned by arch_scale_freq_capacity() for the given CPU must
correspond to the arch_scale_cpu_capacity() return value for it, or
utilization computations will be inaccurate.
Add support for it through per-CPU variables holding the capacity and
maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will
be returned by arch_scale_cpu_capacity() and used by scale_freq_tick()
to compute arch_freq_scale for the current CPU, respectively.
In order to avoid adding measurable overhead for non-hybrid x86 systems,
which are the vast majority in the field, whether or not the new hybrid
CPU capacity scaling will be in effect is controlled by a static key.
This static key is set by calling arch_enable_hybrid_capacity_scale()
which also allocates memory for the per-CPU data and initializes it.
Next, arch_set_cpu_capacity() is used to set the per-CPU variables
mentioned above for each CPU and arch_rebuild_sched_domains() needs
to be called for the scheduler to realize that capacity-aware
scheduling can be used going forward.
Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Ricardo Neri <[email protected]>
Tested-by: Ricardo Neri <[email protected]> # scale invariance
Link: https://patch.msgid.link/[email protected]
[ rjw: Added parens to function kerneldoc comments ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
IFM references
There's an erratum that prevents the PAT from working correctly:
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf
# Document 316515 Version 010
The kernel currently disables PAT support on those CPUs, but it
does it with some magic numbers.
Replace the magic numbers with the new "IFM" macros.
Make the check refer to the last affected CPU (INTEL_CORE_YONAH)
rather than the first fixed one. This makes it easier to find the
documentation of the erratum since Intel documents where it is
broken and not where it is fixed.
I don't think the Pentium Pro (or Pentium II) is actually affected.
But the old check included them, so it can't hurt to keep doing the
same. I'm also not completely sure about the "Pentium M" CPUs
(models 0x9 and 0xd). But, again, they were included in in the
old checks and were close Pentium III derivatives, so are likely
affected.
While we're at it, revise the comment referring to the erratum name
and making sure it is a quote of the language from the actual errata
doc. That should make it easier to find in the future when the URL
inevitably changes.
Why bother with this in the first place? It actually gets rid of one
of the very few remaining direct references to c->x86{,_model}.
No change in functionality intended.
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Len Brown <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Sparse expect an __iomem pointer, but after converting the EISA probe to
memremap() the pointer is a regular memory pointer. Access it directly
instead.
[ tglx: Converted it to fix the already applied version ]
Fixes: 80a4da05642c ("x86/EISA: Use memremap() to probe for the EISA BIOS signature")
Signed-off-by: Maciej W. Rozycki <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
When using resctrl on systems with Sub-NUMA Clustering enabled, monitoring
groups may be allocated RMID values which would overrun the
arch_mbm_{local,total} arrays.
This is due to inconsistencies in whether the SNC-adjusted num_rmid value or
the unadjusted value in resctrl_arch_system_num_rmid_idx() is used. The
num_rmid value for the L3 resource is currently:
resctrl_arch_system_num_rmid_idx() / snc_nodes_per_l3_cache
As a simple fix, make resctrl_arch_system_num_rmid_idx() return the
SNC-adjusted, L3 num_rmid value on x86.
Fixes: e13db55b5a0d ("x86/resctrl: Introduce snc_nodes_per_l3_cache")
Signed-off-by: Peter Newman <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The FRED RSP0 MSR points to the top of the kernel stack for user level
event delivery. As this is the task stack it needs to be updated when a
task is scheduled in.
The update is done at context switch. That means it's also done when
switching to kernel threads, which is pointless as those never go out to
user space. For KVM threads this means there are two writes to FRED_RSP0 as
KVM has to switch to the guest value before VMENTER.
Defer the update to the exit to user space path and cache the per CPU
FRED_RSP0 value, so redundant writes can be avoided.
Provide fred_sync_rsp0() for KVM to keep the cache in sync with the actual
MSR value after returning from guest to host mode.
[ tglx: Massage change log ]
Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Per the discussion about FRED MSR writes with WRMSRNS instruction [1],
use the alternatives mechanism to choose WRMSRNS when it's available,
otherwise fallback to WRMSR.
Remove the dependency on X86_FEATURE_WRMSRNS as WRMSRNS is no longer
dependent on FRED.
[1] https://lore.kernel.org/lkml/[email protected]/
Use DS prefix to pad WRMSR instead of a NOP. The prefix is ignored. At
least that's the current information from the hardware folks.
Signed-off-by: Andrew Cooper <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
SS is initialized to NULL during boot time and not explicitly set to
__KERNEL_DS.
With FRED enabled, if a kernel event is delivered before a CPU goes to
user level for the first time, its SS is NULL thus NULL is pushed into
the SS field of the FRED stack frame. But before ERETS is executed,
the CPU may context switch to another task and go to user level. Then
when the CPU comes back to kernel mode, SS is changed to __KERNEL_DS.
Later when ERETS is executed to return from the kernel event handler,
a #GP fault is generated because SS doesn't match the SS saved in the
FRED stack frame.
Initialize SS to __KERNEL_DS when enabling FRED to prevent that.
Note, IRET doesn't check if SS matches the SS saved in its stack frame,
thus IDT doesn't have this problem. For IDT it doesn't matter whether
SS is set to __KERNEL_DS or not, because it's set to NULL upon interrupt
or exception delivery and __KERNEL_DS upon SYSCALL. Thus it's pointless
to initialize SS for IDT.
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Add new PCI IDs for Device 18h and Function 4 to enable the amd_atl driver
on those systems.
Signed-off-by: Richard Gong <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Yazen Ghannam <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
init_per_cpu_var() returns a pointer in the percpu address space while
rip_rel_ptr() expects a pointer in the generic address space.
When strict address space checks are enabled, GCC's named address space
checks fail:
asm.h:124:63: error: passing argument 1 of 'rip_rel_ptr' from
pointer to non-enclosed address space
Add a explicit cast to remove address space of the returned pointer.
Fixes: 11e36b0f7c21 ("x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]")
Signed-off-by: Uros Bizjak <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
When SGX is not supported by the BIOS, the kernel log contains the error
'SGX disabled by BIOS', which can be confusing since there might not be an
SGX-related option in the BIOS settings.
For the kernel it's difficult to distinguish between the BIOS not
supporting SGX and the BIOS supporting SGX but having it disabled.
Therefore, update the error message to 'SGX disabled or unsupported by
BIOS' to make it easier for those reading kernel logs to understand what's
happening.
Reported-by: Bo Wu <[email protected]>
Co-developed-by: Zelong Xiang <[email protected]>
Signed-off-by: Zelong Xiang <[email protected]>
Signed-off-by: WangYuli <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Closes: https://github.com/linuxdeepin/developer-center/issues/10032
|
|
The current assembly around swap_pages() in the relocate_kernel() takes
some time to follow because the use of registers can be easily lost when
the line of assembly goes long. Add a couple of comments to clarify the
code around swap_pages() to improve readability.
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Link: https://lore.kernel.org/all/8b52b0b8513a34b2a02fb4abb05c6700c2821475.1724573384.git.kai.huang@intel.com
|
|
When relocate_kernel() gets called, %rdi holds 'indirection_page' and
%rsi holds 'page_list'. And %rdi always holds 'indirection_page' when
swap_pages() is called.
Therefore the comment of the first line code of swap_pages()
movq %rdi, %rcx /* Put the page_list in %rcx */
.. isn't correct because it actually moves the 'indirection_page' to
the %rcx. Fix it.
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Link: https://lore.kernel.org/all/adafdfb1421c88efce04420fc9a996c0e2ca1b34.1724573384.git.kai.huang@intel.com
|
|
Building the SGX code with W=1 generates below warning:
arch/x86/kernel/cpu/sgx/main.c:741: warning: Function parameter or
struct member 'low' not described in 'sgx_calc_section_metric'
arch/x86/kernel/cpu/sgx/main.c:741: warning: Function parameter or
struct member 'high' not described in 'sgx_calc_section_metric'
...
The function sgx_calc_section_metric() is a simple helper which is only
used in sgx/main.c. There's no need to use kernel-doc style comment for
it.
Downgrade to a normal comment.
Signed-off-by: Kai Huang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
The area at the 0x0FFFD9 physical location in the PC memory space is
regular memory, traditionally ROM BIOS and more recently a copy of BIOS
code and data in RAM, write-protected.
Therefore use memremap() to get access to it rather than ioremap(),
avoiding issues in virtualization scenarios and complementing changes such
as commit f7750a795687 ("x86, mpparse, x86/acpi, x86/PCI, x86/dmi, SFI: Use
memremap() for RAM mappings") or commit 5997efb96756 ("x86/boot: Use
memremap() to map the MPF and MPC data").
Reported-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Maciej W. Rozycki <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Closes: https://lore.kernel.org/r/[email protected]
|
|
etc.)
Add defines for the architectural memory types that can be shoved into
various MSRs and registers, e.g. MTRRs, PAT, VMX capabilities MSRs, EPTPs,
etc. While most MSRs/registers support only a subset of all memory types,
the values themselves are architectural and identical across all users.
Leave the goofy MTRR_TYPE_* definitions as-is since they are in a uapi
header, but add compile-time assertions to connect the dots (and sanity
check that the msr-index.h values didn't get fat-fingered).
Keep the VMX_EPTP_MT_* defines so that it's slightly more obvious that the
EPTP holds a single memory type in 3 of its 64 bits; those bits just
happen to be 2:0, i.e. don't need to be shifted.
Opportunistically use X86_MEMTYPE_WB instead of an open coded '6' in
setup_vmcs_config().
No functional change intended.
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Kai Huang <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Refactor the list of constant variables into a macro.
This should make it easier to add more constants in the future.
Signed-off-by: Jann Horn <[email protected]>
Reviewed-by: Alexander Gordeev <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Will Deacon <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
|
|
There are two distinct CPU features related to the use of XSAVES and LBR:
whether LBR is itself supported and whether XSAVES supports LBR. The LBR
subsystem correctly checks both in intel_pmu_arch_lbr_init(), but the
XSTATE subsystem does not.
The LBR bit is only removed from xfeatures_mask_independent when LBR is not
supported by the CPU, but there is no validation of XSTATE support.
If XSAVES does not support LBR the write to IA32_XSS causes a #GP fault,
leaving the state of IA32_XSS unchanged, i.e. zero. The fault is handled
with a warning and the boot continues.
Consequently the next XRSTORS which tries to restore supervisor state fails
with #GP because the RFBM has zero for all supervisor features, which does
not match the XCOMP_BV field.
As XFEATURE_MASK_FPSTATE includes supervisor features setting up the FPU
causes a #GP, which ends up in fpu_reset_from_exception_fixup(). That fails
due to the same problem resulting in recursive #GPs until the kernel runs
out of stack space and double faults.
Prevent this by storing the supported independent features in
fpu_kernel_cfg during XSTATE initialization and use that cached value for
retrieving the independent feature bits to be written into IA32_XSS.
[ tglx: Massaged change log ]
Fixes: f0dccc9da4c0 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR")
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Mitchell Levy <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
On 64-bit init_mem_mapping() relies on the minimal page fault handler
provided by the early IDT mechanism. The real page fault handler is
installed right afterwards into the IDT.
This is problematic on CPUs which have X86_FEATURE_FRED set because the
real page fault handler retrieves the faulting address from the FRED
exception stack frame and not from CR2, but that does obviously not work
when FRED is not yet enabled in the CPU.
To prevent this enable FRED right after init_mem_mapping() without
interrupt stacks. Those are enabled later in trap_init() after the CPU
entry area is set up.
[ tglx: Encapsulate the FRED details ]
Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code")
Reported-by: Hou Wenlong <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
To enable FRED earlier, move the RSP initialization out of
cpu_init_fred_exceptions() into cpu_init_fred_rsps().
This is required as the FRED RSP initialization depends on the availability
of the CPU entry areas which are set up late in trap_init(),
No functional change intended. Marked with Fixes as it's a depedency for
the real fix.
Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code")
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Depending on whether FRED is enabled, sysvec_install() installs a system
interrupt handler into either into FRED's system vector dispatch table or
into the IDT.
However FRED can be disabled later in trap_init(), after sysvec_install()
has been invoked already; e.g., the HYPERVISOR_CALLBACK_VECTOR handler is
registered with sysvec_install() in kvm_guest_init(), which is called in
setup_arch() but way before trap_init().
IOW, there is a gap between FRED is available and available but disabled.
As a result, when FRED is available but disabled, early sysvec_install()
invocations fail to install the IDT handler resulting in spurious
interrupts.
Fix it by parsing cmdline param "fred=" in cpu_parse_early_param() to
ensure that FRED is disabled before the first sysvec_install() incovations.
Fixes: 3810da12710a ("x86/fred: Add a fred= cmdline param")
Reported-by: Hou Wenlong <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Xin Li (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even
when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable
it thereby creating inconsistent state.
Due to the early state check for X2APIC_ON, the code path which warns about
a locked X2APIC cannot be reached.
Test for state < X2APIC_ON instead and move the clearing of the state and
mode variables to the place which actually disables X2APIC.
[ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the
right place for back ports ]
Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup")
Signed-off-by: Yuntao Wang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
|
|
Logical destination mode of the local APIC is used for systems with up to
8 CPUs. It has an advantage over physical destination mode as it allows to
target multiple CPUs at once with IPIs.
That advantage was definitely worth it when systems with up to 8 CPUs
were state of the art for servers and workstations, but that's history.
Aside of that there are systems which fail to work with logical destination
mode as the ACPI/DMI quirks show and there are AMD Zen1 systems out there
which fail when interrupt remapping is enabled as reported by Rob and
Christian. The latter problem can be cured by firmware updates, but not all
OEMs distribute the required changes.
Physical destination mode is guaranteed to work because it is the only way
to get a CPU up and running via the INIT/INIT/STARTUP sequence.
As the number of CPUs keeps increasing, logical destination mode becomes a
less used code path so there is no real good reason to keep it around.
Therefore remove logical destination mode support for 64-bit and default to
physical destination mode.
Reported-by: Rob Newcater <[email protected]>
Reported-by: Christian Heusel <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Rob Newcater <[email protected]>
Link: https://lore.kernel.org/all/877cd5u671.ffs@tglx
|
|
Stack unwinding produces large amounts of uninteresting coverage.
It's called from KASAN kmalloc/kfree hooks, fault injection, etc.
It's not particularly useful and is not a function of system call args.
Ignore that code.
Signed-off-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Andrey Konovalov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/all/eaf54b8634970b73552dcd38bf9be6ef55238c10.1718092070.git.dvyukov@google.com
|
|
MTRRs have an obsolete fixed variant for fine grained caching control
of the 640K-1MB region that uses separate MSRs. This fixed variant has
a separate capability bit in the MTRR capability MSR.
So far all x86 CPUs which support MTRR have this separate bit set, so it
went unnoticed that mtrr_save_state() does not check the capability bit
before accessing the fixed MTRR MSRs.
Though on a CPU that does not support the fixed MTRR capability this
results in a #GP. The #GP itself is harmless because the RDMSR fault is
handled gracefully, but results in a WARN_ON().
Add the missing capability check to prevent this.
Fixes: 2b1f6278d77c ("[PATCH] x86: Save the MTRRs of the BSP before booting an AP")
Signed-off-by: Andi Kleen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
The kernel can change spinlock behavior when running as a guest. But this
guest-friendly behavior causes performance problems on bare metal.
The kernel uses a static key to switch between the two modes.
In theory, the static key is enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior or paravirt spinlock).
A performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is
disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which decreases
the performance on bare metal.
Set the default value of virt_spin_lock_key to false. If booting in a VM,
enable this key. Later during the VM initialization, if other
high-efficient spinlock is preferred (e.g. paravirt-spinlock), or the user
wants the native qspinlock (via nopvspin boot commandline), the
virt_spin_lock_key is disabled accordingly.
This results in the following decision matrix:
X86_FEATURE_HYPERVISOR Y Y Y N
CONFIG_PARAVIRT_SPINLOCKS Y Y N Y/N
PV spinlock Y N N Y/N
virt_spin_lock_key N Y/N Y N
Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning")
Reported-by: Prem Nath Dey <[email protected]>
Reported-by: Xiaoping Zhou <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Qiuxu Zhuo <[email protected]>
Suggested-by: Nikolay Borisov <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Nikolay Borisov <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/[email protected]
|
|
On a platform using the "Multiprocessor Wakeup Structure"[1] to startup
secondary CPUs the control processor needs to memremap() the physical
address of the MP Wakeup Structure mailbox to the variable
acpi_mp_wake_mailbox, which holds the virtual address of mailbox.
To wake up the AP the control processor writes the APIC ID of AP, the
wakeup vector and the ACPI_MP_WAKE_COMMAND_WAKEUP command into the mailbox.
Current implementation doesn't consider the case which restricts boot time
CPU bringup to 1 with the kernel parameter "maxcpus=1" and brings other
CPUs online later from user space as it sets acpi_mp_wake_mailbox to
read-only after init. So when the first AP is tried to brought online
after init, the attempt to update the variable results in a kernel panic.
The memremap() call that initializes the variable cannot be moved into
acpi_parse_mp_wake() because memremap() is not functional at that point in
the boot process. Also as the APs might never be brought up, keep the
memremap() call in acpi_wakeup_cpu() so that the operation only takes place
when needed.
Fixes: 24dd05da8c79 ("x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init")
Signed-off-by: Zhiquan Li <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Add missing new lines and reorder variable definitions.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
80 character limit is history.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Add brackets around if/for constructs as required by coding style or remove
pointless line breaks to make it true single line statements which do not
require brackets.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Use proper comment styles and shrink comments to their scope where
applicable.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
It's only used by check_timer().
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Use the new apic_pr_verbose() helper.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|
|
Cleanup the APIC printk()s which are inside of a apic verbosity guarded
region by using apic_dbg() for the KERN_DEBUG level prints.
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Qiuxu Zhuo <[email protected]>
Tested-by: Breno Leitao <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
|