aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu/mce
AgeCommit message (Collapse)AuthorFilesLines
2023-06-05locking/atomic: treewide: use raw_atomic*_<op>()Mark Rutland1-8/+8
Now that we have raw_atomic*_<op>() definitions, there's no need to use arch_atomic*_<op>() definitions outside of the low-level atomic definitions. Move treewide users of arch_atomic*_<op>() over to the equivalent raw_atomic*_<op>(). There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-05-16x86/MCE: Check a hw error's address to determine proper recovery actionYazen Ghannam1-1/+1
Make sure that machine check errors with a usable address are properly marked as poison. This is needed for errors that occur on memory which have MCG_STATUS[RIPV] clear - i.e., the interrupted process cannot be restarted reliably. One example is data poison consumption through the instruction fetch units on AMD Zen-based systems. The MF_MUST_KILL flag is passed to memory_failure() when MCG_STATUS[RIPV] is not set. So the associated process will still be killed. What this does, practically, is get rid of one more check to kill_current_task with the eventual goal to remove it completely. Also, make the handling identical to what is done on the notifier path (uc_decode_notifier() does that address usability check too). The scenario described above occurs when hardware can precisely identify the address of poisoned memory, but execution cannot reliably continue for the interrupted hardware thread. [ bp: Massage commit message. ] Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-04-25Merge tag 'ras_core_for_v6.4_rc1' of ↵Linus Torvalds2-13/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Just cleanups and fixes this time around: make threshold_ktype const, an objtool fix and use proper size for a bitmap * tag 'ras_core_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/MCE/AMD: Use an u64 for bank_map x86/mce: Always inline old MCA stubs x86/MCE/AMD: Make kobj_type structure constant
2023-03-19x86/MCE/AMD: Use an u64 for bank_mapMuralidhara M K1-7/+7
Thee maximum number of MCA banks is 64 (MAX_NR_BANKS), see a0bc32b3cacf ("x86/mce: Increase maximum number of banks to 64"). However, the bank_map which contains a bitfield of which banks to initialize is of type unsigned int and that overflows when those bit numbers are >= 32, leading to UBSAN complaining correctly: UBSAN: shift-out-of-bounds in arch/x86/kernel/cpu/mce/amd.c:1365:38 shift exponent 32 is too large for 32-bit type 'int' Change the bank_map to a u64 and use the proper BIT_ULL() macro when modifying bits in there. [ bp: Rewrite commit message. ] Fixes: a0bc32b3cacf ("x86/mce: Increase maximum number of banks to 64") Signed-off-by: Muralidhara M K <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-12x86/mce: Make sure logged MCEs are processed after sysfs updateYazen Ghannam1-0/+1
A recent change introduced a flag to queue up errors found during boot-time polling. These errors will be processed during late init once the MCE subsystem is fully set up. A number of sysfs updates call mce_restart() which goes through a subset of the CPU init flow. This includes polling MCA banks and logging any errors found. Since the same function is used as boot-time polling, errors will be queued. However, the system is now past late init, so the errors will remain queued until another error is found and the workqueue is triggered. Call mce_schedule_work() at the end of mce_restart() so that queued errors are processed. Fixes: 3bff147b187d ("x86/mce: Defer processing of early errors") Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Tony Luck <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2023-03-08x86/mce: Always inline old MCA stubsBorislav Petkov (AMD)1-5/+5
The stubs for the ancient MCA support (CONFIG_X86_ANCIENT_MCE) are normally optimized away on 64-bit builds. However, an allmodconfig one causes the compiler to add sanitizer calls gunk into them and they exist as constprop calls. Which objtool then complains about: vmlinux.o: warning: objtool: do_machine_check+0xad8: call to \ pentium_machine_check.constprop.0() leaves .noinstr.text section due to them missing noinstr. One could tag them "noinstr" but what should really happen is, they should be forcefully inlined so that all that gunk gets optimized away and the warning doesn't even have a chance to fire. Do so. No functional changes. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-06x86/MCE/AMD: Make kobj_type structure constantThomas Weißschuh1-1/+1
Since ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definition to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-02-21Merge tag 'ras_core_for_v6.3_rc1' of ↵Linus Torvalds3-30/+58
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Add support for reporting more bits of the physical address on error, on newer AMD CPUs - Mask out bits which don't belong to the address of the error being reported * tag 'ras_core_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Mask out non-address bits from machine check bank x86/mce: Add support for Extended Physical Address MCA changes x86/mce: Define a function to extract ErrorAddr from MCA_ADDR
2023-01-10x86/mce: Mask out non-address bits from machine check bankTony Luck1-5/+9
Systems that support various memory encryption schemes (MKTME, TDX, SEV) use high order physical address bits to indicate which key should be used for a specific memory location. When a memory error is reported, some systems may report those key bits in the IA32_MCi_ADDR machine check MSR. The Intel SDM has a footnote for the contents of the address register that says: "Useful bits in this field depend on the address methodology in use when the register state is saved." AMD Processor Programming Reference has a more explicit description of the MCA_ADDR register: "For physical addresses, the most significant bit is given by Core::X86::Cpuid::LongModeInfo[PhysAddrSize]." Add a new #define MCI_ADDR_PHYSADDR for the mask of valid physical address bits within the machine check bank address register. Use this mask for recoverable machine check handling and in the EDAC driver to ignore any key bits that may be present. [ Tony: Based on independent fixes proposed by Fan Du and Isaku Yamahata ] Reported-by: Isaku Yamahata <[email protected]> Reported-by: Fan Du <[email protected]> Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-01-07x86/mce/dev-mcelog: use strscpy() to instead of strncpy()Xu Panda1-2/+1
The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL terminated strings. Signed-off-by: Xu Panda <[email protected]> Signed-off-by: Yang Yang <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-12-28x86/mce: Add support for Extended Physical Address MCA changesSmita Koralahalli3-8/+33
Newer AMD CPUs support more physical address bits. That is, the MCA_ADDR registers on Scalable MCA systems contain the ErrorAddr in bits [56:0] instead of [55:0]. Hence, the existing LSB field from bits [61:56] in MCA_ADDR must be moved around to accommodate the larger ErrorAddr size. MCA_CONFIG[McaLsbInStatusSupported] indicates this change. If set, the LSB field will be found in MCA_STATUS rather than MCA_ADDR. Each logical CPU has unique MCA bank in hardware and is not shared with other logical CPUs. Additionally, on SMCA systems, each feature bit may be different for each bank within same logical CPU. Check for MCA_CONFIG[McaLsbInStatusSupported] for each MCA bank and for each CPU. Additionally, all MCA banks do not support maximum ErrorAddr bits in MCA_ADDR. Some banks might support fewer bits but the remaining bits are marked as reserved. [ Yazen: Rebased and fixed up formatting. bp: Massage comments. ] Signed-off-by: Smita Koralahalli <[email protected]> Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-12-28x86/mce: Define a function to extract ErrorAddr from MCA_ADDRSmita Koralahalli3-18/+17
Move MCA_ADDR[ErrorAddr] extraction into a separate helper function. This will be further refactored to support extended ErrorAddr bits in MCA_ADDR in newer AMD CPUs. [ bp: Massage. ] Signed-off-by: Smita Koralahalli <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/all/[email protected]/
2022-10-31x86/mce: Use severity table to handle uncorrected errors in kernelTony Luck1-3/+5
mce_severity_intel() has a special case to promote UC and AR errors in kernel context to PANIC severity. The "AR" case is already handled with separate entries in the severity table for all instruction fetch errors, and those data fetch errors that are not in a recoverable area of the kernel (i.e. have an extable fixup entry). Add an entry to the severity table for UC errors in kernel context that reports severity = PANIC. Delete the special case code from mce_severity_intel(). Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-10-27x86/MCE/AMD: Clear DFR errors found in THR handlerYazen Ghannam1-13/+20
AMD's MCA Thresholding feature counts errors of all severity levels, not just correctable errors. If a deferred error causes the threshold limit to be reached (it was the error that caused the overflow), then both a deferred error interrupt and a thresholding interrupt will be triggered. The order of the interrupts is not guaranteed. If the threshold interrupt handler is executed first, then it will clear MCA_STATUS for the error. It will not check or clear MCA_DESTAT which also holds a copy of the deferred error. When the deferred error interrupt handler runs it will not find an error in MCA_STATUS, but it will find the error in MCA_DESTAT. This will cause two errors to be logged. Check for deferred errors when handling a threshold interrupt. If a bank contains a deferred error, then clear the bank's MCA_DESTAT register. Define a new helper function to do the deferred error check and clearing of MCA_DESTAT. [ bp: Simplify, convert comment to passive voice. ] Fixes: 37d43acfd79f ("x86/mce/AMD: Redo error logging from APIC LVT interrupt handlers") Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2022-08-29x86/mce: Retrieve poison range from hardwareJane Chu1-1/+12
When memory poison consumption machine checks fire, MCE notifier handlers like nfit_handle_mce() record the impacted physical address range which is reported by the hardware in the MCi_MISC MSR. The error information includes data about blast radius, i.e. how many cachelines did the hardware determine are impacted. A recent change 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity") updated nfit_handle_mce() to stop hard coding the blast radius value of 1 cacheline, and instead rely on the blast radius reported in 'struct mce' which can be up to 4K (64 cachelines). It turns out that apei_mce_report_mem_error() had a similar problem in that it hard coded a blast radius of 4K rather than reading the blast radius from the error information. Fix apei_mce_report_mem_error() to convey the proper poison granularity. Signed-off-by: Jane Chu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected]
2022-06-28x86/mce: Check whether writes to MCA_STATUS are getting ignoredSmita Koralahalli2-1/+48
The platform can sometimes - depending on its settings - cause writes to MCA_STATUS MSRs to get ignored, regardless of HWCR[McStatusWrEn]'s value. For further info see PPR for AMD Family 19h, Model 01h, Revision B1 Processors, doc ID 55898 at https://bugzilla.kernel.org/show_bug.cgi?id=206537. Therefore, probe for ignored writes to MCA_STATUS to determine if hardware error injection is at all possible. [ bp: Heavily massage commit message and patch. ] Signed-off-by: Smita Koralahalli <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-27Merge tag 'libnvdimm-for-5.19' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and DAX updates from Dan Williams: "New support for clearing memory errors when a file is in DAX mode, alongside with some other fixes and cleanups. Previously it was only possible to clear these errors using a truncate or hole-punch operation to trigger the filesystem to reallocate the block, now, any page aligned write can opportunistically clear errors as well. This change spans x86/mm, nvdimm, and fs/dax, and has received the appropriate sign-offs. Thanks to Jane for her work on this. Summary: - Add support for clearing memory error via pwrite(2) on DAX - Fix 'security overwrite' support in the presence of media errors - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)" * tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: pmem: implement pmem_recovery_write() pmem: refactor pmem_clear_poison() dax: add .recovery_write dax_operation dax: introduce DAX_RECOVERY_WRITE dax access mode mce: fix set_mce_nospec to always unmap the whole page x86/mce: relocate set{clear}_mce_nospec() functions acpi/nfit: rely on mce->misc to determine poison granularity testing: nvdimm: asm/mce.h is not needed in nfit.c testing: nvdimm: iomap: make __nfit_test_ioremap a macro nvdimm: Allow overwrite in the presence of disabled dimms tools/testing/nvdimm: remove unneeded flush_workqueue
2022-05-24Merge tag 'acpi-5.19-rc1' of ↵Linus Torvalds1-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI updates from Rafael Wysocki: "These update the ACPICA kernel code to upstream revision 20220331, improve handling of PCI devices that are in D3cold during system initialization, add support for a few features, fix bugs and clean up code. Specifics: - Update ACPICA code in the kernel to upstream revision 20220331 including the following changes: - Add support for the Windows 11 _OSI string (Mario Limonciello) - Add the CFMWS subtable to the CEDT table (Lawrence Hileman). - iASL: NHLT: Treat Terminator as specific_config (Piotr Maziarz). - iASL: NHLT: Fix parsing undocumented bytes at the end of Endpoint Descriptor (Piotr Maziarz). - iASL: NHLT: Rename linux specific strucures to device_info (Piotr Maziarz). - Add new ACPI 6.4 semantics to Load() and LoadTable() (Bob Moore). - Clean up double word in comment (Tom Rix). - Update copyright notices to the year 2022 (Bob Moore). - Remove some tabs and // comments - automated cleanup (Bob Moore). - Replace zero-length array with flexible-array member (Gustavo A. R. Silva). - Interpreter: Add units to time variable names (Paul Menzel). - Add support for ARM Performance Monitoring Unit Table (Besar Wicaksono). - Inform users about ACPI spec violation related to sleep length (Paul Menzel). - iASL/MADT: Add OEM-defined subtable (Bob Moore). - Interpreter: Fix some typo mistakes (Selvarasu Ganesan). - Updates for revision E.d of IORT (Shameer Kolothum). - Use ACPI_FORMAT_UINT64 for 64-bit output (Bob Moore). - Improve debug messages in the ACPI device PM code (Rafael Wysocki). - Block ASUS B1400CEAE from suspend to idle by default (Mario Limonciello). - Improve handling of PCI devices that are in D3cold during system initialization (Rafael Wysocki). - Fix BERT error region memory mapping (Lorenzo Pieralisi). - Add support for NVIDIA 16550-compatible port subtype to the SPCR parsing code (Jeff Brasen). - Use static for BGRT_SHOW kobj_attribute defines (Tom Rix). - Fix missing prototype warning for acpi_agdi_init() (Ilkka Koskinen). - Fix missing ERST record ID in the APEI code (Liu Xinpeng). - Make APEI error injection to refuse to inject into the zero page (Tony Luck). - Correct description of INT3407 / INT3532 DPTF attributes in sysfs (Sumeet Pawnikar). - Add support for high frequency impedance notification to the DPTF driver (Sumeet Pawnikar). - Make mp_config_acpi_gsi() a void function (Li kunyu). - Unify Package () representation for properties in the ACPI device properties documentation (Andy Shevchenko). - Include UUID in _DSM evaluation warning (Michael Niewöhner)" * tag 'acpi-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (41 commits) Revert "ACPICA: executer/exsystem: Warn about sleeps greater than 10 ms" ACPI: utils: include UUID in _DSM evaluation warning ACPI: PM: Block ASUS B1400CEAE from suspend to idle by default x86: ACPI: Make mp_config_acpi_gsi() a void function ACPI: DPTF: Add support for high frequency impedance notification ACPI: AGDI: Fix missing prototype warning for acpi_agdi_init() ACPI: bus: Avoid non-ACPI device objects in walks over children ACPI: DPTF: Correct description of INT3407 / INT3532 attributes ACPI: BGRT: use static for BGRT_SHOW kobj_attribute defines ACPI, APEI, EINJ: Refuse to inject into the zero page ACPI: PM: Always print final debug message in acpi_device_set_power() ACPI: SPCR: Add support for NVIDIA 16550-compatible port subtype ACPI: docs: enumeration: Unify Package () for properties (part 2) ACPI: APEI: Fix missing ERST record id ACPICA: Update version to 20220331 ACPICA: exsystem.c: Use ACPI_FORMAT_UINT64 for 64-bit output ACPICA: IORT: Updates for revision E.d ACPICA: executer/exsystem: Fix some typo mistakes ACPICA: iASL/MADT: Add OEM-defined subtable ACPICA: executer/exsystem: Warn about sleeps greater than 10 ms ...
2022-05-16mce: fix set_mce_nospec to always unmap the whole pageJane Chu1-3/+3
The set_memory_uc() approach doesn't work well in all cases. As Dan pointed out when "The VMM unmapped the bad page from guest physical space and passed the machine check to the guest." "The guest gets virtual #MC on an access to that page. When the guest tries to do set_memory_uc() and instructs cpa_flush() to do clean caches that results in taking another fault / exception perhaps because the VMM unmapped the page from the guest." Since the driver has special knowledge to handle NP or UC, mark the poisoned page with NP and let driver handle it when it comes down to repair. Please refer to discussions here for more details. https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/ Now since poisoned page is marked as not-present, in order to avoid writing to a not-present page and trigger kernel Oops, also fix pmem_do_write(). Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()") Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Signed-off-by: Jane Chu <[email protected]> Acked-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/r/165272615484.103830.2563950688772226611.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <[email protected]>
2022-04-25x86/mce: Add messages for panic errors in AMD's MCE gradingCarlos Bilbao1-1/+10
When a machine error is graded as PANIC by the AMD grading logic, the MCE handler calls mce_panic(). The notification chain does not come into effect so the AMD EDAC driver does not decode the errors. In these cases, the messages displayed to the user are more cryptic and miss information that might be relevant, like the context in which the error took place. Add messages to the grading logic for machine errors so that it is clear what error it was. [ bp: Massage commit message. ] Signed-off-by: Carlos Bilbao <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-25x86/mce: Simplify AMD severity grading logicCarlos Bilbao1-65/+36
The MCE handler needs to understand the severity of the machine errors to act accordingly. Simplify the AMD grading logic following a logic that closely resembles the descriptions of the public PPR documents. This will help include more fine-grained grading of errors in the future. [ bp: Touchups. ] Signed-off-by: Carlos Bilbao <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-13ACPI: APEI: Fix missing ERST record idLiu Xinpeng1-5/+3
Read a record is cleared by others, but the deleted record cache entry is still created by erst_get_record_id_next. When next enumerate the records, get the cached deleted record, then erst_read() return -ENOENT and try to get next record, loop back to first ID will return 0 in function __erst_record_id_cache_add_one and then set record_id as APEI_ERST_INVALID_RECORD_ID, finished this time read operation. It will result in read the records just in the cache hereafter. This patch cleared the deleted record cache, fix the issue that "./erst-inject -p" shows record counts not equal to "./erst-inject -n". A reproducer of the problem(retry many times): [root@localhost erst-inject]# ./erst-inject -c 0xaaaaa00011 [root@localhost erst-inject]# ./erst-inject -p rc: 273 rcd sig: CPER rcd id: 0xaaaaa00012 rc: 273 rcd sig: CPER rcd id: 0xaaaaa00013 rc: 273 rcd sig: CPER rcd id: 0xaaaaa00014 [root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000006 [root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000007 [root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000008 [root@localhost erst-inject]# ./erst-inject -p rc: 273 rcd sig: CPER rcd id: 0xaaaaa00012 rc: 273 rcd sig: CPER rcd id: 0xaaaaa00013 rc: 273 rcd sig: CPER rcd id: 0xaaaaa00014 [root@localhost erst-inject]# ./erst-inject -n total error record count: 6 Signed-off-by: Liu Xinpeng <[email protected]> Reviewed-by: Tony Luck <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2022-04-05x86/MCE/AMD: Fix memory leak when threshold_create_bank() failsAmmar Faizi1-13/+19
In mce_threshold_create_device(), if threshold_create_bank() fails, the previously allocated threshold banks array @bp will be leaked because the call to mce_threshold_remove_device() will not free it. This happens because mce_threshold_remove_device() fetches the pointer through the threshold_banks per-CPU variable but bp is written there only after the bank creation is successful, and not before, when threshold_create_bank() fails. Add a helper which unwinds all the bank creation work previously done and pass into it the previously allocated threshold banks array for freeing. [ bp: Massage. ] Fixes: 6458de97fc15 ("x86/mce/amd: Straighten CPU hotplug path") Co-developed-by: Alviro Iskandar Setiawan <[email protected]> Signed-off-by: Alviro Iskandar Setiawan <[email protected]> Co-developed-by: Yazen Ghannam <[email protected]> Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Ammar Faizi <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-05x86/mce: Avoid unnecessary padding in struct mce_bankSmita Koralahalli1-1/+3
Convert struct mce_bank member "init" from bool to a bitfield to get rid of unnecessary padding. $ pahole -C mce_bank arch/x86/kernel/cpu/mce/core.o before: /* size: 16, cachelines: 1, members: 2 */ /* padding: 7 */ /* last cacheline: 16 bytes */ after: /* size: 16, cachelines: 1, members: 3 */ /* last cacheline: 16 bytes */ No functional changes. Signed-off-by: Smita Koralahalli <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-03-25Merge tag 'ras_core_for_v5.18_rc1' of ↵Linus Torvalds3-90/+139
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - More noinstr fixes - Add an erratum workaround for Intel CPUs which, in certain circumstances, end up consuming an unrelated uncorrectable memory error when using fast string copy insns - Remove the MCE tolerance level control as it is not really needed or used anymore * tag 'ras_core_for_v5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Remove the tolerance level control x86/mce: Work around an erratum on fast string copy instructions x86/mce: Use arch atomic and bit helpers
2022-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-3/+5
Merge updates from Andrew Morton: - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs - Most the MM patches which precede the patches in Willy's tree: kasan, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb, userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp, cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap, zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon. * emailed patches from Andrew Morton <[email protected]>: (227 commits) mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release() Docs/ABI/testing: add DAMON sysfs interface ABI document Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface selftests/damon: add a test for DAMON sysfs interface mm/damon/sysfs: support DAMOS stats mm/damon/sysfs: support DAMOS watermarks mm/damon/sysfs: support schemes prioritization mm/damon/sysfs: support DAMOS quotas mm/damon/sysfs: support DAMON-based Operation Schemes mm/damon/sysfs: support the physical address space monitoring mm/damon/sysfs: link DAMON for virtual address spaces monitoring mm/damon: implement a minimal stub for sysfs-based DAMON interface mm/damon/core: add number of each enum type values mm/damon/core: allow non-exclusive DAMON start/stop Docs/damon: update outdated term 'regions update interval' Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling Docs/vm/damon: call low level monitoring primitives the operations mm/damon: remove unnecessary CONFIG_DAMON option mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() mm/damon/dbgfs-test: fix is_target_id() change ...
2022-03-22mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handlerluofei1-3/+5
When the hwpoison page meets the filter conditions, it should not be regarded as successful memory_failure() processing for mce handler, but should return a distinct value, otherwise mce handler regards the error page has been identified and isolated, which may lead to calling set_mce_nospec() to change page attribute, etc. Here memory_failure() return -EOPNOTSUPP to indicate that the error event is filtered, mce handler should not take any action for this situation and hwpoison injector should treat as correct. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: luofei <[email protected]> Acked-by: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-02-23x86/mce: Remove the tolerance level controlBorislav Petkov3-47/+30
This is pretty much unused and not really useful. What is more, all relevant MCA hardware has recoverable machine checks support so there's no real need to tweak MCA tolerance levels in order to *maybe* extend machine lifetime. So rip it out. Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/YcDq8PxvKtTENl/[email protected]
2022-02-19x86/mce: Work around an erratum on fast string copy instructionsJue Wang2-1/+68
A rare kernel panic scenario can happen when the following conditions are met due to an erratum on fast string copy instructions: 1) An uncorrected error. 2) That error must be in first cache line of a page. 3) Kernel must execute page_copy from the page immediately before that page. The fast string copy instructions ("REP; MOVS*") could consume an uncorrectable memory error in the cache line _right after_ the desired region to copy and raise an MCE. Bit 0 of MSR_IA32_MISC_ENABLE can be cleared to disable fast string copy and will avoid such spurious machine checks. However, that is less preferable due to the permanent performance impact. Considering memory poison is rare, it's desirable to keep fast string copy enabled until an MCE is seen. Intel has confirmed the following: 1. The CPU erratum of fast string copy only applies to Skylake, Cascade Lake and Cooper Lake generations. Directly return from the MCE handler: 2. Will result in complete execution of the "REP; MOVS*" with no data loss or corruption. 3. Will not result in another MCE firing on the next poisoned cache line due to "REP; MOVS*". 4. Will resume execution from a correct point in code. 5. Will result in the same instruction that triggered the MCE firing a second MCE immediately for any other software recoverable data fetch errors. 6. Is not safe without disabling the fast string copy, as the next fast string copy of the same buffer on the same CPU would result in a PANIC MCE. This should mitigate the erratum completely with the only caveat that the fast string copy is disabled on the affected hyper thread thus performance degradation. This is still better than the OS crashing on MCEs raised on an irrelevant process due to "REP; MOVS*' accesses in a kernel context, e.g., copy_page. Tested: Injected errors on 1st cache line of 8 anonymous pages of process 'proc1' and observed MCE consumption from 'proc2' with no panic (directly returned). Without the fix, the host panicked within a few minutes on a random 'proc2' process due to kernel access from copy_page. [ bp: Fix comment style + touch ups, zap an unlikely(), improve the quirk function's readability. ] Signed-off-by: Jue Wang <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-13x86/mce: Use arch atomic and bit helpersBorislav Petkov3-42/+41
The arch helpers do not have explicit KASAN instrumentation. Use them in noinstr code. Inline a couple more functions with single call sites, while at it: mce_severity_amd_smca() has a single call-site which is noinstr so force the inlining and fix: vmlinux.o: warning: objtool: mce_severity_amd.constprop.0()+0xca: call to \ mce_severity_amd_smca() leaves .noinstr.text section Always inline mca_msr_reg(): text data bss dec hex filename 16065240 128031326 36405368 180501934 ac23dae vmlinux.before 16065240 128031294 36405368 180501902 ac23d8e vmlinux.after and mce_no_way_out() as the latter one is used only once, to fix: vmlinux.o: warning: objtool: mce_read_aux()+0x53: call to mca_msr_reg() leaves .noinstr.text section vmlinux.o: warning: objtool: do_machine_check()+0xc9: call to mce_no_way_out() leaves .noinstr.text section Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Marco Elver <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-01x86/cpu: Read/save PPIN MSR during initializationTony Luck1-6/+1
Currently, the PPIN (Protected Processor Inventory Number) MSR is read by every CPU that processes a machine check, CMCI, or just polls machine check banks from a periodic timer. This is not a "fast" MSR, so this adds to overhead of processing errors. Add a new "ppin" field to the cpuinfo_x86 structure. Read and save the PPIN during initialization. Use this copy in mce_setup() instead of reading the MSR. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-01x86/cpu: Merge Intel and AMD ppin_init() functionsTony Luck1-42/+0
The code to decide whether a system supports the PPIN (Protected Processor Inventory Number) MSR was cloned from the Intel implementation. Apart from the X86_FEATURE bit and the MSR numbers it is identical. Merge the two functions into common x86 code, but use x86_match_cpu() instead of the switch (c->x86_model) that was used by the old Intel code. No functional change. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-02-01x86/CPU/AMD: Use default_groups in kobj_typeGreg Kroah-Hartman1-3/+4
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the AMD mce sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that the obsolete default_attrs field can be removed soon. Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Tested-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-01-25x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPINTony Luck1-0/+1
Missed adding the Icelake-D CPU to the list. It uses the same MSRs to control and read the inventory number as all the other models. Fixes: dc6b025de95b ("x86/mce: Add Xeon Icelake to list of CPUs that support PPIN") Reported-by: Ailin Xu <[email protected]> Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-01-23x86/MCE/AMD: Allow thresholding interface updates after initYazen Ghannam1-1/+1
Changes to the AMD Thresholding sysfs code prevents sysfs writes from updating the underlying registers once CPU init is completed, i.e. "threshold_banks" is set. Allow the registers to be updated if the thresholding interface is already initialized or if in the init path. Use the "set_lvt_off" value to indicate if running in the init path, since this value is only set during init. Fixes: a037f3ca0ea0 ("x86/mce/amd: Make threshold bank setting hotplug robust") Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-28x86/mce/inject: Avoid out-of-bounds write when setting flagsZhang Zixun1-1/+1
A contrived zero-length write, for example, by using write(2): ... ret = write(fd, str, 0); ... to the "flags" file causes: BUG: KASAN: stack-out-of-bounds in flags_write Write of size 1 at addr ffff888019be7ddf by task writefile/3787 CPU: 4 PID: 3787 Comm: writefile Not tainted 5.16.0-rc7+ #12 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 due to accessing buf one char before its start. Prevent such out-of-bounds access. [ bp: Productize into a proper patch. Link below is the next best thing because the original mail didn't get archived on lore. ] Fixes: 0451d14d0561 ("EDAC, mce_amd_inj: Modify flags attribute to use string arguments") Signed-off-by: Zhang Zixun <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/linux-edac/[email protected]/
2021-12-22x86/MCE/AMD, EDAC/mce_amd: Support non-uniform MCA bank type enumerationYazen Ghannam1-24/+35
AMD systems currently lay out MCA bank types such that the type of bank number "i" is either the same across all CPUs or is Reserved/Read-as-Zero. For example: Bank # | CPUx | CPUy 0 LS LS 1 RAZ UMC 2 CS CS 3 SMU RAZ Future AMD systems will lay out MCA bank types such that the type of bank number "i" may be different across CPUs. For example: Bank # | CPUx | CPUy 0 LS LS 1 RAZ UMC 2 CS NBIO 3 SMU RAZ Change the structures that cache MCA bank types to be per-CPU and update smca_get_bank_type() to handle this change. Move some SMCA-specific structures to amd.c from mce.h, since they no longer need to be global. Break out the "count" for bank types from struct smca_hwid, since this should provide a per-CPU count rather than a system-wide count. Apply the "const" qualifier to the struct smca_hwid_mcatypes array. The values in this array should not change at runtime. Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-22x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank typesYazen Ghannam1-5/+16
Add HWID and McaType values for new SMCA bank types, and add their error descriptions to edac_mce_amd. The "PHY" bank types all have the same error descriptions, and the NBIF and SHUB bank types have the same error descriptions. So reuse the same arrays where appropriate. [ bp: Remove useless comments over hwid types. ] Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-20x86/mce: Check regs before accessing itBorislav Petkov1-1/+4
Commit in Fixes accesses pt_regs before checking whether it is NULL or not. Make sure the NULL pointer check happens first. Fixes: 0a5b288e85bb ("x86/mce: Prevent severity computation from being instrumented") Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/r/20211217102029.GA29708@kili
2021-12-13x86/mce: Mark mce_start() noinstrBorislav Petkov1-6/+14
Fixes vmlinux.o: warning: objtool: do_machine_check()+0x4ae: call to __const_udelay() leaves .noinstr.text section Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-13x86/mce: Mark mce_timed_out() noinstrBorislav Petkov1-3/+13
Fixes vmlinux.o: warning: objtool: do_machine_check()+0x482: call to mce_timed_out() leaves .noinstr.text section Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-13x86/mce: Move the tainting outside of the noinstr regionBorislav Petkov1-15/+26
add_taint() is yet another external facility which the #MC handler calls. Move that tainting call into the instrumentation-allowed part of the handler. Fixes vmlinux.o: warning: objtool: do_machine_check()+0x617: call to add_taint() leaves .noinstr.text section While at it, allow instrumentation around the mce_log() call. Fixes vmlinux.o: warning: objtool: do_machine_check()+0x690: call to mce_log() leaves .noinstr.text section Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-13x86/mce: Mark mce_read_aux() noinstrBorislav Petkov1-1/+1
Fixes vmlinux.o: warning: objtool: do_machine_check()+0x681: call to mce_read_aux() leaves .noinstr.text section Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-13x86/mce: Mark mce_end() noinstrBorislav Petkov1-3/+11
It is called by the #MC handler which is noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xbd6: call to memset() leaves .noinstr.text section Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-13x86/mce: Mark mce_panic() noinstrBorislav Petkov1-3/+12
And allow instrumentation inside it because it does calls to other facilities which will not be tagged noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xc73: call to mce_panic() leaves .noinstr.text section Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-13x86/mce: Prevent severity computation from being instrumentedBorislav Petkov1-9/+21
Mark all the MCE severity computation logic noinstr and allow instrumentation when it "calls out". Fixes vmlinux.o: warning: objtool: do_machine_check()+0xc5d: call to mce_severity() leaves .noinstr.text section Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-13x86/mce: Allow instrumentation during task work queueingBorislav Petkov1-0/+11
Fixes vmlinux.o: warning: objtool: do_machine_check()+0xdb1: call to queue_task_work() leaves .noinstr.text section Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-13x86/mce: Remove noinstr annotation from mce_setup()Borislav Petkov1-6/+20
Instead, sandwitch around the call which is done in noinstr context and mark the caller - mce_gather_info() - as noinstr. Also, document what the whole instrumentation strategy with #MC is going to be in the future and where it all is supposed to be going to. Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-13x86/mce: Use mce_rdmsrl() in severity checking codeBorislav Petkov3-6/+6
MCA has its own special MSR accessors. Use them. No functional changes. Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-13x86/mce: Remove function-local cpus variablesBorislav Petkov1-6/+2
Use num_online_cpus() directly. No functional changes. Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]