aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-01-07x86/sgx: Fix NULL pointer dereference on non-SGX systemsDave Hansen1-18/+47
== Problem == Nathan Chancellor reported an oops when aceessing the 'sgx_total_bytes' sysfs file: https://lore.kernel.org/all/YbzhBrimHGGpddDM@archlinux-ax161/ The sysfs output code accesses the sgx_numa_nodes[] array unconditionally. However, this array is allocated during SGX initialization, which only occurs on systems where SGX is supported. If the sysfs file is accessed on systems without SGX support, sgx_numa_nodes[] is NULL and an oops occurs. == Solution == To fix this, hide the entire nodeX/x86/ attribute group on systems without SGX support using the ->is_visible attribute group callback. Unfortunately, SGX is initialized via a device_initcall() which occurs _after_ the ->is_visible() callback. Instead of moving SGX initialization earlier, call sysfs_update_group() during SGX initialization to update the group visiblility. This update requires moving the SGX sysfs code earlier in sgx/main.c. There are no code changes other than the addition of arch_update_sysfs_visibility() and a minor whitespace fixup to arch_node_attr_is_visible() which checkpatch caught. CC: Greg Kroah-Hartman <[email protected]> Cc: [email protected] Cc: [email protected] Fixes: 50468e431335 ("x86/sgx: Add an attribute for the amount of SGX memory in a NUMA node") Reported-by: Nathan Chancellor <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Tested-by: Jarkko Sakkinen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-12-17selftests/sgx: Fix corrupted cpuid macro invocationJarkko Sakkinen1-3/+2
The SGX selftest fails to build on tip/x86/sgx: main.c: In function ‘get_total_epc_mem’: main.c:296:17: error: implicit declaration of function ‘__cpuid’ [-Werror=implicit-function-declaration] 296 | __cpuid(&eax, &ebx, &ecx, &edx); | ^~~~~~~ Include cpuid.h and use __cpuid_count() macro in order to fix the compilation issue. [ dhansen: tweak commit message ] Fixes: f0ff2447b861 ("selftests/sgx: Add a new kselftest: Unclobbered_vdso_oversubscribed") Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Reinette Chatre <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Cc: Shuah Khan <[email protected]>
2021-12-09x86/sgx: Add an attribute for the amount of SGX memory in a NUMA nodeJarkko Sakkinen7-0/+39
== Problem == The amount of SGX memory on a system is determined by the BIOS and it varies wildly between systems. It can be as small as dozens of MB's and as large as many GB's on servers. Just like how applications need to know how much regular RAM is available, enclave builders need to know how much SGX memory an enclave can consume. == Solution == Introduce a new sysfs file: /sys/devices/system/node/nodeX/x86/sgx_total_bytes to enumerate the amount of SGX memory available in each NUMA node. This serves the same function for SGX as /proc/meminfo or /sys/devices/system/node/nodeX/meminfo does for normal RAM. 'sgx_total_bytes' is needed today to help drive the SGX selftests. SGX-specific swap code is exercised by creating overcommitted enclaves which are larger than the physical SGX memory on the system. They currently use a CPUID-based approach which can diverge from the actual amount of SGX memory available. 'sgx_total_bytes' ensures that the selftests can work efficiently and do not attempt stupid things like creating a 100,000 MB enclave on a system with 128 MB of SGX memory. == Implementation Details == Introduce CONFIG_HAVE_ARCH_NODE_DEV_GROUP opt-in flag to expose an arch specific attribute group, and add an attribute for the amount of SGX memory in bytes to each NUMA node: == ABI Design Discussion == As opposed to the per-node ABI, a single, global ABI was considered. However, this would prevent enclaves from being able to size themselves so that they fit on a single NUMA node. Essentially, a single value would rule out NUMA optimizations for enclaves. Create a new "x86/" directory inside each "nodeX/" sysfs directory. 'sgx_total_bytes' is expected to be the first of at least a few sgx-specific files to be placed in the new directory. Just scanning /proc/meminfo, these are the no-brainers that we have for RAM, but we need for SGX: MemTotal: xxxx kB // sgx_total_bytes (implemented here) MemFree: yyyy kB // sgx_free_bytes SwapTotal: zzzz kB // sgx_swapped_bytes So, at *least* three. I think we will eventually end up needing something more along the lines of a dozen. A new directory (as opposed to being in the nodeX/ "root") directory avoids cluttering the root with several "sgx_*" files. Place the new file in a new "nodeX/x86/" directory because SGX is highly x86-specific. It is very unlikely that any other architecture (or even non-Intel x86 vendor) will ever implement SGX. Using "sgx/" as opposed to "x86/" was also considered. But, there is a real chance this can get used for other arch-specific purposes. [ dhansen: rewrite changelog ] Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Acked-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-19Merge branch 'x86/urgent' into x86/sgx, to resolve conflictIngo Molnar2-33/+45
Conflicts: arch/x86/kernel/cpu/sgx/main.c Signed-off-by: Ingo Molnar <[email protected]>
2021-11-17x86/sgx: Fix minor documentation issuesReinette Chatre1-7/+7
The SGX documentation has a few repeated or one-off issues: * Remove capitalization from regular words in the middle of a sentence. * Remove punctuation found in the middle of a sentence. * Fix name of SGX daemon to consistently be ksgxd. * Fix typo of SGX instruction: ENIT -> EINIT [ dhansen: tweaked subject and changelog ] Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Link: https://lkml.kernel.org/r/ab99a87368eef69e3fb96f073368becff3eff874.1635529506.git.reinette.chatre@intel.com
2021-11-16x86/sgx: Fix free page accountingReinette Chatre1-6/+6
The SGX driver maintains a single global free page counter, sgx_nr_free_pages, that reflects the number of free pages available across all NUMA nodes. Correspondingly, a list of free pages is associated with each NUMA node and sgx_nr_free_pages is updated every time a page is added or removed from any of the free page lists. The main usage of sgx_nr_free_pages is by the reclaimer that runs when it (sgx_nr_free_pages) goes below a watermark to ensure that there are always some free pages available to, for example, support efficient page faults. With sgx_nr_free_pages accessed and modified from a few places it is essential to ensure that these accesses are done safely but this is not the case. sgx_nr_free_pages is read without any protection and updated with inconsistent protection by any one of the spin locks associated with the individual NUMA nodes. For example: CPU_A CPU_B ----- ----- spin_lock(&nodeA->lock); spin_lock(&nodeB->lock); ... ... sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--; spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock); Since sgx_nr_free_pages may be protected by different spin locks while being modified from different CPUs, the following scenario is possible: CPU_A CPU_B ----- ----- {sgx_nr_free_pages = 100} spin_lock(&nodeA->lock); spin_lock(&nodeB->lock); sgx_nr_free_pages--; sgx_nr_free_pages--; /* LOAD sgx_nr_free_pages = 100 */ /* LOAD sgx_nr_free_pages = 100 */ /* sgx_nr_free_pages-- */ /* sgx_nr_free_pages-- */ /* STORE sgx_nr_free_pages = 99 */ /* STORE sgx_nr_free_pages = 99 */ spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock); In the above scenario, sgx_nr_free_pages is decremented from two CPUs but instead of sgx_nr_free_pages ending with a value that is two less than it started with, it was only decremented by one while the number of free pages were actually reduced by two. The consequence of sgx_nr_free_pages not being protected is that its value may not accurately reflect the actual number of free pages on the system, impacting the availability of free pages in support of many flows. The problematic scenario is when the reclaimer does not run because it believes there to be sufficient free pages while any attempt to allocate a page fails because there are no free pages available. In the SGX driver the reclaimer's watermark is only 32 pages so after encountering the above example scenario 32 times a user space hang is possible when there are no more free pages because of repeated page faults caused by no free pages made available. The following flow was encountered: asm_exc_page_fault ... sgx_vma_fault() sgx_encl_load_page() sgx_encl_eldu() // Encrypted page needs to be loaded from backing // storage into newly allocated SGX memory page sgx_alloc_epc_page() // Allocate a page of SGX memory __sgx_alloc_epc_page() // Fails, no free SGX memory ... if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer wake_up(&ksgxd_waitq); return -EBUSY; // Return -EBUSY giving reclaimer time to run return -EBUSY; return -EBUSY; return VM_FAULT_NOPAGE; The reclaimer is triggered in above flow with the following code: static bool sgx_should_reclaim(unsigned long watermark) { return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list); } In the problematic scenario there were no free pages available yet the value of sgx_nr_free_pages was above the watermark. The allocation of SGX memory thus always failed because of a lack of free pages while no free pages were made available because the reclaimer is never started because of sgx_nr_free_pages' incorrect value. The consequence was that user space kept encountering VM_FAULT_NOPAGE that caused the same address to be accessed repeatedly with the same result. Change the global free page counter to an atomic type that ensures simultaneous updates are done safely. While doing so, move the updating of the variable outside of the spin lock critical section to which it does not belong. Cc: [email protected] Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()") Suggested-by: Dave Hansen <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Acked-by: Jarkko Sakkinen <[email protected]> Link: https://lkml.kernel.org/r/a95a40743bbd3f795b465f30922dde7f1ea9e0eb.1637004094.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Add test for multiple TCS entryReinette Chatre3-0/+39
Each thread executing in an enclave is associated with a Thread Control Structure (TCS). The SGX test enclave contains two hardcoded TCS, thus supporting two threads in the enclave. Add a test to ensure it is possible to enter enclave at both entrypoints. Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/7be151a57b4c7959a2364753b995e0006efa3da1.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Enable multiple thread supportReinette Chatre1-7/+14
Each thread executing in an enclave is associated with a Thread Control Structure (TCS). The test enclave contains two hardcoded TCS. Each TCS contains meta-data used by the hardware to save and restore thread specific information when entering/exiting the enclave. The two TCS structures within the test enclave share their SSA (State Save Area) resulting in the threads clobbering each other's data. Fix this by providing each TCS their own SSA area. Additionally, there is an 8K stack space and its address is computed from the enclave entry point which is correctly done for TCS #1 that starts on the first address inside the enclave but results in out of bounds memory when entering as TCS #2. Split 8K stack space into two separate pages with offset symbol between to ensure the current enclave entry calculation can continue to be used for both threads. While using the enclave with multiple threads requires these fixes the impact is not apparent because every test up to this point enters the enclave from the first TCS. More detail about the stack fix: ------------------------------- Before this change the test enclave (test_encl) looks as follows: .tcs (2 pages): (page 1) TCS #1 (page 2) TCS #2 .text (1 page) One page of code .data (5 pages) (page 1) encl_buffer (page 2) encl_buffer (page 3) SSA (page 4 and 5) STACK encl_stack: As shown above there is a symbol, encl_stack, that points to the end of the .data segment (pointing to the end of page 5 in .data) which is also the end of the enclave. The enclave entry code computes the stack address by adding encl_stack to the pointer to the TCS that entered the enclave. When entering at TCS #1 the stack is computed correctly but when entering at TCS #2 the stack pointer would point to one page beyond the end of the enclave and a #PF would result when TCS #2 attempts to enter the enclave. The fix involves moving the encl_stack symbol between the two stack pages. Doing so enables the stack address computation in the entry code to compute the correct stack address for each TCS. Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/a49dc0d85401db788a0a3f0d795e848abf3b1f44.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Add page permission and exception testReinette Chatre3-0/+169
The Enclave Page Cache Map (EPCM) is a secure structure used by the processor to track the contents of the enclave page cache. The EPCM contains permissions with which enclave pages can be accessed. SGX support allows EPCM and PTE page permissions to differ - as long as the PTE permissions do not exceed the EPCM permissions. Add a test that: (1) Creates an SGX enclave page with writable EPCM permission. (2) Changes the PTE permission on the page to read-only. This should be permitted because the permission does not exceed the EPCM permission. (3) Attempts a write to the page. This should generate a page fault (#PF) because of the read-only PTE even though the EPCM permissions allow the page to be written to. This introduces the first test of SGX exception handling. In this test the issue that caused the exception (PTE page permissions) can be fixed from outside the enclave and after doing so it is possible to re-enter enclave at original entrypoint with ERESUME. Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/3bcc73a4b9fe8780bdb40571805e7ced59e01df7.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Rename test properties in preparation for more enclave testsReinette Chatre3-26/+26
SGX selftests prepares a data structure outside of the enclave with the type of and data for the operation that needs to be run within the enclave. At this time only two complementary operations are supported by the enclave: copying a value from outside the enclave into a default buffer within the enclave and reading a value from the enclave's default buffer into a variable accessible outside the enclave. In preparation for more operations supported by the enclave the names of the current enclave operations are changed to more accurately reflect the operations and more easily distinguish it from future operations: * The enums ENCL_OP_PUT and ENCL_OP_GET are renamed to ENCL_OP_PUT_TO_BUFFER and ENCL_OP_GET_FROM_BUFFER respectively. * The structs encl_op_put and encl_op_get are renamed to encl_op_put_to_buf and encl_op_get_from_buf respectively. * The enclave functions do_encl_op_put and do_encl_op_get are renamed to do_encl_op_put_to_buf and do_encl_op_get_from_buf respectively. No functional changes. Suggested-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Jarkko Sakkinen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/023fda047c787cf330b88ed9337705edae6a0078.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Provide per-op parameter structs for the test enclaveJarkko Sakkinen3-46/+69
To add more operations to the test enclave, the protocol needs to allow to have operations with varying parameters. Create a separate parameter struct for each existing operation, with the shared parameters in struct encl_op_header. [reinette: rebased to apply on top of oversubscription test series] Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/f9a4a8c436b538003b8ebddaa66083992053cef1.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Add a new kselftest: Unclobbered_vdso_oversubscribedJarkko Sakkinen1-0/+75
Add a variation of the unclobbered_vdso test. In the new test, create a heap for the test enclave, which has the same size as all available Enclave Page Cache (EPC) pages in the system. This will guarantee that all test_encl.elf pages *and* SGX Enclave Control Structure (SECS) have been swapped out by the page reclaimer during the load time. This test will trigger both the page reclaimer and the page fault handler. The page reclaimer triggered, while the heap is being created during the load time. The page fault handler is triggered for all the required pages, while the test case is executing. Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/41f7c508eea79a3198b5014d7691903be08f9ff1.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Move setup_test_encl() to each TEST_F()Jarkko Sakkinen1-4/+15
Create the test enclave inside each TEST_F(), instead of FIXTURE_SETUP(), so that the heap size can be defined per test. Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/70ca264535d2ca0dc8dcaf2281e7d6965f8d4a24.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Encpsulate the test enclave creationJarkko Sakkinen1-18/+26
Introduce setup_test_encl() so that the enclave creation can be moved to TEST_F()'s. This is required for a reclaimer test where the heap size needs to be set large enough to triger the page reclaimer. Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/bee0ca867a95828a569c1ba2a8e443a44047dc71.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Dump segments and /proc/self/maps only on failureJarkko Sakkinen1-11/+12
Logging is always a compromise between clarity and detail. The main use case for dumping VMA's is when FIXTURE_SETUP() fails, and is less important for enclaves that do initialize correctly. Therefore, print the segments and /proc/self/maps only in the error case. Finally, if a single test ever creates multiple enclaves, the amount of log lines would become enormous. Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/23cef0ae1de3a8a74cbfbbe74eca48ca3f300fde.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Create a heap for the test enclaveJarkko Sakkinen3-9/+26
Create a heap for the test enclave, which is allocated from /dev/null, and left unmeasured. This is beneficial by its own because it verifies that an enclave built from multiple choices, works properly. If LSM hooks are added for SGX some day, a multi source enclave has higher probability to trigger bugs on access control checks. The immediate need comes from the need to implement page reclaim tests. In order to trigger the page reclaimer, one can just set the size of the heap to high enough. Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/e070c5f23578c29608051cab879b1d276963a27a.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Make data measurement for an enclave segment optionalJarkko Sakkinen3-3/+10
For a heap makes sense to leave its contents "unmeasured" in the SGX enclave build process, meaning that they won't contribute to the cryptographic signature (a RSA-3072 signed SHA56 hash) of the enclave. Enclaves are signed blobs where the signature is calculated both from page data and also from "structural properties" of the pages. For instance a page offset of *every* page added to the enclave is hashed. For data, this is optional, not least because hashing a page has a significant contribution to the enclave load time. Thus, where there is no reason to hash, do not. The SGX ioctl interface supports this with SGX_PAGE_MEASURE flag. Only when the flag is *set*, data is measured. Add seg->measure boolean flag to struct encl_segment. Only when the flag is set, include the segment data to the signature (represented by SIGSTRUCT architectural structure). Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/625b6fe28fed76275e9238ec4e15ec3c0d87de81.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Assign source for each segmentJarkko Sakkinen3-6/+8
Define source per segment so that enclave pages can be added from different sources, e.g. anonymous VMA for zero pages. In other words, add 'src' field to struct encl_segment, and assign it to 'encl->src' for pages inherited from the enclave binary. Signed-off-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/7850709c3089fe20e4bcecb8295ba87c54cc2b4a.1636997631.git.reinette.chatre@intel.com
2021-11-15selftests/sgx: Fix a benign linker warningSean Christopherson1-1/+1
The enclave binary (test_encl.elf) is built with only three sections (tcs, text, and data) as controlled by its custom linker script. If gcc is built with "--enable-linker-build-id" (this appears to be a common configuration even if it is by default off) then gcc will pass "--build-id" to the linker that will prompt it (the linker) to write unique bits identifying the linked file to a ".note.gnu.build-id" section. The section ".note.gnu.build-id" does not exist in the test enclave resulting in the following warning emitted by the linker: /usr/bin/ld: warning: .note.gnu.build-id section discarded, --build-id ignored The test enclave does not use the build id within the binary so fix the warning by passing a build id of "none" to the linker that will disable the setting from any earlier "--build-id" options and thus disable the attempt to write the build id to a ".note.gnu.build-id" section that does not exist. Link: https://lore.kernel.org/linux-sgx/[email protected]/ Suggested-by: Cedric Xing <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/ca0f8a81fc1e78af9bdbc6a88e0f9c37d82e53f2.1636997631.git.reinette.chatre@intel.com
2021-11-15x86/sgx: Add check for SGX pages to ghes_do_memory_failure()Tony Luck1-1/+1
SGX EPC pages do not have a "struct page" associated with them so the pfn_valid() sanity check fails and results in a warning message to the console. Add an additional check to skip the warning if the address of the error is in an SGX EPC page. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Tested-by: Reinette Chatre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-15x86/sgx: Add hook to error injection address validationTony Luck2-1/+21
SGX reserved memory does not appear in the standard address maps. Add hook to call into the SGX code to check if an address is located in SGX memory. There are other challenges in injecting errors into SGX. Update the documentation with a sequence of operations to inject. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Tested-by: Reinette Chatre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-15x86/sgx: Hook arch_memory_failure() into mainline codeTony Luck4-6/+38
Add a call inside memory_failure() to call the arch specific code to check if the address is an SGX EPC page and handle it. Note the SGX EPC pages do not have a "struct page" entry, so the hook goes in at the same point as the device mapping hook. Pull the call to acquire the mutex earlier so the SGX errors are also protected. Make set_mce_nospec() skip SGX pages when trying to adjust the 1:1 map. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Tested-by: Reinette Chatre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-15x86/sgx: Add SGX infrastructure to recover from poisonTony Luck1-0/+76
Provide a recovery function sgx_memory_failure(). If the poison was consumed synchronously then send a SIGBUS. Note that the virtual address of the access is not included with the SIGBUS as is the case for poison outside of SGX enclaves. This doesn't matter as addresses of code/data inside an enclave is of little to no use to code executing outside the (now dead) enclave. Poison found in a free page results in the page being moved from the free list to the per-node poison page list. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Tested-by: Reinette Chatre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-15x86/sgx: Initial poison handling for dirty and free pagesTony Luck2-2/+28
A memory controller patrol scrubber can report poison in a page that isn't currently being used. Add "poison" field in the sgx_epc_page that can be set for an sgx_epc_page. Check for it: 1) When sanitizing dirty pages 2) When freeing epc pages Poison is a new field separated from flags to avoid having to make all updates to flags atomic, or integrate poison state changes into some other locking scheme to protect flags (Currently just sgx_reclaimer_lock which protects the SGX_EPC_PAGE_RECLAIMER_TRACKED bit in page->flags). In both cases place the poisoned page on a per-node list of poisoned epc pages to make sure it will not be reallocated. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Tested-by: Reinette Chatre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-15x86/sgx: Add infrastructure to identify SGX EPC pagesTony Luck2-0/+10
X86 machine check architecture reports a physical address when there is a memory error. Handling that error requires a method to determine whether the physical address reported is in any of the areas reserved for EPC pages by BIOS. SGX EPC pages do not have Linux "struct page" associated with them. Keep track of the mapping from ranges of EPC pages to the sections that contain them using an xarray. N.B. adds CONFIG_XARRAY_MULTI to the SGX dependecies. So "select" that in arch/x86/Kconfig for X86/SGX. Create a function arch_is_platform_page() that simply reports whether an address is an EPC page for use elsewhere in the kernel. The ACPI error injection code needs this function and is typically built as a module, so export it. Note that arch_is_platform_page() will be slower than other similar "what type is this page" functions that can simply check bits in the "struct page". If there is some future performance critical user of this function it may need to be implemented in a more efficient way. Note also that the current implementation of xarray allocates a few hundred kilobytes for this usage on a system with 4GB of SGX EPC memory configured. This isn't ideal, but worth it for the code simplicity. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Tested-by: Reinette Chatre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-15x86/sgx: Add new sgx_epc_page flag bit to mark free pagesTony Luck2-0/+5
SGX EPC pages go through the following life cycle: DIRTY ---> FREE ---> IN-USE --\ ^ | \-----------------/ Recovery action for poison for a DIRTY or FREE page is simple. Just make sure never to allocate the page. IN-USE pages need some extra handling. Add a new flag bit SGX_EPC_PAGE_IS_FREE that is set when a page is added to a free list and cleared when the page is allocated. Notes: 1) These transitions are made while holding the node->lock so that future code that checks the flags while holding the node->lock can be sure that if the SGX_EPC_PAGE_IS_FREE bit is set, then the page is on the free list. 2) Initially while the pages are on the dirty list the SGX_EPC_PAGE_IS_FREE bit is cleared. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Tested-by: Reinette Chatre <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-11-15x86/boot: Pull up cmdline preparation and early param parsingBorislav Petkov1-27/+39
Dan reports that Anjaneya Chagam can no longer use the efi=nosoftreserve kernel command line parameter to suppress "soft reservation" behavior. This is due to the fact that the following call-chain happens at boot: early_reserve_memory |-> efi_memblock_x86_reserve_range |-> efi_fake_memmap_early which does if (!efi_soft_reserve_enabled()) return; and that would have set EFI_MEM_NO_SOFT_RESERVE after having parsed "nosoftreserve". However, parse_early_param() gets called *after* it, leading to the boot cmdline not being taken into account. Therefore, carve out the command line preparation into a separate function which does the early param parsing too. So that it all goes together. And then call that function before early_reserve_memory() so that the params would have been parsed by then. Fixes: 8aa83e6395ce ("x86/setup: Call early_reserve_memory() earlier") Reported-by: Dan Williams <[email protected]> Reviewed-by: Dan Williams <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Tested-by: Anjaneya Chagam <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-11-14Linux 5.16-rc1Linus Torvalds1-2/+2
2021-11-14kconfig: Add support for -Wimplicit-fallthroughGustavo A. R. Silva2-5/+6
Add Kconfig support for -Wimplicit-fallthrough for both GCC and Clang. The compiler option is under configuration CC_IMPLICIT_FALLTHROUGH, which is enabled by default. Special thanks to Nathan Chancellor who fixed the Clang bug[1][2]. This bugfix only appears in Clang 14.0.0, so older versions still contain the bug and -Wimplicit-fallthrough won't be enabled for them, for now. This concludes a long journey and now we are finally getting rid of the unintentional fallthrough bug-class in the kernel, entirely. :) Link: https://github.com/llvm/llvm-project/commit/9ed4a94d6451046a51ef393cd62f00710820a7e8 [1] Link: https://bugs.llvm.org/show_bug.cgi?id=51094 [2] Link: https://github.com/KSPP/linux/issues/115 Link: https://github.com/ClangBuiltLinux/linux/issues/236 Co-developed-by: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Co-developed-by: Linus Torvalds <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Gustavo A. R. Silva <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-14Merge tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds4-7/+12
Pull xfs cleanups from Darrick Wong: "The most 'exciting' aspect of this branch is that the xfsprogs maintainer and I have worked through the last of the code discrepancies between kernel and userspace libxfs such that there are no code differences between the two except for #includes. IOWs, diff suffices to demonstrate that the userspace tools behave the same as the kernel, and kernel-only bits are clearly marked in the /kernel/ source code instead of just the userspace source. Summary: - Clean up open-coded swap() calls. - A little bit of #ifdef golf to complete the reunification of the kernel and userspace libxfs source code" * tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: sync xfs_btree_split macros with userspace libxfs xfs: #ifdef out perag code for userspace xfs: use swap() to make dabtree code cleaner
2021-11-14Merge tag 'for-5.16/parisc-3' of ↵Linus Torvalds5-6/+14
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull more parisc fixes from Helge Deller: "Fix a build error in stracktrace.c, fix resolving of addresses to function names in backtraces, fix single-stepping in assembly code and flush userspace pte's when using set_pte_at()" * tag 'for-5.16/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc/entry: fix trace test in syscall exit path parisc: Flush kernel data mapping in set_pte_at() when installing pte for user page parisc: Fix implicit declaration of function '__kernel_text_address' parisc: Fix backtrace to always include init funtion names
2021-11-14Merge tag 'sh-for-5.16' of git://git.libc.org/linux-shLinus Torvalds21-180/+78
Pull arch/sh updates from Rich Felker. * tag 'sh-for-5.16' of git://git.libc.org/linux-sh: sh: pgtable-3level: Fix cast to pointer from integer of different size sh: fix READ/WRITE redefinition warnings sh: define __BIG_ENDIAN for math-emu sh: math-emu: drop unused functions sh: fix kconfig unmet dependency warning for FRAME_POINTER sh: Cleanup about SPARSE_IRQ sh: kdump: add some attribute to function maple: fix wrong return value of maple_bus_init(). sh: boot: avoid unneeded rebuilds under arch/sh/boot/compressed/ sh: boot: add intermediate vmlinux.bin* to targets instead of extra-y sh: boards: Fix the cacography in irq.c sh: check return code of request_irq sh: fix trivial misannotations
2021-11-14Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-armLinus Torvalds2-13/+13
Pull ARM fixes from Russell King: - Fix early_iounmap - Drop cc-option fallbacks for architecture selection * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 9156/1: drop cc-option fallbacks for architecture selection ARM: 9155/1: fix early early_iounmap()
2021-11-14Merge tag 'devicetree-fixes-for-5.16-1' of ↵Linus Torvalds107-277/+277
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree fixes from Rob Herring: - Two fixes due to DT node name changes on Arm, Ltd. boards - Treewide rename of Ingenic CGU headers - Update ST email addresses - Remove Netlogic DT bindings - Dropping few more cases of redundant 'maxItems' in schemas - Convert toshiba,tc358767 bridge binding to schema * tag 'devicetree-fixes-for-5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: dt-bindings: watchdog: sunxi: fix error in schema bindings: media: venus: Drop redundant maxItems for power-domain-names dt-bindings: Remove Netlogic bindings clk: versatile: clk-icst: Ensure clock names are unique of: Support using 'mask' in making device bus id dt-bindings: treewide: Update @st.com email address to @foss.st.com dt-bindings: media: Update maintainers for st,stm32-hwspinlock.yaml dt-bindings: media: Update maintainers for st,stm32-cec.yaml dt-bindings: mfd: timers: Update maintainers for st,stm32-timers dt-bindings: timer: Update maintainers for st,stm32-timer dt-bindings: i2c: imx: hardware do not restrict clock-frequency to only 100 and 400 kHz dt-bindings: display: bridge: Convert toshiba,tc358767.txt to yaml dt-bindings: Rename Ingenic CGU headers to ingenic,*.h
2021-11-14Merge tag 'timers-urgent-2021-11-14' of ↵Linus Torvalds3-2/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for POSIX CPU timers to address a problem where POSIX CPU timer delivery stops working for a new child task because copy_process() copies state information which is only valid for the parent task" * tag 'timers-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Clear task::posix_cputimers_work in copy_process()
2021-11-14Merge tag 'irq-urgent-2021-11-14' of ↵Linus Torvalds8-28/+60
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of fixes for the interrupt subsystem Core code: - A regression fix for the Open Firmware interrupt mapping code where a interrupt controller property in a node caused a map property in the same node to be ignored. Interrupt chip drivers: - Workaround a limitation in SiFive PLIC interrupt chip which silently ignores an EOI when the interrupt line is masked. - Provide the missing mask/unmask implementation for the CSKY MP interrupt controller. PCI/MSI: - Prevent a use after free when PCI/MSI interrupts are released by destroying the sysfs entries before freeing the memory which is accessed in the sysfs show() function. - Implement a mask quirk for the Nvidia ION AHCI chip which does not advertise masking capability despite implementing it. Even worse the chip comes out of reset with all MSI entries masked, which due to the missing masking capability never get unmasked. - Move the check which prevents accessing the MSI[X] masking for XEN back into the low level accessors. The recent consolidation missed that these accessors can be invoked from places which do not have that check which broke XEN. Move them back to he original place instead of sprinkling tons of these checks all over the code" * tag 'irq-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: of/irq: Don't ignore interrupt-controller when interrupt-map failed irqchip/sifive-plic: Fixup EOI failed when masked irqchip/csky-mpintc: Fixup mask/unmask implementation PCI/MSI: Destroy sysfs before freeing entries PCI: Add MSI masking quirk for Nvidia ION AHCI PCI/MSI: Deal with devices lying about their MSI mask capability PCI/MSI: Move non-mask check back into low level accessors
2021-11-14Merge tag 'locking-urgent-2021-11-14' of ↵Linus Torvalds3-4/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 static call update from Thomas Gleixner: "A single fix for static calls to make the trampoline patching more robust by placing explicit signature bytes after the call trampoline to prevent patching random other jumps like the CFI jump table entries" * tag 'locking-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: static_call,x86: Robustify trampoline patching
2021-11-14Merge tag 'sched_urgent_for_v5.16_rc1' of ↵Linus Torvalds13-59/+96
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Avoid touching ~100 config files in order to be able to select the preemption model - clear cluster CPU masks too, on the CPU unplug path - prevent use-after-free in cfs - Prevent a race condition when updating CPU cache domains - Factor out common shared part of smp_prepare_cpus() into a common helper which can be called by both baremetal and Xen, in order to fix a booting of Xen PV guests * tag 'sched_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: preempt: Restore preemption model selection configs arch_topology: Fix missing clear cluster_cpumask in remove_cpu_topology() sched/fair: Prevent dead task groups from regaining cfs_rq's sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain() x86/smp: Factor out parts of native_smp_prepare_cpus()
2021-11-14Merge tag 'perf_urgent_for_v5.16_rc1' of ↵Linus Torvalds3-6/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Prevent unintentional page sharing by checking whether a page reference to a PMU samples page has been acquired properly before that - Make sure the LBR_SELECT MSR is saved/restored too - Reset the LBR_SELECT MSR when resetting the LBR PMU to clear any residual data left * tag 'perf_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Avoid put_page() when GUP fails perf/x86/vlbr: Add c->flags to vlbr event constraints perf/x86/lbr: Reset LBR_SELECT during vlbr reset
2021-11-14Merge tag 'x86_urgent_for_v5.16_rc1' of ↵Linus Torvalds8-4/+71
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Add the model number of a new, Raptor Lake CPU, to intel-family.h - Do not log spurious corrected MCEs on SKL too, due to an erratum - Clarify the path of paravirt ops patches upstream - Add an optimization to avoid writing out AMX components to sigframes when former are in init state * tag 'x86_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add Raptor Lake to Intel family x86/mce: Add errata workaround for Skylake SKX37 MAINTAINERS: Add some information to PARAVIRT_OPS entry x86/fpu: Optimize out sigframe xfeatures when in init state
2021-11-14Merge tag 'perf-tools-for-v5.16-2021-11-13' of ↵Linus Torvalds131-1148/+2639
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux Pull more perf tools updates from Arnaldo Carvalho de Melo: "Hardware tracing: - ARM: * Print the size of the buffer size consistently in hexadecimal in ARM Coresight. * Add Coresight snapshot mode support. * Update --switch-events docs in 'perf record'. * Support hardware-based PID tracing. * Track task context switch for cpu-mode events. - Vendor events: * Add metric events JSON file for power10 platform perf test: - Get 'perf test' unit tests closer to kunit. - Topology tests improvements. - Remove bashisms from some tests. perf bench: - Fix memory leak of perf_cpu_map__new() in the futex benchmarks. libbpf: - Add some more weak libbpf functions o allow building with the libbpf versions, old ones, present in distros. libbeauty: - Translate [gs]setsockopt 'level' argument integer values to strings. tools headers UAPI: - Sync futex_waitv, arch prctl, sound, i195_drm and msr-index files with the kernel sources. Documentation: - Add documentation to 'struct symbol'. - Synchronize the definition of enum perf_hw_id with code in tools/perf/design.txt" * tag 'perf-tools-for-v5.16-2021-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (67 commits) perf tests: Remove bash constructs from stat_all_pmu.sh perf tests: Remove bash construct from record+zstd_comp_decomp.sh perf test: Remove bash construct from stat_bpf_counters.sh test perf bench futex: Fix memory leak of perf_cpu_map__new() tools arch x86: Sync the msr-index.h copy with the kernel sources tools headers UAPI: Sync drm/i915_drm.h with the kernel sources tools headers UAPI: Sync sound/asound.h with the kernel sources tools headers UAPI: Sync linux/prctl.h with the kernel sources tools headers UAPI: Sync arch prctl headers with the kernel sources perf tools: Add more weak libbpf functions perf bpf: Avoid memory leak from perf_env__insert_btf() perf symbols: Factor out annotation init/exit perf symbols: Bit pack to save a byte perf symbols: Add documentation to 'struct symbol' tools headers UAPI: Sync files changed by new futex_waitv syscall perf test bpf: Use ARRAY_CHECK() instead of ad-hoc equivalent, addressing array_size.cocci warning perf arm-spe: Support hardware-based PID tracing perf arm-spe: Save context ID in record perf arm-spe: Update --switch-events docs in 'perf record' perf arm-spe: Track task context switch for cpu-mode events ...
2021-11-14Merge tag 'irqchip-fixes-5.16-1' of ↵Thomas Gleixner3-8/+27
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent Pull irqchip fixes from Marc Zyngier: - Address an issue with the SiFive PLIC being unable to EOI a masked interrupt - Move the disable/enable methods in the CSky mpintc to mask/unmask - Fix a regression in the OF irq code where an interrupt-controller property in the same node as an interrupt-map property would get ignored Link: https://lore.kernel.org/all/[email protected]
2021-11-13Merge tag 'zstd-for-linus-v5.16' of git://github.com/terrelln/linuxLinus Torvalds76-12941/+27373
Pull zstd update from Nick Terrell: "Update to zstd-1.4.10. Add myself as the maintainer of zstd and update the zstd version in the kernel, which is now 4 years out of date, to a much more recent zstd release. This includes bug fixes, much more extensive fuzzing, and performance improvements. And generates the kernel zstd automatically from upstream zstd, so it is easier to keep the zstd verison up to date, and we don't fall so far out of date again. This includes 5 commits that update the zstd library version: - Adds a new kernel-style wrapper around zstd. This wrapper API is functionally equivalent to the subset of the current zstd API that is currently used. The wrapper API changes to be kernel style so that the symbols don't collide with zstd's symbols. The update to zstd-1.4.10 maintains the same API and preserves the semantics, so that none of the callers need to be updated. All callers are updated in the commit, because there are zero functional changes. - Adds an indirection for `lib/decompress_unzstd.c` so it doesn't depend on the layout of `lib/zstd/` to include every source file. This allows the next patch to be automatically generated. - Imports the zstd-1.4.10 source code. This commit is automatically generated from upstream zstd (https://github.com/facebook/zstd). - Adds me ([email protected]) as the maintainer of `lib/zstd`. - Fixes a newly added build warning for clang. The discussion around this patchset has been pretty long, so I've included a FAQ-style summary of the history of the patchset, and why we are taking this approach. Why do we need to update? ------------------------- The zstd version in the kernel is based off of zstd-1.3.1, which is was released August 20, 2017. Since then zstd has seen many bug fixes and performance improvements. And, importantly, upstream zstd is continuously fuzzed by OSS-Fuzz, and bug fixes aren't backported to older versions. So the only way to sanely get these fixes is to keep up to date with upstream zstd. There are no known security issues that affect the kernel, but we need to be able to update in case there are. And while there are no known security issues, there are relevant bug fixes. For example the problem with large kernel decompression has been fixed upstream for over 2 years [1] Additionally the performance improvements for kernel use cases are significant. Measured for x86_64 on my Intel i9-9900k @ 3.6 GHz: - BtrFS zstd compression at levels 1 and 3 is 5% faster - BtrFS zstd decompression+read is 15% faster - SquashFS zstd decompression+read is 15% faster - F2FS zstd compression+write at level 3 is 8% faster - F2FS zstd decompression+read is 20% faster - ZRAM decompression+read is 30% faster - Kernel zstd decompression is 35% faster - Initramfs zstd decompression+build is 5% faster On top of this, there are significant performance improvements coming down the line in the next zstd release, and the new automated update patch generation will allow us to pull them easily. How is the update patch generated? ---------------------------------- The first two patches are preparation for updating the zstd version. Then the 3rd patch in the series imports upstream zstd into the kernel. This patch is automatically generated from upstream. A script makes the necessary changes and imports it into the kernel. The changes are: - Replace all libc dependencies with kernel replacements and rewrite includes. - Remove unncessary portability macros like: #if defined(_MSC_VER). - Use the kernel xxhash instead of bundling it. This automation gets tested every commit by upstream's continuous integration. When we cut a new zstd release, we will submit a patch to the kernel to update the zstd version in the kernel. The automated process makes it easy to keep the kernel version of zstd up to date. The current zstd in the kernel shares the guts of the code, but has a lot of API and minor changes to work in the kernel. This is because at the time upstream zstd was not ready to be used in the kernel envrionment as-is. But, since then upstream zstd has evolved to support being used in the kernel as-is. Why are we updating in one big patch? ------------------------------------- The 3rd patch in the series is very large. This is because it is restructuring the code, so it both deletes the existing zstd, and re-adds the new structure. Future updates will be directly proportional to the changes in upstream zstd since the last import. They will admittidly be large, as zstd is an actively developed project, and has hundreds of commits between every release. However, there is no other great alternative. One option ruled out is to replay every upstream zstd commit. This is not feasible for several reasons: - There are over 3500 upstream commits since the zstd version in the kernel. - The automation to automatically generate the kernel update was only added recently, so older commits cannot easily be imported. - Not every upstream zstd commit builds. - Only zstd releases are "supported", and individual commits may have bugs that were fixed before a release. Another option to reduce the patch size would be to first reorganize to the new file structure, and then apply the patch. However, the current kernel zstd is formatted with clang-format to be more "kernel-like". But, the new method imports zstd as-is, without additional formatting, to allow for closer correlation with upstream, and easier debugging. So the patch wouldn't be any smaller. It also doesn't make sense to import upstream zstd commit by commit going forward. Upstream zstd doesn't support production use cases running of the development branch. We have a lot of post-commit fuzzing that catches many bugs, so indiviudal commits may be buggy, but fixed before a release. So going forward, I intend to import every (important) zstd release into the Kernel. So, while it isn't ideal, updating in one big patch is the only patch I see forward. Who is responsible for this code? --------------------------------- I am. This patchset adds me as the maintainer for zstd. Previously, there was no tree for zstd patches. Because of that, there were several patches that either got ignored, or took a long time to merge, since it wasn't clear which tree should pick them up. I'm officially stepping up as maintainer, and setting up my tree as the path through which zstd patches get merged. I'll make sure that patches to the kernel zstd get ported upstream, so they aren't erased when the next version update happens. How is this code tested? ------------------------ I tested every caller of zstd on x86_64 (BtrFS, ZRAM, SquashFS, F2FS, Kernel, InitRAMFS). I also tested Kernel & InitRAMFS on i386 and aarch64. I checked both performance and correctness. Also, thanks to many people in the community who have tested these patches locally. Lastly, this code will bake in linux-next before being merged into v5.16. Why update to zstd-1.4.10 when zstd-1.5.0 has been released? ------------------------------------------------------------ This patchset has been outstanding since 2020, and zstd-1.4.10 was the latest release when it was created. Since the update patch is automatically generated from upstream, I could generate it from zstd-1.5.0. However, there were some large stack usage regressions in zstd-1.5.0, and are only fixed in the latest development branch. And the latest development branch contains some new code that needs to bake in the fuzzer before I would feel comfortable releasing to the kernel. Once this patchset has been merged, and we've released zstd-1.5.1, we can update the kernel to zstd-1.5.1, and exercise the update process. You may notice that zstd-1.4.10 doesn't exist upstream. This release is an artifical release based off of zstd-1.4.9, with some fixes for the kernel backported from the development branch. I will tag the zstd-1.4.10 release after this patchset is merged, so the Linux Kernel is running a known version of zstd that can be debugged upstream. Why was a wrapper API added? ---------------------------- The first versions of this patchset migrated the kernel to the upstream zstd API. It first added a shim API that supported the new upstream API with the old code, then updated callers to use the new shim API, then transitioned to the new code and deleted the shim API. However, Cristoph Hellwig suggested that we transition to a kernel style API, and hide zstd's upstream API behind that. This is because zstd's upstream API is supports many other use cases, and does not follow the kernel style guide, while the kernel API is focused on the kernel's use cases, and follows the kernel style guide. Where is the previous discussion? --------------------------------- Links for the discussions of the previous versions of the patch set below. The largest changes in the design of the patchset are driven by the discussions in v11, v5, and v1. Sorry for the mix of links, I couldn't find most of the the threads on lkml.org" Link: https://lkml.org/lkml/2020/9/29/27 [1] Link: https://www.spinics.net/lists/linux-crypto/msg58189.html [v12] Link: https://lore.kernel.org/linux-btrfs/[email protected]/ [v11] Link: https://lore.kernel.org/lkml/[email protected]/ [v10] Link: https://lore.kernel.org/linux-btrfs/[email protected]/ [v9] Link: https://lore.kernel.org/linux-f2fs-devel/[email protected]/ [v8] Link: https://lkml.org/lkml/2020/12/3/1195 [v7] Link: https://lkml.org/lkml/2020/12/2/1245 [v6] Link: https://lore.kernel.org/linux-btrfs/[email protected]/ [v5] Link: https://www.spinics.net/lists/linux-btrfs/msg105783.html [v4] Link: https://lkml.org/lkml/2020/9/23/1074 [v3] Link: https://www.spinics.net/lists/linux-btrfs/msg105505.html [v2] Link: https://lore.kernel.org/linux-btrfs/[email protected]/ [v1] Signed-off-by: Nick Terrell <[email protected]> Tested By: Paul Jones <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Tested-by: Sedat Dilek <[email protected]> # LLVM/Clang v13.0.0 on x86-64 Tested-by: Jean-Denis Girard <[email protected]> * tag 'zstd-for-linus-v5.16' of git://github.com/terrelln/linux: lib: zstd: Add cast to silence clang's -Wbitwise-instead-of-logical MAINTAINERS: Add maintainer entry for zstd lib: zstd: Upgrade to latest upstream zstd version 1.4.10 lib: zstd: Add decompress_sources.h for decompress_unzstd lib: zstd: Add kernel-specific API
2021-11-13Merge tag 'virtio-mem-for-5.16' of git://github.com/davidhildenbrand/linuxLinus Torvalds2-3/+7
Pull virtio-mem update from David Hildenbrand: "Support the VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE feature in virtio-mem, now that "accidential" access to logically unplugged memory inside added Linux memory blocks is no longer possible, because we: - Removed /dev/kmem in commit bbcd53c96071 ("drivers/char: remove /dev/kmem for good") - Disallowed access to virtio-mem device memory via /dev/mem in commit 2128f4e21aa ("virtio-mem: disallow mapping virtio-mem memory via /dev/mem") - Sanitized access to virtio-mem device memory via /proc/kcore in commit 0daa322b8ff9 ("fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages") - Sanitized access to virtio-mem device memory via /proc/vmcore in commit ce2814622e84 ("virtio-mem: kdump mode to sanitize /proc/vmcore access") The new VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE feature that will be required by some hypervisors implementing virtio-mem in the near future, so let's support it now that we safely can" * tag 'virtio-mem-for-5.16' of git://github.com/davidhildenbrand/linux: virtio-mem: support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE
2021-11-13perf tests: Remove bash constructs from stat_all_pmu.shJames Clark1-2/+2
The tests were passing but without testing and were printing the following: $ ./perf test -v 90 90: perf all PMU test : --- start --- test child forked, pid 51650 Testing cpu/branch-instructions/ ./tests/shell/stat_all_pmu.sh: 10: [: Performance counter stats for 'true': 137,307 cpu/branch-instructions/ 0.001686672 seconds time elapsed 0.001376000 seconds user 0.000000000 seconds sys: unexpected operator Changing the regexes to a grep works in sh and prints this: $ ./perf test -v 90 90: perf all PMU test : --- start --- test child forked, pid 60186 [...] Testing tlb_flush.stlb_any test child finished with 0 ---- end ---- perf all PMU test: Ok Signed-off-by: James Clark <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Fastabend <[email protected]> Cc: KP Singh <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Song Liu <[email protected]> Cc: Sumanth Korikkar <[email protected]> Cc: Thomas Richter <[email protected]> Cc: Yonghong Song <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2021-11-13perf tests: Remove bash construct from record+zstd_comp_decomp.shJames Clark1-1/+1
Commit 463538a383a2 ("perf tests: Fix test 68 zstd compression for s390") inadvertently removed the -g flag from all platforms rather than just s390, because the [[ ]] construct fails in sh. Changing to single brackets restores testing of call graphs and removes the following error from the output: $ ./perf test -v 85 85: Zstd perf.data compression/decompression : --- start --- test child forked, pid 50643 Collecting compressed record file: ./tests/shell/record+zstd_comp_decomp.sh: 15: [[: not found Fixes: 463538a383a2 ("perf tests: Fix test 68 zstd compression for s390") Signed-off-by: James Clark <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Fastabend <[email protected]> Cc: KP Singh <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Song Liu <[email protected]> Cc: Sumanth Korikkar <[email protected]> Cc: Thomas Richter <[email protected]> Cc: Yonghong Song <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2021-11-13perf test: Remove bash construct from stat_bpf_counters.sh testJames Clark1-1/+1
Currently the test skips with an error because == only works in bash: $ ./perf test 91 -v Couldn't bump rlimit(MEMLOCK), failures may take place when creating BPF maps, etc 91: perf stat --bpf-counters test : --- start --- test child forked, pid 44586 ./tests/shell/stat_bpf_counters.sh: 26: [: -v: unexpected operator test child finished with -2 ---- end ---- perf stat --bpf-counters test: Skip Changing == to = does the same thing, but doesn't result in an error: ./perf test 91 -v Couldn't bump rlimit(MEMLOCK), failures may take place when creating BPF maps, etc 91: perf stat --bpf-counters test : --- start --- test child forked, pid 45833 Skipping: --bpf-counters not supported Error: unknown option `bpf-counters' [...] test child finished with -2 ---- end ---- perf stat --bpf-counters test: Skip Signed-off-by: James Clark <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Fastabend <[email protected]> Cc: KP Singh <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Song Liu <[email protected]> Cc: Sumanth Korikkar <[email protected]> Cc: Thomas Richter <[email protected]> Cc: Yonghong Song <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2021-11-13perf bench futex: Fix memory leak of perf_cpu_map__new()Sohaib Mohamed4-0/+4
ASan reports memory leaks while running: $ sudo ./perf bench futex all The leaks are caused by perf_cpu_map__new not being freed. This patch adds the missing perf_cpu_map__put since it calls cpu_map_delete implicitly. Fixes: 9c3516d1b850ea93 ("libperf: Add perf_cpu_map__new()/perf_cpu_map__read() functions") Signed-off-by: Sohaib Mohamed <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: André Almeida <[email protected]> Cc: Darren Hart <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sohaib Mohamed <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lore.kernel.org/lkml/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2021-11-13tools arch x86: Sync the msr-index.h copy with the kernel sourcesArnaldo Carvalho de Melo1-0/+2
To pick up the changes in: dae1bd58389615d4 ("x86/msr-index: Add MSRs for XFD") Addressing these tools/perf build warnings: diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h Warning: Kernel ABI header at 'tools/arch/x86/include/asm/msr-index.h' differs from latest version at 'arch/x86/include/asm/msr-index.h' That makes the beautification scripts to pick some new entries: $ diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h --- tools/arch/x86/include/asm/msr-index.h 2021-07-15 16:17:01.819817827 -0300 +++ arch/x86/include/asm/msr-index.h 2021-11-06 15:49:33.738517311 -0300 @@ -625,6 +625,8 @@ #define MSR_IA32_BNDCFGS_RSVD 0x00000ffc +#define MSR_IA32_XFD 0x000001c4 +#define MSR_IA32_XFD_ERR 0x000001c5 #define MSR_IA32_XSS 0x00000da0 #define MSR_IA32_APICBASE 0x0000001b $ tools/perf/trace/beauty/tracepoints/x86_msr.sh > /tmp/before $ cp arch/x86/include/asm/msr-index.h tools/arch/x86/include/asm/msr-index.h $ tools/perf/trace/beauty/tracepoints/x86_msr.sh > /tmp/after $ diff -u /tmp/before /tmp/after --- /tmp/before 2021-11-13 11:10:39.964201505 -0300 +++ /tmp/after 2021-11-13 11:10:47.902410873 -0300 @@ -93,6 +93,8 @@ [0x000001b0] = "IA32_ENERGY_PERF_BIAS", [0x000001b1] = "IA32_PACKAGE_THERM_STATUS", [0x000001b2] = "IA32_PACKAGE_THERM_INTERRUPT", + [0x000001c4] = "IA32_XFD", + [0x000001c5] = "IA32_XFD_ERR", [0x000001c8] = "LBR_SELECT", [0x000001c9] = "LBR_TOS", [0x000001d9] = "IA32_DEBUGCTLMSR", $ And this gets rebuilt: CC /tmp/build/perf/trace/beauty/tracepoints/x86_msr.o INSTALL trace_plugins LD /tmp/build/perf/trace/beauty/tracepoints/perf-in.o LD /tmp/build/perf/trace/beauty/perf-in.o LD /tmp/build/perf/perf-in.o LINK /tmp/build/perf/perf Now one can trace systemwide asking to see backtraces to where those MSRs are being read/written with: # perf trace -e msr:*_msr/max-stack=32/ --filter="msr==IA32_XFD || msr==IA32_XFD_ERR" ^C# # If we use -v (verbose mode) we can see what it does behind the scenes: # perf trace -v -e msr:*_msr/max-stack=32/ --filter="msr==IA32_XFD || msr==IA32_XFD_ERR" <SNIP> New filter for msr:read_msr: (msr==0x1c4 || msr==0x1c5) && (common_pid != 4448951 && common_pid != 8781) New filter for msr:write_msr: (msr==0x1c4 || msr==0x1c5) && (common_pid != 4448951 && common_pid != 8781) <SNIP> ^C# Example with a frequent msr: # perf trace -v -e msr:*_msr/max-stack=32/ --filter="msr==IA32_SPEC_CTRL" --max-events 2 Using CPUID AuthenticAMD-25-21-0 0x48 New filter for msr:read_msr: (msr==0x48) && (common_pid != 3738351 && common_pid != 3564) 0x48 New filter for msr:write_msr: (msr==0x48) && (common_pid != 3738351 && common_pid != 3564) mmap size 528384B Looking at the vmlinux_path (8 entries long) symsrc__init: build id mismatch for vmlinux. Using /proc/kcore for kernel data Using /proc/kallsyms for symbols 0.000 pipewire/2479 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6) do_trace_write_msr ([kernel.kallsyms]) do_trace_write_msr ([kernel.kallsyms]) __switch_to_xtra ([kernel.kallsyms]) __switch_to ([kernel.kallsyms]) __schedule ([kernel.kallsyms]) schedule ([kernel.kallsyms]) schedule_hrtimeout_range_clock ([kernel.kallsyms]) do_epoll_wait ([kernel.kallsyms]) __x64_sys_epoll_wait ([kernel.kallsyms]) do_syscall_64 ([kernel.kallsyms]) entry_SYSCALL_64_after_hwframe ([kernel.kallsyms]) epoll_wait (/usr/lib64/libc-2.33.so) [0x76c4] (/usr/lib64/spa-0.2/support/libspa-support.so) [0x4cf0] (/usr/lib64/spa-0.2/support/libspa-support.so) 0.027 :0/0 msr:write_msr(msr: IA32_SPEC_CTRL, val: 2) do_trace_write_msr ([kernel.kallsyms]) do_trace_write_msr ([kernel.kallsyms]) __switch_to_xtra ([kernel.kallsyms]) __switch_to ([kernel.kallsyms]) __schedule ([kernel.kallsyms]) schedule_idle ([kernel.kallsyms]) do_idle ([kernel.kallsyms]) cpu_startup_entry ([kernel.kallsyms]) start_kernel ([kernel.kallsyms]) secondary_startup_64_no_verify ([kernel.kallsyms]) # Cc: Borislav Petkov <[email protected]> Cc: Chang S. Bae <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Link: https://lore.kernel.org/lkml/YY%[email protected]/ Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2021-11-13tools headers UAPI: Sync drm/i915_drm.h with the kernel sourcesArnaldo Carvalho de Melo1-1/+241
To pick up the changes in: e5e32171a2cf1e43 ("drm/i915/guc: Connect UAPI to GuC multi-lrc interface") 9409eb35942713d0 ("drm/i915: Expose logical engine instance to user") ea673f17ab763879 ("drm/i915/uapi: Add comment clarifying purpose of I915_TILING_* values") d3ac8d42168a9be7 ("drm/i915/pxp: interfaces for using protected objects") cbbd3764b2399ad8 ("drm/i915/pxp: Create the arbitrary session after boot") That don't add any new ioctl, so no changes in tooling. This silences this perf build warning: Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from latest version at 'include/uapi/drm/i915_drm.h' diff -u tools/include/uapi/drm/i915_drm.h include/uapi/drm/i915_drm.h Cc: Daniele Ceraolo Spurio <[email protected]> Cc: Huang, Sean Z <[email protected]> Cc: John Harrison <[email protected]> Cc: Matthew Brost <[email protected]> Cc: Matt Roper <[email protected]> Cc: Rodrigo Vivi <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>