aboutsummaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2021-02-10crypto: arm64/crc-t10dif - move NEON yield to C codeArd Biesheuvel2-38/+35
Instead of yielding from the bowels of the asm routine if a reschedule is needed, divide up the input into 4 KB chunks in the C glue. This simplifies the code substantially, and avoids scheduling out the task with the asm routine on the call stack, which is undesirable from a CFI/instrumentation point of view. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-02-10crypto: arm64/aes-ce-mac - simplify NEON yieldArd Biesheuvel2-40/+33
Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-02-10crypto: arm64/aes-neonbs - remove NEON yield callsArd Biesheuvel1-6/+2
There is no need for elaborate yield handling in the bit-sliced NEON implementation of AES, given that skciphers are naturally bounded by the size of the chunks returned by the skcipher_walk API. So remove the yield calls from the asm code. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-02-10crypto: arm64/sha512-ce - simplify NEON yieldArd Biesheuvel2-48/+34
Instead of calling into kernel_neon_end() and kernel_neon_begin() (and potentially into schedule()) from the assembler code when running in task mode and a reschedule is pending, perform only the preempt count check in assembler, but simply return early in this case, and let the C code deal with the consequences. This reverts commit 6caf7adc5e458f77f550b6c6ca8effa152d61b4a. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-02-10crypto: arm64/sha3-ce - simplify NEON yieldArd Biesheuvel2-56/+39
Instead of calling into kernel_neon_end() and kernel_neon_begin() (and potentially into schedule()) from the assembler code when running in task mode and a reschedule is pending, perform only the preempt count check in assembler, but simply return early in this case, and let the C code deal with the consequences. This reverts commit 7edc86cb1c18b4c274672232117586ea2bef1d9a. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-02-10crypto: arm64/sha2-ce - simplify NEON yieldArd Biesheuvel2-35/+25
Instead of calling into kernel_neon_end() and kernel_neon_begin() (and potentially into schedule()) from the assembler code when running in task mode and a reschedule is pending, perform only the preempt count check in assembler, but simply return early in this case, and let the C code deal with the consequences. This reverts commit d82f37ab5e2426287013eba38b1212e8b71e5be3. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-02-10crypto: arm64/sha1-ce - simplify NEON yieldArd Biesheuvel2-40/+29
Instead of calling into kernel_neon_end() and kernel_neon_begin() (and potentially into schedule()) from the assembler code when running in task mode and a reschedule is pending, perform only the preempt count check in assembler, but simply return early in this case, and let the C code deal with the consequences. This reverts commit 7df8d164753e6e6f229b72767595072bc6a71f48. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-02-10crypto: powerpc/sha256 - remove unneeded semicolonYang Li1-1/+1
Eliminate the following coccicheck warning: ./arch/powerpc/crypto/sha256-spe-glue.c:132:2-3: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-02-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux for-next/cryptoHerbert Xu1-0/+16
Pull change from arm64 tree that's needed for crypto arm changes.
2021-02-03arm64: assembler: add cond_yield macroArd Biesheuvel1-0/+16
Add a macro cond_yield that branches to a specified label when called if the TIF_NEED_RESCHED flag is set and decreasing the preempt count would make the task preemptible again, resulting in a schedule to occur. This can be used by kernel mode SIMD code that keeps a lot of state in SIMD registers, which would make chunking the input in order to perform the cond_resched() check from C code disproportionately costly. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20210203113626.220151-2-ardb@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2021-01-22crypto: aesni - release FPU during skcipher walk API callsArd Biesheuvel1-41/+32
Taking ownership of the FPU in kernel mode disables preemption, and this may result in excessive scheduling blackouts if the size of the data being processed on the FPU is unbounded. Given that taking and releasing the FPU is cheap these days on x86, we can limit the impact of this issue easily for skcipher implementations, by moving the FPU begin/end calls inside the skcipher walk processing loop. Considering that skcipher walks operate on at most one page at a time, doing so fully mitigates this issue. This also permits the skcipher walk logic to use non-atomic kmalloc() calls etc so we can change the 'atomic' bool argument in the calls to skcipher_walk_virt() to false as well. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-22crypto: aesni - replace CTR function pointer with static callArd Biesheuvel1-6/+7
Indirect calls are very expensive on x86, so use a static call to set the system-wide AES-NI/CTR asm helper. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-22crypto: arm64/sha - add missing module aliasesArd Biesheuvel4-0/+9
The accelerated, instruction based implementations of SHA1, SHA2 and SHA3 are autoloaded based on CPU capabilities, given that the code is modest in size, and widely used, which means that resolving the algo name, loading all compatible modules and picking the one with the highest priority is taken to be suboptimal. However, if these algorithms are requested before this CPU feature based matching and autoloading occurs, these modules are not even considered, and we end up with suboptimal performance. So add the missing module aliases for the various SHA implementations. Cc: <stable@vger.kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86 - use local headers for x86 specific shared declarationsArd Biesheuvel12-8/+8
The Camellia, Serpent and Twofish related header files only contain declarations that are shared between different implementations of the respective algorithms residing under arch/x86/crypto, and none of their contents should be used elsewhere. So move the header files into the same location, and use local #includes instead. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86 - remove glue helper moduleArd Biesheuvel3-231/+0
All dependencies on the x86 glue helper module have been replaced by local instantiations of the new ECB/CBC preprocessor helper macros, so the glue helper module can be retired. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/twofish - drop dependency on glue helperArd Biesheuvel2-109/+44
Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/cast6 - drop dependency on glue helperArd Biesheuvel1-44/+17
Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/cast5 - drop dependency on glue helperArd Biesheuvel1-167/+17
Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/serpent - drop dependency on glue helperArd Biesheuvel3-154/+61
Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/camellia - drop dependency on glue helperArd Biesheuvel3-166/+67
Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86 - add some helper macros for ECB and CBC modesArd Biesheuvel1-0/+76
The x86 glue helper module is starting to show its age: - It relies heavily on function pointers to invoke asm helper functions that operate on fixed input sizes that are relatively small. This means the performance is severely impacted by retpolines. - It goes to great lengths to amortize the cost of kernel_fpu_begin()/end() over as much work as possible, which is no longer necessary now that FPU save/restore is done lazily, and doing so may cause unbounded scheduling blackouts due to the fact that enabling the FPU in kernel mode disables preemption. - The CBC mode decryption helper makes backward strides through the input, in order to avoid a single block size memcpy() between chunks. Consuming the input in this manner is highly likely to defeat any hardware prefetchers, so it is better to go through the data linearly, and perform the extra memcpy() where needed (which is turned into direct loads and stores by the compiler anyway). Note that benchmarks won't show this effect, given that the memory they use is always cache hot. - It implements blockwise XOR in terms of le128 pointers, which imply an alignment that is not guaranteed by the API, violating the C standard. GCC does not seem to be smart enough to elide the indirect calls when the function pointers are passed as arguments to static inline helper routines modeled after the existing ones. So instead, let's create some CPP macros that encapsulate the core of the ECB and CBC processing, so we can wire them up for existing users of the glue helper module, i.e., Camellia, Serpent, Twofish and CAST6. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/blowfish - drop CTR mode implementationArd Biesheuvel1-107/+0
Blowfish in counter mode is never used in the kernel, so there is no point in keeping an accelerated implementation around. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/des - drop CTR mode implementationArd Biesheuvel1-104/+0
DES or Triple DES in counter mode is never used in the kernel, so there is no point in keeping an accelerated implementation around. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/glue-helper - drop CTR helper routinesArd Biesheuvel4-207/+0
The glue helper's CTR routines are no longer used, so drop them. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/twofish - drop CTR mode implementationArd Biesheuvel4-147/+0
Twofish in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/cast6 - drop CTR mode implementationArd Biesheuvel2-76/+0
CAST6 in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/cast5 - drop CTR mode implementationArd Biesheuvel1-103/+0
CAST5 in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/serpent - drop CTR mode implementationArd Biesheuvel5-201/+0
Serpent in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/camellia - drop CTR mode implementationArd Biesheuvel6-416/+0
Camellia in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/glue-helper - drop XTS helper routinesArd Biesheuvel4-303/+0
The glue helper's XTS routines are no longer used, so drop them. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/twofish - switch to XTS templateArd Biesheuvel2-151/+0
Now that the XTS template can wrap accelerated ECB modes, it can be used to implement Twofish in XTS mode as well, which turns out to be at least as fast, and sometimes even faster Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/serpent- switch to XTS templateArd Biesheuvel5-304/+0
Now that the XTS template can wrap accelerated ECB modes, it can be used to implement Serpent in XTS mode as well, which turns out to be at least as fast, and sometimes even faster Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/cast6 - switch to XTS templateArd Biesheuvel2-154/+0
Now that the XTS template can wrap accelerated ECB modes, it can be used to implement CAST6 in XTS mode as well, which turns out to be at least as fast, and sometimes even faster Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/camellia - switch to XTS templateArd Biesheuvel5-576/+1
Now that the XTS template can wrap accelerated ECB modes, it can be used to implement Camellia in XTS mode as well, which turns out to be at least as fast, and sometimes even faster. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: aesni - replace function pointers with static branchesArd Biesheuvel1-44/+54
Replace the function pointers in the GCM implementation with static branches, which are based on code patching, which occurs only at module load time. This avoids the severe performance penalty caused by the use of retpolines. In order to retain the ability to switch between different versions of the implementation based on the input size on cores that support AVX and AVX2, use static branches instead of static calls. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: aesni - refactor scatterlist processingArd Biesheuvel1-83/+56
Currently, the gcm(aes-ni) driver open codes the scatterlist handling that is encapsulated by the skcipher walk API. So let's switch to that instead. Also, move the handling at the end of gcmaes_crypt_by_sg() that is dependent on whether we are encrypting or decrypting into the callers, which always do one or the other. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: aesni - clean up mapping of associated dataArd Biesheuvel1-4/+5
The gcm(aes-ni) driver is only built for x86_64, which does not make use of highmem. So testing for PageHighMem is pointless and can be omitted. While at it, replace GFP_ATOMIC with the appropriate runtime decided value based on the context. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: aesni - drop unused asm prototypesArd Biesheuvel1-67/+0
Drop some prototypes that are declared but never called. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: aesni - prevent misaligned buffers on the stackArd Biesheuvel1-12/+16
The GCM mode driver uses 16 byte aligned buffers on the stack to pass the IV to the asm helpers, but unfortunately, the x86 port does not guarantee that the stack pointer is 16 byte aligned upon entry in the first place. Since the compiler is not aware of this, it will not emit the additional stack realignment sequence that is needed, and so the alignment is not guaranteed to be more than 8 bytes. So instead, allocate some padding on the stack, and realign the IV pointer by hand. Cc: <stable@vger.kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-08crypto: x86/aes-ni-xts - rewrite and drop indirections via glue helperArd Biesheuvel2-144/+356
The AES-NI driver implements XTS via the glue helper, which consumes a struct with sets of function pointers which are invoked on chunks of input data of the appropriate size, as annotated in the struct. Let's get rid of this indirection, so that we can perform direct calls to the assembler helpers. Instead, let's adopt the arm64 strategy, i.e., provide a helper which can consume inputs of any size, provided that the penultimate, full block is passed via the last call if ciphertext stealing needs to be applied. This also allows us to enable the XTS mode for i386. Tested-by: Eric Biggers <ebiggers@google.com> # x86_64 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-08crypto: x86/aes-ni-xts - use direct calls to and 4-way strideArd Biesheuvel2-56/+84
The XTS asm helper arrangement is a bit odd: the 8-way stride helper consists of back-to-back calls to the 4-way core transforms, which are called indirectly, based on a boolean that indicates whether we are performing encryption or decryption. Given how costly indirect calls are on x86, let's switch to direct calls, and given how the 8-way stride doesn't really add anything substantial, use a 4-way stride instead, and make the asm core routine deal with any multiple of 4 blocks. Since 512 byte sectors or 4 KB blocks are the typical quantities XTS operates on, increase the stride exported to the glue helper to 512 bytes as well. As a result, the number of indirect calls is reduced from 3 per 64 bytes of in/output to 1 per 512 bytes of in/output, which produces a 65% speedup when operating on 1 KB blocks (measured on a Intel(R) Core(TM) i7-8650U CPU) Fixes: 9697fa39efd3f ("x86/retpoline/crypto: Convert crypto assembler indirect jumps") Tested-by: Eric Biggers <ebiggers@google.com> # x86_64 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: arm/blake2b - add NEON-accelerated BLAKE2bEric Biggers4-0/+464
Add a NEON-accelerated implementation of BLAKE2b. On Cortex-A7 (which these days is the most common ARM processor that doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as SHA-256, and slightly faster than SHA-1. It is also almost three times as fast as the generic implementation of BLAKE2b: Algorithm Cycles per byte (on 4096-byte messages) =================== ======================================= blake2b-256-neon 14.0 sha1-neon 16.3 blake2s-256-arm 18.8 sha1-asm 20.8 blake2s-256-generic 26.0 sha256-neon 28.9 sha256-asm 32.0 blake2b-256-generic 38.9 This implementation isn't directly based on any other implementation, but it borrows some ideas from previous NEON code I've written as well as from chacha-neon-core.S. At least on Cortex-A7, it is faster than the other NEON implementations of BLAKE2b I'm aware of (the implementation in the BLAKE2 official repository using intrinsics, and Andrew Moon's implementation which can be found in SUPERCOP). It does only one block at a time, so it performs well on short messages too. NEON-accelerated BLAKE2b is useful because there is interest in using BLAKE2b-256 for dm-verity on low-end Android devices (specifically, devices that lack the ARMv8 Crypto Extensions) to replace SHA-1. On these devices, the performance cost of upgrading to SHA-256 may be unacceptable, whereas BLAKE2b-256 would actually improve performance. Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which is intended for 32-bit platforms), on 32-bit ARM processors with NEON, BLAKE2b is actually faster than BLAKE2s. This is because NEON supports 64-bit operations, and because BLAKE2s's block size is too small for NEON to be helpful for it. The best I've been able to do with BLAKE2s on Cortex-A7 is 18.8 cpb with an optimized scalar implementation. (I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but they're more complex as they require running multiple hashes at once. Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7, so I expect that any speedup from BLAKE2sp or BLAKE3 would come only from the smaller number of rounds, not from the extra parallelism.) For now this BLAKE2b implementation is only wired up to the shash API, since there is no library API for BLAKE2b yet. However, I've tried to keep things consistent with BLAKE2s, e.g. by defining blake2b_compress_arch() which is analogous to blake2s_compress_arch() and could be exported for use by the library API later if needed. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: arm/blake2s - add ARM scalar optimized BLAKE2sEric Biggers4-0/+374
Add an ARM scalar optimized implementation of BLAKE2s. NEON isn't very useful for BLAKE2s because the BLAKE2s block size is too small for NEON to help. Each NEON instruction would depend on the previous one, resulting in poor performance. With scalar instructions, on the other hand, we can take advantage of ARM's "free" rotations (like I did in chacha-scalar-core.S) to get an implementation get runs much faster than the C implementation. Performance results on Cortex-A7 in cycles per byte using the shash API: 4096-byte messages: blake2s-256-arm: 18.8 blake2s-256-generic: 26.0 500-byte messages: blake2s-256-arm: 20.3 blake2s-256-generic: 27.9 100-byte messages: blake2s-256-arm: 29.7 blake2s-256-generic: 39.2 32-byte messages: blake2s-256-arm: 50.6 blake2s-256-generic: 66.2 Except on very short messages, this is still slower than the NEON implementation of BLAKE2b which I've written; that is 14.0, 16.4, 25.8, and 76.1 cpb on 4096, 500, 100, and 32-byte messages, respectively. However, optimized BLAKE2s is useful for cases where BLAKE2s is used instead of BLAKE2b, such as WireGuard. This new implementation is added in the form of a new module blake2s-arm.ko, which is analogous to blake2s-x86_64.ko in that it provides blake2s_compress_arch() for use by the library API as well as optionally register the algorithms with the shash API. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: blake2s - share the "shash" API boilerplate codeEric Biggers1-67/+7
Add helper functions for shash implementations of BLAKE2s to include/crypto/internal/blake2s.h, taking advantage of __blake2s_update() and __blake2s_final() that were added by the previous patch to share more code between the library and shash implementations. crypto_blake2s_setkey() and crypto_blake2s_init() are usable as shash_alg::setkey and shash_alg::init directly, while crypto_blake2s_update() and crypto_blake2s_final() take an extra 'blake2s_compress_t' function pointer parameter. This allows the implementation of the compression function to be overridden, which is the only part that optimized implementations really care about. The new functions are inline functions (similar to those in sha1_base.h, sha256_base.h, and sm3_base.h) because this avoids needing to add a new module blake2s_helpers.ko, they aren't *too* long, and this avoids indirect calls which are expensive these days. Note that they can't go in blake2s_generic.ko, as that would require selecting CRYPTO_BLAKE2S from CRYPTO_BLAKE2S_X86, which would cause a recursive dependency. Finally, use these new helper functions in the x86 implementation of BLAKE2s. (This part should be a separate patch, but unfortunately the x86 implementation used the exact same function names like "crypto_blake2s_update()", so it had to be updated at the same time.) Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: x86/blake2s - define shash_alg structs using macrosEric Biggers1-61/+23
The shash_alg structs for the four variants of BLAKE2s are identical except for the algorithm name, driver name, and digest size. So, avoid code duplication by using a macro to define these structs. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: arm64/aes-ctr - improve tail handlingArd Biesheuvel2-74/+137
Counter mode is a stream cipher chaining mode that is typically used with inputs that are of arbitrarily length, and so a tail block which is smaller than a full AES block is rule rather than exception. The current ctr(aes) implementation for arm64 always makes a separate call into the assembler routine to process this tail block, which is suboptimal, given that it requires reloading of the AES round keys, and prevents us from handling this tail block using the 5-way stride that we use for better performance on deep pipelines. So let's update the assembler routine so it can handle any input size, and uses NEON permutation instructions and overlapping loads and stores to handle the tail block. This results in a ~16% speedup for 1420 byte blocks on cores with deep pipelines such as ThunderX2. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: arm64/aes-ce - really hide slower algos when faster ones are enabledArd Biesheuvel1-2/+2
Commit 69b6f2e817e5b ("crypto: arm64/aes-neon - limit exposed routines if faster driver is enabled") intended to hide modes from the plain NEON driver that are also implemented by the faster bit sliced NEON one if both are enabled. However, the defined() CPP function does not detect if the bit sliced NEON driver is enabled as a module. So instead, let's use IS_ENABLED() here. Fixes: 69b6f2e817e5b ("crypto: arm64/aes-neon - limit exposed routines if ...") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: remove cipher routines from public crypto APIArd Biesheuvel2-0/+5
The cipher routines in the crypto API are mostly intended for templates implementing skcipher modes generically in software, and shouldn't be used outside of the crypto subsystem. So move the prototypes and all related definitions to a new header file under include/crypto/internal. Also, let's use the new module namespace feature to move the symbol exports into a new namespace CRYPTO_INTERNAL. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: aesni - implement support for cts(cbc(aes))Ard Biesheuvel2-1/+261
Follow the same approach as the arm64 driver for implementing a version of AES-NI in CBC mode that supports ciphertext stealing. This results in a ~2x speed increase for relatively short inputs (less than 256 bytes), which is relevant given that AES-CBC with ciphertext stealing is used for filename encryption in the fscrypt layer. For larger inputs, the speedup is still significant (~25% on decryption, ~6% on encryption) Tested-by: Eric Biggers <ebiggers@google.com> # x86_64 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: arm/chacha-neon - add missing counter incrementArd Biesheuvel1-0/+1
Commit 86cd97ec4b943af3 ("crypto: arm/chacha-neon - optimize for non-block size multiples") refactored the chacha block handling in the glue code in a way that may result in the counter increment to be omitted when calling chacha_block_xor_neon() to process a full block. This violates the skcipher API, which requires that the output IV is suitable for handling more input as long as the preceding input has been presented in round multiples of the block size. Also, the same code is exposed via the chacha library interface whose callers may actually rely on this increment to occur even for final blocks that are smaller than the chacha block size. So increment the counter after calling chacha_block_xor_neon(). Fixes: 86cd97ec4b943af3 ("crypto: arm/chacha-neon - optimize for non-block size multiples") Reported-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>