diff options
author | Clement Courbet <[email protected]> | 2021-03-03 14:46:53 -0800 |
---|---|---|
committer | Peter Zijlstra <[email protected]> | 2021-03-10 09:51:49 +0100 |
commit | 1e17fb8edc5ad6587e9303ccdebce853bc8cf30c (patch) | |
tree | 65052ced18c491a4a3b229f5dc780b94efa3f063 /tools/perf/scripts/python/Perf-Trace-Util | |
parent | 4117cebf1a9fcbf35b9aabf0e37b6c5eea296798 (diff) |
sched: Optimize __calc_delta()
A significant portion of __calc_delta() time is spent in the loop
shifting a u64 by 32 bits. Use `fls` instead of iterating.
This is ~7x faster on benchmarks.
The generic `fls` implementation (`generic_fls`) is still ~4x faster
than the loop.
Architectures that have a better implementation will make use of it. For
example, on x86 we get an additional factor 2 in speed without dedicated
implementation.
On GCC, the asm versions of `fls` are about the same speed as the
builtin. On Clang, the versions that use fls are more than twice as
slow as the builtin. This is because the way the `fls` function is
written, clang puts the value in memory:
https://godbolt.org/z/EfMbYe. This bug is filed at
https://bugs.llvm.org/show_bug.cgi?idI406.
```
name cpu/op
BM_Calc<__calc_delta_loop> 9.57ms Â=B112%
BM_Calc<__calc_delta_generic_fls> 2.36ms Â=B113%
BM_Calc<__calc_delta_asm_fls> 2.45ms Â=B113%
BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms Â=B112%
BM_Calc<__calc_delta_asm_fls64> 2.46ms Â=B113%
BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms Â=B115%
BM_Calc<__calc_delta_builtin> 1.32ms Â=B111%
```
Signed-off-by: Clement Courbet <[email protected]>
Signed-off-by: Josh Don <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Diffstat (limited to 'tools/perf/scripts/python/Perf-Trace-Util')
0 files changed, 0 insertions, 0 deletions