aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2022-06-28sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()Dietmar Eggemann1-1/+1
effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Vincent Guittot <[email protected]> Tested-by: Lukasz Luba <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-06-28perf/core: Add a new read format to get a number of lost samplesNamhyung Kim2-1/+6
Sometimes we want to know an accurate number of samples even if it's lost. Currenlty PERF_RECORD_LOST is generated for a ring-buffer which might be shared with other events. So it's hard to know per-event lost count. Add event->lost_samples field and PERF_FORMAT_LOST to retrieve it from userspace. Original-patch-by: Jiri Olsa <[email protected]> Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-06-28sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avgChen Yu1-0/+1
[Problem Statement] select_idle_cpu() might spend too much time searching for an idle CPU, when the system is overloaded. The following histogram is the time spent in select_idle_cpu(), when running 224 instances of netperf on a system with 112 CPUs per LLC domain: @usecs: [0] 533 | | [1] 5495 | | [2, 4) 12008 | | [4, 8) 239252 | | [8, 16) 4041924 |@@@@@@@@@@@@@@ | [16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 4507667 |@@@@@@@@@@@@@@@ | [512, 1K) 2600472 |@@@@@@@@@ | [1K, 2K) 927912 |@@@ | [2K, 4K) 218720 | | [4K, 8K) 98161 | | [8K, 16K) 37722 | | [16K, 32K) 6715 | | [32K, 64K) 477 | | [64K, 128K) 7 | | netperf latency usecs: ======= case load Lat_99th std% TCP_RR thread-224 257.39 ( 0.21) The time spent in select_idle_cpu() is visible to netperf and might have a negative impact. [Symptom analysis] The patch [1] from Mel Gorman has been applied to track the efficiency of select_idle_sibling. Copy the indicators here: SIS Search Efficiency(se_eff%): A ratio expressed as a percentage of runqueues scanned versus idle CPUs found. A 100% efficiency indicates that the target, prev or recent CPU of a task was idle at wakeup. The lower the efficiency, the more runqueues were scanned before an idle CPU was found. SIS Domain Search Efficiency(dom_eff%): Similar, except only for the slower SIS patch. SIS Fast Success Rate(fast_rate%): Percentage of SIS that used target, prev or recent CPUs. SIS Success rate(success_rate%): Percentage of scans that found an idle CPU. The test is based on Aubrey's schedtests tool, including netperf, hackbench, schbench and tbench. Test on vanilla kernel: schedstat_parse.py -f netperf_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% TCP_RR 28 threads 99.978 18.535 99.995 100.000 TCP_RR 56 threads 99.397 5.671 99.964 100.000 TCP_RR 84 threads 21.721 6.818 73.632 100.000 TCP_RR 112 threads 12.500 5.533 59.000 100.000 TCP_RR 140 threads 8.524 4.535 49.020 100.000 TCP_RR 168 threads 6.438 3.945 40.309 99.999 TCP_RR 196 threads 5.397 3.718 32.320 99.982 TCP_RR 224 threads 4.874 3.661 25.775 99.767 UDP_RR 28 threads 99.988 17.704 99.997 100.000 UDP_RR 56 threads 99.528 5.977 99.970 100.000 UDP_RR 84 threads 24.219 6.992 76.479 100.000 UDP_RR 112 threads 13.907 5.706 62.538 100.000 UDP_RR 140 threads 9.408 4.699 52.519 100.000 UDP_RR 168 threads 7.095 4.077 44.352 100.000 UDP_RR 196 threads 5.757 3.775 35.764 99.991 UDP_RR 224 threads 5.124 3.704 28.748 99.860 schedstat_parse.py -f schbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% normal 1 mthread 99.152 6.400 99.941 100.000 normal 2 mthreads 97.844 4.003 99.908 100.000 normal 3 mthreads 96.395 2.118 99.917 99.998 normal 4 mthreads 55.288 1.451 98.615 99.804 normal 5 mthreads 7.004 1.870 45.597 61.036 normal 6 mthreads 3.354 1.346 20.777 34.230 normal 7 mthreads 2.183 1.028 11.257 21.055 normal 8 mthreads 1.653 0.825 7.849 15.549 schedstat_parse.py -f hackbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% process-pipe 1 group 99.991 7.692 99.999 100.000 process-pipe 2 groups 99.934 4.615 99.997 100.000 process-pipe 3 groups 99.597 3.198 99.987 100.000 process-pipe 4 groups 98.378 2.464 99.958 100.000 process-pipe 5 groups 27.474 3.653 89.811 99.800 process-pipe 6 groups 20.201 4.098 82.763 99.570 process-pipe 7 groups 16.423 4.156 77.398 99.316 process-pipe 8 groups 13.165 3.920 72.232 98.828 process-sockets 1 group 99.977 5.882 99.999 100.000 process-sockets 2 groups 99.927 5.505 99.996 100.000 process-sockets 3 groups 99.397 3.250 99.980 100.000 process-sockets 4 groups 79.680 4.258 98.864 99.998 process-sockets 5 groups 7.673 2.503 63.659 92.115 process-sockets 6 groups 4.642 1.584 58.946 88.048 process-sockets 7 groups 3.493 1.379 49.816 81.164 process-sockets 8 groups 3.015 1.407 40.845 75.500 threads-pipe 1 group 99.997 0.000 100.000 100.000 threads-pipe 2 groups 99.894 2.932 99.997 100.000 threads-pipe 3 groups 99.611 4.117 99.983 100.000 threads-pipe 4 groups 97.703 2.624 99.937 100.000 threads-pipe 5 groups 22.919 3.623 87.150 99.764 threads-pipe 6 groups 18.016 4.038 80.491 99.557 threads-pipe 7 groups 14.663 3.991 75.239 99.247 threads-pipe 8 groups 12.242 3.808 70.651 98.644 threads-sockets 1 group 99.990 6.667 99.999 100.000 threads-sockets 2 groups 99.940 5.114 99.997 100.000 threads-sockets 3 groups 99.469 4.115 99.977 100.000 threads-sockets 4 groups 87.528 4.038 99.400 100.000 threads-sockets 5 groups 6.942 2.398 59.244 88.337 threads-sockets 6 groups 4.359 1.954 49.448 87.860 threads-sockets 7 groups 2.845 1.345 41.198 77.102 threads-sockets 8 groups 2.871 1.404 38.512 74.312 schedstat_parse.py -f tbench_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% loopback 28 threads 99.976 18.369 99.995 100.000 loopback 56 threads 99.222 7.799 99.934 100.000 loopback 84 threads 19.723 6.819 70.215 100.000 loopback 112 threads 11.283 5.371 55.371 99.999 loopback 140 threads 0.000 0.000 0.000 0.000 loopback 168 threads 0.000 0.000 0.000 0.000 loopback 196 threads 0.000 0.000 0.000 0.000 loopback 224 threads 0.000 0.000 0.000 0.000 According to the test above, if the system becomes busy, the SIS Search Efficiency(se_eff%) drops significantly. Although some benchmarks would finally find an idle CPU(success_rate% = 100%), it is doubtful whether it is worth it to search the whole LLC domain. [Proposal] It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time. In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded. Introduce the quadratic function: y = SCHED_CAPACITY_SCALE - p * x^2 and y'= y / SCHED_CAPACITY_SCALE x is the ratio of sum_util compared to the CPU capacity: x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) y' is the ratio of CPUs to be scanned in the LLC domain, and the number of CPUs to scan is calculated by: nr_scan = llc_weight * y' Choosing quadratic function is because: [1] Compared to the linear function, it scans more aggressively when the sum_util is low. [2] Compared to the exponential function, it is easier to calculate. [3] It seems that there is no accurate mapping between the sum of util_avg and the number of CPUs to be scanned. Use heuristic scan for now. For a platform with 112 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 112 111 108 102 93 81 65 47 25 1 0 ... For a platform with 16 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 16 15 15 14 13 11 9 6 3 0 0 ... Furthermore, to minimize the overhead of calculating the metrics in select_idle_cpu(), borrow the statistics from periodic load balance. As mentioned by Abel, on a platform with 112 CPUs per LLC, the sum_util calculated by periodic load balance after 112 ms would decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in reflecting the latest utilization. But it is a trade-off. Checking the util_avg in newidle load balance would be more frequent, but it brings overhead - multiple CPUs write/read the per-LLC shared variable and introduces cache contention. Tim also mentioned that, it is allowed to be non-optimal in terms of scheduling for the short-term variations, but if there is a long-term trend in the load behavior, the scheduler can adjust for that. When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and Mel suggested, SIS_UTIL should be enabled by default. This patch is based on the util_avg, which is very sensitive to the CPU frequency invariance. There is an issue that, when the max frequency has been clamp, the util_avg would decay insanely fast when the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo in frequency invariance") could be used to mitigate this symptom, by adjusting the arch_max_freq_ratio when turbo is disabled. But this issue is still not thoroughly fixed, because the current code is unaware of the user-specified max CPU frequency. [Test result] netperf and tbench were launched with 25% 50% 75% 100% 125% 150% 175% 200% of CPU number respectively. Hackbench and schbench were launched by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times. The following is the benchmark result comparison between baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare% indicates better performance. Each netperf test is a: netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100 netperf.throughput ======= case load baseline(std%) compare%( std%) TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40) TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20) TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40) TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22) TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19) TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18) TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43) TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85) UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33) UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21) UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32) UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61) UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04) UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16) UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93) UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62) Take the 224 threads as an example, the SIS search metrics changes are illustrated below: vanilla patched 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested .try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock. Each hackbench test is a: hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100 hackbench.throughput ========= case load baseline(std%) compare%( std%) process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47) process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81) process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02) process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02) process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13) process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14) process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26) process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03) threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14) threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74) threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13) threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06) threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53) threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26) threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24) threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11) Each tbench test is a: tbench -t 100 $job 127.0.0.1 tbench.throughput ====== case load baseline(std%) compare%( std%) loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09) loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17) loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13) loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03) loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19) loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27) loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17) loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22) Each schbench test is a: schbench -m $job -t 28 -r 100 -s 30000 -c 30000 schbench.latency_90%_us ======== case load baseline(std%) compare%( std%) normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)* normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79) normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64) normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28) *Consider the Standard Deviation, this -7.36% regression might not be valid. Also, a OLTP workload with a commercial RDBMS has been tested, and there is no significant change. There were concerns that unbalanced tasks among CPUs would cause problems. For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to CPU0~CPU6, while CPU7 is idle: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 util_avg 1024 1024 1024 1024 1024 1024 1024 0 Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%, select_idle_cpu() will not scan, thus CPU7 is undetected during scan. But according to Mel, it is unlikely the CPU7 will be idle all the time because CPU7 could pull some tasks via CPU_NEWLY_IDLE. lkp(kernel test robot) has reported a regression on stress-ng.sock on a very busy system. According to the sched_debug statistics, it might be caused by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this might introduce more context switch, especially involuntary preemption, which impacts a busy stress-ng. This regression has shown that, not all benchmarks in every scenario benefit from idle CPU scan limit, and it needs further investigation. Besides, there is slight regression in hackbench's 16 groups case when the LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively in an LLC domain with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is negligible. The current patch aims to propose a generic solution and only considers the util_avg. Something like the below could be applied on top of the current patch to fulfill the requirement: if (llc_weight <= 16) nr_scan = nr_scan * 32 / llc_weight; For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large. The smaller the CPU number this LLC domain has, the larger nr_scan will be expanded. This needs further investigation. There is also ongoing work[2] from Abel to filter out the busy CPUs during wakeup, to further speed up the idle CPU scan. And it could be a following-up optimization on top of this change. Suggested-by: Tim Chen <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Chen Yu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Yicong Yang <[email protected]> Tested-by: Mohini Narkhede <[email protected]> Tested-by: K Prateek Nayak <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-06-27scsi: ufs: ufshcd: Constify pointed dataKrzysztof Kozlowski1-3/+3
For code safety, constify arrays and pointers to data which is not modified. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2022-06-27docs: rename Documentation/vm to Documentation/mmMike Rapoport5-7/+7
so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Tested-by: Ira Weiny <[email protected]> Acked-by: Jonathan Corbet <[email protected]> Acked-by: Wu XiangCheng <[email protected]>
2022-06-27Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-0/+2
Pull virtio fixes from Michael Tsirkin: "Fixes all over the place, most notably we are disabling IRQ hardening (again!)" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio_ring: make vring_create_virtqueue_split prettier vhost-vdpa: call vhost_vdpa_cleanup during the release virtio_mmio: Restore guest page size on resume virtio_mmio: Add missing PM calls to freeze/restore caif_virtio: fix race between virtio_device_ready() and ndo_open() virtio-net: fix race between ndo_open() and virtio_device_ready() virtio: disable notification hardening by default virtio: Remove unnecessary variable assignments virtio_ring : keep used_wrap_counter in vq->last_used_idx vduse: Tie vduse mgmtdev and its device vdpa/mlx5: Initialize CVQ vringh only once vdpa/mlx5: Update Control VQ callback information
2022-06-27Merge branch 'master' into mm-nonmm-stableakpm25-505/+103
2022-06-27Merge branch 'master' into mm-stableakpm25-505/+103
2022-06-27netfilter: nf_tables: avoid skb access on nf_stolenFlorian Westphal1-6/+10
When verdict is NF_STOLEN, the skb might have been freed. When tracing is enabled, this can result in a use-after-free: 1. access to skb->nf_trace 2. access to skb->mark 3. computation of trace id 4. dump of packet payload To avoid 1, keep a cached copy of skb->nf_trace in the trace state struct. Refresh this copy whenever verdict is != STOLEN. Avoid 2 by skipping skb->mark access if verdict is STOLEN. 3 is avoided by precomputing the trace id. Only dump the packet when verdict is not "STOLEN". Reported-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-06-27vfio: de-extern-ify function prototypesAlex Williamson2-69/+66
The use of 'extern' in function prototypes has been disrecommended in the kernel coding style for several years now, remove them from all vfio related files so contributors no longer need to decide between style and consistency. Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Eric Farman <[email protected]> Reviewed-by: Eric Auger <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/165471414407.203056.474032786990662279.stgit@omen Signed-off-by: Alex Williamson <[email protected]>
2022-06-27driver core: fw_devlink: Allow firmware to mark devices as best effortSaravana Kannan1-0/+4
When firmware sets the FWNODE_FLAG_BEST_EFFORT flag for a fwnode, fw_devlink will do a best effort ordering for that device where it'll only enforce the probe/suspend/resume ordering of that device with suppliers that have drivers. The driver of that device can then decide if it wants to defer probe or probe without the suppliers. This will be useful for avoid probe delays of the console device that were caused by commit 71066545b48e ("driver core: Set fw_devlink.strict=1 by default"). Fixes: 71066545b48e ("driver core: Set fw_devlink.strict=1 by default") Reported-by: Sascha Hauer <[email protected]> Reported-by: Peng Fan <[email protected]> Tested-by: Peng Fan <[email protected]> Signed-off-by: Saravana Kannan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.Imran Khan1-0/+57
In current kernfs design a single mutex, kernfs_open_file_mutex, protects the list of kernfs_open_file instances corresponding to a sysfs attribute. So even if different tasks are opening or closing different sysfs files they can contend on osq_lock of this mutex. The contention is more apparent in large scale systems with few hundred CPUs where most of the CPUs have running tasks that are opening, accessing or closing sysfs files at any point of time. Using hashed mutexes in place of a single global mutex, can significantly reduce contention around global mutex and hence can provide better scalability. Moreover as these hashed mutexes are not part of kernfs_node objects we will not see any singnificant change in memory utilization of kernfs based file systems like sysfs, cgroupfs etc. Modify interface introduced in previous patch to make use of hashed mutexes. Use kernfs_node address as hashing key. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Imran Khan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27kernfs: Change kernfs_notify_list to llist.Imran Khan1-1/+1
At present kernfs_notify_list is implemented as a singly linked list of kernfs_node(s), where last element points to itself and value of ->attr.next tells if node is present on the list or not. Both addition and deletion to list happen under kernfs_notify_lock. Change kernfs_notify_list to llist so that addition to list can heppen locklessly. Suggested by: Al Viro <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Imran Khan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27kernfs: make ->attr.open RCU protected.Imran Khan1-1/+1
After removal of kernfs_open_node->refcnt in the previous patch, kernfs_open_node_lock can be removed as well by making ->attr.open RCU protected. kernfs_put_open_node can delegate freeing to ->attr.open to RCU and other readers of ->attr.open can do so under rcu_read_(un)lock. Suggested by: Al Viro <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Imran Khan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27Revert "devcoredump: remove the useless gfp_t parameter in dev_coredumpv and ↵Greg Kroah-Hartman2-7/+8
dev_coredumpm" This reverts commit 77515ebaf01920e2db49e04672ef669a7c2907f2 as it causes build problems in linux-next. It needs to be reintroduced in a way that can allow the api to evolve and not require a "flag day" to catch all users. Link: https://lore.kernel.org/r/[email protected] Cc: Duoming Zhou <[email protected]> Cc: Brian Norris <[email protected]> Cc: Johannes Berg <[email protected]> Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27tty: Add N_CAN327 line discipline ID for ELM327 based CAN driverMax Staudt1-1/+2
The actual driver will be added via the CAN tree. Link: https://lore.kernel.org/all/[email protected] Link: https://lore.kernel.org/all/[email protected] Signed-off-by: Max Staudt <[email protected]> Acked-by: Marc Kleine-Budde <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]>
2022-06-27Binder: add TF_UPDATE_TXN to replace outdated txnLi Li1-0/+1
When the target process is busy, incoming oneway transactions are queued in the async_todo list. If the clients continue sending extra oneway transactions while the target process is frozen, this queue can become too large to accommodate new transactions. That's why binder driver introduced ONEWAY_SPAM_DETECTION to detect this situation. It's helpful to debug the async binder buffer exhausting issue, but the issue itself isn't solved directly. In real cases applications are designed to send oneway transactions repeatedly, delivering updated inforamtion to the target process. Typical examples are Wi-Fi signal strength and some real time sensor data. Even if the apps might only care about the lastet information, all outdated oneway transactions are still accumulated there until the frozen process is thawed later. For this kind of situations, there's no existing method to skip those outdated transactions and deliver the latest one only. This patch introduces a new transaction flag TF_UPDATE_TXN. To use it, use apps can set this new flag along with TF_ONE_WAY. When such an oneway transaction is to be queued into the async_todo list of a frozen process, binder driver will check if any previous pending transactions can be superseded by comparing their code, flags and target node. If such an outdated pending transaction is found, the latest transaction will supersede that outdated one. This effectively prevents the async binder buffer running out and saves unnecessary binder read workloads. Acked-by: Todd Kjos <[email protected]> Signed-off-by: Li Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27mnt_idmapping: return false when comparing two invalid idsSeth Forshee1-4/+6
INVALID_VFS{U,G}ID represent ids which have no mapping in the target mnt_usersns. This can happen for a couple of different reasons -- the source id might be valid but has no mapping in mnt_userns, or the source id might have been invalid (either due to a failed mapping or because it was set to invalid to indicate it is uninitialized). This means that two arbitrary vfs{u,g}ids which are both invalid could represent two different underlying ids, or they could represent a failed mapping and an uninitialized value. In these situation the vfs{u,g}id equality functions evaluate these ids as equal, and care must be taken when comparing ids to avoid problems. It would be less error prone to always evaluate two invalid ids as not equal to each other, and to check explicitly for vfs{u,g}id validity when that is needed. Change all vfs{u,g}id equality functions to return false when both ids are invalid. Functions for checking whether an id is valid exist and are already being used by code which needs to check this. Link: https://lore.kernel.org/linux-fsdevel/YrIMZirGoE0VIO45@do-x1extreme Signed-off-by: Seth Forshee <[email protected]> Reviewed-by: Christian Brauner (Microsoft) <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]>
2022-06-27tty: Add N_CAN327 line discipline ID for ELM327 based CAN driverMax Staudt1-1/+2
The actual driver will be added via the CAN tree. Acked-by: Marc Kleine-Budde <[email protected]> Signed-off-by: Max Staudt <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27serial: Support for RS-485 multipoint addressesIlpo Järvinen2-2/+19
Add support for RS-485 multipoint addressing using 9th bit [*]. The addressing mode is configured through ->rs485_config(). ADDRB in termios indicates 9th bit addressing mode is enabled. In this mode, 9th bit is used to indicate an address (byte) within the communication line. ADDRB can only be enabled/disabled through ->rs485_config() that is also responsible for setting the destination and receiver (filter) addresses. Add traps to detect unwanted changes to struct serial_rs485 layout using static_assert(). [*] Technically, RS485 is just an electronic spec and does not itself specify the 9th bit addressing mode but 9th bit seems at least "semi-standard" way to do addressing with RS485. Signed-off-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27serial: take termios_rwsem for ->rs485_config() & pass termios as paramIlpo Järvinen1-0/+1
To be able to alter ADDRB within ->rs485_config(), take termios_rwsem before calling ->rs485_config() and pass termios. Signed-off-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27serial: 8250: create lsr_save_maskIlpo Järvinen1-0/+1
Allow drivers to alter LSR save mask. Reviewed-by: Andy Shevchenko <[email protected]> Signed-off-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27serial: 8250: make saved LSR largerIlpo Järvinen1-3/+3
DW flags address received as BIT(8) in LSR. In order to not lose that on read, enlarge lsr_saved_flags to u16. Adjust lsr/status variables and related call chains to use u16. Technically, some of these type conversion would not be needed but it doesn't hurt to be consistent. Signed-off-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27serial: Consolidate BOTH_EMPTY useIlpo Järvinen1-0/+9
Per file BOTH_EMPTY defines are littering our source code here and there. Define once in serial.h and create helper for the check too. Reviewed-by: Jiri Slaby <[email protected]> Signed-off-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27serial: Convert SERIAL_XMIT_SIZE to UART_XMIT_SIZEIlpo Järvinen1-6/+0
Both UART_XMIT_SIZE and SERIAL_XMIT_SIZE are defined. Make them all UART_XMIT_SIZE. Reviewed-by: Jiri Slaby <[email protected]> Signed-off-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27serial: Use bits for UART_LSR_BRK_ERROR_BITS/MSR_ANY_DELTAIlpo Järvinen1-2/+2
Instead of listing the bits for UART_LSR_BRK_ERROR_BITS and UART_MSR_ANY_DELTA in comment, use them to define instead. Reviewed-by: Jiri Slaby <[email protected]> Signed-off-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27serial: Drop timeout from uart_portIlpo Järvinen1-2/+14
Since commit 31f6bd7fad3b ("serial: Store character timing information to uart_port"), per frame timing information is available on uart_port. Uart port's timeout can be derived from frame_time by multiplying with fifosize. Most callers of uart_poll_timeout are not made under port's lock. To be on the safe side, make sure frame_time is only accessed once. As fifo_size is effectively a constant, it shouldn't cause any issues. Signed-off-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27tty: Add closing marker into comment in tty_ldisc.hIlpo Järvinen1-1/+1
The closing `` is missing. Add it. Fixes: 6bb6fa6908eb ("tty: Implement lookahead to process XON/XOFF timely") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27block: Make ioprio_best() staticJan Kara1-5/+0
Nobody outside of block/ioprio.c uses it. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: Generalize get_current_ioprio() for any taskJan Kara1-16/+10
get_current_ioprio() operates only on current task. We will need the same functionality for other tasks as well. Generalize get_current_ioprio() for that and also move the bulk out of the header file because it is large enough. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: Return effective IO priority from get_current_ioprio()Jan Kara1-2/+9
get_current_ioprio() is used to initialize IO priority of various requests. As such it should be returning the effective IO priority of the task (i.e., reflecting the fact that unset IO priority should get set based on task's CPU priority) so that the conversion is concentrated in one place. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: fix default IO priority handling againJan Kara1-1/+1
Commit e70344c05995 ("block: fix default IO priority handling") introduced an inconsistency in get_current_ioprio() that tasks without IO context return IOPRIO_DEFAULT priority while tasks with freshly allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority. Tasks without IO context used to be rare before 5a9d041ba2f6 ("block: move io_context creation into where it's needed") but after this commit they became common because now only BFQ IO scheduler setups task's IO context. Similar inconsistency is there for get_task_ioprio() so this inconsistency is now exposed to userspace and userspace will see different IO priority for tasks operating on devices with BFQ compared to devices without BFQ. Furthemore the changes done by commit e70344c05995 change the behavior when no IO priority is set for BFQ IO scheduler which is also documented in ioprio_set(2) manpage: "If no I/O scheduler has been set for a thread, then by default the I/O priority will follow the CPU nice value (setpriority(2)). In Linux kernels before version 2.6.24, once an I/O priority had been set using ioprio_set(), there was no way to reset the I/O scheduling behavior to the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to reset to the default I/O scheduling behavior." So make sure we default to IOPRIO_CLASS_NONE as used to be the case before commit e70344c05995. Also cleanup alloc_io_context() to explicitely set this IO priority for the allocated IO context to avoid future surprises. Note that we tweak ioprio_best() to maintain ioprio_get(2) behavior and make this commit easily backportable. CC: [email protected] Fixes: e70344c05995 ("block: fix default IO priority handling") Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: move blk_queue_get_max_sectors to blk.hChristoph Hellwig1-13/+0
blk_queue_get_max_sectors is private to the block layer, so move it out of blkdev.h. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: fold blk_max_size_offset into get_max_io_sizeChristoph Hellwig1-19/+0
Now that blk_max_size_offset has a single caller left, fold it into that and clean up the naming convention for the local variables there. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Pankaj Raghav <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: factor out a chunk_size_left helperChristoph Hellwig1-6/+13
Factor out a helper from blk_max_size_offset so that it can be reused independently. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Pankaj Raghav <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: relax direct io memory alignmentKeith Busch1-0/+5
Use the address alignment requirements from the block_device for direct io instead of requiring addresses be aligned to the block size. User space can discover the alignment requirements from the dma_alignment queue attribute. User space can specify any hardware compatible DMA offset for each segment, but every segment length is still required to be a multiple of the block size. Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: introduce bdev_iter_is_aligned helperKeith Busch1-0/+7
Provide a convenient function for this repeatable coding pattern. Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27iov: introduce iov_iter_alignedKeith Busch1-0/+2
The existing iov_iter_alignment() function returns the logical OR of address and length. For cases where address and length need to be considered separately, introduce a helper function that a caller can specificy length and address masks that indicate if the iov is unaligned. Cc: Alexander Viro <[email protected]> Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: introduce bdev_dma_alignment helperKeith Busch1-0/+5
Preparing for upcoming dma_alignment users that have a block_device, but don't need the request_queue. Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27spi: opportunistically skip ctlr->cur_msg_completionDavid Jander1-0/+8
There are only a few drivers that do not call spi_finalize_current_message() in the context of transfer_one_message(), and even for those cases the completion ctlr->cur_msg_completion is not needed always. The calls to complete() and wait_for_completion() each take a spin-lock, which is costly. This patch makes it possible to avoid those calls in the big majority of cases, by introducing two flags that with the help of ordering via barriers can avoid using the completion safely. In case of a race with the context calling spi_finalize_current_message(), the scheme errs on the safe side and takes the completion. The impact of this patch is worth the effort: On a i.MX8MM SoC, the time the SPI bus is idle between two consecutive calls to spi_sync(), is reduced from 19.6us to 16.8us... roughly 15%. Signed-off-by: David Jander <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2022-06-27spi: Ensure the io_mutex is held until spi_finalize_current_message()David Jander1-4/+2
This patch introduces a completion that is completed in spi_finalize_current_message() and waited for in __spi_pump_transfer_message(). This way all manipulation of ctlr->cur_msg is done with the io_mutex held and strictly ordered: __spi_pump_transfer_message() will not return until spi_finalize_current_message() is done using ctlr->cur_msg, and its calling context is only touching ctlr->cur_msg after returning. Due to this, we can safely drop the spin-locks around ctlr->cur_msg. Signed-off-by: David Jander <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2022-06-27spi: Remove the now unused ctlr->idling flagDavid Jander1-2/+0
The ctlr->idling flag is never checked now, so we don't need to set it either. Signed-off-by: David Jander <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2022-06-27spi: Don't use the message queue if possible in spi_syncDavid Jander1-1/+10
The interaction with the controller message queue and its corresponding auxiliary flags and variables requires the use of the queue_lock which is costly. Since spi_sync will transfer the complete message anyway, and not return until it is finished, there is no need to put the message into the queue if the queue is empty. This can save a lot of overhead. As an example of how significant this is, when using the MCP2518FD SPI CAN controller on a i.MX8MM SoC, the time during which the interrupt line stays active (during 3 relatively short spi_sync messages), is reduced from 98us to 72us by this patch. Signed-off-by: David Jander <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2022-06-27spi: Move ctlr->cur_msg_prepared to struct spi_messageDavid Jander1-3/+4
This enables the possibility to transfer a message that is not at the current tip of the async message queue. This is in preparation of the next patch(es) which enable spi_sync messages to skip the queue altogether. Signed-off-by: David Jander <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2022-06-27ASoC: soc-component: Remove non_legacy_dai_naming flagCharles Keepax1-1/+0
Now all the users are moved over to the new legacy_dai_naming flag, remove the now unused old flag. Signed-off-by: Charles Keepax <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2022-06-27tty/vt: consolemap: rename struct vc_data::vc_uni_pagedir*Jiri Slaby1-2/+2
As a follow-up to the commit 4173f018aae1 (tty/vt: consolemap: rename and document struct uni_pagedir), rename also the members of struct vc_data. I.e. pagedir -> pagedict. And while touching all the places, remove also the unnecessary vc_ prefix. Suggested-by: Ilpo Järvinen <[email protected]> Reviewed-by: Ilpo Järvinen <[email protected]> Signed-off-by: Jiri Slaby <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2022-06-27ASoC: soc-component: Add legacy_dai_naming flagCharles Keepax1-0/+1
Historically, the legacy DAI naming scheme was applied to platform drivers and the newer scheme to CODEC drivers. During componentisation the core lost the knowledge of if a driver was a CODEC or platform, they were all now components. To continue to support the legacy naming on older platform drivers a flag was added to the snd_soc_component_driver structure, non_legacy_dai_naming, to indicate to use the new scheme and this was applied to all CODECs as part of the migration. However, a slight issue appears to be developing with respect to this flag being opt in for the non-legacy scheme, which presumably we want to be the primary scheme used. Many codec drivers appear to forget to include this flag: grep -l -r "snd_soc_component_driver" sound/soc/codecs/*.c | xargs grep -L "non_legacy_dai_naming" | wc 48 48 556 It would seem more sensible to change the flag to legacy_dai_naming making the new scheme opt out. As a first step this patch adds a new flag for this so that the users can be updated. Signed-off-by: Charles Keepax <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]>
2022-06-27drm/rect: Add DRM_RECT_INIT() macroJosé Expósito1-0/+16
Add a helper macro to initialize a rectangle from x, y, width and height information. Reviewed-by: Jani Nikula <[email protected]> Reviewed-by: David Gow <[email protected]> Acked-by: Thomas Zimmermann <[email protected]> Signed-off-by: José Expósito <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2022-06-27mm: ioremap: Add ioremap/iounmap_allowed()Kefeng Wang1-0/+26
Add special hook for architecture to verify addr, size or prot when ioremap() or iounmap(), which will make the generic ioremap more useful. ioremap_allowed() return a bool, - true means continue to remap - false means skip remap and return directly iounmap_allowed() return a bool, - true means continue to vunmap - false code means skip vunmap and return directly Meanwhile, only vunmap the address when it is in vmalloc area as the generic ioremap only returns vmalloc addresses. Acked-by: Andrew Morton <[email protected]> Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Baoquan He <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2022-06-27mm: ioremap: Use more sensible name in ioremap_prot()Kefeng Wang1-1/+2
Use more meaningful and sensible naming phys_addr instead addr in ioremap_prot(). Suggested-by: Andrew Morton <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>