Age | Commit message (Collapse) | Author | Files | Lines |
|
In start_secondary, even if shared_cache was already set, system does a
redundant match for cpumask. This redundant check can be removed by
checking if shared_cache is already set.
While here, localize the sibling_mask variable to within the if
condition.
Signed-off-by: Srikar Dronamraju <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.
Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Move topology fixup based on the platform attributes into its own
function which is called just before set_sched_topology.
Signed-off-by: Srikar Dronamraju <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Just moving the powerpc_topology description above.
This will help in using functions in this file and avoid declarations.
No other functional changes
Signed-off-by: Srikar Dronamraju <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
A new sched_domain_topology_level was added just for Power9. However the
same can be achieved by merging powerpc_topology with power9_topology
and makes the code more simpler especially when adding a new sched
domain.
Signed-off-by: Srikar Dronamraju <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Fix a build warning in a non CONFIG_NEED_MULTIPLE_NODES
"error: _numa_cpu_lookup_table_ undeclared"
Signed-off-by: Srikar Dronamraju <[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.
This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.
By marking, node 0 as offline, lets stop assuming that node 0 is
always online. If node 0 has CPU or memory that are online, node 0 will
again be set as online.
v5.8
available: 2 nodes (0,2)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31490 MB
node distances:
node 0 2
0: 10 20
2: 20 10
proc and sys files
------------------
/sys/devices/system/node/online: 0,2
/proc/sys/kernel/numa_balancing: 1
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31
v5.8 + patch
------------------
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 32625 MB
node 2 free: 31487 MB
node distances:
node 2
2: 10
proc and sys files
------------------
/sys/devices/system/node/online: 2
/proc/sys/kernel/numa_balancing: 0
/sys/devices/system/node/has_cpu: 2
/sys/devices/system/node/has_memory: 2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible: 0-31
Example of a node with online CPUs/memory on node 0.
(Same o/p with and without patch)
numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 32482 MB
node 0 free: 22994 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 20 40 40
1: 20 10 40 40
2: 40 40 10 20
3: 40 40 20 10
Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.
1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.
2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump
3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.
To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
Abstraction Services (RTAS) Node" available at:
https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
... there are 2 device tree properties:
"ibm,max-associativity-domains"
which defines the maximum number of domains that the firmware i.e
PowerVM can support.
and:
"ibm,current-associativity-domains"
which defines the maximum number of domains that the current
platform can support.
The value of "ibm,max-associativity-domains" is always greater than or
equal to "ibm,current-associativity-domains" property. If the latter
property is not available, use "ibm,max-associativity-domain" as a
fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
is mentioned in page 833 / B.5.3 which is covered under under
"Appendix B. System Binding" section
Currently powerpc uses the "ibm,max-associativity-domains" property
while setting the possible number of nodes. This is currently set at
32. However the possible number of nodes for a platform may be
significantly less. Hence set the possible number of nodes based on
"ibm,current-associativity-domains" property.
Nathan Lynch had raised a valid concern that post LPM (Live Partition
Migration), a user could DLPAR add processors and memory after LPM
with "new" associativity properties:
https://lore.kernel.org/linuxppc-dev/[email protected]/t/#u
He also pointed out that "ibm,max-associativity-domains" has the same
contents on all currently available PowerVM systems, unlike
"ibm,current-associativity-domains" and hence may be better able to
handle the new NUMA associativity properties.
However with the recent commit dbce45628085 ("powerpc/numa: Limit
possible nodes to within num_possible_nodes"), all new NUMA
associativity properties are capped to initially set nr_node_ids.
Hence this commit should be safe with any new DLPAR add post LPM.
$ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
/proc/device-tree/rtas/ibm,current-associativity-domains
00000005 00000001 00000002 00000002 00000002 00000010
/proc/device-tree/rtas/ibm,max-associativity-domains
00000005 00000001 00000008 00000020 00000020 00000100
$ cat /sys/devices/system/node/possible ##Before patch
0-31
$ cat /sys/devices/system/node/possible ##After patch
0-1
Note the maximum nodes this platform can support is only 2 but the
possible nodes is set to 32.
This is important because lot of kernel and user space code allocate
structures for all possible nodes leading to a lot of memory that is
allocated but not used.
I ran a simple experiment to create and destroy 100 memory cgroups on
boot on a 8 node machine (Power8 Alpine).
Before patch:
free -k at boot
total used free shared buff/cache available
Mem: 523498176 4106816 518820608 22272 570752 516606720
Swap: 4194240 0 4194240
free -k after creating 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4628416 518246464 22336 623296 516058688
Swap: 4194240 0 4194240
free -k after destroying 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4697408 518173760 22400 627008 515987904
Swap: 4194240 0 4194240
After patch:
free -k at boot
total used free shared buff/cache available
Mem: 523498176 3969472 518933888 22272 594816 516731776
Swap: 4194240 0 4194240
free -k after creating 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4181888 518676096 22208 640192 516496448
Swap: 4194240 0 4194240
free -k after destroying 100 memory cgroups
total used free shared buff/cache available
Mem: 523498176 4232320 518619904 22272 645952 516443264
Swap: 4194240 0 4194240
Observations:
Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
Signed-off-by: Srikar Dronamraju <[email protected]>
[mpe: Reformat change log a bit for readability]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
On Power9, a pair of SMT4 cores can be presented by the firmware as a SMT8
core for backward compatibility reasons, with the fusion of two SMT4 cores.
Powerpc allows LPARs to be live migrated from Power8 to Power9. Existing
software developed/configured for Power8, expects to see a SMT8 core.
In order to maintain userspace backward compatibility (with Power8 chips in
case of Power9) in enterprise Linux systems, the topology_sibling_cpumask
has to be set to SMT8 core.
cpu_smt_mask() should generally point to the cpu mask of the SMT4 core.
Hence override the default cpu_smt_mask() to be powerpc specific
allowing for better scheduling behaviour on Power.
schbench
(latency measured in usecs, so lesser is better)
Without patch With patch
Latency percentiles (usec) Latency percentiles (usec)
50.0000th: 34 50.0000th: 38
75.0000th: 47 75.0000th: 52
90.0000th: 54 90.0000th: 60
95.0000th: 57 95.0000th: 64
*99.0000th: 62 *99.0000th: 72
99.5000th: 65 99.5000th: 75
99.9000th: 76 99.9000th: 3452
min=0, max=9205 min=0, max=9344
schbench (With Cede disabled)
Without patch With patch
Latency percentiles (usec) Latency percentiles (usec)
50.0000th: 20 50.0000th: 21
75.0000th: 28 75.0000th: 29
90.0000th: 33 90.0000th: 34
95.0000th: 35 95.0000th: 37
*99.0000th: 40 *99.0000th: 40
99.5000th: 48 99.5000th: 42
99.9000th: 94 99.9000th: 79
min=0, max=791 min=0, max=791
perf bench sched pipe
usec/ops : lesser is better
Without patch
N Min Max Median Avg Stddev
101 5.095113 5.595269 5.204842 5.2298776 0.10762713
5.10 - 5.15 : ################################################## 23% (24)
5.15 - 5.20 : ############################################# 21% (22)
5.20 - 5.25 : ################################################## 23% (24)
5.25 - 5.30 : ######################### 11% (12)
5.30 - 5.35 : ########## 4% (5)
5.35 - 5.40 : ######## 3% (4)
5.40 - 5.45 : ######## 3% (4)
5.45 - 5.50 : #### 1% (2)
5.50 - 5.55 : ## 0% (1)
5.55 - 5.60 : #### 1% (2)
With patch
N Min Max Median Avg Stddev
101 5.134675 8.524719 5.207658 5.2780985 0.34911969
5.1 - 5.5 : ################################################## 94% (95)
5.5 - 5.8 : ## 3% (4)
5.8 - 6.2 : 0% (1)
6.2 - 6.5 :
6.5 - 6.8 :
6.8 - 7.2 :
7.2 - 7.5 :
7.5 - 7.8 :
7.8 - 8.2 :
8.2 - 8.5 :
perf bench sched pipe (cede disabled)
usec/ops : lesser is better
Without patch
N Min Max Median Avg Stddev
101 7.884227 12.576538 7.956474 8.0170722 0.46159054
7.9 - 8.4 : ################################################## 99% (100)
8.4 - 8.8 :
8.8 - 9.3 :
9.3 - 9.8 :
9.8 - 10.2 :
10.2 - 10.7 :
10.7 - 11.2 :
11.2 - 11.6 :
11.6 - 12.1 :
12.1 - 12.6 :
With patch
N Min Max Median Avg Stddev
101 7.956021 8.217284 8.015615 8.0283866 0.049844967
7.96 - 7.98 : ###################### 12% (13)
7.98 - 8.01 : ################################################## 28% (29)
8.01 - 8.03 : #################################### 20% (21)
8.03 - 8.06 : ######################### 14% (15)
8.06 - 8.09 : ###################### 12% (13)
8.09 - 8.11 : ###### 3% (4)
8.11 - 8.14 : ### 1% (2)
8.14 - 8.17 : ### 1% (2)
8.17 - 8.19 :
8.19 - 8.22 : # 0% (1)
Observations: With the patch, the initial run/iteration takes a slight
longer time. This can be attributed to the fact that now we pick a CPU
from a idle core which could be sleep mode. Once we remove the cede,
state the numbers improve in favour of the patch.
ebizzy:
transactions per second (higher is better)
without patch
N Min Max Median Avg Stddev
100 1018433 1304470 1193208 1182315.7 60018.733
1018433 - 1047037 : ###### 3% (3)
1047037 - 1075640 : ######## 4% (4)
1075640 - 1104244 : ######## 4% (4)
1104244 - 1132848 : ############### 7% (7)
1132848 - 1161452 : #################################### 17% (17)
1161452 - 1190055 : ########################## 12% (12)
1190055 - 1218659 : ############################################# 21% (21)
1218659 - 1247263 : ################################################## 23% (23)
1247263 - 1275866 : ######## 4% (4)
1275866 - 1304470 : ######## 4% (4)
with patch
N Min Max Median Avg Stddev
100 967014 1292938 1208819 1185281.8 69815.851
967014 - 999606 : ## 1% (1)
999606 - 1032199 : ## 1% (1)
1032199 - 1064791 : ############ 6% (6)
1064791 - 1097384 : ########## 5% (5)
1097384 - 1129976 : ################## 9% (9)
1129976 - 1162568 : #################### 10% (10)
1162568 - 1195161 : ########################## 13% (13)
1195161 - 1227753 : ############################################ 22% (22)
1227753 - 1260346 : ################################################## 25% (25)
1260346 - 1292938 : ############## 7% (7)
Observations: Not much changes, ebizzy is not much impacted.
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
cpu_smt_mask tracks topology_sibling_cpumask. This would be good for
most architectures. One of the users of cpu_smt_mask(), would be to
identify idle-cores. On Power9, a pair of SMT4 cores can be presented
by the firmware as a SMT8 core for backward compatibility reasons.
powerpc allows LPARs to be live migrated from Power8 to Power9. Do
note Power8 had only SMT8 cores. Existing software which has been
developed/configured for Power8 would expect to see SMT8 core.
Maintaining the illusion of SMT8 core is a requirement to make that
work.
In order to maintain above userspace backward compatibility with
previous versions of processor, Power9 onwards there is option to the
firmware to advertise a pair of SMT4 cores as a fused cores aka SMT8
core. On Power9 this pair shares the L2 cache as well. However, from
the scheduler's point of view, a core should be determined by SMT4,
since its a completely independent unit of compute. Hence allow
powerpc architecture to override the default cpu_smt_mask() to point
to the SMT4 cores in a SMT8 mode.
This will ensure the scheduler is always given the right information.
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Make kernel with `C=2`:
drivers/macintosh/smu.c:1018:30: warning: symbol
'__smu_get_sdb_partition' was not declared. Should it be static?
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Wang Wensheng <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
A warning is reported by the kernel in case perf_stats_show() returns
an error code. The warning is of the form below:
papr_scm ibm,persistent-memory:ibm,pmemory@44100001:
Failed to query performance stats, Err:-10
dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
On investigation it looks like that the compiler is silently
truncating the return value of drc_pmem_query_stats() from 'long' to
'int', since the variable used to store the return code 'rc' is an
'int'. This truncated value is then returned back as a 'ssize_t' back
from perf_stats_show() to 'dev_attr_show()' which thinks of it as a
large unsigned number and triggers this warning..
To fix this we update the type of variable 'rc' from 'int' to
'ssize_t' that prevents the compiler from truncating the return value
of drc_pmem_query_stats() and returning correct signed value back from
perf_stats_show().
Fixes: 2d02bf835e57 ("powerpc/papr_scm: Fetch nvdimm performance stats from PHYP")
Signed-off-by: Vaibhav Jain <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.
That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.
Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
only if it hasn't been switched-to by the time the IPI is processed.
This relies on commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
switch sequences.
Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <[email protected]>
Reviewed-by: Michael Ellerman <[email protected]>
Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.
The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).
io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.
The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.
arch/sparc/kernel/smp_64.c:
if (atomic_read(&mm->mm_users) == 1) {
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
goto local_flush_and_out;
}
vs fs/io_uring.c
if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
!mmget_not_zero(ctx->sqo_mm)))
return -EFAULT;
kthread_use_mm(ctx->sqo_mm);
mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.
I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.
The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.
Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: David S. Miller <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.
Signed-off-by: Nicholas Piggin <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
call_usermodehelper()
kernel_execve()
old_mm = current->mm;
active_mm = current->active_mm;
*** preempt *** --------------------> schedule()
prev->active_mm = NULL;
mmdrop(prev active_mm);
...
<-------------------- schedule()
current->mm = mm;
current->active_mm = mm;
if (!old_mm)
mmdrop(active_mm);
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
Signed-off-by: Nicholas Piggin <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Use for_each_child_of_node() macro instead of open coding it.
Signed-off-by: Qinglang Miao <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When a passthrough IO adapter is removed from a pseries machine using
hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
guest OS to clear all page table entries related to the adapter. If
some are still present, the RTAS call which isolates the PCI slot
returns error 9001 "valid outstanding translations" and the removal of
the IO adapter fails. This is because when the PHBs are scanned, Linux
maps automatically the INTx interrupts in the Linux interrupt number
space but these are never removed.
To solve this problem, we introduce a PPC platform specific
pcibios_remove_bus() routine which clears all interrupt mappings when
the bus is removed. This also clears the associated page table entries
of the ESB pages when using XIVE.
For this purpose, we record the logical interrupt numbers of the
mapped interrupt under the PHB structure and let pcibios_remove_bus()
do the clean up.
Since some PCI adapters, like GPUs, use the "interrupt-map" property
to describe interrupt mappings other than the legacy INTx interrupts,
we can not restrict the size of the mapping array to PCI_NUM_INTX. The
number of interrupt mappings is computed from the "interrupt-map"
property and the mapping array is allocated accordingly.
Signed-off-by: Cédric Le Goater <[email protected]>
Reviewed-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
This driver does not restore stop > 3 state, so it limits itself
to states which do not lose full state or TB.
The POWER10 SPRs are sufficiently different from P9 that it seems
easier to split out the P10 code. The POWER10 deep sleep code
(e.g., the BHRB restore) has been taken out, but it can be re-added
when stop > 3 support is added.
Signed-off-by: Nicholas Piggin <[email protected]>
Tested-by: Pratik Rajesh Sampat<[email protected]>
Tested-by: Vaidyanathan Srinivasan <[email protected]>
Reviewed-by: Pratik Rajesh Sampat<[email protected]>
Reviewed-by: Gautham R. Shenoy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
This ensures we don't do a partial mapping of memory. With nvdimm, when
creating namespaces with size not aligned to 16MB, the kernel ends up partially
mapping the pages. This can result in kernel adding multiple hash page table
entries for the same range. A new namespace will result in
create_section_mapping() with start and end overlapping an already existing
bolted hash page table entry.
commit: 6acd7d5ef264 ("libnvdimm/namespace: Enforce memremap_compat_align()")
made sure that we always create namespaces aligned to 16MB. But we can do
better by avoiding mapping pages that are not aligned. This helps to catch
access to these partially mapped pages early.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
This addresses the following sparse warning:
arch/powerpc/platforms/ps3/spu.c:451:33: warning: symbol
'spu_management_ps3_ops' was not declared. Should it be static?
arch/powerpc/platforms/ps3/spu.c:592:28: warning: symbol
'spu_priv1_ps3_ops' was not declared. Should it be static?
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Jason Yan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Before the commit identified below, pages tables allocation was
performed after the allocation of final shadow area for linear memory.
But that commit switched the order, leading to page tables being
already allocated at the time 8xx kasan_init_shadow_8M() is called.
Due to this, kasan_init_shadow_8M() doesn't map the needed
shadow entries because there are already page tables.
kasan_init_shadow_8M() installs huge PMD entries instead of page
tables. We could at that time free the page tables, but there is no
point in creating page tables that get freed before being used.
Only book3s/32 hash needs early allocation of page tables. For other
variants, we can keep the initial order and create remaining page
tables after the allocation of final shadow memory for linear mem.
Move back the allocation of shadow page tables for
CONFIG_KASAN_VMALLOC into kasan_init() after the loop which creates
final shadow memory for linear mem.
Fixes: 41ea93cf7ba4 ("powerpc/kasan: Fix shadow pages allocation failure")
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/8ae4554357da4882612644a74387ae05525b2aaa.1599800716.git.christophe.leroy@csgroup.eu
|
|
low_sleep_handler() has an hardcoded restore of segment registers
that doesn't take KUAP and KUEP into account.
Use head_32's load_segment_registers() routine instead.
Fixes: a68c31fc01ef ("powerpc/32s: Implement Kernel Userspace Access Protection")
Fixes: 31ed2b13c48d ("powerpc/32s: Implement Kernel Userspace Execution Prevention.")
Cc: [email protected]
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/21b05f7298c1b18f73e6e5b4cd5005aafa24b6da.1599820109.git.christophe.leroy@csgroup.eu
|
|
Add a stub for __giveup_fpu() when CONFIG_PPC_FPU is
not selected, as done for CONFIG_SPE and CONFIG_ALTIVEC.
This allows to remove some #ifdef CONFIG_PPC_FPU.
Also change one to IS_ENABLED().
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/69c8b7954ceeccc6b849e52e1fa41b3a0f10f6c1.1597643221.git.christophe.leroy@csgroup.eu
|
|
cpu_has_feature(CPU_FTR_SPE) returns false when CONFIG_SPE is
not set.
There is no need to enclose the test in an #ifdef CONFIG_SPE.
Remove it.
CPU_FTR_SPE only exists on 32 bits. Define it as 0 on 64 bits.
We have a couple of places like:
#ifdef CONFIG_SPE
if (cpu_has_feature(CPU_FTR_SPE)) {
do_something_that_requires_CONFIG_SPE
} else {
return -EINVAL;
}
#else
return -EINVAL;
#endif
Replace them by a cleaner version:
if (cpu_has_feature(CPU_FTR_SPE)) {
#ifdef CONFIG_SPE
do_something_that_requires_CONFIG_SPE
#endif
} else {
return -EINVAL;
}
When CONFIG_SPE is not set, this resolves to an unconditional
return of -EINVAL
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/698df8387555765b70ea42e4a7fa48141c309c1f.1597643221.git.christophe.leroy@csgroup.eu
|
|
cpu_has_feature(CPU_FTR_ALTIVEC) returns false when CONFIG_ALTIVEC is
not set.
There is no need to enclose the test in an #ifdef CONFIG_ALTIVEC.
Remove it.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/03ba6b52344ca7c336df2bc6e3d31d736c804ae2.1597643221.git.christophe.leroy@csgroup.eu
|
|
cpu_has_feature(CPU_FTR_VSX) returns false when CONFIG_VSX is
not set.
There is no need to enclose the test in an #ifdef CONFIG_VSX.
Remove it.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/0eb61cf0dc66d781d47deb2228498cd61d03a754.1597643221.git.christophe.leroy@csgroup.eu
|
|
That #endif is more than 100 lines after the matching #ifdef,
and there are several #ifdef/#else/#endif inbetween.
Tag it as /* CONFIG_PPC_BOOK3S_64 */ to help locate the
matching #ifdef.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/3612a8f8aaca16de3fc414a7e66293319d6e213c.1597643147.git.christophe.leroy@csgroup.eu
|
|
The #ifdef CONFIG_KALLSYMS encloses some printk which can
compile in all cases.
Replace by IS_ENABLED().
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/2d89732a9062b2cf2651728804e4b8f6c9b9358e.1597643164.git.christophe.leroy@csgroup.eu
|
|
by IS_ENABLED()
The #if defined(CONFIG_4xx) || defined(CONFIG_BOOKE) encloses some
printk which can be compiled in all cases.
Replace by IS_ENABLED().
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/a1b6ef3d657c8f249193442f56868fc358ea5b6c.1597643160.git.christophe.leroy@csgroup.eu
|
|
This #ifdef CONFIG_PPC_BOOK3S_64 calls preload_new_slb_context()
when radix is not enabled.
radix_enabled() is always defined, and the prototype for
preload_new_slb_context() is always present, so the #ifdef
is unneeded.
Replace it by IS_ENABLED().
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/d31506ca9bac9def68cf7424eded63fdc4fb6660.1597643167.git.christophe.leroy@csgroup.eu
|
|
isync() is always defined, no need for an #ifdef.
Replace it by IS_ENABLED(CONFIG_PPC_47x).
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/ac8da0e3baa91dda805e1e492fd65aecd90c1fb5.1597643156.git.christophe.leroy@csgroup.eu
|
|
We need r1 to be properly set before activating MMU, otherwise any new
exception taken while saving registers into the stack in exception
prologs will use the user stack, which is wrong and will even lockup
or crash when KUAP is selected.
Do that by switching the meaning of r11 and r1 until we have saved r1
to the stack: copy r1 into r11 and setup the new stack pointer in r1.
To avoid complicating and impacting all generic and specific prolog
code (and more), copy back r1 into r11 once r11 is save onto
the stack.
We could get rid of copying r1 back and forth at the cost of
rewriting everything to use r1 instead of r11 all the way when
CONFIG_VMAP_STACK is set, but the effort is probably not worth it.
Fixes: 028474876f47 ("powerpc/32: prepare for CONFIG_VMAP_STACK")
Cc: [email protected]
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/8f85e8752ac5af602db7237ef53d634f4f3d3892.1599486108.git.christophe.leroy@csgroup.eu
|
|
We need r1 to be properly set before activating MMU, so
reading task_struct->stack must be done with MMU off.
This means we need an additional register to play with MSR
bits while r11 now points to the stack. For that, move r10
back to CR (As is already done for hash MMU) and use r10.
We still don't have r1 correct yet when we activate MMU.
It is done in following patch.
Fixes: 028474876f47 ("powerpc/32: prepare for CONFIG_VMAP_STACK")
Cc: [email protected]
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/a027d447022a006c9c4958ac734128e577a3c5c1.1599486108.git.christophe.leroy@csgroup.eu
|
|
__put_user_asm() and __put_user_asm2() are not used anymore.
Remove them.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/d66c4a372738d2fbd81f433ca86e4295871ace6a.1599216721.git.christophe.leroy@csgroup.eu
|
|
__patch_instruction() is the only user of __put_user_asm() outside
of asm/uaccess.h
Switch to the new __put_user_asm_goto() to enable retirement of
__put_user_asm() in a later patch.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/b9745b122f4a9ae72cef445c61320022ab8b77b7.1599216721.git.christophe.leroy@csgroup.eu
|
|
__put_user_asm_goto() provides more flexibility to GCC and avoids using
a local variable to tell if the write succeeded or not.
GCC can then avoid implementing a cmp in the fast path.
See the difference for a small function like the PPC64 version of
save_general_regs() in arch/powerpc/kernel/signal_32.c:
Before the patch (unreachable nop removed):
0000000000000c10 <.save_general_regs>:
c10: 39 20 00 2c li r9,44
c14: 39 40 00 00 li r10,0
c18: 7d 29 03 a6 mtctr r9
c1c: 38 c0 00 00 li r6,0
c20: 48 00 00 14 b c34 <.save_general_regs+0x24>
c30: 42 40 00 40 bdz c70 <.save_general_regs+0x60>
c34: 28 2a 00 27 cmpldi r10,39
c38: 7c c8 33 78 mr r8,r6
c3c: 79 47 1f 24 rldicr r7,r10,3,60
c40: 39 20 00 01 li r9,1
c44: 41 82 00 0c beq c50 <.save_general_regs+0x40>
c48: 7d 23 38 2a ldx r9,r3,r7
c4c: 79 29 00 20 clrldi r9,r9,32
c50: 91 24 00 00 stw r9,0(r4)
c54: 2c 28 00 00 cmpdi r8,0
c58: 39 4a 00 01 addi r10,r10,1
c5c: 38 84 00 04 addi r4,r4,4
c60: 41 82 ff d0 beq c30 <.save_general_regs+0x20>
c64: 38 60 ff f2 li r3,-14
c68: 4e 80 00 20 blr
c70: 38 60 00 00 li r3,0
c74: 4e 80 00 20 blr
0000000000000000 <.fixup>:
cc: 39 00 ff f2 li r8,-14
d0: 48 00 00 00 b d0 <.fixup+0xd0>
d0: R_PPC64_REL24 .text+0xc54
After the patch:
0000000000001490 <.save_general_regs>:
1490: 39 20 00 2c li r9,44
1494: 39 40 00 00 li r10,0
1498: 7d 29 03 a6 mtctr r9
149c: 60 00 00 00 nop
14a0: 28 2a 00 27 cmpldi r10,39
14a4: 79 48 1f 24 rldicr r8,r10,3,60
14a8: 39 20 00 01 li r9,1
14ac: 41 82 00 0c beq 14b8 <.save_general_regs+0x28>
14b0: 7d 23 40 2a ldx r9,r3,r8
14b4: 79 29 00 20 clrldi r9,r9,32
14b8: 91 24 00 00 stw r9,0(r4)
14bc: 39 4a 00 01 addi r10,r10,1
14c0: 38 84 00 04 addi r4,r4,4
14c4: 42 00 ff dc bdnz 14a0 <.save_general_regs+0x10>
14c8: 38 60 00 00 li r3,0
14cc: 4e 80 00 20 blr
14d0: 38 60 ff f2 li r3,-14
14d4: 4e 80 00 20 blr
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/94ba5a5138f99522e1562dbcdb38d31aa790dc89.1599216721.git.christophe.leroy@csgroup.eu
|
|
Enable pre-update addressing mode in __put_user_asm_goto()
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/346f65d677adb11865f7762c25a1ca3c64404ba5.1599216023.git.christophe.leroy@csgroup.eu
|
|
The 8xx has 4 page sizes: 4k, 16k, 512k and 8M
4k and 16k can be selected at build time as standard page sizes,
and 512k and 8M are hugepages.
When 4k standard pages are selected, 16k pages are not available.
Allow 16k pages as hugepages when 4k pages are used.
To allow that, implement arch_make_huge_pte() which receives
the necessary arguments to allow setting the PTE in accordance
with the page size:
- 512 k pages must have _PAGE_HUGE and _PAGE_SPS. They are set
by pte_mkhuge(). arch_make_huge_pte() does nothing.
- 16 k pages must have only _PAGE_SPS. arch_make_huge_pte() clears
_PAGE_HUGE.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/a518abc29266a708dfbccc8fce9ae6694fe4c2c6.1598862623.git.christophe.leroy@csgroup.eu
|
|
On 8xx, the number of entries occupied by a PTE in the page tables
depends on the size of the page. At the time being, this calculation
is done in two places: in pte_update() and in set_huge_pte_at()
Refactor this calculation into a helper called
number_of_cells_per_pte(). For the time being, the val param is
unused. It will be used by following patch.
Instead of opencoding is_hugepd(), use hugepd_ok() with a forward
declaration.
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/f6ea2483c2c389567b007945948f704d18cfaeea.1598862623.git.christophe.leroy@csgroup.eu
|
|
The following random segfault is observed from time to time with
map_hugetlb selftest:
root@localhost:~# ./map_hugetlb 1 19
524288 kB hugepages
Mapping 1 Mbytes
Segmentation fault
[ 31.219972] map_hugetlb[365]: segfault (11) at 117 nip 77974f8c lr 779a6834 code 1 in ld-2.23.so[77966000+21000]
[ 31.220192] map_hugetlb[365]: code: 9421ffc0 480318d1 93410028 90010044 9361002c 93810030 93a10034 93c10038
[ 31.220307] map_hugetlb[365]: code: 93e1003c 93210024 8123007c 81430038 <80e90004> 814a0004 7f443a14 813a0004
[ 31.221911] BUG: Bad rss-counter state mm:(ptrval) type:MM_FILEPAGES val:33
[ 31.229362] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:5
This fault is due to hugetlb_free_pgd_range() freeing page tables
that are also used by regular pages.
As explain in the comment at the beginning of
hugetlb_free_pgd_range(), the verification done in free_pgd_range()
on floor and ceiling is not done here, which means
hugetlb_free_pte_range() can free outside the expected range.
As the verification cannot be done in hugetlb_free_pgd_range(), it
must be done in hugetlb_free_pte_range().
Fixes: b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard pages.")
Cc: [email protected]
Signed-off-by: Christophe Leroy <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/f0cb2a5477cd87d1eaadb128042e20aeb2bc2859.1598860677.git.christophe.leroy@csgroup.eu
|
|
Enabling CONFIG_TAU_INT causes random crashes:
Unrecoverable exception 1700 at c0009414 (msr=1000)
Oops: Unrecoverable exception, sig: 6 [#1]
BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-pmac-00043-gd5f545e1a8593 #5
NIP: c0009414 LR: c0009414 CTR: c00116fc
REGS: c0799eb8 TRAP: 1700 Not tainted (5.7.0-pmac-00043-gd5f545e1a8593)
MSR: 00001000 <ME> CR: 22000228 XER: 00000100
GPR00: 00000000 c0799f70 c076e300 00800000 0291c0ac 00e00000 c076e300 00049032
GPR08: 00000001 c00116fc 00000000 dfbd3200 ffffffff 007f80a8 00000000 00000000
GPR16: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 c075ce04
GPR24: c075ce04 dfff8880 c07b0000 c075ce04 00080000 00000001 c079ef98 c079ef5c
NIP [c0009414] arch_cpu_idle+0x24/0x6c
LR [c0009414] arch_cpu_idle+0x24/0x6c
Call Trace:
[c0799f70] [00000001] 0x1 (unreliable)
[c0799f80] [c0060990] do_idle+0xd8/0x17c
[c0799fa0] [c0060ba4] cpu_startup_entry+0x20/0x28
[c0799fb0] [c072d220] start_kernel+0x434/0x44c
[c0799ff0] [00003860] 0x3860
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX 3d20c07b XXXXXXXX XXXXXXXX XXXXXXXX 7c0802a6
XXXXXXXX XXXXXXXX XXXXXXXX 4e800421 XXXXXXXX XXXXXXXX XXXXXXXX 7d2000a6
---[ end trace 3a0c9b5cb216db6b ]---
Resolve this problem by disabling each THRMn comparator when handling
the associated THRMn interrupt and by disabling the TAU entirely when
updating THRMn thresholds.
Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
Signed-off-by: Finn Thain <[email protected]>
Tested-by: Stan Johnson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/5a0ba3dc5612c7aac596727331284a3676c08472.1599260540.git.fthain@telegraphics.com.au
|
|
According to Freescale's documentation, MPC74XX processors have an
erratum that prevents the TAU interrupt from working, so don't try to
use it when running on those processors.
Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
Signed-off-by: Finn Thain <[email protected]>
Tested-by: Stan Johnson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/c281611544768e758bd58fe812cf702a5bd2d042.1599260540.git.fthain@telegraphics.com.au
|
|
The commentary at the call site seems to disagree with the code. The
conditional prevents calling set_thresholds() via the exception handler,
which appears to crash. Perhaps that's because it immediately triggers
another TAU exception. Anyway, calling set_thresholds() from TAUupdate()
is redundant because tau_timeout() does so.
Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
Signed-off-by: Finn Thain <[email protected]>
Tested-by: Stan Johnson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/d7c7ee33232cf72a6a6bbb6ef05838b2e2b113c0.1599260540.git.fthain@telegraphics.com.au
|
|
Since commit 19dbdcb8039cf ("smp: Warn on function calls from softirq
context") the Thermal Assist Unit driver causes a warning like the
following when CONFIG_SMP is enabled.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at kernel/smp.c:428 smp_call_function_many_cond+0xf4/0x38c
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-pmac #3
NIP: c00b37a8 LR: c00b3abc CTR: c001218c
REGS: c0799c60 TRAP: 0700 Not tainted (5.7.0-pmac)
MSR: 00029032 <EE,ME,IR,DR,RI> CR: 42000224 XER: 00000000
GPR00: c00b3abc c0799d18 c076e300 c079ef5c c0011fec 00000000 00000000 00000000
GPR08: 00000100 00000100 00008000 ffffffff 42000224 00000000 c079d040 c079d044
GPR16: 00000001 00000000 00000004 c0799da0 c079f054 c07a0000 c07a0000 00000000
GPR24: c0011fec 00000000 c079ef5c c079ef5c 00000000 00000000 00000000 00000000
NIP [c00b37a8] smp_call_function_many_cond+0xf4/0x38c
LR [c00b3abc] on_each_cpu+0x38/0x68
Call Trace:
[c0799d18] [ffffffff] 0xffffffff (unreliable)
[c0799d68] [c00b3abc] on_each_cpu+0x38/0x68
[c0799d88] [c0096704] call_timer_fn.isra.26+0x20/0x7c
[c0799d98] [c0096b40] run_timer_softirq+0x1d4/0x3fc
[c0799df8] [c05b4368] __do_softirq+0x118/0x240
[c0799e58] [c0039c44] irq_exit+0xc4/0xcc
[c0799e68] [c000ade8] timer_interrupt+0x1b0/0x230
[c0799ea8] [c0013520] ret_from_except+0x0/0x14
--- interrupt: 901 at arch_cpu_idle+0x24/0x6c
LR = arch_cpu_idle+0x24/0x6c
[c0799f70] [00000001] 0x1 (unreliable)
[c0799f80] [c0060990] do_idle+0xd8/0x17c
[c0799fa0] [c0060ba8] cpu_startup_entry+0x24/0x28
[c0799fb0] [c072d220] start_kernel+0x434/0x44c
[c0799ff0] [00003860] 0x3860
Instruction dump:
8129f204 2f890000 40beff98 3d20c07a 8929eec4 2f890000 40beff88 0fe00000
81220000 552805de 550802ef 4182ff84 <0fe00000> 3860ffff 7f65db78 7f44d378
---[ end trace 34a886e47819c2eb ]---
Don't call on_each_cpu() from a timer callback, call it from a worker
thread instead.
Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
Tested-by: Stan Johnson <[email protected]>
Signed-off-by: Finn Thain <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/bb61650bea4f4c91fb8e24b9a6f130a1438651a7.1599260540.git.fthain@telegraphics.com.au
|
|
According to the MPC750 Users Manual, the SITV value in Thermal
Management Register 3 is 13 bits long. The present code calculates the
SITV value as 60 * 500 cycles. This would overflow to give 10 us on
a 500 MHz CPU rather than the intended 60 us. (But according to the
Microprocessor Datasheet, there is also a factor of 266 that has to be
applied to this value on certain parts i.e. speed sort above 266 MHz.)
Always use the maximum cycle count, as recommended by the Datasheet.
Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2")
Signed-off-by: Finn Thain <[email protected]>
Tested-by: Stan Johnson <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/896f542e5f0f1d6cf8218524c2b67d79f3d69b3c.1599260540.git.fthain@telegraphics.com.au
|
|
MAX_PHYSMEM #define is used along with sparsemem to determine the SECTION_SHIFT
value. Powerpc also uses the same value to limit the max memory enabled on the
system. With 4K PAGE_SIZE and hash translation mode, we want to limit the max
memory enabled to 64TB due to page table size restrictions. However, with
radix translation, we don't have these restrictions. Hence split the radix
and hash MA_PHYSMEM limit and use different limit for each of them.
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
With commit: 0034d395f89d ("powerpc/mm/hash64: Map all the kernel
regions in the same 0xc range"), we now split the 64TB address range
into 4 contexts each of 16TB. That implies we can do only 16TB linear
mapping.
On some systems, eg. Power9, memory attached to nodes > 0 will appear
above 16TB in the linear mapping. This resulted in kernel crash when
we boot such systems in hash translation mode with 4K PAGE_SIZE.
This patch updates the kernel mapping such that we now start supporting upto
61TB of memory with 4K. The kernel mapping now looks like below 4K PAGE_SIZE
and hash translation.
vmalloc start = 0xc0003d0000000000
IO start = 0xc0003e0000000000
vmemmap start = 0xc0003f0000000000
Our MAX_PHYSMEM_BITS for 4K is still 64TB even though we can only map 61TB.
We prevent bolt mapping anything outside 61TB range by checking against
H_VMALLOC_START.
Fixes: 0034d395f89d ("powerpc/mm/hash64: Map all the kernel regions in the same 0xc range")
Reported-by: Cameron Berkenpas <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|