Age | Commit message (Collapse) | Author | Files | Lines |
|
Commit:
6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
allow a task to wake early. It addressed the problem by calling the scheduling
classes update_curr() when the cputimer starts.
Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.
This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.
Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.
pound_clock_gettime:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.48 3.06 ( 11.83%)
5 3.33 3.25 ( 2.40%)
8 3.37 3.26 ( 3.30%)
12 3.32 3.37 ( -1.60%)
21 4.01 3.90 ( 2.74%)
30 3.63 3.36 ( 7.41%)
48 3.71 3.11 ( 16.27%)
79 3.75 3.16 ( 15.74%)
110 3.81 3.25 ( 14.80%)
128 3.88 3.31 ( 14.76%)
pound_times:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.65 3.25 ( 11.03%)
5 3.45 3.17 ( 7.92%)
8 3.52 3.22 ( 8.69%)
12 3.29 3.36 ( -2.04%)
21 4.07 3.92 ( 3.78%)
30 3.87 3.40 ( 12.17%)
48 3.79 3.16 ( 16.61%)
79 3.88 3.28 ( 15.42%)
110 3.90 3.38 ( 13.35%)
128 4.00 3.38 ( 15.45%)
pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:
$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)
The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again
$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)
will produce results to compare with. A comparison table will be output
with:
$ cd mmtests/work/log
$ ../../compare-kernels.sh
the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.
The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be
struct sched_entity *curr = cfs_rq->curr;
and
delta_exec = now - curr->exec_start;
in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.
A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):
threads cycles before cycles after gain
2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88%
5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85%
8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74%
12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74%
21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89%
30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92%
48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10%
79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33%
110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21%
128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99%
A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):
threads L1 misses/total*100 L1 misses/total*100 gain
before after
2 7.43 +-4.90% 7.36 +-4.70% 0.94%
5 13.09 +-4.74% 13.52 +-3.73% -3.28%
8 13.79 +-5.61% 12.90 +-3.27% 6.45%
12 11.57 +-2.44% 8.71 +-1.40% 24.72%
21 12.39 +-3.92% 9.97 +-1.84% 19.53%
30 13.91 +-2.53% 11.73 +-2.28% 15.67%
48 13.71 +-1.59% 12.32 +-1.97% 10.14%
79 14.44 +-0.66% 13.40 +-1.06% 7.20%
110 15.86 +-0.50% 14.46 +-0.59% 8.83%
128 16.51 +-0.32% 15.06 +-0.78% 8.78%
As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).
pound_clock_gettime:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.23 3.68 ( -64.56%) 3.48 (-55.48%)
5 2.83 3.78 ( -33.42%) 3.33 (-17.43%)
8 2.84 4.31 ( -52.12%) 3.37 (-18.76%)
12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%)
21 3.14 4.63 ( -47.36%) 4.01 (-27.71%)
30 3.28 5.75 ( -75.37%) 3.63 (-10.80%)
48 3.02 6.05 (-100.56%) 3.71 (-22.99%)
79 2.88 6.30 (-118.90%) 3.75 (-30.26%)
110 2.95 6.46 (-119.00%) 3.81 (-29.24%)
128 3.05 6.42 (-110.08%) 3.88 (-27.04%)
pound_times:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.27 3.73 ( -64.71%) 3.65 (-61.14%)
5 2.78 3.77 ( -35.56%) 3.45 (-23.98%)
8 2.79 4.41 ( -57.71%) 3.52 (-26.05%)
12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%)
21 3.10 4.61 ( -48.74%) 4.07 (-31.34%)
30 3.33 5.75 ( -72.53%) 3.87 (-16.01%)
48 2.96 6.06 (-105.04%) 3.79 (-28.10%)
79 2.88 6.24 (-116.83%) 3.88 (-34.81%)
110 2.98 6.37 (-114.08%) 3.90 (-31.12%)
128 3.10 6.35 (-104.61%) 4.00 (-28.87%)
The source code of the two benchmarks follows. To compile the two:
NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done
==== BEGIN pound_times.c ====
struct tms start;
void *pound (void *threadid)
{
struct tms end;
int oldutime = 0;
int utime;
int i;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}
==== END pound_times.c ====
==== BEGIN pound_clock_gettime.c ====
void *pound (void *threadid)
{
struct timespec ts;
int rc, i;
unsigned long prev = 0, this = 0;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
if (rc < 0)
perror("clock_gettime");
this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
if (0 && this < prev)
printf("%lu ns timewarp at iteration %d\n", prev - this, i);
prev = this;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long rc, i;
pid_t pgid;
for (i = 0; i < NUM_THREADS; i++) {
rc = pthread_create(&th[i], NULL, pound, (void *)i);
if (rc < 0)
perror("pthread_create");
}
pthread_exit(NULL);
return 0;
}
==== END pound_clock_gettime.c ====
Suggested-by: Mike Galbraith <[email protected]>
Signed-off-by: Giovanni Gherdovich <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We should update cfs_rq->throttled_clock_task, not
pcfs_rq->throttle_clock_task.
The effects of this bug was probably occasionally erratic
group scheduling, particularly in cgroups-intense workloads.
Signed-off-by: Xunlei Pang <[email protected]>
[ Added changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Konstantin Khlebnikov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 55e16d30bd99 ("sched/fair: Rework throttle_count sync")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Current code in cpudeadline.c has a bug in re-heapifying when adding a
new element at the end of the heap, because a deadline value of 0 is
temporarily set in the new elem, then cpudl_change_key() is called
with the actual elem deadline as param.
However, the function compares the new deadline to set with the one
previously in the elem, which is 0. So, if current absolute deadlines
grew so much to have negative values as s64, the comparison in
cpudl_change_key() makes the wrong decision. Instead, as from
dl_time_before(), the kernel should handle correctly abs deadlines
wrap-arounds.
This patch fixes the problem with a minimally invasive change that
forces cpudl_change_key() to heapify up in this case.
Signed-off-by: Tommaso Cucinotta <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Luca Abeni <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Intel Kabylake PCH has the same DWC3 than Intel
Sunrisepoint. Add the new ID to the supported devices.
Cc: <[email protected]>
Signed-off-by: Heikki Krogerus <[email protected]>
Signed-off-by: Felipe Balbi <[email protected]>
|
|
If we stop earlier due to short packet, we will
not be able to giveback all TRBs.
Cc: <[email protected]>
Cc: Brian E Rogers <[email protected]>
Signed-off-by: Felipe Balbi <[email protected]>
|
|
DWC3 has one interesting peculiarity with chained
transfers. If we setup N chained transfers and we
get a short packet before processing all N TRBs,
DWC3 will (conditionally) issue a XferComplete or
XferInProgress event and retire all TRBs from the
one which got a short packet to the last without
clearing their HWO bits.
This means SW must clear HWO bit manually, which
this patch is doing.
Cc: <[email protected]>
Cc: Brian E Rogers <[email protected]>
Signed-off-by: Felipe Balbi <[email protected]>
|
|
When using SG lists, we would end up setting
request->actual to:
num_mapped_sgs * (request->length - count)
Let's fix that up by incrementing request->actual
only once.
Cc: <[email protected]>
Reported-by: Brian E Rogers <[email protected]>
Signed-off-by: Felipe Balbi <[email protected]>
|
|
Fix the direct assignment of offset and length attributes included in
nft_exthdr structure from u32 data to u8.
Signed-off-by: Laura Garcia Liebana <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
There's a perf stat bug easy to observer on a machine with only one cgroup:
$ perf stat -e cycles -I 1000 -C 0 -G /
# time counts unit events
1.000161699 <not counted> cycles /
2.000355591 <not counted> cycles /
3.000565154 <not counted> cycles /
4.000951350 <not counted> cycles /
We'd expect some output there.
The underlying problem is that there is an optimization in
perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
if the old and new cgroups in a task switch are the same.
This optimization interacts with the current code in two ways
that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
cgroup event matches the current task. These are:
1. On creation of the first cgroup event in a CPU: In current code,
cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
aforesaid optimization, perf_cgroup_sched_in will run until the next
cgroup switches in that CPU. This may happen late or never happen,
depending on system's number of cgroups, CPU load, etc.
2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
because cpuctx->cgrp == NULL until a cgroup switch occurs and
perf_cgroup_sched_in is executed (updating cpuctx->cgrp).
This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
mirroring what list_del_event does when removing a cgroup event from CPU
context, as introduced in:
commit 68cacd29167b ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")
With this patch, cpuctx->cgrp is always set/clear when installing/removing
the first/last cgroup event in/from the CPU context. With cpuctx->cgrp
correctly set, event_filter_match works as intended when events are
sched in/out.
After the fix, the output is as expected:
$ perf stat -e cycles -I 1000 -a -G /
# time counts unit events
1.004699159 627342882 cycles /
2.007397156 615272690 cycles /
3.010019057 616726074 cycles /
Signed-off-by: David Carrillo-Cisneros <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vegard Nossum <[email protected]>
Cc: Vince Weaver <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
deference crash
Vegard Nossum reported that perf fuzzing generates a NULL
pointer dereference crash:
> Digging a bit deeper into this, it seems the event itself is getting
> created by perf_event_open() and it gets added to the pmu_event_list
> through:
>
> perf_event_open()
> - perf_event_alloc()
> - account_event()
> - account_pmu_sb_event()
> - attach_sb_event()
>
> so at this point the event is being attached but its ->ctx is still
> NULL. It seems like ->ctx is set just a bit later in
> perf_event_open(), though.
>
> But before that, __schedule() comes along and creates a stack trace
> similar to the one above:
>
> __schedule()
> - __perf_event_task_sched_out()
> - perf_iterate_sb()
> - perf_iterate_sb_cpu()
> - event_filter_match()
> - perf_cgroup_match()
> - __get_cpu_context()
> - (dereference ctx which is NULL)
>
> So I guess the question is... should the event be attached (= put on
> the list) before ->ctx gets set? Or should the cgroup code check for a
> NULL ->ctx?
The latter seems like the simplest solution. Moving the list-add later
creates a bit of a mess.
Reported-by: Vegard Nossum <[email protected]>
Tested-by: Vegard Nossum <[email protected]>
Tested-by: Vince Weaver <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Carrillo-Cisneros <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: f2fb6bef9251 ("perf/core: Optimize side-band event delivery")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This patch eliminates a source of imprecise APIC timer interrupts,
which imprecision may result in double interrupts or even late
interrupts.
The TSC deadline clockevent devices' configuration and registration
happens before the TSC frequency calibration is refined in
tsc_refine_calibration_work().
This results in the TSC clocksource and the TSC deadline clockevent
devices being configured with slightly different frequencies: the former
gets the refined one and the latter are configured with the inaccurate
frequency detected earlier by means of the "Fast TSC calibration using PIT".
Within the APIC code, introduce the notifier function
lapic_update_tsc_freq() which reconfigures all per-CPU TSC deadline
clockevent devices with the current tsc_khz.
Call it from the TSC code after TSC calibration refinement has happened.
Signed-off-by: Nicolai Stange <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christopher S. Hall <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Hidehiro Kawai <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Pushed #ifdef CONFIG_X86_LOCAL_APIC into header, improved changelog. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
clockevents frequency roundoff error
I noticed the following bug/misbehavior on certain Intel systems: with a
single task running on a NOHZ CPU on an Intel Haswell, I recognized
that I did not only get the one expected local_timer APIC interrupt, but
two per second at minimum. (!)
Further tracing showed that the first one precedes the programmed deadline
by up to ~50us and hence, it did nothing except for reprogramming the TSC
deadline clockevent device to trigger shortly thereafter again.
The reason for this is imprecise calibration, the timeout we program into
the APIC results in 'too short' timer interrupts. The core (hr)timer code
notices this (because it has a precise ktime source and sees the short
interrupt) and fixes it up by programming an additional very short
interrupt period.
This is obviously suboptimal.
The reason for the imprecise calibration is twofold, and this patch
fixes the first reason:
In setup_APIC_timer(), the registered clockevent device's frequency
is calculated by first dividing tsc_khz by TSC_DIVISOR and multiplying
it with 1000 afterwards:
(tsc_khz / TSC_DIVISOR) * 1000
The multiplication with 1000 is done for converting from kHz to Hz and the
division by TSC_DIVISOR is carried out in order to make sure that the final
result fits into an u32.
However, with the order given in this calculation, the roundoff error
introduced by the division gets magnified by a factor of 1000 by the
following multiplication.
To fix it, reversing the order of the division and the multiplication a la:
(tsc_khz * 1000) / TSC_DIVISOR
... reduces the roundoff error already.
Furthermore, if TSC_DIVISOR divides 1000, associativity holds:
(tsc_khz * 1000) / TSC_DIVISOR = tsc_khz * (1000 / TSC_DIVISOR)
and thus, the roundoff error even vanishes and the whole operation can be
carried out within 32 bits.
The powers of two that divide 1000 are 2, 4 and 8. A value of 8 for
TSC_DIVISOR still allows for TSC frequencies up to
2^32 / 10^9ns * 8 = 34.4GHz which is way larger than anything to expect
in the next years.
Thus we also replace the current TSC_DIVISOR value of 32 by 8. Reverse
the order of the divison and the multiplication in the calculation of
the registered clockevent device's frequency.
Signed-off-by: Nicolai Stange <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Christopher S. Hall <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Hidehiro Kawai <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Improved changelog. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Where a device driver has set a 64-bit DMA mask to indicate the absence
of addressing limitations, we still need to ensure that we don't
allocate IOVAs beyond the actual input size of the IOMMU. The reported
aperture is the most reliable way we have of inferring that input
address size, so use that to enforce a hard upper limit where available.
Fixes: 0db2e5d18f76 ("iommu: Implement common IOMMU ops for DMA mapping")
Signed-off-by: Robin Murphy <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
|
|
We cannot do those initializations from apply_feature_fixups() as
this function runs in a very restricted environment on 32-bit where
the kernel isn't running at its linked address and the PTRRELOC()
macro must be used for any global accesss.
Instead, split them into a separtate steup_feature_keys() function
which is called in a more suitable spot on ppc32.
Fixes: 309b315b6ec6 ("powerpc: Call jump_label_init() in apply_feature_fixups()")
Reported-and-tested-by: Christian Kujau <[email protected]>
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
We don't identify the machine type anymore...
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
This makes it easier to debug crashes that happen very early before
the kernel takes over Open Firmware by allowing us to relate the OF
reported crashing addresses to offsets within the kernel.
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Both set_memory_ro() and set_memory_rw() will modify the page
attributes of at least one page, even if the numpages parameter is
zero.
The author expected that calling these functions with numpages == zero
would never happen. However with the new 444d13ff10fb ("modules: add
ro_after_init support") feature this happens frequently.
Therefore do the right thing and make these two functions return
gracefully if nothing should be done.
Fixes crashes on module load like this one:
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 000003ff80008000 TEID: 000003ff80008407
Fault in home space mode while using kernel ASCE.
AS:0000000000d18007 R3:00000001e6aa4007 S:00000001e6a10800 P:00000001e34ee21d
Oops: 0004 ilc:3 [#1] SMP
Modules linked in: x_tables
CPU: 10 PID: 1 Comm: systemd Not tainted 4.7.0-11895-g3fa9045 #4
Hardware name: IBM 2964 N96 703 (LPAR)
task: 00000001e9118000 task.stack: 00000001e9120000
Krnl PSW : 0704e00180000000 00000000005677f8 (rb_erase+0xf0/0x4d0)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 000003ff80008b20 000003ff80008b20 000003ff80008b70 0000000000b9d608
000003ff80008b20 0000000000000000 00000001e9123e88 000003ff80008950
00000001e485ab40 000003ff00000000 000003ff80008b00 00000001e4858480
0000000100000000 000003ff80008b68 00000000001d5998 00000001e9123c28
Krnl Code: 00000000005677e8: ec1801c3007c cgij %r1,0,8,567b6e
00000000005677ee: e32010100020 cg %r2,16(%r1)
#00000000005677f4: a78401c2 brc 8,567b78
>00000000005677f8: e35010080024 stg %r5,8(%r1)
00000000005677fe: ec5801af007c cgij %r5,0,8,567b5c
0000000000567804: e30050000024 stg %r0,0(%r5)
000000000056780a: ebacf0680004 lmg %r10,%r12,104(%r15)
0000000000567810: 07fe bcr 15,%r14
Call Trace:
([<000003ff80008900>] __this_module+0x0/0xffffffffffffd700 [x_tables])
([<0000000000264fd4>] do_init_module+0x12c/0x220)
([<00000000001da14a>] load_module+0x24e2/0x2b10)
([<00000000001da976>] SyS_finit_module+0xbe/0xd8)
([<0000000000803b26>] system_call+0xd6/0x264)
Last Breaking-Event-Address:
[<000000000056771a>] rb_erase+0x12/0x4d0
Kernel panic - not syncing: Fatal exception: panic_on_oops
Reported-by: Christian Borntraeger <[email protected]>
Reported-and-tested-by: Sebastian Ott <[email protected]>
Fixes: e8a97e42dc98 ("s390/pageattr: allow kernel page table splitting")
Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
|
|
When a device is in a status where CIO has killed all I/O by itself the
interrupt for a clear request may not contain an irb to determine the
clear function. Instead it contains an error pointer -EIO.
This was ignored by the DASD int_handler leading to a hanging device
waiting for a clear interrupt.
Handle -EIO error pointer correctly for requests that are clear pending and
treat the clear as successful.
Signed-off-by: Stefan Haberland <[email protected]>
Reviewed-by: Sebastian Ott <[email protected]>
Cc: [email protected]
Signed-off-by: Martin Schwidefsky <[email protected]>
|
|
Commit 8d460f6156cd ("powerpc/process: Add the function
flush_tmregs_to_thread") added flush_tmregs_to_thread() and included
the assumption that it would only be called for a task which is not
current.
Although this is correct for ptrace, when generating a core dump, some
of the routines which call flush_tmregs_to_thread() are called. This
leads to a WARNing such as:
Not expecting ptrace on self: TM regs may be incorrect
------------[ cut here ]------------
WARNING: CPU: 123 PID: 7727 at arch/powerpc/kernel/process.c:1088 flush_tmregs_to_thread+0x78/0x80
CPU: 123 PID: 7727 Comm: libvirtd Not tainted 4.8.0-rc1-gcc6x-g61e8a0d #1
task: c000000fe631b600 task.stack: c000000fe63b0000
NIP: c00000000001a1a8 LR: c00000000001a1a4 CTR: c000000000717780
REGS: c000000fe63b3420 TRAP: 0700 Not tainted (4.8.0-rc1-gcc6x-g61e8a0d)
MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 28004222 XER: 20000000
...
NIP [c00000000001a1a8] flush_tmregs_to_thread+0x78/0x80
LR [c00000000001a1a4] flush_tmregs_to_thread+0x74/0x80
Call Trace:
flush_tmregs_to_thread+0x74/0x80 (unreliable)
vsr_get+0x64/0x1a0
elf_core_dump+0x604/0x1430
do_coredump+0x5fc/0x1200
get_signal+0x398/0x740
do_signal+0x54/0x2b0
do_notify_resume+0x98/0xb0
ret_from_except_lite+0x70/0x74
So fix flush_tmregs_to_thread() to detect the case where it is called on
current, and a transaction is active, and in that case flush the TM regs
to the thread_struct.
This patch also moves flush_tmregs_to_thread() into ptrace.c as it is
only called from that file.
Fixes: 8d460f6156cd ("powerpc/process: Add the function flush_tmregs_to_thread")
Signed-off-by: Cyril Bur <[email protected]>
[mpe: Flesh out change log]
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Commit 7aef4136566b0 ("powerpc32: rewrite csum_partial_copy_generic()
based on copy_tofrom_user()") introduced a bug when destination
address is odd and initial csum is not null
In that (rare) case the initial csum value has to be rotated one byte
as well as the resulting value is
This patch also fixes related comments
Fixes: 7aef4136566b0 ("powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user()")
Signed-off-by: Christophe Leroy <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Adding fdb entries pointing to the bridge device uses fdb_insert(),
which lacks various checks and does not respect added_by_user flag.
As a result, some inconsistent behavior can happen:
* Adding temporary entries succeeds but results in permanent entries.
* Same goes for "dynamic" and "use".
* Changing mac address of the bridge device causes deletion of
user-added entries.
* Replacing existing entries looks successful from userspace but actually
not, regardless of NLM_F_EXCL flag.
Use the same logic as other entries and fix them.
Fixes: 3741873b4f73 ("bridge: allow adding of fdb entries pointing to the bridge device")
Signed-off-by: Toshiaki Makita <[email protected]>
Acked-by: Roopa Prabhu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Switch the setting of psl_fir_cntl from debug to production
environment recommended value. It mostly affects the PSL behavior when
an error is raised in psl_fir1/2.
Tested with cxlflash.
Signed-off-by: Frederic Barrat <[email protected]>
Reviewed-by: Uma Krishnan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
I've found funny live-lock between raid10 barriers during resync and
memory controller hard limits. Inside mpage_readpages() task holds on to
its plug bio which blocks the barrier in raid10. Its memory cgroup have
no free memory thus the task goes into reclaimer but all reclaimable
pages are dirty and cannot be written because raid10 is rebuilding and
stuck on the barrier.
Common flush of such IO in schedule() never happens, because the caller
doesn't go to sleep.
Lock is 'live' because changing memory limit or killing tasks which
holds that stuck bio unblock whole progress.
That was what happened in 3.18.x but I see no difference in upstream
logic. Theoretically this might happen even without memory cgroup.
Signed-off-by: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Disable all interrupts when suspend, they will be enabled
when resume. Otherwise, the suspend/resume process will be
blocked occasionally.
Signed-off-by: Wenyou Yang <[email protected]>
Acked-by: Nicolas Ferre <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Commit b5a099c67a1c36b "net: ethernet: davicom: fix devicetree irq
resource" causes an interrupt storm after the ethernet interface
is activated on S3C24XX platform (ARM non-dt), due to the interrupt
trigger type not being set properly.
It seems, after adding parsing of IRQ flags in commit 7085a7401ba54e92b
"drivers: platform: parse IRQ flags from resources", there is no path
for non-dt platforms where irq_set_type callback could be invoked when
we don't pass the trigger type flags to the request_irq() call.
In case of a board where the regression is seen the interrupt trigger
type flags are passed through a platform device's resource and it is
not currently handled properly without passing the irq trigger type
flags to the request_irq() call. In case of OF an of_irq_get() call
within platform_get_irq() function seems to be ensuring required irq_chip
setup, but there is no equivalent code for non OF/ACPI platforms.
This patch mostly restores irq trigger type setting code which has been
removed in commit ("net: ethernet: davicom: fix devicetree irq resource").
Fixes: b5a099c67a1c36b913 ("net: ethernet: davicom: fix devicetree irq resource")
Signed-off-by: Sylwester Nawrocki <[email protected]>
Acked-by: Robert Jarzmik <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
During boot, sometimes the kernel will test to see if an instruction
causes an undefined instruction exception. Unfortunately, the exit
path for these exceptions did not restore the address limit, which
causes the rootfs mount code to fail. Fix the missing address limit
restoration.
Tested-by: Guenter Roeck <[email protected]>
Signed-off-by: Russell King <[email protected]>
|
|
The late_alloc() PTE allocation function used by create_mapping_late()
does not call pgtable_page_ctor() on PTE pages it allocates, leaving
the per-page spinlock uninitialized.
Since generic page table manipulation code may assume that translation
table pages that are not owned by init_mm are covered by fully
constructed struct pages, the following crash may occur with the new
UEFI memory attributes table code.
efi: memattr: Processing EFI Memory Attributes table:
efi: memattr: 0x0000ffa16000-0x0000ffa82fff [Runtime Code |RUN| | |XP| | | | | | | | ]
Unable to handle kernel NULL pointer dereference at virtual address 00000010
pgd = c0204000
[00000010] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc4-00063-g3882aa7b340b #361
Hardware name: Generic DT based system
task: ed858000 ti: ed842000 task.ti: ed842000
PC is at __lock_acquire+0xa0/0x19a8
...
[<c038c830>] (__lock_acquire) from [<c038e4f8>] (lock_acquire+0x6c/0x88)
[<c038e4f8>] (lock_acquire) from [<c0c06134>] (_raw_spin_lock+0x2c/0x3c)
[<c0c06134>] (_raw_spin_lock) from [<c0410384>] (apply_to_page_range+0xe8/0x238)
[<c0410384>] (apply_to_page_range) from [<c1205f34>] (efi_set_mapping_permissions+0x54/0x5c)
[<c1205f34>] (efi_set_mapping_permissions) from [<c1247474>] (efi_memattr_apply_permissions+0x2b8/0x378)
[<c1247474>] (efi_memattr_apply_permissions) from [<c1248258>] (arm_enable_runtime_services+0x1f0/0x22c)
[<c1248258>] (arm_enable_runtime_services) from [<c0301f0c>] (do_one_initcall+0x44/0x174)
[<c0301f0c>] (do_one_initcall) from [<c1200d10>] (kernel_init_freeable+0x90/0x1e8)
[<c1200d10>] (kernel_init_freeable) from [<c0bff690>] (kernel_init+0x8/0x114)
[<c0bff690>] (kernel_init) from [<c0307ed0>] (ret_from_fork+0x14/0x24)
The crash is due to the fact that the UEFI page tables are not owned by
init_mm, but are not covered by fully constructed struct pages.
Given that the UEFI subsystem is currently the only user of
create_mapping_late(), add an unconditional call to pgtable_page_ctor() to
late_alloc().
Fixes: 9fc68b717c24 ("ARM/efi: Apply strict permissions for UEFI Runtime Services regions")
Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Russell King <[email protected]>
|
|
To limit the amount of mapped low memory, we determine a physical address
boundary based on the start of the vmalloc area using __pa().
Strictly speaking, the vmalloc area location is arbitrary and does not
necessarily corresponds to a valid physical address. For example, if
PAGE_OFFSET = 0x80000000
PHYS_OFFSET = 0x90000000
vmalloc_min = 0xf0000000
then __pa(vmalloc_min) overflows and returns a wrapped 0 when phys_addr_t
is a 32-bit type. Then the code that follows determines that the entire
physical memory is above that boundary and no low memory gets mapped at
all:
|[...]
|Machine model: Freescale i.MX51 NA04 Board
|Ignoring RAM at 0x90000000-0xb0000000 (!CONFIG_HIGHMEM)
|Consider using a HIGHMEM enabled kernel.
To avoid this problem let's make vmalloc_limit a 64-bit value all the
time and determine that boundary explicitly without using __pa().
Reported-by: Emil Renner Berthing <[email protected]>
Signed-off-by: Nicolas Pitre <[email protected]>
Tested-by: Emil Renner Berthing <[email protected]>
Signed-off-by: Russell King <[email protected]>
|
|
The message "803.ad" should be "802.3ad".
Signed-off-by: Zhu Yanjun <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Kmemleak reports following false positive memory leaks for each sk
buffers allocated by CPSW (__netdev_alloc_skb_ip_align()) in
cpsw_ndo_open() and cpsw_rx_handler():
unreferenced object 0xea915000 (size 2048):
comm "systemd-network", pid 713, jiffies 4294938323 (age 102.180s)
hex dump (first 32 bytes):
00 58 91 ea ff ff ff ff ff ff ff ff ff ff ff ff .X..............
ff ff ff ff ff ff fd 0f 00 00 00 00 00 00 00 00 ................
backtrace:
[<c0108680>] __kmalloc_track_caller+0x1a4/0x230
[<c0529eb4>] __alloc_skb+0x68/0x16c
[<c052c884>] __netdev_alloc_skb+0x40/0x104
[<bf1ad29c>] cpsw_ndo_open+0x374/0x670 [ti_cpsw]
[<c053c3d4>] __dev_open+0xb0/0x114
[<c053c690>] __dev_change_flags+0x9c/0x14c
[<c053c760>] dev_change_flags+0x20/0x50
[<c054bdcc>] do_setlink+0x2cc/0x78c
[<c054c358>] rtnl_setlink+0xcc/0x100
[<c054b34c>] rtnetlink_rcv_msg+0x184/0x224
[<c056467c>] netlink_rcv_skb+0xa8/0xc4
[<c054b1c0>] rtnetlink_rcv+0x2c/0x34
[<c0564018>] netlink_unicast+0x16c/0x1f8
[<c0564498>] netlink_sendmsg+0x334/0x348
[<c052015c>] sock_sendmsg+0x1c/0x2c
[<c05213e0>] SyS_sendto+0xc0/0xe8
unreferenced object 0xec861780 (size 192):
comm "softirq", pid 0, jiffies 4294938759 (age 109.540s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 b0 5a ed 00 00 00 00 00 00 00 00 ......Z.........
backtrace:
[<c0107830>] kmem_cache_alloc+0x190/0x208
[<c052c768>] __build_skb+0x30/0x98
[<c052c8fc>] __netdev_alloc_skb+0xb8/0x104
[<bf1abc54>] cpsw_rx_handler+0x68/0x1e4 [ti_cpsw]
[<bf11aa30>] __cpdma_chan_free+0xa8/0xc4 [davinci_cpdma]
[<bf11ab98>] __cpdma_chan_process+0x14c/0x16c [davinci_cpdma]
[<bf11abfc>] cpdma_chan_process+0x44/0x5c [davinci_cpdma]
[<bf1adc78>] cpsw_rx_poll+0x1c/0x9c [ti_cpsw]
[<c0539180>] net_rx_action+0x1f0/0x2ec
[<c003881c>] __do_softirq+0x134/0x258
[<c0038a00>] do_softirq+0x68/0x70
[<c0038adc>] __local_bh_enable_ip+0xd4/0xe8
[<c0640994>] _raw_spin_unlock_bh+0x30/0x34
[<c05f4e9c>] igmp6_group_added+0x4c/0x1bc
[<c05f6600>] ipv6_dev_mc_inc+0x398/0x434
[<c05dba74>] addrconf_dad_work+0x224/0x39c
This happens because CPSW allocates SK buffers and then passes
pointers on them in CPDMA where they stored in internal CPPI RAM
(SRAM) which belongs to DEV MMIO space. Kmemleak does not scan IO
memory and so reports memory leaks.
Hence, mark allocated sk buffers as false positive explicitly.
Cc: Catalin Marinas <[email protected]>
Signed-off-by: Grygorii Strashko <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Signed-off-by: Chunming Zhou <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
When executing the script included below, the netns delete operation
hangs with the following message (repeated at 10 second intervals):
kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1
This occurs because a reference to the lo interface in the "secure" netns
is still held by a dst entry in the xfrm bundle cache in the init netns.
Address this problem by garbage collecting the tunnel netns flow cache
when a cross-namespace vti interface receives a NETDEV_DOWN notification.
A more detailed description of the problem scenario (referencing commands
in the script below):
(1) ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
The vti_test interface is created in the init namespace. vti_tunnel_init()
attaches a struct ip_tunnel to the vti interface's netdev_priv(dev),
setting the tunnel net to &init_net.
(2) ip link set vti_test netns secure
The vti_test interface is moved to the "secure" netns. Note that
the associated struct ip_tunnel still has tunnel->net set to &init_net.
(3) ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1
The first packet sent using the vti device causes xfrm_lookup() to be
called as follows:
dst = xfrm_lookup(tunnel->net, skb_dst(skb), fl, NULL, 0);
Note that tunnel->net is the init namespace, while skb_dst(skb) references
the vti_test interface in the "secure" namespace. The returned dst
references an interface in the init namespace.
Also note that the first parameter to xfrm_lookup() determines which flow
cache is used to store the computed xfrm bundle, so after xfrm_lookup()
returns there will be a cached bundle in the init namespace flow cache
with a dst referencing a device in the "secure" namespace.
(4) ip netns del secure
Kernel begins to delete the "secure" namespace. At some point the
vti_test interface is deleted, at which point dst_ifdown() changes
the dst->dev in the cached xfrm bundle flow from vti_test to lo (still
in the "secure" namespace however).
Since nothing has happened to cause the init namespace's flow cache
to be garbage collected, this dst remains attached to the flow cache,
so the kernel loops waiting for the last reference to lo to go away.
<Begin script>
ip link add br1 type bridge
ip link set dev br1 up
ip addr add dev br1 1.1.1.1/8
ip netns add secure
ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
ip link set vti_test netns secure
ip netns exec secure ip link set vti_test up
ip netns exec secure ip link s lo up
ip netns exec secure ip addr add dev lo 192.168.100.1/24
ip netns exec secure ip route add 192.168.200.0/24 dev vti_test
ip xfrm policy flush
ip xfrm state flush
ip xfrm policy add dir out tmpl src 1.1.1.1 dst 1.1.1.2 \
proto esp mode tunnel mark 1
ip xfrm policy add dir in tmpl src 1.1.1.2 dst 1.1.1.1 \
proto esp mode tunnel mark 1
ip xfrm state add src 1.1.1.1 dst 1.1.1.2 proto esp spi 1 \
mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
ip xfrm state add src 1.1.1.2 dst 1.1.1.1 proto esp spi 1 \
mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1
ip netns del secure
<End script>
Reported-by: Hangbin Liu <[email protected]>
Reported-by: Jan Tluka <[email protected]>
Signed-off-by: Lance Richardson <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Miscellaneous fixes
Here are a bunch of miscellaneous fixes to AF_RXRPC:
(*) Fix an uninitialised pointer.
(*) Fix error handling when we fail to connect a call.
(*) Fix a NULL pointer dereference.
(*) Fix two occasions where a packet is accessed again after being queued
for someone else to deal with.
(*) Fix a missing skb free.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
User visible fixes:
- Fix the lookup for a kernel module in 'perf probe', fixing for instance, the
erroneous return of "[raid10]" when looking for "[raid1]" (Konstantin Khlebnikov)
- Disable counters in a group before reading them in 'perf stat', to avoid skew (Mark Rutland)
- Fix adding probes to function aliases in systems using kaslr (Masami Hiramatsu)
- Trip libtraceevent trace_seq buffers, removing unnecessary memory usage that could
bring a system using tracepoint events with 'perf top' to a crawl, as the trace_seq
buffers start at a whooping 4 KB, which is very rarely used in perf's usecases,
so realloc it to the really used space as a last measure after using libtraceevent
functions to format the fields of tracepoint events (Arnaldo Carvalho de Melo)
- Fix 'perf probe' location when using DWARF on ppc64le (Ravi Bangoria)
- Allow specifying signedness casts to a 'perf probe' variable, to shorten
the number of steps to see signed values that otherwise would always appear
as hex values (Naohiro Aota)
Documentation fixes:
- Add 'bpf-output' field to 'perf script' usage message (Brendan Gregg)
Infrastructure fixes:
- Sync kernel header files: cpufeatures.h, {disabled,required}-features.h,
bpf.h and vmx.h, so that we get a clean build, without warnings about files
being different from the kernel counterparts.
A verification of the need or desirability of changes in tools/ based on what
was done in the kernel changesets was made and documented in the respective
file sync changesets (Arnaldo Carvalho de Melo)
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This reverts commit 874f9c7da9a4acbc1b9e12ca722579fb50e4d142.
Geert Uytterhoeven reports:
"This change seems to have an (unintendent?) side-effect.
Before, pr_*() calls without a trailing newline characters would be
printed with a newline character appended, both on the console and in
the output of the dmesg command.
After this commit, no new line character is appended, and the output
of the next pr_*() call of the same type may be appended, like in:
- Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000
- Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)
+ Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)"
Joe Perches says:
"No, that is not intentional.
The newline handling code inside vprintk_emit is a bit involved and
for now I suggest a revert until this has all the same behavior as
earlier"
Reported-by: Geert Uytterhoeven <[email protected]>
Requested-by: Joe Perches <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
driver
Since IRQCHIP_DECLARE now flags the GPC node as already populated, the
GPC power domain driver is never probed unless we clear the flag again.
Fixes: 15cc2ed6dcf9 ("of/irq: Mark initialised interrupt controllers as populated")
Suggested-by: Rob Herring <[email protected]>
Signed-off-by: Philipp Zabel <[email protected]>
Cc: Jon Hunter <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
|
|
That way the init callback may clear the flag again, in case of drivers
split between early irq chip and a normal platform driver.
Fixes: 15cc2ed6dcf9 ("of/irq: Mark initialised interrupt controllers as populated")
Suggested-by: Rob Herring <[email protected]>
Signed-off-by: Philipp Zabel <[email protected]>
Acked-by: Jon Hunter <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
|
|
@mynodes is set to NULL when __unflatten_device_tree() is called
to unflatten device sub-tree in PCI hot add scenario on PowerPC
PowerNV platform. Marking @mynodes detached unconditionally causes
kernel crash as below backtrace shows:
Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc000000000b26f64
cpu 0x0: Vector: 300 (Data Access) at [c000003fcc7cf740]
pc: c000000000b26f64: __unflatten_device_tree+0xf4/0x190
lr: c000000000b26f40: __unflatten_device_tree+0xd0/0x190
sp: c000003fcc7cf9c0
msr: 900000000280b033
dar: 0
dsisr: 40000000
current = 0xc000003fcc281680
paca = 0xc00000000ff00000 softe: 0 irq_happened: 0x01
pid = 2724, comm = sh
Linux version 4.7.0-gavin-07754-g92a6836 (gwshan@gwshan) (gcc version \
4.9.3 (Buildroot 2016.02-rc2-00093-g5ea3bce) ) #539 SMP Mon Aug 1 \
12:40:29 AEST 2016
enter ? for help
[c000003fcc7cfa50] c000000000b27060 of_fdt_unflatten_tree+0x60/0x90
[c000003fcc7cfaa0] c0000000004c6288 pnv_php_set_slot_power_state+0x118/0x440
[c000003fcc7cfb80] c0000000004c6a10 pnv_php_enable+0xc0/0x170
[c000003fcc7cfbd0] c0000000004c4d80 power_write_file+0xa0/0x190
[c000003fcc7cfc50] c0000000004be93c pci_slot_attr_store+0x3c/0x60
[c000003fcc7cfc70] c0000000002d3fd4 sysfs_kf_write+0x94/0xc0
[c000003fcc7cfcb0] c0000000002d2c30 kernfs_fop_write+0x180/0x260
[c000003fcc7cfd00] c000000000230fe0 __vfs_write+0x40/0x190
[c000003fcc7cfd90] c000000000232278 vfs_write+0xc8/0x240
[c000003fcc7cfde0] c000000000233d90 SyS_write+0x60/0x110
[c000003fcc7cfe30] c000000000009524 system_call+0x38/0x108
This avoids the kernel crash by marking @mynodes detached only when
@mynodes is dereferencing valid device node in __unflatten_device_tree().
Fixes: 1d1bde550ea3 ("of: fdt: mark unflattened tree as detached")
Reported-by: Meng Li <[email protected]>
Signed-off-by: Gavin Shan <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
|
|
The of_node_put() function tests whether its argument is NULL
and then returns immediately.
Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix tick_stop tracepoint symbols for user export.
Luiz Capitulino noticed that the tick_stop tracepoint wasn't being
parsed properly by the tracing user space tools.
This was due to the TRACE_DEFINE_ENUM() being set to a define, when it
should have been set to the enum itself. The define was of the MASK
that used the BIT to shift. The BIT was the enum and by adding that,
everything gets converted nicely. The MASK is still kept just in case
it gets converted to an enum in the future"
* tag 'trace-v4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix tick_stop tracepoint symbols for user export
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc plugin improvements from Kees Cook:
"Several fixes/improvements for the gcc plugin infrastructure:
- fix a problem with gcc plugins interfering with cc-option tests.
- abort more gracefully when gcc plugin headers or compiler support
is missing.
- improve the gcc plugin rule generation to be more dynamic, pass
arguments, and build from subdirectories"
* tag 'gcc-plugin-infrastructure-v4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
gcc-plugins: Add support for plugin subdirectories
gcc-plugins: Automate make rule generation
gcc-plugins: Add support for passing plugin arguments
gcc-plugins: abort builds cleanly when not supported
kbuild: no gcc-plugins during cc-option tests
|
|
git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
Pull x86 platform driver update from Darren Hart:
"dell-wmi: ignore battery remove/insert event"
* tag 'platform-drivers-x86-v4.8-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
dell-wmi: Ignore WMI event 0xe00e
|
|
Pull drm fixes from Dave Airlie:
"This contains a bunch of amdgpu fixes, and some i915 regression fixes.
It also contains some fixes for an older regression with some EDID
changes and some 6bpc panels.
Then there are the lockdep, cirrus and rcar-du regression fixes from
this window"
* tag 'drm-fixes-for-4.8-rc2' of git://people.freedesktop.org/~airlied/linux:
drm/cirrus: Fix NULL pointer dereference when registering the fbdev
drm/edid: Set 8 bpc color depth for displays with "DFP 1.x compliant TMDS".
drm/i915/dp: Revert "drm/i915/dp: fall back to 18 bpp when sink capability is unknown"
drm/edid: Add 6 bpc quirk for display AEO model 0.
drm: Paper over locking inversion after registration rework
drm: rcar-du: Link HDMI encoder with bridge
drm/ttm: Wait for a BO to become idle before unbinding it from GTT
drm/i915/fbdev: Check for the framebuffer before use
drm/amdgpu: update golden setting of polaris10
drm/amdgpu: update golden setting of stoney
drm/amdgpu: update golden setting of polaris11
drm/amdgpu: update golden setting of carrizo
drm/amdgpu: update golden setting of iceland
drm/amd/amdgpu: change pptable output format from ASCII to binary
drm/amdgpu/ci: add mullins to default case for smc ucode
drm/amdgpu/gmc7: add missing mullins case
drm/i915: Never fully mask the the EI up rps interrupt on SNB/IVB
drm/i915: Wait up to 3ms for the pcu to ack the cdclk change request on SKL
|
|
Commit b195d5e2bffd ("ipr: Wait to do async scan until scsi host is
initialized") fixed async scan for ipr, but broke sync scan for ipr.
This fixes sync scan back up.
Signed-off-by: Brian King <[email protected]>
Reported-and-tested-by: Michael Ellerman <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg
in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
with __GFP_ACCOUNT, including those that aren't actually charged to any
cgroup, i.e. allocated from the root cgroup context. To avoid overhead
in case cgroups are not used, we only do that if memcg_kmem_enabled() is
true. The latter is set iff there are kmem-enabled memory cgroups
(online or offline). The root cgroup is not considered kmem-enabled.
As a result, if a page is allocated with __GFP_ACCOUNT for the root
cgroup when there are kmem-enabled memory cgroups and is freed after all
kmem-enabled memory cgroups were removed, e.g.
# no memory cgroups has been created yet, create one
mkdir /sys/fs/cgroup/memory/test
# run something allocating pages with __GFP_ACCOUNT, e.g.
# a program using pipe
dmesg | tail
# remove the memory cgroup
rmdir /sys/fs/cgroup/memory/test
we'll get bad page state bug complaining about page->_mapcount != -1:
BUG: Bad page state in process swapper/0 pfn:1fd945c
page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0
flags: 0x1000000000000000()
To avoid that, let's mark with PageKmemcg only those pages that are
actually charged to and hence pin a non-root memory cgroup.
Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
Reported-and-tested-by: Eric Dumazet <[email protected]>
Signed-off-by: Vladimir Davydov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch 19a469a58720 ("drivers/perf: arm-pmu: Handle per-interrupt
affinity mask") added support for partitionned PPI setups, but
inadvertently broke setups using SPIs without the "interrupt-affinity"
property (which is the case for UP platforms).
This patch restore the broken functionnality by testing whether the
interrupt is percpu or not instead of relying on the using_spi flag
that really means "SPI *and* interrupt-affinity property".
Acked-by: Mark Rutland <[email protected]>
Reported-by: Geert Uytterhoeven <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Fixes: 19a469a58720 ("drivers/perf: arm-pmu: Handle per-interrupt affinity mask")
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
|
|
arm_pmu_mutex is never held long and we don't want to sleep while the
lock is being held as it's executed in the context of hotplug notifiers.
So it can be converted to a simple spinlock instead.
Without this patch we get the following warning:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/2
no locks held by swapper/2/0.
irq event stamp: 381314
hardirqs last enabled at (381313): _raw_spin_unlock_irqrestore+0x7c/0x88
hardirqs last disabled at (381314): cpu_die+0x28/0x48
softirqs last enabled at (381294): _local_bh_enable+0x28/0x50
softirqs last disabled at (381293): irq_enter+0x58/0x78
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.7.0 #12
Call trace:
dump_backtrace+0x0/0x220
show_stack+0x24/0x30
dump_stack+0xb4/0xf0
___might_sleep+0x1d8/0x1f0
__might_sleep+0x5c/0x98
mutex_lock_nested+0x54/0x400
arm_perf_starting_cpu+0x34/0xb0
cpuhp_invoke_callback+0x88/0x3d8
notify_cpu_starting+0x78/0x98
secondary_start_kernel+0x108/0x1a8
This patch converts the mutex to spinlock to eliminate the above
warnings. This constraints pmu->reset to be non-blocking call which is
the case with all the ARM PMU backends.
Cc: Stephen Boyd <[email protected]>
Fixes: 37b502f121ad ("arm/perf: Fix hotplug state machine conversion")
Acked-by: Mark Rutland <[email protected]>
Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
|
|
Under certain conditions, the data_ready handler will discard a packet.
These need to be freed.
Signed-off-by: David Howells <[email protected]>
|
|
Fix a use of a packet after it has been enqueued onto the packet processing
queue in the data_ready handler. Once on a call's Rx queue, we mustn't
touch it any more as it may be dequeued and freed by the call processor
running on a work queue.
Save the values we need before enqueuing.
Without this, we can get an oops like the following:
BUG: unable to handle kernel NULL pointer dereference at 000000000000009c
IP: [<ffffffffa01854e8>] rxrpc_fast_process_packet+0x724/0xa11 [af_rxrpc]
PGD 0
Oops: 0000 [#1] SMP
Modules linked in: kafs(E) af_rxrpc(E) [last unloaded: af_rxrpc]
CPU: 2 PID: 0 Comm: swapper/2 Tainted: G E 4.7.0-fsdevel+ #1336
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
task: ffff88040d6863c0 task.stack: ffff88040d68c000
RIP: 0010:[<ffffffffa01854e8>] [<ffffffffa01854e8>] rxrpc_fast_process_packet+0x724/0xa11 [af_rxrpc]
RSP: 0018:ffff88041fb03a78 EFLAGS: 00010246
RAX: ffffffffffffffff RBX: ffff8803ff195b00 RCX: 0000000000000001
RDX: ffffffffa01854d1 RSI: 0000000000000008 RDI: ffff8803ff195b00
RBP: ffff88041fb03ab0 R08: 0000000000000000 R09: 0000000000000001
R10: ffff88041fb038c8 R11: 0000000000000000 R12: ffff880406874800
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88041fb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000009c CR3: 0000000001c14000 CR4: 00000000001406e0
Stack:
ffff8803ff195ea0 ffff880408348800 ffff880406874800 ffff8803ff195b00
ffff880408348800 ffff8803ff195ed8 0000000000000000 ffff88041fb03af0
ffffffffa0186072 0000000000000000 ffff8804054da000 0000000000000000
Call Trace:
<IRQ>
[<ffffffffa0186072>] rxrpc_data_ready+0x89d/0xbae [af_rxrpc]
[<ffffffff814c94d7>] __sock_queue_rcv_skb+0x24c/0x2b2
[<ffffffff8155c59a>] __udp_queue_rcv_skb+0x4b/0x1bd
[<ffffffff8155e048>] udp_queue_rcv_skb+0x281/0x4db
[<ffffffff8155ea8f>] __udp4_lib_rcv+0x7ed/0x963
[<ffffffff8155ef9a>] udp_rcv+0x15/0x17
[<ffffffff81531d86>] ip_local_deliver_finish+0x1c3/0x318
[<ffffffff81532544>] ip_local_deliver+0xbb/0xc4
[<ffffffff81531bc3>] ? inet_del_offload+0x40/0x40
[<ffffffff815322a9>] ip_rcv_finish+0x3ce/0x42c
[<ffffffff81532851>] ip_rcv+0x304/0x33d
[<ffffffff81531edb>] ? ip_local_deliver_finish+0x318/0x318
[<ffffffff814dff9d>] __netif_receive_skb_core+0x601/0x6e8
[<ffffffff814e072e>] __netif_receive_skb+0x13/0x54
[<ffffffff814e082a>] netif_receive_skb_internal+0xbb/0x17c
[<ffffffff814e1838>] napi_gro_receive+0xf9/0x1bd
[<ffffffff8144eb9f>] rtl8169_poll+0x32b/0x4a8
[<ffffffff814e1c7b>] net_rx_action+0xe8/0x357
[<ffffffff81051074>] __do_softirq+0x1aa/0x414
[<ffffffff810514ab>] irq_exit+0x3d/0xb0
[<ffffffff810184a2>] do_IRQ+0xe4/0xfc
[<ffffffff81612053>] common_interrupt+0x93/0x93
<EOI>
[<ffffffff814af837>] ? cpuidle_enter_state+0x1ad/0x2be
[<ffffffff814af832>] ? cpuidle_enter_state+0x1a8/0x2be
[<ffffffff814af96a>] cpuidle_enter+0x12/0x14
[<ffffffff8108956f>] call_cpuidle+0x39/0x3b
[<ffffffff81089855>] cpu_startup_entry+0x230/0x35d
[<ffffffff810312ea>] start_secondary+0xf4/0xf7
Signed-off-by: David Howells <[email protected]>
|
|
Once a packet has been posted to a connection in the data_ready handler, we
mustn't try reposting if we then find that the connection is dying as the
refcount has been given over to the dying connection and the packet might
no longer exist.
Losing the packet isn't a problem as the peer will retransmit.
Signed-off-by: David Howells <[email protected]>
|