Age | Commit message (Collapse) | Author | Files | Lines |
|
The tgid_map array records a mapping from pid to tgid, where the index
of an entry within the array is the pid & the value stored at that index
is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map
array & dereferences the pointers which results in the tgid, but then it
passes that dereferenced value to trace_find_tgid() which treats it as a
pid & does a further lookup within the tgid_map array. It seems likely
that the intent here was to skip over entries in tgid_map for which the
recorded tgid is zero, but instead we end up skipping over entries for
which the thread group leader hasn't yet had its own tgid recorded in
tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let
seq_read() iterate over the whole tgid_map array & filter out empty
entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that
approach, removing the incorrect logic here entirely.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: d914ba37d714 ("tracing: Add support for recording tgid of tasks")
Cc: Ingo Molnar <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: <[email protected]>
Signed-off-by: Paul Burton <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
The wakeup_rt wakeup_dl, tracing_dl is only set to 0, 1.
So changing type of wakeup_rt wakeup_dl, tracing_dl as bool
makes relevant routine be more readable.
Link: https://lkml.kernel.org/r/20210629140548.GA1627@raspberrypi
Signed-off-by: Austin Kim <[email protected]>
[ Removed unneeded initialization of static bool tracing_dl ]
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Pick up a fix for a warning that several people reported.
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Merge misc updates from Andrew Morton:
"191 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
pagealloc, and memory-failure)"
* emailed patches from Andrew Morton <[email protected]>: (191 commits)
mm,hwpoison: make get_hwpoison_page() call get_any_page()
mm,hwpoison: send SIGBUS with error virutal address
mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
docs: remove description of DISCONTIGMEM
arch, mm: remove stale mentions of DISCONIGMEM
mm: remove CONFIG_DISCONTIGMEM
m68k: remove support for DISCONTIGMEM
arc: remove support for DISCONTIGMEM
arc: update comment about HIGHMEM implementation
alpha: remove DISCONTIGMEM and NUMA
mm/page_alloc: move free_the_page
mm/page_alloc: fix counting of managed_pages
mm/page_alloc: improve memmap_pages dbg msg
mm: drop SECTION_SHIFT in code comments
mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
mm/page_alloc: scale the number of pages that are batch freed
...
|
|
Trivial conflict in net/netfilter/nf_tables_api.c.
Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.
skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Dan Carpenter reported that:
The patch a955d7eac177: "trace: Add timerlat tracer" from Jun 22,
2021, leads to the following static checker warning:
kernel/trace/trace_osnoise.c:1400 timerlat_main()
warn: inconsistent indenting
here:
1389 while (!kthread_should_stop()) {
1390 now = ktime_to_ns(hrtimer_cb_get_time(&tlat->timer));
1391 diff = now - tlat->abs_period;
1392
1393 s.seqnum = tlat->count;
1394 s.timer_latency = diff;
1395 s.context = THREAD_CONTEXT;
1396
1397 trace_timerlat_sample(&s);
1398
1399 #ifdef CONFIG_STACKTRACE
1400 if (osnoise_data.print_stack)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This should be indented another tab?
1401 if (osnoise_data.print_stack <= time_to_us(diff))
1402 timerlat_dump_stack();
1403 #endif /* CONFIG_STACKTRACE */
1404
1405 tlat->tracing_thread = false;
1406 if (osnoise_data.stop_tracing_total)
1407 if (time_to_us(diff) >= osnoise_data.stop_tracing_total)
1408 osnoise_stop_tracing();
1409
1410 wait_next_period(tlat);
1411 }
And the static checker is right. Fix the indentation.
Link: https://lkml.kernel.org/r/3d5d8c9258fbdcfa9d3c7362941b3d13a2a28d9d.1624986368.git.bristot@redhat.com
Cc: Ingo Molnar <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: [email protected]
Fixes: a955d7eac177 ("trace: Add timerlat tracer")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Dab Carpenter reported that:
The patch bce29ac9ce0b: "trace: Add osnoise tracer" from Jun 22,
2021, leads to the following static checker warning:
kernel/trace/trace_osnoise.c:1103 run_osnoise()
warn: unsigned 'noise' is never less than zero.
In this part of the code:
1100 /*
1101 * This shouldn't happen.
1102 */
1103 if (noise < 0) {
^^^^^^^^^
1104 osnoise_taint("negative noise!");
1105 goto out;
1106 }
1107
And the static checker is right because 'noise' is u64.
Make noise s64 and keep the check. It is important to check if
the time read is behaving correctly - so we can trust the results.
I also re-arranged some variable declarations.
Link: https://lkml.kernel.org/r/acd7cd6e7d56b798a298c3bc8139a390b3c4ab52.1624986368.git.bristot@redhat.com
Cc: Ingo Molnar <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: [email protected]
Fixes: bce29ac9ce0b ("trace: Add osnoise tracer")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add hybrid processors support to the intel_pstate driver and
make it work with more processor models when HWP is disabled, make the
intel_idle driver use special C6 idle state paremeters when package
C-states are disabled, add cooling support to the tegra30 devfreq
driver, rework the TEO (timer events oriented) cpuidle governor,
extend the OPP (operating performance points) framework to use the
required-opps DT property in more cases, fix some issues and clean up
a number of assorted pieces of code.
Specifics:
- Make intel_pstate support hybrid processors using abstract
performance units in the HWP interface (Rafael Wysocki).
- Add Icelake servers and Cometlake support in no-HWP mode to
intel_pstate (Giovanni Gherdovich).
- Make cpufreq_online() error path be consistent with the CPU device
removal path in cpufreq (Rafael Wysocki).
- Clean up 3 cpufreq drivers and the statistics code (Hailong Liu,
Randy Dunlap, Shaokun Zhang).
- Make intel_idle use special idle state parameters for C6 when
package C-states are disabled (Chen Yu).
- Rework the TEO (timer events oriented) cpuidle governor to address
some theoretical shortcomings in it (Rafael Wysocki).
- Drop unneeded semicolon from the TEO governor (Wan Jiabing).
- Modify the runtime PM framework to accept unassigned suspend and
resume callback pointers (Ulf Hansson).
- Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski).
- Improve device performance states support in the generic power
domains (genpd) framework (Ulf Hansson).
- Fix some documentation issues in genpd (Yang Yingliang).
- Make the operating performance points (OPP) framework use the
required-opps DT property in use cases that are not related to
genpd (Hsin-Yi Wang).
- Make lazy_link_required_opp_table() use list_del_init instead of
list_del/INIT_LIST_HEAD (Yang Yingliang).
- Simplify wake IRQs handling in the core system-wide sleep support
code and clean up some coding style inconsistencies in it (Tian
Tao, Zhen Lei).
- Add cooling support to the tegra30 devfreq driver and improve its
DT bindings (Dmitry Osipenko).
- Fix some assorted issues in the devfreq core and drivers (Chanwoo
Choi, Dong Aisheng, YueHaibing)"
* tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits)
PM / devfreq: passive: Fix get_target_freq when not using required-opp
cpufreq: Make cpufreq_online() call driver->offline() on errors
opp: Allow required-opps to be used for non genpd use cases
cpuidle: teo: remove unneeded semicolon in teo_select()
dt-bindings: devfreq: tegra30-actmon: Add cooling-cells
dt-bindings: devfreq: tegra30-actmon: Convert to schema
PM / devfreq: userspace: Use DEVICE_ATTR_RW macro
PM: runtime: Clarify documentation when callbacks are unassigned
PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
PM: runtime: Improve path in rpm_idle() when no callback
PM: hibernate: remove leading spaces before tabs
PM: sleep: remove trailing spaces and tabs
PM: domains: Drop/restore performance state votes for devices at runtime PM
PM: domains: Return early if perf state is already set for the device
PM: domains: Split code in dev_pm_genpd_set_performance_state()
cpuidle: teo: Use kerneldoc documentation in admin-guide
cpuidle: teo: Rework most recent idle duration values treatment
cpuidle: teo: Change the main idle state selection logic
cpuidle: teo: Cosmetic modification of teo_select()
cpuidle: teo: Cosmetic modifications of teo_update()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Time and clocksource/clockevent related updates:
Core changes:
- Infrastructure to support per CPU "broadcast" devices for per CPU
clockevent devices which stop in deep idle states. This allows us
to utilize the more efficient architected timer on certain ARM SoCs
for normal operation instead of permanentely using the slow to
access SoC specific clockevent device.
- Print the name of the broadcast/wakeup device in /proc/timer_list
- Make the clocksource watchdog more robust against delays between
reading the current active clocksource and the watchdog
clocksource. Such delays can be caused by NMIs, SMIs and vCPU
preemption.
Handle this by reading the watchdog clocksource twice, i.e. before
and after reading the current active clocksource. In case that the
two watchdog reads shows an excessive time delta, the read sequence
is repeated up to 3 times.
- Improve the debug output and add a test module for the watchdog
mechanism.
- Reimplementation of the venerable time64_to_tm() function with a
faster and significantly smaller version. Straight from the source,
i.e. the author of the related research paper contributed this!
Driver changes:
- No new drivers, not even new device tree bindings!
- Fixes, improvements and cleanups and all over the place"
* tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
time/kunit: Add missing MODULE_LICENSE()
time: Improve performance of time64_to_tm()
clockevents: Use list_move() instead of list_del()/list_add()
clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
clocksource: Provide kernel module to test clocksource watchdog
clocksource: Reduce clocksource-skew threshold
clocksource: Limit number of CPUs checked for clock synchronization
clocksource: Check per-CPU clock synchronization when marked unstable
clocksource: Retry clock read if long delays detected
clockevents: Add missing parameter documentation
clocksource/drivers/timer-ti-dm: Drop unnecessary restore
clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround
clocksource/drivers/arm_global_timer: Remove duplicated argument in arm_global_timer
clocksource/drivers/arm_global_timer: Make symbol 'gt_clk_rate_change_nb' static
arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore
clocksource/drivers/arm_global_timer: Implement rate compensation whenever source clock changes
clocksource/drivers/ingenic: Rename unreasonable array names
clocksource/drivers/timer-ti-dm: Save and restore timer TIOCP_CFG
clocksource/drivers/mediatek: Ack and disable interrupts on suspend
clocksource/drivers/samsung_pwm: Constify source IO memory
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt subsystem:
Core changes:
- Cleanup and simplification of common code to invoke the low level
interrupt flow handlers when this invocation requires irqdomain
resolution. Add the necessary core infrastructure.
- Provide a proper interface for modular PMU drivers to set the
interrupt affinity.
- Add a request flag which allows to exclude interrupts from spurious
interrupt detection. Useful especially for IPI handlers which
always return IRQ_HANDLED which turns the spurious interrupt
detection into a pointless waste of CPU cycles.
Driver changes:
- Bulk convert interrupt chip drivers to the new irqdomain low level
flow handler invocation mechanism.
- Add device tree bindings for the Renesas R-Car M3-W+ SoC
- Enable modular build of the Qualcomm PDC driver
- The usual small fixes and improvements"
* tag 'irq-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
dt-bindings: interrupt-controller: arm,gic-v3: Describe GICv3 optional properties
irqchip: gic-pm: Remove redundant error log of clock bulk
irqchip/sun4i: Remove unnecessary oom message
irqchip/irq-imx-gpcv2: Remove unnecessary oom message
irqchip/imgpdc: Remove unnecessary oom message
irqchip/gic-v3-its: Remove unnecessary oom message
irqchip/gic-v2m: Remove unnecessary oom message
irqchip/exynos-combiner: Remove unnecessary oom message
irqchip: Bulk conversion to generic_handle_domain_irq()
genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()
genirq: Add generic_handle_domain_irq() helper
irqchip/nvic: Convert from handle_IRQ() to handle_domain_irq()
irqdesc: Fix __handle_domain_irq() comment
genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and co
irqdomain: Introduce irq_resolve_mapping()
irqdomain: Protect the linear revmap with RCU
irqdomain: Cache irq_data instead of a virq number in the revmap
irqdomain: Use struct_size() helper when allocating irqdomain
irqdomain: Make normal and nomap irqdomains exclusive
powerpc: Move the use of irq_domain_add_nomap() behind a config option
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Thomas Gleixner:
"A fix for the CPU hotplug and cpusets interaction:
cpusets delegate the hotplug work to a workqueue to prevent a lock
order inversion vs. the CPU hotplug lock. The work is not flushed
before the hotplug operation returns which creates user visible
inconsistent state. Prevent this by flushing the work after dropping
CPU hotplug lock and before releasing the outer mutex which serializes
the CPU hotplug related sysfs interface operations"
* tag 'smp-urgent-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Cure the cpusets trainwreck
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug cleanup from Thomas Gleixner:
"A simple cleanup for the CPU hotplug code to avoid per_cpu_ptr()
reevaluation"
* tag 'smp-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Simplify access to percpu cpuhp_state
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Add %pt[RT]s modifier to vsprintf(). It overrides ISO 8601 separator
by using ' ' (space). It produces "YYYY-mm-dd HH:MM:SS" instead of
"YYYY-mm-ddTHH:MM:SS".
- Correctly parse long row of numbers by sscanf() when using the field
width. Add extensive sscanf() selftest.
- Generalize re-entrant CPU lock that has already been used to
serialize dump_stack() output. It is part of the ongoing printk
rework. It will allow to remove the obsoleted printk_safe buffers and
introduce atomic consoles.
- Some code clean up and sparse warning fixes.
* tag 'printk-for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: fix cpu lock ordering
lib/dump_stack: move cpu lock to printk.c
printk: Remove trailing semicolon in macros
random32: Fix implicit truncation warning in prandom_seed_state()
lib: test_scanf: Remove pointless use of type_min() with unsigned types
selftests: lib: Add wrapper script for test_scanf
lib: test_scanf: Add tests for sscanf number conversion
lib: vsprintf: Fix handling of number field widths in vsscanf
lib: vsprintf: scanf: Negative number must have field width > 1
usb: host: xhci-tegra: Switch to use %ptTs
nilfs2: Switch to use %ptTs
kdb: Switch to use %ptTs
lib/vsprintf: Allow to override ISO 8601 date and time separator
|
|
After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
configuration option is equivalent to FLATMEM.
Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Vineet Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
configuration options are equivalent.
Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
Done with
$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
$(git grep -wl NEED_MULTIPLE_NODES)
with manual tweaks afterwards.
[[email protected]: fix arm boot crash]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Matt Turner <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Vineet Gupta <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
similar to the old vm.percpu_pagelist_fraction. The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention. However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled. This
sysctl only adjusts pcp->high so that zone->lock contention is potentially
reduced but allocation latency during a PCP refill remains the same.
# grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 649
batch: 63
# sysctl vm.percpu_pagelist_high_fraction=8
# grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 35071
batch: 63
# sysctl vm.percpu_pagelist_high_fraction=64
high: 4383
batch: 63
# sysctl vm.percpu_pagelist_high_fraction=0
high: 649
batch: 63
[[email protected]: fix documentation]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.
The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the
zone size into account or the number of CPUs local to a zone. With larger
zones and more CPUs per node, the contention is getting worse.
Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
and high values means that the sysctl can reduce zone lock contention but
also increase allocation latencies.
This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static. It
also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.
The motivation is partially to adjust to larger memory sizes but is also
driven by the fact that large batches of page freeing via release_pages()
often shows zone contention as a major part of the problem. Another is a
bug report based on an older kernel where a multi-terabyte process can
takes several minutes to exit. A workaround was to use
vm.percpu_pagelist_fraction to increase the pcp->high value but testing
indicated that a production workload could not use the same values because
of an increase in allocation latencies. Unfortunately, I cannot reproduce
this test case myself as the multi-terabyte machines are in active use but
it should alleviate the problem.
The series aims to address both and partially acts as a pre-requisite.
pcp only works with order-0 which is useless for SLUB (when using high
orders) and THP (unconditionally). To store high-order pages on PCP, the
pcp->high values need to be increased first.
This patch (of 6):
The vm.percpu_pagelist_fraction is used to increase the batch and high
limits for the per-cpu page allocator (PCP). The intent behind the sysctl
is to reduce zone lock acquisition when allocating/freeing pages but it
has a problem. While it can decrease contention, it can also increase
latency on the allocation side due to unreasonably large batch sizes.
This leads to games where an administrator adjusts
percpu_pagelist_fraction on the fly to work around contention and
allocation latency problems.
This series aims to alleviate the problems with zone lock contention while
avoiding the allocation-side latency problems. For the purposes of
review, it's easier to remove this sysctl now and reintroduce a similar
sysctl later in the series that deals only with pcp->high.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Cc: Hillf Danton <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Liam R. Howlett <[email protected]>
Reviewed-by: Laurent Dufour <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "perf/binfmt/mm: remove in-tree usage of MAP_EXECUTABLE".
Stumbling over the history of MAP_EXECUTABLE, I noticed that we still have
some in-tree users that we can get rid of.
This patch (of 3):
Before commit e9714acf8c43 ("mm: kill vma flag VM_EXECUTABLE and
mm->num_exe_file_vmas"), VM_EXECUTABLE indicated MAP_EXECUTABLE.
MAP_EXECUTABLE is nowadays essentially ignored by the kernel and does not
relate to VM_MAYEXEC.
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: f972eb63b100 ("perf: Pass protection and flags bits through mmap2 interface")
Signed-off-by: David Hildenbrand <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Greg Ungerer <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Don Zickus <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The current code only associates with the existing blkcg when aio is used
to access the backing file. This patch covers all types of i/o to the
backing file and also associates the memcg so if the backing file is on
tmpfs, memory is charged appropriately.
This patch also exports cgroup_get_e_css and int_active_memcg so it can be
used by the loop module.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dan Schatzberg <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Jens Axboe <[email protected]>
Cc: Chris Down <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
cleanup.
Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
would reintroduce a loss of SMP scalability to pin-fast, so there's no
future potential usefulness to keep an atomic in the mm for this.
set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
(atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
atomic_set after this commit) has to be still issued only once per "mm",
so the difference between the two will be lost in the noise.
will-it-scale "mmap2" shows no change in performance with enterprise
config as expected.
will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
improvement against upstream as expected.
This is a noop as far as overall performance and SMP scalability are
concerned.
[[email protected]: pack has_pinned in MMF_HAS_PINNED]
Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
[[email protected]: coding style fixes]
[[email protected]: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Kirill Shutemov <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The watchdog thread has been replaced by cpu_stop_work, modify the
explanation related.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wang Qing <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Stephen Kitt <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: "Guilherme G. Piccoli" <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Santosh Sivaraj <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
kthread_cancel_delayed_work_sync()
kthread_mod_delayed_work() might race with
kthread_cancel_delayed_work_sync() or another kthread_mod_delayed_work()
call. The function lets the other operation win when it sees
work->canceling counter set. And it returns @false.
But it should return @true as it is done by the related workqueue API, see
mod_delayed_work_on().
The reason is that the return value might be used for reference counting.
It has to distinguish the case when the number of queued works has changed
or stayed the same.
The change is safe. kthread_mod_delayed_work() return value is not
checked anywhere at the moment.
Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Petr Mladek <[email protected]>
Reported-by: Oleg Nesterov <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: <[email protected]>
Cc: Martin Liu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
All internal use cases for tracepoint_probe_register() is set to not ever
be called with the same function and data. If it is, it is considered a
bug, as that means the accounting of handling tracepoints is corrupted.
If the function and data for a tracepoint is already registered when
tracepoint_probe_register() is called, it will call WARN_ON_ONCE() and
return with EEXISTS.
The BPF system call can end up calling tracepoint_probe_register() with
the same data, which now means that this can trigger the warning because
of a user space process. As WARN_ON_ONCE() should not be called because
user space called a system call with bad data, there needs to be a way to
register a tracepoint without triggering a warning.
Enter tracepoint_probe_register_may_exist(), which can be called, but will
not cause a WARN_ON() if the probe already exists. It will still error out
with EEXIST, which will then be sent to the user space that performed the
BPF system call.
This keeps the previous testing for issues with other users of the
tracepoint code, while letting BPF call it with duplicated data and not
warn about it.
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://syzkaller.appspot.com/bug?id=41f4318cf01762389f4d1c1c459da4f542fe5153
Cc: [email protected]
Fixes: c4f6699dfcb85 ("bpf: introduce BPF_RAW_TRACEPOINT")
Reported-by: syzbot <[email protected]>
Reported-by: Tetsuo Handa <[email protected]>
Tested-by: [email protected]
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
* pm-core:
PM: runtime: Clarify documentation when callbacks are unassigned
PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks
PM: runtime: Improve path in rpm_idle() when no callback
PM: runtime: document common mistake with pm_runtime_get_sync()
* pm-sleep:
PM: hibernate: remove leading spaces before tabs
PM: sleep: remove trailing spaces and tabs
PM: hibernate: fix spelling mistakes
PM: wakeirq: Set IRQF_NO_AUTOEN when requesting the IRQ
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace rlimit handling update from Eric Biederman:
"This is the work mainly by Alexey Gladkov to limit rlimits to the
rlimits of the user that created a user namespace, and to allow users
to have stricter limits on the resources created within a user
namespace."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
cred: add missing return error code when set_cred_ucounts() failed
ucounts: Silence warning in dec_rlimit_ucounts
ucounts: Set ucount_max to the largest positive value the type can hold
kselftests: Add test to check for rlimit changes in different user namespaces
Reimplement RLIMIT_MEMLOCK on top of ucounts
Reimplement RLIMIT_SIGPENDING on top of ucounts
Reimplement RLIMIT_MSGQUEUE on top of ucounts
Reimplement RLIMIT_NPROC on top of ucounts
Use atomic_t for ucounts reference counting
Add a reference to ucounts for each cred
Increase size of ucounts to atomic_long_t
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
- Add "atomic addfd + send reply" mode to SECCOMP_USER_NOTIF to better
handle EINTR races visible to seccomp monitors. (Rodrigo Campos,
Sargun Dhillon)
- Improve seccomp selftests for readability in CI systems. (Kees Cook)
* tag 'seccomp-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
selftests/seccomp: Avoid using "sysctl" for report
selftests/seccomp: Flush benchmark output
selftests/seccomp: More closely track fds being assigned
selftests/seccomp: Add test for atomic addfd+send
seccomp: Support atomic "addfd + send reply"
|
|
Certain uses of "do once" functionality reside outside of fast path,
and so do not require jump label patching via static keys, making
existing DO_ONCE undesirable in such cases.
Replace uses of __section(".data.once") with DO_ONCE_LITE(_IF)?
This patch changes the return values of xfs_printk_once, printk_once,
and printk_deferred_once. Before, they returned whether the print was
performed, but now, they always return true. This is okay because the
return values of the following macros are entirely ignored throughout
the kernel:
- xfs_printk_once
- xfs_warn_once
- xfs_notice_once
- xfs_info_once
- printk_once
- pr_emerg_once
- pr_alert_once
- pr_crit_once
- pr_err_once
- pr_warn_once
- pr_notice_once
- pr_info_once
- pr_devel_once
- pr_debug_once
- printk_deferred_once
- orc_warn
Changes
v3:
- Expand commit message to explain why changing return values of
xfs_printk_once, printk_once, printk_deferred_once is benign
v2:
- Fix i386 build warnings
Signed-off-by: Tanner Love <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Acked-by: Mahesh Bandewar <[email protected]>
Acked-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-06-28
The following pull-request contains BPF updates for your *net-next* tree.
We've added 37 non-merge commits during the last 12 day(s) which contain
a total of 56 files changed, 394 insertions(+), 380 deletions(-).
The main changes are:
1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney.
2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski.
3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev.
4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria.
5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards.
6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa.
7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi.
8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim.
9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young.
10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer.
11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski.
12) Fix up bpfilter startup log-level to info level, from Gary Lin.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's a reasonable amount here and the juicy details are all below.
It's worth noting that the MTE/KASAN changes strayed outside of our
usual directories due to core mm changes and some associated changes
to some other architectures; Andrew asked for us to carry these [1]
rather that take them via the -mm tree.
Summary:
- Optimise SVE switching for CPUs with 128-bit implementations.
- Fix output format from SVE selftest.
- Add support for versions v1.2 and 1.3 of the SMC calling
convention.
- Allow Pointer Authentication to be configured independently for
kernel and userspace.
- PMU driver cleanups for managing IRQ affinity and exposing event
attributes via sysfs.
- KASAN optimisations for both hardware tagging (MTE) and out-of-line
software tagging implementations.
- Relax frame record alignment requirements to facilitate 8-byte
alignment with KASAN and Clang.
- Cleanup of page-table definitions and removal of unused memory
types.
- Reduction of ARCH_DMA_MINALIGN back to 64 bytes.
- Refactoring of our instruction decoding routines and addition of
some missing encodings.
- Move entry code moved into C and hardened against harmful compiler
instrumentation.
- Update booting requirements for the FEAT_HCX feature, added to v8.7
of the architecture.
- Fix resume from idle when pNMI is being used.
- Additional CPU sanity checks for MTE and preparatory changes for
systems where not all of the CPUs support 32-bit EL0.
- Update our kernel string routines to the latest Cortex Strings
implementation.
- Big cleanup of our cache maintenance routines, which were
confusingly named and inconsistent in their implementations.
- Tweak linker flags so that GDB can understand vmlinux when using
RELR relocations.
- Boot path cleanups to enable early initialisation of per-cpu
operations needed by KCSAN.
- Non-critical fixes and miscellaneous cleanup"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (150 commits)
arm64: tlb: fix the TTL value of tlb_get_level
arm64: Restrict undef hook for cpufeature registers
arm64/mm: Rename ARM64_SWAPPER_USES_SECTION_MAPS
arm64: insn: avoid circular include dependency
arm64: smp: Bump debugging information print down to KERN_DEBUG
drivers/perf: fix the missed ida_simple_remove() in ddr_perf_probe()
perf/arm-cmn: Fix invalid pointer when access dtc object sharing the same IRQ number
arm64: suspend: Use cpuidle context helpers in cpu_suspend()
PSCI: Use cpuidle context helpers in psci_cpu_suspend_enter()
arm64: Convert cpu_do_idle() to using cpuidle context helpers
arm64: Add cpuidle context save/restore helpers
arm64: head: fix code comments in set_cpu_boot_mode_flag
arm64: mm: drop unused __pa(__idmap_text_start)
arm64: mm: fix the count comments in compute_indices
arm64/mm: Fix ttbr0 values stored in struct thread_info for software-pan
arm64: mm: Pass original fault address to handle_mm_fault()
arm64/mm: Drop SECTION_[SHIFT|SIZE|MASK]
arm64/mm: Use CONT_PMD_SHIFT for ARM64_MEMSTART_SHIFT
arm64/mm: Drop SWAPPER_INIT_MAP_SIZE
arm64: Conditionally configure PTR_AUTH key of the kernel.
...
|
|
This option at minimum adds extra code to the scheduler - even if
it's default unused - and most users wouldn't want it.
Reported-by: Linus Torvalds <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Alban Crequy reported a race condition userspace faces when we want to
add some fds and make the syscall return them[1] using seccomp notify.
The problem is that currently two different ioctl() calls are needed by
the process handling the syscalls (agent) for another userspace process
(target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and
SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible
for the agent to do the first ioctl to add a file descriptor but the
target is interrupted (EINTR) before the agent does the second ioctl()
call.
This patch adds a flag to the ADDFD ioctl() so it adds the fd and
returns that value atomically to the target program, as suggested by
Kees Cook[2]. This is done by simply allowing
seccomp_do_user_notification() to add the fd and return it in this case.
Therefore, in this case the target wakes up from the wait in
seccomp_do_user_notification() either to interrupt the syscall or to add
the fd and return it.
This "allocate an fd and return" functionality is useful for syscalls
that return a file descriptor only, like connect(2). Other syscalls that
return a file descriptor but not as return value (or return more than
one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not
work with this flag.
This effectively combines SECCOMP_IOCTL_NOTIF_ADDFD and
SECCOMP_IOCTL_NOTIF_SEND into an atomic opteration. The notification's
return value, nor error can be set by the user. Upon successful invocation
of the SECCOMP_IOCTL_NOTIF_ADDFD ioctl with the SECCOMP_ADDFD_FLAG_SEND
flag, the notifying process's errno will be 0, and the return value will
be the file descriptor number that was installed.
[1]: https://lore.kernel.org/lkml/CADZs7q4sw71iNHmV8EOOXhUKJMORPzF7thraxZYddTZsxta-KQ@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/
Signed-off-by: Rodrigo Campos <[email protected]>
Signed-off-by: Sargun Dhillon <[email protected]>
Acked-by: Tycho Andersen <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timers/nohz updates from Ingo Molnar:
- Micro-optimize tick_nohz_full_cpu()
- Optimize idle exit tick restarts to be less eager
- Optimize tick_nohz_dep_set_task() to only wake up a single CPU.
This reduces IPIs and interruptions on nohz_full CPUs.
- Optimize tick_nohz_dep_set_signal() in a similar fashion.
- Skip IPIs in tick_nohz_kick_task() when trying to kick a
non-running task.
- Micro-optimize tick_nohz_task_switch() IRQ flags handling to
reduce context switching costs.
- Misc cleanups and fixes
* tag 'timers-nohz-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Add myself as context tracking maintainer
tick/nohz: Call tick_nohz_task_switch() with interrupts disabled
tick/nohz: Kick only _queued_ task whose tick dependency is updated
tick/nohz: Change signal tick dependency to wake up CPUs of member tasks
tick/nohz: Only wake up a single target cpu when kicking a task
tick/nohz: Update nohz_full Kconfig help
tick/nohz: Update idle_exittime on actual idle exit
tick/nohz: Remove superflous check for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
tick/nohz: Conditionally restart tick on idle exit
tick/nohz: Evaluate the CPU expression after the static key
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler udpates from Ingo Molnar:
- Changes to core scheduling facilities:
- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow the
flexible utilization of SMT siblings, without exposing untrusted
domains to information leaks & side channels, plus to ensure more
deterministic computing performance on SMT systems used by
heterogenous workloads.
There are new prctls to set core scheduling groups, which allows
more flexible management of workloads that can share siblings.
- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.
- Load-balancing changes:
- Tweak newidle_balance for fair-sched, to improve 'memcache'-like
workloads.
- "Age" (decay) average idle time, to better track & improve
workloads such as 'tbench'.
- Fix & improve energy-aware (EAS) balancing logic & metrics.
- Fix & improve the uclamp metrics.
- Fix task migration (taskset) corner case on !CONFIG_CPUSET.
- Fix RT and deadline utilization tracking across policy changes
- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked via
/sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
- Rework assymetric topology/capacity detection & handling.
- Scheduler statistics & tooling:
- Disable delayacct by default, but add a sysctl to enable it at
runtime if tooling needs it. Use static keys and other
optimizations to make it more palatable.
- Use sched_clock() in delayacct, instead of ktime_get_ns().
- Misc cleanups and fixes.
* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/doc: Update the CPU capacity asymmetry bits
sched/topology: Rework CPU capacity asymmetry detection
sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
psi: Fix race between psi_trigger_create/destroy
sched/fair: Introduce the burstable CFS controller
sched/uclamp: Fix uclamp_tg_restrict()
sched/rt: Fix Deadline utilization tracking during policy change
sched/rt: Fix RT utilization tracking during policy change
sched: Change task_struct::state
sched,arch: Remove unused TASK_STATE offsets
sched,timer: Use __set_current_state()
sched: Add get_current_state()
sched,perf,kvm: Fix preemption condition
sched: Introduce task_is_running()
sched: Unbreak wakeups
sched/fair: Age the average idle time
sched/cpufreq: Consider reduced CPU capacity in energy calculation
sched/fair: Take thermal pressure into account while estimating energy
thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
sched/fair: Return early from update_tg_cfs_load() if delta == 0
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Platform PMU driver updates:
- x86 Intel uncore driver updates for Skylake (SNR) and Icelake (ICX) servers
- Fix RDPMC support
- Fix [extended-]PEBS-via-PT support
- Fix Sapphire Rapids event constraints
- Fix :ppp support on Sapphire Rapids
- Fix fixed counter sanity check on Alder Lake & X86_FEATURE_HYBRID_CPU
- Other heterogenous-PMU fixes
- Kprobes:
- Remove the unused and misguided kprobe::fault_handler callbacks.
- Warn about kprobes taking a page fault.
- Fix the 'nmissed' stat counter.
- Misc cleanups and fixes.
* tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix task context PMU for Hetero
perf/x86/intel: Fix instructions:ppp support in Sapphire Rapids
perf/x86/intel: Add more events requires FRONTEND MSR on Sapphire Rapids
perf/x86/intel: Fix fixed counter check warning for some Alder Lake
perf/x86/intel: Fix PEBS-via-PT reload base value for Extended PEBS
perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
kprobes: Do not increment probe miss count in the fault handler
x86,kprobes: WARN if kprobes tries to handle a fault
kprobes: Remove kprobe::fault_handler
uprobes: Update uprobe_write_opcode() kernel-doc comment
perf/hw_breakpoint: Fix DocBook warnings in perf hw_breakpoint
perf/core: Fix DocBook warnings
perf/core: Make local function perf_pmu_snapshot_aux() static
perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on ICX
perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on SNR
perf/x86/intel/uncore: Generalize I/O stacks to PMON mapping procedure
perf/x86/intel/uncore: Drop unnecessary NULL checks after container_of()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Core locking & atomics:
- Convert all architectures to ARCH_ATOMIC: move every architecture
to ARCH_ATOMIC, then get rid of ARCH_ATOMIC and all the
transitory facilities and #ifdefs.
Much reduction in complexity from that series:
63 files changed, 756 insertions(+), 4094 deletions(-)
- Self-test enhancements
- Futexes:
- Add the new FUTEX_LOCK_PI2 ABI, which is a variant that doesn't
set FLAGS_CLOCKRT (.e. uses CLOCK_MONOTONIC).
[ The temptation to repurpose FUTEX_LOCK_PI's implicit setting of
FLAGS_CLOCKRT & invert the flag's meaning to avoid having to
introduce a new variant was resisted successfully. ]
- Enhance futex self-tests
- Lockdep:
- Fix dependency path printouts
- Optimize trace saving
- Broaden & fix wait-context checks
- Misc cleanups and fixes.
* tag 'locking-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
locking/lockdep: Correct the description error for check_redundant()
futex: Provide FUTEX_LOCK_PI2 to support clock selection
futex: Prepare futex_lock_pi() for runtime clock selection
lockdep/selftest: Remove wait-type RCU_CALLBACK tests
lockdep/selftests: Fix selftests vs PROVE_RAW_LOCK_NESTING
lockdep: Fix wait-type for empty stack
locking/selftests: Add a selftest for check_irq_usage()
lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage()
locking/lockdep: Remove the unnecessary trace saving
locking/lockdep: Fix the dep path printing for backwards BFS
selftests: futex: Add futex compare requeue test
selftests: futex: Add futex wait test
seqlock: Remove trailing semicolon in macros
locking/lockdep: Reduce LOCKDEP dependency list
locking/lockdep,doc: Improve readability of the block matrix
locking/atomics: atomic-instrumented: simplify ifdeffery
locking/atomic: delete !ARCH_ATOMIC remnants
locking/atomic: xtensa: move to ARCH_ATOMIC
locking/atomic: sparc: move to ARCH_ATOMIC
locking/atomic: sh: move to ARCH_ATOMIC
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fix and updates from Ingo Molnar:
"An ELF format fix for a section flags mismatch bug that breaks kernel
tooling such as kpatch-build.
The biggest change in this cycle is the new code to handle and rewrite
variable sized jump labels - which results in slightly tighter code
generation in hot paths, through the use of short(er) NOPs.
Also a number of cleanups and fixes, and a change to the generic
include/linux/compiler.h to handle a s390 GCC quirk"
* tag 'objtool-urgent-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Don't make .altinstructions writable
* tag 'objtool-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Improve reloc hash size guestimate
instrumentation.h: Avoid using inline asm operand modifiers
compiler.h: Avoid using inline asm operand modifiers
kbuild: Fix objtool dependency for 'OBJECT_FILES_NON_STANDARD_<obj> := n'
objtool: Reflow handle_jump_alt()
jump_label/x86: Remove unused JUMP_LABEL_NOP_SIZE
jump_label, x86: Allow short NOPs
objtool: Provide stats for jump_labels
objtool: Rewrite jump_label instructions
objtool: Decode jump_entry::key addend
jump_label, x86: Emit short JMP
jump_label: Free jump_entry::key bit1 for build use
jump_label, x86: Add variable length patching support
jump_label, x86: Introduce jump_entry_size()
jump_label, x86: Improve error when we fail expected text
jump_label, x86: Factor out the __jump_table generation
jump_label, x86: Strip ASM jump_label support
x86, objtool: Dont exclude arch/x86/realmode/
objtool: Rewrite hashtable sizing
|
|
kernel test robot reported:
>> kernel/trace/trace_osnoise.c:1584:2: error: void function
'osnoise_init_hotplug_support' should not return a
value [-Wreturn-type]
return 0;
When !CONFIG_HOTPLUG_CPU.
Fix it problem by removing the return value.
Link: https://lkml.kernel.org/r/c7fc67f1a117cc88bab2e508c898634872795341.1624872608.git.bristot@redhat.com
Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
kernel test robot reported:
>> kernel/trace/trace_osnoise.c:966:3: warning: comparison of distinct
pointer types ('typeof ((interval)) *' (aka 'long long *') and
'uint64_t *' (aka 'unsigned long long *'))
[-Wcompare-distinct-pointer-types]
do_div(interval, USEC_PER_MSEC);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/div64.h:228:28: note: expanded from macro 'do_div'
(void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~
As interval cannot be negative because sample_period >= sample_runtime,
making interval u64 on osnoise_main() is enough to fix this problem.
Link: https://lkml.kernel.org/r/4ae1e7780563598563de079a3ef6d4d10b5f5546.1624872608.git.bristot@redhat.com
Fixes: bce29ac9ce0b ("trace: Add osnoise tracer")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
kernel test robot reported some osnoise functions with "no previous
prototype."
Fix these warnings by making local functions static, and by adding:
void osnoise_trace_irq_entry(int id);
void osnoise_trace_irq_exit(int id, const char *desc);
to include/linux/trace.h.
Link: https://lkml.kernel.org/r/e40d3cb4be8bde921f4b40fa6a095cf85ab807bd.1624872608.git.bristot@redhat.com
Fixes: bce29ac9ce0b ("trace: Add osnoise tracer")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
ftracetest triggered:
INFO: rcu_tasks detected stalls on tasks:
00000000b92b832d: .. nvcsw: 1/1 holdout: 1 idle_cpu: -1/7
task:osnoise/7 state:R running task stack: 0 pid: 2133 ppid: 2 flags:0x00004000
Call Trace:
? asm_sysvec_apic_timer_interrupt+0x12/0x20
? asm_sysvec_apic_timer_interrupt+0x12/0x20
? trace_hardirqs_on+0x2b/0xe0
? asm_sysvec_apic_timer_interrupt+0x12/0x20
? trace_clock_local+0xc/0x20
? osnoise_main+0x10e/0x450
? trace_softirq_entry_callback+0x50/0x50
? kthread+0x153/0x170
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x22/0x30
While running osnoise tracer with other tracers that rely on
synchronize_rcu_tasks(), where that just hung.
The reason is that osnoise_main() never schedules out if the interval
is less than 1, and this will cause synchronize_rcu_tasks() to never
return.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: bce29ac9ce0bb ("trace: Add osnoise tracer")
Acked-by: Paul E. McKenney <[email protected]>
Reviewed-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator updates from Mark Brown:
"The main core change this release is generic support for handling of
hardware errors from Matti Vaittinen, including some small updates to
the reboot and thermal code so we can share support for powering off
the system if things are going wrong enough.
Otherwise this release we've mainly seen the addition of new drivers,
including MT6359 which has pulled in some small changes from the MFD
tree for build dependencies.
- Support for controlling the trigger points for hardware error
detection, and shared handlers for this.
- Support for Maxim MAX8993, Mediatek MT6359 and MT6359P, Qualcomm
PM8226 and SA8115P-ADP, and Sylergy TCS4526"
* tag 'regulator-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (91 commits)
regulator: bd9576: Fix uninitializes variable may_have_irqs
regulator: max8893: Select REGMAP_I2C to fix build error
regulator: da9052: Ensure enough delay time for .set_voltage_time_sel
regulator: mt6358: Fix vdram2 .vsel_mask
regulator: hi6421v600: Fix setting wrong driver_data
MAINTAINERS: Add reviewer for regulator irq_helpers
regulator: bd9576: Fix the driver name in id table
regulator: bd9576: Support error reporting
regulator: bd9576 add FET ON-resistance for OCW
regulator: add property parsing and callbacks to set protection limits
regulator: IRQ based event/error notification helpers
regulator: move rdev_print helpers to internal.h
regulator: add warning flags
thermal: Use generic HW-protection shutdown API
reboot: Add hardware protection power-off
regulator: Add protection limit properties
regulator: hi6421v600: Fix setting idle mode
regulator: Add MAX8893 bindings
regulator: max8893: add regulator driver
regulator: hi6421: Use correct variable type for regmap api val argument
...
|
|
kmemleak scans struct page, but it does not scan the page content. If we
allocate some memory with kmalloc(), then allocate page with alloc_page(),
and if we put kmalloc pointer somewhere inside that page, kmemleak will
report kmalloc pointer as a false positive.
We can instruct kmemleak to scan the memory area by calling kmemleak_alloc()
and kmemleak_free(), but part of struct bpf_ringbuf is mmaped to user space,
and if struct bpf_ringbuf changes we would have to revisit and review size
argument in kmemleak_alloc(), because we do not want kmemleak to scan the
user space memory. Let's simplify things and use kmemleak_not_leak() here.
For posterity, also adding additional prior analysis from Andrii:
I think either kmemleak or syzbot are misreporting this. I've added a
bunch of printks around all allocations performed by BPF ringbuf. [...]
On repro side I get these two warnings:
[vmuser@archvm bpf]$ sudo ./repro
BUG: memory leak
unreferenced object 0xffff88810d538c00 (size 64):
comm "repro", pid 2140, jiffies 4294692933 (age 14.540s)
hex dump (first 32 bytes):
00 af 19 04 00 ea ff ff c0 ae 19 04 00 ea ff ff ................
80 ae 19 04 00 ea ff ff c0 29 2e 04 00 ea ff ff .........)......
backtrace:
[<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
[<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
[<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
[<00000000f601d565>] do_syscall_64+0x2d/0x40
[<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae
BUG: memory leak
unreferenced object 0xffff88810d538c80 (size 64):
comm "repro", pid 2143, jiffies 4294699025 (age 8.448s)
hex dump (first 32 bytes):
80 aa 19 04 00 ea ff ff 00 ab 19 04 00 ea ff ff ................
c0 ab 19 04 00 ea ff ff 80 44 28 04 00 ea ff ff .........D(.....
backtrace:
[<0000000077bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
[<00000000587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
[<0000000044d49e96>] __do_sys_bpf+0x359/0x1d90
[<00000000f601d565>] do_syscall_64+0x2d/0x40
[<0000000043d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae
Note that both reported leaks (ffff88810d538c80 and ffff88810d538c00)
correspond to pages array bpf_ringbuf is allocating and tracking properly
internally. Note also that syzbot repro doesn't close FD of created BPF
ringbufs, and even when ./repro itself exits with error, there are still
two forked processes hanging around in my system. So clearly ringbuf maps
are alive at that point. So reporting any memory leak looks weird at that
point, because that memory is being used by active referenced BPF ringbuf.
It's also a question why repro doesn't clean up its forks. But if I do a
`pkill repro`, I do see that all the allocated memory is /properly/ cleaned
up [and the] "leaks" are deallocated properly.
BTW, if I add close() right after bpf() syscall in syzbot repro, I see that
everything is immediately deallocated, like designed. And no memory leak
is reported. So I don't think the problem is anywhere in bpf_ringbuf code,
rather in the leak detection and/or repro itself.
Reported-by: [email protected]
Signed-off-by: Rustam Kovhaev <[email protected]>
[ Daniel: also included analysis from Andrii to the commit log ]
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: [email protected]
Cc: Dmitry Vyukov <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/CAEf4BzYk+dqs+jwu6VKXP-RttcTEGFe+ySTGWT9CRNkagDiJVA@mail.gmail.com
Link: https://lore.kernel.org/lkml/YNTAqiE7CWJhOK2M@nuc10
Link: https://lore.kernel.org/lkml/[email protected]
Link: https://syzkaller.appspot.com/bug?extid=5d895828587f49e7fe9b
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Allow the helper to be called from tracing programs. This is needed to
handle cgroup hiererachies in the program.
Signed-off-by: Namhyung Kim <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
The _sum and _avg values are in general sync together with the PELT
divider. They are however not always completely in perfect sync,
resulting in situations where _sum gets to zero while _avg stays
positive. Such situations are undesirable.
This comes from the fact that PELT will increase period_contrib, also
increasing the PELT divider, without updating _sum and _avg values to
stay in perfect sync where (_sum == _avg * divider). However, such PELT
change will never lower _sum, making it impossible to end up in a
situation where _sum is zero and _avg is not.
Therefore, we need to ensure that when subtracting load outside PELT,
that when _sum is zero, _avg is also set to zero. This occurs when
(_sum < _avg * divider), and the subtracted (_avg * divider) is bigger
or equal to the current _sum, while the subtracted _avg is smaller than
the current _avg.
Reported-by: Sachin Sant <[email protected]>
Reported-by: Naresh Kamboju <[email protected]>
Signed-off-by: Odin Ugedal <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Vincent Guittot <[email protected]>
Tested-by: Sachin Sant <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Fix kernel-doc warning in irqdesc.c:
../kernel/irq/irqdesc.c:692: warning: Excess function parameter 'lookup' description in 'handle_domain_irq'
Fixes: e1c054918c6c ("genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()")
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier:
- Revamped the irqdomain internals to consistently cache irqdata
- Expose a new API to simplify IRQ handling involving an irqdomain by
not using the IRQ number
- Convert all the irqchip drivers to this new API
- Allow the Qualcomm PDC driver to be compiled as a module
- Fix HiSi MBIGEN compile warning when CONFIG_ACPI isn't selected
- Remove a bunch of spurious printks on error paths
- The obligatory couple of DT updates
|
|
[ mingo: MODULE_LICENSE() takes a string. ]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This reverts commits 4bad58ebc8bc4f20d89cff95417c9b4674769709 (and
399f8dd9a866e107639eabd3c1979cd526ca3a98, which tried to fix it).
I do not believe these are correct, and I'm about to release 5.13, so am
reverting them out of an abundance of caution.
The locking is odd, and appears broken.
On the allocation side (in __sigqueue_alloc()), the locking is somewhat
straightforward: it depends on sighand->siglock. Since one caller
doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
the case with no locks held.
On the freeing side (in sigqueue_cache_or_free()), there is no locking
at all, and the logic instead depends on 'current' being a single
thread, and not able to race with itself.
To make things more exciting, there's also the data race between freeing
a signal and allocating one, which is handled by using WRITE_ONCE() and
READ_ONCE(), and being mutually exclusive wrt the initial state (ie
freeing will only free if the old state was NULL, while allocating will
obviously only use the value if it was non-NULL, so only one or the
other will actually act on the value).
However, while the free->alloc paths do seem mutually exclusive thanks
to just the data value dependency, it's not clear what the memory
ordering constraints are on it. Could writes from the previous
allocation possibly be delayed and seen by the new allocation later,
causing logical inconsistencies?
So it's all very exciting and unusual.
And in particular, it seems that the freeing side is incorrect in
depending on "current" being single-threaded. Yes, 'current' is a
single thread, but in the presense of asynchronous events even a single
thread can have data races.
And such asynchronous events can and do happen, with interrupts causing
signals to be flushed and thus free'd (for example - sending a
SIGCONT/SIGSTOP can happen from interrupt context, and can flush
previously queued process control signals).
So regardless of all the other questions about the memory ordering and
locking for this new cached allocation, the sigqueue_cache_or_free()
assumptions seem to be fundamentally incorrect.
It may be that people will show me the errors of my ways, and tell me
why this is all safe after all. We can reinstate it if so. But my
current belief is that the WRITE_ONCE() that sets the cached entry needs
to be a smp_store_release(), and the READ_ONCE() that finds a cached
entry needs to be a smp_load_acquire() to handle memory ordering
correctly.
And the sequence in sigqueue_cache_or_free() would need to either use a
lock or at least be interrupt-safe some way (perhaps by using something
like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
percpu operations it needs to be interrupt-safe).
Fixes: 399f8dd9a866 ("signal: Prevent sigqueue caching after task got released")
Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct")
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|