Age | Commit message (Collapse) | Author | Files | Lines |
|
Give the basic phys_to_dma() and dma_to_phys() helpers a __-prefix and add
the memory encryption mask to the non-prefixed versions. Use the
__-prefixed versions directly instead of clearing the mask again in
various places.
Tested-by: Tom Lendacky <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Jon Mason <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Muli Ben-Yehuda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
... to make these APIs more universally available.
Tested-by: Tom Lendacky <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Tom Lendacky <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Jon Mason <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Muli Ben-Yehuda <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
With the following commit:
333522447063 ("jump_label: Explicitly disable jump labels in __init code")
... we explicitly disabled jump labels in __init code, so they could be
detected and not warned about in the following commit:
dc1dd184c2f0 ("jump_label: Warn on failed jump_label patching attempt")
In-kernel __exit code has the same issue. It's never used, so it's
freed along with the rest of initmem. But jump label entries in __exit
code aren't explicitly disabled, so we get the following warning when
enabling pr_debug() in __exit code:
can't patch jump_label at dmi_sysfs_exit+0x0/0x2d
WARNING: CPU: 0 PID: 22572 at kernel/jump_label.c:376 __jump_label_update+0x9d/0xb0
Fix the warning by disabling all jump labels in initmem (which includes
both __init and __exit code).
Reported-and-tested-by: Li Wang <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: dc1dd184c2f0 ("jump_label: Warn on failed jump_label patching attempt")
Link: http://lkml.kernel.org/r/7121e6e595374f06616c505b6e690e275c0054d1.1521483452.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There are no users left (everyone got converted to wait_var_event()), remove it.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
wait_var_event() API
The old wait_on_atomic_t() is going to get removed, use the more
flexible wait_var_event() API instead.
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: David Howells <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
As a replacement for the wait_on_atomic_t() API provide the
wait_var_event() API.
The wait_var_event() API is based on the very same hashed-waitqueue
idea, but doesn't care about the type (atomic_t) or the specific
condition (atomic_read() == 0). IOW. it's much more widely
applicable/flexible.
It shares all the benefits/disadvantages of a hashed-waitqueue
approach with the existing wait_on_atomic_t/wait_on_bit() APIs.
The API is modeled after the existing wait_event() API, but instead of
taking a wait_queue_head, it takes an address. This addresses is
hashed to obtain a wait_queue_head from the bit_wait_table.
Similar to the wait_event() API, it takes a condition expression as
second argument and will wait until this expression becomes true.
The following are (mostly) identical replacements:
wait_on_atomic_t(&my_atomic, atomic_t_wait, TASK_UNINTERRUPTIBLE);
wake_up_atomic_t(&my_atomic);
wait_var_event(&my_atomic, !atomic_read(&my_atomic));
wake_up_var(&my_atomic);
The only difference is that wake_up_var() is an unconditional wakeup
and doesn't check the previously hard-coded (atomic_read() == 0)
condition here. This is of little concequence, since most callers are
already conditional on atomic_dec_and_test() and the ones that are
not, are trivial to make so.
Tested-by: Dan Williams <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: David Howells <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.
The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.
Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.
For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.
This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:
util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))
This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.
If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).
The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:
util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)
where:
cfs_rq::util_est::enqueued = sum(_task_util_est(task))
for each RUNNABLE task on that root cfs_rq
It's worth noting that the estimated utilization is tracked only for
objects of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection
Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
"Two low-impact workqueue commits.
One fixes workqueue creation error path and the other removes the
unused cancel_work()"
* 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove unused cancel_work()
workqueue: use put_device() instead of kfree()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu fixes from Tejun Heo:
"Late percpu pull request for v4.16-rc6.
- percpu allocator pool replenishing no longer triggers OOM or
warning messages.
Also, the alloc interface now understands __GFP_NORETRY and
__GFP_NOWARN. This is to allow avoiding OOMs from userland
triggered actions like bpf map creation.
Also added cond_resched() in alloc loop.
- perpcu allocation now can be interrupted by kill sigs to avoid
deadlocking OOM killer.
- Added Dennis Zhou as a co-maintainer.
He has rewritten the area map allocator, understands most of the
code base and has been responsive for all bug reports"
* 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods
mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
percpu: include linux/sched.h for cond_resched()
percpu: add a schedule point in pcpu_balance_workfn()
percpu: allow select gfp to be passed to underlying allocators
percpu: add __GFP_NORETRY semantics to the percpu balancing path
percpu: match chunk allocator declarations with definitions
percpu: add Dennis Zhou as a percpu co-maintainer
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
grace periods
percpu_ref internally uses sched-RCU to implement the percpu -> atomic
mode switching and the documentation suggested that this could be
depended upon. This doesn't seem like a good idea.
* percpu_ref uses sched-RCU which has different grace periods regular
RCU. Users may combine percpu_ref with regular RCU usage and
incorrectly believe that regular RCU grace periods are performed by
percpu_ref. This can lead to, for example, use-after-free due to
premature freeing.
* percpu_ref has a grace period when switching from percpu to atomic
mode. It doesn't have one between the last put and release. This
distinction is subtle and can lead to surprising bugs.
* percpu_ref allows starting in and switching to atomic mode manually
for debugging and other purposes. This means that there may not be
any grace periods from kill to release.
This patch makes it clear that the grace periods are percpu_ref's
internal implementation detail and can't be depended upon by the
users.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Pull KVM fixes from Paolo Bonzini:
"PPC:
- fix bug leading to lost IPIs and smp_call_function_many() lockups
on POWER9
ARM:
- locking fix
- reset fix
- GICv2 multi-source SGI injection fix
- GICv2-on-v3 MMIO synchronization fix
- make the console less verbose.
x86:
- fix device passthrough on AMD SME"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Fix device passthrough when SME is active
kvm: arm/arm64: vgic-v3: Tighten synchronization for guests using v2 on v3
KVM: arm/arm64: vgic: Don't populate multiple LRs with the same vintid
KVM: arm/arm64: Reduce verbosity of KVM init log
KVM: arm/arm64: Reset mapped IRQs on VM reset
KVM: arm/arm64: Avoid vcpu_load for other vcpu ioctls than KVM_RUN
KVM: arm/arm64: vgic: Add missing irq_lock to vgic_mmio_read_pending
KVM: PPC: Book3S HV: Fix trap number return from __kvmppc_vcore_entry
|
|
Mark noticed that the change to sibling_list changed some iteration
semantics; because previously we used group_list as list entry,
sibling events would always have an empty sibling_list.
But because we now use sibling_list for both list head and list entry,
siblings will report as having siblings.
Fix this with a custom for_each_sibling_event() iterator.
Fixes: 8343aae66167 ("perf/core: Remove perf_event::group_entry")
Reported-by: Mark Rutland <[email protected]>
Suggested-by: Mark Rutland <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
|
|
With reorder header off, received packets are untagged in skb_vlan_untag()
called from within __netif_receive_skb_core(), and later the tag will be
inserted back in vlan_do_receive().
This caused out of order vlan headers when we create a vlan device on top
of another vlan device, because vlan_do_receive() inserts a tag as the
outermost vlan tag. E.g. the outer tag is first removed in skb_vlan_untag()
and inserted back in vlan_do_receive(), then the inner tag is next removed
and inserted back as the outermost tag.
This patch fixes the behaviour by inserting the inner tag at the right
position.
Signed-off-by: Toshiaki Makita <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Tile was the only remaining architecture to implement alloc_remap(),
and since that is being removed, there is no point in keeping this
function.
Removing all callers simplifies the mem_map handling.
Reviewed-by: Pavel Tatashin <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
|
|
The Analog Devices Blackfin port was added in 2007 and was rather
active for a while, but all work on it has come to a standstill
over time, as Analog have changed their product line-up.
Aaron Wu confirmed that the architecture port is no longer relevant,
and multiple people suggested removing blackfin independently because
of some of its oddities like a non-working SMP port, and the amount of
duplication between the chip variants, which cause extra work when
doing cross-architecture changes.
Link: https://docs.blackfin.uclinux.org/
Acked-by: Aaron Wu <[email protected]>
Acked-by: Bryan Wu <[email protected]>
Cc: Steven Miao <[email protected]>
Cc: Mike Frysinger <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
- backport-friendly part of lock_parent() race fix
- a fix for an assumption in the heurisic used by path_connected() that
is not true on NFS
- livelock fixes for d_alloc_parallel()
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Teach path_connected to handle nfs filesystems with multiple roots.
fs: dcache: Use READ_ONCE when accessing i_dir_seq
fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
|
|
On nfsv2 and nfsv3 the nfs server can export subsets of the same
filesystem and report the same filesystem identifier, so that the nfs
client can know they are the same filesystem. The subsets can be from
disjoint directory trees. The nfsv2 and nfsv3 filesystems provides no
way to find the common root of all directory trees exported form the
server with the same filesystem identifier.
The practical result is that in struct super s_root for nfs s_root is
not necessarily the root of the filesystem. The nfs mount code sets
s_root to the root of the first subset of the nfs filesystem that the
kernel mounts.
This effects the dcache invalidation code in generic_shutdown_super
currently called shrunk_dcache_for_umount and that code for years
has gone through an additional list of dentries that might be dentry
trees that need to be freed to accomodate nfs.
When I wrote path_connected I did not realize nfs was so special, and
it's hueristic for avoiding calling is_subdir can fail.
The practical case where this fails is when there is a move of a
directory from the subtree exposed by one nfs mount to the subtree
exposed by another nfs mount. This move can happen either locally or
remotely. With the remote case requiring that the move directory be cached
before the move and that after the move someone walks the path
to where the move directory now exists and in so doing causes the
already cached directory to be moved in the dcache through the magic
of d_splice_alias.
If someone whose working directory is in the move directory or a
subdirectory and now starts calling .. from the initial mount of nfs
(where s_root == mnt_root), then path_connected as a heuristic will
not bother with the is_subdir check. As s_root really is not the root
of the nfs filesystem this heuristic is wrong, and the path may
actually not be connected and path_connected can fail.
The is_subdir function might be cheap enough that we can call it
unconditionally. Verifying that will take some benchmarking and
the result may not be the same on all kernels this fix needs
to be backported to. So I am avoiding that for now.
Filesystems with snapshots such as nilfs and btrfs do something
similar. But as the directory tree of the snapshots are disjoint
from one another and from the main directory tree rename won't move
things between them and this problem will not occur.
Cc: [email protected]
Reported-by: Al Viro <[email protected]>
Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
Signed-off-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Al Viro <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
kvm/arm fixes for 4.16, take 2
- Peace of mind locking fix in vgic_mmio_read_pending
- Allow hw-mapped interrupts to be reset when the VM resets
- Fix GICv2 multi-source SGI injection
- Fix MMIO synchronization for GICv2 on v3 emulation
- Remove excess verbosity on the console
|
|
|
|
The vgic code is trying to be clever when injecting GICv2 SGIs,
and will happily populate LRs with the same interrupt number if
they come from multiple vcpus (after all, they are distinct
interrupt sources).
Unfortunately, this is against the letter of the architecture,
and the GICv2 architecture spec says "Each valid interrupt stored
in the List registers must have a unique VirtualID for that
virtual CPU interface.". GICv3 has similar (although slightly
ambiguous) restrictions.
This results in guests locking up when using GICv2-on-GICv3, for
example. The obvious fix is to stop trying so hard, and inject
a single vcpu per SGI per guest entry. After all, pending SGIs
with multiple source vcpus are pretty rare, and are mostly seen
in scenario where the physical CPUs are severely overcomitted.
But as we now only inject a single instance of a multi-source SGI per
vcpu entry, we may delay those interrupts for longer than strictly
necessary, and run the risk of injecting lower priority interrupts
in the meantime.
In order to address this, we adopt a three stage strategy:
- If we encounter a multi-source SGI in the AP list while computing
its depth, we force the list to be sorted
- When populating the LRs, we prevent the injection of any interrupt
of lower priority than that of the first multi-source SGI we've
injected.
- Finally, the injection of a multi-source SGI triggers the request
of a maintenance interrupt when there will be no pending interrupt
in the LRs (HCR_NPIE).
At the point where the last pending interrupt in the LRs switches
from Pending to Active, the maintenance interrupt will be delivered,
allowing us to add the remaining SGIs using the same process.
Cc: [email protected]
Fixes: 0919e84c0fc1 ("KVM: arm/arm64: vgic-new: Add IRQ sync/flush framework")
Acked-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a small clump of USB fixes for 4.16-rc6.
Nothing major, just a number of fixes in lots of different drivers, as
well as a PHY driver fix that snuck into this tree. Full details are
in the shortlog.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (22 commits)
usb: musb: Fix external abort in musb_remove on omap2430
phy: qcom-ufs: add MODULE_LICENSE tag
usb: typec: tcpm: fusb302: Do not log an error on -EPROBE_DEFER
USB: OHCI: Fix NULL dereference in HCDs using HCD_LOCAL_MEM
usbip: vudc: fix null pointer dereference on udc->lock
xhci: Fix front USB ports on ASUS PRIME B350M-A
usb: host: xhci-plat: revert "usb: host: xhci-plat: enable clk in resume timing"
usb: usbmon: Read text within supplied buffer size
usb: host: xhci-rcar: add support for r8a77965
USB: storage: Add JMicron bridge 152d:2567 to unusual_devs.h
usb: xhci: dbc: Fix lockdep warning
xhci: fix endpoint context tracer output
Revert "typec: tcpm: Only request matching pdos"
usb: musb: call pm_runtime_{get,put}_sync before reading vbus registers
usb: quirks: add control message delay for 1b1c:1b20
uas: fix comparison for error code
usb: gadget: udc: renesas_usb3: add binging for r8a77965
usb: renesas_usbhs: add binding for r8a77965
usb: dwc2: fix STM32F7 USB OTG HS compatible
dt-bindings: usb: fix the STM32F7 DWC2 OTG HS core binding
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are some small tty core and serial driver fixes for 4.16-rc6.
They resolve some newly reported bugs, as well as some very old ones,
which is always nice to see. There is also a new device id added in
here for good measure.
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-4.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: imx: fix bogus dev_err
serial: sh-sci: prevent lockup on full TTY buffers
serial: 8250_pci: Add Brainboxes UC-260 4 port serial device
earlycon: add reg-offset to physical address before mapping
serial: core: mark port as initialized in autoconfig
serial: 8250_pci: Don't fail on multiport card class
tty/serial: atmel: add new version check for usart
tty: make n_tty_read() always abort if hangup is in progress
|
|
Andrei Vagin reported a KASAN: slab-out-of-bounds error in
skb_update_prio()
Since SYNACK might be attached to a request socket, we need to
get back to the listener socket.
Since this listener is manipulated without locks, add const
qualifiers to sock_cgroup_prioidx() so that the const can also
be used in skb_update_prio()
Also add the const qualifier to sock_cgroup_classid() for consistency.
Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Andrei Vagin <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Back in 2013, runtime PM for GPUs with integrated HDA controller was
introduced with commits 0d69704ae348 ("gpu/vga_switcheroo: add driver
control power feature. (v3)") and 246efa4a072f ("snd/hda: add runtime
suspend/resume on optimus support (v4)").
Briefly, the idea was that the HDA controller is forced on and off in
unison with the GPU.
The original code is mostly still in place even though it was never a
100% perfect solution: E.g. on access to the HDA controller, the GPU
is powered up via vga_switcheroo_runtime_resume_hdmi_audio() but there
are no provisions to keep it resumed until access to the HDA controller
has ceased: The GPU autosuspends after 5 seconds, rendering the HDA
controller inaccessible.
Additionally, a kludge is required when hda_intel.c probes: It has to
check whether the GPU is powered down (check_hdmi_disabled()) and defer
probing if so.
However in the meantime (in v4.10) the driver core has gained a feature
called device links which promises to solve such issues in a clean way:
It allows us to declare a dependency from the HDA controller (consumer)
to the GPU (supplier). The PM core then automagically ensures that the
GPU is runtime resumed as long as the HDA controller's ->probe hook is
executed and whenever the HDA controller is accessed.
By default, the HDA controller has a dependency on its parent, a PCIe
Root Port. Adding a device link creates another dependency on its
sibling:
PCIe Root Port
^ ^
| |
| |
HDA ===> GPU
The device link is not only used for runtime PM, it also guarantees that
on system sleep, the HDA controller suspends before the GPU and resumes
after the GPU, and on system shutdown the HDA controller's ->shutdown
hook is executed before the one of the GPU. It is a complete solution.
Using this functionality is as simple as calling device_link_add(),
which results in a dmesg entry like this:
pci 0000:01:00.1: Linked as a consumer to 0000:01:00.0
The code for the GPU-governed audio power management can thus be removed
(except where it's still needed for legacy manual power control).
The device link is added in a PCI quirk rather than in hda_intel.c.
It is therefore legal for the GPU to runtime suspend to D3cold even if
the HDA controller is not bound to a driver or if CONFIG_SND_HDA_INTEL
is not enabled, for accesses to the HDA controller will cause the GPU to
wake up regardless if they're occurring outside of hda_intel.c (think
config space readout via sysfs).
Contrary to the previous implementation, the HDA controller's power
state is now self-governed, rather than GPU-governed, whereas the GPU's
power state is no longer fully self-governed. (The HDA controller needs
to runtime suspend before the GPU can.)
It is thus crucial that runtime PM is always activated on the HDA
controller even if CONFIG_SND_HDA_POWER_SAVE_DEFAULT is set to 0 (which
is the default), lest the GPU stays awake. This is achieved by setting
the auto_runtime_pm flag on every codec and the AZX_DCAPS_PM_RUNTIME
flag on the HDA controller.
A side effect is that power consumption might be reduced if the GPU is
in use but the HDA controller is not, because the HDA controller is now
allowed to go to D3hot. Before, it was forced to stay in D0 as long as
the GPU was in use. (There is no reduction in power consumption on my
Nvidia GK107, but there might be on other chips.)
The code paths for legacy manual power control are adjusted such that
runtime PM is disabled during power off, thereby preventing the PM core
from resuming the HDA controller.
Note that the device link is not only added on vga_switcheroo capable
systems, but for *any* GPU with integrated HDA controller. The idea is
that the HDA controller streams audio via connectors located on the GPU,
so the GPU needs to be on for the HDA controller to do anything useful.
This commit implicitly fixes an unbalanced runtime PM ref upon unbind of
hda_intel.c: On ->probe, a runtime PM ref was previously released under
the condition "azx_has_pm_runtime(chip) || hda->use_vga_switcheroo", but
on ->remove a runtime PM ref was only acquired under the first of those
conditions. Thus, binding and unbinding the driver twice on a
vga_switcheroo capable system caused the runtime PM refcount to drop
below zero. The issue is resolved because the AZX_DCAPS_PM_RUNTIME flag
is now always set if use_vga_switcheroo is true.
For more information on device links please refer to:
https://www.kernel.org/doc/html/latest/driver-api/device_link.html
Documentation/driver-api/device_link.rst
Cc: Dave Airlie <[email protected]>
Cc: Ben Skeggs <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Takashi Iwai <[email protected]>
Reviewed-by: Peter Wu <[email protected]>
Tested-by: Kai Heng Feng <[email protected]> # AMD PowerXpress
Tested-by: Mike Lothian <[email protected]> # AMD PowerXpress
Tested-by: Denis Lisov <[email protected]> # Nvidia Optimus
Tested-by: Peter Wu <[email protected]> # Nvidia Optimus
Tested-by: Lukas Wunner <[email protected]> # MacBook Pro
Signed-off-by: Lukas Wunner <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/51bd38360ff502a8c42b1ebf4405ee1d3f27118d.1520068884.git.lukas@wunner.de
|
|
There are PCI devices which are power-manageable by a nonstandard means,
such as a custom ACPI method. One example are discrete GPUs in hybrid
graphics laptops, another are Thunderbolt controllers in Macs.
Such devices can't be put into D3cold with pci_set_power_state() because
pci_platform_power_transition() fails with -ENODEV. Instead they're put
into D3hot by pci_set_power_state() and subsequently into D3cold by
invoking the nonstandard means. However as a consequence the cached
current_state is incorrectly left at D3hot.
What we need to do is walk the hierarchy below such a PCI device on
powerdown and update the current_state to D3cold. On powerup the PCI
device itself and the hierarchy below it is in D0uninitialized, so we
need to walk the hierarchy again and wake all devices, causing them to
be put into D0active and then letting them autosuspend as they see fit.
To this end make pci_wakeup_bus() & pci_bus_set_current_state() public
so PCI drivers don't have to reinvent the wheel.
Cc: Rafael J. Wysocki <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Signed-off-by: Lukas Wunner <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/2962443259e7faec577274b4ef8c54aad66f9a94.1520068884.git.lukas@wunner.de
|
|
Found this by accident.
There are no usages of bare cancel_work() in current kernel source.
Signed-off-by: Stephen Hemminger <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
This patch validates user provided input to prevent integer overflow due
to integer manipulation in the mlx5_ib_create_srq function.
Cc: syzkaller <[email protected]>
Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Boris Pismenny <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>
|
|
Problem and motivation: Once a breakpoint perf event (PERF_TYPE_BREAKPOINT)
is created, there is no flexibility to change the breakpoint type
(bp_type), breakpoint address (bp_addr), or breakpoint length (bp_len). The
only option is to close the perf event and configure a new breakpoint
event. This inflexibility has a significant performance overhead. For
example, sampling-based, lightweight performance profilers (and also
concurrency bug detection tools), monitor different addresses for a short
duration using PERF_TYPE_BREAKPOINT and change the address (bp_addr) to
another address or change the kind of breakpoint (bp_type) from "write" to
a "read" or vice-versa or change the length (bp_len) of the address being
monitored. The cost of these modifications is prohibitive since it involves
unmapping the circular buffer associated with the perf event, closing the
perf event, opening another perf event and mmaping another circular buffer.
Solution: The new ioctl flag for perf events,
PERF_EVENT_IOC_MODIFY_ATTRIBUTES, introduced in this patch takes a pointer
to a struct perf_event_attr as an argument to update an old breakpoint
event with new address, type, and size. This facility allows retaining a
previous mmaped perf events ring buffer and avoids having to close and
reopen another perf event.
This patch supports only changing PERF_TYPE_BREAKPOINT event type; future
implementations can extend this feature. The patch replicates some of its
functionality of modify_user_hw_breakpoint() in
kernel/events/hw_breakpoint.c. modify_user_hw_breakpoint cannot be called
directly since perf_event_ctx_lock() is already held in _perf_ioctl().
Evidence: Experiments show that the baseline (not able to modify an already
created breakpoint) costs an order of magnitude (~10x) more than the
suggested optimization (having the ability to dynamically modifying a
configured breakpoint via ioctl). When the breakpoints typically do not
trap, the speedup due to the suggested optimization is ~10x; even when the
breakpoints always trap, the speedup is ~4x due to the suggested
optimization.
Testing: tests posted at
https://github.com/linux-contrib/perf_event_modify_bp demonstrate the
performance significance of this patch. Tests also check the functional
correctness of the patch.
Signed-off-by: Milind Chabbi <[email protected]>
[ Using modify_user_hw_breakpoint_check function. ]
[ Reformated PERF_EVENT_IOC_*, so the values are all in one column. ]
Signed-off-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Hari Bathini <[email protected]>
Cc: Jin Yao <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sukadev Bhattiprolu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree, they are:
1) Fixed hashtable representation doesn't support timeout flag, skip it
otherwise rules to add elements from the packet fail bogusly fail with
EOPNOTSUPP.
2) Fix bogus error with 32-bits ebtables userspace and 64-bits kernel,
patch from Florian Westphal.
3) Sanitize proc names in several x_tables extensions, also from Florian.
4) Add sanitization to ebt_among wormhash logic, from Florian.
5) Missing release of hook array in flowtable.
====================
|
|
registered
Now when using 'ss' in iproute, kernel would try to load all _diag
modules, which also causes corresponding family and proto modules
to be loaded as well due to module dependencies.
Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
would be loaded.
For example:
$ lsmod|grep sctp
$ ss
$ lsmod|grep sctp
sctp_diag 16384 0
sctp 323584 5 sctp_diag
inet_diag 24576 4 raw_diag,tcp_diag,sctp_diag,udp_diag
libcrc32c 16384 3 nf_conntrack,nf_nat,sctp
As these family and proto modules are loaded unintentionally, it
could cause some problems, like:
- Some debug tools use 'ss' to collect the socket info, which loads all
those diag and family and protocol modules. It's noisy for identifying
issues.
- Users usually expect to drop sctp init packet silently when they
have no sense of sctp protocol instead of sending abort back.
- It wastes resources (especially with multiple netns), and SCTP module
can't be unloaded once it's loaded.
...
In short, it's really inappropriate to have these family and proto
modules loaded unexpectedly when just doing debugging with inet_diag.
This patch is to introduce sock_load_diag_module() where it loads
the _diag module only when it's corresponding family or proto has
been already registered.
Note that we can't just load _diag module without the family or
proto loaded, as some symbols used in _diag module are from the
family or proto module.
v1->v2:
- move inet proto check to inet_diag to avoid a compiling err.
v2->v3:
- define sock_load_diag_module in sock.c and export one symbol
only.
- improve the changelog.
Reported-by: Sabrina Dubroca <[email protected]>
Acked-by: Marcelo Ricardo Leitner <[email protected]>
Acked-by: Phil Sutter <[email protected]>
Acked-by: Sabrina Dubroca <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
In 664fcf123a30e (net: phy: Threaded interrupts allow some simplification)
the phy_interrupt system was changed to use a traditional threaded
interrupt scheme instead of a workqueue approach.
With this change, the phy status check moved into phy_change, which
did not report back to the caller whether or not the interrupt was
handled. This means that, in the case of a shared phy interrupt,
only the first phydev's interrupt registers are checked (since
phy_interrupt() would always return IRQ_HANDLED). This leads to
interrupt storms when it is a secondary device that's actually the
interrupt source.
Signed-off-by: Brad Mouring <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
When an event group contains more events than can be scheduled on the
hardware, iterating the full event group for ctx_sched_out is a waste
of time.
Keep track of the events that got programmed on the hardware, such
that we can iterate this smaller list in order to schedule them out.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Alexey Budankov <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Carrillo-Cisneros <[email protected]>
Cc: Dmitri Prokhorov <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Valery Cherepennikov <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Now that all the grouping is done with RB trees, we no longer need
group_entry and can replace the whole thing with sibling_list.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Alexey Budankov <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Carrillo-Cisneros <[email protected]>
Cc: Dmitri Prokhorov <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Valery Cherepennikov <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Change event groups into RB trees sorted by CPU and then by a 64bit
index, so that multiplexing hrtimer interrupt handler would be able
skipping to the current CPU's list and ignore groups allocated for the
other CPUs.
New API for manipulating event groups in the trees is implemented as well
as adoption on the API in the current implementation.
pinned_group_sched_in() and flexible_group_sched_in() API are
introduced to consolidate code enabling the whole group from pinned
and flexible groups appropriately.
Signed-off-by: Alexey Budankov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Carrillo-Cisneros <[email protected]>
Cc: Dmitri Prokhorov <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Valery Cherepennikov <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
It's unused:
$ git grep "\<TASK_ALL\>" | wc -l
1
... and it is also dangerous, kill the bugger.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Presently, only ARM uses mm_struct to manage EFI page tables and EFI
runtime region mappings. As this is the preferred approach, let's make
this data structure common across architectures. Specially, for x86,
using this data structure improves code maintainability and readability.
Tested-by: Bhupesh Sharma <[email protected]>
[ardb: don't #include the world to get a declaration of struct mm_struct]
Signed-off-by: Sai Praneeth Prakhya <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>
Reviewed-by: Matt Fleming <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Lee, Chun-Yi <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ravi Shankar <[email protected]>
Cc: Ricardo Neri <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
recent and hashlimit both create /proc files, but only check that
name is 0 terminated.
This can trigger WARN() from procfs when name is "" or "/".
Add helper for this and then use it for both.
Cc: Eric Dumazet <[email protected]>
Reported-by: Eric Dumazet <[email protected]>
Reported-by: <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:
- Miscellaneous fixes, perhaps most notably removing obsolete
code whose only purpose in life was to gather information for
the now-removed RCU debugfs facility. Other notable changes
include removing NO_HZ_FULL_ALL in favor of the nohz_full kernel
boot parameter, minor optimizations for expedited grace periods,
some added tracing, creating an RCU-specific workqueue using Tejun's
new WQ_MEM_RECLAIM flag, and several cleanups to code and comments.
- SRCU cleanups and optimizations.
- Torture-test updates, perhaps most notably the adding of ARMv8
support, but also including numerous cleanups and usability fixes.
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Remove the MN10300 arch as the hardware is defunct.
Suggested-by: Arnd Bergmann <[email protected]>
Signed-off-by: David Howells <[email protected]>
cc: Masahiro Yamada <[email protected]>
cc: [email protected]
Signed-off-by: Arnd Bergmann <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- fix sparc build issue when OF_IRQ not enabled (Guenter Roeck)
- fix enumeration of devices below switches on DesignWare-based
controllers (Koen Vandeputte)
* tag 'pci-v4.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: dwc: Fix enumeration end when reaching root subordinate
PCI: Move of_irq_parse_and_map_pci() declaration under OF_IRQ
|
|
After commit fc72d1d54dd9 ("tuntap: XDP transmission"), we can
actually queueing XDP pointers in the pointer ring, so we should
examine the pointer type before freeing the pointer.
Fixes: fc72d1d54dd9 ("tuntap: XDP transmission")
Reported-by: Michael S. Tsirkin <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
It was suggested that a migration hint might be usefull for the
CPU-freq governors.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The primary observation is that nohz enter/exit is always from the
current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be
an atomic.
Secondary is that we appear to have 2 nearly identical hooks in the
nohz enter code, set_cpu_sd_state_idle() and
nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into
nohz_balance_{enter,exit}_idle.
Removes an atomic op from both enter and exit paths.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Instead of trying to duplicate scheduler state to track if an RT task
is running, directly use the scheduler runqueue state for it.
This vastly simplifies things and fixes a number of bugs related to
sugov and the scheduler getting out of sync wrt this state.
As a consequence we not also update the remove cfs/dl state when
iterating the shared mask.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Bitrot...
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|