Age | Commit message (Collapse) | Author | Files | Lines |
|
Instead of defining the local struct vlantag use the standard definition
of vlan_hdr from <linux/if_vlan.h>.
The fields in the <linux/if_vlan.h> definition have different names:
- vlan -> h_vlan_TCI
- subtype -> h_vlan_encapsulated_proto
While there, use also the ETH_P_IP macro instead of an hard-coded 0x0800
value.
Signed-off-by: Luis Henriques <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Instead of defining a local version of struct ethhdr use the standard
definition from <linux/if_ether.h>.
The fields in the <linux/if_ether.h> definition have different names:
- dest -> h_dest
- src -> h_source
- type -> h_proto
While there, use a few other standard functions/macros:
- eth_broadcast_addr (instead of a memset)
- ETH_ALEN
- ETH_P_8021Q
Signed-off-by: Luis Henriques <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Now that other PTE fields have been moved out of the way, we can
expand the RPN field of the PTE on 64-bit Book 3S systems and align
it with the RPN field in the radix PTE format used by PowerISA v3.0
CPUs in radix mode. For 64k page size, this means we need to move
the _PAGE_COMBO and _PAGE_4K_PFN bits.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
This moves the _PAGE_SPECIAL and _PAGE_SOFT_DIRTY bits in the Linux
PTE on 64-bit Book 3S systems to bit positions which are designated
for software use in the radix PTE format used by PowerISA v3.0 CPUs
in radix mode.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
This moves the _PAGE_EXEC, _PAGE_RW and _PAGE_USER bits around in
the Linux PTE on 64-bit Book 3S systems to correspond with the bit
positions used in radix mode by PowerISA v3.0 CPUs. This also adds
a _PAGE_READ bit corresponding to the read permission bit in the
radix PTE. _PAGE_READ is currently unused but could possibly be used
in future to improve pte_protnone().
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
This moves the _PAGE_HASHPTE, _PAGE_F_GIX and _PAGE_F_SECOND fields in
the Linux PTE on 64-bit Book 3S systems to the most significant byte.
Of the 5 bits, one is a software-use bit and the other four are
reserved bit positions in the PowerISA v3.0 radix PTE format.
Using these bits is OK because these bits are all to do with tracking
the HPTE(s) associated with the Linux PTE, and therefore won't be
needed in radix mode. This frees up bit positions in the lower two
bytes.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
This changes _PAGE_PTE for 64-bit Book 3S processors from 0x1 to
0x4000_0000_0000_0000, because that bit is used as the L (leaf)
bit by PowerISA v3.0 CPUs in radix mode. The "leaf" bit indicates
that the PTE points to a page directly rather than another radix
level, which is what the _PAGE_PTE bit means.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
This changes _PAGE_PRESENT for 64-bit Book 3S processors from 0x2 to
0x8000_0000_0000_0000, because that is where PowerISA v3.0 CPUs in
radix mode will expect to find it.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
This changes the Linux page tables to store physical addresses
rather than kernel virtual addresses in the upper levels of the
tree (pgd, pud and pmd) for 64-bit Book 3S machines.
This also changes the hugepd pointers used to implement hugepages
when the base page size is 4k to store physical addresses rather than
virtual addresses (again just for 64-bit Book3S machines).
This frees up some high order bits, and will be needed with
PowerISA v3.0 machines which read the page table tree in hardware
in radix mode.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
applying new changes
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Redirecting the wakeup of a VCPU from the H_IPI hypercall to
a core running in the host is usually a good idea, most workloads
seemed to benefit. However, in one heavily interrupt-driven SMT1
workload, some regression was observed. This patch adds a kvm_hv
module parameter called h_ipi_redirect to control this feature.
The default value for this tunable is 1 - that is enable the feature.
Signed-off-by: Suresh Warrier <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
This patch adds support to real-mode KVM to search for a core
running in the host partition and send it an IPI message with
VCPU to be woken. This avoids having to switch to the host
partition to complete an H_IPI hypercall when the VCPU which
is the target of the the H_IPI is not loaded (is not running
in the guest).
The patch also includes the support in the IPI handler running
in the host to do the wakeup by calling kvmppc_xics_ipi_action
for the PPC_MSG_RM_HOST_ACTION message.
When a guest is being destroyed, we need to ensure that there
are no pending IPIs waiting to wake up a VCPU before we free
the VCPUs of the guest. This is accomplished by:
- Forces a PPC_MSG_CALL_FUNCTION IPI to be completed by all CPUs
before freeing any VCPUs in kvm_arch_destroy_vm().
- Any PPC_MSG_RM_HOST_ACTION messages must be executed first
before any other PPC_MSG_CALL_FUNCTION messages.
Signed-off-by: Suresh Warrier <[email protected]>
Acked-by: Michael Ellerman <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
This patch adds the support for the kick VCPU operation for
kvmppc_host_rm_ops. The kvmppc_xics_ipi_action() function
provides the function to be invoked for a host side operation
when poked by the real mode KVM. This is initiated by KVM by
sending an IPI to any free host core.
KVM real mode must set the rm_action to XICS_RM_KICK_VCPU and
rm_data to point to the VCPU to be woken up before sending the IPI.
Note that we have allocated one kvmppc_host_rm_core structure
per core. The above values need to be set in the structure
corresponding to the core to which the IPI will be sent.
Signed-off-by: Suresh Warrier <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
The kvmppc_host_rm_ops structure keeps track of which cores are
are in the host by maintaining a bitmask of active/runnable
online CPUs that have not entered the guest. This patch adds
support to manage the bitmask when a CPU is offlined or onlined
in the host.
Signed-off-by: Suresh Warrier <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
Update the core host state in kvmppc_host_rm_ops whenever
the primary thread of the core enters the guest or returns
back.
Signed-off-by: Suresh Warrier <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
This patch defines the data structures to support the setting up
of host side operations while running in real mode in the guest,
and also the functions to allocate and free it.
The operations are for now limited to virtual XICS operations.
Currently, we have only defined one operation in the data
structure:
- Wake up a VCPU sleeping in the host when it
receives a virtual interrupt
The operations are assigned at the core level because PowerKVM
requires that the host run in SMT off mode. For each core,
we will need to manage its state atomically - where the state
is defined by:
1. Is the core running in the host?
2. Is there a Real Mode (RM) operation pending on the host?
Currently, core state is only managed at the whole-core level
even when the system is in split-core mode. This just limits
the number of free or "available" cores in the host to perform
any host-side operations.
The kvmppc_host_rm_core.rm_data allows any data to be passed by
KVM in real mode to the host core along with the operation to
be performed.
The kvmppc_host_rm_ops structure is allocated the very first time
a guest VM is started. Initial core state is also set - all online
cores are in the host. This structure is never deleted, not even
when there are no active guests. However, it needs to be freed
when the module is unloaded because the kvmppc_host_rm_ops_hv
can contain function pointers to kvm-hv.ko functions for the
different supported host operations.
Signed-off-by: Suresh Warrier <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
Function to cause an IPI by directly updating the MFFR register
in the XICS. The function is meant for real-mode callers since
they cannot use the smp_ops->cause_ipi function which uses an
ioremapped address.
Normal usage is for the the KVM real mode code to set the IPI message
using smp_muxed_ipi_message_pass and then invoke icp_native_cause_ipi_rm
to cause the actual IPI.
The function requires kvm_hstate.xics_phys to have been initialized
with the physical address of XICS.
Signed-off-by: Suresh Warrier <[email protected]>
Acked-by: Michael Ellerman <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
smp_muxed_ipi_message_pass() invokes smp_ops->cause_ipi, which
uses an ioremapped address to access registers on the XICS
interrupt controller to cause the IPI. Because of this real
mode callers cannot call smp_muxed_ipi_message_pass() for IPI
messaging.
This patch creates a separate function smp_muxed_ipi_set_message
just to set the IPI message without the cause_ipi routine.
After calling this function to set the IPI message, real
mode callers must cause the IPI by writing to the XICS registers
directly.
As part of this, we also change smp_muxed_ipi_message_pass
to call smp_muxed_ipi_set_message to set the message instead
of doing it directly inside the routine.
Signed-off-by: Suresh Warrier <[email protected]>
Acked-by: Michael Ellerman <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
This patch increases the number of demuxed messages for a
controller with a single ipi to 8 for 64-bit systems.
This is required because we want to use the IPI mechanism
to send messages from a CPU running in KVM real mode in a
guest to a CPU in the host to take some action. Currently,
we only support 4 messages and all 4 are already taken.
Define a fifth message PPC_MSG_RM_HOST_ACTION for this
purpose.
Signed-off-by: Suresh Warrier <[email protected]>
Acked-by: Michael Ellerman <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
|
|
Replace calls to get_random_int() followed by a cast to (unsigned long)
with calls to get_random_long(). Also address shifting bug which, in
case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.
Signed-off-by: Daniel Cashman <[email protected]>
Acked-by: Kees Cook <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Nick Kralevich <[email protected]>
Cc: Jeff Vander Stoep <[email protected]>
Cc: Mark Salyzyn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This frees up bits 57-63 in the Linux PTE on 64-bit Book 3S machines.
In the 4k page case, this is done just by reducing the size of the
RPN field to 39 bits, giving 51-bit real addresses. In the 64k page
case, we had 10 unused bits in the middle of the PTE, so this moves
the RPN field down 10 bits to make use of those unused bits. This
means the RPN field is now 3 bits larger at 37 bits, giving 53-bit
real addresses in the normal case, or 49-bit real addresses for the
special 4k PFN case.
We are doing this in order to be able to move some other PTE bits
into the positions where PowerISA V3.0 processors will expect to
find them in radix-tree mode. Ultimately we will be able to move
the RPN field to lower bit positions and make it larger.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
No code changes.
Signed-off-by: Paul Mackerras <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- eeh: Fix partial hotplug criterion from Gavin Shan
- mm: Clear the invalid slot information correctly from Aneesh Kumar K.V
* tag 'powerpc-4.5-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm/hash: Clear the invalid slot information correctly
powerpc/eeh: Fix partial hotplug criterion
|
|
This patch add the SO_CNX_ADVICE socket option (setsockopt only). The
purpose is to allow an application to give feedback to the kernel about
the quality of the network path for a connected socket. The value
argument indicates the type of quality report. For this initial patch
the only supported advice is a value of 1 which indicates "bad path,
please reroute"-- the action taken by the kernel is to call
dst_negative_advice which will attempt to choose a different ECMP route,
reset the TX hash for flow label and UDP source port in encapsulation,
etc.
This facility should be useful for connected UDP sockets where only the
application can provide any feedback about path quality. It could also
be useful for TCP applications that have additional knowledge about the
path outside of the normal TCP control loop.
Signed-off-by: Tom Herbert <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Pull in our current fixes from 4.5, in particular the "Fix Multi hit
ERAT" bug is causing folks some grief when testing next.
|
|
The problem:
On -rt, an emulated LAPIC timer instances has the following path:
1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled
This extra context switch introduces unnecessary latency in the
LAPIC path for a KVM guest.
The solution:
Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.
Normal waitqueues make use of spinlocks, which on -RT
are sleepable locks. Therefore, waking up a waitqueue
waiter involves locking a sleeping lock, which
is not allowed from hard interrupt context.
cyclictest command line:
This patch reduces the average latency in my tests from 14us to 11us.
Daniel writes:
Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
benchmark on mainline. The test was run 1000 times on
tip/sched/core 4.4.0-rc8-01134-g0905f04:
./x86-run x86/tscdeadline_latency.flat -cpu host
with idle=poll.
The test seems not to deliver really stable numbers though most of
them are smaller. Paolo write:
"Anything above ~10000 cycles means that the host went to C1 or
lower---the number means more or less nothing in that case.
The mean shows an improvement indeed."
Before:
min max mean std
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 5162.596000 2019270.084000 5824.491541 20681.645558
std 75.431231 622607.723969 89.575700 6492.272062
min 4466.000000 23928.000000 5537.926500 585.864966
25% 5163.000000 1613252.750000 5790.132275 16683.745433
50% 5175.000000 2281919.000000 5834.654000 23151.990026
75% 5190.000000 2382865.750000 5861.412950 24148.206168
max 5228.000000 4175158.000000 6254.827300 46481.048691
After
min max mean std
count 1000.000000 1000.00000 1000.000000 1000.000000
mean 5143.511000 2076886.10300 5813.312474 21207.357565
std 77.668322 610413.09583 86.541500 6331.915127
min 4427.000000 25103.00000 5529.756600 559.187707
25% 5148.000000 1691272.75000 5784.889825 17473.518244
50% 5160.000000 2308328.50000 5832.025000 23464.837068
75% 5172.000000 2393037.75000 5853.177675 24223.969976
max 5222.000000 3922458.00000 6186.720500 42520.379830
[Patch was originaly based on the swait implementation found in the -rt
tree. Daniel ported it to mainline's version and gathered the
benchmark numbers for tscdeadline_latency test.]
Signed-off-by: Daniel Wagner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: Boqun Feng <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
I ran into this issue while debugging an early boot problem. The system
hit a BUG_ON() but report bug failed to print the line number and file
name. The reason being that the system was running in real mode and
report_bug() searches for addresses in the PAGE_OFFSET+ region.
Suggested-by: Paul Mackerras <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
__xchg_called_with_bad_pointer() can't tell us which code uses {cmp}xchg
with an unsupported size, and no error is reported until the link stage.
To make such problems easier to debug, use BUILD_BUG_ON_MSG() instead.
Signed-off-by: pan xinhui <[email protected]>
[mpe: Tweak change log wording & add relaxed/acquire]
Signed-off-by: Michael Ellerman <[email protected]>
fixup
|
|
Most current OpenPOWER platforms have an AST BMC, so add graphics
support via the AST DRM driver.
Signed-off-by: Jeremy Kerr <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
There are a few firmware-provided interfaces for OpenPOWER platforms:
the PRD infrastructure, IPMI support, and MTD access to the PNOR flash.
This change adds these to powernv_defconfig
Signed-off-by: Jeremy Kerr <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
This change adds a defconfig for the non-virtualised power platforms,
based on pseries_defconfig, but without pseries, and little-endian,
and no OF trampoline.
Signed-off-by: Jeremy Kerr <[email protected]>
Acked-by: Joel Stanley <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Add a cputable entry for POWER9. More code is required to actually
boot and run on a POWER9 but this gets the base piece in which we can
start building on.
Copies over from POWER8 except for:
- Adds a new CPU_FTR_ARCH_300 bit to start hanging new architecture
features from (in subsequent patches).
- Advertises new user features bits PPC_FEATURE2_ARCH_3_00 &
HAS_IEEE128 when on POWER9.
- Drops CPU_FTR_SUBCORE.
- Drops PMU code and machine check.
Signed-off-by: Michael Neuling <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Use defines for literals __init_tlb_power[78] rather than hand coding
them.
Signed-off-by: Michael Neuling <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Subcores isn't really part of the 2.07 architecture but currently we
turn it on using the 2.07 feature bit. Subcores is really a POWER8
specific feature.
This adds a new CPU_FTR bit just for subcores and moves the subcore
init code over to use this.
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Neuling <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
When initialising OPAL interfaces, there is a possibility that
opal_msglog_init() may fail to initialise the msglog/memory console.
Fix opal_msglog_sysfs_init() so it doesn't try to create sysfs entry for
the msglog if this occurs.
Suggested-by: Joel Stanley <[email protected]>
Fixes: 9b4fffa14906 ("powerpc/powernv: new function to access OPAL msglog")
Signed-off-by: Andrew Donnellan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
We can get a hash pte fault with 4k base page size and find the pte
already inserted with 64K base page size. In that case we need to clear
the existing slot information from the old pte. Fix this correctly
With THP, we also clear the slot information with respect to all
the 64K hash pte mapping that 16MB page. They are all invalid
now. This make sure we don't find the slot valid when we fault with
4k base page size. Finding the slot valid should not result in any wrong
behavior because we do check again in hash page table for the validity.
But we can avoid that check completely.
Fixes: a43c0eb8364c022 ("powerpc/mm: Convert 4k hash insert to C")
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
During error recovery, the device could be removed as part of the
partial hotplug. The criterion used to come with partial hotplug
is: if the device driver provides error_detected(), slot_reset()
and resume() callbacks, it's immune from hotplug. Otherwise,
it's going to experience partial hotplug during EEH recovery. But
the criterion isn't correct enough: mlx4_core driver for Mellanox
adapters provides error_detected(), slot_reset() callbacks, but
resume() isn't there. Those Mellanox adapters won't be to involved
in the partial hotplug.
This fixes the criterion to a practical one: adpater with driver
that provides error_detected(), slot_reset() will be immune from
partial hotplug. resume() isn't mandatory.
Fixes: f2da4ccf ("powerpc/eeh: More relaxed hotplug criterion")
Cc: [email protected] #v4.4+
Signed-off-by: Gavin Shan <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix build error on 32-bit with checkpoint restart from Aneesh Kumar
- Fix dedotify for binutils >= 2.26 from Andreas Schwab
- Don't trace hcalls on offline CPUs from Denis Kirjanov
- eeh: Fix stale cached primary bus from Gavin Shan
- eeh: Fix stale PE primary bus from Gavin Shan
- mm: Fix Multi hit ERAT cause by recent THP update from Aneesh Kumar K.V
- ioda: Set "read" permission when "write" is set from Alexey Kardashevskiy
* tag 'powerpc-4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/ioda: Set "read" permission when "write" is set
powerpc/mm: Fix Multi hit ERAT cause by recent THP update
powerpc/powernv: Fix stale PE primary bus
powerpc/eeh: Fix stale cached primary bus
powerpc/pseries: Don't trace hcalls on offline CPUs
powerpc: Fix dedotify for binutils >= 2.26
powerpc/book3s_32: Fix build error with checkpoint restart
|
|
This plumbs a protection key through calc_vm_flag_bits(). We
could have done this in calc_vm_prot_bits(), but I did not feel
super strongly which way to go. It was pretty arbitrary which
one to use.
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arve Hjønnevåg <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Chen Gang <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Geliang Tang <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Maxime Coquelin <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul Gortmaker <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Riley Andrews <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
As discussed earlier, we attempt to enforce protection keys in
software.
However, the code checks all faults to ensure that they are not
violating protection key permissions. It was assumed that all
faults are either write faults where we check PKRU[key].WD (write
disable) or read faults where we check the AD (access disable)
bit.
But, there is a third category of faults for protection keys:
instruction faults. Instruction faults never run afoul of
protection keys because they do not affect instruction fetches.
So, plumb the PF_INSTR bit down in to the
arch_vma_access_permitted() function where we do the protection
key checks.
We also add a new FAULT_FLAG_INSTRUCTION. This is because
handle_mm_fault() is not passed the architecture-specific
error_code where we keep PF_INSTR, so we need to encode the
instruction fetch information in to the arch-generic fault
flags.
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We try to enforce protection keys in software the same way that we
do in hardware. (See long example below).
But, we only want to do this when accessing our *own* process's
memory. If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
tried to PTRACE_POKE a target process which just happened to have
some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
debugger access to that memory. PKRU is fundamentally a
thread-local structure and we do not want to enforce it on access
to _another_ thread's data.
This gets especially tricky when we have workqueues or other
delayed-work mechanisms that might run in a random process's context.
We can check that we only enforce pkeys when operating on our *own* mm,
but delayed work gets performed when a random user context is active.
We might end up with a situation where a delayed-work gup fails when
running randomly under its "own" task but succeeds when running under
another process. We want to avoid that.
To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
fault flag: FAULT_FLAG_REMOTE. They indicate that we are
walking an mm which is not guranteed to be the same as
current->mm and should not be subject to protection key
enforcement.
Thanks to Jerome Glisse for pointing out this scenario.
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Alexey Kardashevskiy <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Gibson <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Dominik Dingel <[email protected]>
Cc: Dominik Vogt <[email protected]>
Cc: Eric B Munson <[email protected]>
Cc: Geliang Tang <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jason Low <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Shachar Raindel <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Xie XiuQi <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action. For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.
We try to do the same thing for protection keys. Basically, we
try to make sure that if a user does this:
mprotect(ptr, size, PROT_NONE);
*ptr = foo;
they see the same effects with protection keys when they do this:
mprotect(ptr, size, PROT_READ|PROT_WRITE);
set_pkey(ptr, size, 4);
wrpkru(0xffffff3f); // access disable pkey 4
*ptr = foo;
The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.
We add two functions and expose them to generic code:
arch_pte_access_permitted(pte_flags, write)
arch_vma_access_permitted(vma, write)
These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.
But, there are also cases where we do not want to respect
protection keys. When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Cc: Alexey Kardashevskiy <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Boaz Harrosh <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Gibson <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: David Vrabel <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Dominik Dingel <[email protected]>
Cc: Dominik Vogt <[email protected]>
Cc: Guan Xuetao <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Jason Low <[email protected]>
Cc: Jerome Marchand <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Laurent Dufour <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Shachar Raindel <[email protected]>
Cc: Stephen Smalley <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Implement cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed, based on
which _release variants can be built.
To avoid superfluous barriers in _acquire variants, we implement these
operations with assembly code rather use __atomic_op_acquire() to build
them automatically.
For the same reason, we keep the assembly implementation of fully
ordered cmpxchg operations.
However, we don't do the similar for _release, because that will require
putting barriers in the middle of ll/sc loops, which is probably a bad
idea.
Note cmpxchg{,64}_relaxed and atomic{,64}_cmpxchg_relaxed are not
compiler barriers.
Signed-off-by: Boqun Feng <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Implement xchg{,64}_relaxed and atomic{,64}_xchg_relaxed, based on these
_relaxed variants, release/acquire variants and fully ordered versions
can be built.
Note that xchg{,64}_relaxed and atomic_{,64}_xchg_relaxed are not
compiler barriers.
Signed-off-by: Boqun Feng <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
On powerpc, acquire and release semantics can be achieved with
lightweight barriers("lwsync" and "ctrl+isync"), which can be used to
implement __atomic_op_{acquire,release}.
For release semantics, since we only need to ensure all memory accesses
that issue before must take effects before the -store- part of the
atomics, "lwsync" is what we only need. On the platform without
"lwsync", "sync" should be used. Therefore in __atomic_op_release() we
use PPC_RELEASE_BARRIER.
For acquire semantics, "lwsync" is what we only need for the similar
reason. However on the platform without "lwsync", we can use "isync"
rather than "sync" as an acquire barrier. Therefore in
__atomic_op_acquire() we use PPC_ACQUIRE_BARRIER, which is barrier() on
UP, "lwsync" if available and "isync" otherwise.
Implement atomic{,64}_{add,sub,inc,dec}_return_relaxed, and build other
variants with these helpers.
Signed-off-by: Boqun Feng <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
I spent some time trying to use kgdb and debugged my inability to
resume from kgdb_handle_breakpoint(). NIP is not incremented
and that leads to a loop in the debugger.
I've tested this lightly on a virtual instance with KDB enabled.
After the patch, I am able to get the "go" command to work as
expected.
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Quite often drivers set only "write" permission assuming that this
includes "read" permission as well and this works on plenty of
platforms. However IODA2 is strict about this and produces an EEH when
"read" permission is not set and reading happens.
This adds a workaround in the IODA code to always add the "read" bit
when the "write" bit is set.
Fixes: 10b35b2b7485 ("powerpc/powernv: Do not set "read" flag if direction==DMA_NONE")
Cc: [email protected] # 4.2+
Cc: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Tested-by: Douglas Miller <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
|
|
Commit 28856a9e52c7 missed the addition of the crypto/xts.h include file
for different architecture-specific AES implementations.
Signed-off-by: Stephan Mueller <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
The patch centralizes the XTS key check logic into the service function
xts_check_key which is invoked from the different XTS implementations.
With this, the XTS implementations in ARM, ARM64, PPC and S390 have now
a sanity check for the XTS keys similar to the other arches.
In addition, this service function received a check to ensure that the
key != the tweak key which is mandated by FIPS 140-2 IG A.9. As the
check is not present in the standards defining XTS, it is only enforced
in FIPS mode of the kernel.
Signed-off-by: Stephan Mueller <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|