Age | Commit message (Collapse) | Author | Files | Lines |
|
In virtualization environment, PV extensions (drivers, interrupts,
timers, etc) are enabled in the majority of use cases which is the
best option.
However, in some cases (kexec not fully working, benchmarking)
we want to disable PV extensions. We have "xen_nopv" for that purpose
but only for XEN. For a consistent admin experience a common command
line parameter "nopv" set across all PV guest implementations is a
better choice.
There are guest types which just won't work without PV extensions,
like Xen PV, Xen PVH and jailhouse. add a "ignore_nopv" member to
struct hypervisor_x86 set to true for those guest types and call
the detect functions only if nopv is false or ignore_nopv is true.
Suggested-by: Juergen Gross <[email protected]>
Signed-off-by: Zhenzhong Duan <[email protected]>
Reviewed-by: Juergen Gross <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jan Kiszka <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
.. as they are only called at early bootup stage. In fact, other
functions in x86_hyper_xen_hvm.init.* are all marked as __init.
Unexport xen_hvm_need_lapic as it's never used outside.
Signed-off-by: Zhenzhong Duan <[email protected]>
Reviewed-by: Juergen Gross <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
The Xen tmem (transcendent memory) driver can be removed, as the
related Xen hypervisor feature never made it past the "experimental"
state and will be removed in future Xen versions (>= 4.13).
The xen-selfballoon driver depends on tmem, so it can be removed, too.
Signed-off-by: Juergen Gross <[email protected]>
Acked-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
initialized"
This reverts commit ca5d376e17072c1b60c3fee66f3be58ef018952d.
Commit 8990cac6e5ea ("x86/jump_label: Initialize static branching
early") adds jump_label_init() call in setup_arch() to make static
keys initialized early, so we could use the original simpler code
again.
Signed-off-by: Zhenzhong Duan <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
When binding an interdomain event channel to a vcpu via
IOCTL_EVTCHN_BIND_INTERDOMAIN not only the event channel needs to be
bound, but the affinity of the associated IRQi must be changed, too.
Otherwise the IRQ and the event channel won't be moved to another vcpu
in case the original vcpu they were bound to is going offline.
Cc: <[email protected]> # 4.13
Fixes: c48f64ab472389df ("xen-evtchn: Bind dyn evtchn:qemu-dm interrupt to next online VCPU")
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
|
|
When using a virt_boundary_mask, as done for NVMe devices attached to
megaraid_sas controllers, we require an unlimited max_segment_size as the
virt boundary merging code assumes that. But we also need to propagate
that to the DMA mapping layer to make dma-debug happy. The SCSI layer
takes care of that when using the per-host virt_boundary setting, but
given that megaraid_sas only wants to set the virt_boundary for actual
NVMe devices, we can't rely on that. The DMA layer maximum segment is
global to the HBA however, so we have to set it explicitly. This patch
assumes that megaraid_sas does not have a segment size limitation, which
seems true based on the SGL format, but will need to be verified.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
When using a virt_boundary_mask, as done for NVMe devices attached to
mpt3sas controllers, we require an unlimited max_segment_size as the virt
boundary merging code assumes that. But we also need to propagate that to
the DMA mapping layer to make dma-debug happy. The SCSI layer takes care
of that when using the per-host virt_boundary setting, but given that
mpt3sas only wants to set the virt_boundary for actual NVMe devices, we
can't rely on that. The DMA layer maximum segment is global to the HBA
however, so we have to set it explicitly. This patch assumes that mpt3sas
does not have a segment size limitation, which seems true based on the SGL
format, but will need to be verified.
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Suganath Prabu <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
This ensures all proper DMA layer handling is taken care of by the SCSI
midlayer.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Acked-by: Bart Van Assche <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
This ensures all proper DMA layer handling is taken care of by the SCSI
midlayer.
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Sagi Grimberg <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
This ensures all proper DMA layer handling is taken care of by the SCSI
midlayer.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
We need to also mirror the value to the device to ensure IOMMU merging
doesn't undo it, and the SCSI host level parameter will ensure that.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
We need to limit the device's max_sectors to what the DMA mapping
implementation can support. If not, we risk running out of swiotlb
buffers easily.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
This allows drivers setting it up easily instead of branching out to block
layer calls in slave_alloc, and ensures the upgraded max_segment_size
setting gets picked up by the DMA layer.
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Kashyap Desai < [email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
We used to need rather convoluted ordering trickery to guarantee
that dput() of ex-mountpoints happens before the final mntput()
of the same. Since we don't need that anymore, there's no point
playing with fs_pin for that.
Signed-off-by: Al Viro <[email protected]>
|
|
Lift getting the original mount (dentry is actually not needed at all)
of the mountpoint into the callers - to do_move_mount() and pivot_root()
level. That simplifies the cleanup in those and allows to get saner
arguments for attach_mnt_recursive().
Signed-off-by: Al Viro <[email protected]>
|
|
This patch fixes below sparse warning related to __virtio
type in virtio pmem driver. This is reported by Intel test
bot on linux-next tree.
nd_virtio.c:56:28: warning: incorrect type in assignment
(different base types)
nd_virtio.c:56:28: expected unsigned int [unsigned] [usertype] type
nd_virtio.c:56:28: got restricted __virtio32
nd_virtio.c:93:59: warning: incorrect type in argument 2
(different base types)
nd_virtio.c:93:59: expected restricted __virtio32 [usertype] val
nd_virtio.c:93:59: got unsigned int [unsigned] [usertype] ret
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Pankaj Gupta <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
|
|
Using dput_to_list() to shift the contributing reference from ->mnt_mountpoint
to ->mnt_mp->m_dentry. Dentries are dropped (with dput_to_list()) as soon
as struct mountpoint is destroyed; in cases where we are under namespace_sem
we use the global list, shrinking it in namespace_unlock(). In case of
detaching stuck MNT_LOCKed children at final mntput_no_expire() we use a local
list and shrink it ourselves. ->mnt_ex_mountpoint crap is gone.
Signed-off-by: Al Viro <[email protected]>
|
|
When scsi_init_sense_cache(host) is called concurrently from different
hosts, each code path may find that no cache has been created and
allocate a new one. The lack of locking can lead to potentially
overriding a cache allocated by a different host.
Fix the issue by moving 'mutex_lock(&scsi_sense_cache_mutex)' before
scsi_select_sense_cache().
Fixes: 0a6ac4ee7c21 ("scsi: respect unchecked_isa_dma for blk-mq")
Cc: Stable <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Ewan D. Milne <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
kbuild test robot gets the following compilation warning using gcc 7.4
cross compilation for c6x (GCC_VERSION=7.4.0 make.cross ARCH=c6x).
In file included from include/asm-generic/bug.h:18:0,
from arch/c6x/include/asm/bug.h:12,
from include/linux/bug.h:5,
from include/linux/thread_info.h:12,
from include/asm-generic/current.h:5,
from ./arch/c6x/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/blkdev.h:5,
from drivers//scsi/sd_zbc.c:11:
drivers//scsi/sd_zbc.c: In function 'sd_zbc_read_zones':
>> include/linux/kernel.h:62:48: warning: 'zone_blocks' may be used
uninitialized in this function [-Wmaybe-uninitialized]
#define __round_mask(x, y) ((__typeof__(x))((y)-1))
^
drivers//scsi/sd_zbc.c:464:6: note: 'zone_blocks' was declared here
u32 zone_blocks;
^~~~~~~~~~~
This is a false-positive report. The variable zone_blocks is always
initialized in sd_zbc_check_zones() before use. It is not initialized
only and only if sd_zbc_check_zones() fails.
Avoid this warning by initializing the zone_blocks variable to 0.
Fixes: 5f832a395859 ("scsi: sd_zbc: Fix sd_zbc_check_zones() error checks")
Cc: Stable <[email protected]>
Signed-off-by: Damien Le Moal <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
Currently if lport is null then the null lport pointer is dereference when
printing out debug via the FC_LPORT_DB macro. Fix this by using the more
generic FC_LIBFC_DBG debug macro instead that does not use lport.
Addresses-Coverity: ("Dereference after null check")
Fixes: 7414705ea4ae ("libfc: Add runtime debugging with debug_logging module parameter")
Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
RocksDB can hang indefinitely when using a DAX file. This is due to
a bug in the XArray conversion when handling a PMD fault and finding a
PTE entry. We use the wrong index in the hash and end up waiting on
the wrong waitqueue.
There's actually no need to wait; if we find a PTE entry while looking
for a PMD entry, we can return immediately as we know we should fall
back to a PTE fault (which may not conflict with the lock held).
We reuse the XA_RETRY_ENTRY to signal a conflicting entry was found.
This value can never be found in an XArray while holding its lock, so
it does not create an ambiguity.
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/CAPcyv4hwHpX-MkUEqxwdTj7wCCZCN4RV-L4jsnuwLGyL_UEG4A@mail.gmail.com
Fixes: b15cd800682f ("dax: Convert page fault handlers to XArray")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Tested-by: Dan Williams <[email protected]>
Reported-by: Robert Barror <[email protected]>
Reported-by: Seema Pandit <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
|
|
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
struct boo entry[];
};
size = sizeof(struct foo) + count * sizeof(struct boo);
instance = kmalloc(size, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can now
use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
Also, notice that variable size is unnecessary, hence it is removed.
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20190604164226.GA13823@embeddedor
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
locked_vm accounting is done roughly the same way in five places, so
unify them in a helper.
Include the helper's caller in the debug print to distinguish between
callsites.
Error codes stay the same, so user-visible behavior does too. The one
exception is that the -EPERM case in tce_account_locked_vm is removed
because Alexey has never seen it triggered.
[[email protected]: v3]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: fix mm/util.c]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Daniel Jordan <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
Tested-by: Alexey Kardashevskiy <[email protected]>
Acked-by: Alex Williamson <[email protected]>
Cc: Alan Tull <[email protected]>
Cc: Alex Williamson <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Moritz Fischer <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Steve Sistare <[email protected]>
Cc: Wu Hao <[email protected]>
Cc: Ira Weiny <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In order for things like get_user_pages() to work on ZONE_DEVICE memory,
we need a software PTE bit to identify device-backed PFNs. Hook this up
along with the relevant helpers to join in with ARCH_HAS_PTE_DEVMAP.
[[email protected]: build fixes]
Link: http://lkml.kernel.org/r/13026c4e64abc17133bbfa07d7731ec6691c0bcd.1559050949.git.robin.murphy@arm.com
Link: http://lkml.kernel.org/r/817d92886fc3b33bcbf6e105ee83a74babb3a5aa.1558547956.git.robin.murphy@arm.com
Signed-off-by: Robin Murphy <[email protected]>
Acked-by: Will Deacon <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oliver O'Halloran <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
with the long-out-of-date comment can lead to the impression than an
architecture may just enable it (since __add_pages() now "comprehends
device memory" for itself) and expect things to work.
In practice, however, ZONE_DEVICE users have little chance of
functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
dependency so the real situation is clearer.
Link: http://lkml.kernel.org/r/87554aa78478a02a63f2c4cf60a847279ae3eb3b.1558547956.git.robin.murphy@arm.com
Signed-off-by: Robin Murphy <[email protected]>
Acked-by: Dan Williams <[email protected]>
Reviewed-by: Ira Weiny <[email protected]>
Acked-by: Oliver O'Halloran <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Refactor is_device_{public,private}_page() with is_pci_p2pdma_page() to
make them all consistent in depending on their respective config options
even when CONFIG_DEV_PAGEMAP_OPS is enabled for other reasons. This
allows a little more compile-time optimisation as well as the conceptual
and cosmetic cleanup.
Link: http://lkml.kernel.org/r/187c2ab27dea70635d375a61b2f2076d26c032b0.1558547956.git.robin.murphy@arm.com
Signed-off-by: Robin Murphy <[email protected]>
Suggested-by: Jerome Glisse <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oliver O'Halloran <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Two architecture that use arch specific MMAP flags are powerpc and
sparc. We still have few flag values common across them and other
architectures. Consolidate this in mman-common.h.
Also update the comment to indicate where to find HugeTLB specific
reserved values
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This enables support for synchronous DAX fault on powerpc
The generic changes are added as part of b6fb293f2497 ("mm: Define
MAP_SYNC and VM_SYNC flags")
Without this, mmap returns EOPNOTSUPP for MAP_SYNC with
MAP_SHARED_VALIDATE
Instead of adding MAP_SYNC with same value to
arch/powerpc/include/uapi/asm/mman.h, I am moving the #define to
asm-generic/mman-common.h. Two architectures using mman-common.h
directly are sparc and powerpc. We should be able to consloidate more
#defines to mman-common.h. That can be done as a separate patch.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
It is now allowed to use persistent memory like a regular RAM, but
currently there is no way to remove this memory until machine is
rebooted.
This work expands the functionality to also allows hotremoving
previously hotplugged persistent memory, and recover the device for use
for other purposes.
To hotremove persistent memory, the management software must first
offline all memory blocks of dax region, and than unbind it from
device-dax/kmem driver. So, operations should look like this:
echo offline > /sys/devices/system/memory/memoryN/state
...
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
Note: if unbind is done without offlining memory beforehand, it won't be
possible to do dax0.0 hotremove, and dax's memory is going to be part of
System RAM until reboot.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: James Morris <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Yaowei Bai <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Presently the remove_memory() interface is inherently broken. It tries
to remove memory but panics if some memory is not offline. The problem
is that it is impossible to ensure that all memory blocks are offline as
this function also takes lock_device_hotplug that is required to change
memory state via sysfs.
So, between calling this function and offlining all memory blocks there
is always a window when lock_device_hotplug is released, and therefore,
there is always a chance for a panic during this window.
Make this interface to return an error if memory removal fails. This
way it is safe to call this function without panicking machine, and also
makes it symmetric to add_memory() which already returns an error.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Yaowei Bai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series ""Hotremove" persistent memory", v6.
Recently, adding a persistent memory to be used like a regular RAM was
added to Linux. This work extends this functionality to also allow hot
removing persistent memory.
We (Microsoft) have an important use case for this functionality.
The requirement is for physical machines with small amount of RAM (~8G)
to be able to reboot in a very short period of time (<1s). Yet, there
is a userland state that is expensive to recreate (~2G).
The solution is to boot machines with 2G preserved for persistent
memory.
Copy the state, and hotadd the persistent memory so machine still has
all 8G available for runtime. Before reboot, offline and hotremove
device-dax 2G, copy the memory that is needed to be preserved to pmem0
device, and reboot.
The series of operations look like this:
1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
and free ramdisk.
2. Convert raw pmem0 to devdax
ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
3. Hotadd to System RAM
echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
echo online_movable > /sys/devices/system/memoryXXX/state
4. Before reboot hotremove device-dax memory from System RAM
echo offline > /sys/devices/system/memoryXXX/state
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
5. Create raw pmem0 device
ndctl create-namespace --mode raw -e namespace0.0 -f
6. Copy the state that was stored by apps to ramdisk to pmem device
7. Do kexec reboot or reboot through firmware if firmware does not
zero memory in pmem0 region (These machines have only regular
volatile memory). So to have pmem0 device either memmap kernel
parameter is used, or devices nodes in dtb are specified.
This patch (of 3):
When add_memory() fails, the resource and the memory should be freed.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
Signed-off-by: Pavel Tatashin <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Fengguang Wu <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Yaowei Bai <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix a few spelling and grammar errors, and two places where fast/safe in
the documentation did not match the function.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Tom Levy <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Jiri Kosina <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Andreas Christoforou reported:
UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
9 * 2305843009213693951 cannot be represented in type 'long int'
...
Call Trace:
mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
evict+0x472/0x8c0 fs/inode.c:558
iput_final fs/inode.c:1547 [inline]
iput+0x51d/0x8c0 fs/inode.c:1573
mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
vfs_mkobj+0x39e/0x580 fs/namei.c:2892
prepare_open ipc/mqueue.c:731 [inline]
do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771
Which could be triggered by:
struct mq_attr attr = {
.mq_flags = 0,
.mq_maxmsg = 9,
.mq_msgsize = 0x1fffffffffffffff,
.mq_curmsgs = 0,
};
if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
perror("mq_open");
mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
preparing to return -EINVAL. During the cleanup, it calls
mqueue_evict_inode() which performed resource usage tracking math for
updating "user", before checking if there was a valid "user" at all
(which would indicate that the calculations would be sane). Instead,
delay this check to after seeing a valid "user".
The overflow was real, but the results went unused, so while the flaw is
harmless, it's noisy for kernel fuzzers, so just fix it by moving the
calculation under the non-NULL "user" where it actually gets used.
Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescook
Signed-off-by: Kees Cook <[email protected]>
Reported-by: Andreas Christoforou <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
architectures
For architectures using __WARN_TAINT, the WARN_ON macro did not print
out the "cut here" string. The other WARN_XXX macros would print "cut
here" inside __warn_printk, which is not called for WARN_ON since it
doesn't have a message to print.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: a7bed27af194 ("bug: fix "cut here" location for __WARN_TAINT architectures")
Signed-off-by: Drew Davenport <[email protected]>
Acked-by: Kees Cook <[email protected]>
Tested-by: Kees Cook <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add helper commands and functions for finding pointers to struct device
by enumerating linux device bus/class infrastructure. This can be used
to fetch subsystem and driver-specific structs:
(gdb) p *$container_of($lx_device_find_by_class_name("net", "eth0"), "struct net_device", "dev")
(gdb) p *$container_of($lx_device_find_by_bus_name("i2c", "0-004b"), "struct i2c_client", "dev")
(gdb) p *(struct imx_port*)$lx_device_find_by_class_name("tty", "ttymxc1")->parent->driver_data
Several generic "lx-device-list" functions are included to enumerate
devices by bus and class:
(gdb) lx-device-list-bus usb
(gdb) lx-device-list-class
(gdb) lx-device-list-tree &platform_bus
Similar information is available in /sys but pointer values are
deliberately hidden.
Link: http://lkml.kernel.org/r/c948628041311cbf1b9b4cff3dda7d2073cb3eaa.1561492937.git.leonard.crestez@nxp.com
Signed-off-by: Leonard Crestez <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Cc: Kieran Bingham <[email protected]>
Cc: Jan Kiszka <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
This is like /sys/kernel/debug/pm/pm_genpd_summary except it's
accessible through a debugger.
This can be useful if the target crashes or hangs because power domains
were not properly enabled.
Link: http://lkml.kernel.org/r/f9ee627a0d4f94b894aa202fee8a98444049bed8.1561492937.git.leonard.crestez@nxp.com
Signed-off-by: Leonard Crestez <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Cc: Kieran Bingham <[email protected]>
Cc: Jan Kiszka <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The PPS assert/clear offset corrections are set by the PPS_SETPARAMS
ioctl in the pps_ktime structs, which also contain flags. The flags are
not initialized by applications (using the timepps.h header) and they
are not used by the kernel for anything except returning them back in
the PPS_GETPARAMS ioctl.
Set the flags to zero to make it clear they are unused and avoid leaking
uninitialized data of the PPS_SETPARAMS caller to other applications
that have a read access to the PPS device.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Miroslav Lichvar <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Acked-by: Rodolfo Giometti <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Dan Carpenter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
struct pid's count is an atomic_t field used as a refcount. Use
refcount_t for it which is basically atomic_t but does additional
checking to prevent use-after-free bugs.
For memory ordering, the only change is with the following:
- if ((atomic_read(&pid->count) == 1) ||
- atomic_dec_and_test(&pid->count)) {
+ if (refcount_dec_and_test(&pid->count)) {
kmem_cache_free(ns->pid_cachep, pid);
Here the change is from: Fully ordered --> RELEASE + ACQUIRE (as per
refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the
free happens after the refcount_dec_and_test().
The above hunk also removes atomic_read() since it is not needed for the
code to work and it is unclear how beneficial it is. The removal lets
refcount_dec_and_test() check for cases where get_pid() happened before
the object was freed.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Reviewed-by: Andrea Parri <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Elena Reshetova <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: KJ Tsanaktsidis <[email protected]>
Cc: Michal Hocko <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The dev_info.name[] array has space for RIO_MAX_DEVNAME_SZ + 1
characters. But the problem here is that we don't ensure that the user
put a NUL terminator on the end of the string. It could lead to an out
of bounds read.
Link: http://lkml.kernel.org/r/20190529110601.GB19119@mwanda
Fixes: e8de370188d0 ("rapidio: add mport char device driver")
Signed-off-by: Dan Carpenter <[email protected]>
Acked-by: Alexandre Bounine <[email protected]>
Cc: Ira Weiny <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Now that restore_saved_sigmask_unless() is always called with the same
argument right before poll_select_copy_remaining() we can move it into
poll_select_copy_remaining() and make it the only caller of restore() in
fs/select.c.
The patch also renames poll_select_copy_remaining(),
poll_select_finish() looks better after this change.
kern_select() doesn't use set_user_sigmask(), so in this case
poll_select_finish() does restore_saved_sigmask_unless() "for no
reason". But this won't hurt, and WARN_ON(!TIF_SIGPENDING) is still
valid.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: David Laight <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Deepa Dinamani <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Eric Wong <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
do_poll() returns -EINTR if interrupted and after that all its callers
have to translate it into -ERESTARTNOHAND. Change do_poll() to return
-ERESTARTNOHAND and update (simplify) the callers.
Note that this also unifies all users of restore_saved_sigmask_unless(),
see the next patch.
Linus:
: The *right* return value will actually be then chosen by
: poll_select_copy_remaining(), which will turn ERESTARTNOHAND to EINTR
: when it can't update the timeout.
:
: Except for the cases that use restart_block and do that instead and
: don't have the whole timeout restart issue as a result.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: David Laight <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Deepa Dinamani <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Eric Wong <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths. This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.
This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Deepa Dinamani <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Eric Wong <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: David Laight <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
struct sighand_struct::siglock field is the most used field by far, put
it first so that is can be accessed without IMM8 or IMM32 encoding on
x86_64.
Space savings (on trimmed down VM test config):
add/remove: 0/0 grow/shrink: 8/68 up/down: 49/-1147 (-1098)
Function old new delta
complete_signal 512 533 +21
do_signalfd4 335 346 +11
__cleanup_sighand 39 43 +4
unhandled_signal 49 52 +3
prepare_signal 692 695 +3
ignore_signals 37 40 +3
__tty_check_change.part 248 251 +3
ksys_unshare 780 781 +1
sighand_ctor 33 29 -4
ptrace_trap_notify 60 56 -4
sigqueue_free 98 91 -7
run_posix_cpu_timers 1389 1382 -7
proc_pid_status 2448 2441 -7
proc_pid_limits 344 337 -7
posix_cpu_timer_rearm 222 215 -7
posix_cpu_timer_get 249 242 -7
kill_pid_info_as_cred 243 236 -7
freeze_task 197 190 -7
flush_old_exec 1873 1866 -7
do_task_stat 3363 3356 -7
do_send_sig_info 98 91 -7
do_group_exit 147 140 -7
init_sighand 2088 2080 -8
do_notify_parent_cldstop 399 391 -8
signalfd_cleanup 50 41 -9
do_notify_parent 557 545 -12
__send_signal 1029 1017 -12
ptrace_stop 590 577 -13
get_signal 1576 1563 -13
__lock_task_sighand 112 99 -13
zap_pid_ns_processes 391 377 -14
update_rlimit_cpu 78 64 -14
tty_signal_session_leader 413 399 -14
tty_open_proc_set_tty 149 135 -14
tty_jobctrl_ioctl 936 922 -14
set_cpu_itimer 339 325 -14
ptrace_resume 226 212 -14
ptrace_notify 110 96 -14
proc_clear_tty 81 67 -14
posix_cpu_timer_del 229 215 -14
kernel_sigaction 156 142 -14
getrusage 977 963 -14
get_current_tty 98 84 -14
force_sigsegv 89 75 -14
force_sig_info 205 191 -14
flush_signals 83 69 -14
flush_itimer_signals 85 71 -14
do_timer_create 1120 1106 -14
do_sigpending 88 74 -14
do_signal_stop 537 523 -14
cgroup_init_fs_context 644 630 -14
call_usermodehelper_exec_async 402 388 -14
calculate_sigpending 58 44 -14
__x64_sys_timer_delete 248 234 -14
__set_current_blocked 80 66 -14
__ptrace_unlink 310 296 -14
__ptrace_detach.part 187 173 -14
send_sigqueue 362 347 -15
get_cpu_itimer 214 199 -15
signalfd_poll 175 159 -16
dequeue_signal 340 323 -17
do_getitimer 192 174 -18
release_task.part 1060 1040 -20
ptrace_peek_siginfo 408 387 -21
posix_cpu_timer_set 827 806 -21
exit_signals 437 416 -21
do_sigaction 541 520 -21
do_setitimer 485 464 -21
disassociate_ctty.part 545 517 -28
__x64_sys_rt_sigtimedwait 721 679 -42
__x64_sys_ptrace 1319 1277 -42
ptrace_request 1828 1782 -46
signalfd_read 507 459 -48
wait_consider_task 2027 1971 -56
do_coredump 3672 3616 -56
copy_process.part 6936 6871 -65
Link: http://lkml.kernel.org/r/20190503192800.GA18004@avx2
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Check whether PTRACE_GET_SYSCALL_INFO semantics implemented in the
kernel matches userspace expectations.
[[email protected]: coding-style fixes]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry V. Levin <[email protected]>
Acked-by: Shuah Khan <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Elvira Khabirova <[email protected]>
Cc: Eugene Syromyatnikov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Helge Deller <[email protected]> [parisc]
Cc: James E.J. Bottomley <[email protected]>
Cc: James Hogan <[email protected]>
Cc: kbuild test robot <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.
There are two reasons for a special syscall-related ptrace request.
Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls. Some examples include:
* The notorious int-0x80-from-64-bit-task issue. See [1] for details.
In short, if a 64-bit task performs a syscall through int 0x80, its
tracer has no reliable means to find out that the syscall was, in
fact, a compat syscall, and misidentifies it.
* Syscall-enter-stop and syscall-exit-stop look the same for the
tracer. Common practice is to keep track of the sequence of
ptrace-stops in order not to mix the two syscall-stops up. But it is
not as simple as it looks; for example, strace had a (just recently
fixed) long-standing bug where attaching strace to a tracee that is
performing the execve system call led to the tracer identifying the
following syscall-exit-stop as syscall-enter-stop, which messed up
all the state tracking.
* Since the introduction of commit 84d77d3f06e7 ("ptrace: Don't allow
accessing an undumpable mm"), both PTRACE_PEEKDATA and
process_vm_readv become unavailable when the process dumpable flag is
cleared. On such architectures as ia64 this results in all syscall
arguments being unavailable for the tracer.
Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee. For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.
ptrace(2) man page:
long ptrace(enum __ptrace_request request, pid_t pid,
void *addr, void *data);
...
PTRACE_GET_SYSCALL_INFO
Retrieve information about the syscall that caused the stop.
The information is placed into the buffer pointed by "data"
argument, which should be a pointer to a buffer of type
"struct ptrace_syscall_info".
The "addr" argument contains the size of the buffer pointed to
by "data" argument (i.e., sizeof(struct ptrace_syscall_info)).
The return value contains the number of bytes available
to be written by the kernel.
If the size of data to be written by the kernel exceeds the size
specified by "addr" argument, the output is truncated.
[[email protected]: selftests/seccomp/seccomp_bpf: update for PTRACE_GET_SYSCALL_INFO]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Elvira Khabirova <[email protected]>
Co-developed-by: Dmitry V. Levin <[email protected]>
Signed-off-by: Dmitry V. Levin <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Cc: Eugene Syromyatnikov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Helge Deller <[email protected]> [parisc]
Cc: James E.J. Bottomley <[email protected]>
Cc: James Hogan <[email protected]>
Cc: kbuild test robot <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
syscall_get_error() is required to be implemented on this architecture in
addition to already implemented syscall_get_nr(), syscall_get_arguments(),
syscall_get_return_value(), and syscall_get_arch() functions in order to
extend the generic ptrace API with PTRACE_GET_SYSCALL_INFO request.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry V. Levin <[email protected]>
Acked-by: Michael Ellerman <[email protected]>
Cc: Elvira Khabirova <[email protected]>
Cc: Eugene Syromyatnikov <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Helge Deller <[email protected]> [parisc]
Cc: James E.J. Bottomley <[email protected]>
Cc: James Hogan <[email protected]>
Cc: kbuild test robot <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
syscall_get_error() is required to be implemented on all architectures in
addition to already implemented syscall_get_nr(), syscall_get_arguments(),
syscall_get_return_value(), and syscall_get_arch() functions in order to
extend the generic ptrace API with PTRACE_GET_SYSCALL_INFO request.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry V. Levin <[email protected]>
Acked-by: Helge Deller <[email protected]> [parisc]
Cc: James E.J. Bottomley <[email protected]>
Cc: Elvira Khabirova <[email protected]>
Cc: Eugene Syromyatnikov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: James Hogan <[email protected]>
Cc: kbuild test robot <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
syscall_get_error() is required to be implemented on all architectures
in addition to already implemented syscall_get_nr(),
syscall_get_arguments(), syscall_get_return_value(), and
syscall_get_arch() functions in order to extend the generic ptrace API
with PTRACE_GET_SYSCALL_INFO request.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry V. Levin <[email protected]>
Acked-by: Paul Burton <[email protected]>
Cc: Elvira Khabirova <[email protected]>
Cc: Eugene Syromyatnikov <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Helge Deller <[email protected]> [parisc]
Cc: James E.J. Bottomley <[email protected]>
Cc: kbuild test robot <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
syscall_get_* functions are required to be implemented on all
architectures in order to extend the generic ptrace API with
PTRACE_GET_SYSCALL_INFO request.
This adds remaining 2 syscall_get_* functions as documented in
asm-generic/syscall.h: syscall_get_error and syscall_get_return_value.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Dmitry V. Levin <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Elvira Khabirova <[email protected]>
Cc: Eugene Syromyatnikov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Greentime Hu <[email protected]>
Cc: Helge Deller <[email protected]> [parisc]
Cc: James E.J. Bottomley <[email protected]>
Cc: James Hogan <[email protected]>
Cc: kbuild test robot <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vincent Chen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.
There are two reasons for a special syscall-related ptrace request.
Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls. Some examples include:
* The notorious int-0x80-from-64-bit-task issue. See [1] for details.
In short, if a 64-bit task performs a syscall through int 0x80, its
tracer has no reliable means to find out that the syscall was, in
fact, a compat syscall, and misidentifies it.
* Syscall-enter-stop and syscall-exit-stop look the same for the
tracer. Common practice is to keep track of the sequence of
ptrace-stops in order not to mix the two syscall-stops up. But it is
not as simple as it looks; for example, strace had a (just recently
fixed) long-standing bug where attaching strace to a tracee that is
performing the execve system call led to the tracer identifying the
following syscall-exit-stop as syscall-enter-stop, which messed up
all the state tracking.
* Since the introduction of commit 84d77d3f06e7 ("ptrace: Don't allow
accessing an undumpable mm"), both PTRACE_PEEKDATA and
process_vm_readv become unavailable when the process dumpable flag is
cleared. On such architectures as ia64 this results in all syscall
arguments being unavailable for the tracer.
Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee. For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.
PTRACE_GET_SYSCALL_INFO returns the following structure:
struct ptrace_syscall_info {
__u8 op; /* PTRACE_SYSCALL_INFO_* */
__u32 arch __attribute__((__aligned__(sizeof(__u32))));
__u64 instruction_pointer;
__u64 stack_pointer;
union {
struct {
__u64 nr;
__u64 args[6];
} entry;
struct {
__s64 rval;
__u8 is_error;
} exit;
struct {
__u64 nr;
__u64 args[6];
__u32 ret_data;
} seccomp;
};
};
The structure was chosen according to [2], except for the following
changes:
* seccomp substructure was added as a superset of entry substructure
* the type of nr field was changed from int to __u64 because syscall
numbers are, as a practical matter, 64 bits
* stack_pointer field was added along with instruction_pointer field
since it is readily available and can save the tracer from extra
PTRACE_GETREGS/PTRACE_GETREGSET calls
* arch is always initialized to aid with tracing system calls such as
execve()
* instruction_pointer and stack_pointer are always initialized so they
could be easily obtained for non-syscall stops
* a boolean is_error field was added along with rval field, this way
the tracer can more reliably distinguish a return value from an error
value
strace has been ported to PTRACE_GET_SYSCALL_INFO. Starting with
release 4.26, strace uses PTRACE_GET_SYSCALL_INFO API as the preferred
mechanism of obtaining syscall information.
[1] https://lore.kernel.org/lkml/CA+55aFzcSVmdDj9Lh_gdbz1OzHyEm6ZrGPBDAJnywm2LF_eVyg@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAObL_7GM0n80N7J_DFw_eQyfLyzq+sf4y2AvsCCV88Tb3AwEHA@mail.gmail.com/
This patch (of 7):
All syscall_get_*() and syscall_set_*() functions must be defined as
static inline as on all other architectures, otherwise asm/syscall.h
cannot be included in more than one compilation unit.
This bug has to be fixed in order to extend the generic
ptrace API with PTRACE_GET_SYSCALL_INFO request.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 1932fbe36e02 ("nds32: System calls handling")
Signed-off-by: Dmitry V. Levin <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Acked-by: Greentime Hu <[email protected]>
Cc: Vincent Chen <[email protected]>
Cc: Elvira Khabirova <[email protected]>
Cc: Eugene Syromyatnikov <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Helge Deller <[email protected]> [parisc]
Cc: James E.J. Bottomley <[email protected]>
Cc: James Hogan <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Paul Burton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Richard Kuo <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|