Age | Commit message (Collapse) | Author | Files | Lines |
|
The target size can now be returned by nanddev_get_targetsize(). Get
rid of the chip->chipsize field and use this helper instead.
Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Frieder Schrempf <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
|
|
Now that we inherit from nand_device, we can use
nand_device->memorg.bits_per_cell instead of having our own field at
the nand_chip level.
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Frieder Schrempf <[email protected]>
|
|
nanddev_mtd_max_bad_blocks() is implemented by the generic NAND layer
and is already doing what we need. Reuse this function instead of
having our own implementation.
While at it, get rid of the ->max_bb_per_die and ->blocks_per_die
fields which are now unused.
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Frieder Schrempf <[email protected]>
|
|
Looking at the field names it's hard to tell what ->data_buf, ->pagebuf
and ->pagebuf_bitflips are for. Clarify that by moving those fields
in a sub-struct named pagecache.
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Frieder Schrempf <[email protected]>
|
|
We plan to move cache related fields to a pagecache struct in nand_chip
but some drivers access ->pagebuf directly to invalidate the cache
before they start using ->data_buf.
Let's provide an helper that returns a pointer to ->data_buf after
invalidating the cache.
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Frieder Schrempf <[email protected]>
|
|
The generic NAND layer provides abstraction of NAND devices no matter
the bus that is used to communicate with the chip. Basing the raw NAND
core on this generic layer should avoid duplication of common
operations, like iterating over all pages/blocks for MTD IO/erase
operations.
In order to re-use this layer, we must first inherit from nand_device
and then initialize the nand_device struct appropriately. This patch
is taking care of the former.
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Frieder Schrempf <[email protected]>
|
|
Use the nand_to_mtd() helper to access chip->mtd as done everywhere
else.
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Frieder Schrempf <[email protected]>
|
|
Will be used by the raw NAND framework.
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Frieder Schrempf <[email protected]>
|
|
Some drivers in the raw NAND framework seems to need this helper, so
let's just add it instead of open-coding the logic.
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Frieder Schrempf <[email protected]>
|
|
NAND datasheets usually give the maximum number of bad blocks per LUN
and this number can be used to help upper layers decide how much blocks
they should reserve for bad block handling.
Add a max_bad_eraseblocks_per_lun to the nand_memory_organization
struct and update the NAND_MEMORG() macro (and its users) accordingly.
We also provide a default mtd->_max_bad_blocks() implementation.
Signed-off-by: Boris Brezillon <[email protected]>
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Frieder Schrempf <[email protected]>
|
|
This patch creates set_cs_timing SPI master optional method for
SPI masters to implement configuring CS timing if applicable.
This patch also creates spi_cs_timing accessory for SPI clients to
use for requesting SPI master controllers to configure device requested
CS setup time, hold time and inactive delay.
Signed-off-by: Sowjanya Komatineni <[email protected]>
Signed-off-by: Mark Brown <[email protected]>
|
|
Move all of the code doing struct spi_bitbang initialization, so that
it can be paired with devm_spi_register_master() in order to avoid
having to call spi_bitbang_stop() explicitly.
Signed-off-by: Andrey Smirnov <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Chris Healy <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Mark Brown <[email protected]>
|
|
The current definition and implementation of the SEV_GET_ID command
does not provide the length of the unique ID returned by the firmware.
As per the firmware specification, the firmware may return an ID
length that is not restricted to 64 bytes as assumed by the SEV_GET_ID
command.
Introduce the SEV_GET_ID2 command to overcome with the SEV_GET_ID
limitations. Deprecate the SEV_GET_ID in the favor of SEV_GET_ID2.
At the same time update SEV API web link.
Cc: Janakarajan Natarajan <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Gary Hook <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Nathaniel McCallum <[email protected]>
Signed-off-by: Brijesh Singh <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
Native bit_spin_locks are not tracked by lockdep.
The bit_spin_locks used for rhashtable buckets are local
to the rhashtable implementation, so there is little opportunity
for the sort of misuse that lockdep might detect.
However locks are held while a hash function or compare
function is called, and if one of these took a lock,
a misbehaviour is possible.
As it is quite easy to add lockdep support this unlikely
possibility seems to be enough justification.
So create a lockdep class for bucket bit_spin_lock and attach
through a lockdep_map in each bucket_table.
Without the 'nested' annotation in rhashtable_rehash_one(), lockdep
correctly reports a possible problem as this lock is taken
while another bucket lock (in another table) is held. This
confirms that the added support works.
With the correct nested annotation in place, lockdep reports
no problems.
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the
bucket pointer to lock the hash chain for that bucket.
The benefits of a bit spin_lock are:
- no need to allocate a separate array of locks.
- no need to have a configuration option to guide the
choice of the size of this array
- locking cost is often a single test-and-set in a cache line
that will have to be loaded anyway. When inserting at, or removing
from, the head of the chain, the unlock is free - writing the new
address in the bucket head implicitly clears the lock bit.
For __rhashtable_insert_fast() we ensure this always happens
when adding a new key.
- even when lockings costs 2 updates (lock and unlock), they are
in a cacheline that needs to be read anyway.
The cost of using a bit spin_lock is a little bit of code complexity,
which I think is quite manageable.
Bit spin_locks are sometimes inappropriate because they are not fair -
if multiple CPUs repeatedly contend of the same lock, one CPU can
easily be starved. This is not a credible situation with rhashtable.
Multiple CPUs may want to repeatedly add or remove objects, but they
will typically do so at different buckets, so they will attempt to
acquire different locks.
As we have more bit-locks than we previously had spinlocks (by at
least a factor of two) we can expect slightly less contention to
go with the slightly better cache behavior and reduced memory
consumption.
To enhance type checking, a new struct is introduced to represent the
pointer plus lock-bit
that is stored in the bucket-table. This is "struct rhash_lock_head"
and is empty. A pointer to this needs to be cast to either an
unsigned lock, or a "struct rhash_head *" to be useful.
Variables of this type are most often called "bkt".
Previously "pprev" would sometimes point to a bucket, and sometimes a
->next pointer in an rhash_head. As these are now different types,
pprev is NULL when it would have pointed to the bucket. In that case,
'blk' is used, together with correct locking protocol.
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Rather than returning a pointer to a static nulls, rht_bucket_var()
now returns NULL if the bucket doesn't exist.
This will make the next patch, which stores a bitlock in the
bucket pointer, somewhat cleaner.
This change involves introducing __rht_bucket_nested() which is
like rht_bucket_nested(), but doesn't provide the static nulls,
and changing rht_bucket_nested() to call this and possible
provide a static nulls - as is still needed for the non-var case.
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The current handling of MSR_IA32_ENERGY_PERF_BIAS in the kernel is
problematic, because it may cause changes made by user space to that
MSR (with the help of the x86_energy_perf_policy tool, for example)
to be lost every time a CPU goes offline and then back online as well
as during system-wide power management transitions into sleep states
and back into the working state.
The first problem is that if the current EPB value for a CPU going
online is 0 ('performance'), the kernel will change it to 6 ('normal')
regardless of whether or not this is the first bring-up of that CPU.
That also happens during system-wide resume from sleep states
(including, but not limited to, hibernation). However, the EPB may
have been adjusted by user space this way and the kernel should not
blindly override that setting.
The second problem is that if the platform firmware resets the EPB
values for any CPUs during system-wide resume from a sleep state,
the kernel will not restore their previous EPB values that may
have been set by user space before the preceding system-wide
suspend transition. Again, that behavior may at least be confusing
from the user space perspective.
In order to address these issues, rework the handling of
MSR_IA32_ENERGY_PERF_BIAS so that the EPB value is saved on CPU
offline and restored on CPU online as well as (for the boot CPU)
during the syscore stages of system-wide suspend and resume
transitions, respectively.
However, retain the policy by which the EPB is set to 6 ('normal')
on the first bring-up of each CPU if its initial value is 0, based
on the observation that 0 may mean 'not initialized' just as well as
'performance' in that case.
While at it, move the MSR_IA32_ENERGY_PERF_BIAS handling code into
a separate file and document it in Documentation/admin-guide.
Fixes: abe48b108247 (x86, intel, power: Initialize MSR_IA32_ENERGY_PERF_BIAS)
Fixes: b51ef52df71c (x86/cpu: Restore MSR_IA32_ENERGY_PERF_BIAS after resume)
Reported-by: Thomas Renninger <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Acked-by: Borislav Petkov <[email protected]>
Acked-by: Thomas Gleixner <[email protected]>
|
|
Remove the LM3532 backlight driver references from the ti-lmu
code as dedicated driver support is available.
Signed-off-by: Dan Murphy <[email protected]>
Acked-for-MFD-by: Lee Jones <[email protected]>
Signed-off-by: Jacek Anaszewski <[email protected]>
|
|
run simultaneously without deadlock
Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.
This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of
/proc/xen/xenbus.
The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.
However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f2655e3 -
is doing so irregardless of whether a file is seekable or not.
See
https://lore.kernel.org/lkml/[email protected]/
https://lwn.net/Articles/180387
https://lwn.net/Articles/180396
for historic context.
The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some
examples:
kernel/power/user.c snapshot_read
fs/debugfs/file.c u32_array_read
fs/fuse/control.c fuse_conn_waiting_read + ...
drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read
arch/s390/hypfs/inode.c hypfs_read_iter
...
Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.
Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):
drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.
FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:
See
https://github.com/libfuse/osspd
https://lwn.net/Articles/308445
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:
https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
Let's fix this regression. The plan is:
1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
doing so would break many in-kernel nonseekable_open users which
actually use ppos in read/write handlers.
2. Add stream_open() to kernel to open stream-like non-seekable file
descriptors. Read and write on such file descriptors would never use
nor change ppos. And with that property on stream-like files read and
write will be running without taking f_pos lock - i.e. read and write
could be running simultaneously.
3. With semantic patch search and convert to stream_open all in-kernel
nonseekable_open users for which read and write actually do not
depend on ppos and where there is no other methods in file_operations
which assume @offset access.
4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
steam_open if that bit is present in filesystem open reply.
It was tempting to change fs/fuse/ open handler to use stream_open
instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
and in particular GVFS which actually uses offset in its read and
write handlers
https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
so if we would do such a change it will break a real user.
5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
from v3.14+ (the kernel where 9c225f2655 first appeared).
This will allow to patch OSSPD and other FUSE filesystems that
provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
in their open handler and this way avoid the deadlock on all kernel
versions. This should work because fs/fuse/ ignores unknown open
flags returned from a filesystem and so passing FOPEN_STREAM to a
kernel that is not aware of this flag cannot hurt. In turn the kernel
that is not aware of FOPEN_STREAM will be < v3.14 where just
FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
write deadlock.
This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.
Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.
The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)
Cc: Michael Kerrisk <[email protected]>
Cc: Yongzhi Pan <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: David Vrabel <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Julia Lawall <[email protected]>
Cc: Nikolaus Rath <[email protected]>
Cc: Han-Wen Nienhuys <[email protected]>
Signed-off-by: Kirill Smelkov <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
architectures. These types are required to support block device and/or
file sizes larger than 2 TiB, and have generally defaulted to on for
a long time. Enabling the option only increases the i386 tinyconfig
size by 145 bytes, and many data structures already always use
64-bit values for their in-core and on-disk data structures anyway,
so there should not be a large change in dynamic memory usage either.
Dropping this option removes a somewhat weird non-default config that
has cause various bugs or compiler warnings when actually used.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Due to an erratum in some Pericom PCIe-to-PCI bridges in reverse mode
(conventional PCI on primary side, PCIe on downstream side), the Retrain
Link bit needs to be cleared manually to allow the link training to
complete successfully.
If it is not cleared manually, the link training is continuously restarted
and no devices below the PCI-to-PCIe bridge can be accessed. That means
drivers for devices below the bridge will be loaded but won't work and may
even crash because the driver is only reading 0xffff.
See the Pericom Errata Sheet PI7C9X111SLB_errata_rev1.2_102711.pdf for
details. Devices known as affected so far are: PI7C9X110, PI7C9X111SL,
PI7C9X130.
Add a new flag, clear_retrain_link, in struct pci_dev. Quirks for affected
devices set this bit.
Note that pcie_retrain_link() lives in aspm.c because that's currently the
only place we use it, but this erratum is not specific to ASPM, and we may
retrain links for other reasons in the future.
Signed-off-by: Stefan Mätje <[email protected]>
[bhelgaas: apply regardless of CONFIG_PCIEASPM]
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
CC: [email protected]
|
|
Merge misc fixes from Andrew Morton:
"14 fixes"
* emailed patches from Andrew Morton <[email protected]>:
kernel/sysctl.c: fix out-of-bounds access when setting file-max
mm/util.c: fix strndup_user() comment
sh: fix multiple function definition build errors
MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM
MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM
mm: writeback: use exact memcg dirty counts
psi: clarify the units used in pressure files
mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()
hugetlbfs: fix memory leak for resv_map
mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX()
lib/lzo: fix bugs for very short or empty input
include/linux/bitrev.h: fix constant bitrev
kmemleak: powerpc: skip scanning holes in the .bss section
lib/string.c: implement a basic bcmp
|
|
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:
1) per-memcg per-cpu values in range of [-32..32]
2) per-memcg atomic counter
When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic. Stat readers only check the atomic. Thus readers such as
balance_dirty_pages() may see a nontrivial error margin: 32 pages per
cpu.
Assuming 100 cpus:
4k x86 page_size: 13 MiB error per memcg
64k ppc page_size: 200 MiB error per memcg
Considering that dirty+writeback are used together for some decisions the
errors double.
This inaccuracy can lead to undeserved oom kills. One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages. If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.
It could be argued that tiny containers are not supported, but it's more
subtle. It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.
The following test reliably ooms without this patch. This patch avoids
oom kills.
$ cat test
mount -t cgroup2 none /dev/cgroup
cd /dev/cgroup
echo +io +memory > cgroup.subtree_control
mkdir test
cd test
echo 10M > memory.max
(echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
(echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
$ cat memcg-writeback-stress.c
/*
* Dirty pages from all but one cpu.
* Clean pages from the non dirtying cpu.
* This is to stress per cpu counter imbalance.
* On a 100 cpu machine:
* - per memcg per cpu dirty count is 32 pages for each of 99 cpus
* - per memcg atomic is -99*32 pages
* - thus the complete dirty limit: sum of all counters 0
* - balance_dirty_pages() only sees atomic count -99*32 pages, which
* it max()s to 0.
* - So a workload can dirty -99*32 pages before balance_dirty_pages()
* cares.
*/
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/sysinfo.h>
#include <sys/types.h>
#include <unistd.h>
static char *buf;
static int bufSize;
static void set_affinity(int cpu)
{
cpu_set_t affinity;
CPU_ZERO(&affinity);
CPU_SET(cpu, &affinity);
if (sched_setaffinity(0, sizeof(affinity), &affinity))
err(1, "sched_setaffinity");
}
static void dirty_on(int output_fd, int cpu)
{
int i, wrote;
set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
}
int main(int argc, char **argv)
{
int cpu, flush_cpu = 1, output_fd;
const char *output;
if (argc != 2)
errx(1, "usage: output_file");
output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");
output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);
for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}
set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
}
Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters. This avoids the aforementioned oom
kills.
This does not affect the overhead of memory.stat, which still reads the
single atomic counter.
Why not use percpu_counter? memcg already handles cpus going offline, so
no need for that overhead from percpu_counter. And the percpu_counter
spinlocks are more heavyweight than is required.
It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports. But that is saved for later.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: <[email protected]> [4.16+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Symmetrically to VM_FAULT_SET_HINDEX(), we need a force-cast in
VM_FAULT_GET_HINDEX() to tell sparse that this is intentional.
Sparse complains about the current code when building a kernel with
CONFIG_MEMORY_FAILURE:
arch/x86/mm/fault.c:1058:53: warning: restricted vm_fault_t degrades to integer
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 3d3539018d2c ("mm: create the new vm_fault_t type")
Signed-off-by: Jann Horn <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: Souptick Joarder <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
clang points out with hundreds of warnings that the bitrev macros have a
problem with constant input:
drivers/hwmon/sht15.c:187:11: error: variable '__x' is uninitialized when used within its own initialization
[-Werror,-Wuninitialized]
u8 crc = bitrev8(data->val_status & 0x0F);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/bitrev.h:102:21: note: expanded from macro 'bitrev8'
__constant_bitrev8(__x) : \
~~~~~~~~~~~~~~~~~~~^~~~
include/linux/bitrev.h:67:11: note: expanded from macro '__constant_bitrev8'
u8 __x = x; \
~~~ ^
Both the bitrev and the __constant_bitrev macros use an internal
variable named __x, which goes horribly wrong when passing one to the
other.
The obvious fix is to rename one of the variables, so this adds an extra
'_'.
It seems we got away with this because
- there are only a few drivers using bitrev macros
- usually there are no constant arguments to those
- when they are constant, they tend to be either 0 or (unsigned)-1
(drivers/isdn/i4l/isdnhdlc.o, drivers/iio/amplifiers/ad8366.c) and
give the correct result by pure chance.
In fact, the only driver that I could find that gets different results
with this is drivers/net/wan/slic_ds26522.c, which in turn is a driver
for fairly rare hardware (adding the maintainer to Cc for testing).
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 556d2f055bf6 ("ARM: 8187/1: add CONFIG_HAVE_ARCH_BITREVERSE to support rbit instruction")
Signed-off-by: Arnd Bergmann <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Cc: Zhao Qiang <[email protected]>
Cc: Yalin Wang <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A recent optimization in Clang (r355672) lowers comparisons of the
return value of memcmp against zero to comparisons of the return value
of bcmp against zero. This helps some platforms that implement bcmp
more efficiently than memcmp. glibc simply aliases bcmp to memcmp, but
an optimized implementation is in the works.
This results in linkage failures for all targets with Clang due to the
undefined symbol. For now, just implement bcmp as a tailcail to memcmp
to unbreak the build. This routine can be further optimized in the
future.
Other ideas discussed:
* A weak alias was discussed, but breaks for architectures that define
their own implementations of memcmp since aliases to declarations are
not permitted (only definitions). Arch-specific memcmp
implementations typically declare memcmp in C headers, but implement
them in assembly.
* -ffreestanding also is used sporadically throughout the kernel.
* -fno-builtin-bcmp doesn't work when doing LTO.
Link: https://bugs.llvm.org/show_bug.cgi?id=41035
Link: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#bcmp
Link: https://github.com/llvm/llvm-project/commit/8e16d73346f8091461319a7dfc4ddd18eedcff13
Link: https://github.com/ClangBuiltLinux/linux/issues/416
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Nick Desaulniers <[email protected]>
Reported-by: Nathan Chancellor <[email protected]>
Reported-by: Adhemerval Zanella <[email protected]>
Suggested-by: Arnd Bergmann <[email protected]>
Suggested-by: James Y Knight <[email protected]>
Suggested-by: Masahiro Yamada <[email protected]>
Suggested-by: Nathan Chancellor <[email protected]>
Suggested-by: Rasmus Villemoes <[email protected]>
Acked-by: Steven Rostedt (VMware) <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Tested-by: Nathan Chancellor <[email protected]>
Reviewed-by: Masahiro Yamada <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Cc: David Laight <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull syscall-get-arguments cleanup and fixes from Steven Rostedt:
"Andy Lutomirski approached me to tell me that the
syscall_get_arguments() implementation in x86 was horrible and gcc
certainly gets it wrong.
He said that since the tracepoints only pass in 0 and 6 for i and n
repectively, it should be optimized for that case. Inspecting the
kernel, I discovered that all users pass in 0 for i and only one file
passing in something other than 6 for the number of arguments. That
code happens to be my own code used for the special syscall tracing.
That can easily be converted to just using 0 and 6 as well, and only
copying what is needed. Which is probably the faster path anyway for
that case.
Along the way, a couple of real fixes came from this as the
syscall_get_arguments() function was incorrect for csky and riscv.
x86 has been optimized to for the new interface that removes the
variable number of arguments, but the other architectures could still
use some loving and take more advantage of the simpler interface"
* tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
syscalls: Remove start and number from syscall_set_arguments() args
syscalls: Remove start and number from syscall_get_arguments() args
csky: Fix syscall_get_arguments() and syscall_set_arguments()
riscv: Fix syscall_get_arguments() and syscall_set_arguments()
tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments()
ptrace: Remove maxargs from task_current_syscall()
|
|
In preparation of dropping interconnect target module platform data in
favor of devicetree based data, we must pass swsup idle quirks to the
platform data functions.
For now, let's only tag the UART modules with the SWSUP_SIDLE_ACT quirk.
The other modules will get tagged with swsup quirks as we drop the
platform data and test the changes.
Signed-off-by: Tony Lindgren <[email protected]>
|
|
Minor comment merge conflict in mlx5.
Staging driver has a fixup due to the skb->xmit_more changes
in 'net-next', but was removed in 'net'.
Signed-off-by: David S. Miller <[email protected]>
|
|
Handle event of power state change in the PCIE slot. When the event
occurs, check if query power state and PCI power fields is supported. If
so, read these fields from MPEIN (management PCIE info) register and
issue a corresponding message.
Signed-off-by: Aya Levin <[email protected]>
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
This merge commit includes some misc shared code updates from mlx5-next branch needed
for net-next.
1) From Maxim, Remove un-used macros and spinlock from mlx5 code.
2) From Aya, Expose Management PCIE info register layout and add rate limit
print macros.
3) From Tariq, Compilation warning fix in fs_core.c
4) From Vu, Huy and Saeed, Improve mlx5 initialization flow:
The goal is to provide a better logical separation of mlx5 core
device initialization flow and will help to seamlessly support
creating different mlx5 device types such as PF, VF and SF
mlx5 sub-function virtual devices.
Mlx5_core driver needs to separate HCA resources from pci resources.
Its initialize/load/unload will be broken into stages:
1. Initialize common data structures
2. Setup function which initializes pci resources (for PF/VF)
or some other specific resources for virtual device
3. Initialize software objects according to hardware capabilities
4. Load all mlx5_core components
It is also necessary to detach mlx5_core mdev name/message from pci
device mdev->pdev name/message for a clearer report/debug of
different mlx5 device types.
Signed-off-by: Saeed Mahameed <[email protected]>
|
|
Provide a nice little shortcut for mapping a single bvec.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
|
|
In a lot of places we want to know the DMA direction for a given
struct request. Add a little helper to make it a littler easier.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
|
|
This provides a nice little shortcut to get the integrity data for
drivers like NVMe that only support a single integrity segment.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
|
|
Return the currently active bvec segment, potentially spanning multiple
pages.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
|
|
Pull networking fixes from David Miller:
1) Several hash table refcount fixes in batman-adv, from Sven
Eckelmann.
2) Use after free in bpf_evict_inode(), from Daniel Borkmann.
3) Fix mdio bus registration in ixgbe, from Ivan Vecera.
4) Unbounded loop in __skb_try_recv_datagram(), from Paolo Abeni.
5) ila rhashtable corruption fix from Herbert Xu.
6) Don't allow upper-devices to be added to vrf devices, from Sabrina
Dubroca.
7) Add qmi_wwan device ID for Olicard 600, from Bjørn Mork.
8) Don't leave skb->next poisoned in __netif_receive_skb_list_ptype,
from Alexander Lobakin.
9) Missing IDR checks in mlx5 driver, from Aditya Pakki.
10) Fix false connection termination in ktls, from Jakub Kicinski.
11) Work around some ASPM issues with r8169 by disabling rx interrupt
coalescing on certain chips. From Heiner Kallweit.
12) Properly use per-cpu qstat values on NOLOCK qdiscs, from Paolo
Abeni.
13) Fully initialize sockaddr_in structures in SCTP, from Xin Long.
14) Various BPF flow dissector fixes from Stanislav Fomichev.
15) Divide by zero in act_sample, from Davide Caratti.
16) Fix bridging multicast regression introduced by rhashtable
conversion, from Nikolay Aleksandrov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
ibmvnic: Fix completion structure initialization
ipv6: sit: reset ip header pointer in ipip6_rcv
net: bridge: always clear mcast matching struct on reports and leaves
libcxgb: fix incorrect ppmax calculation
vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2x
sch_cake: Make sure we can write the IP header before changing DSCP bits
sch_cake: Use tc_skb_protocol() helper for getting packet protocol
tcp: Ensure DCTCP reacts to losses
net/sched: act_sample: fix divide by zero in the traffic path
net: thunderx: fix NULL pointer dereference in nicvf_open/nicvf_stop
net: hns: Fix sparse: some warnings in HNS drivers
net: hns: Fix WARNING when remove HNS driver with SMMU enabled
net: hns: fix ICMP6 neighbor solicitation messages discard problem
net: hns: Fix probabilistic memory overwrite when HNS driver initialized
net: hns: Use NAPI_POLL_WEIGHT for hns driver
net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()
flow_dissector: rst'ify documentation
ipv6: Fix dangling pointer when ipv6 fragment
net-gro: Fix GRO flush when receiving a GSO packet.
flow_dissector: document BPF flow dissector environment
...
|
|
git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for 5.2:
UAPI Changes:
-syncobj: Add TIMELINE_WAIT|QUERY|TRANSFER|TIMELINE_SIGNAL ioctls (Chunming)
-Clarify that 1.0 can be represented by drm_color_lut (Daniel)
Cross-subsystem Changes:
-dt-bindings: Add binding for rk3066 hdmi (Johan)
-dt-bindings: Add binding for Feiyang FY07024DI26A30-D panel (Jagan)
-dt-bindings: Add Rocktech vendor prefix and jh057n00900 panel bindings (Guido)
-MAINTAINERS: Add lima and ASPEED entries (Joel & Qiang)
Core Changes:
-memory: use dma_alloc_coherent when mem encryption is active (Christian)
-dma_buf: add support for a dma_fence chain (Christian)
-shmem_gem: fix off-by-one bug in new shmem gem helpers (Dan)
Driver Changes:
-rockchip: Add support for rk3066 hdmi (Johan)
-ASPEED: Add driver supporting ASPEED BMC display controller to drm (Joel)
-lima: Add driver supporting Arm Mali4xx gpus to drm (Qiang)
-vc4/v3d: Various cleanups and improved error handling (Eric)
-panel: Add support for Feiyang FY07024DI26A30-D MIPI-DSI panel (Jagan)
-panel: Add support for Rocktech jh057n00900 MIPI-DSI panel (Guido)
Cc: Johan Jonker <[email protected]>
Cc: Christian König <[email protected]>
Cc: Chunming Zhou <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Eric Anholt <[email protected]>
Cc: Qiang Yu <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Jagan Teki <[email protected]>
Cc: Guido Günther <[email protected]>
Cc: Joel Stanley <[email protected]>
[airlied: fixed XA limit build breakage, Rodrigo also submitted the same patch, but
I squashed it in the merge.]
Signed-off-by: Dave Airlie <[email protected]>
From: Sean Paul <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/20190404201016.GA139524@art_vandelay
|
|
What happens there is that we are replacing file->path.mnt of
a file we'd just opened with a clone and we need the write
count contribution to be transferred from original mount to
new one. That's it. We do *NOT* want any kind of freeze
protection for the duration of switchover.
IOW, we should just use __mnt_{want,drop}_write() for that
switchover; no need to bother with mnt_{want,drop}_write()
there.
Tested-by: Amir Goldstein <[email protected]>
Reported-by: [email protected]
Signed-off-by: Al Viro <[email protected]>
|
|
syzbot reports:
BUG: using __this_cpu_read() in preemptible code:
caller is dev_recursion_level include/linux/netdevice.h:3052 [inline]
__this_cpu_preempt_check+0x246/0x270 lib/smp_processor_id.c:47
dev_recursion_level include/linux/netdevice.h:3052 [inline]
ip6_skb_dst_mtu include/net/ip6_route.h:245 [inline]
I erronously downgraded a this_cpu_read to __this_cpu_read when
moving dev_recursion_level() around.
Reported-by: [email protected]
Fixes: 97cdcf37b57e ("net: place xmit recursion in softnet data")
Signed-off-by: Florian Westphal <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
For devices from the SigmaDelta family we need to keep CS low when doing a
conversion, since the device will use the MISO line as a interrupt to
indicate that the conversion is complete.
This is why the driver locks the SPI bus and when the SPI bus is locked
keeps as long as a conversion is going on. The current implementation gets
one small detail wrong though. CS is only de-asserted after the SPI bus is
unlocked. This means it is possible for a different SPI device on the same
bus to send a message which would be wrongfully be addressed to the
SigmaDelta device as well. Make sure that the last SPI transfer that is
done while holding the SPI bus lock de-asserts the CS signal.
Signed-off-by: Lars-Peter Clausen <[email protected]>
Signed-off-by: Alexandru Ardelean <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
|
|
Replace diff_{m1,m2} with div_{m1,m2} since they are dividers and not a
differential settings.
Signed-off-by: Lars-Peter Clausen <[email protected]>
Signed-off-by: Alexandru Ardelean <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
|
|
If we put headers alphabetically sorted in the IIO driver,
the compilation will abort because of unknown type to handle.
Simple add a forward declaration of opaque struct iio_dev.
Suggested-by: Jonathan Cameron <[email protected]>
Signed-off-by: Andy Shevchenko <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
|
|
Some variants in the ADIS16400 family support burst mode. This mechanism is
implemented in the `adis16400_buffer.c` file.
Some variants in ADIS16480 are also adding burst mode, which is
functionally similar to ADIS16400, but with different parameters. To get
there, a `adis_burst` struct is added to parametrize certain bits of the
SPI communication to setup: the register that triggers burst-mode, and the
extra-data-length that needs be accounted for when building the bust-length
buffer.
The trigger handler cannot be made generic, since it's very specific to
each ADIS164XX family.
A future enhancement of this `adis_burst` mode will be the possibility to
enable/disable burst-mode. For the ADIS16400 family it's hard-coded to on
by default. But for ADIS16480 there will be a need to disable this.
When that will be implemented, both ADIS16400 & ADIS16480 will have the
burst-mode enable-able/disable-able.
Signed-off-by: Alexandru Ardelean <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
|
|
This patch allows to read a mount-matrix device tree
property and report to user-space or in-kernel iio
clients.
Signed-off-by: H. Nikolaus Schaller <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Reviewed-by: Linus Walleij <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
|
|
Currently mount matrix is allowed in Device Tree, though there is
no technical issue to extend it to support ACPI.
Convert the function to use device_property_read_string_array() and
thus allow to read mount matrix from ACPI if available.
Example of use in _DSD method:
Name (_DSD, Package ()
{
ToUUID("daffd814-6eba-4d8c-8a91-bc9bbf4aa301"),
Package ()
{
Package () { "mount-matrix", Package() {
"1", "0", "0",
"0", "0.866", "0.5",
"0", "-0.5", "0.866",
} },
}
})
At the same time drop the "of" prefix from its name and
convert current users.
No functional change intended.
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Linus Walleij <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
|
|
It turns out that one specific hardware has two direction
registers: one to set a GPIO line as input and another one
to set a GPIO line as output. So in theory a line can be
configured as input and output at the same time.
Make the MMIO GPIO helper deal with this: store both
registers in the state container, use both in the generic
code if present. Synchronize the input register to the
output register when we register a GPIO chip, with the
output settings taking precedence.
Keep the helper variable to detect inverted direction
semantics (only direction in register) but augment the
code to be more straight-forward for the generic case
when setting the registers.
Fix some flunky with unreadable direction registers at
the same time as we're touching this code.
Cc: David Woods <[email protected]>
Cc: Shravan Kumar Ramani <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
|
|
System memory may have caches to help improve access speed to frequently
requested address ranges. While the system provided cache is transparent
to the software accessing these memory ranges, applications can optimize
their own access based on cache attributes.
Provide a new API for the kernel to register these memory-side caches
under the memory node that provides it.
The new sysfs representation is modeled from the existing cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/<cpu>/cache/. Unlike CPU
cacheinfo though, the node cache level is reported from the view of the
memory. A higher level number is nearer to the CPU, while lower levels
are closer to the last level memory.
The exported attributes are the cache size, the line size, associativity
indexing, and write back policy, and add the attributes for the system
memory caches to sysfs stable documentation.
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Rafael J. Wysocki <[email protected]>
Reviewed-by: Brice Goglin <[email protected]>
Tested-by: Brice Goglin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Heterogeneous memory systems provide memory nodes with different latency
and bandwidth performance attributes. Provide a new kernel interface
for subsystems to register the attributes under the memory target
node's initiator access class. If the system provides this information,
applications may query these attributes when deciding which node to
request memory.
The following example shows the new sysfs hierarchy for a node exporting
performance attributes:
# tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/
/sys/devices/system/node/nodeY/accessZ/initiators/
|-- read_bandwidth
|-- read_latency
|-- write_bandwidth
`-- write_latency
The bandwidth is exported as MB/s and latency is reported in
nanoseconds. The values are taken from the platform as reported by the
manufacturer.
Memory accesses from an initiator node that is not one of the memory's
access "Z" initiator nodes linked in the same directory may observe
different performance than reported here. When a subsystem makes use
of this interface, initiators of a different access number may not have
the same performance relative to initiators in other access numbers, or
omitted from the any access class' initiators.
Descriptions for memory access initiator performance access attributes
are added to sysfs stable documentation.
Acked-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Rafael J. Wysocki <[email protected]>
Tested-by: Brice Goglin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Systems may be constructed with various specialized nodes. Some nodes
may provide memory, some provide compute devices that access and use
that memory, and others may provide both. Nodes that provide memory are
referred to as memory targets, and nodes that can initiate memory access
are referred to as memory initiators.
Memory targets will often have varying access characteristics from
different initiators, and platforms may have ways to express those
relationships. In preparation for these systems, provide interfaces for
the kernel to export the memory relationship among different nodes memory
targets and their initiators with symlinks to each other.
If a system provides access locality for each initiator-target pair, nodes
may be grouped into ranked access classes relative to other nodes. The
new interface allows a subsystem to register relationships of varying
classes if available and desired to be exported.
A memory initiator may have multiple memory targets in the same access
class. The target memory's initiators in a given class indicate the
nodes access characteristics share the same performance relative to other
linked initiator nodes. Each target within an initiator's access class,
though, do not necessarily perform the same as each other.
A memory target node may have multiple memory initiators. All linked
initiators in a target's class have the same access characteristics to
that target.
The following example show the nodes' new sysfs hierarchy for a memory
target node 'Y' with access class 0 from initiator node 'X':
# symlinks -v /sys/devices/system/node/nodeX/access0/
relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
# symlinks -v /sys/devices/system/node/nodeY/access0/
relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
The new attributes are added to the sysfs stable documentation.
Reviewed-by: Jonathan Cameron <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Rafael J. Wysocki <[email protected]>
Tested-by: Brice Goglin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
The Heterogeneous Memory Attribute Table (HMAT) header has different
field lengths than the existing parsing uses. Add the HMAT type to the
parsing rules so it may be generically parsed.
Reviewed-by: Rafael J. Wysocki <[email protected]>
Acked-by: Jonathan Cameron <[email protected]>
Tested-by: Jonathan Cameron <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
Tested-by: Brice Goglin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|