aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2019-04-08mtd: rawnand: Get rid of chip->chipsizeBoris Brezillon1-2/+0
The target size can now be returned by nanddev_get_targetsize(). Get rid of the chip->chipsize field and use this helper instead. Signed-off-by: Boris Brezillon <[email protected]> Reviewed-by: Frieder Schrempf <[email protected]> Signed-off-by: Miquel Raynal <[email protected]>
2019-04-08mtd: rawnand: Get rid of chip->bits_per_cellBoris Brezillon1-4/+2
Now that we inherit from nand_device, we can use nand_device->memorg.bits_per_cell instead of having our own field at the nand_chip level. Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Reviewed-by: Frieder Schrempf <[email protected]>
2019-04-08mtd: rawnand: Use nanddev_mtd_max_bad_blocks()Boris Brezillon1-5/+0
nanddev_mtd_max_bad_blocks() is implemented by the generic NAND layer and is already doing what we need. Reuse this function instead of having our own implementation. While at it, get rid of the ->max_bb_per_die and ->blocks_per_die fields which are now unused. Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Reviewed-by: Frieder Schrempf <[email protected]>
2019-04-08mtd: rawnand: Move all page cache related fields to a sub-structBoris Brezillon1-7/+11
Looking at the field names it's hard to tell what ->data_buf, ->pagebuf and ->pagebuf_bitflips are for. Clarify that by moving those fields in a sub-struct named pagecache. Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Reviewed-by: Frieder Schrempf <[email protected]>
2019-04-08mtd: rawnand: Provide a helper to get chip->data_bufBoris Brezillon1-0/+21
We plan to move cache related fields to a pagecache struct in nand_chip but some drivers access ->pagebuf directly to invalidate the cache before they start using ->data_buf. Let's provide an helper that returns a pointer to ->data_buf after invalidating the cache. Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Reviewed-by: Frieder Schrempf <[email protected]>
2019-04-08mtd: rawnand: Prepare things to reuse the generic NAND layerBoris Brezillon1-4/+6
The generic NAND layer provides abstraction of NAND devices no matter the bus that is used to communicate with the chip. Basing the raw NAND core on this generic layer should avoid duplication of common operations, like iterating over all pages/blocks for MTD IO/erase operations. In order to re-use this layer, we must first inherit from nand_device and then initialize the nand_device struct appropriately. This patch is taking care of the former. Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Reviewed-by: Frieder Schrempf <[email protected]>
2019-04-08mtd: rawnand: Use nand_to_mtd() in nand_{set,get}_flash_node()Boris Brezillon1-11/+11
Use the nand_to_mtd() helper to access chip->mtd as done everywhere else. Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Reviewed-by: Frieder Schrempf <[email protected]>
2019-04-08mtd: nand: Add a helper to retrieve the number of pages per targetBoris Brezillon1-0/+14
Will be used by the raw NAND framework. Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Reviewed-by: Frieder Schrempf <[email protected]>
2019-04-08mtd: nand: Add a helper returning the number of eraseblocks per targetBoris Brezillon1-0/+12
Some drivers in the raw NAND framework seems to need this helper, so let's just add it instead of open-coding the logic. Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Reviewed-by: Frieder Schrempf <[email protected]>
2019-04-08mtd: nand: Add max_bad_eraseblocks_per_lun info to memorgBoris Brezillon1-1/+5
NAND datasheets usually give the maximum number of bad blocks per LUN and this number can be used to help upper layers decide how much blocks they should reserve for bad block handling. Add a max_bad_eraseblocks_per_lun to the nand_memory_organization struct and update the NAND_MEMORG() macro (and its users) accordingly. We also provide a default mtd->_max_bad_blocks() implementation. Signed-off-by: Boris Brezillon <[email protected]> Signed-off-by: Miquel Raynal <[email protected]> Reviewed-by: Frieder Schrempf <[email protected]>
2019-04-08spi: add a method for configuring CS timingSowjanya Komatineni1-0/+15
This patch creates set_cs_timing SPI master optional method for SPI masters to implement configuring CS timing if applicable. This patch also creates spi_cs_timing accessory for SPI clients to use for requesting SPI master controllers to configure device requested CS setup time, hold time and inactive delay. Signed-off-by: Sowjanya Komatineni <[email protected]> Signed-off-by: Mark Brown <[email protected]>
2019-04-08spi: bitbang: Introduce spi_bitbang_init()Andrey Smirnov1-0/+1
Move all of the code doing struct spi_bitbang initialization, so that it can be paired with devm_spi_register_master() in order to avoid having to call spi_bitbang_stop() explicitly. Signed-off-by: Andrey Smirnov <[email protected]> Cc: Mark Brown <[email protected]> Cc: Chris Healy <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Mark Brown <[email protected]>
2019-04-08crypto: ccp - introduce SEV_GET_ID2 commandSingh, Brijesh1-2/+1
The current definition and implementation of the SEV_GET_ID command does not provide the length of the unique ID returned by the firmware. As per the firmware specification, the firmware may return an ID length that is not restricted to 64 bytes as assumed by the SEV_GET_ID command. Introduce the SEV_GET_ID2 command to overcome with the SEV_GET_ID limitations. Deprecate the SEV_GET_ID in the favor of SEV_GET_ID2. At the same time update SEV API web link. Cc: Janakarajan Natarajan <[email protected]> Cc: Herbert Xu <[email protected]> Cc: Gary Hook <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Nathaniel McCallum <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2019-04-07rhashtable: add lockdep tracking to bucket bit-spin-locks.NeilBrown1-17/+34
Native bit_spin_locks are not tracked by lockdep. The bit_spin_locks used for rhashtable buckets are local to the rhashtable implementation, so there is little opportunity for the sort of misuse that lockdep might detect. However locks are held while a hash function or compare function is called, and if one of these took a lock, a misbehaviour is possible. As it is quite easy to add lockdep support this unlikely possibility seems to be enough justification. So create a lockdep class for bucket bit_spin_lock and attach through a lockdep_map in each bucket_table. Without the 'nested' annotation in rhashtable_rehash_one(), lockdep correctly reports a possible problem as this lock is taken while another bucket lock (in another table) is held. This confirms that the added support works. With the correct nested annotation in place, lockdep reports no problems. Signed-off-by: NeilBrown <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-04-07rhashtable: use bit_spin_locks to protect hash bucket.NeilBrown2-98/+165
This patch changes rhashtables to use a bit_spin_lock on BIT(1) of the bucket pointer to lock the hash chain for that bucket. The benefits of a bit spin_lock are: - no need to allocate a separate array of locks. - no need to have a configuration option to guide the choice of the size of this array - locking cost is often a single test-and-set in a cache line that will have to be loaded anyway. When inserting at, or removing from, the head of the chain, the unlock is free - writing the new address in the bucket head implicitly clears the lock bit. For __rhashtable_insert_fast() we ensure this always happens when adding a new key. - even when lockings costs 2 updates (lock and unlock), they are in a cacheline that needs to be read anyway. The cost of using a bit spin_lock is a little bit of code complexity, which I think is quite manageable. Bit spin_locks are sometimes inappropriate because they are not fair - if multiple CPUs repeatedly contend of the same lock, one CPU can easily be starved. This is not a credible situation with rhashtable. Multiple CPUs may want to repeatedly add or remove objects, but they will typically do so at different buckets, so they will attempt to acquire different locks. As we have more bit-locks than we previously had spinlocks (by at least a factor of two) we can expect slightly less contention to go with the slightly better cache behavior and reduced memory consumption. To enhance type checking, a new struct is introduced to represent the pointer plus lock-bit that is stored in the bucket-table. This is "struct rhash_lock_head" and is empty. A pointer to this needs to be cast to either an unsigned lock, or a "struct rhash_head *" to be useful. Variables of this type are most often called "bkt". Previously "pprev" would sometimes point to a bucket, and sometimes a ->next pointer in an rhash_head. As these are now different types, pprev is NULL when it would have pointed to the bucket. In that case, 'blk' is used, together with correct locking protocol. Signed-off-by: NeilBrown <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-04-07rhashtable: allow rht_bucket_var to return NULL.NeilBrown1-2/+9
Rather than returning a pointer to a static nulls, rht_bucket_var() now returns NULL if the bucket doesn't exist. This will make the next patch, which stores a bitlock in the bucket pointer, somewhat cleaner. This change involves introducing __rht_bucket_nested() which is like rht_bucket_nested(), but doesn't provide the static nulls, and changing rht_bucket_nested() to call this and possible provide a static nulls - as is still needed for the non-var case. Signed-off-by: NeilBrown <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-04-07PM / arch: x86: Rework the MSR_IA32_ENERGY_PERF_BIAS handlingRafael J. Wysocki1-0/+1
The current handling of MSR_IA32_ENERGY_PERF_BIAS in the kernel is problematic, because it may cause changes made by user space to that MSR (with the help of the x86_energy_perf_policy tool, for example) to be lost every time a CPU goes offline and then back online as well as during system-wide power management transitions into sleep states and back into the working state. The first problem is that if the current EPB value for a CPU going online is 0 ('performance'), the kernel will change it to 6 ('normal') regardless of whether or not this is the first bring-up of that CPU. That also happens during system-wide resume from sleep states (including, but not limited to, hibernation). However, the EPB may have been adjusted by user space this way and the kernel should not blindly override that setting. The second problem is that if the platform firmware resets the EPB values for any CPUs during system-wide resume from a sleep state, the kernel will not restore their previous EPB values that may have been set by user space before the preceding system-wide suspend transition. Again, that behavior may at least be confusing from the user space perspective. In order to address these issues, rework the handling of MSR_IA32_ENERGY_PERF_BIAS so that the EPB value is saved on CPU offline and restored on CPU online as well as (for the boot CPU) during the syscore stages of system-wide suspend and resume transitions, respectively. However, retain the policy by which the EPB is set to 6 ('normal') on the first bring-up of each CPU if its initial value is 0, based on the observation that 0 may mean 'not initialized' just as well as 'performance' in that case. While at it, move the MSR_IA32_ENERGY_PERF_BIAS handling code into a separate file and document it in Documentation/admin-guide. Fixes: abe48b108247 (x86, intel, power: Initialize MSR_IA32_ENERGY_PERF_BIAS) Fixes: b51ef52df71c (x86/cpu: Restore MSR_IA32_ENERGY_PERF_BIAS after resume) Reported-by: Thomas Renninger <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Borislav Petkov <[email protected]> Acked-by: Thomas Gleixner <[email protected]>
2019-04-07mfd: ti-lmu: Remove LM3532 backlight driver referencesDan Murphy2-45/+0
Remove the LM3532 backlight driver references from the ti-lmu code as dedicated driver support is available. Signed-off-by: Dan Murphy <[email protected]> Acked-for-MFD-by: Lee Jones <[email protected]> Signed-off-by: Jacek Anaszewski <[email protected]>
2019-04-06fs: stream_open - opener for stream-like files so that read and write can ↵Kirill Smelkov1-0/+4
run simultaneously without deadlock Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added locking for file.f_pos access and in particular made concurrent read and write not possible - now both those functions take f_pos lock for the whole run, and so if e.g. a read is blocked waiting for data, write will deadlock waiting for that read to complete. This caused regression for stream-like files where previously read and write could run simultaneously, but after that patch could not do so anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes to /proc/xen/xenbus") which fixes such regression for particular case of /proc/xen/xenbus. The patch that added f_pos lock in 2014 did so to guarantee POSIX thread safety for read/write/lseek and added the locking to file descriptors of all regular files. In 2014 that thread-safety problem was not new as it was already discussed earlier in 2006. However even though 2006'th version of Linus's patch was adding f_pos locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding the stream-like objects like pipes and sockets)", the 2014 version - the one that actually made it into the tree as 9c225f2655e3 - is doing so irregardless of whether a file is seekable or not. See https://lore.kernel.org/lkml/[email protected]/ https://lwn.net/Articles/180387 https://lwn.net/Articles/180396 for historic context. The reason that it did so is, probably, that there are many files that are marked non-seekable, but e.g. their read implementation actually depends on knowing current position to correctly handle the read. Some examples: kernel/power/user.c snapshot_read fs/debugfs/file.c u32_array_read fs/fuse/control.c fuse_conn_waiting_read + ... drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read arch/s390/hypfs/inode.c hypfs_read_iter ... Despite that, many nonseekable_open users implement read and write with pure stream semantics - they don't depend on passed ppos at all. And for those cases where read could wait for something inside, it creates a situation similar to xenbus - the write could be never made to go until read is done, and read is waiting for some, potentially external, event, for potentially unbounded time -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've found with semantic patch (see below): drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write() drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write() drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write() drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write() net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write() drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write() drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write() drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write() net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write() drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write() drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write() drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write() drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write() drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write() In addition to the cases above another regression caused by f_pos locking is that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can no longer implement bidirectional stream-like files - for the same reason as above e.g. read can deadlock write locking on file.f_pos in the kernel. FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse: implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and write routines not depending on current position at all, and with both read and write being potentially blocking operations: See https://github.com/libfuse/osspd https://lwn.net/Articles/308445 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510 Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as "somewhat pipe-like files ..." with read handler not using offset. However that test implements only read without write and cannot exercise the deadlock scenario: https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216 I've actually hit the read vs write deadlock for real while implementing my FUSE filesystem where there is /head/watch file, for which open creates separate bidirectional socket-like stream in between filesystem and its user with both read and write being later performed simultaneously. And there it is semantically not easy to split the stream into two separate read-only and write-only channels: https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169 Let's fix this regression. The plan is: 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would break many in-kernel nonseekable_open users which actually use ppos in read/write handlers. 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors. Read and write on such file descriptors would never use nor change ppos. And with that property on stream-like files read and write will be running without taking f_pos lock - i.e. read and write could be running simultaneously. 3. With semantic patch search and convert to stream_open all in-kernel nonseekable_open users for which read and write actually do not depend on ppos and where there is no other methods in file_operations which assume @offset access. 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open if that bit is present in filesystem open reply. It was tempting to change fs/fuse/ open handler to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in their open handler and this way avoid the deadlock on all kernel versions. This should work because fs/fuse/ ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. This patch adds stream_open, converts /proc/xen/xenbus to it and adds semantic patch to automatically locate in-kernel places that are either required to be converted due to read vs write deadlock, or that are just safe to be converted because read and write do not use ppos and there are no other funky methods in file_operations. Regarding semantic patch I've verified each generated change manually - that it is correct to convert - and each other nonseekable_open instance left - that it is either not correct to convert there, or that it is not converted due to current stream_open.cocci limitations. The script also does not convert files that should be valid to convert, but that currently have .llseek = noop_llseek or generic_file_llseek for unknown reason despite file being opened with nonseekable_open (e.g. drivers/input/mousedev.c) Cc: Michael Kerrisk <[email protected]> Cc: Yongzhi Pan <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: David Vrabel <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Julia Lawall <[email protected]> Cc: Nikolaus Rath <[email protected]> Cc: Han-Wen Nienhuys <[email protected]> Signed-off-by: Kirill Smelkov <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-04-06block: remove CONFIG_LBDAFChristoph Hellwig3-21/+6
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-04-06PCI: Work around Pericom PCIe-to-PCI bridge Retrain Link erratumStefan Mätje1-0/+2
Due to an erratum in some Pericom PCIe-to-PCI bridges in reverse mode (conventional PCI on primary side, PCIe on downstream side), the Retrain Link bit needs to be cleared manually to allow the link training to complete successfully. If it is not cleared manually, the link training is continuously restarted and no devices below the PCI-to-PCIe bridge can be accessed. That means drivers for devices below the bridge will be loaded but won't work and may even crash because the driver is only reading 0xffff. See the Pericom Errata Sheet PI7C9X111SLB_errata_rev1.2_102711.pdf for details. Devices known as affected so far are: PI7C9X110, PI7C9X111SL, PI7C9X130. Add a new flag, clear_retrain_link, in struct pci_dev. Quirks for affected devices set this bit. Note that pcie_retrain_link() lives in aspm.c because that's currently the only place we use it, but this erratum is not specific to ASPM, and we may retrain links for other reasons in the future. Signed-off-by: Stefan Mätje <[email protected]> [bhelgaas: apply regardless of CONFIG_PCIEASPM] Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> CC: [email protected]
2019-04-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds4-25/+31
Merge misc fixes from Andrew Morton: "14 fixes" * emailed patches from Andrew Morton <[email protected]>: kernel/sysctl.c: fix out-of-bounds access when setting file-max mm/util.c: fix strndup_user() comment sh: fix multiple function definition build errors MAINTAINERS: add maintainer and replacing reviewer ARM/NUVOTON NPCM MAINTAINERS: fix bad pattern in ARM/NUVOTON NPCM mm: writeback: use exact memcg dirty counts psi: clarify the units used in pressure files mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd() hugetlbfs: fix memory leak for resv_map mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX() lib/lzo: fix bugs for very short or empty input include/linux/bitrev.h: fix constant bitrev kmemleak: powerpc: skip scanning holes in the .bss section lib/string.c: implement a basic bcmp
2019-04-05mm: writeback: use exact memcg dirty countsGreg Thelen1-1/+4
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu*32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c /* * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -99*32 pages * - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -99*32 pages, which * it max()s to 0. * - So a workload can dirty -99*32 pages before balance_dirty_pages() * cares. */ #define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <sched.h> #include <stdlib.h> #include <stdio.h> #include <sys/stat.h> #include <sys/sysinfo.h> #include <sys/types.h> #include <unistd.h> static char *buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(&affinity); CPU_SET(cpu, &affinity); if (sched_setaffinity(0, sizeof(affinity), &affinity)) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char **argv) { int cpu, flush_cpu = 1, output_fd; const char *output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to use exact dirty and writeback counters in memcg oom reports. But that is saved for later. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Thelen <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: <[email protected]> [4.16+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-04-05mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX()Jann Horn1-1/+1
Symmetrically to VM_FAULT_SET_HINDEX(), we need a force-cast in VM_FAULT_GET_HINDEX() to tell sparse that this is intentional. Sparse complains about the current code when building a kernel with CONFIG_MEMORY_FAILURE: arch/x86/mm/fault.c:1058:53: warning: restricted vm_fault_t degrades to integer Link: http://lkml.kernel.org/r/[email protected] Fixes: 3d3539018d2c ("mm: create the new vm_fault_t type") Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Souptick Joarder <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-04-05include/linux/bitrev.h: fix constant bitrevArnd Bergmann1-23/+23
clang points out with hundreds of warnings that the bitrev macros have a problem with constant input: drivers/hwmon/sht15.c:187:11: error: variable '__x' is uninitialized when used within its own initialization [-Werror,-Wuninitialized] u8 crc = bitrev8(data->val_status & 0x0F); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ include/linux/bitrev.h:102:21: note: expanded from macro 'bitrev8' __constant_bitrev8(__x) : \ ~~~~~~~~~~~~~~~~~~~^~~~ include/linux/bitrev.h:67:11: note: expanded from macro '__constant_bitrev8' u8 __x = x; \ ~~~ ^ Both the bitrev and the __constant_bitrev macros use an internal variable named __x, which goes horribly wrong when passing one to the other. The obvious fix is to rename one of the variables, so this adds an extra '_'. It seems we got away with this because - there are only a few drivers using bitrev macros - usually there are no constant arguments to those - when they are constant, they tend to be either 0 or (unsigned)-1 (drivers/isdn/i4l/isdnhdlc.o, drivers/iio/amplifiers/ad8366.c) and give the correct result by pure chance. In fact, the only driver that I could find that gets different results with this is drivers/net/wan/slic_ds26522.c, which in turn is a driver for fairly rare hardware (adding the maintainer to Cc for testing). Link: http://lkml.kernel.org/r/[email protected] Fixes: 556d2f055bf6 ("ARM: 8187/1: add CONFIG_HAVE_ARCH_BITREVERSE to support rbit instruction") Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Cc: Zhao Qiang <[email protected]> Cc: Yalin Wang <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-04-05lib/string.c: implement a basic bcmpNick Desaulniers1-0/+3
A recent optimization in Clang (r355672) lowers comparisons of the return value of memcmp against zero to comparisons of the return value of bcmp against zero. This helps some platforms that implement bcmp more efficiently than memcmp. glibc simply aliases bcmp to memcmp, but an optimized implementation is in the works. This results in linkage failures for all targets with Clang due to the undefined symbol. For now, just implement bcmp as a tailcail to memcmp to unbreak the build. This routine can be further optimized in the future. Other ideas discussed: * A weak alias was discussed, but breaks for architectures that define their own implementations of memcmp since aliases to declarations are not permitted (only definitions). Arch-specific memcmp implementations typically declare memcmp in C headers, but implement them in assembly. * -ffreestanding also is used sporadically throughout the kernel. * -fno-builtin-bcmp doesn't work when doing LTO. Link: https://bugs.llvm.org/show_bug.cgi?id=41035 Link: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#bcmp Link: https://github.com/llvm/llvm-project/commit/8e16d73346f8091461319a7dfc4ddd18eedcff13 Link: https://github.com/ClangBuiltLinux/linux/issues/416 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Nick Desaulniers <[email protected]> Reported-by: Nathan Chancellor <[email protected]> Reported-by: Adhemerval Zanella <[email protected]> Suggested-by: Arnd Bergmann <[email protected]> Suggested-by: James Y Knight <[email protected]> Suggested-by: Masahiro Yamada <[email protected]> Suggested-by: Nathan Chancellor <[email protected]> Suggested-by: Rasmus Villemoes <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Reviewed-by: Masahiro Yamada <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Cc: David Laight <[email protected]> Cc: Rasmus Villemoes <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Dan Williams <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-04-05Merge tag 'trace-5.1-rc3' of ↵Linus Torvalds1-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull syscall-get-arguments cleanup and fixes from Steven Rostedt: "Andy Lutomirski approached me to tell me that the syscall_get_arguments() implementation in x86 was horrible and gcc certainly gets it wrong. He said that since the tracepoints only pass in 0 and 6 for i and n repectively, it should be optimized for that case. Inspecting the kernel, I discovered that all users pass in 0 for i and only one file passing in something other than 6 for the number of arguments. That code happens to be my own code used for the special syscall tracing. That can easily be converted to just using 0 and 6 as well, and only copying what is needed. Which is probably the faster path anyway for that case. Along the way, a couple of real fixes came from this as the syscall_get_arguments() function was incorrect for csky and riscv. x86 has been optimized to for the new interface that removes the variable number of arguments, but the other architectures could still use some loving and take more advantage of the simpler interface" * tag 'trace-5.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: syscalls: Remove start and number from syscall_set_arguments() args syscalls: Remove start and number from syscall_get_arguments() args csky: Fix syscall_get_arguments() and syscall_set_arguments() riscv: Fix syscall_get_arguments() and syscall_set_arguments() tracing/syscalls: Pass in hardcoded 6 into syscall_get_arguments() ptrace: Remove maxargs from task_current_syscall()
2019-04-05bus: ti-sysc: Handle swsup idle mode quirksTony Lindgren1-0/+3
In preparation of dropping interconnect target module platform data in favor of devicetree based data, we must pass swsup idle quirks to the platform data functions. For now, let's only tag the UART modules with the SWSUP_SIDLE_ACT quirk. The other modules will get tagged with swsup quirks as we drop the platform data and test the changes. Signed-off-by: Tony Lindgren <[email protected]>
2019-04-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller11-32/+39
Minor comment merge conflict in mlx5. Staging driver has a fixup due to the skb->xmit_more changes in 'net-next', but was removed in 'net'. Signed-off-by: David S. Miller <[email protected]>
2019-04-05net/mlx5: Handle event of power detection in the PCIE slotAya Levin1-0/+1
Handle event of power state change in the PCIE slot. When the event occurs, check if query power state and PCI power fields is supported. If so, read these fields from MPEIN (management PCIE info) register and issue a corresponding message. Signed-off-by: Aya Levin <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2019-04-05Merge branch 'mlx5-next' of ↵Saeed Mahameed4-32/+63
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux This merge commit includes some misc shared code updates from mlx5-next branch needed for net-next. 1) From Maxim, Remove un-used macros and spinlock from mlx5 code. 2) From Aya, Expose Management PCIE info register layout and add rate limit print macros. 3) From Tariq, Compilation warning fix in fs_core.c 4) From Vu, Huy and Saeed, Improve mlx5 initialization flow: The goal is to provide a better logical separation of mlx5 core device initialization flow and will help to seamlessly support creating different mlx5 device types such as PF, VF and SF mlx5 sub-function virtual devices. Mlx5_core driver needs to separate HCA resources from pci resources. Its initialize/load/unload will be broken into stages: 1. Initialize common data structures 2. Setup function which initializes pci resources (for PF/VF) or some other specific resources for virtual device 3. Initialize software objects according to hardware capabilities 4. Load all mlx5_core components It is also necessary to detach mlx5_core mdev name/message from pci device mdev->pdev name/message for a clearer report/debug of different mlx5 device types. Signed-off-by: Saeed Mahameed <[email protected]>
2019-04-05block: add dma_map_bvec helperChristoph Hellwig1-0/+4
Provide a nice little shortcut for mapping a single bvec. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2019-04-05block: add a rq_dma_dir helperChristoph Hellwig1-0/+3
In a lot of places we want to know the DMA direction for a given struct request. Add a little helper to make it a littler easier. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2019-04-05block: add a rq_integrity_vec helperChristoph Hellwig1-0/+16
This provides a nice little shortcut to get the integrity data for drivers like NVMe that only support a single integrity segment. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2019-04-05block: add a req_bvec helperChristoph Hellwig1-0/+11
Return the currently active bvec segment, potentially spanning multiple pages. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2019-04-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2-1/+3
Pull networking fixes from David Miller: 1) Several hash table refcount fixes in batman-adv, from Sven Eckelmann. 2) Use after free in bpf_evict_inode(), from Daniel Borkmann. 3) Fix mdio bus registration in ixgbe, from Ivan Vecera. 4) Unbounded loop in __skb_try_recv_datagram(), from Paolo Abeni. 5) ila rhashtable corruption fix from Herbert Xu. 6) Don't allow upper-devices to be added to vrf devices, from Sabrina Dubroca. 7) Add qmi_wwan device ID for Olicard 600, from Bjørn Mork. 8) Don't leave skb->next poisoned in __netif_receive_skb_list_ptype, from Alexander Lobakin. 9) Missing IDR checks in mlx5 driver, from Aditya Pakki. 10) Fix false connection termination in ktls, from Jakub Kicinski. 11) Work around some ASPM issues with r8169 by disabling rx interrupt coalescing on certain chips. From Heiner Kallweit. 12) Properly use per-cpu qstat values on NOLOCK qdiscs, from Paolo Abeni. 13) Fully initialize sockaddr_in structures in SCTP, from Xin Long. 14) Various BPF flow dissector fixes from Stanislav Fomichev. 15) Divide by zero in act_sample, from Davide Caratti. 16) Fix bridging multicast regression introduced by rhashtable conversion, from Nikolay Aleksandrov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) ibmvnic: Fix completion structure initialization ipv6: sit: reset ip header pointer in ipip6_rcv net: bridge: always clear mcast matching struct on reports and leaves libcxgb: fix incorrect ppmax calculation vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2x sch_cake: Make sure we can write the IP header before changing DSCP bits sch_cake: Use tc_skb_protocol() helper for getting packet protocol tcp: Ensure DCTCP reacts to losses net/sched: act_sample: fix divide by zero in the traffic path net: thunderx: fix NULL pointer dereference in nicvf_open/nicvf_stop net: hns: Fix sparse: some warnings in HNS drivers net: hns: Fix WARNING when remove HNS driver with SMMU enabled net: hns: fix ICMP6 neighbor solicitation messages discard problem net: hns: Fix probabilistic memory overwrite when HNS driver initialized net: hns: Use NAPI_POLL_WEIGHT for hns driver net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw() flow_dissector: rst'ify documentation ipv6: Fix dangling pointer when ipv6 fragment net-gro: Fix GRO flush when receiving a GSO packet. flow_dissector: document BPF flow dissector environment ...
2019-04-05Merge tag 'drm-misc-next-2019-04-04' of ↵Dave Airlie1-0/+81
git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for 5.2: UAPI Changes: -syncobj: Add TIMELINE_WAIT|QUERY|TRANSFER|TIMELINE_SIGNAL ioctls (Chunming) -Clarify that 1.0 can be represented by drm_color_lut (Daniel) Cross-subsystem Changes: -dt-bindings: Add binding for rk3066 hdmi (Johan) -dt-bindings: Add binding for Feiyang FY07024DI26A30-D panel (Jagan) -dt-bindings: Add Rocktech vendor prefix and jh057n00900 panel bindings (Guido) -MAINTAINERS: Add lima and ASPEED entries (Joel & Qiang) Core Changes: -memory: use dma_alloc_coherent when mem encryption is active (Christian) -dma_buf: add support for a dma_fence chain (Christian) -shmem_gem: fix off-by-one bug in new shmem gem helpers (Dan) Driver Changes: -rockchip: Add support for rk3066 hdmi (Johan) -ASPEED: Add driver supporting ASPEED BMC display controller to drm (Joel) -lima: Add driver supporting Arm Mali4xx gpus to drm (Qiang) -vc4/v3d: Various cleanups and improved error handling (Eric) -panel: Add support for Feiyang FY07024DI26A30-D MIPI-DSI panel (Jagan) -panel: Add support for Rocktech jh057n00900 MIPI-DSI panel (Guido) Cc: Johan Jonker <[email protected]> Cc: Christian König <[email protected]> Cc: Chunming Zhou <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Eric Anholt <[email protected]> Cc: Qiang Yu <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Jagan Teki <[email protected]> Cc: Guido Günther <[email protected]> Cc: Joel Stanley <[email protected]> [airlied: fixed XA limit build breakage, Rodrigo also submitted the same patch, but I squashed it in the merge.] Signed-off-by: Dave Airlie <[email protected]> From: Sean Paul <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/20190404201016.GA139524@art_vandelay
2019-04-04acct_on(): don't mess with freeze protectionAl Viro1-0/+2
What happens there is that we are replacing file->path.mnt of a file we'd just opened with a clone and we need the write count contribution to be transferred from original mount to new one. That's it. We do *NOT* want any kind of freeze protection for the duration of switchover. IOW, we should just use __mnt_{want,drop}_write() for that switchover; no need to bother with mnt_{want,drop}_write() there. Tested-by: Amir Goldstein <[email protected]> Reported-by: [email protected] Signed-off-by: Al Viro <[email protected]>
2019-04-04net: use correct this_cpu primitive in dev_recursion_levelFlorian Westphal1-1/+1
syzbot reports: BUG: using __this_cpu_read() in preemptible code: caller is dev_recursion_level include/linux/netdevice.h:3052 [inline] __this_cpu_preempt_check+0x246/0x270 lib/smp_processor_id.c:47 dev_recursion_level include/linux/netdevice.h:3052 [inline] ip6_skb_dst_mtu include/net/ip6_route.h:245 [inline] I erronously downgraded a this_cpu_read to __this_cpu_read when moving dev_recursion_level() around. Reported-by: [email protected] Fixes: 97cdcf37b57e ("net: place xmit recursion in softnet data") Signed-off-by: Florian Westphal <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-04-04iio: ad_sigma_delta: Properly handle SPI bus locking vs CS assertionLars-Peter Clausen1-0/+1
For devices from the SigmaDelta family we need to keep CS low when doing a conversion, since the device will use the MISO line as a interrupt to indicate that the conversion is complete. This is why the driver locks the SPI bus and when the SPI bus is locked keeps as long as a conversion is going on. The current implementation gets one small detail wrong though. CS is only de-asserted after the SPI bus is unlocked. This means it is possible for a different SPI device on the same bus to send a message which would be wrongfully be addressed to the SigmaDelta device as well. Make sure that the last SPI transfer that is done while holding the SPI bus lock de-asserts the CS signal. Signed-off-by: Lars-Peter Clausen <[email protected]> Signed-off-by: Alexandru Ardelean <[email protected]> Signed-off-by: Jonathan Cameron <[email protected]>
2019-04-04iio: frequency: ad9523: Fix typo in ad9523_platform_dataLars-Peter Clausen1-4/+4
Replace diff_{m1,m2} with div_{m1,m2} since they are dividers and not a differential settings. Signed-off-by: Lars-Peter Clausen <[email protected]> Signed-off-by: Alexandru Ardelean <[email protected]> Signed-off-by: Jonathan Cameron <[email protected]>
2019-04-04iio: Make possible to include driver.h firstAndy Shevchenko1-0/+1
If we put headers alphabetically sorted in the IIO driver, the compilation will abort because of unknown type to handle. Simple add a forward declaration of opaque struct iio_dev. Suggested-by: Jonathan Cameron <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Jonathan Cameron <[email protected]>
2019-04-04iio: imu: adis: generalize burst mode supportAlexandru Ardelean1-0/+14
Some variants in the ADIS16400 family support burst mode. This mechanism is implemented in the `adis16400_buffer.c` file. Some variants in ADIS16480 are also adding burst mode, which is functionally similar to ADIS16400, but with different parameters. To get there, a `adis_burst` struct is added to parametrize certain bits of the SPI communication to setup: the register that triggers burst-mode, and the extra-data-length that needs be accounted for when building the bust-length buffer. The trigger handler cannot be made generic, since it's very specific to each ADIS164XX family. A future enhancement of this `adis_burst` mode will be the possibility to enable/disable burst-mode. For the ADIS16400 family it's hard-coded to on by default. But for ADIS16480 there will be a need to disable this. When that will be implemented, both ADIS16400 & ADIS16480 will have the burst-mode enable-able/disable-able. Signed-off-by: Alexandru Ardelean <[email protected]> Signed-off-by: Jonathan Cameron <[email protected]>
2019-04-04iio: gyro: itg3200: add mount matrix supportH. Nikolaus Schaller1-0/+1
This patch allows to read a mount-matrix device tree property and report to user-space or in-kernel iio clients. Signed-off-by: H. Nikolaus Schaller <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Linus Walleij <[email protected]> Signed-off-by: Jonathan Cameron <[email protected]>
2019-04-04iio: Allow to read mount matrix from ACPIAndy Shevchenko1-2/+2
Currently mount matrix is allowed in Device Tree, though there is no technical issue to extend it to support ACPI. Convert the function to use device_property_read_string_array() and thus allow to read mount matrix from ACPI if available. Example of use in _DSD method: Name (_DSD, Package () { ToUUID("daffd814-6eba-4d8c-8a91-bc9bbf4aa301"), Package () { Package () { "mount-matrix", Package() { "1", "0", "0", "0", "0.866", "0.5", "0", "-0.5", "0.866", } }, } }) At the same time drop the "of" prefix from its name and convert current users. No functional change intended. Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: Linus Walleij <[email protected]> Signed-off-by: Jonathan Cameron <[email protected]>
2019-04-05gpio: mmio: Support two direction registersLinus Walleij1-4/+11
It turns out that one specific hardware has two direction registers: one to set a GPIO line as input and another one to set a GPIO line as output. So in theory a line can be configured as input and output at the same time. Make the MMIO GPIO helper deal with this: store both registers in the state container, use both in the generic code if present. Synchronize the input register to the output register when we register a GPIO chip, with the output settings taking precedence. Keep the helper variable to detect inverted direction semantics (only direction in register) but augment the code to be more straight-forward for the generic case when setting the registers. Fix some flunky with unreadable direction registers at the same time as we're touching this code. Cc: David Woods <[email protected]> Cc: Shravan Kumar Ramani <[email protected]> Signed-off-by: Linus Walleij <[email protected]>
2019-04-04node: Add memory-side caching attributesKeith Busch1-0/+39
System memory may have caches to help improve access speed to frequently requested address ranges. While the system provided cache is transparent to the software accessing these memory ranges, applications can optimize their own access based on cache attributes. Provide a new API for the kernel to register these memory-side caches under the memory node that provides it. The new sysfs representation is modeled from the existing cpu cacheinfo attributes, as seen from /sys/devices/system/cpu/<cpu>/cache/. Unlike CPU cacheinfo though, the node cache level is reported from the view of the memory. A higher level number is nearer to the CPU, while lower levels are closer to the last level memory. The exported attributes are the cache size, the line size, associativity indexing, and write back policy, and add the attributes for the system memory caches to sysfs stable documentation. Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Reviewed-by: Brice Goglin <[email protected]> Tested-by: Brice Goglin <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-04-04node: Add heterogenous memory access attributesKeith Busch1-0/+26
Heterogeneous memory systems provide memory nodes with different latency and bandwidth performance attributes. Provide a new kernel interface for subsystems to register the attributes under the memory target node's initiator access class. If the system provides this information, applications may query these attributes when deciding which node to request memory. The following example shows the new sysfs hierarchy for a node exporting performance attributes: # tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/ /sys/devices/system/node/nodeY/accessZ/initiators/ |-- read_bandwidth |-- read_latency |-- write_bandwidth `-- write_latency The bandwidth is exported as MB/s and latency is reported in nanoseconds. The values are taken from the platform as reported by the manufacturer. Memory accesses from an initiator node that is not one of the memory's access "Z" initiator nodes linked in the same directory may observe different performance than reported here. When a subsystem makes use of this interface, initiators of a different access number may not have the same performance relative to initiators in other access numbers, or omitted from the any access class' initiators. Descriptions for memory access initiator performance access attributes are added to sysfs stable documentation. Acked-by: Jonathan Cameron <[email protected]> Tested-by: Jonathan Cameron <[email protected]> Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Tested-by: Brice Goglin <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-04-04node: Link memory nodes to their compute nodesKeith Busch1-0/+6
Systems may be constructed with various specialized nodes. Some nodes may provide memory, some provide compute devices that access and use that memory, and others may provide both. Nodes that provide memory are referred to as memory targets, and nodes that can initiate memory access are referred to as memory initiators. Memory targets will often have varying access characteristics from different initiators, and platforms may have ways to express those relationships. In preparation for these systems, provide interfaces for the kernel to export the memory relationship among different nodes memory targets and their initiators with symlinks to each other. If a system provides access locality for each initiator-target pair, nodes may be grouped into ranked access classes relative to other nodes. The new interface allows a subsystem to register relationships of varying classes if available and desired to be exported. A memory initiator may have multiple memory targets in the same access class. The target memory's initiators in a given class indicate the nodes access characteristics share the same performance relative to other linked initiator nodes. Each target within an initiator's access class, though, do not necessarily perform the same as each other. A memory target node may have multiple memory initiators. All linked initiators in a target's class have the same access characteristics to that target. The following example show the nodes' new sysfs hierarchy for a memory target node 'Y' with access class 0 from initiator node 'X': # symlinks -v /sys/devices/system/node/nodeX/access0/ relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY # symlinks -v /sys/devices/system/node/nodeY/access0/ relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX The new attributes are added to the sysfs stable documentation. Reviewed-by: Jonathan Cameron <[email protected]> Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Tested-by: Brice Goglin <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-04-04acpi: Add HMAT to generic parsing tablesKeith Busch1-0/+1
The Heterogeneous Memory Attribute Table (HMAT) header has different field lengths than the existing parsing uses. Add the HMAT type to the parsing rules so it may be generically parsed. Reviewed-by: Rafael J. Wysocki <[email protected]> Acked-by: Jonathan Cameron <[email protected]> Tested-by: Jonathan Cameron <[email protected]> Signed-off-by: Keith Busch <[email protected]> Tested-by: Brice Goglin <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>