blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2020-12-08	btrfs: add helper for string match ignoring leading/trailing whitespace	Anand Jain	1	-0/+19
	Add a generic helper to match the string in a given buffer, and ignore the leading and trailing whitespace. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> [ rename variables, add comments ] Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: do not start and wait for delalloc on snapshot roots on transaction ↵	Filipe Manana	1	-43/+6
	commit We do not need anymore to start writeback for delalloc of roots that are being snapshotted and wait for it to complete. This was done in commit 609e804d771f59 ("Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes") to fix a type of file corruption where files in a snapshot end up having their i_size updated in a non-ordered way, leaving implicit file holes, when buffered IO writes that increase a file's size are followed by direct IO writes that also increase the file's size. This is not needed anymore because we now have a more generic mechanism to prevent a non-ordered i_size update since commit 9ddc959e802bf7 ("btrfs: use the file extent tree infrastructure"), which addresses this scenario involving snapshots as well. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: switch extent buffer tree lock to rw_semaphore	Josef Bacik	5	-351/+70
	Historically we've implemented our own locking because we wanted to be able to selectively spin or sleep based on what we were doing in the tree. For instance, if all of our nodes were in cache then there's rarely a reason to need to sleep waiting for node locks, as they'll likely become available soon. At the time this code was written the rw_semaphore didn't do adaptive spinning, and thus was orders of magnitude slower than our home grown locking. However now the opposite is the case. There are a few problems with how we implement blocking locks, namely that we use a normal waitqueue and simply wake everybody up in reverse sleep order. This leads to some suboptimal performance behavior, and a lot of context switches in highly contended cases. The rw_semaphores actually do this properly, and also have adaptive spinning that works relatively well. The locking code is also a bit of a bear to understand, and we lose the benefit of lockdep for the most part because the blocking states of the lock are simply ad-hoc and not mapped into lockdep. So rework the locking code to drop all of this custom locking stuff, and simply use a rw_semaphore for everything. This makes the locking much simpler for everything, as we can now drop a lot of cruft and blocking transitions. The performance numbers vary depending on the workload, because generally speaking there doesn't tend to be a lot of contention on the btree. However, on my test system which is an 80 core single socket system with 256GiB of RAM and a 2TiB NVMe drive I get the following results (with all debug options off): dbench 200 baseline Throughput 216.056 MB/sec 200 clients 200 procs max_latency=1471.197 ms dbench 200 with patch Throughput 737.188 MB/sec 200 clients 200 procs max_latency=714.346 ms Previously we also used fs_mark to test this sort of contention, and those results are far less impressive, mostly because there's not enough tasks to really stress the locking fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16 baseline Average Files/sec: 160166.7 p50 Files/sec: 165832 p90 Files/sec: 123886 p99 Files/sec: 123495 real 3m26.527s user 2m19.223s sys 48m21.856s patched Average Files/sec: 164135.7 p50 Files/sec: 171095 p90 Files/sec: 122889 p99 Files/sec: 113819 real 3m29.660s user 2m19.990s sys 44m12.259s Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: open code insert_orphan_item	Nikolay Borisov	1	-13/+3
	Just open code it in its sole caller and remove a level of indirection. Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: introduce mount option rescue=all	Josef Bacik	2	-0/+12
	Now that we have the building blocks for some better recovery options with corrupted file systems, add a rescue=all option to enable all of the relevant rescue options. This will allow distros to simply default to rescue=all for the "oh dear lord the world's on fire" recovery without needing to know all the different options that we have and may add in the future. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: introduce mount option rescue=ignoredatacsums	Josef Bacik	4	-10/+25
	There are cases where you can end up with bad data csums because of misbehaving applications. This happens when an application modifies a buffer in-flight when doing an O_DIRECT write. In order to recover the file we need a way to turn off data checksums so you can copy the file off, and then you can delete the file and restore it properly later. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: introduce mount option rescue=ignorebadroots	Josef Bacik	10	-28/+130
	In the face of extent root corruption, or any other core fs wide root corruption we will fail to mount the file system. This makes recovery kind of a pain, because you need to fall back to userspace tools to scrape off data. Instead provide a mechanism to gracefully handle bad roots, so we can at least mount read-only and possibly recover data from the file system. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: show rescue=usebackuproot in /proc/mounts	Josef Bacik	2	-6/+2
	The standalone option usebackuproot was intended as one-time use and it was not necessary to keep it in the option list. Now that we're going to have more rescue options, it's desirable to keep them intact as it could be confusing why the option disappears. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ remove the btrfs_clear_opt part from open_ctree ] Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: add a helper to print out rescue= options	Josef Bacik	1	-1/+8
	We're going to have a lot of rescue options, add a helper to collapse the /proc/mounts output to rescue=option1:option2:option3 format. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: sysfs: export supported rescue= mount options	Josef Bacik	1	-0/+22
	We're going to be adding a variety of different rescue options, we should advertise which ones we support to make user spaces life easier in the future. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: push the NODATASUM check into btrfs_lookup_bio_sums	Josef Bacik	3	-12/+17
	When we move to being able to handle NULL csum_roots it'll be cleaner to just check in btrfs_lookup_bio_sums instead of at all of the caller locations, so push the NODATASUM check into it as well so it's unified. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: unify the ro checking for mount options	Josef Bacik	1	-7/+16
	We're going to be adding more options that require RDONLY, so add a helper to do the check and error out if we don't have RDONLY set. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: do not start readahead for csum tree when scrubbing non-data block groups	Filipe Manana	1	-8/+12
	When scrubbing a stripe of a block group we always start readahead for the checksums btree and wait for it to complete, however when the blockgroup is not a data block group (or a mixed block group) it is a waste of time to do it, since there are no checksums for metadata extents in that btree. So skip that when the block group does not have the data flag set, saving some time doing memory allocations, queueing a job in the readahead work queue, waiting for it to complete and potentially avoiding some IO as well (when csum tree extents are not in memory already). Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: assert we are holding the reada_lock when releasing a readahead zone	Filipe Manana	1	-0/+2
	When we drop the last reference of a zone, we end up releasing it through the callback reada_zone_release(), which deletes the zone from a device's reada_zones radix tree. This tree is protected by the global readahead lock at fs_info->reada_lock. Currently all places that are sure that they are dropping the last reference on a zone, are calling kref_put() in a critical section delimited by this lock, while all other places that are sure they are not dropping the last reference, do not bother calling kref_put() while holding that lock. When working on the previous fix for hangs and use-after-frees in the readahead code, my initial attempts were different and I actually ended up having reada_zone_release() called when not holding the lock, which resulted in weird and unexpected problems. So just add an assertion there to detect such problem more quickly and make the dependency more obvious. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()	Goldwyn Rodrigues	3	-18/+12
	Set the extent bits EXTENT_NORESERVE inside btrfs_dirty_pages() as opposed to calling set_extent_bits again later. Fold check for written length within the function. Note: EXTENT_NORESERVE is set before unlocking extents. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Goldwyn Rodrigues <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: use round_down while calculating start position in btrfs_dirty_pages()	Goldwyn Rodrigues	1	-1/+1
	round_down looks prettier than the bit mask operations. Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: use iosize while reading compressed pages	Goldwyn Rodrigues	1	-7/+3
	While using compression, a submitted bio is mapped with a compressed bio which performs the read from disk, decompresses and returns uncompressed data to original bio. The original bio must reflect the uncompressed size (iosize) of the I/O to be performed, or else the page just gets the decompressed I/O length of data (disk_io_size). The compressed bio checks the extent map and gets the correct length while performing the I/O from disk. This came up in subpage work when only compressed length of the original bio was filled in the page. This worked correctly for pagesize == sectorsize because both compressed and uncompressed data are at pagesize boundaries, and would end up filling the requested page. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write	Goldwyn Rodrigues	1	-22/+12
	write_bytes can change in btrfs_check_nocow_lock(). Calculate variables such as num_pages and reserve_bytes once we are sure of the value of write_bytes so there is no need to re-calculate. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: calculate more accurate remaining time to sleep in transaction_kthread	Nikolay Borisov	1	-1/+3
	If transaction_kthread is woken up before btrfs_fs_info::commit_interval seconds have elapsed it will sleep for a fixed period of 5 seconds. This is not a problem per-se but is not accurate. Instead the code should sleep for an interval which guarantees on next wakeup commit_interval would have passed. Since time tracking is not precise subtract 1 second from delta to ensure the delay we end up waiting will be longer than than the wake up period. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: record delta directly in transaction_kthread	Nikolay Borisov	1	-3/+3
	Rename 'now' to 'delta' and store there the delta between transaction start time and current time. This is in preparation for optimising the sleep logic in the next patch. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08	btrfs: remove redundant time check in transaction kthread loop	Nikolay Borisov	1	-2/+1
	The value obtained from ktime_get_seconds() is guaranteed to be monotonically increasing since it's taken from CLOCK_MONOTONIC. As transaction_kthread obtains a reference to the currently running transaction under holding btrfs_fs_info::trans_lock it's guaranteed to: a) see an initialized 'cur', whose start_time is guaranteed to be smaller than 'now' or b) not obtain a 'cur' and simply go to sleep. Given this remove the unnecessary check, if it sees now < cur->start_time this would imply there are far greater problems on the machine. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-07	btrfs: use helpers to convert from seconds to jiffies in transaction_kthread	Nikolay Borisov	1	-2/+2
	The kernel provides easy to understand helpers to convert from human understandable units to the kernel-friendly 'jiffies'. So let's use those to make the code easier to understand. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-07	btrfs: sysfs: export filesystem generation	Anand Jain	1	-0/+10
	Matching with the information that's available from the ioctl FS_INFO, add generation to the per-filesystem directory /sys/fs/btrfs/UUID/generation, which could be used by scripts. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-06	Linux 5.10-rc7	Linus Torvalds	1	-1/+1

2020-12-06	Merge tag 'char-misc-5.10-rc7' of ↵	Linus Torvalds	7	-909/+33
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small driver fixes, and one "large" revert, for 5.10-rc7. They include: - revert mei patch from 5.10-rc1 that was using a reserved userspace value. It will be resubmitted once the proper id has been assigned by the virtio people. - habanalabs fixes found by the fall-through audit from Gustavo - speakup driver fixes for reported issues - fpga config build fix for reported issue. All of these except the revert have been in linux-next with no reported issues. The revert is "clean" and just removes a previously-added driver, so no real issue there" * tag 'char-misc-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: Revert "mei: virtio: virtualization frontend driver" fpga: Specify HAS_IOMEM dependency for FPGA_DFL habanalabs: put devices before driver removal habanalabs: free host huge va_range if not used speakup: Reject setting the speakup line discipline outside of speakup
2020-12-06	Merge tag 'tty-5.10-rc7' of ↵	Linus Torvalds	3	-14/+41
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty fixes from Greg KH: "Here are two tty core fixes for 5.10-rc7. They resolve some reported locking issues in the tty core. While they have not been in a released linux-next yet, they have passed all of the 0-day bot testing as well as the submitter's testing" * tag 'tty-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: tty: Fix ->session locking tty: Fix ->pgrp locking in tiocspgrp()
2020-12-06	Merge tag 'usb-5.10-rc7' of ↵	Linus Torvalds	12	-50/+53
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are some small USB fixes for 5.10-rc7 that resolve a number of reported issues, and add some new device ids. Nothing major here, but these solve some problems that people were having with the 5.10-rc tree: - reverts for USB storage dma settings that broke working devices - thunderbolt use-after-free fix - cdns3 driver fixes - gadget driver userspace copy fix - new device ids All of these except for the reverts have been in linux-next with no reported issues. The reverts are "clean" and were tested by Hans, as well as passing the 0-day tests" * tag 'usb-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: usb: gadget: f_fs: Use local copy of descriptors for userspace copy usb: ohci-omap: Fix descriptor conversion Revert "usb-storage: fix sdev->host->dma_dev" Revert "uas: fix sdev->host->dma_dev" Revert "uas: bump hw_max_sectors to 2048 blocks for SS or faster drives" USB: serial: kl5kusb105: fix memleak on open USB: serial: ch341: sort device-id entries USB: serial: ch341: add new Product ID for CH341A USB: serial: option: fix Quectel BG96 matching usb: cdns3: core: fix goto label for error path usb: cdns3: gadget: clear trb->length as zero after preparing every trb usb: cdns3: Fix hardware based role switch USB: serial: option: add support for Thales Cinterion EXS82 USB: serial: option: add Fibocom NL668 variants thunderbolt: Fix use-after-free in remove_unplugged_switch()
2020-12-06	Merge tag 'x86-urgent-2020-12-06' of ↵	Linus Torvalds	9	-15/+58
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of fixes for x86: - Make the AMD L3 QoS code and data priorization enable/disable mechanism work correctly. The control bit was only set/cleared on one of the CPUs in a L3 domain, but it has to be modified on all CPUs in the domain. The initial documentation was not clear about this, but the updated one from Oct 2020 spells it out. - Fix an off by one in the UV platform detection code which causes the UV hubs to be identified wrongly. The chip revisions start at 1 not at 0. - Fix a long standing bug in the evaluation of prefixes in the uprobes code which fails to handle repeated prefixes properly. The aggregate size of the prefixes can be larger than the bytes array but the code blindly iterated over the aggregate size beyond the array boundary. Add a macro to handle this case properly and use it at the affected places" * tag 'x86-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes x86/platform/uv: Fix UV4 hub revision adjustment x86/resctrl: Fix AMD L3 QOS CDP enable/disable
2020-12-06	Merge tag 'perf-urgent-2020-12-06' of ↵	Linus Torvalds	1	-2/+2
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two fixes for performance monitoring on X86: - Add recursion protection to another callchain invoked from x86_pmu_stop() which can recurse back into x86_pmu_stop(). The first attempt to fix this missed this extra code path. - Use the already filtered status variable to check for PEBS counter overflow bits and not the unfiltered full status read from IA32_PERF_GLOBAL_STATUS which can have unrelated bits check which would be evaluated incorrectly" * tag 'perf-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Check PEBS status correctly perf/x86/intel: Fix a warning on x86_pmu_stop() with large PEBS
2020-12-06	Merge tag 'irq-urgent-2020-12-06' of ↵	Linus Torvalds	5	-12/+25
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of updates for the interrupt subsystem: - Make multiqueue devices which use the managed interrupt affinity infrastructure work on PowerPC/Pseries. PowerPC does not use the generic infrastructure for setting up PCI/MSI interrupts and the multiqueue changes failed to update the legacy PCI/MSI infrastructure. Make this work by passing the affinity setup information down to the mapping and allocation functions. - Move Jason Cooper from MAINTAINERS to CREDITS as his mail is bouncing and he's not reachable. We hope all is well with him and say thanks for his work over the years" * tag 'irq-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: powerpc/pseries: Pass MSI affinity to irq_create_mapping() genirq/irqdomain: Add an irq_create_mapping_affinity() function MAINTAINERS: Move Jason Cooper to CREDITS
2020-12-06	Merge tag 'locking-urgent-2020-12-06' of ↵	Linus Torvalds	1	-14/+14
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull intel_idle build fix from Thomas Gleixner: "A tiny build fix for a recent change in the intel_idle driver which missed a CONFIG dependency and broke the build for certain configurations" * tag 'locking-urgent-2020-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: intel_idle: Build fix
2020-12-06	Merge tag 'kbuild-fixes-v5.10-2' of ↵	Linus Torvalds	17	-24/+64
	git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Move -Wcast-align to W=3, which tends to be false-positive and there is no tree-wide solution. - Pass -fmacro-prefix-map to KBUILD_CPPFLAGS because it is a preprocessor option and makes sense for .S files as well. - Disable -gdwarf-2 for Clang's integrated assembler to avoid warnings. - Disable --orphan-handling=warn for LLD 10.0.1 to avoid warnings. - Fix undesirable line breaks in .mod files. tag 'kbuild-fixes-v5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: avoid split lines in .mod files kbuild: Disable CONFIG_LD_ORPHAN_WARN for ld.lld 10.0.1 kbuild: Hoist '--orphan-handling' into Kconfig Kbuild: do not emit debug info for assembly with LLVM_IAS=1 kbuild: use -fmacro-prefix-map for .S sources Makefile.extrawarn: move -Wcast-align to W=3
2020-12-06	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	15	-121/+75
	Merge misc fixes from Andrew Morton: "12 patches. Subsystems affected by this patch series: mm (memcg, zsmalloc, swap, mailmap, selftests, pagecache, hugetlb, pagemap), lib, and coredump" * emailed patches from Andrew Morton <[email protected]>: mm/mmap.c: fix mmap return value when vma is merged after call_mmap() hugetlb_cgroup: fix offline of hugetlb cgroup with reservations mm/filemap: add static for function __add_to_page_cache_locked userfaultfd: selftests: fix SIGSEGV if huge mmap fails tools/testing/selftests/vm: fix build error mailmap: add two more addresses of Uwe Kleine-König mm/swapfile: do not sleep with a spin lock held mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING mm: list_lru: set shrinker map bit when child nr_items is not zero mm: memcg/slab: fix obj_cgroup_charge() return value handling coredump: fix core_pattern parse error zlib: export S390 symbols for zlib modules
2020-12-06	mm/mmap.c: fix mmap return value when vma is merged after call_mmap()	Liu Zixian	1	-14/+12
	On success, mmap should return the begin address of newly mapped area, but patch "mm: mmap: merge vma after call_mmap() if possible" set vm_start of newly merged vma to return value addr. Users of mmap will get wrong address if vma is merged after call_mmap(). We fix this by moving the assignment to addr before merging vma. We have a driver which changes vm_flags, and this bug is found by our testcases. Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible") Signed-off-by: Liu Zixian <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Hongxiang Lou <[email protected]> Cc: Hu Shiyuan <[email protected]> Cc: Matthew Wilcox <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	hugetlb_cgroup: fix offline of hugetlb cgroup with reservations	Mike Kravetz	1	-5/+3
	Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload using hugetlbfs. In this environment the issue is reproduced by: - Start a simple pod that uses the recently added HugePages medium feature (pod yaml attached) - Start a DPDK app. It doesn't need to run successfully (as in transfer packets) nor interact with real hardware. It seems just initializing the EAL layer (which handles hugepage reservation and locking) is enough to trigger the issue - Delete the Pod (or let it "Complete"). This would result in a kworker thread going into a tight loop (top output): 1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy 'perf top -g' reports: - 63.28% 0.01% [kernel] [k] worker_thread - 49.97% worker_thread - 52.64% process_one_work - 62.08% css_killed_work_fn - hugetlb_cgroup_css_offline 41.52% _raw_spin_lock - 2.82% _cond_resched rcu_all_qs 2.66% PageHuge - 0.57% schedule - 0.57% __schedule We are spinning in the do-while loop in hugetlb_cgroup_css_offline. Worse yet, we are holding the master cgroup lock (cgroup_mutex) while infinitely spinning. Little else can be done on the system as the cgroup_mutex can not be acquired. Do note that the issue can be reproduced by simply offlining a hugetlb cgroup containing pages with reservation counts. The loop in hugetlb_cgroup_css_offline is moving page counts from the cgroup being offlined to the parent cgroup. This is done for each hstate, and is repeated until hugetlb_cgroup_have_usage returns false. The routine moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts. The routine hugetlb_cgroup_have_usage is checking for both 'usage' and 'reservation' counts. Discussion about what to do with reservation counts when reparenting was discussed here: https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/ The decision was made to leave a zombie cgroup for with reservation counts. Unfortunately, the code checking reservation counts was incorrectly added to hugetlb_cgroup_have_usage. To fix the issue, simply remove the check for reservation counts. While fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed. The hstate index is not reinitialized each time through the do-while loop. Fix this as well. Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations") Reported-by: Adrian Moreno <[email protected]> Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Adrian Moreno <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Mina Almasry <[email protected]> Cc: David Rientjes <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Sandipan Das <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	mm/filemap: add static for function __add_to_page_cache_locked	Alex Shi	1	-1/+1
	mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes] Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Souptick Joarder <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	userfaultfd: selftests: fix SIGSEGV if huge mmap fails	Axel Rasmussen	1	-9/+16
	The error handling in hugetlb_allocate_area() was incorrect for the hugetlb_shared test case. Previously the behavior was: - mmap a hugetlb area - If this fails, set the pointer to NULL, and carry on - mmap an alias of the same hugetlb fd - If this fails, munmap the original area If the original mmap failed, it's likely the second one did too. If both failed, we'd blindly try to munmap a NULL pointer, causing a SIGSEGV. Instead, "goto fail" so we return before trying to mmap the alias. This issue can be hit "in real life" by forgetting to set /proc/sys/vm/nr_hugepages (leaving it at 0), and then trying to run the hugetlb_shared test. Another small improvement is, when the original mmap fails, don't just print "it failed": perror(), so we can see why. :) Signed-off-by: Axel Rasmussen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Peter Xu <[email protected]> Cc: Joe Perches <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Alan Gilbert <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	tools/testing/selftests/vm: fix build error	Xingxing Su	1	-0/+4
	Only x86 and PowerPC implement the pkey-xxx.h, and an error was reported when compiling protection_keys.c. Add a Arch judgment to compile "protection_keys" in the Makefile. If other arch implement this, add the arch name to the Makefile. eg: ifneq (,$(findstring $(ARCH),powerpc mips ... )) Following build errors: pkey-helpers.h:93:2: error: #error Architecture not supported #error Architecture not supported pkey-helpers.h:96:20: error: `PKEY_DISABLE_ACCESS' undeclared #define PKEY_MASK (PKEY_DISABLE_ACCESS \| PKEY_DISABLE_WRITE) ^ protection_keys.c:218:45: error: `PKEY_DISABLE_WRITE' undeclared pkey_assert(flags & (PKEY_DISABLE_ACCESS \| PKEY_DISABLE_WRITE)); ^ Signed-off-by: Xingxing Su <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Sandipan Das <[email protected]> Cc: John Hubbard <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Mina Almasry <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	mailmap: add two more addresses of Uwe Kleine-König	Uwe Kleine-König	1	-0/+2
	This fixes attribution for the commits (among others) - d4097456cd1d ("video/framebuffer: move the probe func into .devinit.text in Blackfin LCD driver") - 0312e024d6cd ("mfd: mc13xxx: Add support for mc34708") Signed-off-by: Uwe Kleine-König <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	mm/swapfile: do not sleep with a spin lock held	Qian Cai	1	-1/+3
	We can't call kvfree() with a spin lock held, so defer it. Fixes a might_sleep() runtime warning. Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation") Signed-off-by: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING	Minchan Kim	4	-69/+0
	While I was doing zram testing, I found sometimes decompression failed since the compression buffer was corrupted. With investigation, I found below commit calls cond_resched unconditionally so it could make a problem in atomic context if the task is reschedule. BUG: sleeping function called from invalid context at mm/vmalloc.c:108 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog 3 locks held by memhog/946: #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160 #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160 #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0 CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 Call Trace: unmap_kernel_range_noflush+0x2eb/0x350 unmap_kernel_range+0x14/0x30 zs_unmap_object+0xd5/0xe0 zram_bvec_rw.isra.0+0x38c/0x8e0 zram_rw_page+0x90/0x101 bdev_write_page+0x92/0xe0 __swap_writepage+0x94/0x4a0 pageout+0xe3/0x3a0 shrink_page_list+0xb94/0xd60 shrink_inactive_list+0x158/0x460 We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which contains the offending calling code) from zsmalloc. Even though this option showed some amount improvement(e.g., 30%) in some arm32 platforms, it has been headache to maintain since it have abused APIs[1](e.g., unmap_kernel_range in atomic context). Since we are approaching to deprecate 32bit machines and already made the config option available for only builtin build since v5.8, lastly it has been not default option in zsmalloc, it's time to drop the option for better maintenance. [1] http://lore.kernel.org/linux-mm/[email protected] Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range") Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Tony Lindgren <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Harish Sriram <[email protected]> Cc: Uladzislau Rezki <[email protected]> Cc: <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	mm: list_lru: set shrinker map bit when child nr_items is not zero	Yang Shi	1	-5/+5
	When investigating a slab cache bloat problem, significant amount of negative dentry cache was seen, but confusingly they neither got shrunk by reclaimer (the host has very tight memory) nor be shrunk by dropping cache. The vmcore shows there are over 14M negative dentry objects on lru, but tracing result shows they were even not scanned at all. Further investigation shows the memcg's vfs shrinker_map bit is not set. So the reclaimer or dropping cache just skip calling vfs shrinker. So we have to reboot the hosts to get the memory back. I didn't manage to come up with a reproducer in test environment, and the problem can't be reproduced after rebooting. But it seems there is race between shrinker map bit clear and reparenting by code inspection. The hypothesis is elaborated as below. The memcg hierarchy on our production environment looks like: root / \ system user The main workloads are running under user slice's children, and it creates and removes memcg frequently. So reparenting happens very often under user slice, but no task is under user slice directly. So with the frequent reparenting and tight memory pressure, the below hypothetical race condition may happen: CPU A CPU B reparent dst->nr_items == 0 shrinker: total_objects == 0 add src->nr_items to dst set_bit return SHRINK_EMPTY clear_bit child memcg offline replace child's kmemcg_id with parent's (in memcg_offline_kmem()) list_lru_del() between shrinker runs see parent's kmemcg_id dec dst->nr_items reparent again dst->nr_items may go negative due to concurrent list_lru_del() The second run of shrinker: read nr_items without any synchronization, so it may see intermediate negative nr_items then total_objects may return 0 coincidently keep the bit cleared dst->nr_items != 0 skip set_bit add scr->nr_item to dst After this point dst->nr_item may never go zero, so reparenting will not set shrinker_map bit anymore. And since there is no task under user slice directly, so no new object will be added to its lru to set the shrinker map bit either. That bit is kept cleared forever. How does list_lru_del() race with reparenting? It is because reparenting replaces children's kmemcg_id to parent's without protecting from nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually deleting items from child's lru, but dec'ing parent's nr_items, so the parent's nr_items may go negative as commit 2788cf0c401c ("memcg: reparent list_lrus and free kmemcg_id on css offline") says. Since it is impossible that dst->nr_items goes negative and src->nr_items goes zero at the same time, so it seems we could set the shrinker map bit iff src->nr_items != 0. We could synchronize list_lru_count_one() and reparenting with nlru->lock, but it seems checking src->nr_items in reparenting is the simplest and avoids lock contention. Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance") Suggested-by: Roman Gushchin <[email protected]> Signed-off-by: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Kirill Tkhai <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: <[email protected]> [4.19] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	mm: memcg/slab: fix obj_cgroup_charge() return value handling	Roman Gushchin	1	-16/+24
	Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") introduced a regression into the handling of the obj_cgroup_charge() return value. If a non-zero value is returned (indicating of exceeding one of memory.max limits), the allocation should fail, instead of falling back to non-accounted mode. To make the code more readable, move memcg_slab_pre_alloc_hook() and memcg_slab_post_alloc_hook() calling conditions into bodies of these hooks. Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	coredump: fix core_pattern parse error	Menglong Dong	1	-1/+2
	'format_corename()' will splite 'core_pattern' on spaces when it is in pipe mode, and take helper_argv[0] as the path to usermode executable. It works fine in most cases. However, if there is a space between '\|' and '/file/path', such as '\| /usr/lib/systemd/systemd-coredump %P %u %g', then helper_argv[0] will be parsed as '', and users will get a 'Core dump to \| disabled'. It is not friendly to users, as the pattern above was valid previously. Fix this by ignoring the spaces between '\|' and '/file/path'. Fixes: 315c69261dd3 ("coredump: split pipe command whitespace before expanding template") Signed-off-by: Menglong Dong <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Paul Wise <[email protected]> Cc: Jakub Wilk <[email protected]> [https://bugs.debian.org/924398] Cc: Neil Horman <[email protected]> Cc: <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	zlib: export S390 symbols for zlib modules	Randy Dunlap	1	-0/+3
	Fix build errors when ZLIB_INFLATE=m and ZLIB_DEFLATE=m and ZLIB_DFLTCC=y by exporting the 2 needed symbols in dfltcc_inflate.c. Fixes these build errors: ERROR: modpost: "dfltcc_inflate" [lib/zlib_inflate/zlib_inflate.ko] undefined! ERROR: modpost: "dfltcc_can_inflate" [lib/zlib_inflate/zlib_inflate.ko] undefined! Fixes: 126196100063 ("lib/zlib: add s390 hardware support for kernel zlib_inflate") Reported-by: kernel test robot <[email protected]> Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Ilya Leoshkevich <[email protected]> Cc: Mikhail Zaslonko <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Christian Borntraeger <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-12-06	kbuild: avoid split lines in .mod files	Masahiro Yamada	1	-8/+4
	"xargs echo" is not a safe way to remove line breaks because the input may exceed the command line limit and xargs may break it up into multiple invocations of echo. This should never happen because scripts/gen_autoksyms.sh expects all undefined symbols are placed in the second line of .mod files. One possible way is to replace "xargs echo" with "sed ':x;N;$!bx;s/\n/ /g'" or something, but I rewrote the code by using awk because it is more readable. This issue was reported by Sami Tolvanen; in his Clang LTO patch set, $(multi-used-m) is no longer an ELF object, but a thin archive that contains LLVM bitcode files. llvm-nm prints out symbols for each archive member separately, which results a lot of dupications, in some places, beyond the system-defined limit. This problem must be fixed irrespective of LTO, and we must ensure zero possibility of having this issue. Link: https://lkml.org/lkml/2020/12/1/1658 Reported-by: Sami Tolvanen <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]> Reviewed-by: Sami Tolvanen <[email protected]>
2020-12-06	Revert "mei: virtio: virtualization frontend driver"	Michael S. Tsirkin	3	-887/+0
	This reverts commit d162219c655c8cf8003128a13840d6c1e183fb80. The device uses a VIRTIO device ID out of a not-for-production range. Releasing Linux using an ID out of this range will make it conflict with development setups. An official request to reserve an ID for an MEI device is yet to be submitted to the virtio TC, thus there's no chance it will be reserved and fixed in time before the next release. Once requested it usually takes 2-3 weeks to land in the spec, which means the device can be supported with the official ID in the next Linux version if contributors act quickly. Signed-off-by: Michael S. Tsirkin <[email protected]> Cc: Tomas Winkler <[email protected]> Cc: Alexander Usyskin <[email protected]> Cc: Wang Yu <[email protected]> Cc: Liu Shuo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2020-12-06	x86/sev-es: Use new for_each_insn_prefix() macro to loop over prefixes bytes	Masami Hiramatsu	1	-3/+2
	Since insn.prefixes.nbytes can be bigger than the size of insn.prefixes.bytes[] when a prefix is repeated, the proper check must be: insn.prefixes.bytes[i] != 0 and i < 4 instead of using insn.prefixes.nbytes. Use the new for_each_insn_prefix() macro which does it correctly. Debugged by Kees Cook <[email protected]>. [ bp: Massage commit message. ] Fixes: 25189d08e516 ("x86/sev-es: Add support for handling IOIO exceptions") Reported-by: [email protected] Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/160697106089.3146288.2052422845039649176.stgit@devnote2
2020-12-06	x86/insn-eval: Use new for_each_insn_prefix() macro to loop over prefixes bytes	Masami Hiramatsu	1	-5/+5
	Since insn.prefixes.nbytes can be bigger than the size of insn.prefixes.bytes[] when a prefix is repeated, the proper check must be insn.prefixes.bytes[i] != 0 and i < 4 instead of using insn.prefixes.nbytes. Use the new for_each_insn_prefix() macro which does it correctly. Debugged by Kees Cook <[email protected]>. [ bp: Massage commit message. ] Fixes: 32d0b95300db ("x86/insn-eval: Add utility functions to get segment selector") Reported-by: [email protected] Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/160697104969.3146288.16329307586428270032.stgit@devnote2
2020-12-06	x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes	Masami Hiramatsu	3	-4/+36
	Since insn.prefixes.nbytes can be bigger than the size of insn.prefixes.bytes[] when a prefix is repeated, the proper check must be insn.prefixes.bytes[i] != 0 and i < 4 instead of using insn.prefixes.nbytes. Introduce a for_each_insn_prefix() macro for this purpose. Debugged by Kees Cook <[email protected]>. [ bp: Massage commit message, sync with the respective header in tools/ and drop "we". ] Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints") Reported-by: [email protected] Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/160697103739.3146288.7437620795200799020.stgit@devnote2