blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2019-10-24	xprtrdma: Refine trace_xprtrdma_fixup	Chuck Lever	2	-50/+15
	Slightly reduce overhead and display more useful information. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Report the computed connect delay	Chuck Lever	3	-33/+60
	For debugging, the op_connect trace point should report the computed connect delay. We can then ensure that the delay is computed at the proper times, for example. As a further clean-up, remove a few low-value "heartbeat" trace points in the connect path. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Wake tasks after connect worker fails	Chuck Lever	1	-9/+6
	Pending tasks are currently never awoken when the connect worker fails. The reason is that XPRT_CONNECTED is always clear after a failure return of rpcrdma_ep_connect, thus the xprt_test_and_clear_connected() check in xprt_rdma_connect_worker() always fails. - xprt_rdma_close always clears XPRT_CONNECTED. - rpcrdma_ep_connect always clears XPRT_CONNECTED. After reviewing the TCP connect worker, it appears that there's no need for extra test_and_set paranoia in xprt_rdma_connect_worker. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Pull up sometimes	Chuck Lever	5	-7/+85
	On some platforms, DMA mapping part of a page is more costly than copying bytes. Restore the pull-up code and use that when we think it's going to be faster. The heuristic for now is to pull-up when the size of the RPC message body fits in the buffer underlying the head iovec. Indeed, not involving the I/O MMU can help the RPC/RDMA transport scale better for tiny I/Os across more RDMA devices. This is because interaction with the I/O MMU is eliminated, as is handling a Send completion, for each of these small I/Os. Without the explicit unmapping, the NIC no longer needs to do a costly internal TLB shoot down for buffers that are just a handful of bytes. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Refactor rpcrdma_prepare_msg_sges()	Chuck Lever	1	-117/+146
	Refactor: Replace spaghetti with code that makes it plain what needs to be done for each rtype. This makes it easier to add features and optimizations. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Move the rpcrdma_sendctx::sc_wr field	Chuck Lever	5	-12/+11
	Clean up: This field is not needed in the Send completion handler, so it can be moved to struct rpcrdma_req to reduce the size of struct rpcrdma_sendctx, and to reduce the amount of memory that is sloshed between the sending process and the Send completion process. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Remove rpcrdma_sendctx::sc_device	Chuck Lever	2	-3/+2
	Micro-optimization: Save eight bytes in a frequently allocated structure. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Remove rpcrdma_sendctx::sc_xprt	Chuck Lever	2	-11/+10
	Micro-optimization: Save eight bytes in a frequently allocated structure. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Ensure ri_id is stable during MR recycling	Chuck Lever	2	-24/+6
	ia->ri_id is replaced during a reconnect. The connect_worker runs with the transport send lock held to prevent ri_id from being dereferenced by the send_request path during this process. Currently, however, there is no guarantee that ia->ri_id is stable in the MR recycling worker, which operates in the background and is not serialized with the connect_worker in any way. But now that Local_Inv completions are being done in process context, we can handle the recycling operation there instead of deferring the recycling work to another process. Because the disconnect path drains all work before allowing tear down to proceed, it is guaranteed that Local Invalidations complete only while the ri_id pointer is stable. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Manage MRs in context of a single connection	Chuck Lever	4	-57/+40
	MRs are now allocated on demand so we can safely throw them away on disconnect. This way an idle transport can disconnect and it won't pin hardware MR resources. Two additional changes: - Now that all MRs are destroyed on disconnect, there's no need to check during header marshaling if a req has MRs to recycle. Each req is sent only once per connection, and now rl_registered is guaranteed to be empty when rpcrdma_marshal_req is invoked. - Because MRs are now destroyed in a WQ_MEM_RECLAIM context, they also must be allocated in a WQ_MEM_RECLAIM context. This reduces the likelihood that device driver memory allocation will trigger memory reclaim during NFS writeback. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Fix MR list handling	Chuck Lever	1	-20/+21
	Close some holes introduced by commit 6dc6ec9e04c4 ("xprtrdma: Cache free MRs in each rpcrdma_req") that could result in list corruption. In addition, the result that is tabulated in @count is no longer used, so @count is removed. Fixes: 6dc6ec9e04c4 ("xprtrdma: Cache free MRs in each rpcrdma_req") Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Close window between waking RPC senders and posting Receives	Chuck Lever	3	-4/+9
	A recent clean up attempted to separate Receive handling and RPC Reply processing, in the name of clean layering. Unfortunately, we can't do this because the Receive Queue has to be refilled _after_ the most recent credit update from the responder is parsed from the transport header, but _before_ we wake up the next RPC sender. That is right in the middle of rpcrdma_reply_handler(). Usually this isn't a problem because current responder implementations don't vary their credit grant. The one exception is when a connection is established: the grant goes from one to a much larger number on the first Receive. The requester MUST post enough Receives right then so that any outstanding requests can be sent without risking RNR and connection loss. Fixes: 6ceea36890a0 ("xprtrdma: Refactor Receive accounting") Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Initialize rb_credits in one place	Chuck Lever	4	-16/+38
	Clean up/code de-duplication. Nit: RPC_CWNDSHIFT is incorrect as the initial value for xprt->cwnd. This mistake does not appear to have operational consequences, since the cwnd value is replaced with a valid value upon the first Receive completion. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Connection becomes unstable after a reconnect	Chuck Lever	2	-0/+25
	This is because xprt_request_get_cong() is allowing more than one RPC Call to be transmitted before the first Receive on the new connection. The first Receive fills the Receive Queue based on the server's credit grant. Before that Receive, there is only a single Receive WR posted because the client doesn't know the server's credit grant. Solution is to clear rq_cong on all outstanding rpc_rqsts when the the cwnd is reset. This is because an RPC/RDMA credit is good for one connection instance only. Fixes: 75891f502f5f ("SUNRPC: Support for congestion control ... ") Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	xprtrdma: Add unique trace points for posting Local Invalidate WRs	Chuck Lever	2	-2/+27
	When adding frwr_unmap_async way back when, I re-used the existing trace_xprtrdma_post_send() trace point to record the return code of ib_post_send. Unfortunately there are some cases where re-using that trace point causes a crash. Instead, construct a trace point specific to posting Local Invalidate WRs that will always be safe to use in that context, and will act as a trace log eye-catcher for Local Invalidation. Fixes: 847568942f93 ("xprtrdma: Remove fr_state") Fixes: d8099feda483 ("xprtrdma: Reduce context switching due ... ") Signed-off-by: Chuck Lever <[email protected]> Tested-by: Bill Baker <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	SUNRPC: Add trace points to observe transport congestion control	Chuck Lever	2	-9/+106
	To help debug problems with RPC/RDMA credit management, replace dprintk() call sites in the transport send lock paths with trace events. Similar trace points are defined for the non-congestion paths. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-24	SUNRPC: Eliminate log noise in call_reserveresult	Chuck Lever	1	-12/+2
	Sep 11 16:35:20 manet kernel: call_reserveresult: unrecognized error -512, exiting Diagnostic error messages such as this likely have no value for NFS client administrators. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2019-10-20	Linux 5.4-rc4	Linus Torvalds	1	-1/+1

2019-10-20	Merge tag 'kbuild-fixes-v5.4-2' of ↵	Linus Torvalds	3	-6/+9
	git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild fixes from Masahiro Yamada: - fix a bashism of setlocalversion - do not use the too new --sort option of tar * tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kheaders: substituting --sort in archive creation scripts: setlocalversion: fix a bashism kbuild: update comment about KBUILD_ALLDIRS
2019-10-20	Merge branch 'x86-urgent-for-linus' of ↵	Linus Torvalds	6	-38/+84
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A small set of x86 fixes: - Prevent a NULL pointer dereference in the X2APIC code in case of a CPU hotplug failure. - Prevent boot failures on HP superdome machines by invalidating the level2 kernel pagetable entries outside of the kernel area as invalid so BIOS reserved space won't be touched unintentionally. Also ensure that memory holes are rounded up to the next PMD boundary correctly. - Enable X2APIC support on Hyper-V to prevent boot failures. - Set the paravirt name when running on Hyper-V for consistency - Move a function under the appropriate ifdef guard to prevent build warnings" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/acpi: Move get_cmdline_acpi_rsdp() under #ifdef guard x86/hyperv: Set pv_info.name to "Hyper-V" x86/apic/x2apic: Fix a NULL pointer deref when handling a dying cpu x86/hyperv: Make vapic support x2apic mode x86/boot/64: Round memory hole size up to next PMD page x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
2019-10-20	Merge branch 'irq-urgent-for-linus' of ↵	Linus Torvalds	5	-17/+43
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A small set of irq chip driver fixes and updates: - Update the SIFIVE PLIC interrupt driver to use the fasteoi handler to address the shortcomings of the existing flow handling which was prone to lose interrupts - Use the proper limit for GIC interrupt line numbers - Add retrigger support for the recently merged Anapurna Labs Fabric interrupt controller to make it complete - Enable the ATMEL AIC5 interrupt controller driver on the new SAM9X60 SoC" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/sifive-plic: Switch to fasteoi flow irqchip/gic-v3: Fix GIC_LINE_NR accessor irqchip/atmel-aic5: Add support for sam9x60 irqchip irqchip/al-fic: Add support for irq retrigger
2019-10-20	Merge branch 'timers-urgent-for-linus' of ↵	Linus Torvalds	1	-4/+4
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull hrtimer fixlet from Thomas Gleixner: "A single commit annotating the lockcless access to timer->base with READ_ONCE() and adding the WRITE_ONCE() counterparts for completeness" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Annotate lockless access to timer->base
2019-10-20	Merge branch 'core-urgent-for-linus' of ↵	Linus Torvalds	1	-4/+6
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull stop-machine fix from Thomas Gleixner: "A single fix, amending stop machine with WRITE/READ_ONCE() to address the fallout of KCSAN" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: stop_machine: Avoid potential race behaviour
2019-10-19	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Linus Torvalds	179	-892/+1486
	Pull networking fixes from David Miller: "I was battling a cold after some recent trips, so quite a bit piled up meanwhile, sorry about that. Highlights: 1) Fix fd leak in various bpf selftests, from Brian Vazquez. 2) Fix crash in xsk when device doesn't support some methods, from Magnus Karlsson. 3) Fix various leaks and use-after-free in rxrpc, from David Howells. 4) Fix several SKB leaks due to confusion of who owns an SKB and who should release it in the llc code. From Eric Biggers. 5) Kill a bunc of KCSAN warnings in TCP, from Eric Dumazet. 6) Jumbo packets don't work after resume on r8169, as the BIOS resets the chip into non-jumbo mode during suspend. From Heiner Kallweit. 7) Corrupt L2 header during MPLS push, from Davide Caratti. 8) Prevent possible infinite loop in tc_ctl_action, from Eric Dumazet. 9) Get register bits right in bcmgenet driver, based upon chip version. From Florian Fainelli. 10) Fix mutex problems in microchip DSA driver, from Marek Vasut. 11) Cure race between route lookup and invalidation in ipv4, from Wei Wang. 12) Fix performance regression due to false sharing in 'net' structure, from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (145 commits) net: reorder 'struct net' fields to avoid false sharing net: dsa: fix switch tree list net: ethernet: dwmac-sun8i: show message only when switching to promisc net: aquantia: add an error handling in aq_nic_set_multicast_list net: netem: correct the parent's backlog when corrupted packet was dropped net: netem: fix error path for corrupted GSO frames macb: propagate errors when getting optional clocks xen/netback: fix error path of xenvif_connect_data() net: hns3: fix mis-counting IRQ vector numbers issue net: usb: lan78xx: Connect PHY before registering MAC vsock/virtio: discard packets if credit is not respected vsock/virtio: send a credit update when buffer size is changed mlxsw: spectrum_trap: Push Ethernet header before reporting trap net: ensure correct skb->tstamp in various fragmenters net: bcmgenet: reset 40nm EPHY on energy detect net: bcmgenet: soft reset 40nm EPHYs before MAC init net: phy: bcm7xxx: define soft_reset for 40nm EPHY net: bcmgenet: don't set phydev->link from MAC net: Update address for MediaTek ethernet driver in MAINTAINERS ipv4: fix race condition between route lookup and invalidation ...
2019-10-19	net: reorder 'struct net' fields to avoid false sharing	Eric Dumazet	1	-8/+17
	Intel test robot reported a ~7% regression on TCP_CRR tests that they bisected to the cited commit. Indeed, every time a new TCP socket is created or deleted, the atomic counter net->count is touched (via get_net(net) and put_net(net) calls) So cpus might have to reload a contended cache line in net_hash_mix(net) calls. We need to reorder 'struct net' fields to move @hash_mix in a read mostly cache line. We move in the first cache line fields that can be dirtied often. We probably will have to address in a followup patch the __randomize_layout that was added in linux-4.13, since this might break our placement choices. Fixes: 355b98553789 ("netns: provide pure entropy for net_hash_mix()") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: kernel test robot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-19	net: dsa: fix switch tree list	Vivien Didelot	1	-1/+1
	If there are multiple switch trees on the device, only the last one will be listed, because the arguments of list_add_tail are swapped. Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation") Signed-off-by: Vivien Didelot <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-19	net: ethernet: dwmac-sun8i: show message only when switching to promisc	Mans Rullgard	1	-1/+2
	Printing the info message every time more than the max number of mac addresses are requested generates unnecessary log spam. Showing it only when the hw is not already in promiscous mode is equally informative without being annoying. Signed-off-by: Mans Rullgard <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-19	net: aquantia: add an error handling in aq_nic_set_multicast_list	Chenwandun	1	-0/+2
	add an error handling in aq_nic_set_multicast_list, it may not work when hw_multicast_list_set error; and at the same time it will remove gcc Wunused-but-set-variable warning. Signed-off-by: Chenwandun <[email protected]> Reviewed-by: Igor Russkikh <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-19	Merge branch 'netem-fix-further-issues-with-packet-corruption'	David S. Miller	1	-3/+8
	Jakub Kicinski says: ==================== net: netem: fix further issues with packet corruption This set is fixing two more issues with the netem packet corruption. First patch (which was previously posted) avoids NULL pointer dereference if the first frame gets freed due to allocation or checksum failure. v2 improves the clarity of the code a little as requested by Cong. Second patch ensures we don't return SUCCESS if the frame was in fact dropped. Thanks to this commit message for patch 1 no longer needs the "this will still break with a single-frame failure" disclaimer. ==================== Signed-off-by: David S. Miller <[email protected]>
2019-10-19	net: netem: correct the parent's backlog when corrupted packet was dropped	Jakub Kicinski	1	-0/+2
	If packet corruption failed we jump to finish_segs and return NET_XMIT_SUCCESS. Seeing success will make the parent qdisc increment its backlog, that's incorrect - we need to return NET_XMIT_DROP. Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue") Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-19	net: netem: fix error path for corrupted GSO frames	Jakub Kicinski	1	-3/+6
	To corrupt a GSO frame we first perform segmentation. We then proceed using the first segment instead of the full GSO skb and requeue the rest of the segments as separate packets. If there are any issues with processing the first segment we still want to process the rest, therefore we jump to the finish_segs label. Commit 177b8007463c ("net: netem: fix backlog accounting for corrupted GSO frames") started using the pointer to the first segment in the "rest of segments processing", but as mentioned above the first segment may had already been freed at this point. Backlog corrections for parent qdiscs have to be adjusted. Fixes: 177b8007463c ("net: netem: fix backlog accounting for corrupted GSO frames") Reported-by: kbuild test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Reported-by: Ben Hutchings <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-19	macb: propagate errors when getting optional clocks	Michael Tretter	1	-6/+6
	The tx_clk, rx_clk, and tsu_clk are optional. Currently the macb driver marks clock as not available if it receives an error when trying to get a clock. This is wrong, because a clock controller might return -EPROBE_DEFER if a clock is not available, but will eventually become available. In these cases, the driver would probe successfully but will never be able to adjust the clocks, because the clocks were not available during probe, but became available later. For example, the clock controller for the ZynqMP is implemented in the PMU firmware and the clocks are only available after the firmware driver has been probed. Use devm_clk_get_optional() in instead of devm_clk_get() to get the optional clock and propagate all errors to the calling function. Signed-off-by: Michael Tretter <[email protected]> Acked-by: Nicolas Ferre <[email protected]> Tested-by: Nicolas Ferre <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-19	xen/netback: fix error path of xenvif_connect_data()	Juergen Gross	1	-1/+0
	xenvif_connect_data() calls module_put() in case of error. This is wrong as there is no related module_get(). Remove the superfluous module_put(). Fixes: 279f438e36c0a7 ("xen-netback: Don't destroy the netdev until the vif is shut down") Cc: <[email protected]> # 3.12 Signed-off-by: Juergen Gross <[email protected]> Reviewed-by: Paul Durrant <[email protected]> Reviewed-by: Wei Liu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-19	net: hns3: fix mis-counting IRQ vector numbers issue	Yonglong Liu	6	-6/+58
	Currently, the num_msi_left means the vector numbers of NIC, but if the PF supported RoCE, it contains the vector numbers of NIC and RoCE(Not expected). This may cause interrupts lost in some case, because of the NIC module used the vector resources which belongs to RoCE. This patch adds a new variable num_nic_msi to store the vector numbers of NIC, and adjust the default TQP numbers and rss_size according to the value of num_nic_msi. Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Yonglong Liu <[email protected]> Signed-off-by: Huazhong Tan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-10-19	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	27	-139/+165
	Merge misc fixes from Andrew Morton: "Rather a lot of fixes, almost all affecting mm/" * emailed patches from Andrew Morton <[email protected]>: (26 commits) scripts/gdb: fix debugging modules on s390 kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register mm/thp: allow dropping THP from page cache mm/vmscan.c: support removing arbitrary sized pages from mapping mm/thp: fix node page state in split_huge_page_to_list() proc/meminfo: fix output alignment mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition mm: include <linux/huge_mm.h> for is_vma_temporary_stack zram: fix race between backing_dev_show and backing_dev_store mm/memcontrol: update lruvec counters in mem_cgroup_move_account ocfs2: fix panic due to ocfs2_wq is null hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() mm: memblock: do not enforce current limit for memblock_phys* family mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size mm/gup: fix a misnamed "write" argument, and a related bug mm/gup_benchmark: add a missing "w" to getopt string ocfs2: fix error handling in ocfs2_setattr() mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release mm/memunmap: don't access uninitialized memmap in memunmap_pages() ...
2019-10-19	scripts/gdb: fix debugging modules on s390	Ilya Leoshkevich	1	-1/+7
	Currently lx-symbols assumes that module text is always located at module->core_layout->base, but s390 uses the following layout: +------+ <- module->core_layout->base \| GOT \| +------+ <- module->core_layout->base + module->arch->plt_offset \| PLT \| +------+ <- module->core_layout->base + module->arch->plt_offset + \| TEXT \| module->arch->plt_size +------+ Therefore, when trying to debug modules on s390, all the symbol addresses are skewed by plt_offset + plt_size. Fix by adding plt_offset + plt_size to module_addr in load_module_symbols(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ilya Leoshkevich <[email protected]> Reviewed-by: Jan Kiszka <[email protected]> Cc: Kieran Bingham <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register	Song Liu	1	-2/+11
	Attaching uprobe to text section in THP splits the PMD mapped page table into PTE mapped entries. On uprobe detach, we would like to regroup PMD mapped page table entry to regain performance benefit of THP. However, the regroup is broken For perf_event based trace_uprobe. This is because perf_event based trace_uprobe calls uprobe_unregister twice on close: first in TRACE_REG_PERF_CLOSE, then in TRACE_REG_PERF_UNREGISTER. The second call will split the PMD mapped page table entry, which is not the desired behavior. Fix this by only use FOLL_SPLIT_PMD for uprobe register case. Add a WARN() to confirm uprobe unregister never work on huge pages, and abort the operation when this WARN() triggers. Link: http://lkml.kernel.org/r/[email protected] Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT") Signed-off-by: Song Liu <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: William Kucharski <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	mm/thp: allow dropping THP from page cache	Kirill A. Shutemov	1	-0/+12
	Once a THP is added to the page cache, it cannot be dropped via /proc/sys/vm/drop_caches. Fix this issue with proper handling in invalidate_mapping_pages(). Link: http://lkml.kernel.org/r/[email protected] Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Song Liu <[email protected]> Tested-by: Song Liu <[email protected]> Acked-by: Yang Shi <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: William Kucharski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	mm/vmscan.c: support removing arbitrary sized pages from mapping	William Kucharski	1	-4/+1
	__remove_mapping() assumes that pages can only be either base pages or HPAGE_PMD_SIZE. Ask the page what size it is. Link: http://lkml.kernel.org/r/[email protected] Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: William Kucharski <[email protected]> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Song Liu <[email protected]> Acked-by: Yang Shi <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	mm/thp: fix node page state in split_huge_page_to_list()	Kirill A. Shutemov	1	-2/+7
	Make sure split_huge_page_to_list() handles the state of shmem THP and file THP properly. Link: http://lkml.kernel.org/r/[email protected] Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP") Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Song Liu <[email protected]> Tested-by: Song Liu <[email protected]> Acked-by: Yang Shi <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: William Kucharski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	proc/meminfo: fix output alignment	Kirill A. Shutemov	1	-2/+2
	Patch series "Fixes for THP in page cache", v2. This patch (of 5): Add extra space for FileHugePages and FilePmdMapped, so the output is aligned with other rows. Link: http://lkml.kernel.org/r/[email protected] Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP") Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Song Liu <[email protected]> Tested-by: Song Liu <[email protected]> Acked-by: Yang Shi <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: William Kucharski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch	Ben Dooks (Codethink)	1	-0/+1
	mm_init.c needs to include <linux/mman.h> for the definition of vm_committed_as_batch. Fixes the following sparse warning: mm/mm_init.c:141:5: warning: symbol 'vm_committed_as_batch' was not declared. Should it be static? Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ben Dooks <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition	Ben Dooks	1	-0/+1
	The generic_file_vm_ops is defined in <linux/ramfs.h> so include it to fix the following warning: mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static? Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ben Dooks <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	mm: include <linux/huge_mm.h> for is_vma_temporary_stack	Ben Dooks	1	-0/+1
	Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack to fix the following sparse warning: mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static? Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ben Dooks <[email protected]> Reviewed-by: Qian Cai <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	zram: fix race between backing_dev_show and backing_dev_store	Chenwandun	1	-2/+3
	CPU0: CPU1: backing_dev_show backing_dev_store ...... ...... file = zram->backing_dev; down_read(&zram->init_lock); down_read(&zram->init_init_lock) file_path(file, ...); zram->backing_dev = backing_dev; up_read(&zram->init_lock); up_read(&zram->init_lock); gets the value of zram->backing_dev too early in backing_dev_show, which resultin the value being NULL at the beginning, and not NULL later. backtrace: d_path+0xcc/0x174 file_path+0x10/0x18 backing_dev_show+0x40/0xb4 dev_attr_show+0x20/0x54 sysfs_kf_seq_show+0x9c/0x10c kernfs_seq_show+0x28/0x30 seq_read+0x184/0x488 kernfs_fop_read+0x5c/0x1a4 __vfs_read+0x44/0x128 vfs_read+0xa0/0x138 SyS_read+0x54/0xb4 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chenwandun <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Jens Axboe <[email protected]> Cc: <[email protected]> [4.14+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	mm/memcontrol: update lruvec counters in mem_cgroup_move_account	Konstantin Khlebnikov	1	-6/+12
	Mapped, dirty and writeback pages are also counted in per-lruvec stats. These counters needs update when page is moved between cgroups. Currently is nobody consuming the lruvec versions of these counters and that there is no user-visible effect. Link: http://lkml.kernel.org/r/157112699975.7360.1062614888388489788.stgit@buzz Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure") Signed-off-by: Konstantin Khlebnikov <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	ocfs2: fix panic due to ocfs2_wq is null	Yi Li	2	-2/+4
	mount.ocfs2 failed when reading ocfs2 filesystem superblock encounters an error. ocfs2_initialize_super() returns before allocating ocfs2_wq. ocfs2_dismount_volume() triggers the following panic. Oct 15 16:09:27 cnwarekv-205120 kernel: On-disk corruption discovered.Please run fsck.ocfs2 once the filesystem is unmounted. Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_read_locked_inode:537 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:458 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_init_global_system_inodes:491 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_initialize_super:2313 ERROR: status = -30 Oct 15 16:09:27 cnwarekv-205120 kernel: (mount.ocfs2,22804,44): ocfs2_fill_super:1033 ERROR: status = -30 ------------[ cut here ]------------ Oops: 0002 [#1] SMP NOPTI CPU: 1 PID: 11753 Comm: mount.ocfs2 Tainted: G E 4.14.148-200.ckv.x86_64 #1 Hardware name: Sugon H320-G30/35N16-US, BIOS 0SSDX017 12/21/2018 task: ffff967af0520000 task.stack: ffffa5f05484000 RIP: 0010:mutex_lock+0x19/0x20 Call Trace: flush_workqueue+0x81/0x460 ocfs2_shutdown_local_alloc+0x47/0x440 [ocfs2] ocfs2_dismount_volume+0x84/0x400 [ocfs2] ocfs2_fill_super+0xa4/0x1270 [ocfs2] ? ocfs2_initialize_super.isa.211+0xf20/0xf20 [ocfs2] mount_bdev+0x17f/0x1c0 mount_fs+0x3a/0x160 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yi Li <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic()	David Hildenbrand	1	-3/+2
	Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched. Let's make sure that we only consider online memory (managed by the buddy) that has initialized memmaps. ZONE_DEVICE is not applicable. page_zone() will call page_to_nid(), which will trigger VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps. This can be the case when an offline memory block (e.g., never onlined) is spanned by a zone. Note: As explained by Michal in [1], alloc_contig_range() will verify the range. So it boils down to the wrong access in this function. [1] http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <[email protected]> Reported-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: <[email protected]> [4.13+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	mm: memblock: do not enforce current limit for memblock_phys* family	Mike Rapoport	1	-3/+3
	Until commit 92d12f9544b7 ("memblock: refactor internal allocation functions") the maximal address for memblock allocations was forced to memblock.current_limit only for the allocation functions returning virtual address. The changes introduced by that commit moved the limit enforcement into the allocation core and as a result the allocation functions returning physical address also started to limit allocations to memblock.current_limit. This caused breakage of etnaviv GPU driver: etnaviv etnaviv: bound 130000.gpu (ops gpu_ops) etnaviv etnaviv: bound 134000.gpu (ops gpu_ops) etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops) etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108 etnaviv-gpu 130000.gpu: command buffer outside valid memory window etnaviv-gpu 134000.gpu: model: GC320, revision: 5007 etnaviv-gpu 134000.gpu: command buffer outside valid memory window etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215 etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0 Restore the behaviour of memblock_phys* family so that these functions will not enforce memblock.current_limit. Link: http://lkml.kernel.org/r/[email protected] Fixes: 92d12f9544b7 ("memblock: refactor internal allocation functions") Signed-off-by: Mike Rapoport <[email protected]> Reported-by: Adam Ford <[email protected]> Tested-by: Adam Ford <[email protected]> [imx6q-logicpd] Cc: Catalin Marinas <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Fabio Estevam <[email protected]> Cc: Lucas Stach <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-10-19	mm: memcg: get number of pages on the LRU list in memcgroup base on ↵	Honglei Wang	1	-4/+5
	lru_zone_size Commit 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") has made lruvec_page_state to use per-cpu counters instead of calculating it directly from lru_zone_size with an idea that this would be more effective. Tim has reported that this is not really the case for their database benchmark which is showing an opposite results where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25% of the system time which is roughly 7% of total cpu cycles) on 5.3 kernels. The workload is running on a larger machine (96cpus), it has many cgroups (500) and it is heavily direct reclaim bound. Tim Chen said: : The problem can also be reproduced by running simple multi-threaded : pmbench benchmark with a fast Optane SSD swap (see profile below). : : : 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size : \| : \|--3.07%--lruvec_lru_size : \| \| : \| \|--2.11%--cpumask_next : \| \| \| : \| \| --1.66%--find_next_bit : \| \| : \| --0.57%--call_function_interrupt : \| \| : \| --0.55%--smp_call_function_interrupt : \| : \|--1.59%--0x441f0fc3d009 : \| _ops_rdtsc_init_base_freq : \| access_histogram : \| page_fault : \| __do_page_fault : \| handle_mm_fault : \| __handle_mm_fault : \| \| : \| --1.54%--do_swap_page : \| swapin_readahead : \| swap_cluster_readahead : \| \| : \| --1.53%--read_swap_cache_async : \| __read_swap_cache_async : \| alloc_pages_vma : \| __alloc_pages_nodemask : \| __alloc_pages_slowpath : \| try_to_free_pages : \| do_try_to_free_pages : \| shrink_node : \| shrink_node_memcg : \| \| : \| \|--0.77%--lruvec_lru_size : \| \| : \| --0.76%--inactive_list_is_low : \| \| : \| --0.76%--lruvec_lru_size : \| : --1.50%--measure_read : page_fault : __do_page_fault : handle_mm_fault : __handle_mm_fault : do_swap_page : swapin_readahead : swap_cluster_readahead : \| : --1.48%--read_swap_cache_async : __read_swap_cache_async : alloc_pages_vma : __alloc_pages_nodemask : __alloc_pages_slowpath : try_to_free_pages : do_try_to_free_pages : shrink_node : shrink_node_memcg : \| : \|--0.75%--inactive_list_is_low : \| \| : \| --0.75%--lruvec_lru_size : \| : --0.73%--lruvec_lru_size The likely culprit is the cache traffic the lruvec_page_state_local generates. Dave Hansen says: : I was thinking purely of the cache footprint. If it's reading : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192 : bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would : be 18k of data for the whole system and the caching would be pretty : efficient and all 18k would probably survive a tight page fault loop in : the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't : fit in the L1 and probably wouldn't survive a tight page fault loop if : both logical threads were banging on different cgroups. : : It's just a theory, but it's why I noted the number of cgroups when I : initially saw this show up in profiles Fix the regression by partially reverting the said commit and calculate the lru size explicitly. Link: http://lkml.kernel.org/r/[email protected] Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") Signed-off-by: Honglei Wang <[email protected]> Reported-by: Tim Chen <[email protected]> Acked-by: Tim Chen <[email protected]> Tested-by: Tim Chen <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Dave Hansen <[email protected]> Cc: <[email protected]> [5.2+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>