aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2012-07-31nfs: enable swap on NFSMel Gorman2-2/+5
Implement the new swapfile a_ops for NFS and hook up ->direct_IO. This will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol ->connect() method. PF_MEMALLOC should allow the allocation of struct socket and related objects and the early (re)setting of SOCK_MEMALLOC should allow us to receive the packets required for the TCP connection buildup. [[email protected]: Restore PF_MEMALLOC task flags in all cases] [[email protected]: Fix handling of multiple swap files] [[email protected]: Original patch] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David S. Miller <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Paris <[email protected]> Cc: James Morris <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Christie <[email protected]> Cc: Neil Brown <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiaotian Feng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: add support for direct_IO to highmem pagesMel Gorman1-0/+7
The patch "mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages" added support for using direct_IO to write swap pages but it is insufficient for highmem pages. To support highmem pages, this patch kmaps() the page before calling the direct_IO() handler. As direct_IO deals with virtual addresses an additional helper is necessary for get_kernel_pages() to lookup the struct page for a kmap virtual address. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David S. Miller <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Paris <[email protected]> Cc: James Morris <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Christie <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiaotian Feng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: swap: implement generic handler for swap_activateMel Gorman2-2/+9
The version of swap_activate introduced is sufficient for swap-over-NFS but would not provide enough information to implement a generic handler. This patch shuffles things slightly to ensure the same information is available for aops->swap_activate() as is available to the core. No functionality change. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David S. Miller <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Paris <[email protected]> Cc: James Morris <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Christie <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiaotian Feng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: add support for a filesystem to activate swap files and use direct_IO ↵Mel Gorman2-0/+6
for writing swap pages Currently swapfiles are managed entirely by the core VM by using ->bmap to allocate space and write to the blocks directly. This effectively ensures that the underlying blocks are allocated and avoids the need for the swap subsystem to locate what physical blocks store offsets within a file. If the swap subsystem is to use the filesystem information to locate the blocks, it is critical that information such as block groups, block bitmaps and the block descriptor table that map the swap file were resident in memory. This patch adds address_space_operations that the VM can call when activating or deactivating swap backed by a file. int swap_activate(struct file *); int swap_deactivate(struct file *); The ->swap_activate() method is used to communicate to the file that the VM relies on it, and the address_space should take adequate measures such as reserving space in the underlying device, reserving memory for mempools and pinning information such as the block descriptor table in memory. The ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate() returned success. After a successful swapfile ->swap_activate, the swapfile is marked SWP_FILE and swapper_space.a_ops will proxy to sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache pages and ->readpage to read. It is perfectly possible that direct_IO be used to read the swap pages but it is an unnecessary complication. Similarly, it is possible that ->writepage be used instead of direct_io to write the pages but filesystem developers have stated that calling writepage from the VM is undesirable for a variety of reasons and using direct_IO opens up the possibility of writing back batches of swap pages in the future. [[email protected]: Original patch] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David S. Miller <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Paris <[email protected]> Cc: James Morris <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Christie <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiaotian Feng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: add get_kernel_page[s] for pinning of kernel addresses for I/OMel Gorman3-0/+8
This patch adds two new APIs get_kernel_pages() and get_kernel_page() that may be used to pin a vector of kernel addresses for IO. The initial user is expected to be NFS for allowing pages to be written to swap using aops->direct_IO(). Strictly speaking, swap-over-NFS only needs to pin one page for IO but it makes sense to express the API in terms of a vector and add a helper for pinning single pages. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David S. Miller <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Paris <[email protected]> Cc: James Morris <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Christie <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiaotian Feng <[email protected]> Cc: Mark Salter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: methods for teaching filesystems about PG_swapcache pagesMel Gorman3-0/+31
In order to teach filesystems to handle swap cache pages, three new page functions are introduced: pgoff_t page_file_index(struct page *); loff_t page_file_offset(struct page *); struct address_space *page_file_mapping(struct page *); page_file_index() - gives the offset of this page in the file in PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this function also gives the correct index for PG_swapcache pages. page_file_offset() - uses page_file_index(), so that it will give the expected result, even for PG_swapcache pages. page_file_mapping() - gives the mapping backing the actual page; that is for swap cache pages it will give swap_file->f_mapping. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David S. Miller <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Paris <[email protected]> Cc: James Morris <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Christie <[email protected]> Cc: Neil Brown <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Xiaotian Feng <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31netvm: prevent a stream-specific deadlockMel Gorman1-3/+5
This patch series is based on top of "Swap-over-NBD without deadlocking v15" as it depends on the same reservation of PF_MEMALLOC reserves logic. When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. In diskless systems this is not an option so if swap if required then swapping over the network is considered. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap but this is not always an option. There is no guarantee that the network attached storage (NAS) device is running Linux or supports NBD. However, it is likely that it supports NFS so there are users that want support for swapping over NFS despite any performance concern. Some distributions currently carry patches that support swapping over NFS but it would be preferable to support it in the mainline kernel. Patch 1 avoids a stream-specific deadlock that potentially affects TCP. Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC reserves. Patch 3 adds three helpers for filesystems to handle swap cache pages. For example, page_file_mapping() returns page->mapping for file-backed pages and the address_space of the underlying swap file for swap cache pages. Patch 4 adds two address_space_operations to allow a filesystem to pin all metadata relevant to a swapfile in memory. Upon successful activation, the swapfile is marked SWP_FILE and the address space operation ->direct_IO is used for writing and ->readpage for reading in swap pages. Patch 5 notes that patch 3 is bolting filesystem-specific-swapfile-support onto the side and that the default handlers have different information to what is available to the filesystem. This patch refactors the code so that there are generic handlers for each of the new address_space operations. Patch 6 adds an API to allow a vector of kernel addresses to be translated to struct pages and pinned for IO. Patch 7 adds support for using highmem pages for swap by kmapping the pages before calling the direct_IO handler. Patch 8 updates NFS to use the helpers from patch 3 where necessary. Patch 9 avoids setting PF_private on PG_swapcache pages within NFS. Patch 10 implements the new swapfile-related address_space operations for NFS and teaches the direct IO handler how to manage kernel addresses. Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO where appropriate. Patch 12 fixes a NULL pointer dereference that occurs when using swap-over-NFS. With the patches applied, it is possible to mount a swapfile that is on an NFS filesystem. Swap performance is not great with a swap stress test taking roughly twice as long to complete than if the swap device was backed by NBD. This patch: netvm: prevent a stream-specific deadlock It could happen that all !SOCK_MEMALLOC sockets have buffered so much data that we're over the global rmem limit. This will prevent SOCK_MEMALLOC buffers from receiving data, which will prevent userspace from running, which is needed to reduce the buffered data. Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once this change it applied, it is important that sockets that set SOCK_MEMALLOC do not clear the flag until the socket is being torn down. If this happens, a warning is generated and the tokens reclaimed to avoid accounting errors until the bug is fixed. [[email protected]: Warning about clearing SOCK_MEMALLOC] Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Acked-by: David S. Miller <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Neil Brown <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: account for the number of times direct reclaimers get throttledMel Gorman1-0/+1
Under significant pressure when writing back to network-backed storage, direct reclaimers may get throttled. This is expected to be a short-lived event and the processes get woken up again but processes do get stalled. This patch counts how many times such stalling occurs. It's up to the administrator whether to reduce these stalls by increasing min_free_kbytes. Signed-off-by: Mel Gorman <[email protected]> Cc: David Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is ↵Mel Gorman1-0/+1
backed by network storage If swap is backed by network storage such as NBD, there is a risk that a large number of reclaimers can hang the system by consuming all PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune min_free_kbytes in advance which is a bit fragile. This patch throttles direct reclaimers if half the PF_MEMALLOC reserves are in use. If the system is routinely getting throttled the system administrator can increase min_free_kbytes so degradation is smoother but the system will keep running. Signed-off-by: Mel Gorman <[email protected]> Cc: David Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31netvm: set PF_MEMALLOC as appropriate during SKB processingMel Gorman1-0/+5
In order to make sure pfmemalloc packets receive all memory needed to proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC. This is limited to a subset of protocols that are expected to be used for writing to swap. Taps are not allowed to use PF_MEMALLOC as these are expected to communicate with userspace processes which could be paged out. [[email protected]: Ideas taken from various patches] [[email protected]: Lock imbalance fix] Signed-off-by: Mel Gorman <[email protected]> Acked-by: David S. Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31netvm: propagate page->pfmemalloc from skb_alloc_page to skbMel Gorman1-0/+55
The skb->pfmemalloc flag gets set to true iff during the slab allocation of data in __alloc_skb that the the PFMEMALLOC reserves were used. If page splitting is used, it is possible that pages will be allocated from the PFMEMALLOC reserve without propagating this information to the skb. This patch propagates page->pfmemalloc from pages allocated for fragments to the skb. It works by reintroducing and expanding the skb_alloc_page() API to take an skb. If the page was allocated from pfmemalloc reserves, it is automatically copied. If the driver allocates the page before the skb, it should call skb_propagate_pfmemalloc() after the skb is allocated to ensure the flag is copied properly. Failure to do so is not critical. The resulting driver may perform slower if it is used for swap-over-NBD or swap-over-NFS but it should not result in failure. [[email protected]: API rename and consistency] Signed-off-by: Mel Gorman <[email protected]> Acked-by: David S. Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31netvm: propagate page->pfmemalloc to skbMel Gorman1-0/+11
The skb->pfmemalloc flag gets set to true iff during the slab allocation of data in __alloc_skb that the the PFMEMALLOC reserves were used. If the packet is fragmented, it is possible that pages will be allocated from the PFMEMALLOC reserve without propagating this information to the skb. This patch propagates page->pfmemalloc from pages allocated for fragments to the skb. Signed-off-by: Mel Gorman <[email protected]> Acked-by: David S. Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31netvm: allow skb allocation to use PFMEMALLOC reservesMel Gorman3-2/+30
Change the skb allocation API to indicate RX usage and use this to fall back to the PFMEMALLOC reserve when needed. SKBs allocated from the reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the reserve and the socket is later found to be unrelated to page reclaim, the packet is dropped so that the memory remains available for page reclaim. Network protocols are expected to recover from this packet loss. [[email protected]: Ideas taken from various patches] [[email protected]: Use static branches, coding style corrections] [[email protected]: Avoid unnecessary cast, fix !CONFIG_NET build] Signed-off-by: Mel Gorman <[email protected]> Acked-by: David S. Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31netvm: allow the use of __GFP_MEMALLOC by specific socketsMel Gorman1-1/+4
Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC for their allocations. These sockets will be able to go below watermarks and allocate from the emergency reserve. Such sockets are to be used to service the VM (iow. to swap over). They must be handled kernel side, exposing such a socket to user-space is a bug. There is a risk that the reserves be depleted so for now, the administrator is responsible for increasing min_free_kbytes as necessary to prevent deadlock for their workloads. [[email protected]: Original patches] Signed-off-by: Mel Gorman <[email protected]> Acked-by: David S. Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31net: introduce sk_gfp_atomic() to allow addition of GFP flags depending on ↵Mel Gorman1-0/+5
the individual socket Introduce sk_gfp_atomic(), this function allows to inject sock specific flags to each sock related allocation. It is only used on allocation paths that may be required for writing pages back to network storage. [[email protected]: Use sk_gfp_atomic only when necessary] Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Acked-by: David S. Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: allow PF_MEMALLOC from softirq contextMel Gorman1-0/+7
This is needed to allow network softirq packet processing to make use of PF_MEMALLOC. Currently softirq context cannot use PF_MEMALLOC due to it not being associated with a task, and therefore not having task flags to fiddle with - thus the gfp to alloc flag mapping ignores the task flags when in interrupts (hard or soft) context. Allowing softirqs to make use of PF_MEMALLOC therefore requires some trickery. This patch borrows the task flags from whatever process happens to be preempted by the softirq. It then modifies the gfp to alloc flags mapping to not exclude task flags in softirq context, and modify the softirq code to save, clear and restore the PF_MEMALLOC flag. The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag cannot leak back into the preempted process. This should be safe due to the following reasons Softirqs can run on multiple CPUs sure but the same task should not be executing the same softirq code. Neither should the softirq handler be preempted by any other softirq handler so the flags should not leak to an unrelated softirq. Softirqs re-enable hardware interrupts in __do_softirq() so can be preempted by hardware interrupts so PF_MEMALLOC is inherited by the hard IRQ. However, this is similar to a process in reclaim being preempted by a hardirq. While PF_MEMALLOC is set, gfp_to_alloc_flags() distinguishes between hard and soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS flag. If the softirq is deferred to ksoftirq then its flags may be used instead of a normal tasks but as the softirq cannot be preempted, the PF_MEMALLOC flag does not leak to other code by accident. [[email protected]: Document why PF_MEMALLOC is safe] Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Cc: David Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: introduce __GFP_MEMALLOC to allow access to emergency reservesMel Gorman3-3/+10
__GFP_MEMALLOC will allow the allocation to disregard the watermarks, much like PF_MEMALLOC. It allows one to pass along the memalloc state in object related allocation flags as opposed to task related flags, such as sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now enough to identify allocations related to page reclaim. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Cc: David Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: sl[au]b: add knowledge of PFMEMALLOC reserve pagesMel Gorman2-0/+38
When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [[email protected]: Original implementation] [[email protected]: Correct order of page flag clearing] Signed-off-by: Mel Gorman <[email protected]> Cc: David Miller <[email protected]> Cc: Neil Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mike Christie <[email protected]> Cc: Eric B Munson <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memory-hotplug: fix kswapd looping forever problemMinchan Kim1-0/+8
When hotplug offlining happens on zone A, it starts to mark freed page as MIGRATE_ISOLATE type in buddy for preventing further allocation. (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but we can't allocate them). When the memory shortage happens during hotplug offlining, current task starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go sleep because current zone_watermark_ok_safe doesn't consider MIGRATE_ISOLATE freed page count. Current task continue to reclaim in direct reclaim path without kswapd's helping. The problem is that zone->all_unreclaimable is set by only kswapd so that current task would be looping forever like below. __alloc_pages_slowpath restart: wake_all_kswapd rebalance: __alloc_pages_direct_reclaim do_try_to_free_pages if global_reclaim && !all_unreclaimable return 1; /* It means we did did_some_progress */ skip __alloc_pages_may_oom should_alloc_retry goto rebalance; If we apply KOSAKI's patch[1] which doesn't depends on kswapd about setting zone->all_unreclaimable, we can solve this problem by killing some task in direct reclaim path. But it doesn't wake up kswapd, still. It could be a problem still if other subsystem needs GFP_ATOMIC request. So kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE going sleep. This patch counts the number of MIGRATE_ISOLATE page block and zone_watermark_ok_safe will consider it if the system has such blocks (fortunately, it's very rare so no problem in POV overhead and kswapd is never hotpath). Copy/modify from Mel's quote " Ideal solution would be "allocating" the pageblock. It would keep the free space accounting as it is but historically, memory hotplug didn't allocate pages because it would be difficult to detect if a pageblock was isolated or if part of some balloon. Allocating just full pageblocks would work around this, However, it would play very badly with CMA. " [1] http://lkml.org/lkml/2012/6/14/74 [[email protected]: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()] [[email protected]: fix CONFIG_MEMORY_ISOLATION=n build] Signed-off-by: Minchan Kim <[email protected]> Suggested-by: KOSAKI Motohiro <[email protected]> Tested-by: Aaditya Kumar <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: factor out memory isolate functionsMinchan Kim1-4/+9
mm/page_alloc.c has some memory isolation functions but they are used only when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. So let's make it configurable by new CONFIG_MEMORY_ISOLATION so that it can reduce binary size and we can check it simple by CONFIG_MEMORY_ISOLATION, not if defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. Signed-off-by: Minchan Kim <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Marek Szyprowski <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm, memcg: move all oom handling to memcontrol.cDavid Rientjes2-2/+3
By globally defining check_panic_on_oom(), the memcg oom handler can be moved entirely to mm/memcontrol.c. This removes the ugly #ifdef in the oom killer and cleans up the code. Signed-off-by: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm, memcg: introduce own oom handler to iterate only over its own threadsDavid Rientjes2-7/+18
The global oom killer is serialized by the per-zonelist try_set_zonelist_oom() which is used in the page allocator. Concurrent oom kills are thus a rare event and only occur in systems using mempolicies and with a large number of nodes. Memory controller oom kills, however, can frequently be concurrent since there is no serialization once the oom killer is called for oom conditions in several different memcgs in parallel. This creates a massive contention on tasklist_lock since the oom killer requires the readside for the tasklist iteration. If several memcgs are calling the oom killer, this lock can be held for a substantial amount of time, especially if threads continue to enter it as other threads are exiting. Since the exit path grabs the writeside of the lock with irqs disabled in a few different places, this can cause a soft lockup on cpus as a result of tasklist_lock starvation. The kernel lacks unfair writelocks, and successful calls to the oom killer usually result in at least one thread entering the exit path, so an alternative solution is needed. This patch introduces a seperate oom handler for memcgs so that they do not require tasklist_lock for as much time. Instead, it iterates only over the threads attached to the oom memcg and grabs a reference to the selected thread before calling oom_kill_process() to ensure it doesn't prematurely exit. This still requires tasklist_lock for the tasklist dump, iterating children of the selected process, and killing all other threads on the system sharing the same memory as the selected victim. So while this isn't a complete solution to tasklist_lock starvation, it significantly reduces the amount of time that it is held. Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: David Rientjes <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Reviewed-by: Sha Zhengju <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm, oom: move declaration for mem_cgroup_out_of_memory to oom.hDavid Rientjes2-2/+2
mem_cgroup_out_of_memory() is defined in mm/oom_kill.c, so declare it in linux/oom.h rather than linux/memcontrol.h. Acked-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: David Rientjes <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/hotplug: free zone->pageset when a zone becomes emptyJiang Liu1-0/+1
When a zone becomes empty after memory offlining, free zone->pageset. Otherwise it will cause memory leak when adding memory to the empty zone again because build_all_zonelists() will allocate zone->pageset for an empty zone. Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Wei Wang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Keping Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/hotplug: correctly setup fallback zonelists when creating new pgdatJiang Liu1-1/+1
When hotadd_new_pgdat() is called to create new pgdat for a new node, a fallback zonelist should be created for the new node. There's code to try to achieve that in hotadd_new_pgdat() as below: /* * The node we allocated has no zone fallback lists. For avoiding * to access not-initialized zonelist, build here. */ mutex_lock(&zonelists_mutex); build_all_zonelists(pgdat, NULL); mutex_unlock(&zonelists_mutex); But it doesn't work as expected. When hotadd_new_pgdat() is called, the new node is still in offline state because node_set_online(nid) hasn't been called yet. And build_all_zonelists() only builds zonelists for online nodes as: for_each_online_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); build_zonelists(pgdat); build_zonelist_cache(pgdat); } Though we hope to create zonelist for the new pgdat, but it doesn't. So add a new parameter "pgdat" the build_all_zonelists() to build pgdat for the new pgdat too. Signed-off-by: Jiang Liu <[email protected]> Signed-off-by: Xishi Qiu <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Tony Luck <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: David Rientjes <[email protected]> Cc: Keping Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31vmscan: remove obsolete shrink_control commentMinchan Kim1-1/+0
09f363c7 ("vmscan: fix shrinker callback bug in fs/super.c") fixed a shrinker callback which was returning -1 when nr_to_scan is zero, which caused excessive slab scanning. But 635697c6 ("vmscan: fix initial shrinker size handling") fixed the problem, again so we can freely return -1 although nr_to_scan is zero. So let's revert 09f363c7 because the comment added in 09f363c7 made an unnecessary rule. Signed-off-by: Minchan Kim <[email protected]> Cc: Al Viro <[email protected]> Cc: Mikulas Patocka <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: CONFIG_HAVE_MEMBLOCK_NODE -> CONFIG_HAVE_MEMBLOCK_NODE_MAPRabin Vincent1-1/+1
0ee332c14518699 ("memblock: Kill early_node_map[]") wanted to replace CONFIG_ARCH_POPULATES_NODE_MAP with CONFIG_HAVE_MEMBLOCK_NODE_MAP but ended up replacing one occurence with a reference to the non-existent symbol CONFIG_HAVE_MEMBLOCK_NODE. The resulting omission of code would probably have been causing problems to 32-bit machines with memory hotplug. Signed-off-by: Rabin Vincent <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: have order > 0 compaction start off where it leftRik van Riel1-0/+4
Order > 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. The obvious solution is to have isolate_freepages remember where it left off last time, and continue at that point the next time it gets invoked for an order > 0 compaction. This could cause compaction to fail if cc->free_pfn and cc->migrate_pfn are close together initially, in that case we restart from the end of the zone and try once more. Forced full (order == -1) compactions are left alone. [[email protected]: checkpatch fixes] [[email protected]: s/laste/last/, use 80 cols] Signed-off-by: Rik van Riel <[email protected]> Reported-by: Jim Schutt <[email protected]> Tested-by: Jim Schutt <[email protected]> Cc: Minchan Kim <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: remove unused LRU_ALL_EVICTABLEWanpeng Li1-1/+0
Signed-off-by: Wanpeng Li <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31memcg: rename config variablesAndrew Morton7-23/+23
Sanity: CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM [[email protected]: fix missed bits] Cc: Glauber Costa <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/compaction: cleanup on compaction_deferredGavin Shan1-2/+2
When CONFIG_COMPACTION is enabled, compaction_deferred() tries to recalculate the deferred limit again, which isn't necessary. When CONFIG_COMPACTION is disabled, compaction_deferred() should return "true" or "false" since it has "bool" for its return value. Signed-off-by: Gavin Shan <[email protected]> Acked-by: Minchan Kim <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during ↵Aneesh Kumar K.V1-0/+8
migration With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since we are holding a hugepage reference, we can be sure that old page won't get uncharged till the last put_page(). Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: add hugetlb cgroup control filesAneesh Kumar K.V2-0/+11
Add the control files for hugetlb controller [[email protected]: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g] [[email protected]: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: add charge/uncharge routines for hugetlb cgroupAneesh Kumar K.V1-0/+38
Add the charge and uncharge routines for hugetlb cgroup. We do cgroup charging in page alloc and uncharge in compound page destructor. Assigning page's hugetlb cgroup is protected by hugetlb_lock. [[email protected]: add huge_page_order check to avoid incorrect uncharge] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb/cgroup: add the cgroup pointer to page lruAneesh Kumar K.V1-0/+37
Add the hugetlb cgroup pointer to 3rd page lru.next. This limit the usage to hugetlb cgroup to only hugepages with 3 or more normal pages. I guess that is an acceptable limitation. Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm/hugetlb: add new HugeTLB cgroupAneesh Kumar K.V2-0/+43
Implement a new controller that allows us to control HugeTLB allocations. The extension allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to access HugeTLB pages beyond its limit. This requires the application to know beforehand how much HugeTLB pages it would require for its use. The charge/uncharge calls will be added to HugeTLB code in later patch. Support for cgroup removal will be added in later patches. [[email protected]: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g] [[email protected]: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/g] Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: make some static variables globalAneesh Kumar K.V1-0/+5
We will use them later in hugetlb_cgroup.c Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: David Rientjes <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: add a list for tracking in-use HugeTLB pagesAneesh Kumar K.V1-0/+1
hugepage_activelist will be used to track currently used HugeTLB pages. We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal. On cgroup removal we update the page's HugeTLB cgroup to point to parent cgroup. Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: simplify migrate_huge_page()Aneesh Kumar K.V1-2/+2
Since we migrate only one hugepage, don't use linked list for passing the page around. Directly pass the page that need to be migrated as argument. This also removes the usage of page->lru in the migrate path. Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Reviewed-by: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: use mmu_gather instead of a temporary linked list for accumulating ↵Aneesh Kumar K.V1-5/+17
pages Use a mmu_gather instead of a temporary linked list for accumulating pages when we unmap a hugepage range Signed-off-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31hugetlb: add an inline helper for finding hstate indexAneesh Kumar K.V1-0/+6
Add an inline helper and use it in the code. Signed-off-by: Aneesh Kumar K.V <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Hillf Danton <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threadsWanpeng Li2-5/+3
Since per-BDI flusher threads were introduced in 2.6, the pdflush mechanism is not used any more. But the old interface exported through /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless. For back-compatibility, printk warning information and return 2 to notify the users that the interface is removed. Signed-off-by: Wanpeng Li <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31mm: account the total_vm in the vm_stat_account()Huang Shijie1-0/+1
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now. But we can also account for total_vm in the vm_stat_account() which makes the code tidy. Even for mprotect_fixup(), we can get the right result in the end. Signed-off-by: Huang Shijie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-07-31Merge tag 'for-v3.6' of git://git.infradead.org/battery-2.6Linus Torvalds5-2/+82
Pull battery updates from Anton Vorontsov: "The tag contains just a few battery-related changes for v3.6. It's is all pretty straightforward, except one thing. One of our patches added thermal support for power supply class, but thermal/ subsystem changed under our feet. We (well, Stephen, that is) caught the issue and it was decided[1] that I'd just delay the battery pull request, and then will fix it up by merging upstream back into battery tree at the specific commit. That's not all though: another[2] small fixup for thermal subsystem was needed to get rid of a warning in power supply subsystem (the warning was not drivers/power's "fault", the thermal registration function just needed a proper const annotation, which is also done by a small commit on top of the merge. So, to sum this up: - The 'master' branch of the battery tree was in the -next tree for weeks, was never rebased, altered etc. It should be all OK; - Although, for-v3.6 tag contains the 'master' branch + merge + the warning fix. [1] http://lkml.org/lkml/2012/6/19/23 [2] http://lkml.org/lkml/2012/6/18/28" * tag 'for-v3.6' of git://git.infradead.org/battery-2.6: (23 commits) thermal: Constify 'type' argument for the registration routine olpc-battery: update CHARGE_FULL_DESIGN property for BYD LiFe batteries olpc-battery: Add VOLTAGE_MAX_DESIGN property charger-manager: Fix build break related to EXTCON lp8727_charger: Move header file into platform_data directory power_supply: Add min/max alert properties for CAPACITY, TEMP, TEMP_AMBIENT bq27x00_battery: Add support for BQ27425 chip charger-manager: Set current limit of regulator for over current protection charger-manager: Use EXTCON Subsystem to detect charger cables for charging test_power: Add VOLTAGE_NOW and BATTERY_TEMP properties test_power: Add support for USB AC source gpio-charger: Use cansleep version of gpio_set_value bq27x00_battery: Add support for power average and health properties sbs-battery: Don't trigger false supply_changed event twl4030_charger: Allow charger to control the regulator that feeds it twl4030_charger: Add backup-battery charging twl4030_charger: Fix some typos max17042_battery: Support CHARGE_COUNTER power supply attribute smb347-charger: Add constant charge and current properties power_supply: Add constant charge_current and charge_voltage properties ...
2012-07-31ipv4: Properly purge netdev references on uncached routes.David S. Miller1-0/+3
When a device is unregistered, we have to purge all of the references to it that may exist in the entire system. If a route is uncached, we currently have no way of accomplishing this. So create a global list that is scanned when a network device goes down. This mirrors the logic in net/core/dst.c's dst_ifdown(). Signed-off-by: David S. Miller <[email protected]>
2012-07-31ipv4: Cache routes in nexthop exception entries.David S. Miller1-0/+1
Signed-off-by: David S. Miller <[email protected]>
2012-07-31Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linuxLinus Torvalds4-13/+34
Pull nfsd changes from J. Bruce Fields: "This has been an unusually quiet cycle--mostly bugfixes and cleanup. The one large piece is Stanislav's work to containerize the server's grace period--but that in itself is just one more step in a not-yet-complete project to allow fully containerized nfs service. There are a number of outstanding delegation, container, v4 state, and gss patches that aren't quite ready yet; 3.7 may be wilder." * 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (35 commits) NFSd: make boot_time variable per network namespace NFSd: make grace end flag per network namespace Lockd: move grace period management from lockd() to per-net functions LockD: pass actual network namespace to grace period management functions LockD: manage grace list per network namespace SUNRPC: service request network namespace helper introduced NFSd: make nfsd4_manager allocated per network namespace context. LockD: make lockd manager allocated per network namespace LockD: manage grace period per network namespace Lockd: add more debug to host shutdown functions Lockd: host complaining function introduced LockD: manage used host count per networks namespace LockD: manage garbage collection timeout per networks namespace LockD: make garbage collector network namespace aware. LockD: mark host per network namespace on garbage collect nfsd4: fix missing fault_inject.h include locks: move lease-specific code out of locks_delete_lock locks: prevent side-effects of locks_release_private before file_lock is initialized NFSd: set nfsd_serv to NULL after service destruction NFSd: introduce nfsd_destroy() helper ...
2012-07-31ipv4: percpu nh_rth_output cacheEric Dumazet1-1/+2
Input path is mostly run under RCU and doesnt touch dst refcnt But output path on forwarding or UDP workloads hits badly dst refcount, and we have lot of false sharing, for example in ipv4_mtu() when reading rt->rt_pmtu Using a percpu cache for nh_rth_output gives a nice performance increase at a small cost. 24 udpflood test on my 24 cpu machine (dummy0 output device) (each process sends 1.000.000 udp frames, 24 processes are started) before : 5.24 s after : 2.06 s For reference, time on linux-3.5 : 6.60 s Signed-off-by: Eric Dumazet <[email protected]> Tested-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2012-07-31ipv4: Restore old dst_free() behavior.Eric Dumazet3-10/+11
commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried to solve a race but added a problem at device/fib dismantle time : We really want to call dst_free() as soon as possible, even if sockets still have dst in their cache. dst_release() calls in free_fib_info_rcu() are not welcomed. Root of the problem was that now we also cache output routes (in nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in rt_free(), because output route lookups are done in process context. Based on feedback and initial patch from David Miller (adding another call_rcu_bh() call in fib, but it appears it was not the right fix) I left the inet_sk_rx_dst_set() helper and added __rcu attributes to nh_rth_output and nh_rth_input to better document what is going on in this code. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2012-07-31Merge branch 'for-linus' of ↵Linus Torvalds8-60/+113
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "Lots of stuff this time around: - lots of cleanup and refactoring in the libceph messenger code, and many hard to hit races and bugs closed as a result. - lots of cleanup and refactoring in the rbd code from Alex Elder, mostly in preparation for the layering functionality that will be coming in 3.7. - some misc rbd cleanups from Josh Durgin that are finally going upstream - support for CRUSH tunables (used by newer clusters to improve the data placement) - some cleanup in our use of d_parent that Al brought up a while back - a random collection of fixes across the tree There is another patch coming that fixes up our ->atomic_open() behavior, but I'm going to hammer on it a bit more before sending it." Fix up conflicts due to commits that were already committed earlier in drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits) rbd: create rbd_refresh_helper() rbd: return obj version in __rbd_refresh_header() rbd: fixes in rbd_header_from_disk() rbd: always pass ops array to rbd_req_sync_op() rbd: pass null version pointer in add_snap() rbd: make rbd_create_rw_ops() return a pointer rbd: have __rbd_add_snap_dev() return a pointer libceph: recheck con state after allocating incoming message libceph: change ceph_con_in_msg_alloc convention to be less weird libceph: avoid dropping con mutex before fault libceph: verify state after retaking con lock after dispatch libceph: revoke mon_client messages on session restart libceph: fix handling of immediate socket connect failure ceph: update MAINTAINERS file libceph: be less chatty about stray replies libceph: clear all flags on con_close libceph: clean up con flags libceph: replace connection state bits with states libceph: drop unnecessary CLOSED check in socket state change callback libceph: close socket directly from ceph_con_close() ...