aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2014-04-03mm: thrash detection-based file cache sizingJohannes Weiner6-23/+321
The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently usedbut ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: Bob Liu <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Luigi Semenzato <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Metin Doslu <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ozgun Erdogan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Mallon <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm + fs: store shadow entries in page cacheJohannes Weiner3-9/+80
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Bob Liu <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Luigi Semenzato <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Metin Doslu <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ozgun Erdogan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Mallon <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm + fs: prepare for non-page entries in page cache radix treesJohannes Weiner6-123/+325
shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Bob Liu <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Luigi Semenzato <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Metin Doslu <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ozgun Erdogan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Mallon <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: filemap: move radix tree hole searching hereJohannes Weiner2-2/+78
The radix tree hole searching code is only used for page cache, for example the readahead code trying to get a a picture of the area surrounding a fault. It sufficed to rely on the radix tree definition of holes, which is "empty tree slot". But this is about to change, though, as shadow page descriptors will be stored in the page cache after the actual pages get evicted from memory. Move the functions over to mm/filemap.c and make them native page cache operations, where they can later be adapted to handle the new definition of "page cache hole". Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Bob Liu <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Luigi Semenzato <[email protected]> Cc: Metin Doslu <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ozgun Erdogan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Mallon <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: shmem: save one radix tree lookup when truncating swapped pagesJohannes Weiner1-13/+12
Page cache radix tree slots are usually stabilized by the page lock, but shmem's swap cookies have no such thing. Because the overall truncation loop is lockless, the swap entry is currently confirmed by a tree lookup and then deleted by another tree lookup under the same tree lock region. Use radix_tree_delete_item() instead, which does the verification and deletion with only one lookup. This also allows removing the delete-only special case from shmem_radix_tree_replace(). Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Bob Liu <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Luigi Semenzato <[email protected]> Cc: Metin Doslu <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ozgun Erdogan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Ryan Mallon <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: vmscan: shrink_slab: rename max_pass -> freeableVladimir Davydov1-13/+13
The name `max_pass' is misleading, because this variable actually keeps the estimate number of freeable objects, not the maximal number of objects we can scan in this pass, which can be twice that. Rename it to reflect its actual meaning. Signed-off-by: Vladimir Davydov <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: improve page-fault scalabilityDavidlohr Bueso1-13/+71
The kernel can currently only handle a single hugetlb page fault at a time. This is due to a single mutex that serializes the entire path. This lock protects from spurious OOM errors under conditions of low availability of free hugepages. This problem is specific to hugepages, because it is normal to want to use every single hugepage in the system - with normal pages we simply assume there will always be a few spare pages which can be used temporarily until the race is resolved. Address this problem by using a table of mutexes, allowing a better chance of parallelization, where each hugepage is individually serialized. The hash key is selected depending on the mapping type. For shared ones it consists of the address space and file offset being faulted; while for private ones the mm and virtual address are used. The size of the table is selected based on a compromise of collisions and memory footprint of a series of database workloads. Large database workloads that make heavy use of hugepages can be particularly exposed to this issue, causing start-up times to be painfully slow. This patch reduces the startup time of a 10 Gb Oracle DB (with ~5000 faults) from 37.5 secs to 25.7 secs. Larger workloads will naturally benefit even more. NOTE: The only downside to this patch, detected by Joonsoo Kim, is that a small race is possible in private mappings: A child process (with its own mm, after cow) can instantiate a page that is already being handled by the parent in a cow fault. When low on pages, can trigger spurious OOMs. I have not been able to think of a efficient way of handling this... but do we really care about such a tiny window? We already maintain another theoretical race with normal pages. If not, one possible way to is to maintain the single hash for private mappings -- any workloads that *really* suffer from this scaling problem should already use shared mappings. [[email protected]: remove stray + characters, go BUG if hugetlb_init() kmalloc fails] Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: David Gibson <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: use vma_resv_map() map typesJoonsoo Kim1-50/+45
Util now, we get a resv_map by two ways according to each mapping type. This makes code dirty and unreadable. Unify it. [[email protected]: code cleanups] Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: remove resv_map_putJoonsoo Kim1-12/+3
This is a preparation patch to unify the use of vma_resv_map() regardless of the map type. This patch prepares it by removing resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not for all resv_maps. [[email protected]: update changelog] Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: fix race in region trackingDavidlohr Bueso1-20/+38
There is a race condition if we map a same file on different processes. Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex. When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only mmap_sem (exclusively). This doesn't prevent other tasks from modifying the region structure, so it can be modified by two processes concurrently. To solve this, introduce a spinlock to resv_map and make region manipulation function grab it before they do actual work. [[email protected]: updated changelog] Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: Joonsoo Kim <[email protected]> Suggested-by: Joonsoo Kim <[email protected]> Acked-by: David Gibson <[email protected]> Cc: David Gibson <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: improve, cleanup resv_map parametersJoonsoo Kim1-13/+17
To change a protection method for region tracking to find grained one, we pass the resv_map, instead of list_head, to region manipulation functions. This doesn't introduce any functional change, and it is just for preparing a next step. [[email protected]: update changelog] Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, hugetlb: unify region structure handlingJoonsoo Kim1-16/+21
Currently, to track reserved and allocated regions, we use two different ways, depending on the mapping. For MAP_SHARED, we use address_mapping's private_list and, while for MAP_PRIVATE, we use a resv_map. Now, we are preparing to change a coarse grained lock which protect a region structure to fine grained lock, and this difference hinder it. So, before changing it, unify region structure handling, consistently using a resv_map regardless of the kind of mapping. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> Reviewed-by: Aneesh Kumar K.V <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: David Gibson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: optimize put_mems_allowed() usageMel Gorman6-25/+23
Since put_mems_allowed() is strictly optional, its a seqcount retry, we don't need to evaluate the function if the allocation was in fact successful, saving a smp_rmb some loads and comparisons on some relative fast-paths. Since the naming, get/put_mems_allowed() does suggest a mandatory pairing, rename the interface, as suggested by Mel, to resemble the seqcount interface. This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(), where it is important to note that the return value of the latter call is inverted from its previous incarnation. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, compaction: ignore pageblock skip when manually invoking compactionDavid Rientjes1-0/+1
The cached pageblock hint should be ignored when triggering compaction through /proc/sys/vm/compact_memory so all eligible memory is isolated. Manually invoking compaction is known to be expensive, there's no need to skip pageblocks based on heuristics (mainly for debugging). Signed-off-by: David Rientjes <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: vmscan: remove shrink_control arg from do_try_to_free_pages()Vladimir Davydov1-20/+12
There is no need passing on a shrink_control struct from try_to_free_pages() and friends to do_try_to_free_pages() and then to shrink_zones(), because it is only used in shrink_zones() and the only field initialized on the top level is gfp_mask, which is always equal to scan_control.gfp_mask. So let's move shrink_control initialization to shrink_zones(). Signed-off-by: Vladimir Davydov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Glauber Costa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: vmscan: move call to shrink_slab() to shrink_zones()Vladimir Davydov1-31/+25
This reduces the indentation level of do_try_to_free_pages() and removes extra loop over all eligible zones counting the number of on-LRU pages. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Glauber Costa <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaimVladimir Davydov1-2/+2
When direct reclaim is executed by a process bound to a set of NUMA nodes, we should scan only those nodes when possible, but currently we will scan kmem from all online nodes even if the kmem shrinker is NUMA aware. That said, binding a process to a particular NUMA node won't prevent it from shrinking inode/dentry caches from other nodes, which is not good. Fix this. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Glauber Costa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03kmemleak: change some global variables to intLi Zefan1-40/+40
They don't have to be atomic_t, because they are simple boolean toggles. Signed-off-by: Li Zefan <[email protected]> Acked-by: Catalin Marinas <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03kmemleak: remove redundant codeLi Zefan1-6/+1
Remove kmemleak_padding() and kmemleak_release(). Signed-off-by: Li Zefan <[email protected]> Acked-by: Catalin Marinas <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03kmemleak: allow freeing internal objects after kmemleak was disabledLi Zefan1-14/+32
Currently if kmemleak is disabled, the kmemleak objects can never be freed, no matter if it's disabled by a user or due to fatal errors. Those objects can be a big waste of memory. OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 1200264 1197433 99% 0.30K 46164 26 369312K kmemleak_object With this patch, after kmemleak was disabled you can reclaim memory with: # echo clear > /sys/kernel/debug/kmemleak Also inform users about this with a printk. Signed-off-by: Li Zefan <[email protected]> Acked-by: Catalin Marinas <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03kmemleak: free internal objects only if there're no leaks to be reportedLi Zefan1-4/+9
Currently if you stop kmemleak thread before disabling kmemleak, kmemleak objects will be freed and so you won't be able to check previously reported leaks. With this patch, kmemleak objects won't be freed if there're leaks that can be reported. Signed-off-by: Li Zefan <[email protected]> Acked-by: Catalin Marinas <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03bdi: avoid oops on device removalJan Kara1-4/+9
After commit 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue") when device is removed while we are writing to it we crash in bdi_writeback_workfn() -> set_worker_desc() because bdi->dev is NULL. This can happen because even though bdi_unregister() cancels all pending flushing work, nothing really prevents new ones from being queued from balance_dirty_pages() or other places. Fix the problem by clearing BDI_registered bit in bdi_unregister() and checking it before scheduling of any flushing work. Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977 Reviewed-by: Tejun Heo <[email protected]> Signed-off-by: Jan Kara <[email protected]> Cc: Derek Basehore <[email protected]> Cc: Jens Axboe <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03backing_dev: fix hung task on syncDerek Basehore1-1/+4
bdi_wakeup_thread_delayed() used the mod_delayed_work() function to schedule work to writeback dirty inodes. The problem with this is that it can delay work that is scheduled for immediate execution, such as the work from sync_inodes_sb(). This can happen since mod_delayed_work() can now steal work from a work_queue. This fixes the problem by using queue_delayed_work() instead. This is a regression caused by commit 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue"). The reason that this causes a problem is that laptop-mode will change the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default. In the case that bdi_wakeup_thread_delayed() races with sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung task. Even if dirty_writeback_centisecs is not long enough to cause a hung task, we still don't want to delay sync for that long. We fix the problem by using queue_delayed_work() when we want to schedule writeback sometime in future. This function doesn't change the timer if it is already armed. For the same reason, we also change bdi_writeback_workfn() to immediately queue the work again in the case that the work_list is not empty. The same problem can happen if the sync work is run on the rescue worker. [[email protected]: update changelog, add comment, use bdi_wakeup_thread_delayed()] Signed-off-by: Derek Basehore <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Alexander Viro <[email protected]> Reviewed-by: Tejun Heo <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Derek Basehore <[email protected]> Cc: Kees Cook <[email protected]> Cc: Benson Leung <[email protected]> Cc: Sonny Rao <[email protected]> Cc: Luigi Semenzato <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Dave Chinner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03Merge branch 'for-3.15' of ↵Linus Torvalds3-86/+43
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot updates for cgroup: - The biggest one is cgroup's conversion to kernfs. cgroup took after the long abandoned vfs-entangled sysfs implementation and made it even more convoluted over time. cgroup's internal objects were fused with vfs objects which also brought in vfs locking and object lifetime rules. Naturally, there are places where vfs rules don't fit and nasty hacks, such as credential switching or lock dance interleaving inode mutex and cgroup_mutex with object serial number comparison thrown in to decide whether the operation is actually necessary, needed to be employed. After conversion to kernfs, internal object lifetime and locking rules are mostly isolated from vfs interactions allowing shedding of several nasty hacks and overall simplification. This will also allow implmentation of operations which may affect multiple cgroups which weren't possible before as it would have required nesting i_mutexes. - Various simplifications including dropping of module support, easier cgroup name/path handling, simplified cgroup file type handling and task_cg_lists optimization. - Prepatory changes for the planned unified hierarchy, which is still a patchset away from being actually operational. The dummy hierarchy is updated to serve as the default unified hierarchy. Controllers which aren't claimed by other hierarchies are associated with it, which BTW was what the dummy hierarchy was for anyway. - Various fixes from Li and others. This pull request includes some patches to add missing slab.h to various subsystems. This was triggered xattr.h include removal from cgroup.h. cgroup.h indirectly got included a lot of files which brought in xattr.h which brought in slab.h. There are several merge commits - one to pull in kernfs updates necessary for converting cgroup (already in upstream through driver-core), others for interfering changes in the fixes branch" * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits) cgroup: remove useless argument from cgroup_exit() cgroup: fix spurious lockdep warning in cgroup_exit() cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c cgroup: break kernfs active_ref protection in cgroup directory operations cgroup: fix cgroup_taskset walking order cgroup: implement CFTYPE_ONLY_ON_DFL cgroup: make cgrp_dfl_root mountable cgroup: drop const from @buffer of cftype->write_string() cgroup: rename cgroup_dummy_root and related names cgroup: move ->subsys_mask from cgroupfs_root to cgroup cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}() cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root cgroup: reorganize cgroup bootstrapping cgroup: relocate setting of CGRP_DEAD cpuset: use rcu_read_lock() to protect task_cs() cgroup_freezer: document freezer_fork() subtleties cgroup: update cgroup_transfer_tasks() to either succeed or fail cgroup: drop task_lock() protection around task->cgroups cgroup: update how a newly forked task gets associated with css_set ...
2014-04-02Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual rocket science -- mostly documentation and comment updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: sparse: fix comment doc: fix double words isdn: capi: fix "CAPI_VERSION" comment doc: DocBook: Fix typos in xml and template file Bluetooth: add module name for btwilink driver core: unexport static function create_syslog_header mmc: core: typo fix in printk specifier ARM: spear: clean up editing mistake net-sysfs: fix comment typo 'CONFIG_SYFS' doc: Insert MODULE_ in module-signing macros Documentation: update URL to hfsplus Technote 1150 gpio: update path to documentation ixgbe: Fix format string in ixgbe_fcoe. Kconfig: Remove useless "default N" lines user_namespace.c: Remove duplicated word in comment CREDITS: fix formatting treewide: Fix typo in Documentation/DocBook mm: Fix warning on make htmldocs caused by slab.c ata: ata-samsung_cf: cleanup in header file idr: remove unused prototype of idr_free()
2014-04-02Merge branch 'x86-vdso-for-linus' of ↵Linus Torvalds1-4/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 vdso changes from Peter Anvin: "This is the revamp of the 32-bit vdso and the associated cleanups. This adds timekeeping support to the 32-bit vdso that we already have in the 64-bit vdso. Although 32-bit x86 is legacy, it is likely to remain in the embedded space for a very long time to come. This removes the traditional COMPAT_VDSO support; the configuration variable is reused for simply removing the 32-bit vdso, which will produce correct results but obviously suffer a performance penalty. Only one beta version of glibc was affected, but that version was unfortunately included in one OpenSUSE release. This is not the end of the vdso cleanups. Stefani and Andy have agreed to continue work for the next kernel cycle; in fact Andy has already produced another set of cleanups that came too late for this cycle. An incidental, but arguably important, change is that this ensures that unused space in the VVAR page is properly zeroed. It wasn't before, and would contain whatever garbage was left in memory by BIOS or the bootloader. Since the VVAR page is accessible to user space this had the potential of information leaks" * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86, vdso: Fix the symbol versions on the 32-bit vDSO x86, vdso, build: Don't rebuild 32-bit vdsos on every make x86, vdso: Actually discard the .discard sections x86, vdso: Fix size of get_unmapped_area() x86, vdso: Finish removing VDSO32_PRELINK x86, vdso: Move more vdso definitions into vdso.h x86: Load the 32-bit vdso in place, just like the 64-bit vdsos x86, vdso32: handle 32 bit vDSO larger one page x86, vdso32: Disable stack protector, adjust optimizations x86, vdso: Zero-pad the VVAR page x86, vdso: Add 32 bit VDSO time support for 64 bit kernel x86, vdso: Add 32 bit VDSO time support for 32 bit kernel x86, vdso: Patch alternatives in the 32-bit VDSO x86, vdso: Introduce VVAR marco for vdso32 x86, vdso: Cleanup __vdso_gettimeofday() x86, vdso: Replace VVAR(vsyscall_gtod_data) by gtod macro x86, vdso: __vdso_clock_gettime() cleanup x86, vdso: Revamp vclock_gettime.c mm: Add new func _install_special_mapping() to mmap.c x86, vdso: Make vsyscall_gtod_data handling x86 generic ...
2014-04-02sparse: fix commentLi Zhong1-1/+1
retmain -> remain Signed-off-by: Li Zhong <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2014-04-01kill generic_file_buffered_write()Al Viro1-22/+1
Signed-off-by: Al Viro <[email protected]>
2014-04-01export generic_perform_write(), start getting rid of generic_file_buffer_write()Al Viro1-16/+21
Signed-off-by: Al Viro <[email protected]>
2014-04-01generic_file_direct_write(): get rid of ppos argumentAl Viro1-3/+3
always equal to &iocb->ki_pos. Signed-off-by: Al Viro <[email protected]>
2014-04-01kill the 5th argument of generic_file_buffered_write()Al Viro1-5/+4
same story - it's &iocb->ki_pos in all cases Signed-off-by: Al Viro <[email protected]>
2014-04-01kill the 4th argument of __generic_file_aio_write()Al Viro1-7/+6
It's always equal to &iocb->ki_pos, where iocb is the value of the 1st argument. Signed-off-by: Al Viro <[email protected]>
2014-04-01take iov_iter stuff to mm/iov_iter.cAl Viro3-222/+225
Signed-off-by: Al Viro <[email protected]>
2014-04-01process_vm_access: tidy up a bitAl Viro1-40/+19
saner variable names, update linuxdoc comments, etc. Signed-off-by: Al Viro <[email protected]>
2014-04-01process_vm_access: don't bother with returning the amounts of bytes copiedAl Viro1-31/+16
we can calculate that in the caller just fine, TYVM Signed-off-by: Al Viro <[email protected]>
2014-04-01process_vm_rw_pages(): pass accurate amount of bytesAl Viro1-8/+14
... makes passing the amount of pages unnecessary Signed-off-by: Al Viro <[email protected]>
2014-04-01process_vm_access: take get_user_pages/put_pages one level upAl Viro1-58/+39
... and trim the fuck out of process_vm_rw_pages() argument list. Signed-off-by: Al Viro <[email protected]>
2014-04-01process_vm_access: switch to copy_page_to_iter/iov_iter_copy_from_userAl Viro1-68/+23
... rather than open-coding those. As a side benefit, we get much saner loop calling those; we can just feed entire pages, instead of the "copy would span the iovec boundary, let's do it in two loop iterations" mess. Signed-off-by: Al Viro <[email protected]>
2014-04-01process_vm_access: switch to iov_iterAl Viro1-34/+28
instead of keeping its pieces in separate variables and passing pointers to all of them... Signed-off-by: Al Viro <[email protected]>
2014-04-01untangling process_vm_..., part 4Al Viro1-16/+13
instead of passing vector size (by value) and index (by reference), pass the number of elements remaining. That's all we care about in these functions by that point. Signed-off-by: Al Viro <[email protected]>
2014-04-01untangling process_vm_..., part 3Al Viro1-4/+3
lift iov one more level out - from process_vm_rw_single_vec to process_vm_rw_core(). Same story as with the previous commit. Signed-off-by: Al Viro <[email protected]>
2014-04-01untangling process_vm_..., part 2Al Viro1-3/+5
move iov to caller's stack frame; the value we assign to it on the next call of process_vm_rw_pages() is equal to the value it had when the last time we were leaving process_vm_rw_pages(). drop lvec argument of process_vm_rw_pages() - it's not used anymore. Signed-off-by: Al Viro <[email protected]>
2014-04-01untangling process_vm_..., part 1Al Viro1-5/+9
we want to massage it to use of iov_iter. This one is an equivalent transformation - just introduce a local variable mirroring lvec + *lvec_current. Signed-off-by: Al Viro <[email protected]>
2014-04-01introduce copy_page_to_iter, kill loop over iovec in generic_file_aio_read()Al Viro2-143/+136
generic_file_aio_read() was looping over the target iovec, with loop over (source) pages nested inside that. Just set an iov_iter up and pass *that* to do_generic_file_aio_read(). With copy_page_to_iter() doing all work of mapping and copying a page to iovec and advancing iov_iter. Switch shmem_file_aio_read() to the same and kill file_read_actor(), while we are at it. Signed-off-by: Al Viro <[email protected]>
2014-04-01do_shmem_file_read(): call file_read_actor() directlyAl Viro1-3/+3
Signed-off-by: Al Viro <[email protected]>
2014-04-01callers of iov_copy_from_user_atomic() don't need pagecache_disable()Al Viro1-3/+0
... it does that itself (via kmap_atomic()) Signed-off-by: Al Viro <[email protected]>
2014-04-01switch ->is_partially_uptodate() to saner argumentsAl Viro1-1/+1
Signed-off-by: Al Viro <[email protected]>
2014-04-01mm/slab.c: cleanup outdated comments and unify variables namingJianyu Zhan1-34/+32
As time goes, the code changes a lot, and this leads to that some old-days comments scatter around , which instead of faciliating understanding, but make more confusion. So this patch cleans up them. Also, this patch unifies some variables naming. Acked-by: Christoph Lameter <[email protected]> Signed-off-by: Jianyu Zhan <[email protected]> Signed-off-by: Pekka Enberg <[email protected]>
2014-03-31Merge branch 'for-3.15' of ↵Linus Torvalds1-96/+112
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu changes from Tejun Heo: "The percpu allocation is now popular enough for the extremely naive range allocator to cause scalability issues. The existing allocator linearly scanned the allocation map on both alloc and free without making use of hint or anything. Al reimplemented the range allocator so that it can use binary search instead of linear scan during free and alloc path uses simple hinting to avoid scanning in common cases. Combined, the new allocator resolves the scalability issue percpu allocator was showing during container benchmark workload" * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: renew the max_contig if we merge the head and previous block percpu: allocation size should be even percpu: speed alloc_pcpu_area() up percpu: store offsets instead of lengths in ->map[] perpcu: fold pcpu_split_block() into the only caller
2014-03-31Merge branch 'for-linus' of ↵Linus Torvalds2-0/+13
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "There are two memory management related changes, the CMMA support for KVM to avoid swap-in of freed pages and the split page table lock for the PMD level. These two come with common code changes in mm/. A fix for the long standing theoretical TLB flush problem, this one comes with a common code change in kernel/sched/. Another set of changes is Heikos uaccess work, included is the initial set of patches with more to come. And fixes and cleanups as usual" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (36 commits) s390/con3270: optionally disable auto update s390/mm: remove unecessary parameter from pgste_ipte_notify s390/mm: remove unnecessary parameter from gmap_do_ipte_notify s390/mm: fixing comment so that parameter name match s390/smp: limit number of cpus in possible cpu mask hypfs: Add clarification for "weight_min" attribute s390: update defconfigs s390/ptrace: add support for PTRACE_SINGLEBLOCK s390/perf: make print_debug_cf() static s390/topology: Remove call to update_cpu_masks() s390/compat: remove compat exec domain s390: select CONFIG_TTY for use of tty in unconditional keyboard driver s390/appldata_os: fix cpu array size calculation s390/checksum: remove memset() within csum_partial_copy_from_user() s390/uaccess: remove copy_from_user_real() s390/sclp_early: Return correct HSA block count also for zero s390: add some drivers/subsystems to the MAINTAINERS file s390: improve debug feature usage s390/airq: add support for irq ranges s390/mm: enable split page table lock for PMD level ...