aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-12-10mm, compaction: pass classzone_idx and alloc_flags to watermark checkingVlastimil Babka5-29/+42
Compaction relies on zone watermark checks for decisions such as if it's worth to start compacting in compaction_suitable() or whether compaction should stop in compact_finished(). The watermark checks take classzone_idx and alloc_flags parameters, which are related to the memory allocation request. But from the context of compaction they are currently passed as 0, including the direct compaction which is invoked to satisfy the allocation request, and could therefore know the proper values. The lack of proper values can lead to mismatch between decisions taken during compaction and decisions related to the allocation request. Lack of proper classzone_idx value means that lowmem_reserve is not taken into account. This has manifested (during recent changes to deferred compaction) when DMA zone was used as fallback for preferred Normal zone. compaction_suitable() without proper classzone_idx would think that the watermarks are already satisfied, but watermark check in get_page_from_freelist() would fail. Because of this problem, deferring compaction has extra complexity that can be removed in the following patch. The issue (not confirmed in practice) with missing alloc_flags is opposite in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on CMA-enabled systems) the watermark checking in compaction with 0 passed will be stricter than in get_page_from_freelist(). In these cases compaction might be running for a longer time than is really needed. Another issue compaction_suitable() is that the check for "does the zone need compaction at all?" comes only after the check "does the zone have enough free free pages to succeed compaction". The latter considers extra pages for migration and can therefore in some situations fail and return COMPACT_SKIPPED, although the high-order allocation would succeed and we should return COMPACT_PARTIAL. This patch fixes these problems by adding alloc_flags and classzone_idx to struct compact_control and related functions involved in direct compaction and watermark checking. Where possible, all other callers of compaction_suitable() pass proper values where those are known. This is currently limited to classzone_idx, which is sometimes known in kswapd context. However, the direct reclaim callers should_continue_reclaim() and compaction_ready() do not currently know the proper values, so the coordination between reclaim and compaction may still not be as accurate as it could. This can be fixed later, if it's shown to be an issue. Additionaly the checks in compact_suitable() are reordered to address the second issue described above. The effect of this patch should be slightly better high-order allocation success rates and/or less compaction overhead, depending on the type of allocations and presence of CMA. It allows simplifying deferred compaction code in a followup patch. When testing with stress-highalloc, there was some slight improvement (which might be just due to variance) in success rates of non-THP-like allocations. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Christoph Lameter <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: vmscan: count only dirty pages as congestedJamie Liu1-1/+2
shrink_page_list() counts all pages with a mapping, including clean pages, toward nr_congested if they're on a write-congested BDI. shrink_inactive_list() then sets ZONE_CONGESTED if nr_dirty == nr_congested. Fix this apples-to-oranges comparison by only counting pages for nr_congested if they count for nr_dirty. Signed-off-by: Jamie Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: verify compound order when freeing a pageYu Zhao1-0/+3
This allows us to catch the bug fixed in the previous patch (mm: free compound page with correct order). Here we also verify whether a page is tail page or not -- tail pages are supposed to be freed along with their head, not by themselves. Signed-off-by: Yu Zhao <[email protected]> Reviewed-by: "Kirill A. Shutemov" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Bob Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10cma: make default CMA area size zero for x86Akinobu Mita1-1/+7
This makes CMA memory area size zero for x86 in default configuration (doesn't change on the other architectures). If default CMA size is zero, DMA_CMA is disabled. It can be enabled by passing cma= to the kernel. This makes less impact on x86. Because there is no mainline driver that requires it for x86, and Peter Hurley reported the performance regression, as this is trying to drive _all_ dma mapping allocations through a _very_ small window. Signed-off-by: Akinobu Mita <[email protected]> Reported-by: Peter Hurley <[email protected]> Cc: Peter Hurley <[email protected]> Cc: Chuck Ebbert <[email protected]> Cc: Jean Delvare <[email protected]> Acked-by: Marek Szyprowski <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Don Dutile <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm, memory_hotplug/failure: drain single zone pcplistsVlastimil Babka2-4/+4
Memory hotplug and failure mechanisms have several places where pcplists are drained so that pages are returned to the buddy allocator and can be e.g. prepared for offlining. This is always done in the context of a single zone, we can reduce the pcplists drain to the single zone, which is now possible. The change should make memory offlining due to hotremove or failure faster and not disturbing unrelated pcplists anymore. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Zhang Yanfei <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Marek Szyprowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm, cma: drain single zone pcplistsVlastimil Babka1-1/+1
CMA allocation drains pcplists so that pages can merge back to buddy allocator. Since it operates on a single zone, we can reduce the pcplists drain to the single zone, which is now possible. The change should make CMA allocations faster and not disturbing unrelated pcplists anymore. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Zhang Yanfei <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Marek Szyprowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm, page_isolation: drain single zone pcplistsVlastimil Babka1-1/+1
When setting MIGRATETYPE_ISOLATE on a pageblock, pcplists are drained to have a better chance that all pages will be successfully isolated and not left in the per-cpu caches. Since isolation is always concerned with a single zone, we can reduce the pcplists drain to the single zone, which is now possible. The change should make memory isolation faster and not disturbing unrelated pcplists anymore. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Zhang Yanfei <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Marek Szyprowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: introduce single zone pcplists drainVlastimil Babka5-32/+63
The functions for draining per-cpu pages back to buddy allocators currently always operate on all zones. There are however several cases where the drain is only needed in the context of a single zone, and spilling other pcplists is a waste of time both due to the extra spilling and later refilling. This patch introduces new zone pointer parameter to drain_all_pages() and changes the dummy parameter of drain_local_pages() to be also a zone pointer. When NULL is passed, the functions operate on all zones as usual. Passing a specific zone pointer reduces the work to the single zone. All callers are updated to pass the NULL pointer in this patch. Conversion to single zone (where appropriate) is done in further patches. Signed-off-by: Vlastimil Babka <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Zhang Yanfei <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Marek Szyprowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm/vmscan.c: replace printk with pr_errPintu Kumar1-2/+1
This patch replaces printk(KERN_ERR..) with pr_err found under shrink_slab. Thus it also reduces one line extra because of formatting. Signed-off-by: Pintu Kumar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm/vmalloc.c: replace printk with pr_warnPintu Kumar1-2/+1
This patch replaces printk(KERN_WARNING..) with pr_warn. Thus it also reduces one line extra because of formatting. Signed-off-by: Pintu Kumar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm/page_alloc.c: convert boot printks without log level to pr_infoAnton Blanchard1-11/+11
Signed-off-by: Anton Blanchard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: remove synchronous stock draining codeJohannes Weiner1-40/+6
With charge reparenting, the last synchronous stock drainer left. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: continue cache reclaim from offlined groupsJohannes Weiner1-217/+1
On cgroup deletion, outstanding page cache charges are moved to the parent group so that they're not lost and can be reclaimed during pressure on/inside said parent. But this reparenting is fairly tricky and its synchroneous nature has led to several lock-ups in the past. Since c2931b70a32c ("cgroup: iterate cgroup_subsys_states directly") css iterators now also include offlined css, so memcg iterators can be changed to include offlined children during reclaim of a group, and leftover cache can just stay put. There is a slight change of behavior in that charges of deleted groups no longer show up as local charges in the parent. But they are still included in the parent's hierarchical statistics. Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: remove obsolete kmemcg pinning tricksJohannes Weiner3-94/+7
As charges now pin the css explicitely, there is no more need for kmemcg to acquire a proxy reference for outstanding pages during offlining, or maintain state to identify such "dead" groups. This was the last user of the uncharge functions' return values, so remove them as well. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: take a css reference for each charged pageJohannes Weiner3-13/+81
Charges currently pin the css indirectly by playing tricks during css_offline(): user pages stall the offlining process until all of them have been reparented, whereas kmemcg acquires a keep-alive reference if outstanding kernel pages are detected at that point. In preparation for removing all this complexity, make the pinning explicit and acquire a css references for every charged page. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: convert reclaim iterator to simple css refcountingJohannes Weiner1-174/+84
The memcg reclaim iterators use a complicated weak reference scheme to prevent pinning cgroups indefinitely in the absence of memory pressure. However, during the ongoing cgroup core rework, css lifetime has been decoupled such that a pinned css no longer interferes with removal of the user-visible cgroup, and all this complexity is now unnecessary. [[email protected]: ensure that the cached reference is always released] Signed-off-by: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10kernel: res_counter: remove the unused APIJohannes Weiner6-647/+8
All memory accounting and limiting has been switched over to the lockless page counters. Bye, res_counter! [[email protected]: update Documentation/cgroups/memory.txt] [[email protected]: ditch the last remainings of res_counter] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Cc: David Rientjes <[email protected]> Cc: Paul Bolle <[email protected]> Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: hugetlb_cgroup: convert to lockless page countersJohannes Weiner4-48/+61
Abandon the spinlock-protected byte counters in favor of the unlocked page counters in the hugetlb controller as well. Signed-off-by: Johannes Weiner <[email protected]> Reviewed-by: Vladimir Davydov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tejun Heo <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: memcontrol: lockless page countersJohannes Weiner9-404/+615
Memory is internally accounted in bytes, using spinlock-protected 64-bit counters, even though the smallest accounting delta is a page. The counter interface is also convoluted and does too many things. Introduce a new lockless word-sized page counter API, then change all memory accounting over to it. The translation from and to bytes then only happens when interfacing with userspace. The removed locking overhead is noticable when scaling beyond the per-cpu charge caches - on a 4-socket machine with 144-threads, the following test shows the performance differences of 288 memcgs concurrently running a page fault benchmark: vanilla: 18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% ) 1,380,638 context-switches # 0.074 K/sec ( +- 0.75% ) 24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% ) 1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% ) 50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% ) 1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% ) 1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% ) 132.474343877 seconds time elapsed ( +- 0.21% ) lockless: 12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% ) 832,850 context-switches # 0.068 K/sec ( +- 0.54% ) 15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% ) 1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% ) 32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% ) 2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% ) 1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% ) 91.369330729 seconds time elapsed ( +- 0.45% ) On top of improved scalability, this also gets rid of the icky long long types in the very heart of memcg, which is great for 32 bit and also makes the code a lot more readable. Notable differences between the old and new API: - res_counter_charge() and res_counter_charge_nofail() become page_counter_try_charge() and page_counter_charge() resp. to match the more common kernel naming scheme of try_do()/do() - res_counter_uncharge_until() is only ever used to cancel a local counter and never to uncharge bigger segments of a hierarchy, so it's replaced by the simpler page_counter_cancel() - res_counter_set_limit() is replaced by page_counter_limit(), which expects its callers to serialize against themselves - res_counter_memparse_write_strategy() is replaced by page_counter_limit(), which rounds down to the nearest page size - rather than up. This is more reasonable for explicitely requested hard upper limits. - to keep charging light-weight, page_counter_try_charge() charges speculatively, only to roll back if the result exceeds the limit. Because of this, a failing bigger charge can temporarily lock out smaller charges that would otherwise succeed. The error is bounded to the difference between the smallest and the biggest possible charge size, so for memcg, this means that a failing THP charge can send base page charges into reclaim upto 2MB (4MB) before the limit would have been reached. This should be acceptable. [[email protected]: add includes for WARN_ON_ONCE and memparse] [[email protected]: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: David Rientjes <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10slab: replace smp_read_barrier_depends() with lockless_dereference()Pranith Kumar1-3/+3
Recently lockless_dereference() was added which can be used in place of hard-coding smp_read_barrier_depends(). The following PATCH makes the change. Signed-off-by: Pranith Kumar <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10slab: improve checking for invalid gfp_flagsAndrew Morton2-2/+8
The code goes BUG, but doesn't tell us which bits were unexpectedly set. Print that out. Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: slub: fix format mismatches in slab_err() callersAndrey Ryabinin1-3/+3
Adding __printf(3, 4) to slab_err exposed following: mm/slub.c: In function `check_slab': mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=] s->name, page->objects, maxobj); ^ mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args] mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=] s->name, page->inuse, page->objects); ^ mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args] mm/slub.c: In function `on_freelist': mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=] "should be %d", page->objects, max_objects); Fix first two warnings by removing redundant s->name. Fix the last by changing type of max_object from unsigned long to int. Signed-off-by: Andrey Ryabinin <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm/slab: reverse iteration on find_mergeable()Joonsoo Kim1-1/+1
Unlike SLUB, sometimes, object isn't started at the beginning of the slab in the SLAB. This causes the unalignment problem when after slab merging is supported by commit 12220dea07f1 ("mm/slab: support slab merge"). Alignment mismatch check is introduced ("mm/slab: fix unalignment problem on Malta with EVA due to slab merge") to prevent merge in this case. This causes undesirable result that merging happens between infrequently used kmem_caches if there are kmem_caches with same size and is 256 bytes, are merged into pool_workqueue rather than kmalloc-256, because kmem_caches for kmalloc are at the tail of the list. To prevent this situation, this patch reverses iteration order in find_mergeable() to find frequently used kmem_caches. This change helps to merge kmem_cache to frequently used kmem_caches, such as kmalloc kmem_caches. Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10slab: print slabinfo header in seq showVladimir Davydov3-16/+8
Currently we print the slabinfo header in the seq start method, which makes it unusable for showing leaks, so we have leaks_show, which does practically the same as s_show except it doesn't show the header. However, we can print the header in the seq show method - we only need to check if the current element is the first on the list. This will allow us to use the same set of seq iterators for both leaks and slabinfo reporting, which is nice. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm: slab/slub: coding style: whitespaces and tabs mixtureLQYMGT2-10/+10
Some code in mm/slab.c and mm/slub.c use whitespaces in indent. Clean them up. Signed-off-by: LQYMGT <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10fs/char_dev.c: remove pointless assignment from __register_chrdev_region()Jan Kara1-1/+0
At one place we assign major number we found to ret. That assignment is then never used and actually doesn't make any sense given how the code is currently structured (the assignment comes from pre-git times). Just remove it. Coverity id: 1226852. Signed-off-by: Jan Kara <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: remove unneeded NULL checkDan Carpenter1-1/+1
In commit 1faf289454b9 ("ocfs2_dlm: disallow a domain join if node maps mismatch") we introduced a new earlier NULL check so this one is not needed. Also static checkers complain because we dereference it first and then check for NULL. Signed-off-by: Dan Carpenter <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: remove bogus NULL check in ocfs2_move_extents()Dan Carpenter1-3/+0
"inode" isn't NULL here, and also we dereference it on the previous line so static checkers get annoyed. Signed-off-by: Dan Carpenter <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: do not set filesystem readonly if link downjiangyiwen2-2/+2
Do not set the filesystem readonly if the storage link is down. In this case, metadata is not corrupted and only -EIO is returned. And if it is indeed corrupted metadata, it has already called ocfs2_error() in ocfs2_validate_inode_block(). Signed-off-by: Yiwen Jiang <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: do not set OCFS2_LOCK_UPCONVERT_FINISHING if nonblocking lock can not ↵Xue jiufei2-6/+37
be granted at once ocfs2_readpages() use nonblocking flag to avoid page lock inversion. It will trigger cluster hang because that flag OCFS2_LOCK_UPCONVERT_FINISHING is not cleared if nonblocking lock cannot be granted at once. The flag would prevent dc thread from downconverting. So other nodes cannot acheive this lockres for ever. So we should not set OCFS2_LOCK_UPCONVERT_FINISHING when receiving ast if nonblocking lock had already returned. Signed-off-by: joyce.xue <[email protected]> Reviewed-by: Junxiao Bi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: fix error handling when creating debugfs root in ocfs2_init()Jan Kara1-1/+2
Error handling if creation of root of debugfs in ocfs2_init() fails is broken. Although error code is set we fail to exit ocfs2_init() with error and thus initialization ends with success. Later when mounting a filesystem, ocfs2 debugfs entries end up being created in the root of debugfs filesystem which is confusing. Fix the error handling to bail out. Coverity id: 1227009. Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: remove filesize checks for sync I/O journal commitGoldwyn Rodrigues1-3/+1
Filesize is not a good indication that the file needs to be synced. An example where this breaks is: 1. Open the file in O_SYNC|O_RDWR 2. Read a small portion of the file (say 64 bytes) 3. Lseek to starting of the file 4. Write 64 bytes If the node crashes, it is not written out to disk because this was not committed in the journal and the other node which reads the file after recovery reads stale data (even if the write on the other node was successful) Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: o2net: fix connect expiredJunxiao Bi1-1/+1
Set nn_persistent_error to -ENOTCONN will stop reconnect since the "stop" condition in o2net_start_connect() will be true. stop = (nn->nn_sc || (nn->nn_persistent_error && (nn->nn_persistent_error != -ENOTCONN || timeout == 0))); This will make connection never be established if the first connection request is lost. Set nn_persistent_error to 0 when connect expired to fix this. With this changes, dlm will not be waken up when connect expired, this is OK since dlm depends on network, dlm can do nothing in this case if waken up. Let it wait there for network recover and connect built again to continue. Signed-off-by: Junxiao Bi <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: o2dlm: fix a race between purge and master querySrinivas Eeda1-0/+12
Node A sends master query request to node B which is the master. At this time lockres happens to be on purgelist. dlm_master_request_handler gets the dlm spinlock, finds the resource and releases the dlm spin lock. Right at this dlm_thread on this node could purge the lockres. dlm_master_request_handler can then acquire lockres spinlock and reply to Node A that node B is the master even though lockres on node B is purged. The above scenario will now make node A falsely think node B is the master which is inconsistent. Further if another node C tries to master the same resource, every node will respond they are not the master. Node C then masters the resource and sends assert master to all nodes. This will now make node A crash with the following message. dlm_assert_master_handler:1831 ERROR: DIE! Mastery assert from 9, but current owner is 10! Signed-off-by: Srinivas Eeda <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Reviewed-by: Wengang Wang <[email protected]> Tested-by: Joseph Qi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: report error from o2hb_do_disk_heartbeat() to userJan Kara1-2/+2
Report return value of o2hb_do_disk_heartbeat() as a part of ML_HEARTBEAT message so that we know whether a heartbeat actually happened or not. This also makes assigned but otherwise unused 'ret' variable used. Coverity id: 1227053. Signed-off-by: Jan Kara <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: remove bogus test from ocfs2_read_locked_inode()Jan Kara1-2/+1
'args' are always set for ocfs2_read_locked_inode() and brelse() checks whether bh is NULL. So the test (args && bh) is unnecessary (plus the args part is really confusing anyway). Remove it. Coverity id: 1128856. Signed-off-by: Jan Kara <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: Fix xattr check in ocfs2_get_xattr_nolock()Jan Kara1-1/+1
ocfs2_get_xattr_nolock() checks whether inode has any extended attributes (OCFS2_HAS_XATTR_FL). If not, it just sets 'ret' to -ENODATA but continues with checking inline and external attributes anyway (which is pointless although it does not harm). Just return immediately when we know there are no extended attributes in the inode. Coverity id: 1226906. Signed-off-by: Jan Kara <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2: fix an off-by-one BUG_ON() statementDan Carpenter1-1/+1
The ->si_slots[] array is allocated in ocfs2_init_slot_info() it has "->max_slots" number of elements so this test should be >= instead of >. Static checker work. Compile tested only. Signed-off-by: Dan Carpenter <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10ocfs2/dlm: let sender retry if dlm_dispatch_assert_master failed with -ENOMEMJoseph Qi1-5/+13
Do not BUG() if GFP_ATOMIC allocation fails in dlm_dispatch_assert_master. Instead, return -ENOMEM to the sender and then retry. Signed-off-by: Joseph Qi <[email protected]> Reviewed-by: Alex Chen <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10sh: off by one BUG_ON() in setup_bootmem_node()Dan Carpenter1-1/+1
This off by one bug is harmless but it upsets the static checkers and the code is obvious so it doesn't hurt to fix it. The Smatch warning is: arch/sh/mm/numa.c:47 setup_bootmem_node() error: buffer overflow 'node_data' 1024 <= 1024 Signed-off-by: Dan Carpenter <[email protected]> Cc: Paul Mundt <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10scripts/kernel-doc: don't eat struct members with __alignedJohannes Berg1-1/+1
The change from \d+ to .+ inside __aligned() means that the following structure: struct test { u8 a __aligned(2); u8 b __aligned(2); }; essentially gets modified to struct test { u8 a; }; for purposes of kernel-doc, thus dropping a struct member, which in turns causes warnings and invalid kernel-doc generation. Fix this by replacing the catch-all (".") with anything that's not a semicolon ("[^;]"). Fixes: 9dc30918b23f ("scripts/kernel-doc: handle struct member __aligned without numbers") Signed-off-by: Johannes Berg <[email protected]> Cc: Nishanth Menon <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Michal Marek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10dma-debug: prevent early callers from crashingFlorian Fainelli1-2/+10
dma_debug_init() is called by architecture specific code at different levels, but typically as a fs_initcall due to the debugfs initialization. Some platforms may have early callers of the DMA-API, running prior to the fs_initcall() level, which is not much of an issue unless CONFIG_DMA_API_DEBUG is set. When the DMA-API debugging facilities are turned on a caller will go through: debug_dma_map_{single,page} -> dma_mapping_error (inline function usually) -> debug_dma_mapping_error -> get_hash_bucket Calling get_hash_bucket() returns a valid hash value since we hash on high bits of the dma_addr cookie, but we will grab an unitialized spinlock, which typically won't crash but produce a warning, the real crash will however happen during the bucket list traversal because the list has not been initialized yet. An obvious solution is of course to move some of the offenders to run after the fs_initcall level, but since this might not always be an option, we add a flag "dma_debug_initialized" which is set to false by default, and set to true once dma_debug_init() has had a chance to run. The dma_debug_disabled() helper function previously introduced just needs to check for dma_debug_initialized to allow the caller to proceed or not. Signed-off-by: Florian Fainelli <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Horia Geanta <[email protected]> Cc: Brian Norris <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10dma-debug: introduce dma_debug_disabledFlorian Fainelli1-16/+21
Add a helper function which returns whether the DMA debugging API is disabled, right now we only check for global_disable, but in order to accommodate early callers of the DMA-API, we will check for more initialization flags in the next patch. Signed-off-by: Florian Fainelli <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Horia Geanta <[email protected]> Cc: Brian Norris <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10fs/cifs/smb2file.c: replace count*size kzalloc by kcallocFabian Frederick1-2/+2
kcalloc manages count*sizeof overflow. Signed-off-by: Fabian Frederick <[email protected]> Cc: Steve French <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10fs/cifs/file.c: replace count*size kzalloc by kcallocFabian Frederick1-2/+2
kcalloc manages count*sizeof overflow. Signed-off-by: Fabian Frederick <[email protected]> Cc: Steve French <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10fs/cifs: remove obsolete __constantFabian Frederick7-47/+47
Replace all __constant_foo to foo() except in smb2status.h (1700 lines to update). Signed-off-by: Fabian Frederick <[email protected]> Cc: Steve French <[email protected]> Cc: Jeff Layton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10mm/CMA: fix boot regression due to physical address of high_memoryJoonsoo Kim1-1/+13
high_memory isn't direct mapped memory so retrieving it's physical address isn't appropriate. But, it would be useful to check physical address of highmem boundary so it's justfiable to get physical address from it. In x86, there is a validation check if CONFIG_DEBUG_VIRTUAL and it triggers following boot failure reported by Ingo. ... BUG: Int 6: CR2 00f06f53 ... Call Trace: dump_stack+0x41/0x52 early_idt_handler+0x6b/0x6b cma_declare_contiguous+0x33/0x212 dma_contiguous_reserve_area+0x31/0x4e dma_contiguous_reserve+0x11d/0x125 setup_arch+0x7b5/0xb63 start_kernel+0xb8/0x3e6 i386_start_kernel+0x79/0x7d To fix boot regression, this patch implements workaround to avoid validation check in x86 when retrieving physical address of high_memory. __pa_nodebug() used by this patch is implemented only in x86 so there is no choice but to use dirty #ifdef. [[email protected]: tweak comment] Signed-off-by: Joonsoo Kim <[email protected]> Reported-by: Ingo Molnar <[email protected]> Tested-by: Ingo Molnar <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Russell King <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-12-10Merge branch 'timers-2038-for-linus' of ↵Linus Torvalds3-5/+67
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more 2038 timer work from Thomas Gleixner: "Two more patches for the ongoing 2038 work: - New accessors to clock MONOTONIC and REALTIME seconds This is a seperate branch as Arnd has follow up work depending on this" * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Provide y2038 safe accessor to the seconds portion of CLOCK_REALTIME timekeeping: Provide fast accessor to the seconds part of CLOCK_MONOTONIC
2014-12-10Merge branch 'x86-mpx-for-linus' of ↵Linus Torvalds35-47/+1591
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MPX support from Thomas Gleixner: "This enables support for x86 MPX. MPX is a new debug feature for bound checking in user space. It requires kernel support to handle the bound tables and decode the bound violating instruction in the trap handler" * 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: asm-generic: Remove asm-generic arch_bprm_mm_init() mm: Make arch_unmap()/bprm_mm_init() available to all architectures x86: Cleanly separate use of asm-generic/mm_hooks.h x86 mpx: Change return type of get_reg_offset() fs: Do not include mpx.h in exec.c x86, mpx: Add documentation on Intel MPX x86, mpx: Cleanup unused bound tables x86, mpx: On-demand kernel allocation of bounds tables x86, mpx: Decode MPX instruction to get bound violation information x86, mpx: Add MPX-specific mmap interface x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific x86, mpx: Add MPX to disabled features ia64: Sync struct siginfo with general version mips: Sync struct siginfo with general version mpx: Extend siginfo structure to include bound violation information x86, mpx: Rename cfg_reg_u and status_reg x86: mpx: Give bndX registers actual names x86: Remove arbitrary instruction size limit in instruction decoder
2014-12-10Merge branch 'irq-irqdomain-for-linus' of ↵Linus Torvalds53-351/+1924
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq domain updates from Thomas Gleixner: "The real interesting irq updates: - Support for hierarchical irq domains: For complex interrupt routing scenarios where more than one interrupt related chip is involved we had no proper representation in the generic interrupt infrastructure so far. That made people implement rather ugly constructs in their nested irq chip implementations. The main offenders are x86 and arm/gic. To distangle that mess we have now hierarchical irqdomains which seperate the various interrupt chips and connect them via the hierarchical domains. That keeps the domain specific details internal to the particular hierarchy level and removes the criss/cross referencing of chip internals. The resulting hierarchy for a complex x86 system will look like this: vector mapped: 74 msi-0 mapped: 2 dmar-ir-1 mapped: 69 ioapic-1 mapped: 4 ioapic-0 mapped: 20 pci-msi-2 mapped: 45 dmar-ir-0 mapped: 3 ioapic-2 mapped: 1 pci-msi-1 mapped: 2 htirq mapped: 0 Neither ioapic nor pci-msi know about the dmar interrupt remapping between themself and the vector domain. If interrupt remapping is disabled ioapic and pci-msi become direct childs of the vector domain. In hindsight we should have done that years ago, but in hindsight we always know better :) - Support for generic MSI interrupt domain handling We have more and more non PCI related MSI interrupts, so providing a generic infrastructure for this is better than having all affected architectures implementing their own private hacks. - Support for PCI-MSI interrupt domain handling, based on the generic MSI support. This part carries the pci/msi branch from Bjorn Helgaas pci tree to avoid a massive conflict. The PCI/MSI parts are acked by Bjorn. I have two more branches on top of this. The full conversion of x86 to hierarchical domains and a partial conversion of arm/gic" * 'irq-irqdomain-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits) genirq: Move irq_chip_write_msi_msg() helper to core PCI/MSI: Allow an msi_controller to be associated to an irq domain PCI/MSI: Provide mechanism to alloc/free MSI/MSIX interrupt from irqdomain PCI/MSI: Enhance core to support hierarchy irqdomain PCI/MSI: Move cached entry functions to irq core genirq: Provide default callbacks for msi_domain_ops genirq: Introduce msi_domain_alloc/free_irqs() asm-generic: Add msi.h genirq: Add generic msi irq domain support genirq: Introduce callback irq_chip.irq_write_msi_msg genirq: Work around __irq_set_handler vs stacked domains ordering issues irqdomain: Introduce helper function irq_domain_add_hierarchy() irqdomain: Implement a method to automatically call parent domains alloc/free genirq: Introduce helper irq_domain_set_info() to reduce duplicated code genirq: Split out flow handler typedefs into seperate header file genirq: Add IRQ_SET_MASK_OK_DONE to support stacked irqchip genirq: Introduce irq_chip.irq_compose_msi_msg() to support stacked irqchip genirq: Add more helper functions to support stacked irq_chip genirq: Introduce helper functions to support stacked irq_chip irqdomain: Do irq_find_mapping and set_type for hierarchy irqdomain in case OF ...