blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2016-05-20	mm, oom: protect !costly allocations some more	Michal Hocko	1	-10/+78
	should_reclaim_retry will give up retries for higher order allocations if none of the eligible zones has any requested or higher order pages available even if we pass the watermak check for order-0. This is done because there is no guarantee that the reclaimable and currently free pages will form the required order. This can, however, lead to situations where the high-order request (e.g. order-2 required for the stack allocation during fork) will trigger OOM too early - e.g. after the first reclaim/compaction round. Such a system would have to be highly fragmented and there is no guarantee further reclaim/compaction attempts would help but at least make sure that the compaction was active before we go OOM and keep retrying even if should_reclaim_retry tells us to oom if - the last compaction round backed off or - we haven't completed at least MAX_COMPACT_RETRIES active compaction rounds. The first rule ensures that the very last attempt for compaction was not ignored while the second guarantees that the compaction has done some work. Multiple retries might be needed to prevent occasional pigggy backing of other contexts to steal the compacted pages before the current context manages to retry to allocate them. compaction_failed() is taken as a final word from the compaction that the retry doesn't make much sense. We have to be careful though because the first compaction round is MIGRATE_ASYNC which is rather weak as it ignores pages under writeback and gives up too easily in other situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode has been used before we give up. With this logic in place we do not have to increase the migration mode unconditionally and rather do it only if the compaction failed for the weaker mode. A nice side effect is that the stronger migration mode is used only when really needed so this has a potential of smaller latencies in some cases. Please note that the compaction doesn't tell us much about how successful it was when returning compaction_made_progress so we just have to blindly trust that another retry is worthwhile and cap the number to something reasonable to guarantee a convergence. If the given number of successful retries is not sufficient for a reasonable workloads we should focus on the collected compaction tracepoints data and try to address the issue in the compaction code. If this is not feasible we can increase the retries limit. [[email protected]: fix warning] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm: throttle on IO only when there are too many dirty and writeback pages	Michal Hocko	2	-21/+40
	wait_iff_congested has been used to throttle allocator before it retried another round of direct reclaim to allow the writeback to make some progress and prevent reclaim from looping over dirty/writeback pages without making any progress. We used to do congestion_wait before commit 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone") but that led to undesirable stalls and sleeping for the full timeout even when the BDI wasn't congested. Hence wait_iff_congested was used instead. But it seems that even wait_iff_congested doesn't work as expected. We might have a small file LRU list with all pages dirty/writeback and yet the bdi is not congested so this is just a cond_resched in the end and can end up triggering pre mature OOM. This patch replaces the unconditional wait_iff_congested by congestion_wait which is executed only if we _know_ that the last round of direct reclaim didn't make any progress and dirty+writeback pages are more than a half of the reclaimable pages on the zone which might be usable for our target allocation. This shouldn't reintroduce stalls fixed by 0e093d99763e because congestion_wait is called only when we are getting hopeless when sleeping is a better choice than OOM with many pages under IO. We have to preserve logic introduced by commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") into the __alloc_pages_slowpath now that wait_iff_congested is not used anymore. As the only remaining user of wait_iff_congested is shrink_inactive_list we can remove the WQ specific short sleep from wait_iff_congested because the sleep is needed to be done only once in the allocation retry cycle. [[email protected]: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm, oom: rework oom detection	Michal Hocko	3	-29/+97
	__alloc_pages_slowpath has traditionally relied on the direct reclaim and did_some_progress as an indicator that it makes sense to retry allocation rather than declaring OOM. shrink_zones had to rely on zone_reclaimable if shrink_zone didn't make any progress to prevent from a premature OOM killer invocation - the LRU might be full of dirty or writeback pages and direct reclaim cannot clean those up. zone_reclaimable allows to rescan the reclaimable lists several times and restart if a page is freed. This is really subtle behavior and it might lead to a livelock when a single freed page keeps allocator looping but the current task will not be able to allocate that single page. OOM killer would be more appropriate than looping without any progress for unbounded amount of time. This patch changes OOM detection logic and pulls it out from shrink_zone which is too low to be appropriate for any high level decisions such as OOM which is per zonelist property. It is __alloc_pages_slowpath which knows how many attempts have been done and what was the progress so far therefore it is more appropriate to implement this logic. The new heuristic is implemented in should_reclaim_retry helper called from __alloc_pages_slowpath. It tries to be more deterministic and easier to follow. It builds on an assumption that retrying makes sense only if the currently reclaimable memory + free pages would allow the current allocation request to succeed (as per __zone_watermark_ok) at least for one zone in the usable zonelist. This alone wouldn't be sufficient, though, because the writeback might get stuck and reclaimable pages might be pinned for a really long time or even depend on the current allocation context. Therefore there is a backoff mechanism implemented which reduces the reclaim target after each reclaim round without any progress. This means that we should eventually converge to only NR_FREE_PAGES as the target and fail on the wmark check and proceed to OOM. The backoff is simple and linear with 1/16 of the reclaimable pages for each round without any progress. We are optimistic and reset counter for successful reclaim rounds. Costly high order pages mostly preserve their semantic and those without __GFP_REPEAT fail right away while those which have the flag set will back off after the amount of reclaimable pages reaches equivalent of the requested order. The only difference is that if there was no progress during the reclaim we rely on zone watermark check. This is more logical thing to do than previous 1<<order attempts which were a result of zone_reclaimable faking the progress. [[email protected]: check classzone_idx for shrink_zone] [[email protected]: separate the heuristic into should_reclaim_retry] [[email protected]: use zone_page_state_snapshot for NR_FREE_PAGES] [[email protected]: shrink_zones doesn't need to return anything] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm, compaction: abstract compaction feedback to helpers	Michal Hocko	1	-0/+79
	Compaction can provide a wild variation of feedback to the caller. Many of them are implementation specific and the caller of the compaction (especially the page allocator) shouldn't be bound to specifics of the current implementation. This patch abstracts the feedback into three basic types: - compaction_made_progress - compaction was active and made some progress. - compaction_failed - compaction failed and further attempts to invoke it would most probably fail and therefore it is not worth retrying - compaction_withdrawn - compaction wasn't invoked for an implementation specific reasons. In the current implementation it means that the compaction was deferred, contended or the page scanners met too early without any progress. Retrying is still worthwhile. [[email protected]: do not change thp back off behavior] [[email protected]: fix typo in comment, per Hillf] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Hillf Danton <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm, compaction: simplify __alloc_pages_direct_compact feedback interface	Michal Hocko	1	-36/+31
	__alloc_pages_direct_compact communicates potential back off by two variables: - deferred_compaction tells that the compaction returned COMPACT_DEFERRED - contended_compaction is set when there is a contention on zone->lock resp. zone->lru_lock locks __alloc_pages_slowpath then backs of for THP allocation requests to prevent from long stalls. This is rather messy and it would be much cleaner to return a single compact result value and hide all the nasty details into __alloc_pages_direct_compact. This patch shouldn't introduce any functional changes. Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm, compaction: update compaction_result ordering	Michal Hocko	1	-10/+16
	compaction_result will be used as the primary feedback channel for compaction users. At the same time try_to_compact_pages (and potentially others) assume a certain ordering where a more specific feedback takes precendence. This gets a bit awkward when we have conflicting feedback from different zones. E.g one returing COMPACT_COMPLETE meaning the full zone has been scanned without any outcome while other returns with COMPACT_PARTIAL aka made some progress. The caller should get COMPACT_PARTIAL because that means that the compaction still can make some progress. The same applies for COMPACT_PARTIAL vs COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the larger the value is the more progress we have done. Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm, compaction: distinguish between full and partial COMPACT_COMPLETE	Michal Hocko	4	-4/+22
	COMPACT_COMPLETE now means that compaction and free scanner met. This is not very useful information if somebody just wants to use this feedback and make any decisions based on that. The current caller might be a poor guy who just happened to scan tiny portion of the zone and that could be the reason no suitable pages were compacted. Make sure we distinguish the full and partial zone walks. Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success and be optimistic in retrying. The existing users of COMPACT_COMPLETE are conservatively changed to use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably reconsidered and only defer the compaction only for COMPACT_COMPLETE with the new semantic. This patch shouldn't introduce any functional changes. Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED	Michal Hocko	3	-6/+11
	try_to_compact_pages() can currently return COMPACT_SKIPPED even when the compaction is defered for some zone just because zone DMA is skipped in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED basically unusable for the page allocator as a feedback mechanism. Make sure we distinguish those two states properly and switch their ordering in the enum. This would mean that the COMPACT_SKIPPED will be returned only when all eligible zones are skipped. As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath will be more precise and we would bail out rather than reclaim. Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm, compaction: cover all compaction mode in compact_zone	Michal Hocko	1	-8/+5
	The compiler is complaining after "mm, compaction: change COMPACT_ constants into enum" mm/compaction.c: In function `compact_zone': mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch] switch (ret) { ^ mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch] mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch] mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch] mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch] compaction_suitable is allowed to return only COMPACT_PARTIAL, COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply impossible. Put a VM_BUG_ON to catch an impossible return value. Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm, compaction: change COMPACT_ constants into enum	Michal Hocko	3	-32/+42
	Compaction code is doing weird dances between COMPACT_FOO -> int -> unsigned long But there doesn't seem to be any reason for that. All functions which return/use one of those constants are not expecting any other value so it really makes sense to define an enum for them and make it clear that no other values are expected. This is a pure cleanup and shouldn't introduce any functional changes. Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	vmscan: consider classzone_idx in compaction_ready	Michal Hocko	1	-4/+4
	Motivation: As pointed out by Linus [2][3] relying on zone_reclaimable as a way to communicate the reclaim progress is rater dubious. I tend to agree, not only it is really obscure, it is not hard to imagine cases where a single page freed in the loop keeps all the reclaimers looping without getting any progress because their gfp_mask wouldn't allow to get that page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather rare so it doesn't happen in the practice but the current logic which we have is rather obscure and hard to follow a also non-deterministic. This is an attempt to make the OOM detection more deterministic and easier to follow because each reclaimer basically tracks its own progress which is implemented at the page allocator layer rather spread out between the allocator and the reclaim. The more on the implementation is described in the first patch. I have tested several different scenarios but it should be clear that testing OOM killer is quite hard to be representative. There is usually a tiny gap between almost OOM and full blown OOM which is often time sensitive. Anyway, I have tested the following 2 scenarios and I would appreciate if there are more to test. Testing environment: a virtual machine with 2G of RAM and 2CPUs without any swap to make the OOM more deterministic. 1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G file size, removes the files and starts over again) running in parallel for 10s to build up a lot of dirty pages when 100 parallel mem_eaters (anon private populated mmap which waits until it gets signal) with 80M each. This causes an OOM flood of course and I have compared both patched and unpatched kernels. The test is considered finished after there are no OOM conditions detected. This should tell us whether there are any excessive kills or some of them premature (e.g. due to dirty pages): I have performed two runs this time each after a fresh boot. * base kernel $ grep "Out of memory:" base-oom-run1.log \| wc -l 78 $ grep "Out of memory:" base-oom-run2.log \| wc -l 78 $ grep "Kill process" base-oom-run1.log \| tail -n1 [ 91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child $ grep "Kill process" base-oom-run2.log \| tail -n1 [ 82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child $ grep "DMA32 free:" base-oom-run1.log \| sed 's@.free:$[0-9]$kB.@\1@' \| calc_min_max.awk min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61 $ grep "DMA32 free:" base-oom-run2.log \| sed 's@.free:$[0-9]$kB.@\1@' \| calc_min_max.awk min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52 $ grep "DMA32.all_unreclaimable? no" base-oom-run1.log \| wc -l 1 $ grep "DMA32.all_unreclaimable? no" base-oom-run2.log \| wc -l 3 * patched kernel $ grep "Out of memory:" patched-oom-run1.log \| wc -l 78 miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log \| wc -l 77 e grep "Kill process" patched-oom-run1.log \| tail -n1 [ 497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child $ grep "Kill process" patched-oom-run2.log \| tail -n1 [ 316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child $ grep "DMA32 free:" patched-oom-run1.log \| sed 's@.free:$[0-9]$kB.@\1@' \| calc_min_max.awk min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78 $ grep "DMA32 free:" patched-oom-run2.log \| sed 's@.free:$[0-9]$kB.@\1@' \| calc_min_max.awk min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77 e grep "DMA32.all_unreclaimable? no" patched-oom-run1.log \| wc -l 2 $ grep "DMA32.all_unreclaimable? no" patched-oom-run2.log \| wc -l 3 The patched kernel run noticeably longer while invoking OOM killer same number of times. This means that the original implementation is much more aggressive and triggers the OOM killer sooner. free pages stats show that neither kernels went OOM too early most of the time, though. I guess the difference is in the backoff when retries without any progress do sleep for a while if there is memory under writeback or dirty which is highly likely considering the parallel IO. Both kernels have seen races where zone wasn't marked unreclaimable and we still hit the OOM killer. This is most likely a race where a task managed to exit between the last allocation attempt and the oom killer invocation. 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much memory as possible without triggering the OOM killer. This required a lot of tuning but I've considered 3 consecutive runs in three different boots without OOM as a success. * base kernel size=$(awk '/MemFree/{printf "%dK", ($2/10)-(161024)}' /proc/meminfo) patched kernel size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo) That means 40M more memory was usable without triggering OOM killer. The base kernel sometimes managed to handle the same as patched but it wasn't consistent and failed in at least on of the 3 runs. This seems like a minor improvement. I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented memory and under memory pressure. The results are in patch 11 where the logic is implemented. In short I can see huge improvement there. I am certainly interested in other usecases as well as well as any feedback. Especially those which require higher order requests. This patch (of 14): While playing with the oom detection rework [1] I have noticed that my heavy order-9 (hugetlb) load close to OOM ended up in an endless loop where the reclaim hasn't made any progress but did_some_progress didn't reflect that and compaction_suitable was backing off because no zone is above low wmark + 1 << order. It turned out that this is in fact an old standing bug in compaction_ready which ignores the requested_highidx and did the watermark check for 0 classzone_idx. This succeeds for zone DMA most of the time as the zone is mostly unused because of lowmem protection. As a result costly high order allocatios always report a successfull progress even when there was none. This wasn't a problem so far because these allocations usually fail quite early or retry only few times with __GFP_REPEAT but this will change after later patch in this series so make sure to not lie about the progress and propagate requested_highidx down to compaction_ready and use it for both the watermak check and compaction_suitable to fix this issue. [1] http://lkml.kernel.org/r/[email protected] [2] https://lkml.org/lkml/2015/10/12/808 [3] https://lkml.org/lkml/2015/10/13/597 Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Hillf Danton <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: David Rientjes <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm: vmscan: reduce size of inactive file list	Rik van Riel	3	-131/+42
	The inactive file list should still be large enough to contain readahead windows and freshly written file data, but it no longer is the only source for detecting multiple accesses to file pages. The workingset refault measurement code causes recently evicted file pages that get accessed again after a shorter interval to be promoted directly to the active list. With that mechanism in place, we can afford to (on a larger system) dedicate more memory to the active file list, so we can actually cache more of the frequently used file pages in memory, and not have them pushed out by streaming writes, once-used streaming file reads, etc. This can help things like database workloads, where only half the page cache can currently be used to cache the database working set. This patch automatically increases that fraction on larger systems, using the same ratio that has already been used for anonymous memory. [[email protected]: cgroup-awareness] Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Andres Freund <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm: filemap: only do access activations on reads	Johannes Weiner	1	-1/+1
	Andres observed that his database workload is struggling with the transaction journal creating pressure on frequently read pages. Access patterns like transaction journals frequently write the same pages over and over, but in the majority of cases those pages are never read back. There are no caching benefits to be had for those pages, so activating them and having them put pressure on pages that do benefit from caching is a bad choice. Leave page activations to read accesses and don't promote pages based on writes alone. It could be said that partially written pages do contain cache-worthy data, because even if userspace does not access the unwritten part, the kernel still has to read it from the filesystem for correctness. However, a counter argument is that these pages enjoy at least some protection over other inactive file pages through the writeback cache, in the sense that dirty pages are written back with a delay and cache reclaim leaves them alone until they have been written back to disk. Should that turn out to be insufficient and we see increased read IO from partial writes under memory pressure, we can always go back and update grab_cache_page_write_begin() to take (pos, len) so that it can tell partial writes from pages that don't need partial reads. But for now, keep it simple. Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Andres Freund <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	mm: workingset: only do workingset activations on reads	Rik van Riel	1	-1/+5
	This is a follow-up to http://www.spinics.net/lists/linux-mm/msg101739.html where Andres reported his database workingset being pushed out by the minimum size enforcement of the inactive file list - currently 50% of cache - as well as repeatedly written file pages that are never actually read. Two changes fell out of the discussions. The first change observes that pages that are only ever written don't benefit from caching beyond what the writeback cache does for partial page writes, and so we shouldn't promote them to the active file list where they compete with pages whose cached data is actually accessed repeatedly. This change comes in two patches - one for in-cache write accesses and one for refaults triggered by writes, neither of which should promote a cache page. Second, with the refault detection we don't need to set 50% of the cache aside for used-once cache anymore since we can detect frequently used pages even when they are evicted between accesses. We can allow the active list to be bigger and thus protect a bigger workingset that isn't challenged by streamers. Depending on the access patterns, this can increase major faults during workingset transitions for better performance during stable phases. This patch (of 3): When rewriting a page, the data in that page is replaced with new data. This means that evicting something else from the active file list, in order to cache data that will be replaced by something else, is likely to be a waste of memory. It is better to save the active list for frequently read pages, because reads actually use the data that is in the page. This patch ignores partial writes, because it is unclear whether the complexity of identifying those is worth any potential performance gain obtained from better caching pages that see repeated partial writes at large enough intervals to not get caught by the use-twice promotion code used for the inactive file list. Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Andres Freund <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-05-20	Merge branch 'sparc32-cosmetic-changes'	David S. Miller	11	-34/+33
	Sam Ravnborg says: ==================== sparc32: kgdb_32 and STRICT_MM_TYPECHECKS updates A few cosmetic pathes for sparc32 follows. I noticed some inconsistency in kgdb_32 that triggered a few patches. The inconsistency in kgdb_32 turned out to have no functional impact. But I anyway fixed it so kgdb_32 and kgdb_64 became just a tiny bit more alike. The STRICT_MM_TYPECHECKS patch was triggered by a discussion on linux-arch where Arnd considered removing this cruft. The resulting benary of srmmu.c (I checked an assembler file) was _smaller_ when I defined STRICT_MM_TYPECHECKS. But due to lack of testing I left it undefined. With the build errors fixed it should be trivial to try out if defining STRICT_MM_TYPECHECKS breaks anything. Any takers? As I miss any working sparc32 gear for the moment this has only been build tested - so please consider if it is worth taking the risk to apply the patches. ==================== Signed-off-by: David S. Miller <[email protected]>
2016-05-20	sparc32: drop superfluous cast in calls to __nocache_pa()	Sam Ravnborg	2	-3/+3
	Signed-off-by: Sam Ravnborg <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	sparc32: fix build with STRICT_MM_TYPECHECKS	Sam Ravnborg	5	-11/+14
	Based on recent thread on linux-arch (some weeks ago) I decided to check how much work was required to build sparc32 with STRICT_MM_TYPECHECKS enabled. The resulting binary (checked srmmu.o) was to my suprise smaller with STRICT_MM_TYPECHECKS defined, than without. As I have no working gear to test sparc32 bits at for the moment, I did not enable STRICT_MM_TYPECHECKS - but was tempeted to do so. Signed-off-by: Sam Ravnborg <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	sparc32: use proper prototype for trapbase	Sam Ravnborg	3	-5/+3
	This killed an extern ... in a .c file. No functional change. Signed-off-by: Sam Ravnborg <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	sparc32: drop local prototype in kgdb_32	Sam Ravnborg	1	-2/+2
	Signed-off-by: Sam Ravnborg <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	sparc32: drop hardcoding trap_level in kgdb_trap	Sam Ravnborg	4	-14/+12
	Fix this so we pass the trap_level from the actual trap code like we do in sparc64. Add use on ENTRY(), ENDPROC() in the assembler function too. This fixes a bug where the hardcoded value for trap_level was the sparc64 value. As the generic code does not use the trap_level argument (for sparc32) - this patch does not have any functional impact. Signed-off-by: Sam Ravnborg <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	net: suppress warnings on dev_alloc_skb	Neil Horman	1	-2/+2
	Noticed an allocation failure in a network driver the other day on a 32 bit system: DMA-API: debugging out of memory - disabling bnx2fc: adapter_lookup: hba NULL lldpad: page allocation failure. order:0, mode:0x4120 Pid: 4556, comm: lldpad Not tainted 2.6.32-639.el6.i686.debug #1 Call Trace: [<c08a4086>] ? printk+0x19/0x23 [<c05166a4>] ? __alloc_pages_nodemask+0x664/0x830 [<c0649d02>] ? free_object+0x82/0xa0 [<fb4e2c9b>] ? ixgbe_alloc_rx_buffers+0x10b/0x1d0 [ixgbe] [<fb4e2fff>] ? ixgbe_configure_rx_ring+0x29f/0x420 [ixgbe] [<fb4e228c>] ? ixgbe_configure_tx_ring+0x15c/0x220 [ixgbe] [<fb4e3709>] ? ixgbe_configure+0x589/0xc00 [ixgbe] [<fb4e7be7>] ? ixgbe_open+0xa7/0x5c0 [ixgbe] [<fb503ce6>] ? ixgbe_init_interrupt_scheme+0x5b6/0x970 [ixgbe] [<fb4e8e54>] ? ixgbe_setup_tc+0x1a4/0x260 [ixgbe] [<fb505a9f>] ? ixgbe_dcbnl_set_state+0x7f/0x90 [ixgbe] [<c088d80d>] ? dcb_doit+0x10ed/0x16d0 ... Thought that perhaps the big splat in the logs wasn't really necessecary, as all call sites for dev_alloc_skb: a) check the return code for the function and b) either print their own error message or have a recovery path that makes the warning moot. Fix it by modifying dev_alloc_pages to pass __GFP_NOWARN as a gfp flag to suppress the warning applies to the net tree Signed-off-by: Neil Horman <[email protected]> CC: "David S. Miller" <[email protected]> CC: Eric Dumazet <[email protected]> CC: Alexander Duyck <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	uapi glibc compat: fix compilation when !__USE_MISC in glibc	Nicolas Dichtel	1	-1/+1
	These structures are defined only if __USE_MISC is set in glibc net/if.h headers, ie when _BSD_SOURCE or _SVID_SOURCE are defined. CC: Jan Engelhardt <[email protected]> CC: Josh Boyer <[email protected]> CC: Stephen Hemminger <[email protected]> CC: Waldemar Brodkorb <[email protected]> CC: Gabriel Laskar <[email protected]> CC: Mikko Rapeli <[email protected]> Fixes: 4a91cb61bb99 ("uapi glibc compat: fix compile errors when glibc net/if.h included before linux/if.h") Signed-off-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	udp: prevent skbs lingering in tunnel socket queues	Hannes Frederic Sowa	4	-11/+7
	In case we find a socket with encapsulation enabled we should call the encap_recv function even if just a udp header without payload is available. The callbacks are responsible for correctly verifying and dropping the packets. Also, in case the header validation fails for geneve and vxlan we shouldn't put the skb back into the socket queue, no one will pick them up there. Instead we can simply discard them in the respective encap_recv functions. Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	Merge branch 'bpf-verifier-fixes'	David S. Miller	1	-9/+21
	Alexei Starovoitov says: ==================== bpf: verifier fixes Further testing of 'direct packet access' uncovered several usability issues. Fix them. ==================== Signed-off-by: David S. Miller <[email protected]>
2016-05-20	bpf: teach verifier to recognize imm += ptr pattern	Alexei Starovoitov	1	-1/+17
	Humans don't write C code like: u8 *ptr = skb->data; int imm = 4; imm += ptr; but from llvm backend point of view 'imm' and 'ptr' are registers and imm += ptr may be preferred vs ptr += imm depending which register value will be used further in the code, while verifier can only recognize ptr += imm. That caused small unrelated changes in the C code of the bpf program to trigger rejection by the verifier. Therefore teach the verifier to recognize both ptr += imm and imm += ptr. For example: when R6=pkt(id=0,off=0,r=62) R7=imm22 after r7 += r6 instruction will be R6=pkt(id=0,off=0,r=62) R7=pkt(id=0,off=22,r=62) Fixes: 969bf05eb3ce ("bpf: direct packet access") Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	bpf: support decreasing order in direct packet access	Alexei Starovoitov	1	-8/+4
	when packet headers are accessed in 'decreasing' order (like TCP port may be fetched before the program reads IP src) the llvm may generate the following code: [...] // R7=pkt(id=0,off=22,r=70) r2 = (u32 )(r7 +0) // good access [...] r7 += 40 // R7=pkt(id=0,off=62,r=70) r8 = (u32 )(r7 +0) // good access [...] r1 = (u32 )(r7 -20) // this one will fail though it's within a safe range // it's doing (u32)(skb->data + 42) Fix verifier to recognize such code pattern Alos turned out that 'off > range' condition is not a verifier bug. It's a buggy program that may do something like: if (ptr + 50 > data_end) return 0; ptr += 60; (u32)ptr; in such case emit "invalid access to packet, off=0 size=4, R1(id=0,off=60,r=50)" error message, so all information is available for the program author to fix the program. Fixes: 969bf05eb3ce ("bpf: direct packet access") Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	net: usb: ch9200: use kmemdup	Muhammad Falak R Wani	1	-2/+1
	Use kmemdup when some other buffer is immediately copied into allocated region. It replaces call to allocation followed by memcpy, by a single call to kmemdup. Signed-off-by: Muhammad Falak R Wani <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	ps3_gelic: use kmemdup	Muhammad Falak R Wani	1	-2/+2
	Use kmemdup when some other buffer is immediately copied into allocated region. It replaces call to allocation followed by memcpy, by a single call to kmemdup. Signed-off-by: Muhammad Falak R Wani <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	net:liquidio: use kmemdup	Muhammad Falak R Wani	1	-3/+1
	Use kmemdup when some other buffer is immediately copied into allocated region. It replaces call to allocation followed by memcpy, by a single call to kmemdup. Signed-off-by: Muhammad Falak R Wani <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	bpf: Use mount_nodev not mount_ns to mount the bpf filesystem	Eric W. Biederman	1	-1/+1
	While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the bpf filesystem. Looking at the code I saw a broken usage of mount_ns with current->nsproxy->mnt_ns. As the code does not acquire a reference to the mount namespace it can not possibly be correct to store the mount namespace on the superblock as it does. Replace mount_ns with mount_nodev so that each mount of the bpf filesystem returns a distinct instance, and the code is not buggy. In discussion with Hannes Frederic Sowa it was reported that the use of mount_ns was an attempt to have one bpf instance per mount namespace, in an attempt to keep resources that pin resources from hiding. That intent simply does not work, the vfs is not built to allow that kind of behavior. Which means that the bpf filesystem really is buggy both semantically and in it's implemenation as it does not nor can it implement the original intent. This change is userspace visible, but my experience with similar filesystems leads me to believe nothing will break with a model of each mount of the bpf filesystem is distinct from all others. Fixes: b2197755b263 ("bpf: add support for persistent maps/progs") Cc: Hannes Frederic Sowa <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]> Acked-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	Merge tag 'wireless-drivers-next-for-davem-2016-05-13' of ↵	David S. Miller	95	-4766/+6751
	git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== wireless-drivers patches for 4.7 Major changes: iwlwifi * remove IWLWIFI_DEBUG_EXPERIMENTAL_UCODE kconfig option * work for RX multiqueue continues * dynamic queue allocation work continues * add Luca as maintainer * a bunch of fixes and improvements all over brcmfmac * add 4356 sdio support ath6kl * add ability to set debug uart baud rate with a module parameter wil6210 * add debugfs file to configure firmware led functionality ==================== Signed-off-by: David S. Miller <[email protected]>
2016-05-20	net: cdc_ncm: update datagram size after changing mtu	Rafal Redzimski	1	-2/+4
	Current implementation updates the mtu size and notify cdc_ncm device using USB_CDC_SET_MAX_DATAGRAM_SIZE request about datagram size change instead of changing rx_urb_size. Whenever mtu is being changed, datagram size should also be updated. Also updating maxmtu formula so it takes max_datagram_size with use of cdc_ncm_max_dgram_size() and not ctx. Signed-off-by: Robert Dobrowolski <[email protected]> Signed-off-by: Rafal Redzimski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	tuntap: correctly wake up process during uninit	Jason Wang	1	-3/+3
	We used to check dev->reg_state against NETREG_REGISTERED after each time we are woke up. But after commit 9e641bdcfa4e ("net-tun: restructure tun_do_read for better sleep/wakeup efficiency"), it uses skb_recv_datagram() which does not check dev->reg_state. This will result if we delete a tun/tap device after a process is blocked in the reading. The device will wait for the reference count which was held by that process for ever. Fixes this by using RCV_SHUTDOWN which will be checked during sk_recv_datagram() before trying to wake up the process during uninit. Fixes: 9e641bdcfa4e ("net-tun: restructure tun_do_read for better sleep/wakeup efficiency") Cc: Eric Dumazet <[email protected]> Cc: Xi Wang <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Signed-off-by: Jason Wang <[email protected]> Acked-by: Eric Dumazet <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	Merge branch 'GREoIPV6-followups'	David S. Miller	9	-4/+16
	Alexander Duyck says: ==================== Follow-ups for GUEoIPv6 patches This patch series is meant to be applied after: [PATCH v7 net-next 00/16] ipv6: Enable GUEoIPv6 and more fixes for v6 tunneling The first patch addresses an issue we already resolved in the GREv4 and is now present in GREv6 with the introduction of FOU/GUE for IPv6 based GRE tunnels. The second patch goes through and enables IPv6 tunnel offloads for the Intel NICs that already support the IPv4 based IP-in-IP tunnel offloads. I have only done a bit of touch testing but have seen ~20 Gb/s over an i40e interface using a v4-in-v6 tunnel, and I have verified IPv6 GRE is still passing traffic at around the same rate. I plan to do further testing but with these patches present it should enable a wider audience to be able to test the new features introduced in Tom's patchset with hardware offloads. ==================== Signed-off-by: David S. Miller <[email protected]>
2016-05-20	intel: Add support for IPv6 IP-in-IP offload	Alexander Duyck	8	-0/+8
	This patch adds support for offloading IPXIP6 type packets that represent either IPv4 or IPv6 encapsulated inside of an IPv6 outer IP header. In addition with this change we should also be able to support FOU encapsulated traffic with outer IPv6 headers. Signed-off-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled with FOU/GUE	Alexander Duyck	1	-4/+8
	This patch addresses the same issue we had for IPv4 where enabling GRE with an inner checksum cannot be supported with FOU/GUE due to the fact that they will jump past the GRE header at it is treated like a tunnel header. Signed-off-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-21	ACPI / battery: Correctly serialise with the pending async probe	Chris Wilson	1	-1/+1
	async_synchronize_cookie() only serialises all tasks up to the specified cookie, and importantly does not wait for the task corresponding with the cookie. [This is so that it can be trivially used from inside the async_func_t in order to serialise with all preceding tasks.] In order to serialise with acpi_battery_init_async() we need to compensate and pass in the next cookie instead. The impact today is zero since performing an async_schedule() from inside a module init function will trigger an async_synchronize_full() prior to the module loader's completion. However, if the probe was moved to its own unregistered async_domain, then the async_synchronize_cookie would be replaced with an async_synchronize_full_domain. Signed-off-by: Chris Wilson <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-05-20	Merge branch 'rds-conn-spamming'	David S. Miller	1	-4/+9
	Sowmini Varadhan says: ==================== RDS: TCP: connection spamming fixes We have been testing the RDS-TCP code with a connection spammer that sends incoming SYNs to the RDS listen port well after an rds-tcp connection has been established, and found a few race-windows that are fixed by this patch series. Patch 1 avoids a null pointer deref when an incoming SYN shows up when a netns is being dismantled, or when the rds-tcp module is being unloaded. Patch 2 addresses the case when a SYN is received after the connection arbitration algorithm has converged: the incoming SYN should not needlessly quiesce the transmit path, and it should not result in needless TCP connection resets due to re-execution of the connection arbitration logic. ==================== Signed-off-by: David S. Miller <[email protected]>
2016-05-20	RDS: TCP: Avoid rds connection churn from rogue SYNs	Sowmini Varadhan	1	-4/+6
	When a rogue SYN is received after the connection arbitration algorithm has converged, the incoming SYN should not needlessly quiesce the transmit path, and it should not result in needless TCP connection resets due to re-execution of the connection arbitration logic. Signed-off-by: Sowmini Varadhan <[email protected]> Acked-by: Santosh Shilimkar <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	RDS: TCP: rds_tcp_accept_worker() must exit gracefully when terminating rds-tcp	Sowmini Varadhan	1	-0/+3
	There are two instances where we want to terminate RDS-TCP: when exiting the netns or during module unload. In either case, the termination sequence is to stop the listen socket, mark the rtn->rds_tcp_listen_sock as null, and flush any accept workqs. Thus any workqs that get flushed at this point will encounter a null rds_tcp_listen_sock, and must exit gracefully to allow the RDS-TCP termination to complete successfully. Signed-off-by: Sowmini Varadhan <[email protected]> Acked-by: Santosh Shilimkar <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	Merge tag 'gfs2-4.7.fixes' of ↵	Linus Torvalds	9	-32/+60
	git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Bob Peterson: "We've got nine patches this time: - Abhi Das has two patches that fix a GFS2 splice issue (and an adjustment). - Ben Marzinski has a patch which allows the proper unmount of a GFS2 file system after hitting a withdraw error. - I have a patch to fix a problem where GFS2 would dereference an error value, plus three cosmetic / refactoring patches. - Daniel DeFreez has a patch to fix two glock reference count problems, where GFS2 was not properly "uninitializing" its glock holder on error paths. - Denys Vlasenko has a patch to change a function to not be inlined, thus reducing the memory footprint of the GFS2 module" * tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: GFS2: Refactor gfs2_remove_from_journal GFS2: Remove allocation parms from gfs2_rbm_find gfs2: use inode_lock/unlock instead of accessing i_mutex directly GFS2: Add calls to gfs2_holder_uninit in two error handlers GFS2: Don't dereference inode in gfs2_inode_lookup until it's valid GFS2: fs/gfs2/glock.c: Deinline do_error, save 1856 bytes gfs2: Use gfs2 wrapper to sync inode before calling generic_file_splice_read() GFS2: Get rid of dead code in inode_go_demote_ok GFS2: ignore unlock failures after withdraw
2016-05-20	net: sock: move ->sk_shutdown out of bitfields.	Andrey Ryabinin	1	-1/+8
	->sk_shutdown bits share one bitfield with some other bits in sock struct, such as ->sk_no_check_[r,t]x, ->sk_userlocks ... sock_setsockopt() may write to these bits, while holding the socket lock. In case of AF_UNIX sockets, we change ->sk_shutdown bits while holding only unix_state_lock(). So concurrent setsockopt() and shutdown() may lead to corrupting these bits. Fix this by moving ->sk_shutdown bits out of bitfield into a separate byte. This will not change the 'struct sock' size since ->sk_shutdown moved into previously unused 16-bit hole. Signed-off-by: Andrey Ryabinin <[email protected]> Suggested-by: Hannes Frederic Sowa <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	Merge branch 'GREoIPV6'	David S. Miller	36	-310/+737
	Tom Herbert says: ==================== ipv6: Enable GUEoIPv6 and more fixes for v6 tunneling This patch set: - Fixes GRE6 to process translate flags correctly from configuration - Adds support for GSO and GRO for ip6ip6 and ip4ip6 - Add support for FOU and GUE in IPv6 - Support GRE, ip6ip6 and ip4ip6 over FOU/GUE - Fixes ip6_input to deal with UDP encapsulations - Some other minor fixes v2: - Removed a check of GSO types in MPLS - Define GSO type SKB_GSO_IPXIP6 and SKB_GSO_IPXIP4 (based on input from Alexander) - Don't define GSO types specifically for IP6IP6 and IP4IP6, above fix makes that unnecessary - Don't bother clearing encapsulation flag in UDP tunnel segment (another item suggested by Alexander). v3: - Address some minor comments from Alexander v4: - Rebase on changes to fix IP TX tunnels - Fix MTU issues in ip4ip6, ip6ip6 - Add test data for above v5: - Address feedback from Shmulik Ladkani regarding extension header code that does not return next header but in instead relies on returning value via nhoff. Solution here is to fix EH processing to return nexthdr value. - Refactored IPv4 encaps so that we won't need to create a ip6_tunnel_core.c when adding encap support IPv6. v6: - Fix build issues with regard to new GSO constants - FIx MTU calculation issues ip6_tunnel.c pointed out byt ALex - Add encap_hlen into headroom for GREv6 to work with FOU/GUE v7: - Added skb_set_inner_ipproto to ip4ip6 and ip6ip6 - Clarified max_headroom in ip6_tnl_xmit - Set features for IPv6 tunnels - Other cleanup suggested by Alexander - Above fixes throughput performance issues in ip4ip6 and ip6ip6, updated test results to reflect that Tested: Various cases of IP tunnels with netperf TCP_STREAM and TCP_RR. - IPv4/GRE/GUE/IPv6 with RCO 1 TCP_STREAM 6616 Mbps 200 TCP_RR 1244043 tps 141/243/446 90/95/99% latencies 86.61% CPU utilization - IPv6/GRE/GUE/IPv6 with RCO 1 TCP_STREAM 6940 Mbps 200 TCP_RR 1270903 tps 138/236/440 90/95/99% latencies 87.51% CPU utilization - IP6IP6 1 TCP_STREAM 5307 Mbps 200 TCP_RR 498981 tps 388/498/631 90/95/99% latencies 19.75% CPU utilization (1 CPU saturated) - IP6IP6/GUE with RCO 1 TCP_STREAM 5575 Mbps 200 TCP_RR 1233818 tps 143/244/451 90/95/99% latencies 87.57 CPU utilization - IP4IP6 1 TCP_STREAM 5235 Mbps 200 TCP_RR 763774 tps 250/318/466 90/95/99% latencies 35.25% CPU utilization (1 CPU saturated) - IP4IP6/GUE with RCO 1 TCP_STREAM 5337 Mbps 200 TCP_RR 1196385 tps 148/251/460 90/95/99% latencies 87.56 CPU utilization - GRE with keyid 200 TCP_RR 744173 tps 258/332/461 90/95/99% latencies 34.59% CPU utilization (1 CPU saturated) ==================== Signed-off-by: David S. Miller <[email protected]>
2016-05-20	ipv6: Don't reset inner headers in ip6_tnl_xmit	Tom Herbert	1	-5/+0
	Since iptunnel_handle_offloads() is called in all paths we can probably drop the block in ip6_tnl_xmit that was checking for skb->encapsulation and resetting the inner headers. Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	ip4ip6: Support for GSO/GRO	Tom Herbert	4	-6/+49
	Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	ip6ip6: Support for GSO/GRO	Tom Herbert	2	-3/+26
	Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	ipv6: Set features for IPv6 tunnels	Tom Herbert	1	-0/+9
	Need to set dev features, use same values that are used in GREv6. Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	ip6_tunnel: Add support for fou/gue encapsulation	Tom Herbert	1	-0/+72
	Add netlink and setup for encapsulation Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	ip6_gre: Add support for fou/gue encapsulation	Tom Herbert	1	-4/+75
	Add netlink and setup for encapsulation Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-05-20	fou: Add encap ops for IPv6 tunnels	Tom Herbert	3	-1/+142
	This patch add a new fou6 module that provides encapsulation operations for IPv6. Signed-off-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>