aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2012-10-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds150-6034/+53
Pull networking updates from David Miller: 1) UAPI changes for networking from David Howells 2) A netlink dump is an operation we can sleep within, and therefore we need to make sure the dump provider module doesn't disappear on us meanwhile. Fix from Gao Feng. 3) Now that tunnels support GRO, we have to be more careful in skb_gro_reset_offset() otherwise we OOPS, from Eric Dumazet. 4) We can end up processing packets for VLANs we aren't actually configured to be on, fix from Florian Zumbiehl. 5) Fix routing cache removal regression in redirects and IPVS. The core issue on the IPVS side is that it wants to rewrite who the nexthop is and we have to explicitly accomodate that case. From Julian Anastasov. 6) Error code return fixes all over the networking drivers from Peter Senna Tschudin. 7) Fix routing cache removal regressions in IPSEC, from Steffen Klassert. 8) Fix deadlock in RDS during pings, from Jeff Liu. 9) Neighbour packet queue can trigger skb_under_panic() because we do not reset the network header of the SKB in the right spot. From Ramesh Nagappa. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits) RDS: fix rds-ping spinlock recursion netdev/phy: Prototype of_mdio_find_bus() farsync: fix support for over 30 cards be2net: Remove code that stops further access to BE NIC based on UE bits pch_gbe: Fix build error by selecting all the possible dependencies. e1000e: add device IDs for i218 ixgbe/ixgbevf: Limit maximum jumbo frame size to 9.5K to avoid Tx hangs ixgbevf: Set the netdev number of Tx queues UAPI: (Scripted) Disintegrate include/linux/tc_ematch UAPI: (Scripted) Disintegrate include/linux/tc_act UAPI: (Scripted) Disintegrate include/linux/netfilter_ipv6 UAPI: (Scripted) Disintegrate include/linux/netfilter_ipv4 UAPI: (Scripted) Disintegrate include/linux/netfilter_bridge UAPI: (Scripted) Disintegrate include/linux/netfilter_arp UAPI: (Scripted) Disintegrate include/linux/netfilter/ipset UAPI: (Scripted) Disintegrate include/linux/netfilter UAPI: (Scripted) Disintegrate include/linux/isdn UAPI: (Scripted) Disintegrate include/linux/caif net: fix typo in freescale/ucc_geth.c vxlan: fix more sparse warnings ...
2012-10-10Merge branch 'next' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds3-0/+55
Pull slave-dmaengine updates from Vinod Koul: "This time we have Andy updates on dw_dmac which is attempting to make this IP block available as PCI and platform device though not fully complete this time. We also have TI EDMA moving the dma driver to use dmaengine APIs, also have a new driver for mmp-tdma, along with bunch of small updates. Now for your excitement the merge is little unusual here, while merging the auto merge on linux-next picks wrong choice for pl330 (drivers/dma/pl330.c) and this causes build failure. The correct resolution is in linux-next. (DMA: PL330: Fix build error) I didn't back merge your tree this time as you are better than me so no point in doing that for me :)" Fixed the pl330 conflict as in linux-next, along with trivial header file conflicts due to changed includes. * 'next' of git://git.infradead.org/users/vkoul/slave-dma: (29 commits) dma: tegra: fix interrupt name issue with apb dma. dw_dmac: fix a regression in dwc_prep_dma_memcpy dw_dmac: introduce software emulation of LLP transfers dw_dmac: autoconfigure data_width or get it via platform data dw_dmac: autoconfigure block_size or use platform data dw_dmac: get number of channels from hardware if possible dw_dmac: fill optional encoded parameters in register structure dw_dmac: mark dwc_dump_chan_regs as inline DMA: PL330: return ENOMEM instead of 0 from pl330_alloc_chan_resources DMA: PL330: Remove redundant runtime_suspend/resume functions DMA: PL330: Remove controller clock enable/disable dmaengine: use kmem_cache_zalloc instead of kmem_cache_alloc/memset DMA: PL330: Set the capability of pdm0 and pdm1 as DMA_PRIVATE ARM: EXYNOS: Set the capability of pdm0 and pdm1 as DMA_PRIVATE dma: tegra: use list_move_tail instead of list_del/list_add_tail mxs/dma: Enlarge the CCW descriptor area to 4 pages dw_dmac: utilize slave_id to pass request line dmaengine: mmp_tdma: add dt support dmaengine: mmp-pdma support spi: davici - make davinci select edma ...
2012-10-10Merge tag 'mmc-merge-for-3.7-rc1' of ↵Linus Torvalds7-17/+48
git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc Pull MMC updates from Chris Ball: "Core: - Add DT properties for card detection (broken-cd, cd-gpios, non-removable) - Don't poll non-removable devices - Fixup/rework eMMC sleep mode/"power off notify" feature - Support eMMC background operations (BKOPS). To set the one-time programmable fuse that enables bkops on an eMMC that doesn't already have it set, you can use the "mmc bkops enable" command in: git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc-utils.git Drivers: - atmel-mci, dw_mmc, pxa-mci, dove, s3c, spear: Add device tree support - bfin_sdh: Add support for the controller in bf60x - dw_mmc: Support Samsung Exynos SoCs - eSDHC: Add ADMA support - sdhci: Support testing a cd-gpio (from slot-gpio) instead of presence bit - sdhci-pltfm: Support broken-cd DT property - tegra: Convert to only supporting DT (mach-tegra has gone DT-only)" * tag 'mmc-merge-for-3.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (67 commits) mmc: core: Fixup broken suspend and eMMC4.5 power off notify mmc: sdhci-spear: Add clk_{un}prepare() support mmc: sdhci-spear: add device tree bindings mmc: sdhci-s3c: Add clk_(enable/disable) in runtime suspend/resume mmc: core: Replace MMC_CAP2_BROKEN_VOLTAGE with test for fixed regulator mmc: sdhci-pxav3: Use sdhci_get_of_property for parsing DT quirks mmc: dt: Support "broken-cd" property in sdhci-pltfm mmc: sdhci-s3c: fix the wrong number of max bus clocks mmc: sh-mmcif: avoid oops on spurious interrupts mmc: sh-mmcif: properly handle MMC_WRITE_MULTIPLE_BLOCK completion IRQ mmc: sdhci-s3c: Fix crash on module insertion for second time mmc: sdhci-s3c: Enable only required bus clock mmc: Revert "mmc: dw_mmc: Add check for IDMAC configuration" mmc: mxcmmc: fix bug that may block a data transfer forever mmc: omap_hsmmc: Pass on the suspend failure to the PM core mmc: atmel-mci: AP700x PDC is not connected to MCI mmc: atmel-mci: DMA can be used with other controllers mmc: mmci: use clk_prepare_enable and clk_disable_unprepare mmc: sdhci-s3c: Add device tree support mmc: dw_mmc: add support for exynos specific implementation of dw-mshc ...
2012-10-09Merge tag 'disintegrate-isdn-20121009' of ↵David S. Miller2-116/+0
git://git.infradead.org/users/dhowells/linux-headers UAPI Disintegration 2012-10-09 Signed-off-by: David S. Miller <[email protected]>
2012-10-09Merge tag 'disintegrate-net-20121009' of ↵David S. Miller144-5879/+21
git://git.infradead.org/users/dhowells/linux-headers UAPI Disintegration 2012-10-09 Signed-off-by: David S. Miller <[email protected]>
2012-10-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linuxDavid S. Miller61-1174/+1894
Pulled mainline in order to get the UAPI infrastructure already merged before I pull in David Howells's UAPI trees for networking. Signed-off-by: David S. Miller <[email protected]>
2012-10-09Merge tag 'disintegrate-mtd-20121009' of ↵David Woodhouse456-4791/+14250
git://git.infradead.org/users/dhowells/linux-headers UAPI Disintegration 2012-10-09 Conflicts: MAINTAINERS arch/arm/configs/bcmring_defconfig arch/arm/mach-imx/clk-imx51-imx53.c drivers/mtd/nand/Kconfig drivers/mtd/nand/bcm_umi_nand.c drivers/mtd/nand/nand_bcm_umi.h drivers/mtd/nand/orion_nand.c
2012-10-09UAPI: (Scripted) Disintegrate include/linux/tc_ematchDavid Howells5-153/+0
Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Dave Jones <[email protected]>
2012-10-09UAPI: (Scripted) Disintegrate include/linux/tc_actDavid Howells8-225/+0
Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Dave Jones <[email protected]>
2012-10-09UAPI: (Scripted) Disintegrate include/linux/netfilter_ipv6David Howells13-519/+2
Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Dave Jones <[email protected]>
2012-10-09UAPI: (Scripted) Disintegrate include/linux/netfilter_ipv4David Howells11-463/+2
Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Dave Jones <[email protected]>
2012-10-09UAPI: (Scripted) Disintegrate include/linux/netfilter_bridgeDavid Howells19-783/+2
Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Dave Jones <[email protected]>
2012-10-09UAPI: (Scripted) Disintegrate include/linux/netfilter_arpDavid Howells3-227/+1
Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Dave Jones <[email protected]>
2012-10-09UAPI: (Scripted) Disintegrate include/linux/netfilter/ipsetDavid Howells5-272/+6
Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Dave Jones <[email protected]>
2012-10-09UAPI: (Scripted) Disintegrate include/linux/netfilterDavid Howells77-3007/+8
Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Dave Jones <[email protected]>
2012-10-09UAPI: (Scripted) Disintegrate include/linux/isdnDavid Howells2-116/+0
Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Dave Jones <[email protected]>
2012-10-09UAPI: (Scripted) Disintegrate include/linux/caifDavid Howells3-230/+0
Signed-off-by: David Howells <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Dave Jones <[email protected]>
2012-10-09Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds28-404/+682
Merge patches from Andrew Morton: "A few misc things and very nearly all of the MM tree. A tremendous amount of stuff (again), including a significant rbtree library rework." * emailed patches from Andrew Morton <[email protected]>: (160 commits) sparc64: Support transparent huge pages. mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd(). mm: Add and use update_mmu_cache_pmd() in transparent huge page code. sparc64: Document PGD and PMD layout. sparc64: Eliminate PTE table memory wastage. sparc64: Halve the size of PTE tables sparc64: Only support 4MB huge pages and 8KB base pages. memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning mm: memcg: clean up mm_match_cgroup() signature mm: document PageHuge somewhat mm: use %pK for /proc/vmallocinfo mm, thp: fix mlock statistics mm, thp: fix mapped pages avoiding unevictable list on mlock memory-hotplug: update memory block's state and notify userspace memory-hotplug: preparation to notify memory block's state at memory hot remove mm: avoid section mismatch warning for memblock_type_name make GFP_NOTRACK definition unconditional cma: decrease cc.nr_migratepages after reclaiming pagelist CMA: migrate mlocked pages kpageflags: fix wrong KPF_THP on non-huge compound pages ...
2012-10-09mm: memcg: clean up mm_match_cgroup() signatureJohannes Weiner1-7/+7
It really should return a boolean for match/no match. And since it takes a memcg, not a cgroup, fix that parameter name as well. [[email protected]: mm_match_cgroup() is not a macro] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm, thp: fix mapped pages avoiding unevictable list on mlockDavid Rientjes1-1/+1
When a transparent hugepage is mapped and it is included in an mlock() range, follow_page() incorrectly avoids setting the page's mlock bit and moving it to the unevictable lru. This is evident if you try to mlock(), munlock(), and then mlock() a range again. Currently: #define MAP_SIZE (4 << 30) /* 4GB */ void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); mlock(ptr, MAP_SIZE); $ grep -E "Unevictable|Inactive\(anon" /proc/meminfo Inactive(anon): 6304 kB Unevictable: 4213924 kB munlock(ptr, MAP_SIZE); Inactive(anon): 4186252 kB Unevictable: 19652 kB mlock(ptr, MAP_SIZE); Inactive(anon): 4198556 kB Unevictable: 21684 kB Notice that less than 2MB was added to the unevictable list; this is because these pages in the range are not transparent hugepages since the 4GB range was allocated with mmap() and has no specific alignment. If posix_memalign() were used instead, unevictable would not have grown at all on the second mlock(). The fix is to call mlock_vma_page() so that the mlock bit is set and the page is added to the unevictable list. With this patch: mlock(ptr, MAP_SIZE); Inactive(anon): 4056 kB Unevictable: 4213940 kB munlock(ptr, MAP_SIZE); Inactive(anon): 4198268 kB Unevictable: 19636 kB mlock(ptr, MAP_SIZE); Inactive(anon): 4008 kB Unevictable: 4213940 kB Signed-off-by: David Rientjes <[email protected]> Acked-by: Hugh Dickins <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michel Lespinasse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09memory-hotplug: update memory block's state and notify userspaceWen Congyang1-0/+2
remove_memory() will be called when hot removing a memory device. But even if offlining memory, we cannot notice it. So the patch updates the memory block's state and sends notification to userspace. Additionally, the memory device may contain more than one memory block. If the memory block has been offlined, __offline_pages() will fail. So we should try to offline one memory block at a time. Thus remove_memory() also check each memory block's state. So there is no need to check the memory block's state before calling remove_memory(). Signed-off-by: Wen Congyang <[email protected]> Signed-off-by: Yasuaki Ishimatsu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Len Brown <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09memory-hotplug: preparation to notify memory block's state at memory hot removeWen Congyang1-0/+1
remove_memory() is called in two cases: 1. echo offline >/sys/devices/system/memory/memoryXX/state 2. hot remove a memory device In the 1st case, the memory block's state is changed and the notification that memory block's state changed is sent to userland after calling remove_memory(). So user can notice memory block is changed. But in the 2nd case, the memory block's state is not changed and the notification is not also sent to userspcae even if calling remove_memory(). So user cannot notice memory block is changed. For adding the notification at memory hot remove, the patch just prepare as follows: 1st case uses offline_pages() for offlining memory. 2nd case uses remove_memory() for offlining memory and changing memory block's state and notifing the information. The patch does not implement notification to remove_memory(). Signed-off-by: Wen Congyang <[email protected]> Signed-off-by: Yasuaki Ishimatsu <[email protected]> Cc: David Rientjes <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Len Brown <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09make GFP_NOTRACK definition unconditionalGlauber Costa1-4/+0
There was a general sentiment in a recent discussion (See https://lkml.org/lkml/2012/9/18/258) that the __GFP flags should be defined unconditionally. Currently, the only offender is GFP_NOTRACK, which is conditional to KMEMCHECK. Signed-off-by: Glauber Costa <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09CMA: migrate mlocked pagesMinchan Kim1-0/+2
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate contiguous memory space. This patch makes mlocked pages be migrated out. Of course, it can affect realtime processes but in CMA usecase, contiguous memory allocation failing is far worse than access latency to an mlocked page being variable while CMA is running. If someone wants to make the system realtime, he shouldn't enable CMA because stalls can still happen at random times. [[email protected]: tweak comment text, per Mel] Signed-off-by: Minchan Kim <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: Marek Szyprowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: remove unevictable_pgs_mlockfreedHugh Dickins1-1/+0
Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line from /proc/vmstat: Johannes and Mel point out that it was very unlikely to have been used by any tool, and of course we can restore it easily enough if that turns out to be wrong. Signed-off-by: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ying Han <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09memory-hotplug: fix zone stat mismatchMinchan Kim1-0/+4
During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing, causing the kernel to hang. When the system doesn't have enough free pages, it enters reclaim but never reclaim any pages due to too_many_isolated()==true and loops forever. The cause is that when we do memory-hotadd after memory-remove, __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset() although the vm_stat_diff of all CPUs still have values. In addtion, when we offline all pages of the zone, we reset them in zone_pcp_reset without draining so we loss some zone stat item. Reviewed-by: Wen Congyang <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Cc: Kamezawa Hiroyuki <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: move all mmu notifier invocations to be done outside the PT lockSagi Grimberg1-47/+0
In order to allow sleeping during mmu notifier calls, we need to avoid invoking them under the page table spinlock. This patch solves the problem by calling invalidate_page notification after releasing the lock (but before freeing the page itself), or by wrapping the page invalidation with calls to invalidate_range_begin and invalidate_range_end. To prevent accidental changes to the invalidate_range_end arguments after the call to invalidate_range_begin, the patch introduces a convention of saving the arguments in consistently named locals: unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ ... mmun_start = ... mmun_end = ... mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); ... mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); The patch changes code to use this convention for all calls to mmu_notifier_invalidate_range_start/end, except those where the calls are close enough so that anyone who glances at the code can see the values aren't changing. This patchset is a preliminary step towards on-demand paging design to be added to the RDMA stack. Why do we want on-demand paging for Infiniband? Applications register memory with an RDMA adapter using system calls, and subsequently post IO operations that refer to the corresponding virtual addresses directly to HW. Until now, this was achieved by pinning the memory during the registration calls. The goal of on demand paging is to avoid pinning the pages of registered memory regions (MRs). This will allow users the same flexibility they get when swapping any other part of their processes address spaces. Instead of requiring the entire MR to fit in physical memory, we can allow the MR to be larger, and only fit the current working set in physical memory. Why should anyone care? What problems are users currently experiencing? This can make programming with RDMA much simpler. Today, developers that are working with more data than their RAM can hold need either to deregister and reregister memory regions throughout their process's life, or keep a single memory region and copy the data to it. On demand paging will allow these developers to register a single MR at the beginning of their process's life, and let the operating system manage which pages needs to be fetched at a given time. In the future, we might be able to provide a single memory access key for each process that would provide the entire process's address as one large memory region, and the developers wouldn't need to register memory regions at all. Is there any prospect that any other subsystems will utilise these infrastructural changes? If so, which and how, etc? As for other subsystems, I understand that XPMEM wanted to sleep in MMU notifiers, as Christoph Lameter wrote at http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and perhaps Andrea knows about other use cases. Scheduling in mmu notifications is required since we need to sync the hardware with the secondary page tables change. A TLB flush of an IO device is inherently slower than a CPU TLB flush, so our design works by sending the invalidation request to the device, and waiting for an interrupt before exiting the mmu notifier handler. Avi said: kvm may be a buyer. kvm::mmu_lock, which serializes guest page faults, also protects long operations such as destroying large ranges. It would be good to convert it into a spinlock, but as it is used inside mmu notifiers, this cannot be done. (there are alternatives, such as keeping the spinlock and using a generation counter to do the teardown in O(1), which is what the "may" is doing up there). [[email protected] speed tweak in hugetlb_cow(), cleanups] Signed-off-by: Andrea Arcangeli <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Haggai Eran <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Xiao Guangrong <[email protected]> Cc: Or Gerlitz <[email protected]> Cc: Haggai Eran <[email protected]> Cc: Shachar Raindel <[email protected]> Cc: Liran Liss <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Avi Kivity <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm, numa: reclaim from all nodes within reclaim distanceDavid Rientjes1-0/+1
RECLAIM_DISTANCE represents the distance between nodes at which it is deemed too costly to allocate from; it's preferred to try to reclaim from a local zone before falling back to allocating on a remote node with such a distance. To do this, zone_reclaim_mode is set if the distance between any two nodes on the system is greather than this distance. This, however, ends up causing the page allocator to reclaim from every zone regardless of its affinity. What we really want is to reclaim only from zones that are closer than RECLAIM_DISTANCE. This patch adds a nodemask to each node that represents the set of nodes that are within this distance. During the zone iteration, if the bit for a zone's node is set for the local node, then reclaim is attempted; otherwise, the zone is skipped. [[email protected]: fix CONFIG_NUMA=n build] Signed-off-by: David Rientjes <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Minchan Kim <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: remove free_page_mlockHugh Dickins1-1/+1
We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be checking it, reporting "BUG: Bad page state" if it's ever found set. Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0. Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ying Han <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: remove vma arg from page_evictableHugh Dickins1-1/+1
page_evictable(page, vma) is an irritant: almost all its callers pass NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page) explicitly in the couple of places it's needed. But in those places we don't even need page_evictable() itself! They're dealing with a freshly allocated anonymous page, which has no "mapping" and cannot be mlocked yet. Signed-off-by: Hugh Dickins <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Ying Han <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: fix-up zone present pagesJianguo Wu1-0/+4
I think zone->present_pages indicates pages that buddy system can management, it should be: zone->present_pages = spanned pages - absent pages - bootmem pages, but is now: zone->present_pages = spanned pages - absent pages - memmap pages. spanned pages: total size, including holes. absent pages: holes. bootmem pages: pages used in system boot, managed by bootmem allocator. memmap pages: pages used by page structs. This may cause zone->present_pages less than it should be. For example, numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's present_pages should be spanned pages - absent pages, but now it also minus memmap pages(free_area_init_core), which are actually allocated from ZONE_MOVABLE. When offlining all memory of a zone, this will cause zone->present_pages less than 0, because present_pages is unsigned long type, it is actually a very large integer, it indirectly caused zone->watermark[WMARK_MIN] becomes a large integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a large integer(calculate_totalreserve_pages()), and finally cause memory allocating failure when fork process(__vm_enough_memory()). [root@localhost ~]# dmesg -bash: fork: Cannot allocate memory I think the bug described in http://marc.info/?l=linux-mm&m=134502182714186&w=2 is also caused by wrong zone present pages. This patch intends to fix-up zone->present_pages when memory are freed to buddy system on x86_64 and IA64 platforms. Signed-off-by: Jianguo Wu <[email protected]> Signed-off-by: Jiang Liu <[email protected]> Reported-by: Petr Tesarik <[email protected]> Tested-by: Petr Tesarik <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm/page_alloc: refactor out __alloc_contig_migrate_alloc()Minchan Kim1-1/+2
__alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor it out (move + rename as a common name) into page_isolation.c. [[email protected]: checkpatch fixes] Signed-off-by: Minchan Kim <[email protected]> Cc: Kamezawa Hiroyuki <[email protected]> Reviewed-by: Yasuaki Ishimatsu <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Wen Congyang <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: compaction: clear PG_migrate_skip based on compaction and reclaim activityMel Gorman2-1/+17
Compaction caches if a pageblock was scanned and no pages were isolated so that the pageblocks can be skipped in the future to reduce scanning. This information is not cleared by the page allocator based on activity due to the impact it would have to the page allocator fast paths. Hence there is a requirement that something clear the cache or pageblocks will be skipped forever. Currently the cache is cleared if there were a number of recent allocation failures and it has not been cleared within the last 5 seconds. Time-based decisions like this are terrible as they have no relationship to VM activity and is basically a big hammer. Unfortunately, accurate heuristics would add cost to some hot paths so this patch implements a rough heuristic. There are two cases where the cache is cleared. 1. If a !kswapd process completes a compaction cycle (migrate and free scanner meet), the zone is marked compact_blockskip_flush. When kswapd goes to sleep, it will clear the cache. This is expected to be the common case where the cache is cleared. It does not really matter if kswapd happens to be asleep or going to sleep when the flag is set as it will be woken on the next allocation request. 2. If there have been multiple failures recently and compaction just finished being deferred then a process will clear the cache and start a full scan. This situation happens if there are multiple high-order allocation requests under heavy memory pressure. The clearing of the PG_migrate_skip bits and other scans is inherently racy but the race is harmless. For allocations that can fail such as THP, they will simply fail. For requests that cannot fail, they will retry the allocation. Tests indicated that scanning rates were roughly similar to when the time-based heuristic was used and the allocation success rates were similar. Signed-off-by: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Richard Davies <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Avi Kivity <[email protected]> Cc: Rafael Aquini <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: compaction: Restart compaction from near where it left offMel Gorman1-0/+4
This is almost entirely based on Rik's previous patches and discussions with him about how this might be implemented. Order > 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. This patch caches where the migration and free scanner should start from on subsequent compaction invocations using the pageblock-skip information. When compaction starts it begins from the cached restart points and will update the cached restart points until a page is isolated or a pageblock is skipped that would have been scanned by synchronous compaction. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Richard Davies <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Avi Kivity <[email protected]> Acked-by: Rafael Aquini <[email protected]> Cc: Fengguang Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: compaction: cache if a pageblock was scanned and no pages were isolatedMel Gorman2-2/+20
When compaction was implemented it was known that scanning could potentially be excessive. The ideal was that a counter be maintained for each pageblock but maintaining this information would incur a severe penalty due to a shared writable cache line. It has reached the point where the scanning costs are a serious problem, particularly on long-lived systems where a large process starts and allocates a large number of THPs at the same time. Instead of using a shared counter, this patch adds another bit to the pageblock flags called PG_migrate_skip. If a pageblock is scanned by either migrate or free scanner and 0 pages were isolated, the pageblock is marked to be skipped in the future. When scanning, this bit is checked before any scanning takes place and the block skipped if set. The main difficulty with a patch like this is "when to ignore the cached information?" If it's ignored too often, the scanning rates will still be excessive. If the information is too stale then allocations will fail that might have otherwise succeeded. In this patch o CMA always ignores the information o If the migrate and free scanner meet then the cached information will be discarded if it's at least 5 seconds since the last time the cache was discarded o If there are a large number of allocation failures, discard the cache. The time-based heuristic is very clumsy but there are few choices for a better event. Depending solely on multiple allocation failures still allows excessive scanning when THP allocations are failing in quick succession due to memory pressure. Waiting until memory pressure is relieved would cause compaction to continually fail instead of using reclaim/compaction to try allocate the page. The time-based mechanism is clumsy but a better option is not obvious. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Richard Davies <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Avi Kivity <[email protected]> Acked-by: Rafael Aquini <[email protected]> Cc: Fengguang Wu <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Bartlomiej Zolnierkiewicz <[email protected]> Cc: Kyungmin Park <[email protected]> Cc: Mark Brown <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09revert "mm: have order > 0 compaction start off where it left"Mel Gorman1-4/+0
This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start off where it left") and commit de74f1cc ("mm: have order > 0 compaction start near a pageblock with free pages"). These patches were a good idea and tests confirmed that they massively reduced the amount of scanning but the implementation is complex and tricky to understand. A later patch will cache what pageblocks should be skipped and reimplements the concept of compact_cached_free_pfn on top for both migration and free scanners. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Richard Davies <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Avi Kivity <[email protected]> Acked-by: Rafael Aquini <[email protected]> Acked-by: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm/memblock: cleanup early_node_map[] related commentsWanpeng Li1-2/+1
Commit 0ee332c14518 ("memblock: Kill early_node_map[]") removed early_node_map[]. Clean up the comments to comply with that change. Signed-off-by: Wanpeng Li <[email protected]> Cc: Michal Hocko <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Gavin Shan <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09readahead: fault retry breaks mmap file read random detectionShaohua Li1-0/+1
.fault now can retry. The retry can break state machine of .fault. In filemap_fault, if page is miss, ra->mmap_miss is increased. In the second try, since the page is in page cache now, ra->mmap_miss is decreased. And these are done in one fault, so we can't detect random mmap file access. Add a new flag to indicate .fault is tried once. In the second try, skip ra->mmap_miss decreasing. The filemap_fault state machine is ok with it. I only tested x86, didn't test other archs, but looks the change for other archs is obvious, but who knows :) Signed-off-by: Shaohua Li <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09atomic: implement generic atomic_dec_if_positive()Shaohua Li1-0/+25
The x86 implementation of atomic_dec_if_positive is quite generic, so make it available to all architectures. This is needed for "swap: add a simple detector for inappropriate swapin readahead". [[email protected]: do the "#define foo foo" trick in the conventional manner] Signed-off-by: Shaohua Li <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Michal Simek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09memory-hotplug: fix pages missed by race rather than failingMinchan Kim1-0/+4
If race between allocation and isolation in memory-hotplug offline happens, some pages could be in MIGRATE_MOVABLE of free_list although the pageblock's migratetype of the page is MIGRATE_ISOLATE. The race could be detected by get_freepage_migratetype in __test_page_isolated_in_pageblock. If it is detected, now EBUSY gets bubbled all the way up and the hotplug operations fails. But better idea is instead of returning and failing memory-hotremove, move the free page to the correct list at the time it is detected. It could enhance memory-hotremove operation success ratio although the race is really rare. Suggested by Mel Gorman. [[email protected]: small cleanup] Signed-off-by: Minchan Kim <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Yasuaki Ishimatsu <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Wen Congyang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: remain migratetype in freed pageMinchan Kim1-2/+2
The page allocator caches the pageblock information in page->private while it is in the PCP freelists but this is overwritten with the order of the page when freed to the buddy allocator. This patch stores the migratetype of the page in the page->index field so that it is available at all times when the page remain in free_list. This patch adds a new call site in __free_pages_ok so it might be overhead a bit but it's for high order allocation. So I believe damage isn't hurt. Signed-off-by: Minchan Kim <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Yasuaki Ishimatsu <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Wen Congyang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: page_alloc: use get_freepage_migratetype() instead of page_private()Minchan Kim1-0/+12
The page allocator uses set_page_private and page_private for handling migratetype when it frees page. Let's replace them with [set|get] _freepage_migratetype to make it more clear. Signed-off-by: Minchan Kim <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Yasuaki Ishimatsu <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: Wen Congyang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09cma: count free CMA pagesBartlomiej Zolnierkiewicz2-0/+9
Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in __zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this counter always available (not only when CONFIG_CMA=y). [[email protected]: use conventional migratetype naming] Signed-off-by: Bartlomiej Zolnierkiewicz <[email protected]> Signed-off-by: Kyungmin Park <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: cma: discard clean pages during contiguous allocation instead of migrationMinchan Kim1-10/+11
Drop clean cache pages instead of migration during alloc_contig_range() to minimise allocation latency by reducing the amount of migration that is necessary. It's useful for CMA because latency of migration is more important than evicting the background process's working set. In addition, as pages are reclaimed then fewer free pages for migration targets are required so it avoids memory reclaiming to get free pages, which is a contributory factor to increased latency. I measured elapsed time of __alloc_contig_migrate_range() which migrates 10M in 40M movable zone in QEMU machine. Before - 146ms, After - 7ms [[email protected]: fix nommu build] Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: Mel Gorman <[email protected]> Cc: Marek Szyprowski <[email protected]> Acked-by: Michal Nazarewicz <[email protected]> Cc: Rik van Riel <[email protected]> Tested-by: Kyungmin Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: avoid taking rmap locks in move_ptes()Michel Lespinasse1-2/+4
During mremap(), the destination VMA is generally placed after the original vma in rmap traversal order: in move_vma(), we always have new_pgoff >= vma->vm_pgoff, and as a result new_vma->vm_pgoff >= vma->vm_pgoff unless vma_merge() merged the new vma with an adjacent one. When the destination VMA is placed after the original in rmap traversal order, we can avoid taking the rmap locks in move_ptes(). Essentially, this reintroduces the optimization that had been disabled in "mm anon rmap: remove anon_vma_moveto_tail". The difference is that we don't try to impose the rmap traversal order; instead we just rely on things being in the desired order in the common case and fall back to taking locks in the uncommon case. Also we skip the i_mmap_mutex in addition to the anon_vma lock: in both cases, the vmas are traversed in increasing vm_pgoff order with ties resolved in tree insertion order. Signed-off-by: Michel Lespinasse <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Daniel Santos <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: add CONFIG_DEBUG_VM_RB build optionMichel Lespinasse2-0/+6
Add a CONFIG_DEBUG_VM_RB build option for the previously existing DEBUG_MM_RB code. Now that Andi Kleen modified it to avoid using recursive algorithms, we can expose it a bit more. Also extend this code to validate_mm() after stack expansion, and to check that the vma's start and last pgoffs have not changed since the nodes were inserted on the anon vma interval tree (as it is important that the nodes be reindexed after each such update). Signed-off-by: Michel Lespinasse <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Daniel Santos <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm anon rmap: replace same_anon_vma linked list with an interval tree.Michel Lespinasse2-5/+20
When a large VMA (anon or private file mapping) is first touched, which will populate its anon_vma field, and then split into many regions through the use of mprotect(), the original anon_vma ends up linking all of the vmas on a linked list. This can cause rmap to become inefficient, as we have to walk potentially thousands of irrelevent vmas before finding the one a given anon page might fall into. By replacing the same_anon_vma linked list with an interval tree (where each avc's interval is determined by its vma's start and last pgoffs), we can make rmap efficient for this use case again. While the change is large, all of its pieces are fairly simple. Most places that were walking the same_anon_vma list were looking for a known pgoff, so they can just use the anon_vma_interval_tree_foreach() interval tree iterator instead. The exception here is ksm, where the page's index is not known. It would probably be possible to rework ksm so that the index would be known, but for now I have decided to keep things simple and just walk the entirety of the interval tree there. When updating vma's that already have an anon_vma assigned, we must take care to re-index the corresponding avc's on their interval tree. This is done through the use of anon_vma_interval_tree_pre_update_vma() and anon_vma_interval_tree_post_update_vma(), which remove the avc's from their interval tree before the update and re-insert them after the update. The anon_vma stays locked during the update, so there is no chance that rmap would miss the vmas that are being updated. Signed-off-by: Michel Lespinasse <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Daniel Santos <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm anon rmap: remove anon_vma_moveto_tailMichel Lespinasse1-1/+0
mremap() had a clever optimization where move_ptes() did not take the anon_vma lock to avoid a race with anon rmap users such as page migration. Instead, the avc's were ordered in such a way that the origin vma was always visited by rmap before the destination. This ordering and the use of page table locks rmap usage safe. However, we want to replace the use of linked lists in anon rmap with an interval tree, and this will make it harder to impose such ordering as the interval tree will always be sorted by the avc->vma->vm_pgoff value. For now, let's replace the anon_vma_moveto_tail() ordering function with proper anon_vma locking in move_ptes(). Once we have the anon interval tree in place, we will re-introduce an optimization to avoid taking these locks in the most common cases. Signed-off-by: Michel Lespinasse <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Daniel Santos <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09mm: interval tree updatesMichel Lespinasse3-222/+194
Update the generic interval tree code that was introduced in "mm: replace vma prio_tree with an interval tree". Changes: - fixed 'endpoing' typo noticed by Andrew Morton - replaced include/linux/interval_tree_tmpl.h, which was used as a template (including it automatically defined the interval tree functions) with include/linux/interval_tree_generic.h, which only defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself defines the interval tree functions when invoked. Now that is a very long macro which is unfortunate, but it does make the usage sites (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously. - make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro, instead of duplicating that code in the interval tree template. - replaced vma_interval_tree_add(), which was actually handling the nonlinear and interval tree cases, with vma_interval_tree_insert_after() which handles only the interval tree case and has an API that is more consistent with the other interval tree handling functions. The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap(). Signed-off-by: Michel Lespinasse <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Daniel Santos <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-10-09rbtree: move augmented rbtree functionality to rbtree_augmented.hMichel Lespinasse3-50/+229
Provide rb_insert_augmented() and rb_erase_augmented() through a new rbtree_augmented.h include file. rb_erase_augmented() is defined there as an __always_inline function, in order to allow inlining of augmented rbtree callbacks into it. Since this generates a relatively large function, each augmented rbtree user should make sure to have a single call site. Signed-off-by: Michel Lespinasse <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Woodhouse <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>