aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-01-21x86, numa, acpi, memory-hotplug: make movable_node have higher priorityTang Chen1-2/+26
If users specify the original movablecore=nn@ss boot option, the kernel will arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar except it specifies ZONE_NORMAL ranges. Now, if users specify "movable_node" in kernel commandline, the kernel will arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored. For those who don't want this, just specify nothing. The kernel will act as before. Signed-off-by: Tang Chen <[email protected]> Signed-off-by: Zhang Yanfei <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21memblock, mem_hotplug: make memblock skip hotpluggable regions if neededTang Chen3-0/+37
Linux kernel cannot migrate pages used by the kernel. As a result, hotpluggable memory used by the kernel won't be able to be hot-removed. To solve this problem, the basic idea is to prevent memblock from allocating hotpluggable memory for the kernel at early time, and arrange all hotpluggable memory in ACPI SRAT(System Resource Affinity Table) as ZONE_MOVABLE when initializing zones. In the previous patches, we have marked hotpluggable memory regions with MEMBLOCK_HOTPLUG flag in memblock.memory. In this patch, we make memblock skip these hotpluggable memory regions in the default top-down allocation function if movable_node boot option is specified. [[email protected]: coding-style fixes] Signed-off-by: Tang Chen <[email protected]> Signed-off-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21acpi, numa, mem_hotplug: mark all nodes the kernel resides un-hotpluggableTang Chen1-0/+44
At very early time, the kernel have to use some memory such as loading the kernel image. We cannot prevent this anyway. So any node the kernel resides in should be un-hotpluggable. Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21acpi, numa, mem_hotplug: mark hotpluggable memory in memblockTang Chen2-0/+7
When parsing SRAT, we know that which memory area is hotpluggable. So we invoke function memblock_mark_hotplug() introduced by previous patch to mark hotpluggable memory in memblock. [[email protected]: coding-style fixes] Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21memblock: make memblock_set_node() support different memblock_typeTang Chen12-19/+28
[[email protected]: fix powerpc build] Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21memblock, mem_hotplug: introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable ↵Tang Chen2-0/+70
regions In find_hotpluggable_memory, once we find out a memory region which is hotpluggable, we want to mark them in memblock.memory. So that we could control memblock allocator not to allocte hotpluggable memory for the kernel later. To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the hotpluggable memory regions in memblock and a function memblock_mark_hotplug() to mark hotpluggable memory if we find one. [[email protected]: coding-style fixes] Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Liu Jiang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Wen Congyang <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21memblock, numa: introduce flags field into memblockTang Chen2-15/+39
There is no flag in memblock to describe what type the memory is. Sometimes, we may use memblock to reserve some memory for special usage. And we want to know what kind of memory it is. So we need a way to In hotplug environment, we want to reserve hotpluggable memory so the kernel won't be able to use it. And when the system is up, we have to free these hotpluggable memory to buddy. So we need to mark these memory first. In order to do so, we need to mark out these special memory in memblock. In this patch, we introduce a new "flags" member into memblock_region: struct memblock_region { phys_addr_t base; phys_addr_t size; unsigned long flags; /* This is new. */ #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP int nid; #endif }; This patch does the following things: 1) Add "flags" member to memblock_region. 2) Modify the following APIs' prototype: memblock_add_region() memblock_insert_region() 3) Add memblock_reserve_region() to support reserve memory with flags, and keep memblock_reserve()'s prototype unmodified. 4) Modify other APIs to support flags, but keep their prototype unmodified. The idea is from Wen Congyang <[email protected]> and Liu Jiang <[email protected]>. Suggested-by: Wen Congyang <[email protected]> Suggested-by: Liu Jiang <[email protected]> Signed-off-by: Tang Chen <[email protected]> Reviewed-by: Zhang Yanfei <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Chen Tang <[email protected]> Cc: Gong Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Len Brown <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Taku Izumi <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Renninger <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasilis Liaskovitis <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Yasuaki Ishimatsu <[email protected]> Cc: Yinghai Lu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/memblock: debug: correct displaying of upper memory boundaryGrygorii Strashko1-2/+2
Current memblock APIs don't work on 32 PAE or LPAE extension arches where the physical memory start address beyond 4GB. The problem was discussed here [3] where Tejun, Yinghai(thanks) proposed a way forward with memblock interfaces. Based on the proposal, this series adds necessary memblock interfaces and convert the core kernel code to use them. Architectures already converted to NO_BOOTMEM use these new interfaces and other which still uses bootmem, these new interfaces just fallback to exiting bootmem APIs. So no functional change in behavior. In long run, once all the architectures moves to NO_BOOTMEM, we can get rid of bootmem layer completely. This is one step to remove the core code dependency with bootmem and also gives path for architectures to move away from bootmem. Testing is done on ARM architecture with 32 bit ARM LAPE machines with normal as well sparse(faked) memory model. This patch (of 23): When debugging is enabled (cmdline has "memblock=debug") the memblock will display upper memory boundary per each allocated/freed memory range wrongly. For example: memblock_reserve: [0x0000009e7e8000-0x0000009e7ed000] _memblock_early_alloc_try_nid_nopanic+0xfc/0x12c The 0x0000009e7ed000 is displayed instead of 0x0000009e7ecfff Hence, correct this by changing formula used to calculate upper memory boundary to (u64)base + size - 1 instead of (u64)base + size everywhere in the debug messages. Signed-off-by: Grygorii Strashko <[email protected]> Signed-off-by: Santosh Shilimkar <[email protected]> Cc: Yinghai Lu <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Russell King <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Tony Lindgren <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/mlock: prepare params outside critical regionDavidlohr Bueso1-7/+11
All mlock related syscalls prepare lock limits, lengths and start parameters with the mmap_sem held. Move this logic outside of the critical region. For the case of mlock, continue incrementing the amount already locked by mm->locked_vm with the rwsem taken. Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Michel Lespinasse <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/mmap.c: add mlock_future_check() helperDavidlohr Bueso1-22/+23
Both do_brk and do_mmap_pgoff verify that we are actually capable of locking future pages if the corresponding VM_LOCKED flags are used. Encapsulate this logic into a single mlock_future_check() helper function. Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Rik van Riel <[email protected]> Reviewed-by: Michel Lespinasse <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: add overcommit_kbytes sysctl variableJerome Marchand8-8/+70
Some applications that run on HPC clusters are designed around the availability of RAM and the overcommit ratio is fine tuned to get the maximum usage of memory without swapping. With growing memory, the 1%-of-all-RAM grain provided by overcommit_ratio has become too coarse for these workload (on a 2TB machine it represents no less than 20GB). This patch adds the new overcommit_kbytes sysctl variable that allow a much finer grain. [[email protected]: coding-style fixes] [[email protected]: fix nommu build] Signed-off-by: Jerome Marchand <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Alan Cox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm, show_mem: remove SHOW_MEM_FILTER_PAGE_COUNTMel Gorman9-190/+65
Commit 4b59e6c47309 ("mm, show_mem: suppress page counts in non-blockable contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to suppress PFN walks on large memory machines. Commit c78e93630d15 ("mm: do not walk all of system memory during show_mem") avoided a PFN walk in the generic show_mem helper which removes the requirement for SHOW_MEM_FILTER_PAGE_COUNT in that case. This patch removes PFN walkers from the arch-specific implementations that report on a per-node or per-zone granularity. ARM and unicore32 still do a PFN walk as they report memory usage on each bank which is a much finer granularity where the debugging information may still be of use. As the remaining arches doing PFN walks have relatively small amounts of memory, this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT. [[email protected]: fix parisc] Signed-off-by: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Tony Luck <[email protected]> Cc: Russell King <[email protected]> Cc: James Bottomley <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}Jianyu Zhan1-10/+10
Currently we are implementing vmalloc_to_pfn() as a wrapper around vmalloc_to_page(), which is implemented as follow: 1. walks the page talbes to generates the corresponding pfn, 2. then converts the pfn to struct page, 3. returns it. And vmalloc_to_pfn() re-wraps vmalloc_to_page() to get the pfn. This seems too circuitous, so this patch reverses the way: implement vmalloc_to_page() as a wrapper around vmalloc_to_pfn(). This makes vmalloc_to_pfn() and vmalloc_to_page() slightly more efficient. No functional change. Signed-off-by: Jianyu Zhan <[email protected]> Cc: Vladimir Murzin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm, mempolicy: remove unneeded functions for UMA configsDavid Rientjes1-32/+0
Mempolicies only exist for CONFIG_NUMA configurations. Therefore, a certain class of functions are unneeded in configurations where CONFIG_NUMA is disabled such as functions that duplicate existing mempolicies, lookup existing policies, set certain mempolicy traits, or test mempolicies for certain attributes. Remove the unneeded functions so that any future callers get a compile- time error and protect their code with CONFIG_NUMA as required. Signed-off-by: David Rientjes <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/hugetlb.c: call MMU notifiers when copying a hugetlb page rangeAndreas Sandberg1-5/+16
When copy_hugetlb_page_range() is called to copy a range of hugetlb mappings, the secondary MMUs are not notified if there is a protection downgrade, which breaks COW semantics in KVM. This patch adds the necessary MMU notifier calls. Signed-off-by: Andreas Sandberg <[email protected]> Acked-by: Steve Capper <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm, memory-failure: fix typo in me_pagecache_dirty()Zhi Yong Wu1-1/+1
[[email protected]: s/cache/pagecache/] Signed-off-by: Zhi Yong Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: create a separate slab for page->ptl allocationKirill A. Shutemov3-3/+24
If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64 is 72 bytes. For page->ptl they will be allocated from kmalloc-96 slab, so we loose 24 on each. An average system can easily allocate few tens thousands of page->ptl and overhead is significant. Let's create a separate slab for page->ptl allocation to solve this. To make sure that it really works this time, some numbers from my test machine (just booted, no load): Before: # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo kmalloc-96 31987 32190 128 30 1 : tunables 120 60 8 : slabdata 1073 1073 92 After: # grep '^\(kmalloc-96\|page->ptl\)' /proc/slabinfo page->ptl 27516 28143 72 53 1 : tunables 120 60 8 : slabdata 531 531 9 kmalloc-96 3853 5280 128 30 1 : tunables 120 60 8 : slabdata 176 176 0 Note that the patch is useful not only for debug case, but also for PREEMPT_RT, where spinlock_t is always bloated. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserveYasuaki Ishimatsu2-0/+19
Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on 9TB memory machine since onlining memory sections is too slow. And we found out setup_zone_migrate_reserve spent >90% of the time. The problem is, setup_zone_migrate_reserve scans all pageblocks unconditionally, but it is only necessary if the number of reserved block was reduced (i.e. memory hot remove). Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means that the number of reserved pageblocks is almost always unchanged. This patch adds zone->nr_migrate_reserve_block to maintain the number of MIGRATE_RESERVE pageblocks and it reduces the overhead of setup_zone_migrate_reserve dramatically. The following table shows time of onlining a memory section. Amount of memory | 128GB | 192GB | 256GB| --------------------------------------------- linux-3.12 | 23.9 | 31.4 | 44.5 | This patch | 8.3 | 8.3 | 8.6 | Mel's proposal patch | 10.9 | 19.2 | 31.3 | --------------------------------------------- (millisecond) 128GB : 4 nodes and each node has 32GB of memory 192GB : 6 nodes and each node has 32GB of memory 256GB : 8 nodes and each node has 32GB of memory (*1) Mel proposed his idea by the following threads. https://lkml.org/lkml/2013/10/30/272 [[email protected]: tweak comment] Signed-off-by: KOSAKI Motohiro <[email protected]> Signed-off-by: Yasuaki Ishimatsu <[email protected]> Reported-by: Yasuaki Ishimatsu <[email protected]> Tested-by: Yasuaki Ishimatsu <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21/proc/meminfo: provide estimated available memoryRik van Riel2-0/+46
Many load balancing and workload placing programs check /proc/meminfo to estimate how much free memory is available. They generally do this by adding up "free" and "cached", which was fine ten years ago, but is pretty much guaranteed to be wrong today. It is wrong because Cached includes memory that is not freeable as page cache, for example shared memory segments, tmpfs, and ramfs, and it does not include reclaimable slab memory, which can take up a large fraction of system memory on mostly idle systems with lots of files. Currently, the amount of memory that is available for a new workload, without pushing the system into swap, can be estimated from MemFree, Active(file), Inactive(file), and SReclaimable, as well as the "low" watermarks from /proc/zoneinfo. However, this may change in the future, and user space really should not be expected to know kernel internals to come up with an estimate for the amount of free memory. It is more convenient to provide such an estimate in /proc/meminfo. If things change in the future, we only have to change it in one place. Signed-off-by: Rik van Riel <[email protected]> Reported-by: Erik Mouw <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail()Oleg Nesterov1-4/+3
get_huge_page_tail()->compound_head() looks confusing. Every caller must check PageTail(page), otherwise atomic_inc(&page->_mapcount) is simply wrong if this page is compound-trans-head. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Dave Jones <[email protected]> Cc: Darren Hart <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mel Gorman <[email protected]> Acked-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: thp: __get_page_tail_foll() can use get_huge_page_tail()Oleg Nesterov1-4/+1
Cleanup. Change __get_page_tail_foll() to use get_huge_page_tail() to avoid the code duplication. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Dave Jones <[email protected]> Cc: Darren Hart <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mel Gorman <[email protected]> Acked-by: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/hugetlb.c: defer PageHeadHuge() symbol exportAndrea Arcangeli1-1/+0
No actual need of it. So keep it internal. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/swap.c: reorganize put_compound_page()Andrew Morton1-129/+125
Tweak it so save a tab stop, make code layout slightly less nutty. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm/hugetlb.c: simplify PageHeadHuge() and PageHuge()Andrew Morton1-10/+2
Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: hugetlbfs: use __compound_tail_refcounted in __get_page_tail tooAndrea Arcangeli1-2/+1
Also remove hugetlb.h which isn't needed anymore as PageHeadHuge is handled in mm.h. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: tail page refcounting optimization for slab and hugetlbfsAndrea Arcangeli4-14/+60
This skips the _mapcount mangling for slab and hugetlbfs pages. The main trouble in doing this is to guarantee that PageSlab and PageHeadHuge remains constant for all get_page/put_page run on the tail of slab or hugetlbfs compound pages. Otherwise if they're set during get_page but not set during put_page, the _mapcount of the tail page would underflow. PageHeadHuge will remain true until the compound page is released and enters the buddy allocator so it won't risk to change even if the tail page is the last reference left on the page. PG_slab instead is cleared before the slab frees the head page with put_page, so if the tail pin is released after the slab freed the page, we would have a problem. But in the slab case the tail pin cannot be the last reference left on the page. This is because the slab code is free to reuse the compound page after a kfree/kmem_cache_free without having to check if there's any tail pin left. In turn all tail pins must be always released while the head is still pinned by the slab code and so we know PG_slab will be still set too. Signed-off-by: Andrea Arcangeli <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: thp: optimize compound_trans_hugeAndrea Arcangeli1-0/+23
Currently we don't clobber page_tail->first_page during split_huge_page, so compound_trans_head can be set to compound_head without adverse effects, and this mostly optimizes away a smp_rmb. It looks worthwhile to keep around the implementation that doesn't relay on page_tail->first_page not to be clobbered, because it would be necessary if we'll decide to enforce page->private to zero at all times whenever PG_private is not set, also for anonymous pages. For anonymous pages enforcing such an invariant doesn't matter as anonymous pages don't use page->private so we can get away with this microoptimization. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: hugetlbfs: move the put/get_page slab and hugetlbfs optimization in a ↵Andrea Arcangeli1-62/+78
faster path We don't actually need a reference on the head page in the slab and hugetlbfs paths, as long as we add a smp_rmb() which should be faster than get_page_unless_zero. [[email protected]: fix typo in comment] Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: hugetlb: use get_page_foll() in follow_hugetlb_page()Andrea Arcangeli1-1/+1
get_page_foll() is more optimal and is always safe to use under the PT lock. More so for hugetlbfs as there's no risk of race conditions with split_huge_page regardless of the PT lock. Signed-off-by: Andrea Arcangeli <[email protected]> Tested-by: Khalid Aziz <[email protected]> Cc: Pravin Shelar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: hugetlbfs: Add some VM_BUG_ON()s to catch non-hugetlbfs pagesDave Hansen1-0/+1
Dave Jiang reported that he was seeing oopses when running NUMA systems and default_hugepagesz=1G. I traced the issue down to migrate_page_copy() trying to use the same code for hugetlb pages and transparent hugepages. It should not have been trying to pass thp pages in there. So, add some VM_BUG_ON()s for the next hapless VM developer that tries the same thing. Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Tested-by: Dave Jiang <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21mm: Make {,set}page_address() static inline if WANT_PAGE_VIRTUALGeert Uytterhoeven1-5/+8
{,set}page_address() are macros if WANT_PAGE_VIRTUAL. If !WANT_PAGE_VIRTUAL, they're plain C functions. If someone calls them with a void *, this pointer is auto-converted to struct page * if !WANT_PAGE_VIRTUAL, but causes a build failure on architectures using WANT_PAGE_VIRTUAL (arc, m68k and sparc64): drivers/md/bcache/bset.c: In function `__btree_sort': drivers/md/bcache/bset.c:1190: warning: dereferencing `void *' pointer drivers/md/bcache/bset.c:1190: error: request for member `virtual' in something not a structure or union Convert them to static inline functions to fix this. There are already plenty of users of struct page members inside <linux/mm.h>, so there's no reason to keep them as macros. Signed-off-by: Geert Uytterhoeven <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Tested-by: Guenter Roeck <[email protected]> Tested-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21fs/ramfs: don't use module_init for non-modular core codePaul Gortmaker1-1/+1
The ramfs is always built in. It will never be modular, so using module_init as an alias for __initcall is rather misleading. Fix this up now, so that we can relocate module_init from init.h into module.h in the future. If we don't do this, we'd have to add module.h to obviously non-modular code, and that would be a worse thing. Note that direct use of __initcall is discouraged, vs. one of the priority categorized subgroups. As __initcall gets mapped onto device_initcall, our use of fs_initcall (which makes sense for fs code) will thus change this registration from level 6-device to level 5-fs (i.e. slightly earlier). However no observable impact of that small difference has been observed during testing, or is expected. Also note that this change uncovers a missing semicolon bug in the registration of the initcall. Signed-off-by: Paul Gortmaker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21fs/super.c: fix WARN on alloc_super() fail pathVladimir Davydov1-1/+2
On fail path alloc_super() calls destroy_super(), which issues a warning if the sb's s_mounts list is not empty, in particular if it has not been initialized. That said s_mounts must be initialized in alloc_super() before any possible failure, but currently it is initialized close to the end of the function leading to a useless warning dumped to log if either percpu_counter_init() or list_lru_init() fails. Let's fix this. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21fs/read_write.c:compat_readv(): remove bogus area verifyCorey Minyard1-4/+0
The compat_do_readv_writev() function was doing a verify_area on the incoming iov, but the nr_segs value is not checked. If someone passes in a -1 for nr_segs, for instance, the function should return an EINVAL. However, it returns a EFAULT because the verify_area fails because it is checking an array of size MAX_UINT. The check is bogus, anyway, because the next check, compat_rw_copy_check_uvector(), will do all the necessary checking, anyway. The non-compat do_readv_writev() function doesn't do this check, so I think it's safe to just remove the code. Signed-off-by: Corey Minyard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21fs/compat_ioctl.c: fix an underflow issue (harmless)Dan Carpenter1-1/+2
We cap "nmsgs" at I2C_RDRW_IOCTL_MAX_MSGS (42) but the current code allows negative values. It's harmless but it makes my static checker upset so I've made nsmgs unsigned. Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21posix_acl: uninliningAndrew Morton2-77/+85
Uninline vast tracts of nested inline functions in include/linux/posix_acl.h. This reduces the text+data+bss size of x86_64 allyesconfig vmlinux by 8026 bytes. The patch also regularises the positioning of the EXPORT_SYMBOLs in posix_acl.c. Cc: Alexander Viro <[email protected]> Cc: J. Bruce Fields <[email protected]> Cc: Trond Myklebust <[email protected]> Tested-by: Benny Halevy <[email protected]> Cc: Benny Halevy <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21arch/sh/kernel/kgdb.c: add missing #include <linux/sched.h>Wanlong Gao1-0/+1
arch/sh/kernel/kgdb.c: In function 'sleeping_thread_to_gdb_regs': arch/sh/kernel/kgdb.c:225:32: error: implicit declaration of function 'task_stack_page' [-Werror=implicit-function-declaration] arch/sh/kernel/kgdb.c:242:23: error: dereferencing pointer to incomplete type arch/sh/kernel/kgdb.c:243:22: error: dereferencing pointer to incomplete type arch/sh/kernel/kgdb.c: In function 'singlestep_trap_handler': arch/sh/kernel/kgdb.c:310:27: error: 'SIGTRAP' undeclared (first use in this function) arch/sh/kernel/kgdb.c:310:27: note: each undeclared identifier is reported only once for each function it appears in This was introduced by commit 16559ae48c76 ("kgdb: remove #include <linux/serial_8250.h> from kgdb.h"). [[email protected]: reworded and reformatted] Signed-off-by: Wanlong Gao <[email protected]> Signed-off-by: Geert Uytterhoeven <[email protected]> Reported-by: Fengguang Wu <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneouslyYiwen Jiang1-2/+6
2 nodes cluster, say Node A and Node B, mount the same ocfs2 volume, and create a file 1. Node A Node B open 1, get open lock rm 1, and then add 1 to orphan_dir storage link down, o2hb_write_timeout ->o2quo_disk_timeout ->emergency_restart at the moment, Node B dismount and do ocfs2rec simultaneously 1) ocfs2_dismount_volume ->ocfs2_recovery_exit ->wait_event(osb->recovery_event) ->flush_workqueue(ocfs2_wq) 2) ocfs2rec ->queue_work(&journal->j_recovery_work) ->ocfs2_recover_orphans ->ocfs2_commit_truncate ->queue_delayed_work(&osb->osb_truncate_log_wq) In ocfs2_recovery_exit, it flushes workqueue and then releases system inodes. When doing ocfs2rec, it will call ocfs2_flush_truncate_log which will try to get sys_root_inode, and NULL pointer dereference occurs. Signed-off-by: Yiwen Jiang <[email protected]> Signed-off-by: joyce <[email protected]> Signed-off-by: Joseph Qi <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: punch hole should return EINVAL if the length argument in ioctl is ↵Tariq Saeed1-1/+2
negative An unreserve space ioctl OCFS2_IOC_UNRESVSP/64 should reject a negative length. Orabug:14789508 Signed-off-by: Tariq Saseed <[email protected]> Signed-off-by: Srinivas Eeda <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: fix sparse non static symbol warningWei Yongjun1-1/+1
Fixes the following sparse warning: fs/ocfs2/stack_user.c:930:32: warning: symbol 'ocfs2_ls_ops' was not declared. Should it be static? Signed-off-by: Wei Yongjun <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: adjust minlen with discard_granularity in the FITRIM ioctlJie Liu1-0/+2
Adjust minlen with discard_granularity for FITRIM ioctl(2) if the given minimum size in bytes is less than it because, discard granularity is used to tell us that the minimum size of extent that can be discarded by the storage device. This is inspired by ext4 commit 5c2ed62fd447 ("ext4: Adjust minlen with discard_granularity in the FITRIM ioctl") from Lukas Czerner. Signed-off-by: Jie Liu <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: return EINVAL if the given range to discard is less than block sizeJie Liu1-7/+3
For FITRIM ioctl(2), we should not keep silence if the given range length ls less than a block size as there is no data blocks would be discareded. Hence it should return EINVAL instead. This issue can be verified via xfstests/generic/288 which is used for FITRIM argument handling tests. Signed-off-by: Jie Liu <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: return EOPNOTSUPP if the device does not support discardJie Liu1-0/+5
For FITRIM ioctl(2), we should return EOPNOTSUPP to inform the user that the storage device does not support discard if it is, otherwise return success would confuse the user even though there is no free blocks were trimmed at all. Signed-off-by: Jie Liu <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: remove redundant ocfs2_alloc_dinode_update_counts() and ↵Younger Liu3-87/+14
ocfs2_block_group_set_bits() ocfs2_alloc_dinode_update_counts() and ocfs2_block_group_set_bits() are already provided in suballoc.c. So, the same functions in move_extents.c are not needed any more. Declare the functions in suballoc.h and remove redundant functions in move_extents.c. Signed-off-by: Younger Liu <[email protected]> Cc: Younger Liu <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: use the new DLM operation callbacks while requesting new lockspaceGoldwyn Rodrigues1-24/+102
Attempt to use the new DLM operations. If it is not supported, use the traditional ocfs2_controld. To exchange ocfs2 versioning, we use the LVB of the version dlm lock. It first attempts to take the lock in EX mode (non-blocking). If successful (which means it is the first mount), it writes the version number and downconverts to PR lock. If it is unsuccessful, it reads the version from the lock. If this becomes the standard (with o2cb as well), it could simplify userspace tools to check if the filesystem is mounted on other nodes. Dan: Since ocfs2_protocol_version are two u8 values, the additional checks with LONG* don't make sense. Signed-off-by: Goldwyn Rodrigues <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: framework for version LVBGoldwyn Rodrigues1-0/+101
Use the native DLM locks for version control negotiation. Most of the framework is taken from gfs2/lock_dlm.c Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: pass ocfs2_cluster_connection to ocfs2_this_nodeGoldwyn Rodrigues5-8/+24
This is done to differentiate between using and not using controld and use the connection information accordingly. We need to be backward compatible. So, we use a new enum ocfs2_connection_type to identify when controld is used and when it is not. Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: shift allocation ocfs2_live_connection to user_connect()Goldwyn Rodrigues1-19/+18
We perform this because the DLM recovery callbacks will require the ocfs2_live_connection structure to record the node information when dlm_new_lockspace() is updated (in the last patch of the series). Before calling dlm_new_lockspace(), we need the structure ready for the .recover_done() callback, which would set oc_this_node. This is the reason we allocate ocfs2_live_connection beforehand in user_connect(). [AKPM] rc initialization is not required because it assigned in case of errors. It will be cleared by compiler anyways. Signed-off-by: Goldwyn Rodrigues <[email protected]> Reveiwed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: add DLM recovery callbacksGoldwyn Rodrigues1-0/+38
These are the callbacks called by the fs/dlm code in case the membership changes. If there is a failure while/during calling any of these, the DLM creates a new membership and relays to the rest of the nodes. - recover_prep() is called when DLM understands a node is down. - recover_slot() is called once all nodes have acknowledged recover_prep and recovery can begin. - recover_done() is called once the recovery is complete. It returns the new membership. Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-01-21ocfs2: add clustername to cluster connectionGoldwyn Rodrigues5-7/+24
This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM handling up to the times with respect to DLM (>=4.0.1) and corosync (2.3.x). AFAIK, cman also is being phased out for a unified corosync cluster stack. fs/dlm performs all the functions with respect to fencing and node management and provides the API's to do so for ocfs2. For all future references, DLM stands for fs/dlm code. The advantages are: + No need to run an additional userspace daemon (ocfs2_controld) + No controld device handling and controld protocol + Shifting responsibilities of node management to DLM layer For backward compatibility, we are keeping the controld handling code. Once enough time has passed we can remove a significant portion of the code. This was tested by using the kernel with changes on older unmodified tools. The kernel used ocfs2_controld as expected, and displayed the appropriate warning message. This feature requires modification in the userspace ocfs2-tools. The changes can be found at: https://github.com/goldwynr/ocfs2-tools branch: nocontrold Currently, not many checks are present in the userspace code, but that would change soon. This patch (of 6): Add clustername to cluster connection. Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>