aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-04-03mm: optimize put_mems_allowed() usageMel Gorman8-39/+38
Since put_mems_allowed() is strictly optional, its a seqcount retry, we don't need to evaluate the function if the allocation was in fact successful, saving a smp_rmb some loads and comparisons on some relative fast-paths. Since the naming, get/put_mems_allowed() does suggest a mandatory pairing, rename the interface, as suggested by Mel, to resemble the seqcount interface. This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(), where it is important to note that the return value of the latter call is inverted from its previous incarnation. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm, compaction: ignore pageblock skip when manually invoking compactionDavid Rientjes1-0/+1
The cached pageblock hint should be ignored when triggering compaction through /proc/sys/vm/compact_memory so all eligible memory is isolated. Manually invoking compaction is known to be expensive, there's no need to skip pageblocks based on heuristics (mainly for debugging). Signed-off-by: David Rientjes <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: vmscan: remove shrink_control arg from do_try_to_free_pages()Vladimir Davydov1-20/+12
There is no need passing on a shrink_control struct from try_to_free_pages() and friends to do_try_to_free_pages() and then to shrink_zones(), because it is only used in shrink_zones() and the only field initialized on the top level is gfp_mask, which is always equal to scan_control.gfp_mask. So let's move shrink_control initialization to shrink_zones(). Signed-off-by: Vladimir Davydov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Glauber Costa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: vmscan: move call to shrink_slab() to shrink_zones()Vladimir Davydov1-31/+25
This reduces the indentation level of do_try_to_free_pages() and removes extra loop over all eligible zones counting the number of on-LRU pages. Signed-off-by: Vladimir Davydov <[email protected]> Reviewed-by: Glauber Costa <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaimVladimir Davydov1-2/+2
When direct reclaim is executed by a process bound to a set of NUMA nodes, we should scan only those nodes when possible, but currently we will scan kmem from all online nodes even if the kmem shrinker is NUMA aware. That said, binding a process to a particular NUMA node won't prevent it from shrinking inode/dentry caches from other nodes, which is not good. Fix this. Signed-off-by: Vladimir Davydov <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Glauber Costa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03kernel/watchdog.c: touch_nmi_watchdog should only touch local cpu not every oneBen Zhang1-8/+8
I ran into a scenario where while one cpu was stuck and should have panic'd because of the NMI watchdog, it didn't. The reason was another cpu was spewing stack dumps on to the console. Upon investigation, I noticed that when writing to the console and also when dumping the stack, the watchdog is touched. This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu just spins forever. This change causes the semantics of touch_nmi_watchdog to be changed slightly. Previously, I accidentally changed the semantics and we noticed there was a codepath in which touch_nmi_watchdog could be touched from a preemtible area. That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled. I believe it was the acpi code. My attempt here re-introduces the change to have the touch_nmi_watchdog() code only touch the local cpu instead of all of the cpus. But instead of using __get_cpu_var(), I use the __raw_get_cpu_var() version. This avoids the preemption problem. However my reasoning wasn't because I was trying to be lazy. Instead I rationalized it as, well if preemption is enabled then interrupts should be enabled to and the NMI watchdog will have no reason to trigger. So it won't matter if the wrong cpu is touched because the percpu interrupt counters the NMI watchdog uses should still be incrementing. Don said: : I'm ok with this patch, though it does alter the behaviour of how : touch_nmi_watchdog works. For the most part I don't think most callers : need to touch all of the watchdogs (on each cpu). Perhaps a corner case : will pop up (the scheduler?? to mimic touch_all_softlockup_watchdogs() ). : : But this does address an issue where if a system is locked up and one cpu : is spewing out useful debug messages (or error messages), the hard lockup : will fail to go off. We have seen this on RHEL also. Signed-off-by: Don Zickus <[email protected]> Signed-off-by: Ben Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03fs/direct-io.c: remove some left over checksDan Carpenter3-3/+3
We know that "ret > 0" is true here. These tests were left over from commit 02afc27faec9 ('direct-io: Handle O_(D)SYNC AIO') and aren't needed any more. Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03fs/direct-io.c: remove redundant comparisonGu Zheng1-1/+0
The return value of bio_get_nr_vecs() cannot be bigger than BIO_MAX_PAGES, so we can remove redundant the comparison between nr_pages and BIO_MAX_PAGES. Signed-off-by: Gu Zheng <[email protected]> Cc: Al Viro <[email protected]> Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: pass "new" parameter to ocfs2_init_xattr_bucketWengang Wang1-8/+15
This patch fixes the following crash: kernel BUG at fs/ocfs2/uptodate.c:530! Modules linked in: ocfs2(F) ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bridge xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs xen_privcmd sunrpc 8021q garp stp llc bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iTCO_wdt iTCO_vendor_support dcdbas coretemp freq_table mperf microcode pcspkr serio_raw bnx2 lpc_ich mfd_core i5k_amb i5000_edac edac_core e1000e sg shpchp ext4(F) jbd2(F) mbcache(F) dm_round_robin(F) sr_mod(F) cdrom(F) usb_storage(F) sd_mod(F) crc_t10dif(F) pata_acpi(F) ata_generic(F) ata_piix(F) mptsas(F) mptscsih(F) mptbase(F) scsi_transport_sas(F) radeon(F) ttm(F) drm_kms_helper(F) drm(F) hwmon(F) i2c_algo_bit(F) i2c_core(F) dm_multipath(F) dm_mirror(F) dm_region_hash(F) dm_log(F) dm_mod(F) CPU 5 Pid: 21303, comm: xattr-test Tainted: GF W 3.8.13-30.el6uek.x86_64 #2 Dell Inc. PowerEdge 1950/0M788G RIP: ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2] Process xattr-test (pid: 21303, threadinfo ffff880017aca000, task ffff880016a2c480) Call Trace: ocfs2_init_xattr_bucket+0x8a/0x120 [ocfs2] ocfs2_cp_xattr_bucket+0xbb/0x1b0 [ocfs2] ocfs2_extend_xattr_bucket+0x20a/0x2f0 [ocfs2] ocfs2_add_new_xattr_bucket+0x23e/0x4b0 [ocfs2] ocfs2_xattr_set_entry_index_block+0x13c/0x3d0 [ocfs2] ocfs2_xattr_block_set+0xf9/0x220 [ocfs2] __ocfs2_xattr_set_handle+0x118/0x710 [ocfs2] ocfs2_xattr_set+0x691/0x880 [ocfs2] ocfs2_xattr_user_set+0x46/0x50 [ocfs2] generic_setxattr+0x96/0xa0 __vfs_setxattr_noperm+0x7b/0x170 vfs_setxattr+0xbc/0xc0 setxattr+0xde/0x230 sys_fsetxattr+0xc6/0xf0 system_call_fastpath+0x16/0x1b Code: 41 80 0c 24 01 48 89 df e8 7d f0 ff ff 4c 89 e6 48 89 df e8 a2 fe ff ff 48 89 df e8 3a f0 ff ff 48 8b 1c 24 4c 8b 64 24 08 c9 c3 <0f> 0b eb fe 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 66 66 RIP ocfs2_set_new_buffer_uptodate+0x51/0x60 [ocfs2] It hit the BUG_ON() in ocfs2_set_new_buffer_uptodate(): void ocfs2_set_new_buffer_uptodate(struct ocfs2_caching_info *ci, struct buffer_head *bh) { /* This should definitely *not* exist in our cache */ if (ocfs2_buffer_cached(ci, bh)) printk(KERN_ERR "bh->b_blocknr: %lu @ %p\n", bh->b_blocknr, bh); BUG_ON(ocfs2_buffer_cached(ci, bh)); set_buffer_uptodate(bh); ocfs2_metadata_cache_io_lock(ci); ocfs2_set_buffer_uptodate(ci, bh); ocfs2_metadata_cache_io_unlock(ci); } The problem here is: We cached a block, but the buffer_head got reused. When we are to pick up this block again, a new buffer_head created with UPTODATE flag cleared. ocfs2_buffer_uptodate() returned false since no UPTODATE is set on the buffer_head. so we set this block to cache as a NEW block, then it failed at asserting block is not in cache. The fix is to add a new parameter indicating the bucket is a new allocated or not to ocfs2_init_xattr_bucket(). ocfs2_init_xattr_bucket() assert block not cached accordingly. Signed-off-by: Wengang Wang <[email protected]> Cc: Joel Becker <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joe Jin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: avoid system inode ref confusion by adding mutex lockjiangyiwen3-0/+7
The following case may lead to the same system inode ref in confusion. A thread B thread ocfs2_get_system_file_inode ->get_local_system_inode ->_ocfs2_get_system_file_inode because of *arr == NULL, ocfs2_get_system_file_inode ->get_local_system_inode ->_ocfs2_get_system_file_inode gets first ref thru _ocfs2_get_system_file_inode, gets second ref thru igrab and set *arr = inode at the moment, B thread also gets two refs, so lead to one more inode ref. So add mutex lock to avoid multi thread set two inode ref once at the same time. Signed-off-by: jiangyiwen <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: iput inode alloc when failed locallyjiangyiwen2-2/+4
In ocfs2_info_handle_freeinode() and ocfs2_test_inode_bit() func, after calls ocfs2_get_system_file_inode() to get inode ref, if calls ocfs2_info_scan_inode_alloc() or ocfs2_inode_lock() failed, we should iput inode alloc to avoid leaking the inode. Signed-off-by: jiangyiwen <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2/o2net: o2net_listen_data_ready should do nothing if socket state is ↵Tariq Saeed1-5/+17
not TCP_LISTEN Orabug: 17330860 When accepting an incomming connection o2net_accept_one clones a child data socket from the parent listening socket. It then proceeds to setup the child with callback o2net_data_ready() and sk_user_data to NULL. If data arrives in this window, o2net_listen_data_ready will be called with some non-deterministic value in sk_user_data (not inherited). We panic when we page fault on sk_user_data -- in parent it is sock_def_readable(). The fix is to recognize that this is a data socket being set up by looking at the socket state and do nothing. Signed-off-by: Tariq Saseed <[email protected]> Signed-off-by: Srinivas Eeda <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: rollback alloc_dinode counts when ocfs2_block_group_set_bits() failedYounger Liu3-2/+32
After updating alloc_dinode counts in ocfs2_alloc_dinode_update_counts(), if ocfs2_alloc_dinode_update_bitmap() failed, there is a rare case that some space may be lost. So, roll back alloc_dinode counts when ocfs2_block_group_set_bits() failed. [[email protected]: coding-style fixes] Signed-off-by: Younger Liu <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: flock: drop cross-node lock when failed locallyWengang Wang1-0/+2
ocfs2_do_flock() calls ocfs2_file_lock() to get the cross-node clock and then call flock_lock_file_wait() to compete with local processes. In case flock_lock_file_wait() failed, say -ENOMEM, clean up work is not done. This patch adds the cleanup --drop the cross-node lock which was just granted. [[email protected]: coding-style fixes] Signed-off-by: Wengang Wang <[email protected]> Cc: Joel Becker <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: call ocfs2_update_inode_fsync_trans when updating any inodeDarrick J. Wong8-1/+20
Ensure that ocfs2_update_inode_fsync_trans() is called any time we touch an inode in a given transaction. This is a follow-on to the previous patch to reduce lock contention and deadlocking during an fsync operation. Signed-off-by: Darrick J. Wong <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Wengang <[email protected]> Cc: Greg Marsden <[email protected]> Cc: Srinivas Eeda <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: fix panic on kfree(xattr->name)Tetsuo Handa1-2/+0
Commit 9548906b2bb7 ('xattr: Constify ->name member of "struct xattr"') missed that ocfs2 is calling kfree(xattr->name). As a result, kernel panic occurs upon calling kfree(xattr->name) because xattr->name refers static constant names. This patch removes kfree(xattr->name) from ocfs2_mknod() and ocfs2_symlink(). Signed-off-by: Tetsuo Handa <[email protected]> Reported-by: Tariq Saeed <[email protected]> Tested-by: Tariq Saeed <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: <[email protected]> [3.12+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: do not put bh when buffer_uptodate failedalex chen1-2/+0
Do not put bh when buffer_uptodate failed in ocfs2_write_block and ocfs2_write_super_or_backup, because it will put bh in b_end_io. Otherwise it will hit a warning "VFS: brelse: Trying to free free buffer". Signed-off-by: Alex Chen <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Cc: Mark Fasheh <[email protected]> Acked-by: Joel Becker <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: __ocfs2_mknod_locked should return error when ↵Xue jiufei1-3/+0
ocfs2_create_new_inode_locks() failed When ocfs2_create_new_inode_locks() return error, inode open lock may not be obtainted for this inode. So other nodes can remove this file and free dinode when inode still remain in memory on this node, which is not correct and may trigger BUG. So __ocfs2_mknod_locked should return error when ocfs2_create_new_inode_locks() failed. Node_1 Node_2 create fileA, call ocfs2_mknod() -> ocfs2_get_init_inode(), allocate inodeA -> ocfs2_claim_new_inode(), claim dinode(dinodeA) -> call ocfs2_create_new_inode_locks(), create open lock failed, return error -> __ocfs2_mknod_locked return success unlink fileA try open lock succeed, and free dinodeA create another file, call ocfs2_mknod() -> ocfs2_get_init_inode(), allocate inodeB -> ocfs2_claim_new_inode(), as Node_2 had freed dinodeA, so claim dinodeA and update generation for dinodeA call __ocfs2_drop_dl_inodes()->ocfs2_delete_inode() to free inodeA, and finally triggers BUG on(inode->i_generation != le32_to_cpu(fe->i_generation)) in function ocfs2_inode_lock_update(). Signed-off-by: joyce.xue <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: allow for more than one data extent when creating xattrTariq Saeed1-1/+8
Orabug: 18108070 ocfs2_xattr_extend_allocation() hits panic when creating xattr during data extent alloc phase. The problem occurs if due to local alloc fragmentation, clusters are spread over multiple extents. In this case ocfs2_add_clusters_in_btree() finds no space to store more than one extent record and therefore fails returning RESTART_META. The situation is anticipated for xattr update case but not xattr create case. This fix simply ports that code to create case. Signed-off-by: Tariq Saeed <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: fix deadlock risk when kmalloc failed in dlm_query_region_handlerZhonghua Guo1-12/+9
In dlm_query_region_handler(), once kmalloc failed, it will unlock dlm_domain_lock without lock first, then deadlock happens. Signed-off-by: Zhonghua Guo <[email protected]> Signed-off-by: Joseph Qi <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Tested-by: Joseph Qi <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: llseek requires ocfs2 inode lock for the file in SEEK_ENDJensen1-1/+10
llseek requires ocfs2 inode lock for updating the file size in SEEK_END. because the file size maybe update on another node. This bug can be reproduce the following scenario: at first, we dd a test fileA, the file size is 10k. on NodeA: --------- 1) open the test fileA, lseek the end of file. and print the position. 2) close the test fileA on NodeB: 1) open the test fileA, append the 5k data to test FileA. 2) lseek the end of file. and print the position. 3) close file. At first we run the test program1 on NodeA , the result is 10k. And then run the test program2 on NodeB, the result is 15k. At last, we run the test program1 on NodeA again, the result is 10k. After applying this patch the three step result is 15k. test result: 1000000 times lseek call; index lseek with inode lock (unit:us) lseek without inode lock (unit:us) 1 1168162 555383 2 1168011 549504 3 1170538 549396 4 1170375 551685 5 1170444 556719 6 1174364 555307 7 1163294 551552 8 1170080 549350 9 1162464 553700 10 1165441 552594 avg 1168317 552519 avg with lock - avg without lock = 615798 (avg with lock - avg without lock)/1000000=0.615798 us Signed-off-by: Jensen <[email protected]> Cc: Jie Liu <[email protected]> Acked-by: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Sunil Mushran <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: fix type conversion risk when get cluster attributesJoseph Qi1-3/+3
In o2nm_cluster, cl_idle_timeout_ms, cl_keepalive_delay_ms, as well as cl_reconnect_delay_ms, are defined as type of unsigned int. So we should also use unsigned int in the helper functions. Signed-off-by: Joseph Qi <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: revert iput deferring code in ocfs2_drop_dentry_lockGoldwyn Rodrigues4-122/+9
The following patches are reverted in this patch because these patches caused performance regression in the remote unlink() calls. ea455f8ab683 - ocfs2: Push out dropping of dentry lock to ocfs2_wq f7b1aa69be13 - ocfs2: Fix deadlock on umount 5fd131893793 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount Previous patches in this series removed the possible deadlocks from downconvert thread so the above patches shouldn't be needed anymore. The regression is caused because these patches delay the iput() in case of dentry unlocks. This also delays the unlocking of the open lockres. The open lockresource is required to test if the inode can be wiped from disk or not. When the deleting node does not get the open lock, it marks it as orphan (even though it is not in use by another node/process) and causes a journal checkpoint. This delays operations following the inode eviction. This also moves the inode to the orphaned inode which further causes more I/O and a lot of unneccessary orphans. The following script can be used to generate the load causing issues: declare -a create declare -a remove declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384) unique="`mktemp -u XXXXX`" script="/tmp/idontknow-${unique}.sh" cat <<EOF > "${script}" for n in {1..8}; do mkdir -p test/dir\${n} eval touch test/dir\${n}/foo{1.."\$1"} done EOF chmod 700 "${script}" function fcreate () { exec 2>&1 /usr/bin/time --format=%E "${script}" "$1" } function fremove () { exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*" } function fcp () { exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new" } echo ------------------------------------------------- echo "| # files | create #s | copy #s | remove #s |" echo ------------------------------------------------- for ((x=0; x < ${#iterations[*]} ; x++)) do create[$x]="`fcreate ${iterations[$x]}`" copy[$x]="`fcp ${iterations[$x]}`" remove[$x]="`fremove`" printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]} done rm "${script}" echo "------------------------" Signed-off-by: Srinivas Eeda <[email protected]> Signed-off-by: Goldwyn Rodrigues <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: avoid blocking in ocfs2_mark_lockres_freeing() in downconvert threadJan Kara3-7/+47
If we are dropping last inode reference from downconvert thread, we will end up calling ocfs2_mark_lockres_freeing() which can block if the lock we are freeing is queued thus creating an A-A deadlock. Luckily, since we are the downconvert thread, we can immediately dequeue the lock and thus avoid waiting in this case. Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: implement delayed dropping of last dquot referenceJan Kara4-0/+50
We cannot drop last dquot reference from downconvert thread as that creates the following deadlock: NODE 1 NODE2 holds dentry lock for 'foo' holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE dquot_initialize(bar) ocfs2_dquot_acquire() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) ... downconvert thread (triggered from another node or a different process from NODE2) ocfs2_dentry_post_unlock() ... iput(foo) ocfs2_evict_inode(foo) ocfs2_clear_inode(foo) dquot_drop(inode) ... ocfs2_dquot_release() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) - blocks finds we need more space in quota file ... ocfs2_extend_no_holes() ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE) - deadlocks waiting for downconvert thread We solve the problem by postponing dropping of the last dquot reference to a workqueue if it happens from the downconvert thread. Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03quota: provide function to grab quota structure referenceJan Kara2-2/+10
Provide dqgrab() function to get quota structure reference when we are sure it already has at least one active reference. Make use of this function inside quota code. Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: move dquot_initialize() in ocfs2_delete_inode() somewhat laterJan Kara1-7/+9
Move dquot_initalize() call in ocfs2_delete_inode() after the moment we verify inode is actually a sane one to delete. We certainly don't want to initialize quota for system inodes etc. This also avoids calling into quota code from downconvert thread. Add more details into the comment why bailing out from ocfs2_delete_inode() when we are in downconvert thread is OK. Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: remove OCFS2_INODE_SKIP_DELETE flagJan Kara3-17/+3
The flag was never set, delete it. Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: add dlm_recover_callback_support in sysfsGoldwyn Rodrigues1-0/+14
This is a part of the nocontrold feature which was incorporated sometime back. This is required for backward compatibility of the tools, specifically the scenario where the tools with recovery callback is used with a kernel not using the recovery callbacks (older kernel + newer tools). The tools look for this file to understand if the kernel supports DLM recovery callbacks. For kernels which support recovery callbacks but will miss this patch, ocfs2 will continue to use the older API and would still be able to mount the filesystem. [[email protected]: simplify] [[email protected]: VERIFY_OCTAL_PERMISSIONS fix up] Signed-off-by: Goldwyn Rodrigues <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: dlm: fix recovery hungJunxiao Bi1-2/+13
There is a race window in dlm_do_recovery() between dlm_remaster_locks() and dlm_reset_recovery() when the recovery master nearly finish the recovery process for a dead node. After the master sends FINALIZE_RECO message in dlm_remaster_locks(), another node may become the recovery master for another dead node, and then send the BEGIN_RECO message to all the nodes included the old master, in the handler of this message dlm_begin_reco_handler() of old master, dlm->reco.dead_node and dlm->reco.new_master will be set to the second dead node and the new master, then in dlm_reset_recovery(), these two variables will be reset to default value. This will cause new recovery master can not finish the recovery process and hung, at last the whole cluster will hung for recovery. old recovery master: new recovery master: dlm_remaster_locks() become recovery master for another dead node. dlm_send_begin_reco_message() dlm_begin_reco_handler() { if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) { return -EAGAIN; } dlm_set_reco_master(dlm, br->node_idx); dlm_set_reco_dead_node(dlm, br->dead_node); } dlm_reset_recovery() { dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM); dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM); } will hang in dlm_remaster_locks() for request dlm locks info Before send FINALIZE_RECO message, recovery master should set DLM_RECO_STATE_FINALIZE for itself and clear it after the recovery done, this can break the race windows as the BEGIN_RECO messages will not be handled before DLM_RECO_STATE_FINALIZE flag is cleared. A similar race may happen between new recovery master and normal node which is in dlm_finalize_reco_handler(), also fix it. Signed-off-by: Junxiao Bi <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Reviewed-by: Wengang Wang <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: dlm: fix lock migration crashJunxiao Bi1-6/+8
This issue was introduced by commit 800deef3f6f8 ("ocfs2: use list_for_each_entry where benefical") in 2007 where it replaced list_for_each with list_for_each_entry. The variable "lock" will point to invalid data if "tmpq" list is empty and a panic will be triggered due to this. Sunil advised reverting it back, but the old version was also not right. At the end of the outer for loop, that list_for_each_entry will also set "lock" to an invalid data, then in the next loop, if the "tmpq" list is empty, "lock" will be an stale invalid data and cause the panic. So reverting the list_for_each back and reset "lock" to NULL to fix this issue. Another concern is that this seemes can not happen because the "tmpq" list should not be empty. Let me describe how. old lock resource owner(node 1): migratation target(node 2): image there's lockres with a EX lock from node 2 in granted list, a NR lock from node x with convert_type EX in converting list. dlm_empty_lockres() { dlm_pick_migration_target() { pick node 2 as target as its lock is the first one in granted list. } dlm_migrate_lockres() { dlm_mark_lockres_migrating() { res->state |= DLM_LOCK_RES_BLOCK_DIRTY; wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res)); //after the above code, we can not dirty lockres any more, // so dlm_thread shuffle list will not run downconvert lock from EX to NR upconvert lock from NR to EX <<< migration may schedule out here, then <<< node 2 send down convert request to convert type from EX to <<< NR, then send up convert request to convert type from NR to <<< EX, at this time, lockres granted list is empty, and two locks <<< in the converting list, node x up convert lock followed by <<< node 2 up convert lock. // will set lockres RES_MIGRATING flag, the following // lock/unlock can not run dlm_lockres_release_ast(dlm, res); } dlm_send_one_lockres() dlm_process_recovery_data() for (i=0; i<mres->num_locks; i++) if (ml->node == dlm->node_num) for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) { list_for_each_entry(lock, tmpq, list) if (lock) break; <<< lock is invalid as grant list is empty. } if (lock->ml.node != ml->node) BUG() >>> crash here } I see the above locks status from a vmcore of our internal bug. Signed-off-by: Junxiao Bi <[email protected]> Reviewed-by: Wengang Wang <[email protected]> Cc: Sunil Mushran <[email protected]> Reviewed-by: Srinivas Eeda <[email protected]> Cc: Joel Becker <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: improve fsync efficiency and fix deadlock between aio_write and sync_fileDarrick J. Wong9-21/+74
Currently, ocfs2_sync_file grabs i_mutex and forces the current journal transaction to complete. This isn't terribly efficient, since sync_file really only needs to wait for the last transaction involving that inode to complete, and this doesn't require i_mutex. Therefore, implement the necessary bits to track the newest tid associated with an inode, and teach sync_file to wait for that instead of waiting for everything in the journal to commit. Furthermore, only issue the flush request to the drive if jbd2 hasn't already done so. This also eliminates the deadlock between ocfs2_file_aio_write() and ocfs2_sync_file(). aio_write takes i_mutex then calls ocfs2_aiodio_wait() to wait for unaligned dio writes to finish. However, if that dio completion involves calling fsync, then we can get into trouble when some ocfs2_sync_file tries to take i_mutex. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: remove unused variable uuid_net_key in ocfs2_initialize_superjoyce.xue1-3/+0
Variable uuid_net_key in ocfs2_initialize_super() is not used. Clean it up. Signed-off-by: joyce.xue <[email protected]> Signed-off-by: Joseph Qi <[email protected]> Acked-by: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: change ip_unaligned_aio to of type mutex from atomit_tWengang Wang5-30/+7
There is a problem that waitqueue_active() may check stale data thus miss a wakeup of threads waiting on ip_unaligned_aio. The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to be of type mutex thus the above prolem is avoid. Another benifit is that mutex which works as FIFO is fairer than wake_up_all(). Signed-off-by: Wengang Wang <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03ocfs2: fix null pointer dereference when access dlm_state before launching ↵Zongxun Wang1-3/+3
dlm thread When mounting an ocfs2 volume, it will firstly generate a file /sys/kernel/debug/o2dlm/<uuid>/dlm_state, and then launch the dlm thread. So the following situation will cause a null pointer dereference. dlm_debug_init -> access file dlm_state which will call dlm_state_print -> dlm_launch_thread Move dlm_debug_init after dlm_launch_thread and dlm_launch_recovery_thread can fix this issue. Signed-off-by: Zongxun Wang <[email protected]> Signed-off-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03arch/sh/drivers/pci/pcie-sh7786.h: remove duplicate SH4A_PCIEPHYCTLRGeert Uytterhoeven1-3/+0
Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03sh: sh7757: switch RSPI clock to dev ID matchGeert Uytterhoeven1-1/+1
Switch the RSPI MSTP clock on SH7757 from a con ID match to a dev ID match, so we can start looking it up using clk_get() with a NULL ID. Signed-off-by: Geert Uytterhoeven <[email protected]> Tested-by: Yoshihiro Shimoda <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03arch/sh/boards/board-sh7757lcr.c: fixup SDHI register sizeKuninori Morimoto1-1/+1
sh7757lcr SDHI register size is 0x100 Signed-off-by: Kuninori Morimoto <[email protected]> Cc: Simon Horman <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03sh: don't pass saved userspace state to exception handlersBobby Bingham2-28/+11
The compiler is permitted to generate code which overwrites the parameters to a function. If those parameters include the only saved copy we have of userspace's registers, we're in trouble. Signed-off-by: Bobby Bingham <[email protected]> Cc: Paul Mundt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03sh: remove unused do_fpu_errorBobby Bingham1-18/+0
This does not appear to have been used since commit 74d99a5e2622 ("sh: SH-2A FPU support") in 2007. Signed-off-by: Bobby Bingham <[email protected]> Cc: Paul Mundt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03sh: push extra copy of r0-r2 for syscall parametersBobby Bingham4-26/+20
When invoking syscall handlers on sh32, the saved userspace registers are at the top of the stack. This seems to have been intentional, as it is an easy way to pass r0, r1, ... to the handler as parameters 5, 6, ... It causes problems, however, because the compiler is allowed to generate code for a function which clobbers that function's own parameters. For example, gcc generates the following code for clone: <SyS_clone>: mov.l 8c020714 <SyS_clone+0xc>,r1 ! 8c020540 <do_fork> mov.l r7,@r15 mov r6,r7 jmp @r1 mov #0,r6 nop .word 0x0540 .word 0x8c02 The `mov.l r7,@r15` clobbers the saved value of r0 passed from userspace. For most system calls, this might not be a problem, because we'll be overwriting r0 with the return value anyway. But in the case of clone, copy_thread will need the original value of r0 if the CLONE_SETTLS flag was specified. The first patch in this series fixes this issue for system calls by pushing to the stack and extra copy of r0-r2 before invoking the handler. We discard this copy before restoring the userspace registers, so it is not a problem if they are clobbered. Exception handlers also receive the userspace register values in a similar manner, and may hit the same problem. The second patch removes the do_fpu_error handler, which looks susceptible to this problem and which, as far as I can tell, has not been used in some time. The third patch addresses other exception handlers. This patch (of 3): The userspace registers are stored at the top of the stack when the syscall handler is invoked, which allows r0-r2 to act as parameters 5-7. Parameters passed on the stack may be clobbered by the syscall handler. The solution is to push an extra copy of the registers which might be used as syscall parameters to the stack, so that the authoritative set of saved register values does not get clobbered. A few system call handlers are also updated to get the userspace registers using current_pt_regs() instead of from the stack. Signed-off-by: Bobby Bingham <[email protected]> Cc: Paul Mundt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03score: remove unused CPU_SCORE7 Kconfig parameterMichael Opdenacker1-6/+0
This removes the CPU_SCORE7 Kconfig parameter, which is no longer used anywhere in the source code and Makefiles. Signed-off-by: Michael Opdenacker <[email protected]> Cc: Chen Liqin <[email protected]> Cc: Lennox Wu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03genksyms: fix typeof() handlingJan Beulich7-384/+498
Recent increased use of typeof() throughout the tree resulted in a number of symbols (25 in a typical distro config of ours) not getting a proper CRC calculated for them anymore, due to the parser in genksyms not coping with several of these uses (interestingly in the majority of [if not all] cases the problem is due to the use of typeof() in code preceding a certain export, not in the declaration/definition of the exported function/object itself; I wasn't able to find a way to address this more general parser shortcoming). The use of parameter_declaration is a little more relaxed than would be ideal (permitting not just a bare type specification, but also one with identifier), but since the same code is being passed through an actual compiler, there's no apparent risk of allowing through any broken code. Otoh using parameter_declaration instead of the ad hoc "decl_specifier_seq '*'" / "decl_specifier_seq" pair allows all types to be handled rather than just plain ones and pointers to plain ones. Signed-off-by: Jan Beulich <[email protected]> Cc: Michal Marek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03fanotify: move unrelated handling from copy_event_to_user()Jan Kara1-21/+19
Move code moving event structure to access_list from copy_event_to_user() to fanotify_read() where it is more logical (so that we can immediately see in the main loop that we either move the event to a different list or free it). Also move special error handling for permission events from copy_event_to_user() to the main loop to have it in one place with error handling for normal events. This makes copy_event_to_user() really only copy the event to user without any side effects. Signed-off-by: Jan Kara <[email protected]> Cc: Eric Paris <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03fanotify: reorganize loop in fanotify_read()Jan Kara1-22/+24
Swap the error / "read ok" branches in the main loop of fanotify_read(). We will grow the "read ok" part in the next patch and this makes the indentation easier. Also it is more common to have error conditions inside an 'if' instead of the fast path. Signed-off-by: Jan Kara <[email protected]> Cc: Eric Paris <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03fanotify: convert access_mutex to spinlockJan Kara2-8/+8
access_mutex is used only to guard operations on access_list. There's no need for sleeping within this lock so just make a spinlock out of it. Signed-off-by: Jan Kara <[email protected]> Cc: Eric Paris <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03fanotify: use fanotify event structure for permission response processingJan Kara3-104/+116
Currently, fanotify creates new structure to track the fact that permission event has been reported to userspace and someone is waiting for a response to it. As event structures are now completely in the hands of each notification framework, we can use the event structure for this tracking instead of allocating a new structure. Since this makes the event structures for normal events and permission events even more different and the structures have different lifetime rules, we split them into two separate structures (where permission event structure contains the structure for a normal event). This makes normal events 8 bytes smaller and the code a tad bit cleaner. [[email protected]: fix build] Signed-off-by: Jan Kara <[email protected]> Cc: Eric Paris <[email protected]> Cc: Al Viro <[email protected]> Cc: Wu Fengguang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03fanotify: remove useless bypass_perm checkJan Kara1-8/+0
The prepare_for_access_response() function checks whether group->fanotify_data.bypass_perm is set. However this test can never be true because prepare_for_access_response() is called only from fanotify_read() which means fanotify group is alive with an active fd while bypass_perm is set from fanotify_release() when all file descriptors pointing to the group are closed and the group is going away. Signed-off-by: Jan Kara <[email protected]> Cc: Eric Paris <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03fs/freevxfs/vxfs_lookup.c: update function commentFabian Frederick1-1/+1
nameidata was replaced by flags in commit 00cd8dd3bf95 ("stop passing nameidata to ->lookup()"). Signed-off-by: Fabian Frederick <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-04-03fs/cifs/cifsfs.c: add __init to cifs_init_inodecache()Fabian Frederick1-1/+1
cifs_init_inodecache is only called by __init init_cifs. Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>