aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2016-05-25Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds2-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool build fix from Ingo Molnar: "An libtool fix for older libelf versions" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Allow building with older libelf
2016-05-26ceph: fix wake_up_session_cb()Yan, Zheng1-1/+1
We should reset i_requested_max_size before waking the waiters. (zero i_requested_max_size make waiter re-request the max size) Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: don't use truncate_pagecache() to invalidate read cacheYan, Zheng2-5/+7
truncate_pagecache() drops dirty pages, it's dangerous to use it to invalidate read cache. Besides, we shouldn't start invalidating read cache while there are buffer writers. Because buffer writers may add dirty pages later. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: SetPageError() for writeback pages if writepages failsYan, Zheng1-1/+3
Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: handle interrupted ceph_writepage()Yan, Zheng1-4/+18
writepage() can be interrupted when it's called by direct memory reclaimer (the direct memory relaimer is killed). To avoid lossing data, we redirty the page. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: make ceph_update_writeable_page() uninterruptibleYan, Zheng1-1/+1
ceph_update_writeable_page() is used by ceph_write_begin(). It beaks atomicity of write operation if it's interruptible. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26libceph: make ceph_osdc_wait_request() uninterruptibleYan, Zheng1-1/+1
Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most cases, the sync IO should be uninterruptible. The fix is use killale wait function in ceph_osdc_wait_request(). Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: handle -EAGAIN returned by ceph_update_writeable_page()Yan, Zheng1-13/+15
when ceph_update_writeable_page() return -EAGAIN, caller should lock the page and call ceph_update_writeable_page() again. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEMYan, Zheng1-20/+17
Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: block non-fatal signals for fault/page_mkwriteYan, Zheng1-27/+39
Fault and page_mkwrite are supposed to be uninterruptable. But they call ceph functions that are interruptible. So they should block signals before calling functions that are interruptible Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: make logical calculation functions return boolZhang Zhuoyu6-9/+9
This patch makes serverl logical caculation functions return bool to improve readability due to these particular functions only using 0/1 as their return value. No functional change. Signed-off-by: Zhang Zhuoyu <[email protected]>
2016-05-26ceph: tolerate bad i_size for symlink inodeYan, Zheng1-7/+15
A mds bug can cause symlink's size to be truncated to zero. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: improve fragtree change detectionYan, Zheng2-4/+21
check if number of splits in i_fragtree is equal to number of splits in mds reply Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: keep leaf frag when updating fragtreeYan, Zheng1-5/+23
Nodes in i_fragtree are sorted according to ceph_compare_frag(). It means frag node in i_fragtree always follow its direct parent node. To check if a leaf node is valid, we just need to check if it's child of previous split node. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: fix dir_auth check in ceph_fill_dirfrag()Yan, Zheng1-0/+3
-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as inode's auth mds Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: don't assume frag tree splits in mds reply are sortedYan, Zheng1-0/+13
The algorithm that updates i_fragtree relies on that the frag tree splits in mds reply are of the same order of i_fragtree. This is not true because current MDS encodes frag tree splits in ascending order of (unsigned)frag_t. But nodes in i_fragtree are sorted according to ceph_frag_compare(). The fix is sort the frag tree splits first, then updates i_fragtree. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: fix inode reference leakYan, Zheng1-1/+1
Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: using hash value to compose dentry offsetYan, Zheng6-47/+136
If MDS sorts dentries in dirfrag in hash order, we use hash value to compose dentry offset. dentry offset is: (0xff << 52) | ((24 bits hash) << 28) | (the nth entry hash hash collision) This offset is stable across directory fragmentation. This alos means there is no need to reset readdir offset if directory get fragmented in the middle of readdir. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: don't forbid marking directory complete after forward seekYan, Zheng1-5/+0
Forward seek within same frag does not update fi->last_name, it will not affect contents of later readdir reply. So there is no need to forbid marking directory complete Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: record 'offset' for each entry of readdir resultYan, Zheng5-29/+59
This is preparation for using hash value as dentry 'offset' Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: define 'end/complete' in readdir reply as bit flagsYan, Zheng4-3/+20
Set a flag in readdir request, which indicates that client interprets 'end/complete' as bit flags. So that mds can reply additional flags in readdir reply. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: define struct for dir entry in readdir replyYan, Zheng4-52/+50
This avoids defining multiple arrays for entries in readdir reply Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: simplify 'offset in frag'Yan, Zheng2-13/+4
don't distinguish leftmost frag from other frags. always use 2 as first entry's offset. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: remove unnecessary checks in __dcache_readdirYan, Zheng1-2/+0
we never add snapdir and the hidden .ceph dir into readdir cache Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: search cache postion for dcache readdirYan, Zheng1-46/+83
use binary search to find cache index that corresponds to readdir postion. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: use CEPH_MDS_OP_RMXATTR request to remove xattrYan, Zheng1-6/+11
Setxattr with NULL value and XATTR_REPLACE flag should be equivalent to removexattr. But current MDS does not support deleting vxattrs through MDS_OP_SETXATTR request. The workaround is sending MDS_OP_RMXATTR request if setxattr actually removs xattr. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: report mount root in session metadataYan, Zheng3-15/+23
Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: don't show symlink target in debugfs/mdscYan, Zheng1-1/+1
symlink target is useless for debug and can be very long. It's annoying to show it in debugfs/mdsc. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: don't call truncate_pagecache in ceph_writepages_startYan, Zheng3-9/+38
truncate_pagecache() may decrease inode's reference. This can cause deadlock if inode's last reference is dropped and iput_final() wants to evict the inode. (evict() calls inode_wait_for_writeback(), which waits for ceph_writepages_start() to return). The fix is use work thead to truncate dirty pages. Also add 'forced umount' check to ceph_update_writeable_page(), which prevents new pages getting dirty. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: renew caps for read/write if mds session got killed.Yan, Zheng4-11/+93
When mds session gets killed, read/write operation may hang. Client waits for Frw caps, but mds does not know what caps client wants. To recover this, client sends an open request to mds. The request will tell mds what caps client wants. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: CEPH_FEATURE_MDSENC supportYan, Zheng2-12/+36
Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: multiple filesystem supportYan, Zheng2-0/+10
To access non-default filesystem, we just need to subscribe to mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds namespace id. Signed-off-by: Yan, Zheng <[email protected]> [[email protected]: switch to a new libceph API] Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: support for subscribing to "mdsmap.<id>" mapsIlya Dryomov4-5/+17
Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: replace ceph_monc_request_next_osdmap()Ilya Dryomov5-16/+9
... with a wrapper around maybe_request_map() - no need for two osdmap-specific functions. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: take osdc->lock in osdmap_show() and dump flags in hexIlya Dryomov1-5/+5
There is now about a dozen CEPH_OSDMAP_* flags. This is a debugging interface, so just dump in hex instead of spelling each flag out. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: pool deletion detectionIlya Dryomov2-6/+248
This adds the "map check" infrastructure for sending osdmap version checks on CALC_TARGET_POOL_DNE and completing in-flight requests with -ENOENT if the target pool doesn't exist or has just been deleted. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: async MON client generic requestsIlya Dryomov3-111/+228
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION messages asynchronously and get a callback on completion. Refactor MON client to allow firing off generic requests asynchronously and add an async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is switched over and remains sync. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: support for checking on status of watchIlya Dryomov2-1/+55
Implement ceph_osdc_watch_check() to be able to check on status of watch. Note that the time it takes for a watch/notify event to get delivered through the notify_wq is taken into account. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: support for sending notifiesIlya Dryomov4-11/+249
Implement ceph_osdc_notify() for sending notifies. Due to the fact that the current messenger can't do read-in into pagelists (it can only do write-out from them), I had to go with a page vector for a NOTIFY_COMPLETE payload, for now. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph, rbd: ceph_osd_linger_request, watch/notify v2Ilya Dryomov7-431/+1067
This adds support and switches rbd to a new, more reliable version of watch/notify protocol. As with the OSD client update, this is mostly about getting the right structures linked into the right places so that reconnects are properly sent when needed. watch/notify v2 also requires sending regular pings to the OSDs - send_linger_ping(). A major change from the old watch/notify implementation is the introduction of ceph_osd_linger_request - linger requests no longer piggy back on ceph_osd_request. ceph_osd_event has been merged into ceph_osd_linger_request. All the details are now hidden within libceph, the interface consists of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack(). ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep the lifetime management simple. ceph_osdc_notify_ack() accepts an optional data payload, which is relayed back to the notifier. Portions of this patch are loosely based on work by Douglas Fuller <[email protected]> and Mike Christie <[email protected]>. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26rbd: rbd_dev_header_unwatch_sync() variantIlya Dryomov1-4/+9
Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify callbacks. This is for the new rados_watcherrcb_t, which would be called from a notify callback. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: wait_request_timeout()Ilya Dryomov1-13/+21
The unwatch timeout is currently implemented in rbd. With watch/unwatch code moving into libceph, we are going to need a ceph_osdc_wait_request() variant with a timeout. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: request_init() and request_release_checks()Ilya Dryomov1-17/+27
These are going to be used by request_reinit() code. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: a major OSD client updateIlya Dryomov5-630/+602
This is a major sync up, up to ~Jewel. The highlights are: - per-session request trees (vs a global per-client tree) - per-session locking (vs a global per-client rwlock) - homeless OSD session - no ad-hoc global per-client lists - support for pool quotas - foundation for watch/notify v2 support - foundation for map check (pool deletion detection) support The switchover is incomplete: lingering requests can be setup and teared down but aren't ever reestablished. This functionality is restored with the introduction of the new lingering infrastructure (ceph_osd_linger_request, linger_work, etc) in a later commit. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: protect osdc->osd_lru list with a spinlockIlya Dryomov2-11/+19
OSD client is getting moved from the big per-client lock to a set of per-session locks. The big rwlock would only be held for read most of the time, so a global osdc->osd_lru needs additional protection. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: allocate ceph_osd with GFP_NOFAILIlya Dryomov1-4/+1
create_osd() is called way too deep in the stack to be able to error out in a sane way; a failing create_osd() just messes everything up. The current req_notarget list solution is broken - the list is never traversed as it's not entirely clear when to do it, I guess. If we were to start traversing it at regular intervals and retrying each request, we wouldn't be far off from what __GFP_NOFAIL is doing, so allocate OSD sessions with __GFP_NOFAIL, at least until we come up with a better fix. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: osd_init() and osd_cleanup()Ilya Dryomov1-9/+37
These are going to be used by homeless OSD sessions code. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: handle_one_map()Ilya Dryomov4-56/+141
Separate osdmap handling from decoding and iterating over a bag of maps in a fresh MOSDMap message. This sets up the scene for the updated OSD client. Of particular importance here is the addition of pi->was_full, which can be used to answer "did this pool go full -> not-full in this map?". This is the key bit for supporting pool quotas. We won't be able to downgrade map_sem for much longer, so drop downgrade_write(). Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-25Merge branch 'work.iov_iter' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs iov_iter regression fix from Al Viro: "Fix for braino in 'fold checks into iterate_and_advance()'" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do "fold checks into iterate_and_advance()" right
2016-05-25Merge branch 'for-linus' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs xattr regression fixes from Al Viro. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make xattr_resolve_handlers() safe to use with NULL ->s_xattr xattr: Fail with -EINVAL for NULL attribute names