aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2016-05-26ceph: fix dir_auth check in ceph_fill_dirfrag()Yan, Zheng1-0/+3
-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as inode's auth mds Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: don't assume frag tree splits in mds reply are sortedYan, Zheng1-0/+13
The algorithm that updates i_fragtree relies on that the frag tree splits in mds reply are of the same order of i_fragtree. This is not true because current MDS encodes frag tree splits in ascending order of (unsigned)frag_t. But nodes in i_fragtree are sorted according to ceph_frag_compare(). The fix is sort the frag tree splits first, then updates i_fragtree. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: fix inode reference leakYan, Zheng1-1/+1
Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: using hash value to compose dentry offsetYan, Zheng6-47/+136
If MDS sorts dentries in dirfrag in hash order, we use hash value to compose dentry offset. dentry offset is: (0xff << 52) | ((24 bits hash) << 28) | (the nth entry hash hash collision) This offset is stable across directory fragmentation. This alos means there is no need to reset readdir offset if directory get fragmented in the middle of readdir. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: don't forbid marking directory complete after forward seekYan, Zheng1-5/+0
Forward seek within same frag does not update fi->last_name, it will not affect contents of later readdir reply. So there is no need to forbid marking directory complete Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: record 'offset' for each entry of readdir resultYan, Zheng5-29/+59
This is preparation for using hash value as dentry 'offset' Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: define 'end/complete' in readdir reply as bit flagsYan, Zheng4-3/+20
Set a flag in readdir request, which indicates that client interprets 'end/complete' as bit flags. So that mds can reply additional flags in readdir reply. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: define struct for dir entry in readdir replyYan, Zheng4-52/+50
This avoids defining multiple arrays for entries in readdir reply Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: simplify 'offset in frag'Yan, Zheng2-13/+4
don't distinguish leftmost frag from other frags. always use 2 as first entry's offset. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: remove unnecessary checks in __dcache_readdirYan, Zheng1-2/+0
we never add snapdir and the hidden .ceph dir into readdir cache Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: search cache postion for dcache readdirYan, Zheng1-46/+83
use binary search to find cache index that corresponds to readdir postion. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: use CEPH_MDS_OP_RMXATTR request to remove xattrYan, Zheng1-6/+11
Setxattr with NULL value and XATTR_REPLACE flag should be equivalent to removexattr. But current MDS does not support deleting vxattrs through MDS_OP_SETXATTR request. The workaround is sending MDS_OP_RMXATTR request if setxattr actually removs xattr. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: report mount root in session metadataYan, Zheng3-15/+23
Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: don't show symlink target in debugfs/mdscYan, Zheng1-1/+1
symlink target is useless for debug and can be very long. It's annoying to show it in debugfs/mdsc. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: don't call truncate_pagecache in ceph_writepages_startYan, Zheng3-9/+38
truncate_pagecache() may decrease inode's reference. This can cause deadlock if inode's last reference is dropped and iput_final() wants to evict the inode. (evict() calls inode_wait_for_writeback(), which waits for ceph_writepages_start() to return). The fix is use work thead to truncate dirty pages. Also add 'forced umount' check to ceph_update_writeable_page(), which prevents new pages getting dirty. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: renew caps for read/write if mds session got killed.Yan, Zheng4-11/+93
When mds session gets killed, read/write operation may hang. Client waits for Frw caps, but mds does not know what caps client wants. To recover this, client sends an open request to mds. The request will tell mds what caps client wants. Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: CEPH_FEATURE_MDSENC supportYan, Zheng2-12/+36
Signed-off-by: Yan, Zheng <[email protected]>
2016-05-26ceph: multiple filesystem supportYan, Zheng2-0/+10
To access non-default filesystem, we just need to subscribe to mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds namespace id. Signed-off-by: Yan, Zheng <[email protected]> [[email protected]: switch to a new libceph API] Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: support for subscribing to "mdsmap.<id>" mapsIlya Dryomov4-5/+17
Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: replace ceph_monc_request_next_osdmap()Ilya Dryomov5-16/+9
... with a wrapper around maybe_request_map() - no need for two osdmap-specific functions. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: take osdc->lock in osdmap_show() and dump flags in hexIlya Dryomov1-5/+5
There is now about a dozen CEPH_OSDMAP_* flags. This is a debugging interface, so just dump in hex instead of spelling each flag out. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: pool deletion detectionIlya Dryomov2-6/+248
This adds the "map check" infrastructure for sending osdmap version checks on CALC_TARGET_POOL_DNE and completing in-flight requests with -ENOENT if the target pool doesn't exist or has just been deleted. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: async MON client generic requestsIlya Dryomov3-111/+228
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION messages asynchronously and get a callback on completion. Refactor MON client to allow firing off generic requests asynchronously and add an async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is switched over and remains sync. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: support for checking on status of watchIlya Dryomov2-1/+55
Implement ceph_osdc_watch_check() to be able to check on status of watch. Note that the time it takes for a watch/notify event to get delivered through the notify_wq is taken into account. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: support for sending notifiesIlya Dryomov4-11/+249
Implement ceph_osdc_notify() for sending notifies. Due to the fact that the current messenger can't do read-in into pagelists (it can only do write-out from them), I had to go with a page vector for a NOTIFY_COMPLETE payload, for now. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph, rbd: ceph_osd_linger_request, watch/notify v2Ilya Dryomov7-431/+1067
This adds support and switches rbd to a new, more reliable version of watch/notify protocol. As with the OSD client update, this is mostly about getting the right structures linked into the right places so that reconnects are properly sent when needed. watch/notify v2 also requires sending regular pings to the OSDs - send_linger_ping(). A major change from the old watch/notify implementation is the introduction of ceph_osd_linger_request - linger requests no longer piggy back on ceph_osd_request. ceph_osd_event has been merged into ceph_osd_linger_request. All the details are now hidden within libceph, the interface consists of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack(). ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep the lifetime management simple. ceph_osdc_notify_ack() accepts an optional data payload, which is relayed back to the notifier. Portions of this patch are loosely based on work by Douglas Fuller <[email protected]> and Mike Christie <[email protected]>. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26rbd: rbd_dev_header_unwatch_sync() variantIlya Dryomov1-4/+9
Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify callbacks. This is for the new rados_watcherrcb_t, which would be called from a notify callback. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: wait_request_timeout()Ilya Dryomov1-13/+21
The unwatch timeout is currently implemented in rbd. With watch/unwatch code moving into libceph, we are going to need a ceph_osdc_wait_request() variant with a timeout. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: request_init() and request_release_checks()Ilya Dryomov1-17/+27
These are going to be used by request_reinit() code. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: a major OSD client updateIlya Dryomov5-630/+602
This is a major sync up, up to ~Jewel. The highlights are: - per-session request trees (vs a global per-client tree) - per-session locking (vs a global per-client rwlock) - homeless OSD session - no ad-hoc global per-client lists - support for pool quotas - foundation for watch/notify v2 support - foundation for map check (pool deletion detection) support The switchover is incomplete: lingering requests can be setup and teared down but aren't ever reestablished. This functionality is restored with the introduction of the new lingering infrastructure (ceph_osd_linger_request, linger_work, etc) in a later commit. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: protect osdc->osd_lru list with a spinlockIlya Dryomov2-11/+19
OSD client is getting moved from the big per-client lock to a set of per-session locks. The big rwlock would only be held for read most of the time, so a global osdc->osd_lru needs additional protection. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: allocate ceph_osd with GFP_NOFAILIlya Dryomov1-4/+1
create_osd() is called way too deep in the stack to be able to error out in a sane way; a failing create_osd() just messes everything up. The current req_notarget list solution is broken - the list is never traversed as it's not entirely clear when to do it, I guess. If we were to start traversing it at regular intervals and retrying each request, we wouldn't be far off from what __GFP_NOFAIL is doing, so allocate OSD sessions with __GFP_NOFAIL, at least until we come up with a better fix. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: osd_init() and osd_cleanup()Ilya Dryomov1-9/+37
These are going to be used by homeless OSD sessions code. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: handle_one_map()Ilya Dryomov4-56/+141
Separate osdmap handling from decoding and iterating over a bag of maps in a fresh MOSDMap message. This sets up the scene for the updated OSD client. Of particular importance here is the addition of pi->was_full, which can be used to answer "did this pool go full -> not-full in this map?". This is the key bit for supporting pool quotas. We won't be able to downgrade map_sem for much longer, so drop downgrade_write(). Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-25Merge branch 'work.iov_iter' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs iov_iter regression fix from Al Viro: "Fix for braino in 'fold checks into iterate_and_advance()'" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do "fold checks into iterate_and_advance()" right
2016-05-25Merge branch 'for-linus' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs xattr regression fixes from Al Viro. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make xattr_resolve_handlers() safe to use with NULL ->s_xattr xattr: Fail with -EINVAL for NULL attribute names
2016-05-25Merge tag 'acpi-4.7-rc1-more' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fix from Rafael Wysocki: "Additional ACPI update for v4.7-rc1 Just one fix for incorrect async_synchronize_cookie() usage in the ACPI battery driver (Chris Wilson)" * tag 'acpi-4.7-rc1-more' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / battery: Correctly serialise with the pending async probe
2016-05-26libceph: allocate dummy osdmap in ceph_osdc_init()Ilya Dryomov3-16/+30
This leads to a simpler osdmap handling code, particularly when dealing with pi->was_full, which is introduced in a later commit. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: schedule tick from ceph_osdc_init()Ilya Dryomov1-28/+9
Both homeless OSD sessions and watch/notify v2, introduced in later commits, require periodic ticks which don't depend on ->num_requests. Schedule the initial tick from ceph_osdc_init() and reschedule from handle_timeout() unconditionally. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: move schedule_delayed_work() in ceph_osdc_init()Ilya Dryomov1-3/+3
ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up with handle_osds_timeout() running on invalid memory if any one of the allocations fails. Call schedule_delayed_work() after everything is setup, just before returning. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: redo callbacks and factor out MOSDOpReply decodingIlya Dryomov4-157/+215
If you specify ACK | ONDISK and set ->r_unsafe_callback, both ->r_callback and ->r_unsafe_callback(true) are called on ack. This is very confusing. Redo this so that only one of them is called: ->r_unsafe_callback(true), on ack ->r_unsafe_callback(false), on commit or ->r_callback, on ack|commit Decode everything in decode_MOSDOpReply() to reduce clutter. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: drop msg argument from ceph_osdc_callback_tIlya Dryomov5-16/+12
finish_read(), its only user, uses it to get to hdr.data_len, which is what ->r_result is set to on success. This gains us the ability to safely call callbacks from contexts other than reply, e.g. map check. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: switch to calc_target(), part 2Ilya Dryomov7-255/+247
The crux of this is getting rid of ceph_osdc_build_request(), so that MOSDOp can be encoded not before but after calc_target() calculates the actual target. Encoding now happens within ceph_osdc_start_request(). Also nuked is the accompanying bunch of pointers into the encoded buffer that was used to update fields on each send - instead, the entire front is re-encoded. If we want to support target->name_len != base->name_len in the future, there is no other way, because oid is surrounded by other fields in the encoded buffer. Encoding OSD ops and adding data items to the request message were mixed together in osd_req_encode_op(). While we want to re-encode OSD ops, we don't want to add duplicate data items to the message when resending, so all call to ceph_osdc_msg_data_add() are factored out into a new setup_request_data(). Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: switch to calc_target(), part 1Ilya Dryomov3-107/+29
Replace __calc_request_pg() and most of __map_request() with calc_target() and start using req->r_t. ceph_osdc_build_request() however still encodes base_oid, because it's called before calc_target() is and target_oid is empty at that point in time; a printf in osdc_show() also shows base_oid. This is fixed in "libceph: switch to calc_target(), part 2". Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: introduce ceph_osd_request_target, calc_target()Ilya Dryomov7-4/+340
Introduce ceph_osd_request_target, containing all mapping-related fields of ceph_osd_request and calc_target() for calculating mappings and populating it. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: pi->min_size, pi->last_force_request_resendIlya Dryomov3-8/+59
Add and decode pi->min_size and pi->last_force_request_resend. These are going to be used by calc_target(). Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: make pgid_cmp() globalIlya Dryomov2-11/+14
calc_target() code is going to need to know how to compare PGs. Take lhs and rhs pgid by const * while at it. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: rename ceph_calc_pg_primary()Ilya Dryomov3-7/+8
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to emphasise that it returns acting primary. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: ceph_osds, ceph_pg_to_up_acting_osds()Ilya Dryomov3-146/+215
Knowning just acting set isn't enough, we need to be able to record up set as well to detect interval changes. This means returning (up[], up_len, up_primary, acting[], acting_len, acting_primary) and passing it around. Introduce and switch to ceph_osds to help with that. Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return both up and acting sets from it. Signed-off-by: Ilya Dryomov <[email protected]>
2016-05-26libceph: rename ceph_oloc_oid_to_pg()Ilya Dryomov4-23/+23
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg(). Emphasise that returned is raw PG and return -ENOENT instead of -EIO if the pool doesn't exist. Signed-off-by: Ilya Dryomov <[email protected]>