aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-04-02ceph: optimize memory usageChengguang Xu5-117/+169
In current code, regular file and directory use same struct ceph_file_info to store fs specific data so the struct has to include some fields which are only used for directory (e.g., readdir related info), when having plenty of regular files, it will lead to memory waste. This patch introduces dedicated ceph_dir_file_info cache for readdir related thins. So that regular file does not include those unused fields anymore. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: optimize mds session registerChengguang Xu1-17/+19
Do memory allocation first, so that avoid unnecessary initialization of newly allocated session in error case. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02libceph, ceph: add __init attribution to init funcitonsChengguang Xu5-14/+9
Add __init attribution to the functions which are called only once during initiating/registering operations and deleting unnecessary symbol exports. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: filter out used flags when printing unused open flagsChengguang Xu1-0/+2
Filter out used access mode flags when printing unused open flags. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: don't wait on writeback when there is no more dirty pagesYan, Zheng1-1/+5
In sync mode, writepages() needs to write all dirty pages. But it can only write dirty pages associated with the oldest snapc. To write dirty pages associated with next snapc, it needs to wait until current writes complete. If there is no more dirty pages, writepages() should not wait on writeback. Otherwise, dirty page writeback becomes very slow. Signed-off-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: invalidate pages that beyond EOF in ceph_writepages_start()Yan, Zheng2-29/+18
Dirty pages can be associated with different capsnap. Different capsnap may have different EOF value. So invalidating dirty pages according to the largest EOF value is wrong. Dirty pages beyond EOF, but associated with other capsnap, do not get invalidated. Signed-off-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: mark the cap cache as unreclaimableChengguang Xu1-2/+1
Releasing cap is affected by many factors (e.g., avail_count/reserve_count/min_count) and min_count could be specified high volume in client mount option. Hence it's better to mark cap cache as unreclaimable in case of non-trivial discrepancies between memory shown as reclaimable and what is actually reclaimed. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: change variable name to follow common ruleChengguang Xu3-30/+30
Variable name ci is mostly used for ceph_inode_info. Variable name fi is mostly used for ceph_file_info. Variable name cf is mostly used for ceph_cap_flush. Change variable name to follow above common rules in case of confusing. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: optimizing cap reservationChengguang Xu1-29/+59
When caps_avail_count is in a low level, most newly trimmed caps will probably go into ->caps_list and caps_avail_count will be increased. Hence after trimming, should recheck caps_avail_count to effectly reuse newly trimmed caps. Also, when releasing unnecessary caps follow the same rule of ceph_put_cap. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: release unreserved caps if having enough available capsChengguang Xu1-1/+15
When unreserving caps check if there is too mamy available caps in the ->caps_list, if so release unreserved caps. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: optimizing cap allocationChengguang Xu1-0/+16
When setting high volume of caps_min_count or having many unreserved caps, unused caps may always keep in the ->caps_list even can't get new cap from kmem_cache_alloc because lack of maximum limitation of caps_avail_count. Hence reuse caps in ->caps_list if available, it's maybe better than setting max limitation of caps_avail_count and releasing unused caps when reaching the limit. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: adding protection for showing cap reservation infoChengguang Xu1-0/+4
Adding spinlock protection during getting cap reservation ralated fields so that the numbers match below BUG_ON condition in the code. BUG_ON(mdsc->caps_total_count != mdsc->caps_use_count + mdsc->caps_reserve_count + mdsc->caps_avail_count); Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02libceph: adding missing message types to ceph_msg_type_name()Chengguang Xu1-0/+5
Some of message types are missing in ceph_msg_type_name(), so just adding them for better understanding of output information. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: get the latest osdmap when using an existing clientIlya Dryomov1-36/+33
Currently we request the latest osdmap only if ceph_pg_poolid_by_name() fails with -ENOENT. This is effective with newly created pools, but we also want to avoid attempting to map from pools that were recently deleted and report "pool does not exist" instead. (Such an attempt eventually fails in the OSD client after map check code kicks in, but the error message is confusing.) Request the latest osdmap unconditionally after bumping a ref on an existing client in rbd_client_find(). Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: move rbd_get_client() below rbd_put_client()Ilya Dryomov1-20/+20
... to avoid a forward declaration in the next commit. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: remove redundant declaration of rbd_spec_put()Ilya Dryomov1-1/+0
Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: use seq_show_option for string type optionsChengguang Xu1-5/+2
Using seq_show_option to replace seq_printf for string type options. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02libceph: fix misjudgement of maximum monitor numberChengguang Xu1-1/+1
num_mon should allow up to CEPH_MAX_MON in ceph_monmap_decode(). Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02libceph, ceph: change permission for readonly debugfs entriesChengguang Xu2-9/+9
Remove write permission for debugfs entries which only have readonly function. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: keep consistent semantic in fscache related option combinationChengguang Xu1-0/+4
When specifying multiple fscache related options, the result isn't always the same as option order, this fix will keep strict consistent meaning by order. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02ceph: add newline to end of debug message formatChengguang Xu7-20/+19
Some of dout format do not include newline in the end, fix for the files which are in fs/ceph and net/ceph directories, and changing printk to dout for printing debug info in super.c Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: "Yan, Zheng" <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: allow "fancy" stripingIlya Dryomov1-27/+2
Signed-off-by: Ilya Dryomov <[email protected]> Acked-by: Jason Dillaman <[email protected]>
2018-04-02rbd: introduce OWN_BVECS data typeIlya Dryomov1-7/+149
If the layout is "fancy", we need to be able to rearrange the provided bio_vecs in stripe unit chunks to make it possible for the messenger to read/write directly from/to the provided data buffer, without employing a temporary data buffer for assembling the result. Higher level bio_vec arrays are generally immutable, so this requires copying into a private array. Only the bio_vecs themselves are shuffled around, not the actual data. OWN_BVECS doesn't own any pages. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: remove rbd_parent_request_{create,destroy}()Ilya Dryomov1-68/+6
rbd_parent_request_create() takes a ref on obj_req for child_img_req. There is no point in doing that because child_img_req is created on behalf of obj_req -- obj_req is the initiator and can't be completed before child_img_req. Open-code the rest of rbd_parent_request_create() and remove it along with rbd_parent_request_destroy(). Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: get rid of img_req->{offset,length}Ilya Dryomov1-18/+8
These are set, but no longer used. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: remove rbd_img_request_fill() and helpersIlya Dryomov1-98/+0
Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: switch to common striping frameworkIlya Dryomov1-23/+168
Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: create+truncate for whole-object layered discardsIlya Dryomov1-1/+6
A whole-object layered discard is implemented as a truncate rather than a delete: a dummy object is needed to prevent the CoW machinery from kicking in. However, a truncate on a non-existent object is a no-op. If the object doesn't exist in HEAD, a discard request is effectively ignored, which violates our "discard zeroes data" promise and breaks REQ_OP_WRITE_ZEROES implementation. A non-exclusive create on an existing object is also a no-op, so the fix is to do a compound create+truncate instead. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: move to obj_req->img_extentsIlya Dryomov1-52/+98
In preparation for rbd "fancy" striping, replace obj_req->img_offset with obj_req->img_extents. A single starting offset isn't sufficient because we want only one OSD request per object and will merge adjacent object extents in ceph_file_to_extents(). The final object extent may map into multiple different byte ranges in the image. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: incorporate ceph_object_extentIlya Dryomov1-37/+34
obj_req->object_no -> obj_req->ex.oe_objno obj_req->offset -> obj_req->ex.oe_off obj_req->length -> obj_req->ex.oe_len ... and use ex for linking object requests to image requests. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02libceph, ceph: move ceph_calc_file_object_mapping() to striper.cIlya Dryomov7-44/+43
ceph_calc_file_object_mapping() has nothing to do with osdmaps. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02libceph: striping framework implementationIlya Dryomov3-0/+292
Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: store data_type in img_req instead of obj_reqIlya Dryomov1-26/+8
All object requests are associated with an image request now -- avoid duplicating the same info in each object request. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: remove obj_req->flags fieldIlya Dryomov1-35/+0
There are no standalone (!IMG_DATA) object requests anymore. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: remove old request completion codeIlya Dryomov1-172/+3
Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: new request completion codeIlya Dryomov1-13/+55
Do away with partial request completions and all the associated complexity. Individual object requests no longer need to be completed in order -- when the last one becomes ready, we complete the entire higher level request all at once. This also wraps up the conversion to a state machine model and eliminates the recursion described in commit 6d69bb536bac ("rbd: prevent kernel stack blow up on rbd map"). Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: update rbd_img_request_submit() signatureIlya Dryomov1-10/+3
It should be void now. Also, object requests are unlinked only in image request destructor, which can't run before rbd_img_request_put(), so no need for _safe. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: add img_req->op_type fieldIlya Dryomov1-63/+12
Store op_type in its own field instead of packing it into flags. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: simplify rbd_osd_req_create()Ilya Dryomov1-45/+14
No need to pass rbd_dev and op_type to rbd_osd_req_create(): there are no standalone (!IMG_DATA) object requests anymore and osd_req->r_flags can be set in rbd_osd_req_format_{read,write}(). Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: remove old request handling codeIlya Dryomov1-730/+4
Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: new request handling codeIlya Dryomov1-77/+601
The notable changes are: - instead of explicitly stat'ing the object to see if it exists before issuing the write, send the write optimistically along with the stat in a single OSD request - zero copyup optimization - all object requests are associated with an image request and have a valid ->img_request pointer; there are no standalone (!IMG_DATA) object requests anymore - code is structured as a state machine (vs a bunch of callbacks with implicit state) Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02libceph: handle zero-length data itemsIlya Dryomov1-2/+12
rbd needs this for null copyups -- if copyup data is all zeroes, we want to save some I/O and network bandwidth. See rbd_obj_issue_copyup() in the next commit. Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Alex Elder <[email protected]>
2018-04-02rbd: move from raw pages to bvec data descriptorsIlya Dryomov1-78/+77
In preparation for rbd "fancy" striping which requires bio_vec arrays, wire up BVECS data type and kill off PAGES data type. There is nothing wrong with using page vectors for copyup requests -- it's just less iterator boilerplate code to write for the new striping framework. Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Alex Elder <[email protected]>
2018-04-02libceph: introduce BVECS data typeIlya Dryomov4-0/+164
In preparation for rbd "fancy" striping, introduce ceph_bvec_iter for working with bio_vec array data buffers. The wrappers are trivial, but make it look similar to ceph_bio_iter. Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: get rid of img_req->copyup_pagesIlya Dryomov1-34/+9
The initiating object request is the proper owner -- save a bit of space. Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Alex Elder <[email protected]>
2018-04-02rbd: don't (ab)use obj_req->pages for stat requestsIlya Dryomov1-10/+5
obj_req->pages is for provided data buffers. stat requests are internal and should be NODATA. Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Alex Elder <[email protected]>
2018-04-02rbd: remove bio cloning helpersIlya Dryomov1-141/+0
Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Alex Elder <[email protected]>
2018-04-02libceph, rbd: new bio handling code (aka don't clone bios)Ilya Dryomov5-112/+139
The reason we clone bios is to be able to give each object request (and consequently each ceph_osd_data/ceph_msg_data item) its own pointer to a (list of) bio(s). The messenger then initializes its cursor with cloned bio's ->bi_iter, so it knows where to start reading from/writing to. That's all the cloned bios are used for: to determine each object request's starting position in the provided data buffer. Introduce ceph_bio_iter to do exactly that -- store position within bio list (i.e. pointer to bio) + position within that bio (i.e. bvec_iter). Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02rbd: start enums at 1 instead of 0Ilya Dryomov1-2/+4
Signed-off-by: Ilya Dryomov <[email protected]>
2018-04-02libceph, ceph: change ceph_calc_file_object_mapping() signatureIlya Dryomov5-32/+18
- make it void - xlen (object extent length) out parameter should be u32 because only a single stripe unit is mapped at a time Signed-off-by: Ilya Dryomov <[email protected]> Reviewed-by: Alex Elder <[email protected]>