Page History
...
- Thread-1
- Takes CR mode tree-lock. This still allows other threads to get CW or CR lock on the htree directory, but prohibits PW, PR or EX locks on the htree directory.
- Takes PR mode child-lock on block-3 and block-6 so these two blocks cannot be modified.
- Thread-2
- Takes CW mode tree-lock. This is compatible with the CR tree-lock taken by thread-1.
- Takes PW mode child-lock on block-9 and modifies the block (i.e: insert a new entry). This lock gives this thread exclusive access to block-9.
For a N-level htree, a child-lock can protect level\[PUB:N\] and level\[PUB:N-1\] blocks of htree directory. A tree-lock can protect all other level blocks. Using the example presented in the figure above, block-0, block-1 and block-2 are protected by tree-lock. Wiki Markup
...
- An initial implementation of PDO only used a child-lock to protect level\[PUB:N\] blocks, or DE-block, and using tree-lock to protect all other DX-blocks. Prototyping this design revealed a high occurrence of DE-block splitting:
- unmigrated-wiki-markup
- As described in solution architecture, each DE-block (level\[PUB:N\] block) can contain about 80 entries. If level\[PUB:N-1\] DX-blocks are protected by tree-lock, then the entire tree must be locked after inserting about 80 entries or 1.25% of all insertions (need to split the DE-block and insert new DE-block to its parent DX-block). This may be a performance issue, particularly as hundreds or thousands service threads may try to inserting/remove entries simultaneously.
- In theory a child-lock can protect level-M blocks (0<= M && M <= N). However, in practice it is only used to protect level\[PUB:N\] and level\[PUB:N-1\] blocks. The reasoning behind this is twofold:
- unmigrated-wiki-markup
- Restructuing the ldiskfs htree implementation is undesirable. The existing htree implementation has specific logic path for level\[PUB:N\] DE-block and level\[PUB:N-1\] DX-block. This makes it simple to add a htree-lock. Requiring the use of an htree-lock at any level will likely require a significant rewrite current code and make it hard to maintain.unmigrated-wiki-markup
- As described in the solution architecture: using child-lock is used to protect level\[PUB:N\] DE-block and level\[PUB:N-1\] DX-block, and a tree-lock to protect all other blocks, the chance of locking the whole tree is about 1/(512 * 80) or 0.00244%. This is judged to be sufficiently small to be negligible.
Hree-lock interfaces and compatibility matrix
...
Ldiskfs lock definition and Locking order
...
Child-locks protect level\[PUB:N-1\] DX-block and level\[PUB:N\] DE-blocks. Other changes are protected by htree-lock. 2-depths child lock are defined as:
DX-lock: a child-lock to protect level\[PUB:N-1\] DX-block.unmigrated-wiki-markupWiki Markup - DE-lock: a child-lock to protect level\[PUB:N\] DE-block.
The use-cases include:
- All no-change operations obtain both DX-lock and DE-lock with PR mode.
- All changing operations obtain both DX-lock and DE-lock with PW mode.
...
The solution to this problem is add another depth for child-lock:
DX-spinlockWiki Markup
rw-spinlock for level\[PUB:N-1\] DX-block (again). It protects the same DX-block, but a different lock depth is used to protect different operations:- DX-lock is a blocking lock and protects the entire split process of DE-blocks. As a result only one thread can split DE-blocks under the same DX-block.
- DX-lock is not necessary for lookup, delete or insert, required only if:
- Hash collision for lookup/delete
- Split DE-block
- DX-spinlock is non-blocking lock that is used to perform a quick scan of DX-block, or changing of DX-block (i.e. insert a new DE-block entry to the DX-block.) This lock is always required but it is non-blocking and only protecting light operation. There is no thread context switch even under lock contention.
...
- call
dx_probe()
with hold shared locking mode (CR or CW).unmigrated-wiki-markup - Level\[PUB:0, 1, …, N-2\] DX-blocks are not protected by tree-lock and do not require any additional protection. This figure only shows logic of protecting level\[PUB:N-1\] DX-block and level\[PUB:N\] DE-block.
- locking order to avoid deadlock
- As previously described, the correct locking order is: DX-lock, DE-lock, then DX-spinlock.
DX-spinlock must be held to scan the level\[PUB:N-1\] DX-block and discover the target DE-block. The DE-lock is obtained. This process violates the locking order, so the DX-spinlock must be released before locking DE-lock. However, no guarantee is available that the target DE-block will not be changed after we release DX-spinlock and take DE-lock. The solution is to call DX-spin-unlock-listen:Wiki Markup - DX-spin-unlock-listen will be notified if the DE-block is changed (i.e. it is split by another thread.)
- If notification arrives, attempt to re-validate the htree-path again ensuring the correct DE-block is locked.
- This function will return reference of DE-block with hold DE-lock, and caller can continue to search/delete the target entry in the target DE-block, or insert a new entry to the target DE-block.unmigrated-wiki-markup
- If there is hash collision for “lookup”, then the DX-lock must be obtained as well. This will allow searching the next adjacent DE-block. In this case, the parent (level\[PUB:N-1\] DX-block) remains unchanged.
Add name entry and split DE-block
...
- The root DX-block always has additional bytes for describing htree. After new DX-block obtains all the entries from the root block without the additional extra bytes, the new DX-block has free bytes to receive a new entry so we can go ahead splitting DE-block.
...
For example, spliting DE-block\[PUB:2\] in Graph-6 (a): after the htree grows, the htree will become (b). A new DX-block\[PUB:X\] will always have free bytes for new DE-block split from DE-block\[PUB:2\]. In this case a split on DE-block\[PUB:2\] can be performed and the new DE-block added to DX-block\[PUB:X\]. Finally, insert name entry to target DE-block.
While growing from While growing from N-level to (N+1)-level, the new DX-block is still inserted as child of the root DX-block. For example, in Graph-6 (d): after htree growing, block-Y is the new block. block-Y now has free bytes that can receive an entry. However, if DE-block\[PUB:2\] is split, and its parent DX-block\[PUB:101\] is split, block-Y remains full and can not insert any new block entry.Wiki Markup
This situation can be resolve by simply aborting the insert operation after growing the tree and restarting from dx_probe. In the second shot, DX-block\[PUB:101\] is full but a split can be performed because the new DX-block\[PUB:Y\] has free entries. In this case, the process is splitting DX-block, splitting DE-block and inserting an entry.
Wiki Markup
In practice, to simplify the logic, {{dx_probe()
}} is always restarted after growing of htree.
Cache block 0 for in htree-lock (optional)
...