Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

  • Thread-1
    • Takes CR mode tree-lock. This still allows other threads to get CW or CR lock on the htree directory, but prohibits PW, PR or EX locks on the htree directory.
    • Takes PR mode child-lock on block-3 and block-6 so these two blocks cannot be modified.
  • Thread-2
    • Takes CW mode tree-lock. This is compatible with the CR tree-lock taken by thread-1.
    • Takes PW mode child-lock on block-9 and modifies the block (i.e: insert a new entry). This lock gives this thread exclusive access to block-9.

Wiki MarkupFor a N-level htree, a child-lock can protect level\[PUB:N\] and level\[PUB:N-1\] blocks of htree directory. A tree-lock can protect all other level blocks. Using the example presented in the figure above, block-0, block-1 and block-2 are protected by tree-lock.

...

  • An initial implementation of PDO only used a child-lock to protect level\[PUB:N\] blocks, or DE-block, and using tree-lock to protect all other DX-blocks. Prototyping this design revealed a high occurrence of DE-block splitting:
      unmigrated-wiki-markup
    • As described in solution architecture, each DE-block (level\[PUB:N\] block) can contain about 80 entries. If level\[PUB:N-1\] DX-blocks are protected by tree-lock, then the entire tree must be locked after inserting about 80 entries or 1.25% of all insertions (need to split the DE-block and insert new DE-block to its parent DX-block). This may be a performance issue, particularly as hundreds or thousands service threads may try to inserting/remove entries simultaneously.
    unmigrated-wiki-markup
  • In theory a child-lock can protect level-M blocks (0<= M && M <= N). However, in practice it is only used to protect level\[PUB:N\] and level\[PUB:N-1\] blocks. The reasoning behind this is twofold:
      unmigrated-wiki-markup
    • Restructuing the ldiskfs htree implementation is undesirable. The existing htree implementation has specific logic path for level\[PUB:N\] DE-block and level\[PUB:N-1\] DX-block. This makes it simple to add a htree-lock. Requiring the use of an htree-lock at any level will likely require a significant rewrite current code and make it hard to maintain.unmigrated-wiki-markup
    • As described in the solution architecture: using child-lock is used to protect level\[PUB:N\] DE-block and level\[PUB:N-1\] DX-block, and a tree-lock to protect all other blocks, the chance of locking the whole tree is about 1/(512 * 80) or 0.00244%. This is judged to be sufficiently small to be negligible.

Hree-lock interfaces and compatibility matrix

...

Ldiskfs lock definition and Locking order

...

Child-locks protect level\[PUB:N-1\] DX-block and level\[PUB:N\] DE-blocks. Other changes are protected by htree-lock. 2-depths child lock are defined as:

  • Wiki MarkupDX-lock: a child-lock to protect level\[PUB:N-1\] DX-block.unmigrated-wiki-markup
  • DE-lock: a child-lock to protect level\[PUB:N\] DE-block.

The use-cases include:

  • All no-change operations obtain both DX-lock and DE-lock with PR mode.
  • All changing operations obtain both DX-lock and DE-lock with PW mode.

...

The solution to this problem is add another depth for child-lock:

  • Wiki MarkupDX-spinlock
    rw-spinlock for level\[PUB:N-1\] DX-block (again). It protects the same DX-block, but a different lock depth is used to protect different operations:
    • DX-lock is a blocking lock and protects the entire split process of DE-blocks. As a result only one thread can split DE-blocks under the same DX-block.
    • DX-lock is not necessary for lookup, delete or insert, required only if:
      • Hash collision for lookup/delete
      • Split DE-block
    • DX-spinlock is non-blocking lock that is used to perform a quick scan of DX-block, or changing of DX-block (i.e. insert a new DE-block entry to the DX-block.) This lock is always required but it is non-blocking and only protecting light operation. There is no thread context switch even under lock contention.

...

  • call dx_probe() with hold shared locking mode (CR or CW).unmigrated-wiki-markup
  • Level\[PUB:0, 1, …, N-2\] DX-blocks are not protected by tree-lock and do not require any additional protection. This figure only shows logic of protecting level\[PUB:N-1\] DX-block and level\[PUB:N\] DE-block.
  • locking order to avoid deadlock
    • As previously described, the correct locking order is: DX-lock, DE-lock, then DX-spinlock.
    • Wiki MarkupDX-spinlock must be held to scan the level\[PUB:N-1\] DX-block and discover the target DE-block. The DE-lock is obtained. This process violates the locking order, so the DX-spinlock must be released before locking DE-lock. However, no guarantee is available that the target DE-block will not be changed after we release DX-spinlock and take DE-lock. The solution is to call DX-spin-unlock-listen:
      • DX-spin-unlock-listen will be notified if the DE-block is changed (i.e. it is split by another thread.)
      • If notification arrives, attempt to re-validate the htree-path again ensuring the correct DE-block is locked.
  • This function will return reference of DE-block with hold DE-lock, and caller can continue to search/delete the target entry in the target DE-block, or insert a new entry to the target DE-block.unmigrated-wiki-markup
  • If there is hash collision for “lookup”, then the DX-lock must be obtained as well. This will allow searching the next adjacent DE-block. In this case, the parent (level\[PUB:N-1\] DX-block) remains unchanged.

Add name entry and split DE-block

...

    • The root DX-block always has additional bytes for describing htree. After new DX-block obtains all the entries from the root block without the additional extra bytes, the new DX-block has free bytes to receive a new entry so we can go ahead splitting DE-block.

...

For example, spliting DE-block\[PUB:2\] in Graph-6 (a): after the htree grows, the htree will become (b). A new DX-block\[PUB:X\] will always have free bytes for new DE-block split from DE-block\[PUB:2\]. In this case a split on DE-block\[PUB:2\] can be performed and the new DE-block added to DX-block\[PUB:X\]. Finally, insert name entry to target DE-block.

    • Wiki MarkupWhile growing from While growing from N-level to (N+1)-level, the new DX-block is still inserted as child of the root DX-block. For example, in Graph-6 (d): after htree growing, block-Y is the new block. block-Y now has free bytes that can receive an entry. However, if DE-block\[PUB:2\] is split, and its parent DX-block\[PUB:101\] is split, block-Y remains full and can not insert any new block entry.

Wiki MarkupThis situation can be resolve by simply aborting the insert operation after growing the tree and restarting from dx_probe. In the second shot, DX-block\[PUB:101\] is full but a split can be performed because the new DX-block\[PUB:Y\] has free entries. In this case, the process is splitting DX-block, splitting DE-block and inserting an entry.
In practice, to simplify the logic, {{dx_probe()}} is always restarted after growing of htree.

Cache block 0 for in htree-lock (optional)

...