Skip to content

HDDS-15059. Shift streaming write sortDatanodes logic to OM#10633

Open
chihsuan wants to merge 10 commits into
apache:masterfrom
chihsuan:HDDS-15059
Open

HDDS-15059. Shift streaming write sortDatanodes logic to OM#10633
chihsuan wants to merge 10 commits into
apache:masterfrom
chihsuan:HDDS-15059

Conversation

@chihsuan

@chihsuan chihsuan commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

SCM sorts the write pipeline (nearest datanode first) on every allocateBlock, on its block-allocation hot path. OM already caches the cluster topology (HDDS-9343) and sorts reads locally, so this PR moves the write sort to OM.

  • OMKeyRequest.allocateBlock sends an empty clientMachine to SCM (SCM skips sorting) and sorts each pipeline locally via a new KeyManager.sortDatanodesForWrite; the result is cached per pipeline.
  • Pipeline nodes returned over RPC aren't linked to OM's topology, so they would all look equidistant and sort randomly. OM resolves each node (and the client) to its canonical node in the cached cluster map before sorting, then maps the order back to the original nodes, matching what SCM's nodeManager.getNode did.
  • The cluster map is read once per sort and used for both client resolution and sorting, so a topology refresh mid-sort can't leave them on different snapshots and shuffle the order.
  • The client is matched by both IP and hostname (UserInfo.remoteAddress is always an IP), so a client co-located on a datanode is recognized even with hdds.datanode.use.datanode.hostname enabled.
  • When the client is empty or unresolved, the pipeline order is kept (no shuffle), since the first node is the write primary.
  • SCMBlockProtocolServer is unchanged for rolling-upgrade safety: an old OM still gets SCM-side sorting, a new OM's empty address is a no-op for SCM. No Protobuf/RPC change.

Note: SCM's ALLOCATE_BLOCK audit now logs client="" for OM-originated writes; the per-client audit stays at OM.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15059

How was this patch tested?

  • Unit (TestOMAllocateBlockRequest): SCM receives an empty clientMachine; the sorted order is applied to every block; a shared pipeline is sorted once.
  • Integration (TestOMSortDatanodes): nearest datanode is first for writes, including for RPC-deserialized (protobuf round-tripped) pipeline nodes; order is preserved for an empty or unresolved client; the client is matched by both IP and hostname.
  • Fork build-branch CI: https://github.com/chihsuan/ozone/actions/runs/28455187435

Generated-by: Claude Code (Claude Opus 4.8)

@chihsuan chihsuan marked this pull request as ready for review June 30, 2026 17:08

@peterxcli peterxcli left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivandika3 Is there any risk that a block ends up allocated on a suboptimal datanodes/pipeline because the OM cache topology is somehow stale?⁠

@ivandika3

ivandika3 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@peterxcli thanks for checking.

Is there any risk that a block ends up allocated on a suboptimal datanodes/pipeline because the OM cache topology is somehow stale?⁠

Should have effect on performance but not correctness, since Streaming Write Pipeline should be able to pick an arbitrary topology. The the data path (streaming WriteChunk data) can be sent to any primary (first node) is separated from the metadata path (PutBlock) which will be sent to the DN leader. The impact of suboptimal Streaming write pipeline topology should be worse write latency (e.g. if the topology picks the furthest node as the primary node). But this possible performance penalty also apply to read path (i.e. where the further node is read first) so I think it should be acceptable. Please let me know if I miss something.

@chihsuan Thanks for the patch, I'll review this soon.

@ivandika3 ivandika3 requested a review from szetszwo July 2, 2026 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants