HDDS-14921. Improve space accounting in SCM with In-Flight container allocation tracking.#10000
HDDS-14921. Improve space accounting in SCM with In-Flight container allocation tracking.#10000ashishkumar50 wants to merge 9 commits intoapache:masterfrom
Conversation
rakeshadr
left a comment
There was a problem hiding this comment.
Thanks @ashishkumar50 for providing the patch. Added a few comments, please take care.
| if (!alreadyOnDn && getContainerManager() instanceof ContainerManagerImpl) { | ||
| ((ContainerManagerImpl) getContainerManager()) | ||
| .getPendingContainerTracker() | ||
| .removePendingAllocation(dd, id); |
There was a problem hiding this comment.
Say, DN is healthy, all containers confirmed, no new allocations → that DN's bucket never rolls even though heartbeats come every 30 seconds, right?
t=0 Container C1 allocated → pending recorded in tracker
t=60-120 FCR arrives from DN
→ cid = C1
→ alreadyInDn = expectedContainersInDatanode.remove(C1) = FALSE
→ !alreadyInDn = TRUE → removePendingAllocation called → rollIfNeeded fires ✓
→ C1 added to NM DN-set
How abt rolls on every processHeartbeat, every 30 seconds regardless of container state changes ?
There was a problem hiding this comment.
Added roll in every node report which is per minute from DN.
| } | ||
|
|
||
| // Cleanup empty buckets to prevent memory leak | ||
| if (bucket.isEmpty()) { |
There was a problem hiding this comment.
Potentially hits concurrency issue. Say two threads entered this block.
Thread-1 (removePendingAllocation): bucket.isEmpty(), returns true
Thread-2 (recordPendingAllocationForDatanode): computeIfAbsent(uuid) returns same bucket
reference (key still exists), calls bucket.add(containerID) and now the bucket will be non-empty
Thread-1: datanodeBuckets.remove(uuid, bucket), then removes the non-empty bucket and now the containerID will be in a detached bucket object, right?
I think, we need to add synchronization to avoid detached bucket object.
There was a problem hiding this comment.
Added sync at bucket level.
| } | ||
|
|
||
| @Test | ||
| public void testRemoveFromBothWindows() { |
There was a problem hiding this comment.
Do we have test scenario covering roll over?
The two-window rolling behavior (container in previousWindow roll after 2× interval). Say, add C1 in currentWindow, then moves C1 to previousWindow, then wait for the roll over.
There was a problem hiding this comment.
Added test for this testTwoWindowRollAgesOutContainerAfterTwoIntervals.
sumitagrawl
left a comment
There was a problem hiding this comment.
@ashishkumar50 Thanks for working over this, have few review comments.
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfig.java
Outdated
Show resolved
Hide resolved
| * @param pipeline The pipeline where container is allocated | ||
| * @param containerID The container being allocated | ||
| */ | ||
| public void recordPendingAllocation(Pipeline pipeline, ContainerID containerID) { |
There was a problem hiding this comment.
This needs to be part of SCMNodeManager, more specific to SCMNodeStat. Reason,
- need handle even like stale node / dead node handler as cleanup
- May need report this when reporting to CLI for available space in the DN
- To be used for pipeline allocation policy, where container manager does not come in role
Its datanode space, just trying to identify already allocated space. And needs to be part of committed space at SCM when reporting to CLI, or other breakup.
There was a problem hiding this comment.
Moved to node package
...cm/src/main/java/org/apache/hadoop/hdds/scm/container/IncrementalContainerReportHandler.java
Outdated
Show resolved
Hide resolved
| processContainerReplica(dd, container, replicaProto, publisher, detailsForLogging); | ||
|
|
||
| // Remove from pending tracker when container is added to DN | ||
| if (!alreadyOnDn && getContainerManager() instanceof ContainerManagerImpl) { |
There was a problem hiding this comment.
Please check if node report is also send in ICR, this is for reason that node information should be updated with ICR at same time.
| // (1*5GB) + (2*5GB) = 15GB → actually 3 containers | ||
| long totalCapacity = 0L; | ||
| long effectiveAllocatableSpace = 0L; | ||
| for (StorageReportProto report : storageReports) { |
There was a problem hiding this comment.
Instead of calcuating all available and then removing, we can do progressive base, like,
required=pending+newAllocation
for each report
required = required - volumeUsage in roundoff value
if (required <= 0)
return true
But we need to reserve also, can do first add and check, if not present, remove containerId
OR other way,
when DN report storage handling, total consolidate value can also be added to memory to avoid looping on every call.
There was a problem hiding this comment.
Updated the logic to break when enough space is available on any volume
...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java
Outdated
Show resolved
Hide resolved
...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/ScmConfig.java
Outdated
Show resolved
Hide resolved
...ds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerReportHandler.java
Outdated
Show resolved
Hide resolved
| return true; | ||
| } catch (Exception e) { | ||
| LOG.warn("Error checking space for pipeline {}", pipeline.getId(), e); | ||
| return true; |
There was a problem hiding this comment.
If we are not sure if we can create container here, Should we still choose this pipeline? Instead of making it generic, we can specify what to do for each exception we might see.
There was a problem hiding this comment.
Moved the code, there is no exception here.
| // Remove from pending tracker when container is added to DN | ||
| // This container was just confirmed for the first time on this DN | ||
| // No need to remove on subsequent reports (it's already been removed) | ||
| if (container != null && getContainerManager() instanceof ContainerManagerImpl) { |
There was a problem hiding this comment.
Why not just add this to the ContainerManager interface? We can avoid these conversions. Is this because Recon uses the same code path and we don't want it to this? For Recon we can just make it a No-Op.
There was a problem hiding this comment.
Moved to node package, so not required these conversions now.
...cm/src/main/java/org/apache/hadoop/hdds/scm/container/IncrementalContainerReportHandler.java
Outdated
Show resolved
Hide resolved
...p-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/pipeline/MockPipelineManager.java
Outdated
Show resolved
Hide resolved
...s/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/PendingContainerTracker.java
Outdated
Show resolved
Hide resolved
| * Roll the windows: previous = current, current = empty. | ||
| * Called when current time exceeds lastRollTime + rollIntervalMs. | ||
| */ | ||
| synchronized void rollIfNeeded() { |
There was a problem hiding this comment.
Pending allocations can persist beyond 2× roll interval after long idle periods because rollIfNeeded() only rolls once.A single roll doesn’t clear entries older than two windows which can incorrectly block new allocations.
There was a problem hiding this comment.
Roll will be done when node report comes which is every minute, so old entries will get clear.
| long effectiveRemaining = effectiveAllocatableSpace - pendingAllocations; | ||
|
|
||
| // Check if there's enough space for a new container | ||
| if (effectiveRemaining < maxContainerSize) { |
There was a problem hiding this comment.
This makes the allocation little aggressive right? Even if we just have 5GB we allocate it. Should we have leave some buffer when allocating a container?
There was a problem hiding this comment.
No need of extra buffer here, as we are anyway going to give buffer in DN by considering soft and hard limit. So in case of some overflow DN will accept it until hard limit.
What changes were proposed in this pull request?
Maintain space accounting during container allocation in SCM. More detail description is in Jira.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14921
How was this patch tested?
UT and IT.