Skip to content

Concurrent Immix#1355

Merged
wks merged 67 commits intommtk:masterfrom
tianleq:concurrent-immix
Sep 18, 2025
Merged

Concurrent Immix#1355
wks merged 67 commits intommtk:masterfrom
tianleq:concurrent-immix

Conversation

@tianleq
Copy link
Copy Markdown
Collaborator

@tianleq tianleq commented Jul 28, 2025

We add a concurrent Immix plan. It can do concurrent marking (non-moving), and falls back to stop-the-world Immix collection with opportunistic defragmentation.

We add a snap-at-the-beginning barrier to support concurrent Immix.

Now the plans control when to clear side unlog bits. This allows the same policy (specifically the ImmixSpace) to be used by different plans and clear the unlog bits at different time.

Comment thread src/policy/immix/mod.rs Outdated
Comment thread src/plan/barriers.rs Outdated
Comment thread src/lib.rs Outdated
Comment thread src/policy/space.rs Outdated
Comment thread src/policy/largeobjectspace.rs Outdated
Comment thread src/plan/global.rs Outdated
Comment thread src/scheduler/gc_work.rs Outdated
Comment thread src/policy/largeobjectspace.rs Outdated
Comment thread src/global_state.rs Outdated
Comment thread src/plan/concurrent/immix/global.rs Outdated
Comment thread src/plan/plan_constraints.rs Outdated
Comment thread src/plan/tracing.rs Outdated
Comment thread src/util/alloc/immix_allocator.rs
Comment thread src/vm/collection.rs Outdated
Comment thread src/lib.rs Outdated
Comment thread src/scheduler/work_bucket.rs Outdated
@wks
Copy link
Copy Markdown
Collaborator

wks commented Jul 29, 2025

What will happen if a mutator wants to fork() while concurrent marking is in progress? There are subtle interactions between preparing for forking (MMTK::prepare_to_fork), the concurrently running marking task (including the dirty mark bits, etc.), and the WorkerGoal which is currently only designed for triggering GC and triggering "prepare-to-fork".

We may temporarily tell the VM binding that MMTk doesn't currently support forking when using concurrent GC. We may fix it later. One simple solution is postponing the forking until the current GC finishes.

The status quo is that only CRuby and Android need forking. But CRuby will not support concurrent GC in a short term.

@k-sareen
Copy link
Copy Markdown
Collaborator

What will happen if a mutator wants to fork() while concurrent marking is in progress?

You can't let this happen. Either the binding needs to ensure that the mutator waits while concurrent marking is active, or you don't let a concurrent GC happen before forking (ART's method of dealing with this).

@wks
Copy link
Copy Markdown
Collaborator

wks commented Jul 29, 2025

You can't let this happen. Either the binding needs to ensure that the mutator waits while concurrent marking is active, or you don't let a concurrent GC happen before forking (ART's method of dealing with this).

Agreed. Fortunately, the current API doc for prepare_to_fork says:

    /// This function sends an asynchronous message to GC threads and returns immediately, but it
    /// is only safe for the VM to call `fork()` after the underlying **native threads** of the GC
    /// threads have exited.  After calling this function, the VM should wait for their underlying
    /// native threads to exit in VM-specific manner before calling `fork()`.

So a well-behaving VM binding shall wait for all the GC worker threads (which are created by the binding via VMCollection::spawn_gc_thread anyway) to exit before calling fork(). That's VM-specific, but not hard. Extending this API to support concurrent GC should not require the VM binding to rewrite this part.

@k-sareen
Copy link
Copy Markdown
Collaborator

Mentioning it before I forget: we need to change the name for the GC counter to say it's pauses or have a separate counter that counts pauses.

Comment thread src/scheduler/gc_work.rs Outdated
Comment thread src/scheduler/gc_work.rs Outdated
Comment thread src/plan/concurrent/mod.rs
Comment thread src/plan/global.rs Outdated
Comment thread src/plan/barriers.rs Outdated
Comment thread src/plan/barriers.rs Outdated
Comment thread src/plan/concurrent/concurrent_marking_work.rs Outdated
Comment thread src/plan/tracing.rs Outdated
Comment thread src/util/address.rs Outdated
Copy link
Copy Markdown
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current "WorkerGoal" mechanism should be able to handle the case where mutators trigger another GC between InitialMark and FinalMark. We can remove WorkPacketStage::Initial and WorkPacketStage::ConcurrentSentinel, and use GCWorkScheduler::on_last_parked to transition the GC state from InitialMark to the concurrent marking to FinalMark and finally finish the GC. See inline comments for more details.

Comment thread src/scheduler/work_bucket.rs Outdated
Comment thread src/scheduler/scheduler.rs Outdated
Comment thread src/scheduler/work_bucket.rs Outdated
Comment thread src/plan/barriers.rs
Comment thread src/scheduler/scheduler.rs Outdated
Comment thread src/scheduler/work_bucket.rs Outdated
@wks
Copy link
Copy Markdown
Collaborator

wks commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects (actually the mark bits of all words in a line are set). This is OK. But what if we need VO bits? The default strategy for maintaining VO bits is copying from the mark bits. Since those objects are not marked all words of the line are marked, their VO bits will be zero the VO bits will also be all ones if we blindly copy VO bits from the mark bits. But because post_alloc already sets the VO bits, we just need to keep their VO bits as is.

To properly support VO bits, we need to do two things:

  1. Reset the BumpPointer of ImmixAllocator so that after the InitialMark, all mutators start allocation from empty lines (i.e. make sure a line never contains both objects allocated before InitialMark and objects allocated after InitialMark).
  2. In Block::sweep,
    • if a line only contains objects allocated after InitialMark, we keep the VO bits as is;
    • if it only contains live objects before InitialMark, we copy the VO bits over from mark bits;
    • otherwise the line must be empty. We clear its VO bits.

There are multiple ways to know if a line only contains new objects or old objects.

  • If a line is marked, but the mark bits of a line are all zero, it must be a line that only contains objects allocated after InitialMark. This doesn't need extra metadata. (Update: This is not true. If a mutator adds an edge from a live old object to the new object, the new object will still be reachable.)
  • If a line is marked, and the mark bits are a superset of the VO bits, it must be a line that only contains objects allocated after InitialMark. If a block only contains objects before the InitialMark, the mark bits must be a subset of the VO bits. Only all-ones is a superset of all possible VO-bit patterns. If the mark bits and VO bits are identical (every word is an individual object, and they are either all live or all new), it doesn't matter if we copy the mark bits or retain the VO bits. Currently each Line is 256 bytes, corresponding to 32 bits of mark bits or VO bits. It should be easy and efficient to do 32-bit bit operation.
  • We introduce another one-bit-per-line metadata to record if it contains objects allocated after InitialMark.

@tianleq
Copy link
Copy Markdown
Collaborator Author

tianleq commented Jul 30, 2025

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

@qinsoon
Copy link
Copy Markdown
Member

qinsoon commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

@tianleq
Copy link
Copy Markdown
Collaborator Author

tianleq commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

The problem is that we do not want to do the check in the fast path. If we want to mark each individual object, then in the allocation fast-path, we need to check if concurrent marking is active and then set the mark bit. While the current bulk setting way only set the mark bit in the slow-path

@wks
Copy link
Copy Markdown
Collaborator

wks commented Jul 30, 2025

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

Actually I think for correctness, we must not set mark bits of individual objects when allocating. Suppose there are objects A and B when GC is triggered, and root -> ... -> A -> B. During concurrent marking, a mutator allocated C, and changed the object graph to root -> ... -> A -> C -> B. Then a GC worker visits A for the first time. If C already has the mark bit, GC will not enqueue C, and will not enqueue B, either. It will consider B as dead.

I forgot the SATB barrier. It will remember B when we remove the edge A -> B.

@tianleq
Copy link
Copy Markdown
Collaborator Author

tianleq commented Jul 30, 2025

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

Actually I think for correctness, we must not set mark bits of individual objects when allocating. Suppose there are objects A and B when GC is triggered, and root -> ... -> A -> B. During concurrent marking, a mutator allocated C, and changed the object graph to root -> ... -> A -> C -> B. Then a GC worker visits A for the first time. If C already has the mark bit, GC will not enqueue C, and will not enqueue B, either. It will consider B as dead.

In your case, B will be captured by the SATB barrier, it will not be considered as dead. There is no need to scan those newly allocated objects because any children of those newly allocated objects must have been alive in the snapshot and thus is guaranteed to be traced

@qinsoon
Copy link
Copy Markdown
Member

qinsoon commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

The problem is that we do not want to do the check in the fast path. If we want to mark each individual object, then in the allocation fast-path, we need to check if concurrent marking is active and then set the mark bit. While the current bulk setting way only set the mark bit in the slow-path

The mark bit could be in the header, and we have to set it per object if it is in the header. We can differentiate header mark bit and side mark bit, and deal with each differently.

But bulk setting mark bits is still a bit hacky -- and this is why we would have issues with VO bits. VO bits copies from mark bits, assuming mark bits is only set for each individual object.

Comment thread src/plan/barriers.rs
Comment thread src/plan/generational/global.rs Outdated
Comment thread src/plan/concurrent/immix/global.rs Outdated
Comment thread src/util/alloc/immix_allocator.rs Outdated
}

fn trace_object(&mut self, object: ObjectReference) -> ObjectReference {
let new_object = self
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a check to skip young blocks here, the more expensive mark table zeroing in immix allocator can be removed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean mark table bulk-setting (to 1s). If we can identify if the object is in a block that was completely free when concurrent marking started, we can be sure that the object must be newly allocated. But ConcurrentImmix can allocate into partially used blocks (i.e. into holes of lines). Checking blocks will not work here. And the trace_object here may be dispatched to other spaces, too, such as LOS.

}

fn object_probable_write_slow(&mut self, obj: ObjectReference) {
crate::plan::tracing::SlotIterator::<VM>::iterate_fields(obj, self.tls.0, |s| {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to enqueue obj not it's fields. If I remember this correctly, at the time this code is called, all fields are uninitialized.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I'm wrong, but I guess object_probable_write is not necessary at all. This obj should already be in the root set of the InitialMark pause, and will be marked eventually.

Copy link
Copy Markdown
Collaborator

@wks wks Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. The semantics of the MMTk-side API Barrier::object_probable_write says

    /// A pre-barrier indicating that some fields of the object will probably be modified soon.
    /// Specifically, the caller should ensure that:
    ///     * The barrier must called before any field modification.
    ///     * Some fields (unknown at the time of calling this barrier) might be modified soon, without a write barrier.
    ///     * There are no safepoints between the barrier call and the field writes.

If the fields are assigned during concurrent marking, the (new) values will either come from the snapshot at the beginning, or be a new object allocated during concurrent marking. In either case, they will be kept alive. (Update: But the old children of the fields that are overwritten by the assignments will not be kept alive.)

To put it another way, the SATB barrier is a deletion barrier. As long as no objects are disconnected from other objects, there is no need to apply barrier. In the case of OpenJDK, it is assigning objects to fields that are not initialized, so no objects are disconnected.

I'll remove object_probable_write for SATBBarrier.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. After a second thought, I think if MMTk implements the semantics described in the doc comment of Barrier::object_probable_write, it should still remember its old field values. If another VM calls object_probable_write on an object that already has children, it will still need to remember the old children because they may probably be overwritten. So Barrier::object_probable_write still has to be implemented as it currently is.

The OpenJDK binding can do VM-specific optimization by eliding the invocation of mmtk_object_probable_write in MMTkSATBBarrierSetRuntime::object_probable_write, making use of the knowledge of the SATBBarrier. Currently only the OpenJDK binding uses Barrier::object_probable_write. So it will not be disruptive to other bindings if we change the semantics of Barrier::object_probable_write to make it more specific to OpenJDK's use case and do more optimizations. But we need to change the semantics first.

Copy link
Copy Markdown
Collaborator

@k-sareen k-sareen Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't remove it or change semantics arbitrarily. I (kind of) depend on it in ART as well. ART has a barrier WriteBarrier::ForEveryFieldWrite whose semantics are essentially object_probable_write. Currently since I only have a generational post-write object remembering barrier implemented, I just call the normal post-write function, but this will be fixed in the future when I add concurrent GC support.

@wks wks added the PR-extended-testing Run extended tests for the pull request label Sep 15, 2025
@wks
Copy link
Copy Markdown
Collaborator

wks commented Sep 15, 2025

I enabled extended testing. I didn't set the binding test repo and branch to the PR for the OpenJDK binding (mmtk/mmtk-openjdk#311) because we will test and merge that separately.

@qinsoon
Copy link
Copy Markdown
Member

qinsoon commented Sep 15, 2025

I enabled extended testing. I didn't set the binding test repo and branch to the PR for the OpenJDK binding (mmtk/mmtk-openjdk#311) because we will test and merge that separately.

Why? We should test with the OpenJDK PR.

@wks
Copy link
Copy Markdown
Collaborator

wks commented Sep 15, 2025

I enabled extended testing. I didn't set the binding test repo and branch to the PR for the OpenJDK binding (mmtk/mmtk-openjdk#311) because we will test and merge that separately.

Why? We should test with the OpenJDK PR.

I am confused. We've always been saying "we'll fix the binding later" or something.

@k-sareen
Copy link
Copy Markdown
Collaborator

k-sareen commented Sep 15, 2025

But how will it work without the correct barrier etc? You need to tell it to use the binding PR so that the barrier works.

EDIT: I think the tests won't even run ConcurrentImmix actually because the CI doesn't have it added yet. So it'll (theoretically) pass but ConcurrentImmix will not have been tested.

@qinsoon
Copy link
Copy Markdown
Member

qinsoon commented Sep 15, 2025

I enabled extended testing. I didn't set the binding test repo and branch to the PR for the OpenJDK binding (mmtk/mmtk-openjdk#311) because we will test and merge that separately.

Why? We should test with the OpenJDK PR.

I am confused. We've always been saying "we'll fix the binding later" or something.

It seems that only when a change is non breaking and a binding may choose to opt in or not, we could use "we'll fix the binding later" approach. In most cases, we don't do that.

When introducing a new plan, we always test with at least one binding to make sure it works (at least for one binding). Otherwise, there is no way for us to tell if the plan works or not.

@wks
Copy link
Copy Markdown
Collaborator

wks commented Sep 17, 2025

binding-refs
OPENJDK_BINDING_REPO=tianleq/mmtk-openjdk
OPENJDK_BINDING_REF=concurrent-immix

Copy link
Copy Markdown
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wks
Copy link
Copy Markdown
Collaborator

wks commented Sep 18, 2025

Just in case this PR has any side effects on existing plans other than ConcurrentImmix, I'll run some benchmarks to test that.

@wks
Copy link
Copy Markdown
Collaborator

wks commented Sep 18, 2025

lusearch from DaCapo Chopin MR2, mole.moma, 2.4x and 3.0x min heap w.r.t. G1, 20 invocations, 5 iterations, comparing master and this PR.

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|mole-2025-09-18-Thu-025228&build^hfac^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&hfac^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&

GenCopy, GenImmix and StickyImmix become slightly faster in terms of STW time and total time. No obvious difference for Immix. SemiSpace becomes slightly slower in terms of STW time and total time.

But something is strange when I test locally. It seems that GenCopy is using PlanProcessEdges for nursery GCs, too. (It should use GenNurseryProcessEdges.)

@wks
Copy link
Copy Markdown
Collaborator

wks commented Sep 18, 2025

But something is strange when I test locally. It seems that GenCopy is using PlanProcessEdges for nursery GCs, too. (It should use GenNurseryProcessEdges.)

False alarm. I was observing the behavior using the capture.py script with -e 50, i.e. capture every 50th GC. But under the given workload, the GCs are alternating between nursery GC and full-heap GC. Because 50 is an even number, if the starting point is a nursery GC, it will always observe nursery GCs in subsequent observations, and the same is true for full-heap GCs. Coincidentally, I only observed full-heap GC for this PR and thought it cannot trigger nursery GC. The harness_end output shows that about 50% of all GCs are nursery GCs. This is an illusion caused by stroboscopic effect. Perhaps I should document it in the README file of the eBPF tracing tools.

@wks
Copy link
Copy Markdown
Collaborator

wks commented Sep 18, 2025

I cannot reproduce the performance difference locally on my computer. This PR does not change GenCopy or CopySpace. So I can only explain this speed-up for GenXxxxx plans and the slight slow-down of SemiSpace by the nondeterminism caused by profile-guided optimization.

I'll merge this PR.

@wks wks added this pull request to the merge queue Sep 18, 2025
Merged via the queue into mmtk:master with commit a4dd70c Sep 18, 2025
37 of 39 checks passed
mmtkgc-bot added a commit to mmtk/mmtk-openjdk that referenced this pull request Sep 18, 2025
We added support for the new plan ConcurrentImmix added to the mmtk-core
in mmtk/mmtk-core#1355

We implemented the SATB barrier fast paths in the OpenJDK binding, and
refactored the barriers to support both pre and post barriers, as well
as (weak) reference loading barrier. The OpenJDK binding is now aware of
concurrent marking, too.

---------

Co-authored-by: Yi Lin <qinsoon@gmail.com>
Co-authored-by: Kunshan Wang <wks1986@gmail.com>
Co-authored-by: mmtkgc-bot <mmtkgc.bot@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR-extended-testing Run extended tests for the pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants