Skip to content

Conversation

@DonIsaac
Copy link
Contributor

@DonIsaac DonIsaac commented Jul 23, 2025

What This PR Does

do not merge this PR yet. It still has rough edges that need to be smoothed out.
please do leave any/all feedback

Proposes the concept of heap regions via trait Subspace rather than via Vec<Option<T>>. It also implements IsoSubspace, a region storing well-sized, bindable homogenous data of a single type. To show what this would look like, I refactors heap.array to use IsoSubspace.

This design is heavily inspired by JavaScriptCore's Subspaces

Key Differences

The idea of heap regions already exist in Nova. This proposal seeks to solidify them by moving to an opaque type where its backing store cannot be directly accessed. It tries to do so within the existing heap architecture. I tried to limit the effect switching to a subspace will have on the rest of the engine.

Goals/Motivation

Custom heap-allocated data types

Runtimes want to store custom native data types on the heap. The heap currently requires regions to be direct properties of Heap. Since heap polymorphism is currently implemented by lots of copied code, this cannot be addressed yet. Trait-based implementations will unblock this.

The proposed solution uses WithSubspace<T>, which informs the heap on which subspace to store Ts on. This can then be used to replace CreateHeapData.

impl Heap {
      /// Allocate a value within the heap.
    pub(crate) fn alloc<'a, T: SubspaceResident>(&mut self, value: T::Bound<'a>) -> T::Key<'a>
    where
        T::Key<'a>: WithSubspace<T>,
    {
        T::Key::subspace_for_mut(self).alloc(value)
    }
}

NOTE: this is currently a rough edge. Rust cannot correctly infer T when calling heap.alloc(data). For now I'm implementing CreateHeapData as sugar.

More Efficient Backing Stores

We would like to experiment with different forms of backing stores that are not Vecs. A major downside of Vecs is that worst-case insertions require allocating an entirely new buffer and copying its data over.

No progress can be made here until the API surface of heap regions is controlled. I've implemented the minimal set of APIs to make Subspace and IsoSubspace work for heap.array, but no more. These should be expanded as necessary, but we must maintain control over what's in them (e.g. do not implement Deref<Target = Vec<T>> for IsoSubspace<T>)

.builtin_functions
.extend((0..intrinsic_function_count()).map(|_| None));
agent.heap.arrays.push(None);
let array_prototype = agent.heap.arrays.reserve_intrinsic();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't like this, it feels ad-hoc.

{
/// # Do not use this
/// This is only for Value discriminant creation.
const _DEF: Self;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should live here, since its only used on newtypes that have Value variants. It is, however, extremely convenient.

///
/// Do not expose this outside of the `subspace` module.
#[derive(Clone, Copy)]
pub(super) struct Name(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly over-complicated, maybe not worth 8 bytes?

@aapoalas
Copy link
Member

Thank you for the proposal / PR / trait.

Something akin to this is absolutely the correct direction to go to; holding direct Vecs won't be the be-all-end-all situation. That being said, I don't exactly see how this matches up with the "Custom heap-allocated data types" aim: the internal storage of "built-in" types should be entirely orthogonal to embedder-defined custom data types, and one can expect that (assuming embedder-types are not prepared at compile-time) storing custom data types will be less efficient than built-in data types, as custom types will need metadata and virtual methods while built-in data types can be known statically and need no such things. From that point of view, I'm not really comfortable adding two pointers to every single heap vector / subspace. The name is entirely unnecessary on those, and the alloc count is presumably there only because the alloc function cannot write to the heap's alloc_counter field directly. (Side note: the alloc counter should eventually become one or more atomic counters, possibly one per heap vector to enable "heartbeat" counting. But! They should likely be allocated far from the vector memory itself, so as to not cause cache contention.)

I'm also not terribly worried about the cost of reallocating heap vectors on growth: yes, growing a vector will be slow (and a Struct-of-Arrays vector growth is even slower) but it shouldn't be a very common operation. If need's must, we can change heap vectors to be chunked vectors but that will obviously cost us heavily on the lookup performance front.

Rather than that, my first instinct is actually to use virtual memory if that becomes necessary (and is possible). So eg. if the ObjectHeapData ends up being a SoA vector of two fields (shapes and property storage indexes) then the SoA would point to the start of the virtual memory containing up two 1 GiB of objects; the first half of that would be virtual memory for the shapes, the latter half would be for the indexes, but only the first pages of both halves would actually be backed by real memory. As more and more objects are allocated, instead of reallocating the memory somewhere else, unallocated virtual memory pages would get paged out to be replaced with real memory. Thus, slowly the 1 GiB gets turned from virtual allocations into real allocations. If the program keeps allocating more objects past the 1 GiB line, then we'll need to actually reallocate, but that seems like a rather rare case.

In general, I am not opposed to putting the heap vectors behind some kind of wrappers but it's maybe not going to be quite this simple of an API. One of my goals for the next two months is to design and implement a SoA vector type that could be used for the heap vectors: this would mean that different heap data would be laid out differently; depending on how many fields they get split up into. Many heap vectors will also have sparse arrays (hash map or btree) on the side; these cannot be put into the same Vec or SoAVec but conceptually should be in the same "subspace". As such, built-in data types will likely have a bit of a varied API forever and a day: they're not all the same after all.

Final note: I've been toying with an idea of splitting the Agent's or Heap's memory itself into a SoAVec like structure: first have all the heap vector pointers in a static array, then have all the vector lengths in a static array, then have all the vector capacities in a static array. The benefit here would be that eg. capacity wouldn't be needed unless you're growing the vector, and it'd stay out of the way. In a far-flung future where we're really sure we're not (usually) doing anything wrong with our indexes, the length of the vector would also become unnecessary as we're sure that our given index into the vector is within allocated memory; no need for a bounds check. This'd mean that the entire set of heap vector pointers could be held in just a few cache lines.

If something like that is eventually done, then the idea of a Subspace struct becomes entirely unworkable for the built-in data types. (It might also be possible and even beneficial to do something similar with custom embedder data types and their Subspaces, but there we'd need to be way more careful and the API would need to be uniform across all embedder data types. There, a trait of some sort would likely make sense.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants