Skip to content

Conversation

@normen662
Copy link
Contributor

No description provided.

@normen662 normen662 changed the title initial version from poc vector index maintainer Nov 13, 2025
@normen662 normen662 force-pushed the vector-index-maintainence branch from 71cc760 to 7f1d54e Compare November 13, 2025 19:01
@normen662 normen662 added the enhancement New feature or request label Nov 13, 2025
@normen662 normen662 force-pushed the vector-index-maintainence branch 6 times, most recently from bdcc506 to aa1d6d4 Compare November 19, 2025 09:18
@normen662 normen662 force-pushed the vector-index-maintainence branch from aa1d6d4 to 15e7df6 Compare November 19, 2025 10:57
Merge remote-tracking branch 'upstream/main' into vector-index-maintainence
.thenApply(valueBytes -> {
if (valueBytes == null) {
throw new IllegalStateException("cannot fetch node");
return null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird that this is only going to affect the CompactStorageAdapter, especially as it means that we sort of have to audit where we call fetchNodeInternal for possible null checks--but only some of the time.

The other way to do this would be to add a new method like fetchNodeIfExists, and then have this call that one (reintroducing the null check). We could either make the two storage adapters symmetric, or only expose the method on the CompactStorageAdapter. (I'm personally fine with making them symmetrical even if we only ever call this for the purposes of checking for existence on the CompactStorageAdapter).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did do the audit for null checks. It only affects the CompactStorageAdapter since a non-existing node and a node without any neighbors cannot be differentiated in the inline storage layout (at least not currently).

})
.thenCompose(accessInfoAndNodeExistence -> {
if (accessInfoAndNodeExistence.isNodeExists()) {
return AsyncUtil.DONE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this correctly, I believe that this means that if the same primary key is used with multiple vectors, whichever is inserted first will be the one and only one stored in the index for that primary key. There's some niceness to this, as it means that the index can be idempotent (as inserting the same vector + primary key multiple times will result in exactly one index insert), but it also means that if there's a bug and we re-use primary keys, we'll get this weird result. I think we'd discussed at one point trying to detect if the newVector matches the existing vector. (I guess that that information actually isn't in the index if, somehow, we used a compact encoding on level 0. Which we currently ban, but you know, we could theoretically do at one point.)

In any case, it would probably be a good idea to add tests of what happens if the same primary key is used for multiple inserts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testing to ensure the index behaves in an idempotent way. I suppose we can change the behavior if needed.

final int statsThreshold, final boolean useRaBitQ, final int raBitQNumExBits,
final int maxNumConcurrentNodeFetches, final int maxNumConcurrentNeighborhoodFetches) {
Preconditions.checkArgument(m <= mMax);
Preconditions.checkArgument(mMax <= mMax0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Maybe it's worth asserting that mMax <= mMax0. If I'm understanding those parameters correctly, we want a larger mMax0 to increase the connectivity of the bottom layer. In theory, that could be less than mMax, it would just be maladapted in that level zero would now be unhelpfully less connected than upper levels. Maybe warning the user to also adjust mMax0 makes sense. I could also see the argument that we just want to set this.mMax0 = Math.max(mMax, mMax0). Maybe that makes more sense in the Builder than the object constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that mMax0 should be prevented from being smaller than mMax. Given that under normal circumstances it is usually mMax0 = 2*mMax per default I would say that if by any chance mMax0 is smaller than mMax I would consider that an error.

VectorRecord.Builder recordBuilder =
VectorRecord.newBuilder();
recordBuilder.mergeFrom(loadedRecord.getRecord());
Assertions.assertThat(loadedRecord.getRecord()).isEqualTo(savedRecords.get((int)l).getRecord());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this intended to test? That having a vector index doesn't disrupt the read/write path in some way that causes errors unrelated to querying?

Copy link
Contributor Author

@normen662 normen662 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in a way. It makes sure that the vectors are properly indexed and read back when multiple vector indexes are present. It also makes sure (although this currently cannot happen at all), that the vector we see in the record on load definitely is the vector that is stored with the record and not the potentially encoded vector (RaBitQ) from the index entries. I think this test will make more sense later when we have covering and non-covering scans. We don't want the RaBitQ vectors to leak out here. I agree, though, that the use case for this test is somewhat limited right now. I think I took it from the POC and modified it a bit without scrutinizing it too much.

VectorRecord.newBuilder()
.mergeFrom(Objects.requireNonNull(rec).getRecord())
.build();
allCounters[record.getGroupId()] ++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this assert that the group id is in {0, 1}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think asserts should assert the non-test code and not the test case itself. I would consider the case you pointed out to be a bug in the test case itself and I am fine for it to fail with an out-of-bound exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants