-
Notifications
You must be signed in to change notification settings - Fork 117
vector index maintainer #3738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
vector index maintainer #3738
Conversation
71cc760 to
7f1d54e
Compare
bdcc506 to
aa1d6d4
Compare
aa1d6d4 to
15e7df6
Compare
Merge remote-tracking branch 'upstream/main' into vector-index-maintainence
| .thenApply(valueBytes -> { | ||
| if (valueBytes == null) { | ||
| throw new IllegalStateException("cannot fetch node"); | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit weird that this is only going to affect the CompactStorageAdapter, especially as it means that we sort of have to audit where we call fetchNodeInternal for possible null checks--but only some of the time.
The other way to do this would be to add a new method like fetchNodeIfExists, and then have this call that one (reintroducing the null check). We could either make the two storage adapters symmetric, or only expose the method on the CompactStorageAdapter. (I'm personally fine with making them symmetrical even if we only ever call this for the purposes of checking for existence on the CompactStorageAdapter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did do the audit for null checks. It only affects the CompactStorageAdapter since a non-existing node and a node without any neighbors cannot be differentiated in the inline storage layout (at least not currently).
fdb-extensions/src/main/java/com/apple/foundationdb/async/hnsw/HNSW.java
Show resolved
Hide resolved
| }) | ||
| .thenCompose(accessInfoAndNodeExistence -> { | ||
| if (accessInfoAndNodeExistence.isNodeExists()) { | ||
| return AsyncUtil.DONE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this correctly, I believe that this means that if the same primary key is used with multiple vectors, whichever is inserted first will be the one and only one stored in the index for that primary key. There's some niceness to this, as it means that the index can be idempotent (as inserting the same vector + primary key multiple times will result in exactly one index insert), but it also means that if there's a bug and we re-use primary keys, we'll get this weird result. I think we'd discussed at one point trying to detect if the newVector matches the existing vector. (I guess that that information actually isn't in the index if, somehow, we used a compact encoding on level 0. Which we currently ban, but you know, we could theoretically do at one point.)
In any case, it would probably be a good idea to add tests of what happens if the same primary key is used for multiple inserts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added testing to ensure the index behaves in an idempotent way. I suppose we can change the behavior if needed.
| final int statsThreshold, final boolean useRaBitQ, final int raBitQNumExBits, | ||
| final int maxNumConcurrentNodeFetches, final int maxNumConcurrentNeighborhoodFetches) { | ||
| Preconditions.checkArgument(m <= mMax); | ||
| Preconditions.checkArgument(mMax <= mMax0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Maybe it's worth asserting that mMax <= mMax0. If I'm understanding those parameters correctly, we want a larger mMax0 to increase the connectivity of the bottom layer. In theory, that could be less than mMax, it would just be maladapted in that level zero would now be unhelpfully less connected than upper levels. Maybe warning the user to also adjust mMax0 makes sense. I could also see the argument that we just want to set this.mMax0 = Math.max(mMax, mMax0). Maybe that makes more sense in the Builder than the object constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that mMax0 should be prevented from being smaller than mMax. Given that under normal circumstances it is usually mMax0 = 2*mMax per default I would say that if by any chance mMax0 is smaller than mMax I would consider that an error.
fdb-extensions/src/main/java/com/apple/foundationdb/linear/RealVector.java
Outdated
Show resolved
Hide resolved
...st/java/com/apple/foundationdb/record/provider/foundationdb/indexes/VectorIndexTestBase.java
Outdated
Show resolved
Hide resolved
| VectorRecord.Builder recordBuilder = | ||
| VectorRecord.newBuilder(); | ||
| recordBuilder.mergeFrom(loadedRecord.getRecord()); | ||
| Assertions.assertThat(loadedRecord.getRecord()).isEqualTo(savedRecords.get((int)l).getRecord()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this intended to test? That having a vector index doesn't disrupt the read/write path in some way that causes errors unrelated to querying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in a way. It makes sure that the vectors are properly indexed and read back when multiple vector indexes are present. It also makes sure (although this currently cannot happen at all), that the vector we see in the record on load definitely is the vector that is stored with the record and not the potentially encoded vector (RaBitQ) from the index entries. I think this test will make more sense later when we have covering and non-covering scans. We don't want the RaBitQ vectors to leak out here. I agree, though, that the use case for this test is somewhat limited right now. I think I took it from the POC and modified it a bit without scrutinizing it too much.
| VectorRecord.newBuilder() | ||
| .mergeFrom(Objects.requireNonNull(rec).getRecord()) | ||
| .build(); | ||
| allCounters[record.getGroupId()] ++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this assert that the group id is in {0, 1}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think asserts should assert the non-test code and not the test case itself. I would consider the case you pointed out to be a bug in the test case itself and I am fine for it to fail with an out-of-bound exception.
...st/java/com/apple/foundationdb/record/provider/foundationdb/indexes/VectorIndexTestBase.java
Outdated
Show resolved
Hide resolved
...st/java/com/apple/foundationdb/record/provider/foundationdb/indexes/VectorIndexTestBase.java
Outdated
Show resolved
Hide resolved
Merge remote-tracking branch 'upstream/main' into vector-index-maintainence
No description provided.