-
Notifications
You must be signed in to change notification settings - Fork 1
Implement SORTBY on FT.SEARCH #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Jonathan-Improving
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
allenss-amazon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important to handle the case where both SORTBY and LIMIT are specified. In this scenario, rather than a full sort of the neighbors, a heap should be used to retain only the required number of entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of the main data path of this code is wrong.
- It's on the wrong thread and thus looking at potentially stale data. The data for the sort needs to come out of fetched data, not the indexes.
- In cluster mode it's only operating on the local data, not the post-merged global result.
It needs to be placed in the call chain of "SearchCommand::SendReply" after the current contents of the candidate keys has been fetched and is available. This will probably require a logic change in the "NOCONTENT" option (since it's suppressing fetching of per-key fields) as well as ensuring that the content fetch code includes the sortby field.
Perhaps some 1:1 time with me will help here.
src/query/search.cc
Outdated
| return results; | ||
| } | ||
|
|
||
| absl::StatusOr<std::deque<indexes::Neighbor>> ApplySorting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this can fail. The only possible failure is if the sortby field isn't found, and that's probably a crash since you've already validated it in the command itself long before you get here....
@allenss-amazon I generally understand the idea but yes, I think a 1:1 would be a good idea. I'm looking at the code now and I think I need a bit of guidance on the implementation. How would you like the meeting set up? |
0f89d05 to
8e0de57
Compare
allenss-amazon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Two things. 1/ I'd like to see a heap-oriented sort for cases where LIMIT is in effect and 2/ need some data oriented actual search cases.
src/commands/ft_search.cc
Outdated
| double num_a, num_b; | ||
| if (!absl::SimpleAtod(val_a, &num_a)) return false; | ||
| if (!absl::SimpleAtod(val_b, &num_b)) return true; | ||
| result = num_a < num_b; | ||
| } else { | ||
| result = val_a < val_b; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to have these kinds of data comparison centralized so that all of the various ways to compare data get the same answers. I'm not familiar with out SimpleAtoD deals with infinity and NaN, etc. Please take a look at the files in the src/expr/* directory.
src/query/search.h
Outdated
| int k{0}; | ||
| std::optional<unsigned> ef; | ||
| LimitParameter limit; | ||
| SortByParameter sortby; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this belongs in "struct SearchCommand", not here. This parameter is private to the search command, not something that's shared across general queries. (Sorry the command naming here is legacy and therefore confusing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried doing this and I'm not sure how this would fit into SearchCommand since it needs to be located by the parser and I would have to change the implementation of ConstructSortByParser to take a SearchCommand &, which would propagate to all other parsers in ft.search.
It would also be inconsistent with other parsers, like ft.aggregate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Legacy issue. Originally, there wasn't a difference between SearchParameters and SearchCommand. When this object hierarchy was introduced it looks like the parsing logic for ft.search didn't get the update. Please update it, it shouldn't require much more than changing the parameters definitions on a few functions in the dedicate ft.search parsing logic.
| .sortby_parameters_str = "SORTBY attribute_identifier_2", | ||
| .sortby_field = "attribute_identifier_2", | ||
| .sortby_order = query::SortOrder::kAscending, | ||
| .sortby_enabled = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A parser test for a field NOT in the schema might be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test where the attribute is not found and found that I actually wasn't testing for this case so that's in here too now.
| "embedding", | ||
| "$embedding", | ||
| "SORTBY", | ||
| "vector", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs some data oriented tests. In theory, the compatibility framework that's under integration/compatibility would be the best place to do this. It used to work for ft.search commands, but I see that those tests have been commented out, which leads me to think that it's not working for those commands. But fixing it should be relatively easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the latest version should work with ft.search commands. I fixed some bugs and I've been using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be the place to test numeric conversion errors.
021dec9 to
d758c86
Compare
7349606 to
bff42e1
Compare
allenss-amazon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be several changes in the PR that are either just formatting or perhaps some temporary changes for debugging purposes. Let's get those cleaned up. Otherwise, the basic core functionality looks correct to me.
Also, I'd definitely want to see some test cases in the compatibility test suite
| data_set = do_answer(answers[i], data_set) | ||
| data_set = do_answer(self.server.get_new_client(), answers[i], data_set) | ||
|
|
||
| print(f"Correct answers: {correct_answers} out of {len(answers)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are changes here intentional? If so, can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part of compatibility tests that is commented out. I am looking into writing those numeric tests that you asked for in this file, but I have been having some trouble with the build environment for a few days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this change in the latest commit
src/commands/ft_search.cc
Outdated
| // Support non-vector queries | ||
| if (IsNonVectorQuery()) { | ||
| query::ProcessNonVectorNeighborsForReply( | ||
| ctx, index_schema->GetAttributeDataType(), neighbors, *this); | ||
| ApplySorting(neighbors, *this); | ||
| SerializeNonVectorNeighbors(ctx, search_result, *this); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why restrict this to non-vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't catch this when resolving a merge conflict. It was supposed to be a single added line to the if statement above. I'll go over the entire change again to make sure nothing else slipped through.
src/commands/ft_search_parser.cc
Outdated
| << "Error parsing vector similarity parameters: "; | ||
| } | ||
|
|
||
| if (auto searchCommand = dynamic_cast<SearchCommand *>(¶meters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anytime I see a dynamic_cast, I wonder if this isn't a band-aid on some incorrect object structure. Here, it looks like the preparse and postparse functions should be virtual, that would eliminate the need for the dynamic cast.
src/commands/ft_search_parser.h
Outdated
| struct SortByParameter { | ||
| std::string field; | ||
| SortOrder order{SortOrder::kAscending}; | ||
| bool enabled{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than an explicit enabled flag, perhaps this should be std::optional ?
| .k = 0, | ||
| .ef = 0, | ||
| .score_as = "", | ||
| .vector_query = false, | ||
| .sortby_parameters_str = "SORTBY attribute_identifier_1 DESC", | ||
| .sortby_field = "attribute_identifier_1", | ||
| .sortby_order = SortOrder::kDescending, | ||
| .sortby_enabled = true, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's tweak some of the keywords with one or more lower case letters to ensure that we're getting the case-invariant matching correct.
| "embedding", | ||
| "$embedding", | ||
| "SORTBY", | ||
| "vector", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be the place to test numeric conversion errors.
edb271a to
cd56346
Compare
3a35723 to
645bb26
Compare
2694c9b to
c2c2491
Compare
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
8defc22 to
62f13e1
Compare
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
Signed-off-by: Alexandru Filip <alexandru.filip@improving.com>
No description provided.