SOLR-18130: rename parameter "zkHost" to "solrCloud" in "solrj-streaming"#4320
SOLR-18130: rename parameter "zkHost" to "solrCloud" in "solrj-streaming"#4320vyatkinv wants to merge 19 commits intoapache:mainfrom
Conversation
27f4ae5 to
c877f2a
Compare
|
Shouldn't |
@dsmiley previously suggested naming this parameter |
Agreed... I believe the intent is to specify how you connect to your running Solr? In the first PR I believe it was |
dsmiley
left a comment
There was a problem hiding this comment.
Great work here! Clearly this was a bit more involved than a rename ;-)
"connectionString" is somewhat long and it's ambiguous as to what we're even connecting to. But it's not bad. How about "solrCloudString"? If you don't like that, I'll capitulate and be satisfied with "connectionString".
Incorporating auth is out-of-scope but I'll say now that I'm suspicious that it makes sense to embed secret credentials into this. A conversation for another time.
I imagine there should be some ref-guide updates on this in addition to major-changes-in-solr-10.adoc.
I can see the potential of a follow-on issue to update CLIUtils.getZkHost (and thus CLI commands that call it) to instead resolve a SolrCloud connection string that is not necessarily ZK.
Answers to the PR questions:
- It's a test issue -- good catch. Glad you made resolving this mandatory. I wouldn't call this "mandatory SolrCloud parameter" since the param can be blank when it's used inside Solr, defaulting to the server's connection.
- Normalize; do not round-trip with "zkHost".
- IMO SolrParams should be used for "parameters" like this, not Map. Perhaps too much scope creep here but I leave that to you.
- Let's not add support for that to ScoreNodes now, albeit a single comment where we initialize it would be nice. Like "for now limited to the current cluster but we could expand support if needed".
I suggest that you change StreamFactory to not mention "ZK" whatsoever, instead using the name we choose (e.g. getDefaultCloudString). For backwards-compatibility, however, the older method names should exist as deprecated.
| @@ -0,0 +1,7 @@ | |||
| title: Parameter 'zkHost' renamed to 'solrCloud' in solrj-streaming. Parameter 'zkHost' is still supported. | |||
There was a problem hiding this comment.
Useful / fine but omits a bigger feature/point in this PR: this parameter now supports HTTP based SolrCloud connection strings (ZK access no longer required).
| collectionName)); | ||
| } | ||
| ModifiableSolrParams params = | ||
| getModifiableSolrParamsWithExclusions( |
There was a problem hiding this comment.
ModifiableSolrParams is a specific impl of SolrParams. I don't think this method name should contain the word "Modifiable"; it's a detail.
|
|
||
| public ShortestPathStream( | ||
| String zkHost, | ||
| String solrCloud, |
There was a problem hiding this comment.
Can we use that nice record you made previously for the connection string instead of String? To be used not just right here but really everywhere in this PR that needs to refer to the connection. It would be so nice to have a real type vs String which is used for anything and everything, defeating the typed nature of Java.
If not, fine.
There was a problem hiding this comment.
String can be replaced with record in private methods, but I'm not sure about public constructors. Even if they are not used within the project, they can still be used by consumers. We can declare them deprecated and introduce new ones with Record
There was a problem hiding this comment.
What ought to be deprecated vs okay to change whenever is often a gray area taking some experience with the project. Even though I haven't personally used the code being changed in this way, I suspect that the only constructor that matters on these things is the StreamExpression expression, StreamFactory factory one. Notice that the other is only used in tests.
| // we don't | ||
| // need to count it twice |
| // don't | ||
| // need to count it twice |
| return shards; | ||
| } | ||
|
|
||
| public static String getSolrCloud( |
There was a problem hiding this comment.
So I know I suggested calling the param "solrCloud" because in context to where a user types it (a streaming expression string), it seems a reasonable choice. But at a code level like here and all the var names, I find it... not so good. getCloudConnectionString would be better IMO and maybe used the record type you made.
There was a problem hiding this comment.
BTW I love this refactoring overall; it cuts down on repeated code.
There was a problem hiding this comment.
done
renamed to buildSolrConnection
| return solrCloud; | ||
| } | ||
|
|
||
| public static ModifiableSolrParams getModifiableSolrParamsWithExclusions( |
There was a problem hiding this comment.
I suggest the name buildSolrParams or buildSolrParamsExcept.
"Modifiable" is overly specific to the SolrParams type.
"get" is overused in java... suggests we retrieve something when in fact we are creating something here.
Also IMO the varargs should be a Set that the caller can easily pass via Set.of
There was a problem hiding this comment.
done
renamed to
buildSolrParamsExcept
| return mParams; | ||
| } | ||
|
|
||
| public static Map<String, String> getMapWithExclusions( |
There was a problem hiding this comment.
if the above method uses MapSolrParams instead of ModifiableSolrParams, then we don't even need this version here since the (few?) callers using it can call MapSolrParams.getMap().
There was a problem hiding this comment.
ModifiableSolrParams.getMap() returns Map<String, String[]> but we needs Map<String, String>
That's why I asked if it was possible to convert all streams to ModifiableSolrParams :)
but I'll think about it some more, maybe I can use generics or a common interface to make a single universal method
There was a problem hiding this comment.
I specifically referred to MapSolrParams.getMap() which returns Map<String, String>
It's also okay to leave both... maybe marking one as deprecated to signify an older approach.
There was a problem hiding this comment.
done
I reworked streams that were using getMapWithExclusions to use SolrParams, or ModifiableSolrParams where a mutable implementation is required.
|
|
||
| private void init( | ||
| String collectionName, TupleStream tupleSource, String zkHost, int updateBatchSize) { | ||
| String collectionName, TupleStream tupleSource, String solrCloud, int updateBatchSize) { |
|
Hi all... So I'm following up after @dsmiley pinged me (thanks!)... What I am excited about this is the opportunity to make it simpler for non Solr experts to access solr. I don't think that you should need to know if it's solrCloud or standalone or zk that you are using, it's just a "connection string". Yes, every connection string has rules of formatting... I would like a name like solrConnection or solrConnString but not something that is as specific as "solrCloudConnection" that ties us to a specific mode of Solr. I know that "Streaming Expressions" is a solr cloud specific feature. However, from a user perspective, they may not know that, or care. Someday I hpoe to see Streaming Expressions pushed more towards our Data Scientist / ML user base, and they don't care about Cloud versus non Cloud mode, and also don't care about zkHost... So, having a property that speaks to their concern "how do I connect" is perfect. Could we have a property like |
|
+1 to "solrConnection"
huh; you just agreed to "solrConnection"? |
|
I was just wondering if in the context of a stream expression if saying "connection" was enough. I like "solrConnection" too. We might look at other connections and see what they are named... |
|
Then lets move forward @vyatkinv with "solrConnection" throughout. If at the last minute Eric gets us to change our minds again, we shouldn't need to subsequently change the vast majority of this PR since "solrConnection" is a perfectly fine parameter & field & local-var name in all places we have such. |
…g> by SolrParams in streams, which used it
|
I’ve addressed most of the review comments. Summary of changes in the latest commits:
As follow-up work (either separate issue or PR), I suggest: PS: Documentation and new tests are not added yet. |
| .get(); | ||
|
|
||
| public static final Option SOLR_CONNECTION_OPTION = | ||
| Option.builder("q") |
There was a problem hiding this comment.
"q"? Interesting... I don't know that I have a opinion one way or the other... Could SOLR_CONNECTION_OPTION just be SOLR_URL_OPTION? I.e do we need another property or can we just use SOLR_URL_OPTION instead? And extend what it takes.
There was a problem hiding this comment.
I didn’t add an implementation for parse this option. It might be better to remove it from here entirely and implement it in a separate PR.
Also, q is just a shorthand for “quorum”.
There was a problem hiding this comment.
I initially planned to implement this option in this PR, but later realized it is not trivial, so I dropped the idea for now and simply forgot to remove it.
There was a problem hiding this comment.
Ahh... That makes sense, yeah, let's remove it from this PR
| @@ -70,7 +71,7 @@ public class ShortestPathStream extends TupleStream implements Expressible { | |||
| private String toField; | |||
| private int joinBatchSize; | |||
| private int maxDepth; | |||
| private String solrCloud; | |||
| private CloudSolrClient.CloudSolrClientConnection solrConnection; | |||
There was a problem hiding this comment.
Maybe just a tiny bit odd that this property is defined in a specic CloudSolrClient. Probably jsut me!
There was a problem hiding this comment.
Maybe you mean you don't like that CloudSolrClientConnection is an inner-class? (technically a record actually) It was created weeks ago to represent a connection to SolrCloud for the benefit of CloudSolrClient... so.... it's here. If we think we might use it in other contexts, it could move.
I'm already disliking the word "Client" in that name.
epugh
left a comment
There was a problem hiding this comment.
This is really looking great. I think you can mark it "not draft" mode ;-)
| @@ -170,8 +171,8 @@ public StreamExpression toExpression(StreamFactory factory) throws IOException { | |||
| } | |||
| } | |||
|
|
|||
| // solrCloud | |||
| expression.addParameter(new StreamExpressionNamedParameter("solrCloud", solrCloud)); | |||
There was a problem hiding this comment.
So we don't expose solrCloud as a property in our streaming expressions? Checking https://solr.apache.org/guide/solr/latest/query-guide/stream-source-reference.html#search-parameters it looks just fine!
| @@ -56,9 +57,8 @@ | |||
| */ | |||
| public class RandomStream extends TupleStream implements Expressible { | |||
|
|
|||
| private String solrCloud; | |||
| // TODO: replace all Map<String, String> in stream handlers by ModifiableSolrParams ? | |||
| chroot != null && connectionString.endsWith(chroot) | ||
| ? connectionString.substring(0, connectionString.length() - chroot.length()) | ||
| : connectionString; | ||
| if (solrConnection.isZk()) { |
There was a problem hiding this comment.
I don't know that I love the short cut isZK versus isZooKeeper (or however we case it? my just be my style.
There was a problem hiding this comment.
I guess it's fine.. it reads pretty easy.
There was a problem hiding this comment.
Have we already published and released something with CloudSolrClientConnection? If not, we'll change it.
There was a problem hiding this comment.
10.1 hasn't shipped, so it can be changed easily now
There was a problem hiding this comment.
done. renamed to isZookeeper
I think that is a great idea! Please do... |
From my perspective Docs and new tests NEED to be part of this PR. I think adding the quorum stuff can be part of this PR or a follow depending on your approach... |
dsmiley
left a comment
There was a problem hiding this comment.
Wow this has become an even bigger PR. Great work here!
| @@ -86,7 +86,8 @@ public boolean isClosed() { | |||
| @Override | |||
| protected Map<String, Table> getTableMap() { | |||
| String zk = this.properties.getProperty("zk"); | |||
There was a problem hiding this comment.
it's reasonably in scope to switch this out too... still supporting "zk" for back-compat.
| } | ||
|
|
||
| TupleStream cloudSolrStream = new CloudSolrStream(streamZkHost, collection, params); | ||
| var solrConnection = CloudSolrClient.CloudSolrClientConnection.parse(streamZkHost); |
There was a problem hiding this comment.
it's reasonably in-scope to update cross-collection join queries to take a "solrConnection" string from the join params, keeping zkHost for back-compat fallback.
| @@ -295,7 +297,7 @@ private Metric getMetric(Pair<String, String> metricPair) { | |||
| } | |||
|
|
|||
| private TupleStream handleSelect( | |||
| String zk, | |||
| CloudSolrClient.CloudSolrClientConnection solrClientConnection, | |||
There was a problem hiding this comment.
minor: rename to solrConnection
| // we don't | ||
| // need to count it twice | ||
| // Validate there are no unknown parameters - solrConnection/zkHost and alias are | ||
| // namedParameter, |
| @@ -381,14 +392,14 @@ protected static class FeaturesSelectionCall implements Callable<NamedList<?>> { | |||
| private final String baseUrl; | |||
| private final String outcome; | |||
| private final String field; | |||
| private final Map<String, String> paramsMap; | |||
There was a problem hiding this comment.
In many places like here you declare a param or field with the type ModifiableSolrParams but that's too specific -- SolrParams should be used.
| @@ -411,7 +422,7 @@ public NamedList<?> call() throws Exception { | |||
| params.add(DISTRIB, "false"); | |||
| params.add("fq", "{!igain}"); | |||
|
|
|||
| for (Map.Entry<String, String> entry : paramsMap.entrySet()) { | |||
| for (Map.Entry<String, String[]> entry : paramsMap) { | |||
| params.add(entry.getKey(), entry.getValue()); | |||
| } | |||
There was a problem hiding this comment.
the switch to SolrParams instead of a Map means we can simply call params.add(paramsMap) in here and some other places in this big PR
| for (Entry<String, String[]> param : props) { | ||
| for (String paramValue : param.getValue()) { | ||
| expression.addParameter(new StreamExpressionNamedParameter(param.getKey(), paramValue)); | ||
| } | ||
| } |
There was a problem hiding this comment.
I've seen this enough times that think there ought to be a utility method, maybe on a base class, that does this. up to you.
There was a problem hiding this comment.
done. added method addParameters to StreamExpression
| ModifiableSolrParams params = new ModifiableSolrParams(); | ||
| ModifiableSolrParams queryRequestParams = new ModifiableSolrParams(); |
There was a problem hiding this comment.
I recommend SolrQuery here, and it has some convenience methods for some of these well known params too.
| @@ -76,36 +76,35 @@ public void setDefaultZKHost(String zkHost) { | |||
| } | |||
| } | |||
|
|
|||
| public synchronized CloudSolrClient getCloudSolrClient(String connectionString) { | |||
| public synchronized CloudSolrClient getCloudSolrClient( | |||
There was a problem hiding this comment.
I think we'll need a deprecated overload here for the previous signature
|
I noticed another issue while writing tests: To address this, I split the cache into separate stores for HTTP and Cloud clients. |
|
I’ve addressed several minor review comments and updated the streaming documentation. In addition, I added validation to ensure that the legacy Next steps (planned shortly):
One open question is the SQL module. Supporting HTTP(S) connections there may be less straightforward, since a connection string like: I’d appreciate any suggestions on the best way to approach this. |
https://issues.apache.org/jira/browse/SOLR-18130
Description
This PR updates the solrj-streaming module by replacing all usages of zkHost with solrCloud to enable support for HTTP-based quorum configurations.
Solution
Parameters, fields and variables in solrj-streaming, named as
zkHostrenamed tosolrCloudFor backward compatibility, specifying zkHost is still supported.
A shared method has been introduced in an abstract class to resolve the effective solrCloud value using a priority-based approach (e.g., explicit parameter → legacy zkHost → default Zookeeper host).
Tests
To Be Done
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.Planned follow-ups:
Open Questions / Discussion Points
I would appreciate feedback on the following topics (also noted as TODOs in the code):
1. Mandatory solrCloud parameter
I made solrCloud required in all stream handlers (an error is thrown if it cannot be resolved from parameters or the default Zookeeper host).
This caused two tests to fail:
StreamExpressionTest#testCloudSolrStreamStreamExpressionTest#testStatsStreamThese tests did not provide any valid host.
I fixed them by adding
.withDefaultZkHost(cluster.getZkServer().getZkAddress())Question:
Is this a test issue or a flaw in the implementation?
Can such a situation occur in real-world usage?
2. toExpression behavior and backward compatibility
Currently, toExpression reconstructs the expression and always includes solrCloud, even if the user originally provided the legacy zkHost parameter.
As a result, a user who runs an expression with zkHost and then calls explain() will see solrCloud in the output.
Question:
3. Potential issue with gather parameter
In:
org/apache/solr/client/solrj/io/graph/GatherNodesStream.java:392there is the following line:
expression.addParameter(new StreamExpressionNamedParameter("gather", solrCloud));Question:
UPDATE: I looked into the context more deeply and yes, it looks like this is a defect that affects explain(). I fixed it as part of my pull request.
4. Inconsistent parameter handling
Some stream handlers use Map<String, String> for parameters, while others use ModifiableSolrParams.
Question:
Would it make sense to standardize on a single approach?
5. ScoreNodesStream limitation
ScoreNodesStreamcan only obtain zkHost / solrCloud via:factory.getDefaultZkHost();Question: