Conversation
fenech
left a comment
There was a problem hiding this comment.
In general I like the idea, although I'm not sure what you mean by your last paragraph (and the second one sounds like you're trying to sell me something 😜 ).
shell/common/common_benchmarks.sh
Outdated
| source_file "$ALOJA_REPO_PATH/shell/common/common_java.sh" | ||
| set_java_requires | ||
|
|
||
| if [ ! -z "$use_hadoop" ]; then |
There was a problem hiding this comment.
Why ! -z rather than -n, or even just nothing at all if [[ $use_hadoop ]]? But this only checks whether the variable is set and non-empty, what if someone wrote use_hadoop=false?
There was a problem hiding this comment.
Indeed, changing the conditionals.
|
Perhaps the code for initialising each separate thing can stay inside the engine-specific file, and the common file can just check whether it needs initialising, calling the init function if so. Something like this in |
npoggi
left a comment
There was a problem hiding this comment.
Nice to see you are using the reviews... have been following them btw...
Ok, so this is something that needs to be updated at some point, but from the current code it doesn't look that PaasS deployments (or SaaS) are taken into account. Also not sure, if some calls would be called more than ones (ie., set java requires when using Derby and Hadoop)
|
Basically your proposal is exactly what i'm doing here. A common function in common_benchmarks initialises the engines, said initialisation are are a few steps: sourcing the common_$engine, preparing configuration, preparing variables and starting services. All this steps are engine-specific functions located in the common_$engine file. In essence, common_benchmarks function "benchmark_suite_start_config" does exactly what you say, check whether to use an engine and initialise it. Nico, the workflow is mostly the same that it was before, even in PaaS and SaaS the prepare functions, download of software and starting services are being called. Is in those functions that we test if we are in PaaS and if so, we do nothing. As an example we always call "start_hadoop" in common_hadoop.sh and it has the following code:
Another example: we always try to download the software calling the "requires" function even in PaaS, those function always do something like:
Both in the old and new version we always go to source common_$engine.sh - set_$engine_requires - prepare_$engine_config - start_$engine_services. Is in those functions we check if we need to run the code or not. As the benchmarks stands right now, yes, it could happen that sourcing of common_$engine happens twice, Hive of Derby need necessarily Hadoop and Java, if a benchmark uses said engines and is not ported to this new system, Hive needs to import Hadoop to maintain legacy functionality. Once all benchmarks are ported, sourcing, preparing and starting services should only occur once. |
shell/common/common_benchmarks.sh
Outdated
| set_java_requires | ||
|
|
||
| if [ ! -z "$use_hadoop" ]; then | ||
| if [ "$use_hadoop" ]; then |
There was a problem hiding this comment.
That doesn't do anything to fix the case where $use_hadoop is set but has a value like 0 or false. It's just a more succinct way of writing what you already had. I was thinking that maybe a simple 1 or 0 would be sufficient, then we explicitly test against those values: [[ $use_hadoop = 1 ]]
| @@ -1,3 +1,5 @@ | |||
| source_file "$ALOJA_REPO_PATH/shell/common/common_java.sh" | |||
| set_java_requires | |||
There was a problem hiding this comment.
Aren't these just the lines that you removed in the first commit? If so, just modify that commit and repush the branch - we don't need to follow the story of "remove, then bugfix (add them back)". You can just rewrite the history and pretend that it never happened.
|
@alejandromontero I guess the first part of your comment was responding to me. In which case, no, it's not quite the same, as I was proposing a single function (hide the implementation details). |
85d9274 to
a7ca454
Compare
shell/common/common_benchmarks.sh
Outdated
| source_file "$ALOJA_REPO_PATH/shell/common/common_java.sh" | ||
| init_java | ||
|
|
||
| if [ "$use_hadoop" ]; then |
There was a problem hiding this comment.
As I said before, this just tests whether use_hadoop has been set to anything other than an empty string. Would rather see [[ $use_hadoop = 1 ]], for example. Same goes for the other variables.
53fe272 to
0d29ff0
Compare
fenech
left a comment
There was a problem hiding this comment.
One minor change left to do. The only other thing I'd say is to try and cut down the commit message to focus on the "what" rather than the "how". We can see the implementation details from the code, so the description of the commit could even just be as simple as "Move initialisation logic into central location", unless there are specific issues that really need to be mentioned.
| [ ! "$HIVE_ENGINE" ] && HIVE_ENGINE="tez" # Available options are: tez (default, hadoop 2 only), mr (MapReduce) | ||
| [ ! "$HIVE_ML_FRAMEWORK" ] && HIVE_ML_FRAMEWORK="mahout" # Available options are: spark, spark-2, mahout(default) | ||
| [ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY="true" # Available options are: true(Server/Client deployment), false(embeded only) | ||
| [ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY=1 # Available options are: true(Server/Client deployment), false(embeded only) |
There was a problem hiding this comment.
Comment should be updated to match the new behaviour. Do/did we ever check for false before? To be honest, this one might make more sense with descriptive strings, e.g. embedded instead of false (or change the variable name to something a bit more descriptive, like $BB_DERBY_USE_EMBEDDED).
....or just don't touch it at all in this commit!
There was a problem hiding this comment.
Many Aloja variables need a refactor, for instance ENGINE={hive,spark-sql} needs to be renamed. I propose to do a refactor of benchmark_defaults and change all related variables in a single branch. We also need to discuss standardise the name and the way to populate the variables in different conditions; for instance when using a control variable, should use_spark be set to 1 or to "true", etc.
There was a problem hiding this comment.
I'm not proposing that we do sweeping refactors through the whole code base to change these things, I'm just saying that if you change the code here, you should update the comment to match.
For now I'm happy with using 1 and 0 for the new variables (and making sure the tests check the value, rather than just whether the variable is set or not).
The thing about ENGINE seems completely unrelated (and I don't understand what the problem with the current names is).
fenech
left a comment
There was a problem hiding this comment.
Is there any reason not to merge this into master now?
| [ ! "$HIVE_ENGINE" ] && HIVE_ENGINE="tez" # Available options are: tez (default, hadoop 2 only), mr (MapReduce) | ||
| [ ! "$HIVE_ML_FRAMEWORK" ] && HIVE_ML_FRAMEWORK="mahout" # Available options are: spark, spark-2, mahout(default) | ||
| [ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY="true" # Available options are: true(Server/Client deployment), false(embeded only) | ||
| [ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY=1 # Available options are: true(Server/Client deployment), false(embeded only) |
There was a problem hiding this comment.
I'm not proposing that we do sweeping refactors through the whole code base to change these things, I'm just saying that if you change the code here, you should update the comment to match.
For now I'm happy with using 1 and 0 for the new variables (and making sure the tests check the value, rather than just whether the variable is set or not).
The thing about ENGINE seems completely unrelated (and I don't understand what the problem with the current names is).
A new function in common_benchmarks permits configuring any benchmark
in a centralized way without duplicating code. It uses specific variables
"use_$engine{hadoop,hive,spark}" to set the requirements to download,
prepare and start necessary services.
The objective of such modification is to reduce benchmark definition
complexity, code duplication, code than runs unnecessarily more than once
and readability.
Some sourcing is still maintained to provide support for legacy benchmarks
until they are all ported to the new system.
Modified benchmark configuration conditionals to avoid incorrect uses.
BugFix: Hive and Derby only benchmarks such as hive-test and derby-test
that are not yet ported to the nev configuration system need to source
Hadoop and Java for them to work. Eventually once all benchmarks are
ported, this sources need to go away.
Added a init method in each engine that handles the initilization and
configuration.
Also, initilization of monitoring tools is moved to
benchmark_suite_start_config as part of the benchmark configuration, this
avoids an issue where some monitor tools where not correctly started.
Modified the way we check if an engine is needed, using use_$engine=1.
This adds security to not initilize the engine if some garbage string is
set to the variables.
Solve package download and configuration sync
80d1909 to
03b60dc
Compare
|
I detected a very nasty bug: before starting any service the package itself and the configuration needs to be downloaded and synced. In the previous approach the sourcing of the script and the initialisation was done before downloading and syncing. I subtly changed the flow of the initialisation, now it is virtually the same as before this new approach, it maintains the simplicity in the benchmark definition as well. The problem: all benchmarks need to be changed to this new approach after merging. |
DO NOT MERGE, ONLY PROVIDE A REVIEW
A new function in common_benchmarks permits configuring any benchmark
in a centralized way without duplicating code. It uses specific variables
"use_$engine{hadoop,hive,spark}" to set the requirements to download,
prepare and start necessary services.
The objective of such modification is to reduce benchmark definition
complexity, code duplication, code than runs unnecessarily more than once
and readability.
For now only BigBench support this the new configuration system.
Some sourcing is still maintained to provide support for legacy benchmarks
until they are all ported to the new system.