Skip to content
This repository was archived by the owner on Jan 4, 2026. It is now read-only.

Benchmark configuration refactor.#22

Open
alejandromontero wants to merge 1 commit intomasterfrom
refactor/change-bench-config-behavior
Open

Benchmark configuration refactor.#22
alejandromontero wants to merge 1 commit intomasterfrom
refactor/change-bench-config-behavior

Conversation

@alejandromontero
Copy link
Contributor

DO NOT MERGE, ONLY PROVIDE A REVIEW

A new function in common_benchmarks permits configuring any benchmark
in a centralized way without duplicating code. It uses specific variables
"use_$engine{hadoop,hive,spark}" to set the requirements to download,
prepare and start necessary services.

The objective of such modification is to reduce benchmark definition
complexity, code duplication, code than runs unnecessarily more than once
and readability.

For now only BigBench support this the new configuration system.

Some sourcing is still maintained to provide support for legacy benchmarks
until they are all ported to the new system.

Copy link
Contributor

@fenech fenech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like the idea, although I'm not sure what you mean by your last paragraph (and the second one sounds like you're trying to sell me something 😜 ).

source_file "$ALOJA_REPO_PATH/shell/common/common_java.sh"
set_java_requires

if [ ! -z "$use_hadoop" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ! -z rather than -n, or even just nothing at all if [[ $use_hadoop ]]? But this only checks whether the variable is set and non-empty, what if someone wrote use_hadoop=false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, changing the conditionals.

@fenech
Copy link
Contributor

fenech commented Sep 14, 2017

Perhaps the code for initialising each separate thing can stay inside the engine-specific file, and the common file can just check whether it needs initialising, calling the init function if so.

Something like this in common_benchmarks.sh:

# import these (if they haven't already been imported at an earlier stage, 
# but I guess they're just function definitions anyway so doesn't matter too much)
. common_hadoop.sh
. common_spark.sh
# etc.

benchmark_init () {
    if [[ $use_hadoop = 1 ]]; then
        init_hadoop # defined in common_hadoop.sh
    fi

    if [[ $use_spark = 1 ]]; then
        init_spark # defined in common_spark.sh
    fi

    # etc.
}

Copy link
Contributor

@npoggi npoggi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see you are using the reviews... have been following them btw...
Ok, so this is something that needs to be updated at some point, but from the current code it doesn't look that PaasS deployments (or SaaS) are taken into account. Also not sure, if some calls would be called more than ones (ie., set java requires when using Derby and Hadoop)

@alejandromontero
Copy link
Contributor Author

alejandromontero commented Sep 15, 2017

Basically your proposal is exactly what i'm doing here. A common function in common_benchmarks initialises the engines, said initialisation are are a few steps: sourcing the common_$engine, preparing configuration, preparing variables and starting services. All this steps are engine-specific functions located in the common_$engine file.

In essence, common_benchmarks function "benchmark_suite_start_config" does exactly what you say, check whether to use an engine and initialise it.

Nico, the workflow is mostly the same that it was before, even in PaaS and SaaS the prepare functions, download of software and starting services are being called. Is in those functions that we test if we are in PaaS and if so, we do nothing. As an example we always call "start_hadoop" in common_hadoop.sh and it has the following code:

restart_hadoop(){ if [ "$clusterType" != "PaaS" ]; then

Another example: we always try to download the software calling the "requires" function even in PaaS, those function always do something like:

set_hadoop_requires() { if [ "$clusterType" != "PaaS" ]; then

Both in the old and new version we always go to source common_$engine.sh - set_$engine_requires - prepare_$engine_config - start_$engine_services. Is in those functions we check if we need to run the code or not.

As the benchmarks stands right now, yes, it could happen that sourcing of common_$engine happens twice, Hive of Derby need necessarily Hadoop and Java, if a benchmark uses said engines and is not ported to this new system, Hive needs to import Hadoop to maintain legacy functionality. Once all benchmarks are ported, sourcing, preparing and starting services should only occur once.

set_java_requires

if [ ! -z "$use_hadoop" ]; then
if [ "$use_hadoop" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't do anything to fix the case where $use_hadoop is set but has a value like 0 or false. It's just a more succinct way of writing what you already had. I was thinking that maybe a simple 1 or 0 would be sufficient, then we explicitly test against those values: [[ $use_hadoop = 1 ]]

@@ -1,3 +1,5 @@
source_file "$ALOJA_REPO_PATH/shell/common/common_java.sh"
set_java_requires
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these just the lines that you removed in the first commit? If so, just modify that commit and repush the branch - we don't need to follow the story of "remove, then bugfix (add them back)". You can just rewrite the history and pretend that it never happened.

@fenech
Copy link
Contributor

fenech commented Sep 15, 2017

@alejandromontero I guess the first part of your comment was responding to me. In which case, no, it's not quite the same, as I was proposing a single function (hide the implementation details). common_benchmarks.sh doesn't need to know how to set up the environment, just that it needs setting up.

@alejandromontero alejandromontero force-pushed the refactor/change-bench-config-behavior branch from 85d9274 to a7ca454 Compare September 18, 2017 08:41
source_file "$ALOJA_REPO_PATH/shell/common/common_java.sh"
init_java

if [ "$use_hadoop" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before, this just tests whether use_hadoop has been set to anything other than an empty string. Would rather see [[ $use_hadoop = 1 ]], for example. Same goes for the other variables.

@alejandromontero alejandromontero force-pushed the refactor/change-bench-config-behavior branch from 53fe272 to 0d29ff0 Compare September 18, 2017 09:22
Copy link
Contributor

@fenech fenech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change left to do. The only other thing I'd say is to try and cut down the commit message to focus on the "what" rather than the "how". We can see the implementation details from the code, so the description of the commit could even just be as simple as "Move initialisation logic into central location", unless there are specific issues that really need to be mentioned.

[ ! "$HIVE_ENGINE" ] && HIVE_ENGINE="tez" # Available options are: tez (default, hadoop 2 only), mr (MapReduce)
[ ! "$HIVE_ML_FRAMEWORK" ] && HIVE_ML_FRAMEWORK="mahout" # Available options are: spark, spark-2, mahout(default)
[ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY="true" # Available options are: true(Server/Client deployment), false(embeded only)
[ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY=1 # Available options are: true(Server/Client deployment), false(embeded only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment should be updated to match the new behaviour. Do/did we ever check for false before? To be honest, this one might make more sense with descriptive strings, e.g. embedded instead of false (or change the variable name to something a bit more descriptive, like $BB_DERBY_USE_EMBEDDED).

....or just don't touch it at all in this commit!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many Aloja variables need a refactor, for instance ENGINE={hive,spark-sql} needs to be renamed. I propose to do a refactor of benchmark_defaults and change all related variables in a single branch. We also need to discuss standardise the name and the way to populate the variables in different conditions; for instance when using a control variable, should use_spark be set to 1 or to "true", etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not proposing that we do sweeping refactors through the whole code base to change these things, I'm just saying that if you change the code here, you should update the comment to match.

For now I'm happy with using 1 and 0 for the new variables (and making sure the tests check the value, rather than just whether the variable is set or not).

The thing about ENGINE seems completely unrelated (and I don't understand what the problem with the current names is).

Copy link
Contributor

@fenech fenech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to merge this into master now?

[ ! "$HIVE_ENGINE" ] && HIVE_ENGINE="tez" # Available options are: tez (default, hadoop 2 only), mr (MapReduce)
[ ! "$HIVE_ML_FRAMEWORK" ] && HIVE_ML_FRAMEWORK="mahout" # Available options are: spark, spark-2, mahout(default)
[ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY="true" # Available options are: true(Server/Client deployment), false(embeded only)
[ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY=1 # Available options are: true(Server/Client deployment), false(embeded only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not proposing that we do sweeping refactors through the whole code base to change these things, I'm just saying that if you change the code here, you should update the comment to match.

For now I'm happy with using 1 and 0 for the new variables (and making sure the tests check the value, rather than just whether the variable is set or not).

The thing about ENGINE seems completely unrelated (and I don't understand what the problem with the current names is).

A new function in common_benchmarks permits configuring any benchmark
in a centralized way without duplicating code. It uses specific variables
"use_$engine{hadoop,hive,spark}" to set the requirements to download,
prepare and start necessary services.

The objective of such modification is to reduce benchmark definition
complexity, code duplication, code than runs unnecessarily more than once
and readability.

Some sourcing is still maintained to provide support for legacy benchmarks
until they are all ported to the new system.

Modified benchmark configuration conditionals to avoid incorrect uses.

BugFix: Hive and Derby only benchmarks such as hive-test and derby-test
that are not yet ported to the nev configuration system need to source
Hadoop and Java for them to work. Eventually once all benchmarks are
ported, this sources need to go away.

Added a init method in each engine that handles the initilization and
configuration.

Also, initilization of monitoring tools is moved to
benchmark_suite_start_config as part of the benchmark configuration, this
avoids an issue where some monitor tools where not correctly started.

Modified the way we check if an engine is needed, using use_$engine=1.
This adds security to not initilize the engine if some garbage string is
set to the variables.

Solve package download and configuration sync
@alejandromontero alejandromontero force-pushed the refactor/change-bench-config-behavior branch from 80d1909 to 03b60dc Compare September 18, 2017 15:15
@alejandromontero
Copy link
Contributor Author

I detected a very nasty bug: before starting any service the package itself and the configuration needs to be downloaded and synced. In the previous approach the sourcing of the script and the initialisation was done before downloading and syncing.

I subtly changed the flow of the initialisation, now it is virtually the same as before this new approach, it maintains the simplicity in the benchmark definition as well. The problem: all benchmarks need to be changed to this new approach after merging.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants