Benchmark configuration refactor. by alejandromontero · Pull Request #22 · Aloja/aloja

alejandromontero · 2017-09-14T16:13:09Z

DO NOT MERGE, ONLY PROVIDE A REVIEW

A new function in common_benchmarks permits configuring any benchmark
in a centralized way without duplicating code. It uses specific variables
"use_$engine{hadoop,hive,spark}" to set the requirements to download,
prepare and start necessary services.

The objective of such modification is to reduce benchmark definition
complexity, code duplication, code than runs unnecessarily more than once
and readability.

For now only BigBench support this the new configuration system.

Some sourcing is still maintained to provide support for legacy benchmarks
until they are all ported to the new system.

fenech

In general I like the idea, although I'm not sure what you mean by your last paragraph (and the second one sounds like you're trying to sell me something 😜 ).

fenech · 2017-09-14T16:19:28Z

shell/common/common_benchmarks.sh

+  source_file "$ALOJA_REPO_PATH/shell/common/common_java.sh"
+  set_java_requires
+
+  if [ ! -z "$use_hadoop" ]; then


Why ! -z rather than -n, or even just nothing at all if [[ $use_hadoop ]]? But this only checks whether the variable is set and non-empty, what if someone wrote use_hadoop=false?

Indeed, changing the conditionals.

fenech · 2017-09-14T16:59:12Z

Perhaps the code for initialising each separate thing can stay inside the engine-specific file, and the common file can just check whether it needs initialising, calling the init function if so.

Something like this in common_benchmarks.sh:

# import these (if they haven't already been imported at an earlier stage, 
# but I guess they're just function definitions anyway so doesn't matter too much)
. common_hadoop.sh
. common_spark.sh
# etc.

benchmark_init () {
    if [[ $use_hadoop = 1 ]]; then
        init_hadoop # defined in common_hadoop.sh
    fi

    if [[ $use_spark = 1 ]]; then
        init_spark # defined in common_spark.sh
    fi

    # etc.
}

npoggi

Nice to see you are using the reviews... have been following them btw...
Ok, so this is something that needs to be updated at some point, but from the current code it doesn't look that PaasS deployments (or SaaS) are taken into account. Also not sure, if some calls would be called more than ones (ie., set java requires when using Derby and Hadoop)

alejandromontero · 2017-09-15T10:31:30Z

Basically your proposal is exactly what i'm doing here. A common function in common_benchmarks initialises the engines, said initialisation are are a few steps: sourcing the common_$engine, preparing configuration, preparing variables and starting services. All this steps are engine-specific functions located in the common_$engine file.

In essence, common_benchmarks function "benchmark_suite_start_config" does exactly what you say, check whether to use an engine and initialise it.

Nico, the workflow is mostly the same that it was before, even in PaaS and SaaS the prepare functions, download of software and starting services are being called. Is in those functions that we test if we are in PaaS and if so, we do nothing. As an example we always call "start_hadoop" in common_hadoop.sh and it has the following code:

restart_hadoop(){ if [ "$clusterType" != "PaaS" ]; then

Another example: we always try to download the software calling the "requires" function even in PaaS, those function always do something like:

set_hadoop_requires() { if [ "$clusterType" != "PaaS" ]; then

Both in the old and new version we always go to source common_$engine.sh - set_$engine_requires - prepare_$engine_config - start_$engine_services. Is in those functions we check if we need to run the code or not.

As the benchmarks stands right now, yes, it could happen that sourcing of common_$engine happens twice, Hive of Derby need necessarily Hadoop and Java, if a benchmark uses said engines and is not ported to this new system, Hive needs to import Hadoop to maintain legacy functionality. Once all benchmarks are ported, sourcing, preparing and starting services should only occur once.

fenech · 2017-09-15T10:34:01Z

shell/common/common_benchmarks.sh

  set_java_requires

-  if [ ! -z "$use_hadoop" ]; then
+  if [ "$use_hadoop" ]; then


That doesn't do anything to fix the case where $use_hadoop is set but has a value like 0 or false. It's just a more succinct way of writing what you already had. I was thinking that maybe a simple 1 or 0 would be sufficient, then we explicitly test against those values: [[ $use_hadoop = 1 ]]

fenech · 2017-09-15T10:35:36Z

shell/common/common_derby.sh

@@ -1,3 +1,5 @@
+source_file "$ALOJA_REPO_PATH/shell/common/common_java.sh"
+set_java_requires


Aren't these just the lines that you removed in the first commit? If so, just modify that commit and repush the branch - we don't need to follow the story of "remove, then bugfix (add them back)". You can just rewrite the history and pretend that it never happened.

fenech · 2017-09-15T10:39:10Z

@alejandromontero I guess the first part of your comment was responding to me. In which case, no, it's not quite the same, as I was proposing a single function (hide the implementation details). common_benchmarks.sh doesn't need to know how to set up the environment, just that it needs setting up.

fenech · 2017-09-18T08:46:16Z

shell/common/common_benchmarks.sh

+  source_file "$ALOJA_REPO_PATH/shell/common/common_java.sh"
+  init_java
+
+  if [ "$use_hadoop" ]; then


As I said before, this just tests whether use_hadoop has been set to anything other than an empty string. Would rather see [[ $use_hadoop = 1 ]], for example. Same goes for the other variables.

fenech

One minor change left to do. The only other thing I'd say is to try and cut down the commit message to focus on the "what" rather than the "how". We can see the implementation details from the code, so the description of the commit could even just be as simple as "Move initialisation logic into central location", unless there are specific issues that really need to be mentioned.

fenech · 2017-09-18T09:48:37Z

shell/conf/benchmarks_defaults.conf

 [ ! "$HIVE_ENGINE" ] && HIVE_ENGINE="tez" # Available options are: tez (default, hadoop 2 only), mr (MapReduce)
 [ ! "$HIVE_ML_FRAMEWORK" ] && HIVE_ML_FRAMEWORK="mahout" # Available options are: spark, spark-2, mahout(default)
-[ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY="true" # Available options are: true(Server/Client deployment), false(embeded only)
+[ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY=1 # Available options are: true(Server/Client deployment), false(embeded only)


Comment should be updated to match the new behaviour. Do/did we ever check for false before? To be honest, this one might make more sense with descriptive strings, e.g. embedded instead of false (or change the variable name to something a bit more descriptive, like $BB_DERBY_USE_EMBEDDED).

....or just don't touch it at all in this commit!

Many Aloja variables need a refactor, for instance ENGINE={hive,spark-sql} needs to be renamed. I propose to do a refactor of benchmark_defaults and change all related variables in a single branch. We also need to discuss standardise the name and the way to populate the variables in different conditions; for instance when using a control variable, should use_spark be set to 1 or to "true", etc.

I'm not proposing that we do sweeping refactors through the whole code base to change these things, I'm just saying that if you change the code here, you should update the comment to match.

For now I'm happy with using 1 and 0 for the new variables (and making sure the tests check the value, rather than just whether the variable is set or not).

The thing about ENGINE seems completely unrelated (and I don't understand what the problem with the current names is).

fenech

Is there any reason not to merge this into master now?

fenech · 2017-09-18T14:53:57Z

shell/conf/benchmarks_defaults.conf

 [ ! "$HIVE_ENGINE" ] && HIVE_ENGINE="tez" # Available options are: tez (default, hadoop 2 only), mr (MapReduce)
 [ ! "$HIVE_ML_FRAMEWORK" ] && HIVE_ML_FRAMEWORK="mahout" # Available options are: spark, spark-2, mahout(default)
-[ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY="true" # Available options are: true(Server/Client deployment), false(embeded only)
+[ ! "$BB_SERVER_DERBY" ] && BB_SERVER_DERBY=1 # Available options are: true(Server/Client deployment), false(embeded only)


I'm not proposing that we do sweeping refactors through the whole code base to change these things, I'm just saying that if you change the code here, you should update the comment to match.

For now I'm happy with using 1 and 0 for the new variables (and making sure the tests check the value, rather than just whether the variable is set or not).

The thing about ENGINE seems completely unrelated (and I don't understand what the problem with the current names is).

A new function in common_benchmarks permits configuring any benchmark in a centralized way without duplicating code. It uses specific variables "use_$engine{hadoop,hive,spark}" to set the requirements to download, prepare and start necessary services. The objective of such modification is to reduce benchmark definition complexity, code duplication, code than runs unnecessarily more than once and readability. Some sourcing is still maintained to provide support for legacy benchmarks until they are all ported to the new system. Modified benchmark configuration conditionals to avoid incorrect uses. BugFix: Hive and Derby only benchmarks such as hive-test and derby-test that are not yet ported to the nev configuration system need to source Hadoop and Java for them to work. Eventually once all benchmarks are ported, this sources need to go away. Added a init method in each engine that handles the initilization and configuration. Also, initilization of monitoring tools is moved to benchmark_suite_start_config as part of the benchmark configuration, this avoids an issue where some monitor tools where not correctly started. Modified the way we check if an engine is needed, using use_$engine=1. This adds security to not initilize the engine if some garbage string is set to the variables. Solve package download and configuration sync

alejandromontero · 2017-09-18T15:19:17Z

I detected a very nasty bug: before starting any service the package itself and the configuration needs to be downloaded and synced. In the previous approach the sourcing of the script and the initialisation was done before downloading and syncing.

I subtly changed the flow of the initialisation, now it is virtually the same as before this new approach, it maintains the simplicity in the benchmark definition as well. The problem: all benchmarks need to be changed to this new approach after merging.

alejandromontero requested review from fenech, npoggi and waldner September 14, 2017 16:13

fenech reviewed Sep 14, 2017

View reviewed changes

npoggi reviewed Sep 15, 2017

View reviewed changes

fenech reviewed Sep 15, 2017

View reviewed changes

alejandromontero force-pushed the refactor/change-bench-config-behavior branch from 85d9274 to a7ca454 Compare September 18, 2017 08:41

fenech reviewed Sep 18, 2017

View reviewed changes

alejandromontero force-pushed the refactor/change-bench-config-behavior branch from 53fe272 to 0d29ff0 Compare September 18, 2017 09:22

fenech reviewed Sep 18, 2017

View reviewed changes

alejandromontero force-pushed the refactor/change-bench-config-behavior branch from 80d1909 to 03b60dc Compare September 18, 2017 15:15

		@@ -1,3 +1,5 @@
		source_file "$ALOJA_REPO_PATH/shell/common/common_java.sh"
		set_java_requires

Conversation

alejandromontero commented Sep 14, 2017

Uh oh!

fenech left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fenech commented Sep 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

npoggi left a comment

Choose a reason for hiding this comment

Uh oh!

alejandromontero commented Sep 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fenech commented Sep 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fenech left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fenech left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alejandromontero commented Sep 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fenech commented Sep 14, 2017 •

edited

Loading

alejandromontero commented Sep 15, 2017 •

edited

Loading