Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Data generation using pure python, HAWQ (with PL/Python), or MapReduce (streaming via python)
Instructions are included for each of the 3. MapReduce version is in the early stages and it's not currently recommended.
TODO:
location of locations_partitions.csv is hardcoded (fixed?)
come up with a realistic template so numbers aren't out of whack
script to calculate expected outputs based on profiles
for transactions, give the option to provide either a folder of all profiles to iterate through or just one json (automatic checking)
user input to generate config files
test output against profiles
add shell scripts to install python packages
add shell scripts to fix hard coding for HAWQ and MR
clean up HAWQ and MR code
add more/better data
improve performance of MapReduce
Spark streaming?
create_pickles doesn't run if the number of years doesn't match the profile inputs
work on making datasets repeatable via random seed
script to replace hashbang with
which pythonscript to replace hard links