Skip to content

Conversation

@dianecloud
Copy link

Data generation using pure python, HAWQ (with PL/Python), or MapReduce (streaming via python)

Instructions are included for each of the 3. MapReduce version is in the early stages and it's not currently recommended.

TODO:

  • location of locations_partitions.csv is hardcoded (fixed?)

  • come up with a realistic template so numbers aren't out of whack

  • script to calculate expected outputs based on profiles

  • for transactions, give the option to provide either a folder of all profiles to iterate through or just one json (automatic checking)

  • user input to generate config files

  • test output against profiles

  • add shell scripts to install python packages

  • add shell scripts to fix hard coding for HAWQ and MR

  • clean up HAWQ and MR code

  • add more/better data

  • improve performance of MapReduce

  • Spark streaming?

  • create_pickles doesn't run if the number of years doesn't match the profile inputs

  • work on making datasets repeatable via random seed

  • script to replace hashbang with which python

  • script to replace hard links

## Data generation using pure python, HAWQ (with PL/Python), or MapReduce (streaming via python)

### Instructions are included for each of the 3. MapReduce version is in the early stages and it's not currently recommended.

TODO:
* location of locations_partitions.csv is hardcoded (fixed?)
* come up with a realistic template so numbers aren't out of whack
* script to calculate expected outputs based on profiles
* for transactions, give the option to provide either a folder of all profiles to iterate through or just one json (automatic checking)
* user input to generate config files

* test output against profiles
* add shell scripts to install python packages
* add shell scripts to fix hard coding for HAWQ and MR
* clean up HAWQ and MR code
* add more/better data

* improve performance of MapReduce
* Spark streaming?

* create_pickles doesn't run if the number of years doesn't match the profile inputs
* work on making datasets repeatable via random seed
* script to replace hashbang with `which python`
* script to replace hard links
@dianecloud
Copy link
Author

That's cool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant