Skip to content

Conversation

@concretevitamin
Copy link

  • Directly reads from .txt file instead of saving out to .Rdata first
    then reading back again. Prototyped for Regression.
  • Even if the .Rdata step is desired, using fread() has much better
    performance.

I've found this to be much more efficient for benchmarking (tested on an EC2 instance). If this approach looks good, I could certainly make corresponding changes for all queries.

- Directly reads from .txt file instead of saving out to .Rdata first
  then reading back again.  Prototyped for Regression.
- Even if the .Rdata step is desired, using fread() has much better
  performance.
@rytaft
Copy link
Collaborator

rytaft commented Jul 28, 2015

Sorry this slipped through the cracks and I am only looking at this now. Thanks for submitting your code!

Regarding the changes to vanilla_R_benchmark.R, I have done a bit of testing on the 5000x5000 dataset, and it seems that load() on a binary file is faster than fread() on a text file (6.5 seconds v. 11.8 seconds). Under what conditions did you find fread() to be faster?

Regarding the changes to generate_Rdata.R, fread() is certainly faster than read.csv(), but it seems to leave the data in a format that doesn't work with the code in vanilla_R_benchmark.R. I haven't done much debugging, but if you have any ideas I'd definitely appreciate them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants