Skip to content

Conversation

@robintw
Copy link

@robintw robintw commented Apr 2, 2016

This is a simple implementation of a feature to skip URLs that have already been processed (Issue #70). It is relatively naive, but should be useful.

It adds a new command-line option (-k or --skipexisting) which, if enabled, means that quickscrape checks to see if the output folder it is going to use for a URL already exists, and if so then skips that URL. It will also skip the rate-limiting at that point (as we don't need to rate-limit if we haven't actually downloaded any URLs), and reinstate the rate-limiting next time it actually downloads a URL.

This is my first PR written in javascript, so I may have done some completely stupid things! Feedback would be greatly appreciated.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 56.0% when pulling fea3075 on robintw:skip-if-exists into 19cefd9 on ContentMine:master.

@petermr
Copy link
Member

petermr commented Apr 2, 2016

Thanks - a good idea.

@robintw
Copy link
Author

robintw commented Apr 15, 2016

Is there any progress on merging this in? If you'd like me to add any tests or anything then let me know.

@tarrow
Copy link
Contributor

tarrow commented Apr 22, 2016

We're waiting on @blahah to have a look before it gets merged.

I'm also pretty new to javascript so take everything I say with a pinch of salt but I wanted to have a look to see if I could encourage things along: It looks good to me. I tested and it does what it says on the tin. I can follow the code and can't see anything odd.

Obviously in an ideal world everything would be tested; but as you'll see in the tests folder there isn't really much testing going on so I don't see it as a reason not to merge. If you do want to write a test for it then we certainly wouldn't mind ;)

👍

@tarrow
Copy link
Contributor

tarrow commented Aug 26, 2016

This is obviously super old; but we are looking for functionality like this at the moment. The issue is that the directory may already exist from getpapers but we may not yet have a quickscrape results.json.

I think we might want to resurrect this soon with an additional check for the results.json file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants