Added simple implementation of code to skip URLs already processed #71

robintw · 2016-04-02T20:24:37Z

This is a simple implementation of a feature to skip URLs that have already been processed (Issue #70). It is relatively naive, but should be useful.

It adds a new command-line option (-k or --skipexisting) which, if enabled, means that quickscrape checks to see if the output folder it is going to use for a URL already exists, and if so then skips that URL. It will also skip the rate-limiting at that point (as we don't need to rate-limit if we haven't actually downloaded any URLs), and reinstate the rate-limiting next time it actually downloads a URL.

This is my first PR written in javascript, so I may have done some completely stupid things! Feedback would be greatly appreciated.

coveralls · 2016-04-02T20:26:50Z

Coverage remained the same at 56.0% when pulling fea3075 on robintw:skip-if-exists into 19cefd9 on ContentMine:master.

petermr · 2016-04-02T21:33:36Z

Thanks - a good idea.

robintw · 2016-04-15T19:42:01Z

Is there any progress on merging this in? If you'd like me to add any tests or anything then let me know.

tarrow · 2016-04-22T12:22:01Z

We're waiting on @blahah to have a look before it gets merged.

I'm also pretty new to javascript so take everything I say with a pinch of salt but I wanted to have a look to see if I could encourage things along: It looks good to me. I tested and it does what it says on the tin. I can follow the code and can't see anything odd.

Obviously in an ideal world everything would be tested; but as you'll see in the tests folder there isn't really much testing going on so I don't see it as a reason not to merge. If you do want to write a test for it then we certainly wouldn't mind ;)

👍

tarrow · 2016-08-26T10:09:28Z

This is obviously super old; but we are looking for functionality like this at the moment. The issue is that the directory may already exist from getpapers but we may not yet have a quickscrape results.json.

I think we might want to resurrect this soon with an additional check for the results.json file.

Added simple implementation of code to skip already existing folders

fea3075

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added simple implementation of code to skip URLs already processed #71

Added simple implementation of code to skip URLs already processed #71

Uh oh!

robintw commented Apr 2, 2016

Uh oh!

coveralls commented Apr 2, 2016

Uh oh!

petermr commented Apr 2, 2016

Uh oh!

robintw commented Apr 15, 2016

Uh oh!

tarrow commented Apr 22, 2016

Uh oh!

tarrow commented Aug 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Added simple implementation of code to skip URLs already processed #71

Are you sure you want to change the base?

Added simple implementation of code to skip URLs already processed #71

Uh oh!

Conversation

robintw commented Apr 2, 2016

Uh oh!

coveralls commented Apr 2, 2016

Uh oh!

petermr commented Apr 2, 2016

Uh oh!

robintw commented Apr 15, 2016

Uh oh!

tarrow commented Apr 22, 2016

Uh oh!

tarrow commented Aug 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants