Many fixes, refactorings, and improvements#1
Conversation
* Rewrote huge parts of internal_document to be more DRY and produce less garbage * Integrated the htmlentities gem for generalized HTML entity decoding * Fixed HTML entity decoding so that it happens when content is extracted, rather than doing it on the source document, which can break parsing * Stubbed out the network calls in the test suite, resulting in dramatically faster tests * General garbage, speed, and style tweaks * Removed trailing whitespace from many files * Make the ImageExtractor logger customizable, or pass false for no logger * In the same vein, use default options and pass them along down to the various pieces of the parser
|
And I somehow missed that this fork isn't the upstream master. Sorry! |
|
Can I somehow amend this pull request to notify @peterc too? |
|
Hmmm... if you go into your fork and do another pull request, can you set Peter's upstream as the merge target? I think you can only have one pull request per branch, so you may need to create a new branch off your current topic branch to create a new pull request. If that doesn't work, you may have to rebase your changes into a patch, fork his repo, apply the patches directly there and do a pull request targeting his repo directly. If there's a more clever way to do this, it's beyond my current level of Git-Fu. Sorry my branch is just hanging around, I'm still waiting on @peterc do merge my pull request into upstream. Seemed like he was still active on it as of a month or so ago, maybe he just got swamped. |
|
I'm still here. It's just this was an incredibly large commit with some major changes, especially in the testing area. I don't want to just click the button, merge, and end up with everyone having a borked library down the path. But likewise, I don't have the time to analyze and do any refactoring of my own right now. If @dparis is happy with this pull request, I'd advise for now bringing it into this repo, so that I can then bring it all upstream in one hit later on. Alternatively, feel free to add it as a pull request on my repo, but I can't guarantee when I'll get to it. |
|
Heh, I hadn't looked at the actual changes in @cheald's commit set yet. That is pretty daunting. 😄 As I mentioned in an issue comment on the upstream project page previously, the app I was using this gem in has migrated away from pismo in favor of an in-house solution, so I haven't been hacking on this code since then. I did give this pull request a very brief once-over and it looked pretty sane, but I can't say much beyond that. I probably won't have time to look this over thoroughly myself any time soon, so you'd probably be better off just applying this directly to upstream with a new pull request. Best of luck, guys. Cheers! |
|
Yeah, this is a huge commit, fully admitted. I just started hacking on it, and four hours later found myself with a gigantic diff. Like I said, I know it's not a very nice commit, but I figured I'd throw it out there in case you all were interested in it. The biggest changes were lede/ledes, title/titles, etc, and the HTML entity/unicode processing. The image extractor got more or less a full rewrite so that it's more idiomatic Ruby, and it was whittled down to two public methods (whose names changed), so it's definitely full of backwards-incompatible changes. I'm hacking on this for in-house purposes, where I have freedom to break interfaces, but I obviously recognize that it's not necessarily desirable in the general case. There will be exactly zero hurt feelings if you don't want to upstream these changes. :) |
|
It's also worth mentioning that a huge number of the touched lines were due to my editor stripping trailing whitespace; sorry, that does clutter the commit a bit for review purposes. |
The metadata_expected.yaml includes datetimes with a +01:00 time zone. Which means the test will fail for anybody running the test in a different time zone. Explicitly setting the time zone to UTC in the helper solves this problem.
Such as periods after an abbreviated month or day of week and commas before the year.
Support punctuation in dates.
Support extracting tags from documents.
New Pismo::Document#tags method.
Merged your stuff in with the latest from Peterc
First off, if this is too big of a monster commit, no worries; I'm writing this for myself, but wanted to offer the pull back to head if you're interested.
I'm working to refactor a lot of the "ugly" pieces, to DRY up the code, reduce garbage generation, and rewrite to use idiomatic Ruby where possible. Longer term-goals are improvement of the extraction algorithms (particularly re: keywords) and making the whole shebang faster. HTML entity decoding and substitution is fully centralized and comprehensive. I've refactored the
foos(all = false)andfoomethods to justfooandfoos;foosalways memoizes the full set of matches, and the singular form is now a convenience method. This is potentially a backwards-incompatible change, but is a) much faster (since everything is cached now), and b) is a lot less surprising, IMO.As a bonus, all tests now pass (head has 1 failing test), and the image test network calls are stubbed, meaning that the test suite in general runs much faster.