Skip to content

Comments

Many fixes, refactorings, and improvements#1

Open
cheald wants to merge 14 commits intodparis:masterfrom
cheald:master
Open

Many fixes, refactorings, and improvements#1
cheald wants to merge 14 commits intodparis:masterfrom
cheald:master

Conversation

@cheald
Copy link

@cheald cheald commented Apr 7, 2012

First off, if this is too big of a monster commit, no worries; I'm writing this for myself, but wanted to offer the pull back to head if you're interested.

I'm working to refactor a lot of the "ugly" pieces, to DRY up the code, reduce garbage generation, and rewrite to use idiomatic Ruby where possible. Longer term-goals are improvement of the extraction algorithms (particularly re: keywords) and making the whole shebang faster. HTML entity decoding and substitution is fully centralized and comprehensive. I've refactored the foos(all = false) and foo methods to just foo and foos; foos always memoizes the full set of matches, and the singular form is now a convenience method. This is potentially a backwards-incompatible change, but is a) much faster (since everything is cached now), and b) is a lot less surprising, IMO.

As a bonus, all tests now pass (head has 1 failing test), and the image test network calls are stubbed, meaning that the test suite in general runs much faster.

cheald added 2 commits March 17, 2012 13:13
* Rewrote huge parts of internal_document to be more DRY and produce less garbage
* Integrated the htmlentities gem for generalized HTML entity decoding
* Fixed HTML entity decoding so that it happens when content is extracted, rather than doing it on the source document, which can break parsing
* Stubbed out the network calls in the test suite, resulting in dramatically faster tests
* General garbage, speed, and style tweaks
* Removed trailing whitespace from many files
* Make the ImageExtractor logger customizable, or pass false for no logger
* In the same vein, use default options and pass them along down to the various pieces of the parser
@cheald
Copy link
Author

cheald commented Apr 7, 2012

And I somehow missed that this fork isn't the upstream master. Sorry!

@cheald cheald closed this Apr 7, 2012
@cheald cheald reopened this Apr 7, 2012
@cheald
Copy link
Author

cheald commented Apr 7, 2012

Can I somehow amend this pull request to notify @peterc too?

@dparis
Copy link
Owner

dparis commented Apr 7, 2012

Hmmm... if you go into your fork and do another pull request, can you set Peter's upstream as the merge target? I think you can only have one pull request per branch, so you may need to create a new branch off your current topic branch to create a new pull request.

If that doesn't work, you may have to rebase your changes into a patch, fork his repo, apply the patches directly there and do a pull request targeting his repo directly.

If there's a more clever way to do this, it's beyond my current level of Git-Fu. Sorry my branch is just hanging around, I'm still waiting on @peterc do merge my pull request into upstream. Seemed like he was still active on it as of a month or so ago, maybe he just got swamped.

@peterc
Copy link

peterc commented Apr 7, 2012

I'm still here. It's just this was an incredibly large commit with some major changes, especially in the testing area. I don't want to just click the button, merge, and end up with everyone having a borked library down the path. But likewise, I don't have the time to analyze and do any refactoring of my own right now.

If @dparis is happy with this pull request, I'd advise for now bringing it into this repo, so that I can then bring it all upstream in one hit later on. Alternatively, feel free to add it as a pull request on my repo, but I can't guarantee when I'll get to it.

@dparis
Copy link
Owner

dparis commented Apr 7, 2012

Heh, I hadn't looked at the actual changes in @cheald's commit set yet. That is pretty daunting. 😄

As I mentioned in an issue comment on the upstream project page previously, the app I was using this gem in has migrated away from pismo in favor of an in-house solution, so I haven't been hacking on this code since then. I did give this pull request a very brief once-over and it looked pretty sane, but I can't say much beyond that. I probably won't have time to look this over thoroughly myself any time soon, so you'd probably be better off just applying this directly to upstream with a new pull request.

Best of luck, guys. Cheers!

@cheald
Copy link
Author

cheald commented Apr 8, 2012

Yeah, this is a huge commit, fully admitted. I just started hacking on it, and four hours later found myself with a gigantic diff. Like I said, I know it's not a very nice commit, but I figured I'd throw it out there in case you all were interested in it.

The biggest changes were lede/ledes, title/titles, etc, and the HTML entity/unicode processing. The image extractor got more or less a full rewrite so that it's more idiomatic Ruby, and it was whittled down to two public methods (whose names changed), so it's definitely full of backwards-incompatible changes. I'm hacking on this for in-house purposes, where I have freedom to break interfaces, but I obviously recognize that it's not necessarily desirable in the general case. There will be exactly zero hurt feelings if you don't want to upstream these changes. :)

@cheald
Copy link
Author

cheald commented Apr 8, 2012

It's also worth mentioning that a huge number of the touched lines were due to my editor stripping trailing whitespace; sorry, that does clutter the commit a bit for review purposes.

Adam Crownoble and others added 12 commits August 14, 2012 15:48
The metadata_expected.yaml includes datetimes with a +01:00 time zone. Which
means the test will fail for anybody running the test in a different time zone.
Explicitly setting the time zone to UTC in the helper solves this problem.
Such as periods after an abbreviated month or day of week and commas before the year.
Support extracting tags from documents.
Merged your stuff in with the latest from Peterc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants