Many fixes, refactorings, and improvements by cheald · Pull Request #1 · dparis/pismo

cheald · 2012-04-07T21:27:59Z

First off, if this is too big of a monster commit, no worries; I'm writing this for myself, but wanted to offer the pull back to head if you're interested.

I'm working to refactor a lot of the "ugly" pieces, to DRY up the code, reduce garbage generation, and rewrite to use idiomatic Ruby where possible. Longer term-goals are improvement of the extraction algorithms (particularly re: keywords) and making the whole shebang faster. HTML entity decoding and substitution is fully centralized and comprehensive. I've refactored the foos(all = false) and foo methods to just foo and foos; foos always memoizes the full set of matches, and the singular form is now a convenience method. This is potentially a backwards-incompatible change, but is a) much faster (since everything is cached now), and b) is a lot less surprising, IMO.

As a bonus, all tests now pass (head has 1 failing test), and the image test network calls are stubbed, meaning that the test suite in general runs much faster.

* Rewrote huge parts of internal_document to be more DRY and produce less garbage * Integrated the htmlentities gem for generalized HTML entity decoding * Fixed HTML entity decoding so that it happens when content is extracted, rather than doing it on the source document, which can break parsing * Stubbed out the network calls in the test suite, resulting in dramatically faster tests * General garbage, speed, and style tweaks * Removed trailing whitespace from many files * Make the ImageExtractor logger customizable, or pass false for no logger * In the same vein, use default options and pass them along down to the various pieces of the parser

cheald · 2012-04-07T21:28:44Z

And I somehow missed that this fork isn't the upstream master. Sorry!

cheald · 2012-04-07T21:30:04Z

Can I somehow amend this pull request to notify @peterc too?

dparis · 2012-04-07T21:49:24Z

Hmmm... if you go into your fork and do another pull request, can you set Peter's upstream as the merge target? I think you can only have one pull request per branch, so you may need to create a new branch off your current topic branch to create a new pull request.

If that doesn't work, you may have to rebase your changes into a patch, fork his repo, apply the patches directly there and do a pull request targeting his repo directly.

If there's a more clever way to do this, it's beyond my current level of Git-Fu. Sorry my branch is just hanging around, I'm still waiting on @peterc do merge my pull request into upstream. Seemed like he was still active on it as of a month or so ago, maybe he just got swamped.

peterc · 2012-04-07T22:32:32Z

I'm still here. It's just this was an incredibly large commit with some major changes, especially in the testing area. I don't want to just click the button, merge, and end up with everyone having a borked library down the path. But likewise, I don't have the time to analyze and do any refactoring of my own right now.

If @dparis is happy with this pull request, I'd advise for now bringing it into this repo, so that I can then bring it all upstream in one hit later on. Alternatively, feel free to add it as a pull request on my repo, but I can't guarantee when I'll get to it.

dparis · 2012-04-07T23:35:00Z

Heh, I hadn't looked at the actual changes in @cheald's commit set yet. That is pretty daunting. 😄

As I mentioned in an issue comment on the upstream project page previously, the app I was using this gem in has migrated away from pismo in favor of an in-house solution, so I haven't been hacking on this code since then. I did give this pull request a very brief once-over and it looked pretty sane, but I can't say much beyond that. I probably won't have time to look this over thoroughly myself any time soon, so you'd probably be better off just applying this directly to upstream with a new pull request.

Best of luck, guys. Cheers!

cheald · 2012-04-08T00:38:01Z

Yeah, this is a huge commit, fully admitted. I just started hacking on it, and four hours later found myself with a gigantic diff. Like I said, I know it's not a very nice commit, but I figured I'd throw it out there in case you all were interested in it.

The biggest changes were lede/ledes, title/titles, etc, and the HTML entity/unicode processing. The image extractor got more or less a full rewrite so that it's more idiomatic Ruby, and it was whittled down to two public methods (whose names changed), so it's definitely full of backwards-incompatible changes. I'm hacking on this for in-house purposes, where I have freedom to break interfaces, but I obviously recognize that it's not necessarily desirable in the general case. There will be exactly zero hurt feelings if you don't want to upstream these changes. :)

cheald · 2012-04-08T00:39:42Z

It's also worth mentioning that a huge number of the touched lines were due to my editor stripping trailing whitespace; sorry, that does clutter the commit a bit for review purposes.

The metadata_expected.yaml includes datetimes with a +01:00 time zone. Which means the test will fail for anybody running the test in a different time zone. Explicitly setting the time zone to UTC in the helper solves this problem.

Such as periods after an abbreviated month or day of week and commas before the year.

Support punctuation in dates.

Support extracting tags from documents.

New Pismo::Document#tags method.

Merged your stuff in with the latest from Peterc

cheald added 2 commits March 17, 2012 13:13

Account for images that don't have a src. Remove trailing whitespace.

e7dd0da

cheald closed this Apr 7, 2012

cheald reopened this Apr 7, 2012

Adam Crownoble and others added 12 commits August 14, 2012 15:48

Set explicit timezone during testing.

b08ce56

The metadata_expected.yaml includes datetimes with a +01:00 time zone. Which means the test will fail for anybody running the test in a different time zone. Explicitly setting the time zone to UTC in the helper solves this problem.

Support punctuation in dates.

30d64a3

Such as periods after an abbreviated month or day of week and commas before the year.

Merge pull request peterc#18 from adamcrown/datetimes_with_punctuation

e4a028e

Support punctuation in dates.

Resolving

380534a

Fixes made to test fixtures - my bad\!

b62a537

New Pismo::Document#tags method.

962c029

Support extracting tags from documents.

Merge pull request peterc#19 from adamcrown/tags_attribute

77fe57e

New Pismo::Document#tags method.

add rake to gemspec

da2c4c5

merge peterc and fix tests

88527c2

Merge pull request #1 from bborn/master

24cad1c

Merged your stuff in with the latest from Peterc

Fix unicode deconversion in Ruby 1.9

32a5e7e

Merge branch 'master' of github.com:cheald/pismo

deed946

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Many fixes, refactorings, and improvements#1

Many fixes, refactorings, and improvements#1
cheald wants to merge 14 commits intodparis:masterfrom
cheald:master

cheald commented Apr 7, 2012

Uh oh!

cheald commented Apr 7, 2012

Uh oh!

cheald commented Apr 7, 2012

Uh oh!

dparis commented Apr 7, 2012

Uh oh!

peterc commented Apr 7, 2012

Uh oh!

dparis commented Apr 7, 2012

Uh oh!

cheald commented Apr 8, 2012

Uh oh!

cheald commented Apr 8, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

cheald commented Apr 7, 2012

Uh oh!

cheald commented Apr 7, 2012

Uh oh!

cheald commented Apr 7, 2012

Uh oh!

dparis commented Apr 7, 2012

Uh oh!

peterc commented Apr 7, 2012

Uh oh!

dparis commented Apr 7, 2012

Uh oh!

cheald commented Apr 8, 2012

Uh oh!

cheald commented Apr 8, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants