From f1fa73f00e814fd13c160425f4eb5bef9e435892 Mon Sep 17 00:00:00 2001 From: Toby Thain Date: Sun, 28 Apr 2013 13:26:55 -0400 Subject: [PATCH 1/3] Fix typos, wrap lines. --- README.md | 47 +++++++++++++++++++++++++++++++++-------------- 1 file changed, 33 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 47af399be..b7c6c9e78 100644 --- a/README.md +++ b/README.md @@ -1,41 +1,51 @@ #Goose - Article Extractor -##Intro +##Intro -Goose was originally an article extractor written in Java that has most recently (aug2011) converted to a scala project. It's mission is to take any news article or article type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate. +Goose was originally an article extractor written in Java that has been +converted to a scala project. Its mission is to take a news article +or article-type web page and extract the main body of the article, all +metadata, and most probable image candidate. -The extraction goal is to try and get the purest extraction from the beginning of the article for servicing flipboard/pulse type applications that need to show the first snippet of a web article along with an image. +The extraction goal is the purest extraction from the beginning of the +article for servicing flipboard/pulse type applications that need to +show the first snippet of a web article along with an image. Goose will try to extract the following information: - Main text of an article - Main image of article - - Any Youtube/Vimeo movies embedded in article + - Any YouTube/Vimeo movies embedded in article - Meta Description - Meta tags - Publish Date -The wiki has the full details on how to use Goose [https://github.com/jiminoc/goose/wiki](https://github.com/jiminoc/goose/wiki) +The wiki has the full details on how to use Goose +[https://github.com/jiminoc/goose/wiki](https://github.com/jiminoc/goose/wiki) Goose was open sourced by Gravity.com in 2011 Lead Programmer: Jim Plush (Gravity.com) -Contributers: Robbie Coleman (Gravity.com) +Contributors: Robbie Coleman (Gravity.com) -Try it out online! -http://jimplush.com/blog/goose +[Try it out online!](http://jimplush.com/blog/goose) ##Licensing -If you find Goose useful or have issues please drop me a line, I'd love to hear how you're using it or what features should be improved -Goose is licensed by Gravity.com under the Apache 2.0 license, see the LICENSE file for more details +If you find Goose useful or have issues, please drop me a line, I'd love +to hear how you're using it or what features should be improved. + +Goose is licensed by Gravity.com under the Apache 2.0 license, see the +LICENSE file for more details. + ##Take it for a spin + To use goose from the command line: cd into the goose directory @@ -47,12 +57,21 @@ To use goose from the command line: Here are some of the reasons for the port to Scala: - - Gravity has moved more towards Scala development internally so maintenance started to become an issue + - Gravity has moved more towards Scala development internally so + maintenance started to become an issue - There wasn't enough contribution to warrant keeping it in Java - - The packages were all namespaced under a person's name and not the company's name + - The packages were all namespaced under a person's name and not the + company's name - Scala is more fun ##Issues -It was a pretty fast Java to Scala port so lots of the nicities of the Scala language aren't in the codebase yet, but those will come over the coming months as we re-write alot of the internal methods to be more Scalesque. -We made sure it was still nice and operable from Java as well so if you're using goose from java you still should be able to use it with a few changes to the method signatures. \ No newline at end of file + +The Java to Scala port was done quickly, so many niceties of the +Scala language aren't in the codebase yet, but those will come over the +coming months as we re-write alot of the internal methods to be more +Scala-esque. + +We made sure it was still nice and operable from Java as well, so you +should still be able to use goose from java with a few changes to the +method signatures. From b1aad5708ad783104199e1da7e32c179e7f322c1 Mon Sep 17 00:00:00 2001 From: Toby Thain Date: Sun, 28 Apr 2013 13:47:05 -0400 Subject: [PATCH 2/3] Minor typos. --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index b7c6c9e78..08a92d4c6 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ ##Intro Goose was originally an article extractor written in Java that has been -converted to a scala project. Its mission is to take a news article +converted to a Scala project. Its mission is to take a news article or article-type web page and extract the main body of the article, all metadata, and most probable image candidate. @@ -53,7 +53,7 @@ To use goose from the command line: MAVEN_OPTS="-Xms256m -Xmx2000m"; mvn exec:java -Dexec.mainClass=com.gravity.goose.TalkToMeGoose -Dexec.args="http://techcrunch.com/2011/05/13/native-apps-or-web-apps-particle-code-wants-you-to-do-both/" -e -q > ~/Desktop/gooseresult.txt -##Regarding the port from JAVA to Scala +##Regarding the port from Java to Scala Here are some of the reasons for the port to Scala: From c889f7b1dc008e6f775763ece1d4c658c0eefb33 Mon Sep 17 00:00:00 2001 From: Toby Thain Date: Sun, 28 Apr 2013 14:52:29 -0400 Subject: [PATCH 3/3] Add basic SBT build configuration. --- build.sbt | 26 ++++++++++++++ .../com/gravity/goose/TalkToMeGoose.scala | 36 +++++++++++-------- 2 files changed, 47 insertions(+), 15 deletions(-) create mode 100644 build.sbt diff --git a/build.sbt b/build.sbt new file mode 100644 index 000000000..0e1bdf902 --- /dev/null +++ b/build.sbt @@ -0,0 +1,26 @@ +name := "Goose" + +version := "2.1.22" + +organization := "GravityLabs" + +organizationHomepage := Some(url("http://gravity.com/")) + +homepage := Some(url("https://github.com/GravityLabs/goose")) + +description := "Extracts text, metadata, and key image from web articles." + +licenses += "Apache2" -> url("http://www.apache.org/licenses/") + +// scalacOptions ++= Seq("-unchecked", "-deprecation") + +libraryDependencies ++= Seq( + "junit" % "junit" % "4.8.1" % "test", + "org.slf4j" % "slf4j-api" % "1.6.1" % "compile", + "org.slf4j" % "slf4j-log4j12" % "1.6.1" % "test", + "org.slf4j" % "slf4j-simple" % "1.6.1", + "org.jsoup" % "jsoup" % "1.5.2", + "commons-io" % "commons-io" % "2.0.1", + "org.apache.httpcomponents" % "httpclient" % "4.1.2", + "commons-lang" % "commons-lang" % "2.6" +) diff --git a/src/main/scala/com/gravity/goose/TalkToMeGoose.scala b/src/main/scala/com/gravity/goose/TalkToMeGoose.scala index fba111b88..e4351c99c 100644 --- a/src/main/scala/com/gravity/goose/TalkToMeGoose.scala +++ b/src/main/scala/com/gravity/goose/TalkToMeGoose.scala @@ -7,21 +7,27 @@ package com.gravity.goose */ object TalkToMeGoose { /** - * you can use this method if you want to run goose from the command line to extract html from a bashscript - * or to just test it's functionality - * you can run it like so - * cd into the goose root - * mvn compile - * MAVEN_OPTS="-Xms256m -Xmx2000m"; mvn exec:java -Dexec.mainClass=com.gravity.goose.TalkToMeGoose -Dexec.args="http://techcrunch.com/2011/05/13/native-apps-or-web-apps-particle-code-wants-you-to-do-both/" -e -q > ~/Desktop/gooseresult.txt - * - * Some top gun love: - * Officer: [in the midst of the MIG battle] Both Catapults are broken, sir. - * Stinger: How long will it take? - * Officer: It'll take ten minutes. - * Stinger: Bullshit ten minutes! This thing will be over in two minutes! Get on it! - * - * @param args - */ + * You can use this method to run goose from the command line + * to extract html from a bash script, or to just test its functionality: + * + * cd into the goose root + * mvn compile + * MAVEN_OPTS="-Xms256m -Xmx2000m"; mvn exec:java -Dexec.mainClass=com.gravity.goose.TalkToMeGoose -Dexec.args="http://techcrunch.com/2011/05/13/native-apps-or-web-apps-particle-code-wants-you-to-do-both/" -e -q > ~/Desktop/gooseresult.txt + * + * or if using sbt: + * + * cd into the goose root + * sbt + * > run http://www.thestar.com/news/insight/2013/04/26/spotting_tiny_gnatcatcher_can_put_a_spring_in_your_step.html + * + * Some top gun love: + * Officer: [in the midst of the MIG battle] Both Catapults are broken, sir. + * Stinger: How long will it take? + * Officer: It'll take ten minutes. + * Stinger: Bullshit ten minutes! This thing will be over in two minutes! Get on it! + * + * @param args + */ def main(args: Array[String]) { try { val url: String = args(0)