Go Crawler

A web crawler package that can be used to traverse through a website and scrape data from each webpage's Document as the crawl progresses concurrently.

package main

import (
    "log"

    "github.com/bwhite000/crawler"
)

var onPageLoadedChan = make(chan *crawler.PageData)

func onPageLoaded() {
    for {
        // Listen for the next incoming webpage data from the crawler.
        data := <-onPageLoadedChan
        doc := data.Document

        // Implement the included scraping helper methods.
        scraper := &crawler.Scraper{Document: doc}

        // Process the webpage's Document to scrape useful data.
        log.Println("Title: ", scraper.GetAttr("meta[property='og:title']", "content"))
    }
}

func init() {
    go onPageLoaded()
}

func main() {
    log.Println("Beginning crawl...")

    // Initialize the crawler.
    crawlerObj := &crawler.Crawler{
        StartURL:             "https://example.com/photos/switzerland",
        OnPageLoadedListener: onPageLoadedChan,
    }

    // Begin the crawl.
    crawlerObj.Begin()
}

Installation

In the terminal, please type:

go get -u github.com/bwhite000/crawler

Scraper Methods

Create the scraper by providing it with a goquery Document pointer during instantiation. These methods can then be called on that Document.

// Scraper is a tool to help with scraping data.
type Scraper struct {
    Document *goquery.Document
}

Exists(selector string) bool

Checks if the selector matches an Element in the Document.

if scraper.Exists("[itemtype='http://schema.org/Product']") {
    // ...
}

Float(selector string) float64

Gets the text content from the matched Element, then parses a float from the string.

percentage := scraper.Float("#percentage-box")

GetAttr(selector string, attrName string) string

Gets the attribute value from the matched Element.

title := scraper.GetAttr("meta[property='og:title']", "content")

Html(selector string) string

Get the inner HTML value from the matched Element.

divHTML := scraper.Html("div.elm-with-text")

Int(selector string) int

Gets the text content from the matched Element, then parses an integer from the string.

year := scraper.Int("div.year-container")

Text(selector string) string

Gets the text from the matched Element.

bank := scraper.Text("#bank-title-elm")

ToFloat(input string) float64

Parses a float value from the provided string. It is okay to have non-numeric values on either side of the expected float in the string.

stockPrice := scraper.ToFloat(scraper.GetAttr("meta[property='og:stock']", "content"))

Crawler Methods

// Crawler crawls a website until the specified number of webpages have been crawled.
type Crawler struct {
    MaxFetches           int
    RequestSpaceMs       int
    IgnoreQueryParams    bool
    StartURL             string
    OnPageLoadedListener chan<- *PageData
}

Properties

MaxFetches int

The maximum number of pages to crawl before completing and exiting the crawl. Useful for testing purposes when setting the value to a lower amount, or for setting an upper limit to prevent very deep crawls.

RequestSpaceMs int

The amount of time in milliseconds to wait between page fetches. This can be used to avoid rate throttling and limiting situations from the website that the script is crawling.

IgnoreQueryParams bool

Specifies if the crawler should differentiate between already crawled pages based on the query parameter strings. The default value is true to ignore query string parameters. Some websites may have links to the same webpage multiple times with query string parameters to do not affect the page's contents and that would result in rescraping the same page repeatedly.

StartURL string

The URL to begin the crawling process at.

OnPageLoadedListener chan<- *PageData

The channel that fetched page information structs will be pushed to when found. Listen on this channel in a loop to process crawled webpages.

Methods

Begin() void

Starts the crawl process using the specified values.

WasIndexed(url string) bool

Tests if the provided URL parameter has already been indexed by this crawl already.

PageData Structure

// PageData contains the details about the page that has been crawled.
type PageData struct {
    URL      string
    Document *goquery.Document
}

URL string

The URL of the webpage that this struct represents.

Document *goquery.Document

The goquery Document that this struct represents the DOM tree for.

Dependencies

html [https://golang.org/x/net/html] - Creates the parse tree from the fetched HTML string.
goquery [https://github.com/PuerkitoBio/goquery] - Adds the ability to query Elements from the html package's parse tree using CSS selector syntax.
- Depends on:
  - cascadia [https://github.com/andybalholm/cascadia] - Implements CSS selectors for use with the parse trees produced by the html package.
  - html [https://golang.org/x/net/html] - Please see above.
Go Standard Library - Used for all remaining functionality built into the language.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
crawler.go		crawler.go
scraper.go		scraper.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Go Crawler

Installation

Scraper Methods

Exists(selector string) bool

Float(selector string) float64

GetAttr(selector string, attrName string) string

Html(selector string) string

Int(selector string) int

Text(selector string) string

ToFloat(input string) float64

Crawler Methods

Properties

MaxFetches int

RequestSpaceMs int

IgnoreQueryParams bool

StartURL string

OnPageLoadedListener chan<- *PageData

Methods

Begin() void

WasIndexed(url string) bool

PageData Structure

URL string

Document *goquery.Document

Dependencies

About

Uh oh!

Releases

Packages

Languages

bwhite000/crawler

Folders and files

Latest commit

History

Repository files navigation

Go Crawler

Installation

Scraper Methods

Exists(selector string) bool

Float(selector string) float64

GetAttr(selector string, attrName string) string

Html(selector string) string

Int(selector string) int

Text(selector string) string

ToFloat(input string) float64

Crawler Methods

Properties

MaxFetches int

RequestSpaceMs int

IgnoreQueryParams bool

StartURL string

OnPageLoadedListener chan<- *PageData

Methods

Begin() void

WasIndexed(url string) bool

PageData Structure

URL string

Document *goquery.Document

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages