Crunch

A Web Scraper that works on mass grouped requests.

Crunch provides flexibility in creating your own custom retry and request handlers while providing a default solution

Crunch pools provides functionality to control collections externally e.g. end all requests

A custom DOM Traversal module is used. However, other modules can be used in its place

Examples

Without Proxy

func onHTML(x req.Result) bool {
    doc := traverse.HTMLNodeToDoc(x.Document())
    
    // Do whatever
    fmt.Println(doc)
    return true
}

func main() {
    duration, _ := time.ParseDuration("2s")
    
    // No Proxy at all
    c := crunch.NoProxy(
        []string{"..."},
        duration,
        onHTML
    )

    crunch.Do(c, "Run", nil)
}

With Proxy

func onHTML(x req.Result) bool {
    doc := traverse.HTMLNodeToDoc(x.Document())
    
    // Do whatever
    fmt.Println(doc)
    return true
}

func main() {
    duration, _ := time.ParseDuration("2s")
    
    // With proxy/not
    c := crunch.ProxySetup(
        []string{"..."}, []string{"..."}, // Proxy / urls
        nil, 5,                        // Headers / Number of retries
        duration,
        onHTML
    )

    crunch.Do(c, "Run", nil)
}

With Pool

func main() {
    
    c := crunch.ProxySetup(...)

    pool := req.Pool{}
    pool.New("pool", req.PoolSettings{
        AllCollectionsCompleted: func(p PoolLook) {
            ...
        },
    })
    pool.Add("new", c)
    pool.RunSession("new")
}

To-Do

This is an on-going project. There will be bugs but overall, crunch carries out its default functionality.

There are some features to be added and further testing would be required before it should be used.

Request

Create Pool Manager
SOCKS compatibility
Cookie implementation
Queue system (With chan?)
Replace http.Clients with http.Transport
Batch requests
Merge request module with crunch?
And more...

Traverse

Create proper parser
Optimise Search functions & Create default search function
Rework HTMLDocument and its respective functions
And more...

Benchmarks and Tests need to be provided

crunch v0.1.0

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
.github/workflows		.github/workflows
benchmark		benchmark
examples		examples
req		req
server		server
tests		tests
traverse		traverse
unrealised		unrealised
.deepsource.toml		.deepsource.toml
README.md		README.md
crunch.go		crunch.go
csv_.go		csv_.go
documentation.md		documentation.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Crunch

Examples

Without Proxy

With Proxy

With Pool

To-Do

Request

Traverse

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

taahakh/crunch

Folders and files

Latest commit

History

Repository files navigation

Crunch

Examples

Without Proxy

With Proxy

With Pool

To-Do

Request

Traverse

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages