A concurrent web worker written in Go (Golang) designed to crawl websites efficiently while respecting basic crawling policies. The worker stops automatically after crawling a specified number of links (default: 64).
- Concurrent crawling: Uses Go's goroutines for parallel processing of URLs.
- Kill switch: Automatically stops after crawling
nlinks (configurable). - Duplicate URL prevention: Tracks visited URLs to avoid reprocessing.
- HTML parsing: Extracts links using the
goquerylibrary. - Simple CLI: Easy to use with minimal configuration.
Below are step-by-step instructions to test the web worker application.
- Go: Ensure Go is installed on your system.
- Dependencies: Install the required Go packages.
go get github.com/PuerkitoBio/goquery
go get github.com/mattn/go-sqlite3Start the server by running the following command in the terminal:
go run main.goThe server will start on http://localhost:8080.
Send a POST request to start a new crawling job:
curl -X POST http://localhost:8080/crawl \
-H "Content-Type: application/json" \
-d '{"url":"https://prorobot.ai/hashtags"}'Response:
{"job_id": "1623751234567890000"}- Save the
job_idfor further testing.
Use the job_id to check the status of a crawling job:
curl http://localhost:8080/jobs/{job_id}/statusReplace {job_id} with the actual job ID.
Example:
curl http://localhost:8080/jobs/1623751234567890000/statusResponse:
{"job_id": "1623751234567890000", "status": "running", "processed": 15, "total": 64}Retrieve a list of all jobs (both active and completed):
curl http://localhost:8080/jobsResponse:
[
{"job_id": "1623751234567890000", "status": "running", "processed": 15, "total": 64},
{"job_id": "1623751234567890001", "status": "completed", "processed": 64, "total": 64}
]Retrieve the results of a completed job:
curl http://localhost:8080/jobs/{job_id}/resultsReplace {job_id} with the actual job ID.
Example:
curl http://localhost:8080/jobs/1623751234567890000/resultsResponse:
[
{"url": "https://prorobot.ai/hashtags", "title": "Example Page", "content": "Lorem ipsum..."},
...
]-
Starting a Job:
- A new job is created, and a
job_idis returned. - The job begins crawling the provided URL.
- A new job is created, and a
-
Checking Job Status:
- If the job is running, the status will be
"running"with the number of processed links. - If the job is completed, the status will be
"completed".
- If the job is running, the status will be
-
Listing All Jobs:
- Returns a list of all jobs with their
job_id,status,processed, andtotallinks.
- Returns a list of all jobs with their
-
Retrieving Job Results:
- If the job is completed, returns the crawled data (URL, title, and content).
- If the job is still running, returns a
"processing"status.
-
Concurrency: Multiple jobs can run simultaneously. Each job is tracked independently.
-
Error Handling: If a job ID is invalid or not found, the API will return a
404 Not Founderror.
-
Start a new job:
curl -X POST http://localhost:8080/crawl -H "Content-Type: application/json" -d '{"url":"https://prorobot.ai/hashtags"}'
-
Check the job status:
curl http://localhost:8080/jobs/1623751234567890000/status
-
List all jobs:
curl http://localhost:8080/jobs
-
Retrieve results after the job completes:
curl http://localhost:8080/jobs/1623751234567890000/results
This testing guide ensures you can verify all functionality of the web worker application. Let me know if you need further assistance! 🚀