A lightweight Ruby on Rails API that scrapes web pages based on provided CSS selectors and meta tags, optimized with caching.
- Fetch HTML content from a given URL
- Extract elements using CSS selectors
- Extract
metatags byname - Cache HTML downloads to minimize redundant requests
- Graceful error handling
- Ruby 3.4+
- Rails 8+
git clone https://github.com/Shongi/dotidot_scraper.git
cd dotidot_scraper
bundle installbin/rails serverPOST /data
{
"url": "https://www.alza.cz/aeg-7000-prosteam-lfr73964cc-d7635493.htm",
"fields": {
"price": ".price-box__primary-price",
"rating_count": ".ratingCount",
"rating_value": ".ratingValue",
"meta": [ "keywords", "twitter:image" ]
}
}{
"price": "15 990,-",
"rating_count": "25 hodnocení",
"rating_value": "4,9",
"meta": {
"keywords": "AEG,7000,ProSteam®,LFR73964CC,Automatické pračky...",
"twitter:image": "https://image.alza.cz/products/...jpg"
}
}Tests are written using RSpec.
bin/rspecCaching is tested using Rails.cache, and external HTTP calls are mocked.
You can test the scraper endpoint manually using Curl or Postman by sending a POST request with a JSON body, make sure the server is running.
Curl:
curl -X POST http://localhost:3000/data \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.alza.cz/aeg-7000-prosteam-lfr73964cc-d7635493.htm",
"fields": {
"price": ".price-box__primary-price",
"rating_count": ".ratingCount",
"rating_value": ".ratingValue",
"meta": ["keywords", "twitter:image"]
}
}'Postman:
- set the method to POST
- set the URL to http://localhost:3000/data
- set the body as raw JSON with the example JSON from the Request Body section
- Caching is done with Rails' built-in cache store.
- Meta tags are matched using the
nameattribute only.
- ScraperController: Single endpoint controller
- ScrapePage: Command object encapsulating the scraping logic
- Tests in spec/