Dotidot Scraper API

A lightweight Ruby on Rails API that scrapes web pages based on provided CSS selectors and meta tags, optimized with caching.

Features

Fetch HTML content from a given URL
Extract elements using CSS selectors
Extract meta tags by name
Cache HTML downloads to minimize redundant requests
Graceful error handling

Requirements

Ruby 3.4+
Rails 8+

Installation

git clone https://github.com/Shongi/dotidot_scraper.git
cd dotidot_scraper
bundle install

Running the Server

bin/rails server

API Usage

Endpoint

POST /data

Request Body

{
  "url": "https://www.alza.cz/aeg-7000-prosteam-lfr73964cc-d7635493.htm",
  "fields": {
    "price": ".price-box__primary-price",
    "rating_count": ".ratingCount",
    "rating_value": ".ratingValue",
    "meta": [ "keywords", "twitter:image" ]
  }
}

Sample Response

{
  "price": "15 990,-",
  "rating_count": "25 hodnocení",
  "rating_value": "4,9",
  "meta": {
    "keywords": "AEG,7000,ProSteam®,LFR73964CC,Automatické pračky...",
    "twitter:image": "https://image.alza.cz/products/...jpg"
  }
}

Testing

Tests are written using RSpec.

bin/rspec

Caching is tested using Rails.cache, and external HTTP calls are mocked.

Manual Testing

You can test the scraper endpoint manually using Curl or Postman by sending a POST request with a JSON body, make sure the server is running.

Curl:

curl -X POST http://localhost:3000/data \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.alza.cz/aeg-7000-prosteam-lfr73964cc-d7635493.htm",
    "fields": {
      "price": ".price-box__primary-price",
      "rating_count": ".ratingCount",
      "rating_value": ".ratingValue",
      "meta": ["keywords", "twitter:image"]
    }
  }'

Postman:

set the method to POST
set the URL to http://localhost:3000/data
set the body as raw JSON with the example JSON from the Request Body section

Notes

Caching is done with Rails' built-in cache store.
Meta tags are matched using the name attribute only.

Code Structure

ScraperController: Single endpoint controller
ScrapePage: Command object encapsulating the scraping logic
Tests in spec/

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
.kamal		.kamal
app		app
bin		bin
config		config
db		db
lib/tasks		lib/tasks
log		log
public		public
script		script
spec		spec
storage		storage
test		test
tmp		tmp
vendor		vendor
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.ruby-version		.ruby-version
Dockerfile		Dockerfile
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
Rakefile		Rakefile
config.ru		config.ru

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dotidot Scraper API

Features

Requirements

Installation

Running the Server

API Usage

Endpoint

Request Body

Sample Response

Testing

Manual Testing

Notes

Code Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dotidot Scraper API

Features

Requirements

Installation

Running the Server

API Usage

Endpoint

Request Body

Sample Response

Testing

Manual Testing

Notes

Code Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages