From c05224bc1e0b66b067e39d3a617e9557809e0099 Mon Sep 17 00:00:00 2001 From: "Marcos G. Zimmermann" Date: Sat, 18 Apr 2026 15:39:54 -0300 Subject: [PATCH 1/2] docs: add comprehensive documentation Adds a 10-file documentation set under docs/ covering: - Getting started, processes, adapters, CLI - Rack middleware, SEO extensions, events, Rails integration - Full API reference --- docs/README.md | 67 +++++++++++++++++ docs/adapters.md | 143 ++++++++++++++++++++++++++++++++++++ docs/api.md | 154 +++++++++++++++++++++++++++++++++++++++ docs/cli.md | 93 ++++++++++++++++++++++++ docs/events.md | 79 ++++++++++++++++++++ docs/extensions.md | 141 ++++++++++++++++++++++++++++++++++++ docs/getting-started.md | 138 +++++++++++++++++++++++++++++++++++ docs/middleware.md | 85 ++++++++++++++++++++++ docs/processes.md | 156 ++++++++++++++++++++++++++++++++++++++++ docs/rails.md | 128 +++++++++++++++++++++++++++++++++ 10 files changed, 1184 insertions(+) create mode 100644 docs/README.md create mode 100644 docs/adapters.md create mode 100644 docs/api.md create mode 100644 docs/cli.md create mode 100644 docs/events.md create mode 100644 docs/extensions.md create mode 100644 docs/getting-started.md create mode 100644 docs/middleware.md create mode 100644 docs/processes.md create mode 100644 docs/rails.md diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..17c1cad --- /dev/null +++ b/docs/README.md @@ -0,0 +1,67 @@ +# site_maps + +Concurrent, adapter-based sitemap.xml generation for Ruby applications. + +`site_maps` is a framework-agnostic sitemap builder with built-in Rails support. It produces valid sitemap XML (with full SEO extensions — image, video, news, hreflang, mobile, PageMap), splits large sitemaps into indexed chunks automatically, generates them concurrently across a thread pool, and ships them to the filesystem, S3, or a custom backend through a pluggable adapter layer. + +## Contents + +- [Getting started](getting-started.md) — install, first sitemap, Rails +- [Processes](processes.md) — static and dynamic process DSL +- [Adapters](adapters.md) — filesystem, S3, no-op, custom +- [CLI](cli.md) — `site_maps generate` +- [Rack middleware](middleware.md) — serve generated sitemaps from the app +- [SEO extensions](extensions.md) — image, video, news, hreflang, mobile, PageMap +- [Events](events.md) — instrumentation hooks +- [Rails integration](rails.md) — URL helpers, Railtie, precompile +- [API reference](api.md) — full public API + +## Install + +```ruby +# Gemfile +gem 'site_maps' +``` + +## One-minute tour + +```ruby +# config/sitemap.rb +SiteMaps.use(:file_system) do + configure do |config| + config.url = 'https://example.com/sitemap.xml' + config.directory = Rails.public_path.to_s + end + + process do |s| + s.add('/', priority: 1.0, changefreq: 'daily') + s.add('/about', lastmod: Time.now) + + Post.find_each do |post| + s.add("/posts/#{post.slug}", lastmod: post.updated_at) + end + end +end +``` + +```bash +bundle exec site_maps generate --config-file config/sitemap.rb +``` + +Generated: `public/sitemap.xml` (plus an indexed chain if the URL set exceeds 50k links). + +## Why site_maps + +- **Concurrency.** Processes run in a `Concurrent::FixedThreadPool`; threads share a thread-safe repo that handles file splitting. +- **Pluggable storage.** Write the same sitemap to disk in development and S3 in production by swapping one line. +- **Incremental sitemaps.** Full URL extensions support — images, videos, news, hreflang alternates, mobile, PageMap. +- **Dynamic processes.** Parameterized templates like `posts/%{year}-%{month}/sitemap.xml` let you rebuild a single shard without regenerating the whole site. + +## Version + +- Ruby: `>= 3.2.0` +- Depends on: `builder ~> 3.0`, `concurrent-ruby >= 1.1`, `rack >= 2.0`, `zeitwerk`, `thor` + +## License + +MIT. diff --git a/docs/adapters.md b/docs/adapters.md new file mode 100644 index 0000000..5ae93c5 --- /dev/null +++ b/docs/adapters.md @@ -0,0 +1,143 @@ +# Adapters + +An **adapter** is the storage backend for generated sitemap files. Three adapters ship with the gem; a clean interface makes it easy to write your own. + +## Built-in adapters + +| Adapter | When to use | +|---------|-------------| +| `:file_system` | Write to disk. Ideal for local dev, or for serving via the bundled Rack middleware. | +| `:aws_sdk` | Upload to S3. Production deployments behind CloudFront or similar. | +| `:noop` | Discard writes. Ideal for tests that care about "what URLs got added" but not "what ended up on disk". | + +Select with `SiteMaps.use()`. + +## `:file_system` + +```ruby +SiteMaps.use(:file_system) do + configure do |config| + config.url = 'https://example.com/sitemap.xml' + config.directory = Rails.public_path.to_s # default: "public/sitemaps" + end + process { |s| ... } +end +``` + +**Config attributes:** + +| Key | Purpose | +|-----|---------| +| `url` | Public URL — drives filename layout and is written into sitemap `` entries. | +| `directory` | Filesystem root under which files land. | + +If `config.url` ends in `.gz`, the adapter writes gzipped files. The middleware transparently decompresses on serve. + +## `:aws_sdk` + +```ruby +SiteMaps.use(:aws_sdk) do + configure do |config| + config.url = 'https://my-bucket.s3.amazonaws.com/sitemap.xml' + config.directory = '/tmp/sitemaps' # local scratch space + config.bucket = 'my-bucket' + config.region = ENV.fetch('AWS_REGION', 'us-east-1') + config.access_key_id = ENV['AWS_ACCESS_KEY_ID'] + config.secret_access_key = ENV['AWS_SECRET_ACCESS_KEY'] + config.acl = 'public-read' # default + config.cache_control = 'private, max-age=0, no-cache' + end + process { |s| ... } +end +``` + +**Config attributes:** + +| Key | Default | +|-----|---------| +| `bucket` | `ENV['AWS_BUCKET']` | +| `region` | `ENV.fetch('AWS_REGION', 'us-east-1')` | +| `access_key_id` | `ENV['AWS_ACCESS_KEY_ID']` | +| `secret_access_key` | `ENV['AWS_SECRET_ACCESS_KEY']` | +| `acl` | `"public-read"` | +| `cache_control` | `"private, max-age=0, no-cache"` | +| `directory` | Local scratch dir for staging before upload | + +The adapter writes locally first (to `directory`), then uploads to S3 with the configured ACL and Cache-Control headers. You'll need `aws-sdk-s3` in your Gemfile: + +```ruby +gem 'aws-sdk-s3' +``` + +## `:noop` + +```ruby +SiteMaps.use(:noop) do + configure { |c| c.url = 'https://example.com/sitemap.xml' } + process { |s| ... } +end +``` + +Writes are discarded. Use it in tests when you want to assert on the URLs being added (via events, for example) without hitting disk. + +## Writing a custom adapter + +Subclass `SiteMaps::Adapters::Adapter` and implement `write`, `read`, `delete`: + +```ruby +class GoogleCloudStorageAdapter < SiteMaps::Adapters::Adapter + class Config < SiteMaps::Configuration + attribute :bucket + attribute :project_id + end + + def write(url, raw_data, **_kwargs) + storage = Google::Cloud::Storage.new(project_id: config.project_id) + bucket = storage.bucket(config.bucket) + bucket.create_file(StringIO.new(raw_data), path_from(url)) + end + + def read(url) + file = storage.bucket(config.bucket).file(path_from(url)) + [file.download.string, { content_type: 'application/xml' }] + end + + def delete(url) + storage.bucket(config.bucket).file(path_from(url))&.delete + end + + private + + def path_from(url) + URI(url).path[1..] + end + + def storage + @storage ||= Google::Cloud::Storage.new(project_id: config.project_id) + end +end +``` + +Register and use it: + +```ruby +SiteMaps.use(GoogleCloudStorageAdapter) do + configure do |config| + config.url = 'https://cdn.example.com/sitemap.xml' + config.bucket = 'my-bucket' + config.project_id = 'my-project' + end + process { |s| ... } +end +``` + +## Adapter interface + +| Method | Purpose | +|--------|---------| +| `#write(url, raw_data, **kwargs)` | Persist `raw_data` at the location implied by `url`. | +| `#read(url)` | Return `[raw_data, { content_type: '…' }]` for the given URL. | +| `#delete(url)` | Remove the file at the URL. | +| `.config_class` | (optional) Return a `Configuration` subclass to expose adapter-specific settings. | + +The adapter base class handles everything else: URL filters, the process registry, and thread-safe URL tracking. diff --git a/docs/api.md b/docs/api.md new file mode 100644 index 0000000..c930e9d --- /dev/null +++ b/docs/api.md @@ -0,0 +1,154 @@ +# API Reference + +## `SiteMaps` (top-level module) + +| Method | Description | +|--------|-------------| +| `SiteMaps.use(adapter, **opts, &block)` | Register an adapter (`:file_system`, `:aws_sdk`, `:noop`, or a class) and yield its configuration block. | +| `SiteMaps.define(&block)` | Register a context-aware definition. Called by `.generate` with the `context:` hash splatted as kwargs. | +| `SiteMaps.configure { |config| ... }` | Mutate global defaults. | +| `SiteMaps.config` | Return global `Configuration`. | +| `SiteMaps.generate(config_file:, context: {}, **runner_opts) → Runner` | Load `config_file` and return a `Runner` ready to `.enqueue` and `.run`. | +| `SiteMaps.current_adapter` | Last-registered adapter (thread-local during `.generate`). | +| `SiteMaps.logger` | Configurable logger (default `Logger.new($stdout)`). | + +### Constants + +```ruby +SiteMaps::MAX_LENGTH # { links: 50_000, images: 1_000, news: 1_000 } +SiteMaps::MAX_FILESIZE # 50_000_000 bytes +``` + +### Errors + +- `SiteMaps::Error` — base error +- `SiteMaps::AdapterNotFound` — unknown adapter symbol +- `SiteMaps::AdapterNotSetError` — generate called without an adapter +- `SiteMaps::FileNotFoundError` — missing file at adapter read +- `SiteMaps::FullSitemapError` — internal signal that a URL set is full (triggers split) +- `SiteMaps::ConfigurationError` — invalid config + +--- + +## `SiteMaps::Configuration` + +Base configuration. Adapter configs subclass this. + +| Attribute | Default | Purpose | +|-----------|---------|---------| +| `url` | — (required) | Public URL of the main sitemap index. | +| `directory` | `"/tmp/sitemaps"` | Local storage directory. | +| `max_links` | `50_000` | URLs per file before split. | +| `emit_priority` | `true` | Emit ``. | +| `emit_changefreq` | `true` | Emit ``. | +| `xsl_stylesheet_url` | `nil` | Stylesheet for URL sets. | +| `xsl_index_stylesheet_url` | `nil` | Stylesheet for the sitemap index. | +| `ping_search_engines` | `false` | Auto-ping after generation. | +| `ping_engines` | `{ bing: '...' }` | URL templates per engine; `%{url}` is URL-encoded at ping time. | + +--- + +## `SiteMaps::Adapters::Adapter` (base class) + +Abstract base. Subclass to build custom adapters. + +| Method | Description | +|--------|-------------| +| `.config_class` | Override to return a `Configuration` subclass with adapter-specific attributes. | +| `#write(url, raw_data, **kwargs)` | Abstract. Persist `raw_data` at the storage location implied by `url`. | +| `#read(url) → [raw_data, { content_type: '…' }]` | Abstract. | +| `#delete(url)` | Abstract. | +| `#configure { |c| ... }` | Yield the adapter's configuration. | +| `#process(name = :default, location = nil, **kwargs, &block)` | Register a process. | +| `#external_sitemap(url, lastmod:)` | Add an external sitemap to the index. | +| `#extend_processes_with(mod)` | Mix `mod` into all process blocks. | +| `#url_filter { |url, options| ... }` | Register a URL filter. | +| `#apply_url_filters(url, options)` | Run all filters; returns modified options or `nil` if excluded. | +| `#reset!` | Clear index and repo. Called before `Runner#run`. | + +--- + +## `SiteMaps::Runner` + +Executes enqueued processes concurrently. + +```ruby +Runner.new(adapter = SiteMaps.current_adapter, max_threads: 4, ping: nil) +``` + +| Method | Description | +|--------|-------------| +| `#enqueue(process_name, **kwargs)` | Queue one process with kwargs. | +| `#enqueue_remaining` / `#enqueue_all` | Queue every process not yet enqueued. | +| `#run` | Execute queued processes, finalize index, optionally ping. | + +--- + +## `SiteMaps::SitemapBuilder` + +Yielded as `s` inside every `process` block. + +| Method | Description | +|--------|-------------| +| `#add(path, **options)` | Add one URL to the current URL set. Automatically splits when full. | +| `#finalize!` | Finalize the current URL set. Called automatically when the process block returns. | + +`options` supports every extension documented in [extensions.md](extensions.md): `lastmod`, `priority`, `changefreq`, `images`, `videos`, `news`, `alternates`, `mobile`, `pagemap`. + +In Rails apps, `s.route` is an object exposing all URL helpers. + +--- + +## `SiteMaps::Middleware` + +Rack middleware for serving generated sitemaps. See [middleware.md](middleware.md). + +```ruby +use SiteMaps::Middleware, + adapter: ..., + public_prefix: nil, + storage_prefix: nil, + x_robots_tag: 'noindex, follow', + cache_control: 'public, max-age=3600' +``` + +--- + +## `SiteMaps::Notification` + +| Method | Description | +|--------|-------------| +| `.subscribe(event_or_class, &block)` | Subscribe to one event (string) or every event named on a class. | +| `.unsubscribe(subscriber)` | Remove a subscription. | +| `.instrument(event, payload) { ... }` | Emit an event, wrapping the block in a timer. | + +See [events.md](events.md) for the event catalog. + +--- + +## `SiteMaps::RobotsTxt` + +| Method | Description | +|--------|-------------| +| `.sitemap_directive(url) → String` | Return `"Sitemap: "`. | +| `.render(sitemap_url:, extra_directives: []) → String` | Build a full robots.txt body. | + +--- + +## `SiteMaps::Ping` + +| Method | Description | +|--------|-------------| +| `.ping(url, engines: { bing: '...' }) → Hash` | Fire a GET to each engine's template (substituting `%{url}`). Returns a hash of `{engine => { status:, url: }}`. | + +--- + +## CLI entry point + +`exec/site_maps` — the executable shipped with the gem. + +```bash +bundle exec site_maps generate [processes] [options] +``` + +See [cli.md](cli.md). diff --git a/docs/cli.md b/docs/cli.md new file mode 100644 index 0000000..471af6c --- /dev/null +++ b/docs/cli.md @@ -0,0 +1,93 @@ +# CLI + +The gem installs a `site_maps` executable backed by Thor. + +```bash +bundle exec site_maps generate [PROCESS_NAMES...] [options] +``` + +If no process names are given, every process in the config file is enqueued. + +## Options + +| Flag | Default | Purpose | +|------|---------|---------| +| `--config-file`, `-r` | — | Path to the config file defining processes. **Required.** | +| `--max-threads`, `-c` | `4` | Thread pool size for concurrent process execution. | +| `--context` | `{}` | Hash-style kwargs passed to `SiteMaps.define` blocks: `--context=tenant:acme locale:en`. | +| `--enqueue-remaining` | `false` | In addition to specified processes, enqueue any others. | +| `--ping` | `false` | Override config to ping search engines after generation. | +| `--debug` | `false` | Set logger to DEBUG level. | +| `--logfile` | — | Write logs to a file instead of stdout. | + +## Examples + +Generate everything: + +```bash +bundle exec site_maps generate --config-file config/sitemap.rb +``` + +Regenerate a single shard of a dynamic process: + +```bash +bundle exec site_maps generate monthly_posts \ + --config-file config/sitemap.rb \ + --context=year:2024 month:3 +``` + +Generate `posts` and `products`, then let the config decide what else to include: + +```bash +bundle exec site_maps generate posts products \ + --config-file config/sitemap.rb \ + --enqueue-remaining +``` + +Tune concurrency: + +```bash +bundle exec site_maps generate --config-file config/sitemap.rb --max-threads 10 +``` + +Ping Bing and any custom engines (config-driven — see below): + +```bash +bundle exec site_maps generate --config-file config/sitemap.rb --ping +``` + +## Search-engine pinging + +Pinging is off by default. Enable globally in config or flip it on per run via `--ping`. + +```ruby +SiteMaps.use(:file_system) do + configure do |config| + config.url = 'https://example.com/sitemap.xml' + config.ping_search_engines = true + config.ping_engines = { + bing: 'https://www.bing.com/ping?sitemap=%{url}', + custom: 'https://search.example.com/ping?url=%{url}' + } + end +end +``` + +`%{url}` in the template is replaced with a URL-encoded `config.url` at ping time. + +## Rails / bundler + +The CLI auto-requires `config/environment` if it detects a `config/application.rb`, so Rails URL helpers (via the Railtie) are available inside your config file. + +If you don't want that — say, a Ruby-only script in a Rails repo — pass a config file outside the Rails root or invoke the library directly via `SiteMaps.generate(...)`. + +## Logging + +- `--debug` sets the logger to `Logger::DEBUG`. +- `--logfile PATH` writes to a file; otherwise stdout. +- A built-in event listener prints one line per finalized URL set with link counts and runtime. + +## Exit codes + +- `0` — success. +- Non-zero — any process raised. Errors are captured per-future and re-raised after all futures complete, so you see the real backtrace rather than a generic runner failure. diff --git a/docs/events.md b/docs/events.md new file mode 100644 index 0000000..22aff0e --- /dev/null +++ b/docs/events.md @@ -0,0 +1,79 @@ +# Events + +`site_maps` ships a lightweight pub/sub system under `SiteMaps::Notification`. Use it for logging, metrics, or reacting to particular generation phases. + +## Subscribing + +### Block subscribers + +```ruby +SiteMaps::Notification.subscribe('sitemaps.finalize_urlset') do |event| + Rails.logger.info( + "[sitemap] wrote #{event[:links_count]} urls to #{event[:url]} in #{event[:runtime]}s" + ) +end +``` + +### Class subscribers + +A class with one method per event name (dots become underscores): + +```ruby +class SitemapMetrics + def self.sitemaps_process_execution(event) + StatsD.timing('sitemaps.process', event[:runtime], tags: ["process:#{event[:process].name}"]) + end + + def self.sitemaps_finalize_urlset(event) + StatsD.increment('sitemaps.urlset.written', tags: ["url:#{event[:url]}"]) + end + + def self.sitemaps_ping(event) + event[:results].each do |engine, result| + StatsD.increment('sitemaps.ping', tags: ["engine:#{engine}", "status:#{result[:status]}"]) + end + end +end + +SiteMaps::Notification.subscribe(SitemapMetrics) +``` + +### The built-in listener + +For colored terminal output during CLI runs: + +```ruby +SiteMaps::Notification.subscribe(SiteMaps::Runner::EventListener) +``` + +This is subscribed automatically by the CLI. + +## Events + +| Event | Payload keys | +|-------|-------------| +| `sitemaps.enqueue_process` | `process`, `kwargs` | +| `sitemaps.before_process_execution` | `process`, `kwargs` | +| `sitemaps.process_execution` | `process`, `kwargs`, `runtime` | +| `sitemaps.finalize_urlset` | `url`, `links_count`, `news_count`, `last_modified`, `runtime`, `process` | +| `sitemaps.ping` | `results` | + +`process` is a `SiteMaps::Process` struct (`name`, `location_template`, `kwargs_template`, `block`). + +## Event ordering + +For each process the sequence is: + +1. `sitemaps.enqueue_process` +2. `sitemaps.before_process_execution` +3. One or more `sitemaps.finalize_urlset` (one per split file) +4. `sitemaps.process_execution` + +After all processes complete, one final `sitemaps.finalize_urlset` fires for the sitemap index itself. If pinging is enabled, `sitemaps.ping` fires last. + +## Use cases + +- **Logging.** Tail-friendly output of what just ran, how many URLs, runtime. +- **Metrics.** StatsD / OpenTelemetry counters for throughput and ping outcomes. +- **Alerting.** Subscribe to `sitemaps.ping`, alert on non-200 results. +- **Cache busting.** After `sitemaps.finalize_urlset`, purge the CDN entry for the written URL. diff --git a/docs/extensions.md b/docs/extensions.md new file mode 100644 index 0000000..0aa1177 --- /dev/null +++ b/docs/extensions.md @@ -0,0 +1,141 @@ +# SEO Extensions + +`s.add` accepts options for every sitemap extension recognized by Google and Bing. Pass any of the following alongside `lastmod`, `priority`, and `changefreq`. + +## Image + +Up to 1,000 images per URL. + +```ruby +s.add('/gallery/summer', images: [ + { + loc: 'https://cdn.example.com/summer/beach.jpg', + title: 'Beach sunset', + caption: 'A photo from the summer trip', + geo_location: 'Cape Cod, MA', + license: 'https://creativecommons.org/licenses/by/4.0/' + } +]) +``` + +## Video + +Up to 1,000 video entries per sitemap file. + +```ruby +s.add('/videos/how-to', videos: [ + { + thumbnail_loc: 'https://cdn.example.com/thumbs/how-to.jpg', + title: 'How to use site_maps', + description: 'A quick walkthrough', + content_loc: 'https://cdn.example.com/videos/how-to.mp4', + player_loc: 'https://example.com/embed/how-to', + duration: 600, + publication_date: Time.now, + rating: 4.8, + view_count: 12_345, + family_friendly: true, + requires_subscription: false, + live: false, + tags: %w[tutorial guide], + category: 'Technology', + uploader: 'example-team', + uploader_info: 'https://example.com/about', + gallery_loc: 'https://example.com/videos', + gallery_title: 'Example video gallery', + price: nil, + allow_embed: true, + autoplay: 'ap=1' + } +]) +``` + +## News + +Up to 1,000 news entries per sitemap file (use a dedicated process for news URLs). + +```ruby +s.add('/news/breaking', news: { + publication_name: 'Example Times', + publication_language: 'en', + publication_date: Time.now, + title: 'Breaking news headline', + keywords: 'breaking, politics', + genres: 'PressRelease', + access: 'Subscription', + stock_tickers: 'NASDAQ:EXMP' +}) +``` + +## Alternate language / hreflang + +```ruby +s.add('/', alternates: [ + { href: 'https://example.com/en', lang: 'en' }, + { href: 'https://example.com/es', lang: 'es' }, + { href: 'https://example.com/fr', lang: 'fr', nofollow: true } +]) +``` + +The `nofollow: true` variant emits `rel="nofollow alternate"` on the link. Use it to declare locale variants without signalling Google to crawl them as equivalents. + +## Mobile + +Declare a URL as mobile-friendly: + +```ruby +s.add('/mobile-page', mobile: true) +``` + +## PageMap + +Structured data for Google Custom Search. + +```ruby +s.add('/products/widget', pagemap: { + dataobjects: [ + { + type: 'product', + id: 'sku-123', + attributes: [ + { name: 'name', value: 'Widget' }, + { name: 'price', value: '19.99' }, + { name: 'color', value: 'blue' } + ] + } + ] +}) +``` + +## Combined example + +Everything can coexist on a single URL: + +```ruby +s.add('/products/widget', + lastmod: Time.now, + priority: 0.9, + changefreq: 'weekly', + images: [{ loc: 'https://cdn.example.com/widget.jpg', title: 'Widget' }], + alternates: [{ href: 'https://example.com/es/products/widget', lang: 'es' }], + mobile: true, + pagemap: { dataobjects: [{ type: 'product', id: 'sku-123', attributes: [] }] } +) +``` + +## Disabling `priority` / `changefreq` + +Both fields are optional per the sitemap spec, and many search engines ignore them. Disable globally if you want smaller files: + +```ruby +configure do |config| + config.emit_priority = false + config.emit_changefreq = false +end +``` + +## Output size + +- Per URL set: 50,000 links **or** 1,000 news items **or** 50 MB uncompressed — whichever comes first. When one of these is hit, the current file is finalized and a new one starts. +- File naming is automatic (`posts/sitemap.xml` → `posts/sitemap1.xml`, `posts/sitemap2.xml`, …). +- Use the `.gz` extension in `config.url` to emit gzipped files — most search engines fetch either form. diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..d91cb7c --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,138 @@ +# Getting Started + +## Install + +```ruby +# Gemfile +gem 'site_maps' +``` + +```bash +bundle install +``` + +## Your first sitemap + +Create `config/sitemap.rb`: + +```ruby +SiteMaps.use(:file_system) do + configure do |config| + config.url = 'https://example.com/sitemap.xml' + config.directory = File.expand_path('public', __dir__) + end + + process do |s| + s.add('/', priority: 1.0, changefreq: 'daily') + s.add('/about', priority: 0.8, lastmod: Time.now) + s.add('/contact', priority: 0.5) + end +end +``` + +Generate: + +```bash +bundle exec site_maps generate --config-file config/sitemap.rb +``` + +Output: `public/sitemap.xml`. + +## Dynamic URLs + +Yield `s.add` for every URL you want indexed. Database records work naturally: + +```ruby +process :posts do |s| + Post.published.find_each do |post| + s.add("/posts/#{post.slug}", lastmod: post.updated_at, priority: 0.7) + end +end +``` + +When the URL count of a single process exceeds `max_links` (default 50,000), the file is split into `sitemap1.xml`, `sitemap2.xml`, … and a sitemap index is written at `config.url`. + +## Named processes + +Named processes get their own file and run in parallel: + +```ruby +SiteMaps.use(:file_system) do + configure { |c| c.url = 'https://example.com/sitemap.xml'; c.directory = 'public' } + + process :static do |s| + s.add('/') + s.add('/about') + end + + process :posts, 'posts/sitemap.xml' do |s| + Post.find_each { |p| s.add("/posts/#{p.slug}") } + end + + process :products, 'products/sitemap.xml' do |s| + Product.find_each { |p| s.add("/products/#{p.id}") } + end +end +``` + +Run all: + +```bash +bundle exec site_maps generate --config-file config/sitemap.rb --max-threads 4 +``` + +Run one: + +```bash +bundle exec site_maps generate posts --config-file config/sitemap.rb +``` + +See [processes.md](processes.md) for the full process DSL including parameterized templates. + +## Using it in Rails + +Add `site_maps` to your Gemfile and generate from a Rake task, a scheduled job, or your deploy pipeline. The Railtie injects URL helpers: + +```ruby +# config/sitemap.rb +SiteMaps.use(:file_system) do + configure do |config| + config.url = 'https://example.com/sitemap.xml' + config.directory = Rails.public_path.to_s + end + + process do |s| + s.add(s.route.root_path, priority: 1.0) + s.add(s.route.about_path) + Post.find_each { |post| s.add(s.route.post_path(post), lastmod: post.updated_at) } + end +end +``` + +See [rails.md](rails.md) for the full Rails integration, including asset precompile hooks and the Rack middleware for serving generated sitemaps. + +## Uploading to S3 + +Swap the adapter line: + +```ruby +SiteMaps.use(:aws_sdk) do + configure do |config| + config.url = 'https://my-bucket.s3.amazonaws.com/sitemap.xml' + config.bucket = 'my-bucket' + config.region = ENV['AWS_REGION'] + # access_key_id / secret_access_key default to ENV vars + end + + process { |s| ... } +end +``` + +See [adapters.md](adapters.md) for adapter specifics and how to build your own. + +## Next steps + +- [Processes](processes.md) — split your sitemap into static and dynamic shards +- [SEO extensions](extensions.md) — image, video, news, hreflang +- [CLI](cli.md) — automation-friendly generate command +- [Rack middleware](middleware.md) — serve the generated files with correct headers diff --git a/docs/middleware.md b/docs/middleware.md new file mode 100644 index 0000000..df1e1c4 --- /dev/null +++ b/docs/middleware.md @@ -0,0 +1,85 @@ +# Rack Middleware + +`SiteMaps::Middleware` serves generated sitemap files directly from the app. Useful when you've generated to `public/sitemaps/` (filesystem adapter) and want proper `Content-Type`, gzip handling, and XSL stylesheet routing without editing your web-server config. + +## Basic usage + +```ruby +# config/application.rb (Rails) +config.middleware.use SiteMaps::Middleware, adapter: -> { SiteMaps.current_adapter } +``` + +Or inline in `config.ru`: + +```ruby +require 'site_maps' + +use SiteMaps::Middleware, adapter: SiteMaps.current_adapter +run MyApp +``` + +## Options + +```ruby +use SiteMaps::Middleware, + adapter: SiteMaps.current_adapter, + public_prefix: nil, + storage_prefix: nil, + x_robots_tag: 'noindex, follow', + cache_control: 'public, max-age=3600' +``` + +| Option | Purpose | +|--------|---------| +| `adapter` | Adapter instance (or a callable returning one — useful if the adapter is reconfigured at boot). | +| `public_prefix` | Strip from request path before lookup — e.g. `/sitemap` if your app mounts them under a sub-path. | +| `storage_prefix` | Prepend to the lookup key — e.g. `tenants/acme` for multi-tenant layouts. | +| `x_robots_tag` | `X-Robots-Tag` header added to served files. | +| `cache_control` | `Cache-Control` header. | + +## Behavior + +The middleware intercepts requests for `*.xml` and `*.xml.gz` files: + +- Matches → serve from the adapter with `Content-Type: application/xml`, plus `X-Robots-Tag` and `Cache-Control`. +- Gzipped sources → auto-decompress on serve so XSL stylesheets render in the browser. Clients asking for `.xml.gz` still get the compressed bytes. +- Doesn't match → `env` passes through to `@app.call`. + +## XSL stylesheets + +The middleware also serves the built-in XSL stylesheets — pretty sitemap rendering for human visitors — at their referenced paths. Configure their URLs via: + +```ruby +configure do |config| + config.xsl_stylesheet_url = '/_sitemap-stylesheet.xsl' + config.xsl_index_stylesheet_url = '/_sitemap-index-stylesheet.xsl' +end +``` + +## Multi-tenant routing + +For per-tenant sitemaps stored under subpaths: + +```ruby +use SiteMaps::Middleware, + adapter: per_request_adapter, + storage_prefix: ->(request) { "tenants/#{request.host.split('.').first}" } +``` + +If the adapter itself already scopes paths by tenant, no prefix is needed — just point it at the right one for each request. + +## robots.txt integration + +Emit a `Sitemap:` directive for the generated file: + +```ruby +# config.ru or a controller +SiteMaps::RobotsTxt.sitemap_directive('https://example.com/sitemap.xml') +# => "Sitemap: https://example.com/sitemap.xml" + +SiteMaps::RobotsTxt.render( + sitemap_url: 'https://example.com/sitemap.xml', + extra_directives: ['Disallow: /admin'] +) +# => "Sitemap: https://example.com/sitemap.xml\nDisallow: /admin" +``` diff --git a/docs/processes.md b/docs/processes.md new file mode 100644 index 0000000..3a8d5a1 --- /dev/null +++ b/docs/processes.md @@ -0,0 +1,156 @@ +# Processes + +A **process** is a unit of work that produces part of a sitemap. Each process runs on its own thread, writes its own URL set, and becomes an entry in the sitemap index. + +## Static processes + +A static process has no parameters. It runs once and writes one (possibly split) sitemap file. + +```ruby +SiteMaps.use(:file_system) do + configure { |c| c.url = 'https://example.com/sitemap.xml'; c.directory = 'public' } + + process do |s| + s.add('/', priority: 1.0) + s.add('/about') + end + + process :posts, 'posts/sitemap.xml' do |s| + Post.find_each { |post| s.add("/posts/#{post.slug}", lastmod: post.updated_at) } + end +end +``` + +- Without an explicit name, the process is named `:default`. +- Without an explicit location, a default filename is assigned. +- The block receives a `SitemapBuilder` (`s`), on which `add` is called per URL. + +## Dynamic processes + +A dynamic process has placeholders in its location template and corresponding kwargs. Each unique combination of kwargs produces a separate sitemap file. + +```ruby +process :monthly_posts, 'posts/%{year}-%{month}/sitemap.xml', year: 2024, month: 1 do |s, year:, month:, **| + Post.where('extract(year from published_at) = ? AND extract(month from published_at) = ?', year, month) + .find_each { |p| s.add("/posts/#{p.slug}", lastmod: p.updated_at) } +end +``` + +The kwargs passed to `process` are **defaults**; the real values come from `Runner#enqueue`: + +```ruby +runner = SiteMaps.generate(config_file: 'config/sitemap.rb') +runner.enqueue(:monthly_posts, year: 2024, month: 1) +runner.enqueue(:monthly_posts, year: 2024, month: 2) +runner.enqueue(:monthly_posts, year: 2024, month: 3) +runner.run +``` + +Or from the CLI: + +```bash +bundle exec site_maps generate monthly_posts \ + --config-file config/sitemap.rb \ + --context=year:2024 month:1 +``` + +## Execution model + +When you call `runner.run`: + +1. Each enqueued process is wrapped in a `Concurrent::Future`. +2. The pool (default 4 threads, configurable via `--max-threads`) runs them in parallel. +3. Each process builds a `URLSet`. When the set fills up (50,000 links, 1,000 news items, or 50 MB uncompressed), it's finalized and written, and a new URLSet starts — automatically. +4. After every process finishes, the sitemap index is aggregated and written to `config.url`. + +## Splitting rules + +A URL set is finalized and rolled over when **any** of these apply: + +- Links reach `config.max_links` (default 50,000 — the sitemap spec limit). +- News entries reach 1,000. +- Uncompressed XML reaches 50 MB. + +Split files are named by `IncrementalLocation`: `posts/sitemap.xml` becomes `posts/sitemap1.xml`, `posts/sitemap2.xml`, etc. + +## Index generation + +A sitemap index is produced when: + +- More than one process exists, +- A single process was split across multiple files, or +- External sitemaps were added. + +Otherwise a single `urlset` is written directly at `config.url` (the "inline" optimization). + +## Adding external sitemaps + +Reference third-party or pre-existing sitemaps in the index: + +```ruby +SiteMaps.use(:file_system) do + configure { |c| c.url = 'https://example.com/sitemap.xml'; c.directory = 'public' } + + external_sitemap('https://cdn.example.com/legacy-sitemap.xml', lastmod: Time.parse('2024-01-15')) + + process { |s| s.add('/') } +end +``` + +## Shared helpers across processes + +Use `extend_processes_with` to add methods that every process block can call: + +```ruby +module Helpers + def post_path(post) = "/posts/#{post.slug}" + def published_posts = Post.where.not(published_at: nil) +end + +SiteMaps.use(:file_system) do + configure { |c| c.url = 'https://example.com/sitemap.xml'; c.directory = 'public' } + extend_processes_with(Helpers) + + process :posts do |s| + published_posts.find_each { |p| s.add(post_path(p), lastmod: p.updated_at) } + end +end +``` + +## URL filters + +Filters run per URL inside every process — use them for global exclusions or default attributes: + +```ruby +SiteMaps.use(:file_system) do + configure { |c| c.url = 'https://example.com/sitemap.xml'; c.directory = 'public' } + + # Exclude any /admin path + url_filter { |url, _options| false if url.include?('/admin') } + + # Boost blog priority + url_filter do |url, options| + if url.include?('/blog/') + options.merge(priority: 0.9, changefreq: 'daily') + else + options + end + end + + process { |s| ... } +end +``` + +A filter returning `false` (or `nil`) excludes the URL entirely. Returning a hash replaces the options. + +## Re-running a single shard + +Only regenerate what changed — the rest is preserved from the existing sitemap index: + +```ruby +runner = SiteMaps.generate(config_file: 'config/sitemap.rb') +runner.enqueue(:monthly_posts, year: 2024, month: 3) # only March +runner.run # Jan and Feb kept as-is +``` + +This is the main advantage of parameterized dynamic processes: you can rebuild one month's shard on a cron and leave the rest untouched. diff --git a/docs/rails.md b/docs/rails.md new file mode 100644 index 0000000..ed93bf3 --- /dev/null +++ b/docs/rails.md @@ -0,0 +1,128 @@ +# Rails Integration + +The Railtie loads automatically when Rails is present. It wires two things: + +1. **URL helpers** — `s.route.` inside process blocks. +2. **No other magic** — no initializer, no autoloaded directories, no patched generators. + +## URL helpers in processes + +```ruby +# config/sitemap.rb +SiteMaps.use(:file_system) do + configure do |config| + config.url = 'https://example.com/sitemap.xml' + config.directory = Rails.public_path.to_s + end + + process do |s| + s.add(s.route.root_path, priority: 1.0) + s.add(s.route.about_path) + Post.find_each { |p| s.add(s.route.post_path(p), lastmod: p.updated_at) } + end +end +``` + +`s.route` is a singleton wrapping `Rails.application.routes.url_helpers`. + +## Generating from Rails + +### One-off + +```bash +bundle exec site_maps generate --config-file config/sitemap.rb +``` + +The CLI auto-requires `config/environment.rb` if it finds a `config/application.rb`, so ActiveRecord, URL helpers, and everything else loads as normal. + +### From a Rake task + +```ruby +# lib/tasks/sitemap.rake +namespace :sitemap do + desc 'Generate sitemaps' + task generate: :environment do + runner = SiteMaps.generate(config_file: Rails.root.join('config/sitemap.rb').to_s) + runner.enqueue_all.run + end +end +``` + +Run on deploy or via cron: + +```bash +bundle exec rake sitemap:generate +``` + +### From a scheduled job + +```ruby +class SitemapJob < ApplicationJob + def perform + runner = SiteMaps.generate(config_file: Rails.root.join('config/sitemap.rb').to_s) + runner.enqueue_all.run + end +end + +SitemapJob.set(cron: '0 3 * * *').perform_later +``` + +## Serving generated sitemaps + +Add the Rack middleware to serve files generated by the `:file_system` adapter: + +```ruby +# config/application.rb +config.middleware.use SiteMaps::Middleware, adapter: -> { SiteMaps.current_adapter } +``` + +See [middleware.md](middleware.md) for options. + +## Asset precompile integration + +If you want sitemaps regenerated on every deploy, hook into `assets:precompile`: + +```ruby +# lib/tasks/sitemap.rake +Rake::Task['assets:precompile'].enhance(['sitemap:generate']) +``` + +## robots.txt + +```erb +<%# public/robots.txt.erb or app/views/robots.text.erb %> +User-agent: * +Disallow: /admin + +<%= SiteMaps::RobotsTxt.sitemap_directive('https://example.com/sitemap.xml') %> +``` + +## Multi-tenant + +`SiteMaps.define` gives you a generation function parameterized by runtime context: + +```ruby +# config/sitemap.rb +SiteMaps.define do |tenant:| + use(:file_system) do + configure do |config| + config.url = "https://#{tenant.domain}/sitemap.xml" + config.directory = tenant.public_path + end + + process { |s| tenant.pages.each { |page| s.add(page.path, lastmod: page.updated_at) } } + end +end +``` + +```ruby +Tenant.find_each do |tenant| + SiteMaps.generate(config_file: 'config/sitemap.rb', context: { tenant: tenant }).enqueue_all.run +end +``` + +The context hash is splatted into the `define` block as keyword args. + +## Dependencies + +- Rails is **not** listed in the gemspec. The Railtie is loaded only if Rails is already present. If you're using `site_maps` in a non-Rails Ruby project, the Rails-specific pieces are inert. From 26e2619f51c34c46ef792bd26b3949410bf41824 Mon Sep 17 00:00:00 2001 From: "Marcos G. Zimmermann" Date: Sun, 19 Apr 2026 09:06:43 -0300 Subject: [PATCH 2/2] docs: link to gems.marcosz.com.br/site_maps documentation --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index ae95e49..fb2652c 100644 --- a/README.md +++ b/README.md @@ -4,8 +4,13 @@ A concurrent, incremental sitemap generator for Ruby. Framework-agnostic with bu Generates SEO-optimized XML sitemaps with support for sitemap indexes, XSL stylesheets, gzip compression, image/video/news extensions, search engine pinging, and Rack middleware for serving sitemaps with proper HTTP headers. +## Documentation + +Full guides, adapter reference, CLI docs, and recipes are published at **[gems.marcosz.com.br/site_maps](https://gems.marcosz.com.br/site_maps/)** — part of the [marcosgz Ruby gem catalogue](https://gems.marcosz.com.br). + ## Table of Contents +- [Documentation](#documentation) - [Installation](#installation) - [Quick Start](#quick-start) - [Configuration](#configuration)