diff --git a/README.md b/README.md index dfe7c0b..1020f7c 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,10 @@ This gem is a Ruby simple and extremely flexible client for ElasticSearch and Op - [esse-redis_storage](https://github.com/marcosgz/esse-redis_storage) - Add-on for the esse gem to be used with Redis as a storage backend - [esse-async_indexing](https://github.com/marcosgz/esse-async_indexing) - Esse extension to allow async indexing using Faktory or Sidekiq. +## Documentation + +Full guides, recipes, and API reference are published at **[gems.marcosz.com.br/esse](https://gems.marcosz.com.br/esse/)** — part of the [marcosgz Ruby gem catalogue](https://gems.marcosz.com.br). + ## Components The main idea of the gem is to be compatible with any type of datasource. It means that you can use it with ActiveRecord, Sequel, HTTP APIs, or any other data source. The gem is divided into three main components: diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..942aa8f --- /dev/null +++ b/docs/README.md @@ -0,0 +1,121 @@ +# Esse Documentation + +Esse is a pure Ruby, framework-agnostic ElasticSearch/OpenSearch client gem that provides an **ETL (Extract, Transform, Load) architecture** for managing search indices. + +It is built on top of the official [`elasticsearch-ruby`](https://github.com/elastic/elasticsearch-ruby) and [`opensearch-ruby`](https://github.com/opensearch-project/opensearch-ruby) clients. + +## Documentation Index + +### Core Concepts + +| Guide | Description | +|-------|-------------| +| [Getting Started](getting-started.md) | Installation, configuration and your first index | +| [Configuration](configuration.md) | Global config and cluster management | +| [Index](index.md) | Defining indices, settings, mappings, lifecycle | +| [Repository](repository.md) | Data loading through collections and documents | +| [Document](document.md) | Document classes and variants | +| [Collection](collection.md) | Iterating over data sources | +| [Search](search.md) | Query DSL, response wrapping, scrolls | +| [Import](import.md) | Bulk import pipeline, retries, batching | +| [Transport](transport.md) | Low-level ES/OS client wrapper | +| [Events](events.md) | Pub/sub and instrumentation | +| [Plugins](plugins.md) | Plugin system and how to write custom plugins | +| [CLI](cli.md) | `esse` command-line reference | +| [Errors](errors.md) | Exception hierarchy | + +### Ecosystem + +| Extension | Purpose | +|-----------|---------| +| [esse-active_record](../../esse-active_record/docs/README.md) | ActiveRecord integration | +| [esse-sequel](../../esse-sequel/docs/README.md) | Sequel ORM integration | +| [esse-rails](../../esse-rails/docs/README.md) | Rails instrumentation | +| [esse-async_indexing](../../esse-async_indexing/docs/README.md) | Background indexing (Sidekiq/Faktory) | +| [esse-hooks](../../esse-hooks/docs/README.md) | Hook/callback state management | +| [esse-jbuilder](../../esse-jbuilder/docs/README.md) | Jbuilder-based search templates | +| [esse-kaminari](../../esse-kaminari/docs/README.md) | Kaminari pagination | +| [esse-pagy](../../esse-pagy/docs/README.md) | Pagy pagination | + +See [extensions.md](extensions.md) for the complete list. + +## Architecture Overview + +Esse is structured around three core components that form an ETL pipeline: + +``` +Collection (yields batches of raw objects) + ↓ +Repository (serializes batches → Documents) + ↓ +Import pipeline (builds bulk requests with retries) + ↓ +Transport (sends bulk to ES/OS) +``` + +### Core Components + +- **Index** (`Esse::Index`) — defines settings, mappings, aliases, and orchestrates operations. +- **Repository** (`Esse::Repository`) — declares `collection` and `document`, one index may have many. +- **Document** (`Esse::Document`) — a single indexable document with `id`, `type`, `routing`, `source`, `meta`. + +### Supporting Components + +- **Cluster** — ES/OS client connections, index prefix, readonly mode. +- **Transport** — wraps the official client with consistent error handling. +- **Search** — query DSL builder with response wrapping and scroll support. +- **Events** — pub/sub instrumentation for every ES/OS operation. +- **CLI** — Thor-based CLI (`esse install`, `esse generate`, `esse index *`). +- **Plugins** — extension system loaded via `plugin :name`. + +## Quick Example + +```ruby +# config/esse.rb +Esse.configure do |config| + config.cluster(:default) do |cluster| + cluster.client = Elasticsearch::Client.new(url: ENV['ELASTICSEARCH_URL']) + end +end + +# app/indices/users_index.rb +class UsersIndex < Esse::Index + settings do + { index: { number_of_shards: 2, number_of_replicas: 1 } } + end + + mappings do + { properties: { + name: { type: 'text' }, + email: { type: 'keyword' } + } } + end + + repository :user do + collection do |**ctx, &block| + User.find_in_batches(batch_size: 1000) { |batch| block.call(batch, ctx) } + end + + document do |user, **| + { _id: user.id, name: user.name, email: user.email } + end + end +end + +# Use: +UsersIndex.create_index(alias: true) +UsersIndex.import +UsersIndex.search(q: 'john').results +``` + +## Version + +Current version: **0.4.0** + +Requires Ruby **>= 2.7** and one of: +- `elasticsearch` (any version from 1.x to 8.x) +- `opensearch` (any version from 1.x to 2.x) + +## License + +MIT — see [LICENSE.txt](../LICENSE.txt). diff --git a/docs/cli.md b/docs/cli.md new file mode 100644 index 0000000..4b886ef --- /dev/null +++ b/docs/cli.md @@ -0,0 +1,211 @@ +# CLI + +Esse ships with a Thor-based CLI called `esse`. It handles scaffolding, index lifecycle, and bulk operations. + +## Running + +```bash +bundle exec esse [options] +``` + +Or as a standalone executable: + +```bash +gem install esse +esse +``` + +## Global options + +| Flag | Description | +|------|-------------| +| `--require FILE`, `-r FILE` | Require a file before executing (useful for a custom config) | +| `--silent`, `-s` | Suppress event output | +| `--version`, `-v` | Print version | +| `--help` | Print help | + +## Configuration paths + +The CLI auto-loads the first existing file from: + +1. `Essefile` +2. `config/esse.rb` +3. `config/initializers/esse.rb` + +In a Rails app, add `require 'esse/rails'` (via the [esse-rails](../../esse-rails/docs/README.md) gem) and the Rails environment is loaded automatically. + +## Commands + +### `esse install` + +Generate a configuration file template: + +```bash +bundle exec esse install +bundle exec esse install --path config/my_esse.rb +``` + +Options: + +| Flag | Default | Description | +|------|---------|-------------| +| `--path` / `-p` | `config/esse.rb` | Target path | + +### `esse generate index ` + +Scaffold a new index class: + +```bash +bundle exec esse generate index UsersIndex +``` + +Creates `app/indices/users_index.rb` (or wherever `Esse.config.indices_directory` points) with a template. + +--- + +### `esse index reset ` + +Zero-downtime index reset — create a new concrete index, import data, swap the alias, and remove the old one. + +```bash +bundle exec esse index reset UsersIndex --suffix 20240401 --optimize +``` + +| Option | Default | Description | +|--------|---------|-------------| +| `--suffix` | timestamp | Concrete index suffix | +| `--import` | `true` | Import from collection | +| `--reindex` | `false` | Use ES `_reindex` API instead of importing | +| `--optimize` | `true` | Temporarily drop replicas/refresh for faster bulk | +| `--settings` | — | JSON/hash to override index settings | +| `--preload_lazy_attributes` | — | Preload lazy attributes via search before import | +| `--eager_load_lazy_attributes` | — | Resolve lazy attributes during bulk | +| `--update_lazy_attributes` | — | Partial-update lazy attributes after bulk | + +### `esse index create ` + +Create an index without importing: + +```bash +bundle exec esse index create UsersIndex --alias +``` + +| Option | Default | Description | +|--------|---------|-------------| +| `--suffix` | — | Concrete suffix | +| `--alias` | `false` | Also create the alias pointing at the index | +| `--settings` | — | Override settings | + +### `esse index delete ` + +```bash +bundle exec esse index delete UsersIndex --suffix 20240401 +``` + +### `esse index import ` + +```bash +bundle exec esse index import UsersIndex \ + --repo user \ + --context active:true,region:us \ + --suffix 20240401 +``` + +| Option | Description | +|--------|-------------| +| `--repo` | Specific repository (omit to import all) | +| `--suffix` | Target index suffix | +| `--context` | Context hash for collection filtering | +| `--preload_lazy_attributes` | Preload via search | +| `--eager_load_lazy_attributes` | Resolve during bulk | +| `--update_lazy_attributes` | Refresh as partial updates | + +### `esse index open ` / `esse index close ` + +Open or close the index: + +```bash +bundle exec esse index close UsersIndex +bundle exec esse index open UsersIndex +``` + +### `esse index update_aliases ` + +Point the alias at one or more concrete indices: + +```bash +bundle exec esse index update_aliases UsersIndex --suffix 20240401 +bundle exec esse index update_aliases UsersIndex --suffix v1,v2 +``` + +### `esse index update_settings ` + +```bash +bundle exec esse index update_settings UsersIndex --settings number_of_replicas:2 +``` + +### `esse index update_mapping ` + +```bash +bundle exec esse index update_mapping UsersIndex +bundle exec esse index update_mapping UsersIndex --suffix 20240401 +``` + +### `esse index update_lazy_attributes [attr ...]` + +Refresh specific lazy attributes without a full reindex: + +```bash +bundle exec esse index update_lazy_attributes UsersIndex comment_count follower_count \ + --repo user \ + --context active:true +``` + +| Option | Description | +|--------|-------------| +| `--repo` | Repository name (required when multiple) | +| `--suffix` | Target suffix | +| `--context` | Collection context | +| `--bulk_options` | Extra ES bulk options | + +## Extension commands + +Gems like [esse-async_indexing](../../esse-async_indexing/docs/README.md) add more subcommands: + +```bash +bundle exec esse index async_import UsersIndex --service sidekiq +bundle exec esse index async_update_lazy_attributes UsersIndex comment_count --service sidekiq +``` + +Consult each extension's documentation for details. + +## Exit codes + +- `0` — success +- non-zero — any error + +Use `--silent` in CI to reduce noise. + +## Example workflows + +### First-time bootstrap + +```bash +bundle exec esse install +# edit config/esse.rb +bundle exec esse generate index UsersIndex +# edit app/indices/users_index.rb +bundle exec esse index reset UsersIndex +``` + +### Daily reindex via cron + +```bash +bundle exec esse index reset UsersIndex --suffix $(date +%Y%m%d) --optimize +``` + +### Partial refresh + +```bash +bundle exec esse index update_lazy_attributes UsersIndex comment_count +``` diff --git a/docs/collection.md b/docs/collection.md new file mode 100644 index 0000000..c87ac4e --- /dev/null +++ b/docs/collection.md @@ -0,0 +1,133 @@ +# Collection + +A **collection** describes how to enumerate the source records of a repository, yielding **batches** rather than individual records. Batching is crucial for bulk indexing performance. + +## Block form + +The simplest form is a block declared inside a repository: + +```ruby +repository :user do + collection do |**context, &block| + User.where(context).find_in_batches(batch_size: 1_000) do |batch| + block.call(batch, context) + end + end +end +``` + +The block receives keyword `context` (passed through from import calls) and a `block` that you call with `(batch_array, context)`. + +## Class form + +Inherit from `Esse::Collection` for more structure: + +```ruby +class MyCollection < Esse::Collection + def each + raw_records.each_slice(@params[:batch_size] || 1_000) do |batch| + yield(batch, @params) + end + end + + # Optional: yield only IDs in batches. Used by async indexing and lazy attribute refresh. + def each_batch_ids + raw_records.each_slice(@params[:batch_size] || 1_000) do |batch| + yield(batch.map(&:id)) + end + end + + private + + def raw_records + # ... + end +end + +repository :user do + collection MyCollection +end +``` + +When the collection is instantiated, it receives the context hash as `@params`: + +```ruby +MyCollection.new(batch_size: 500, active: true) +``` + +## Contract + +| Method | Required | Description | +|--------|----------|-------------| +| `each { |batch, context| }` | ✅ | Yield each batch of raw records | +| `each_batch_ids { |ids| }` | Recommended | Yield batches of IDs for efficient async / partial update operations | +| `count` / `size` | Optional | Total record count | + +`Esse::Collection` includes `Enumerable`, so you get `map`, `select`, `first` etc. for free once `each` is defined. + +## Why `each_batch_ids` matters + +Extensions like [esse-async_indexing](../../esse-async_indexing/docs/README.md) rely on this method to enqueue ID-only jobs that don't hold raw record payloads in memory. If your repository only defines `each`, async indexing won't be able to kick off import jobs from the CLI. + +If you use [esse-active_record](../../esse-active_record/docs/README.md) or [esse-sequel](../../esse-sequel/docs/README.md) the plugin provides both methods for you: + +```ruby +collection ::User # ActiveRecord — both each and each_batch_ids available +``` + +## Context passing + +Whatever keyword arguments are passed as `context:` during import flow are forwarded to the collection: + +```ruby +UsersIndex.import(context: { active: true, region: 'us' }) +``` + +…arrives in the collection as: + +```ruby +collection do |**context, &block| + # context => { active: true, region: 'us' } + User.where(active: context[:active], region: context[:region]) + .find_in_batches { |b| block.call(b, context) } +end +``` + +The `context` is then forwarded to the `document` block unchanged. + +## Custom batching metadata + +You can yield additional metadata alongside the batch for the `document` block to consume: + +```ruby +collection do |**ctx, &block| + Order.find_in_batches do |orders| + # Bulk-fetch related data once per batch + customers = Customer.where(id: orders.map(&:customer_id)).index_by(&:id) + block.call(orders, ctx.merge(customers: customers)) + end +end + +document do |order, customers: {}, **| + customer = customers[order.customer_id] + { _id: order.id, customer_name: customer&.name } +end +``` + +This "batch context" pattern is the recommended way to avoid N+1 lookups inside serialization. + +## ORM integrations + +The ORM extensions turn common patterns into DSL: + +- [esse-active_record](../../esse-active_record/docs/README.md) adds `collection Model` with `scope` / `batch_context` / `connect_with`. +- [esse-sequel](../../esse-sequel/docs/README.md) provides an identical DSL for Sequel. + +```ruby +collection ::User, batch_size: 500 do + scope :active, -> { where(active: true) } + batch_context :orders do |users, **| + Order.where(user_id: users.map(&:id)).group_by(&:user_id) + end +end +``` diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 0000000..ee41906 --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,164 @@ +# Configuration + +Esse is configured via `Esse.configure`. The configuration holds global settings (like `indices_directory`) and one or more **clusters**. + +## Basic structure + +```ruby +Esse.configure do |config| + config.indices_directory = 'app/indices' + config.bulk_wait_interval = 0.1 + + config.cluster(:default) do |cluster| + cluster.client = Elasticsearch::Client.new(url: 'http://localhost:9200') + end +end +``` + +## Global options + +| Option | Default | Description | +|--------|---------|-------------| +| `indices_directory` | `'app/indices'` | Where index files live (used by `Esse.eager_load_indices!`) | +| `bulk_wait_interval` | `0.1` | Seconds to wait between bulk pages to avoid back-pressure | + +## Clusters + +A **cluster** is a named connection to an Elasticsearch/OpenSearch deployment. Every index is attached to a cluster (defaulting to `:default`). + +### Defining a cluster + +```ruby +config.cluster(:default) do |cluster| + cluster.client = Elasticsearch::Client.new(url: 'http://localhost:9200') + cluster.index_prefix = 'myapp' + cluster.readonly = false + cluster.wait_for_status = 'yellow' + + cluster.settings = { + analysis: { analyzer: { default: { type: 'standard' } } } + } + + cluster.mappings = { + properties: { + created_at: { type: 'date' } + } + } +end +``` + +### Cluster attributes + +| Attribute | Type | Description | +|-----------|------|-------------| +| `client` | ES/OS client instance | Required. An instance of `Elasticsearch::Client` or `OpenSearch::Client`. | +| `index_prefix` | `String` | Prefix applied to all index names (`"myapp"` → `myapp_users`) | +| `settings` | `Hash` | Global settings merged into every index | +| `mappings` | `Hash` | Global mappings merged into every index | +| `wait_for_status` | `'green' \| 'yellow' \| 'red'` | Wait until cluster reaches this status before operations | +| `readonly` | `Boolean` | When `true`, write operations raise `Esse::Transport::ReadonlyClusterError` | + +You can also assign attributes in bulk via `cluster.assign(hash)`. + +### Multiple clusters + +```ruby +Esse.configure do |config| + config.cluster(:default) do |c| + c.client = Elasticsearch::Client.new(hosts: %w[es-primary:9200]) + end + + config.cluster(:analytics) do |c| + c.client = Elasticsearch::Client.new(hosts: %w[es-analytics:9200]) + c.index_prefix = 'analytics' + end +end + +class UsersIndex < Esse::Index + self.cluster_id = :default # default +end + +class EventsIndex < Esse::Index + self.cluster_id = :analytics +end +``` + +Set the default cluster for all index subclasses: + +```ruby +Esse::Index.cluster_id = :v1 +``` + +### Accessing a cluster + +```ruby +Esse.cluster # => cluster :default +Esse.cluster(:analytics) # => named cluster +Esse.config.cluster_ids # => [:default, :analytics] + +UsersIndex.cluster # => cluster attached to the index +UsersIndex.cluster_id # => :default +``` + +### Engine detection + +Esse auto-detects the running ES/OS distribution and version via `ClusterEngine`: + +```ruby +engine = UsersIndex.cluster.engine +engine.elasticsearch? # => true / false +engine.opensearch? # => true / false +engine.engine_version # => "8.12.0" +engine.mapping_single_type? # => true for ES >= 6 +engine.mapping_default_type # => :_doc | :doc | nil +``` + +Esse uses this information to stay compatible across ES/OS versions without you having to branch. + +## Readonly mode + +Readonly mode is a hard safety switch. Any write operation (create, delete, update, bulk, reset) on a readonly cluster raises `Esse::Transport::ReadonlyClusterError`. Reads work normally. + +```ruby +config.cluster(:default) do |c| + c.client = Elasticsearch::Client.new + c.readonly = Rails.env.production? && ENV['READ_ONLY_MODE'] == 'true' +end +``` + +This is useful during migrations or to protect replicas. + +## Waiting for status + +```ruby +config.cluster(:default) do |c| + c.wait_for_status = 'yellow' +end + +Esse.cluster.wait_for_status! # called before risky ops like reset +``` + +Valid values: `'green'`, `'yellow'`, `'red'`. Set to `nil` to disable. + +## Loading from YAML + +`Esse::Config#load` accepts a file path, Pathname, or Hash: + +```ruby +Esse.config.load('config/esse.yml') +``` + +The YAML is parsed and applied to `Esse.config`. Keys at the root become cluster definitions, and top-level keys map to config attributes. + +## Thread safety + +Mutable state is guarded by `Esse.synchronize`. You can set `Esse.instance_variable_set(:@single_threaded, true)` to skip locking in single-threaded tests. + +## Logging + +```ruby +Esse.logger = Logger.new($stdout) +Esse.logger = nil # silent (File::NULL) +``` + +See [Events](events.md) for richer observability. diff --git a/docs/document.md b/docs/document.md new file mode 100644 index 0000000..ebad101 --- /dev/null +++ b/docs/document.md @@ -0,0 +1,166 @@ +# Document + +`Esse::Document` is the in-memory representation of a single indexable record. Repositories produce documents; the import pipeline sends them to Elasticsearch. + +In most cases you don't interact with `Esse::Document` directly — returning a Hash from your repository `document` block is enough. But understanding the contract is useful for custom document classes, partial updates, and lazy attributes. + +## The contract + +Every document has these (optionally overridable) methods: + +| Method | Description | +|--------|-------------| +| `id` | String/Integer/nil. Documents with `nil` id are ignored on index/delete. | +| `type` | String or nil. Legacy ES type (modern clusters use `_doc` by default). | +| `routing` | String or nil. Used for custom routing. | +| `source` | Hash. The actual document body. | +| `meta` | Hash. Extra metadata merged into bulk operation headers. | + +Instance methods: + +| Method | Returns | +|--------|---------| +| `object` | The raw record the document was built from | +| `options` | Frozen hash of construction options | +| `to_h` | `{ _id, _type, _routing, ...meta, ...source }` | +| `to_bulk(operation:, data:)` | Bulk-formatted hash | +| `doc_header` | `{ _id, _type, routing }` | +| `ignore_on_index?` | `true` when `id.nil?` | +| `ignore_on_delete?` | `true` when `id.nil?` | + +## Writing a custom Document class + +```ruby +class UserDocument < Esse::Document + def id + object.id + end + + def routing + object.tenant_id + end + + def source + { + name: object.name, + email: object.email, + roles: object.role_names + } + end +end + +# Use it from a repository +repository :user do + collection { |**c, &b| User.find_in_batches { |u| b.call(u, c) } } + document UserDocument +end +``` + +## Variants + +### HashDocument + +Auto-extracts header fields from a hash. Most repository `document` blocks implicitly use `HashDocument`: + +```ruby +doc = Esse::HashDocument.new( + _id: 'abc', + _routing: 'shard-1', + name: 'John', + email: 'john@example.com' +) + +doc.id # => 'abc' +doc.routing # => 'shard-1' +doc.source # => { name: 'John', email: 'john@example.com' } +``` + +ID is looked up in this order: `_id`, `id`, `:_id`, `:id`. Reserved keys (`_id`, `_type`, `_routing`, `routing`) are stripped from `source`. + +### NullDocument + +A no-op document, used to skip indexing from inside a `document` block: + +```ruby +document do |record, **| + next Esse::NullDocument.new if record.deleted? + { _id: record.id, name: record.name } +end +``` + +`NullDocument#id` is `nil`, so it's automatically ignored by bulk operations. + +### DocumentForPartialUpdate + +Used when updating a subset of fields on an already-indexed document. Usually created via: + +```ruby +doc = original_doc.document_for_partial_update(comment_count: 42) + +doc.source # => { comment_count: 42 } +# id, type, routing delegate to original_doc +``` + +### LazyDocumentHeader + +A lightweight placeholder that carries only the information needed to target a document (id, type, routing) and arbitrary extra options. It's the object you receive in `lazy_document_attribute` blocks: + +```ruby +lazy_document_attribute :name do |headers| + # headers is Array[Esse::LazyDocumentHeader] + ids = headers.map(&:id) + data = User.where(id: ids).pluck(:id, :name).to_h + headers.each_with_object({}) { |h, acc| acc[h] = data[h.id] } +end +``` + +Coerce any input into a header: + +```ruby +Esse::LazyDocumentHeader.coerce(1) # from ID +Esse::LazyDocumentHeader.coerce('id-123') # from string +Esse::LazyDocumentHeader.coerce(_id: 1, _routing: 'x') +Esse::LazyDocumentHeader.coerce_each([1, 2, 3]) # batch +``` + +## Mutations + +You can mutate a document before indexing: + +```ruby +doc.mutate(:display_name) { "#{object.first_name} #{object.last_name}" } +doc.mutations # => { display_name: "John Doe" } +doc.mutated_source # => source.merge(display_name: "John Doe") +``` + +Mutations are applied when building bulk payloads. + +## Equality + +```ruby +doc1 = UserDocument.new(user) +doc2 = UserDocument.new(user) +doc1.eql?(doc2) # => true if all headers and source match + +# When comparing to a LazyDocumentHeader: +doc1.eql?(header, match_lazy_doc_header: true) +``` + +## Bulk format + +```ruby +doc.to_bulk(operation: :index) +# => { _id: 1, _type: '_doc', data: { name: 'John' } } + +doc.to_bulk(operation: :update, data: { doc: { name: 'Jane' } }) +# => { _id: 1, _type: '_doc', data: { doc: { name: 'Jane' } } } +``` + +## When do you write a custom class? + +Prefer returning a Hash from the repository `document` block for simple cases. Reach for a custom `Esse::Document` subclass when: + +- You want shared logic across many indices. +- You need access to construction options (`doc.options[:foo]`). +- You want mutations / partial updates to be derived cleanly. +- You're building a reusable serializer layer (e.g., across Active Record models). diff --git a/docs/errors.md b/docs/errors.md new file mode 100644 index 0000000..82952e1 --- /dev/null +++ b/docs/errors.md @@ -0,0 +1,101 @@ +# Errors + +Esse wraps Elasticsearch/OpenSearch client exceptions into a consistent hierarchy so your code does not need to branch on the ES vs OS SDK. + +## Hierarchy + +``` +Esse::Error +├── Esse::Transport::ServerError # base for any HTTP server error +│ ├── BadRequestError # 400 +│ ├── UnauthorizedError # 401 +│ ├── ForbiddenError # 403 +│ ├── NotFoundError # 404 +│ ├── RequestTimeoutError # 408 +│ ├── ConflictError # 409 +│ ├── RequestEntityTooLargeError # 413 +│ ├── UnprocessableEntityError # 422 +│ ├── InternalServerError # 500 +│ ├── BadGatewayError # 502 +│ ├── ServiceUnavailableError # 503 +│ └── GatewayTimeoutError # 504 +│ +├── Esse::Transport::ReadonlyClusterError # raised on writes when cluster.readonly == true +└── Esse::Transport::BulkResponseError # bulk partial failure +``` + +## Catching errors + +Most common patterns: + +```ruby +begin + UsersIndex.import +rescue Esse::Transport::ReadonlyClusterError + # cluster is in readonly mode — skip +rescue Esse::Transport::BulkResponseError => e + e.response # full response hash + e.items # failed items only +rescue Esse::Transport::ServerError => e + # other HTTP-level failure +rescue Esse::Error => e + # any Esse error +end +``` + +## `NotFoundError` nuance + +`NotFoundError` is raised on document/index `get`, `delete`, etc. Some lifecycle operations silently ignore it (for example deleting an already-missing index), but `UsersIndex.get(id: 1)` will raise when the document is missing. + +Opt out of raising with client-level `ignore`: + +```ruby +UsersIndex.get(id: 1, ignore: [404]) +# returns nil instead of raising +``` + +## `BulkResponseError` + +A bulk request can succeed at the HTTP level while containing per-item failures. Esse inspects the response and raises `BulkResponseError` when any item has an error: + +```ruby +begin + UsersIndex.import +rescue Esse::Transport::BulkResponseError => e + e.items.each do |item| + op, data = item.first + puts "#{op} failed for id=#{data[:_id]}: #{data.dig(:error, :reason)}" + end +end +``` + +## `ReadonlyClusterError` + +Raised preemptively when: + +- `cluster.readonly = true`, **and** +- you call a write operation (`bulk`, `index`, `update`, `delete`, `create_index`, `delete_index`, `reset_index`, etc.). + +Readonly clusters still serve reads (`search`, `count`, `get`, `mget`). + +## Writing your own rescue + +```ruby +def safe_index(record) + UsersIndex.index(id: record.id, body: record.as_indexed) +rescue Esse::Transport::ReadonlyClusterError + Rails.logger.warn("Skipping index — cluster is readonly") +rescue Esse::Transport::RequestEntityTooLargeError + # split and retry — or let Esse's bulk retry handle it +end +``` + +## Coercing client exceptions + +If you write a custom transport plugin or call the ES client directly, wrap calls in: + +```ruby +transport.coerce_exception { client.some_call } +``` + +This converts ES/OS SDK exceptions into the matching `Esse::Transport::*` subclass. diff --git a/docs/events.md b/docs/events.md new file mode 100644 index 0000000..f865a06 --- /dev/null +++ b/docs/events.md @@ -0,0 +1,162 @@ +# Events + +Esse ships with a built-in pub/sub instrumentation layer. Every operation against Elasticsearch/OpenSearch emits a named event with a payload. Subscribe to events to instrument, log, or react. + +## Subscribing + +Two forms are supported: + +### Block subscription + +```ruby +Esse::Events.subscribe('elasticsearch.bulk') do |event| + puts "Bulk request: #{event.payload[:runtime]}s" +end +``` + +### Listener subscription + +An object with `on_event_name` methods (dots replaced with underscores) gets auto-subscribed to the matching events: + +```ruby +class LoggerListener + def on_elasticsearch_bulk(event) + puts "bulk: #{event.payload[:body_size]}b" + end + + def on_elasticsearch_search(event) + puts "search: #{event.payload[:runtime]}s" + end +end + +listener = LoggerListener.new +Esse::Events.subscribe(listener) +Esse::Events.subscribed?(listener) # => true +Esse::Events.unsubscribe(listener) +``` + +## Available events + +All events live under the `elasticsearch.*` namespace. + +### Documents + +| Event | When | +|-------|------| +| `elasticsearch.index` | Single-document index | +| `elasticsearch.update` | Single-document update | +| `elasticsearch.delete` | Single-document delete | +| `elasticsearch.get` | Single-document fetch | +| `elasticsearch.mget` | Multi-get | +| `elasticsearch.exist` | Exists check | +| `elasticsearch.count` | Count query | +| `elasticsearch.bulk` | `_bulk` API call | + +### Indices + +| Event | When | +|-------|------| +| `elasticsearch.create_index` | `indices.create` | +| `elasticsearch.delete_index` | `indices.delete` | +| `elasticsearch.index_exist` | `indices.exists` | +| `elasticsearch.close` | `indices.close` | +| `elasticsearch.open` | `indices.open` | +| `elasticsearch.refresh` | `indices.refresh` | +| `elasticsearch.update_settings` | `indices.put_settings` | +| `elasticsearch.update_mapping` | `indices.put_mapping` | +| `elasticsearch.update_aliases` | `indices.update_aliases` | + +### Search + +| Event | When | +|-------|------| +| `elasticsearch.search` | `_search` API | +| `elasticsearch.execute_search_query` | Internal search execution | +| `elasticsearch.reindex` | `_reindex` API | +| `elasticsearch.update_by_query` | `_update_by_query` | +| `elasticsearch.delete_by_query` | `_delete_by_query` | + +### Tasks + +| Event | When | +|-------|------| +| `elasticsearch.tasks` | List tasks | +| `elasticsearch.task` | Single-task query | +| `elasticsearch.cancel_task` | Cancel task | + +## Event payload + +Every event is an object with a `payload` hash. Common keys: + +| Key | Description | +|-----|-------------| +| `:request` | Request parameters sent to ES/OS | +| `:response` | Response hash | +| `:runtime` | Duration in seconds | +| `:error` | Exception if the call failed | +| `:__started_at__` | Internal start time (Time instance) | + +Operation-specific payloads may include `:body_size`, `:document_count`, `:index`, `:type`, etc. + +## Publishing + +Esse itself publishes events via `Esse::Events.instrument`: + +```ruby +Esse::Events.instrument('elasticsearch.bulk') do |payload| + payload[:body_size] = body.bytesize + response = client.bulk(body: body) + payload[:response] = response + response +end +``` + +`instrument` records runtime automatically and ensures the event fires even if the block raises. + +You can publish custom events too: + +```ruby +Esse::Events.publish('myapp.reindex_batch', batch_id: 42, count: 1_000) +``` + +## Patterns + +### Log every ES call + +```ruby +Esse::Events.event_names.grep(/^elasticsearch/).each do |event_name| + Esse::Events.subscribe(event_name) do |event| + Rails.logger.debug "#{event_name}: #{event.payload[:runtime]}s" + end +end +``` + +### Track search latency in Rails + +The [esse-rails](../../esse-rails/docs/README.md) gem already subscribes to every `elasticsearch.*` event and surfaces the accumulated runtime in your controller logs: + +``` +Completed 200 OK in 125.3ms (Views: 45.2ms | Search: 78.1ms) +``` + +### Fail-fast on bulk errors + +```ruby +Esse::Events.subscribe('elasticsearch.bulk') do |event| + next unless event.payload[:response] + + errors = event.payload[:response].dig(:items) || [] + failed = errors.select { |i| i.values.first[:error] } + Sentry.capture_message("Bulk had failed items", extra: { failed: failed }) if failed.any? +end +``` + +## Subscription lifecycle + +Subscriptions are stored in-memory for the current process. In a forking server (Puma, Sidekiq), re-subscribe in each worker fork to avoid missing events. + +```ruby +Esse::Events.subscribers # all subscribers +Esse::Events.unsubscribe(subscriber) +Esse::Events.subscribed?(subscriber) +``` diff --git a/docs/extensions.md b/docs/extensions.md new file mode 100644 index 0000000..92d7223 --- /dev/null +++ b/docs/extensions.md @@ -0,0 +1,132 @@ +# Extensions + +Esse is intentionally minimal. Framework and ORM integration, pagination, async indexing, and template engines are delivered as separate gems. Each extension is a plugin (see [Plugins](plugins.md)) and has its own `docs/` directory. + +## ORM integrations + +### [esse-active_record](../../esse-active_record/docs/README.md) +`gem 'esse-active_record'` — ActiveRecord support. Adds `collection Model` DSL, `scope`, `batch_context`, automatic after-commit callbacks, and hook-based disabling. + +```ruby +class UsersIndex < Esse::Index + plugin :active_record + + repository :user do + collection ::User, batch_size: 500 do + scope :active, -> { where(active: true) } + end + document { |u, **| { _id: u.id, name: u.name } } + end +end + +class User < ApplicationRecord + include Esse::ActiveRecord::Model + index_callback 'users_index:user' +end +``` + +### [esse-sequel](../../esse-sequel/docs/README.md) +`gem 'esse-sequel'` — Sequel ORM support with an identical DSL to `esse-active_record`. + +--- + +## Framework integration + +### [esse-rails](../../esse-rails/docs/README.md) +`gem 'esse-rails'` — Rails-specific integration. Subscribes to all `elasticsearch.*` events and surfaces aggregate search latency in controller logs (and Lograge). + +``` +Completed 200 OK in 125.3ms (Views: 45.2ms | Search: 78.1ms) +``` + +Also auto-loads the Rails environment when `esse` CLI runs in a Rails project. + +--- + +## Async indexing + +### [esse-async_indexing](../../esse-async_indexing/docs/README.md) +`gem 'esse-async_indexing'` — Offload indexing to Sidekiq or Faktory. Adds `async_indexing_job` DSL, CLI commands (`esse index async_import`), and ActiveRecord callbacks that enqueue jobs instead of indexing synchronously. + +```ruby +class City < ApplicationRecord + include Esse::AsyncIndexing::ActiveRecord::Model + async_index_callback('geos_index:city', service_name: :sidekiq) { id } +end +``` + +--- + +## Hook management + +### [esse-hooks](../../esse-hooks/docs/README.md) +`gem 'esse-hooks'` — The callback/state layer used by `esse-active_record` and `esse-sequel` to enable/disable indexing globally, per-repository, or per-model. Not used directly by end users most of the time; included here for completeness. + +```ruby +Esse::ActiveRecord::Hooks.without_indexing { 10.times { User.create! } } +``` + +--- + +## Search query templates + +### [esse-jbuilder](../../esse-jbuilder/docs/README.md) +`gem 'esse-jbuilder'` — Build search bodies with Jbuilder templates. + +```ruby +UsersIndex.search do |json| + json.query do + json.match { json.set! 'name', params[:q] } + end +end +``` + +Or from a `.json.jbuilder` file: + +```ruby +body = Esse::Jbuilder::ViewTemplate.call('users/search', q: params[:q]) +UsersIndex.search(body: body) +``` + +--- + +## Pagination + +### [esse-kaminari](../../esse-kaminari/docs/README.md) +`gem 'esse-kaminari'` — Kaminari integration. Adds `.page(n).per(x)` chainable on search queries. + +```ruby +@search = UsersIndex.search(params[:q]).page(params[:page]).per(10) +# View +<%= paginate @search.paginated_results %> +``` + +### [esse-pagy](../../esse-pagy/docs/README.md) +`gem 'esse-pagy'` — Pagy integration with controller helpers. + +```ruby +@pagy, @response = pagy_esse(UsersIndex.pagy_search(params[:q]), items: 10) +``` + +--- + +## Other extensions + +These extensions are not part of this workspace but are mentioned in the main README: + +- **esse-will_paginate** — WillPaginate pagination support. +- **esse-rspec** — RSpec helpers and matchers. +- **esse-redis_storage** — Redis-backed storage for long-running state. + +Visit the [main project](https://github.com/marcosgz/esse) for the complete list. + +--- + +## Writing your own + +Any extension is just a plugin module (see [Plugins](plugins.md)). Package it as a gem if you want to share it. The contract is: + +- Define a module under `Esse::Plugins::YourName`. +- Optionally add `apply(index, **opts, &block)`, `configure(index, ...)`. +- Optionally add `IndexClassMethods` and `RepositoryClassMethods` submodules. +- Publish with a dependency on `esse` >= 0.3.0 (or your required minimum). diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..518fae3 --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,153 @@ +# Getting Started + +This guide walks you from installation through creating and searching your first index. + +## Installation + +Add Esse and one of the official ES/OS clients to your Gemfile: + +```ruby +# Choose ONE of: +gem 'elasticsearch' # Elasticsearch 1.x → 8.x +gem 'opensearch-ruby' # OpenSearch 1.x → 2.x + +gem 'esse' +``` + +Then install: + +```bash +bundle install +``` + +Or install directly: + +```bash +gem install esse elasticsearch +``` + +## Generate the config file + +Esse ships with a CLI to scaffold files. To generate a config file, run: + +```bash +bundle exec esse install +``` + +This creates `config/esse.rb`. Esse automatically loads any of these paths: + +- `Essefile` +- `config/esse.rb` +- `config/initializers/esse.rb` + +## Configure a cluster + +Edit the generated config to register your cluster(s): + +```ruby +# config/esse.rb +require_relative '../environment' unless defined?(Rails) + +Esse.configure do |config| + config.indices_directory = 'app/indices' + + config.cluster(:default) do |cluster| + cluster.client = Elasticsearch::Client.new( + url: ENV.fetch('ELASTICSEARCH_URL', 'http://localhost:9200') + ) + end +end +``` + +Multiple clusters are supported — see [Configuration](configuration.md). + +## Generate an index class + +```bash +bundle exec esse generate index UsersIndex +``` + +This creates `app/indices/users_index.rb` with placeholders for settings, mappings, and a repository. + +Fill it in: + +```ruby +# app/indices/users_index.rb +class UsersIndex < Esse::Index + settings do + { + index: { + number_of_shards: 2, + number_of_replicas: 1 + } + } + end + + mappings do + { + properties: { + name: { type: 'text' }, + email: { type: 'keyword' }, + created_at: { type: 'date' } + } + } + end + + repository :user do + collection do |**context, &block| + User.where(context).find_in_batches(batch_size: 1_000) do |batch| + block.call(batch, context) + end + end + + document do |user, **| + { + _id: user.id, + name: user.name, + email: user.email, + created_at: user.created_at + } + end + end +end +``` + +## Create and populate the index + +Use the `reset` command to create the index, import data, and set up the alias: + +```bash +bundle exec esse index reset UsersIndex +``` + +This performs a zero-downtime reset: +1. Creates `users_` with your settings/mappings. +2. Imports data from the repository. +3. Points the `users` alias at the new index. +4. Removes the old concrete index. + +## Query the index + +```ruby +response = UsersIndex.search( + body: { + query: { + match: { name: 'john' } + } + } +) + +response.total # total hits +response.results # array of hit hashes +response.each { |hit| puts hit['_source']['name'] } +``` + +See [Search](search.md) for the full DSL. + +## Next steps + +- Learn the [Index](index.md) DSL — settings, mappings, aliases, plugins. +- Understand [Repositories](repository.md) — collections, serialization, lazy attributes. +- Explore the [CLI](cli.md) — reset, import, create, update_aliases. +- Hook into [Events](events.md) for observability. +- Use an [ORM extension](extensions.md) like `esse-active_record` or `esse-sequel`. diff --git a/docs/import.md b/docs/import.md new file mode 100644 index 0000000..1cff5e8 --- /dev/null +++ b/docs/import.md @@ -0,0 +1,153 @@ +# Import + +`import` is the process of walking a repository's collection, serializing records to documents, and sending them to Elasticsearch/OpenSearch in batches using the `_bulk` API. + +## Basic usage + +```ruby +UsersIndex.import # all repositories +UsersIndex.import(:user) # specific repository +UsersIndex.import(:user, :admin) # multiple repositories +``` + +Or from the CLI: + +```bash +bundle exec esse index import UsersIndex +bundle exec esse index import UsersIndex --repo user +``` + +## Options + +```ruby +UsersIndex.import( + :user, + suffix: '20240401', # concrete index suffix + context: { active: true }, # passed to collection + document + eager_load_lazy_attributes: [:roles], # include these before bulk + update_lazy_attributes: [:comment_count], # refresh after bulk + preload_lazy_attributes: true # fetch via search before import +) +``` + +| Option | Description | +|--------|-------------| +| `suffix` | Target concrete index (e.g., for zero-downtime reset). Defaults to the alias. | +| `context` | Hash forwarded to the collection and then to the document block. | +| `eager_load_lazy_attributes` | Resolve these lazy attributes during import and include them in the bulk payload. Use `true` for all. | +| `update_lazy_attributes` | After the initial bulk, send partial updates for these attributes. | +| `preload_lazy_attributes` | If `true`, use the search API to pre-read current lazy attribute values and skip rewrites. | + +## Flow overview + +``` +collection.each { |batch, ctx| + repository.serialize_batch(batch, ctx) + → Array[Esse::Document] + import.bulk(documents) + → _bulk API call with retry and size handling +} +``` + +## Bulk API and retries + +Bulk requests are built by `Esse::Import::Bulk`. It splits operations by type (`index`, `create`, `update`, `delete`) and: + +1. Writes one bulk payload per batch. +2. Detects `413 Request Entity Too Large` → splits and retries in smaller chunks. +3. Detects timeouts → exponential backoff: `(retry**4) + 15 + rand(10) * (retry + 1)` seconds. +4. Gives up after 4 retries (configurable via `max_retries:`). +5. On the last retry, optionally switches to one-doc-per-request mode (`last_retry_in_small_chunks:`). + +## Optimizing during import + +Esse can temporarily disable replicas and refresh for huge imports, then restore them: + +```bash +bundle exec esse index reset UsersIndex --optimize +``` + +Equivalent to: + +```ruby +UsersIndex.reset_index(optimize: true, import: true) +``` + +When `optimize: true` is set, Esse sets `number_of_replicas: 0` and `refresh_interval: -1` before the bulk walk, then restores the original values after. + +## Zero-downtime reset + +The common full-reindex workflow is: + +```bash +bundle exec esse index reset UsersIndex --suffix 20240401 +``` + +Internally this is: + +1. `create_index(suffix: '20240401')` — create `users_20240401`. +2. `import(suffix: '20240401')` — fill it up. +3. `update_aliases(suffix: '20240401')` — point `users` alias at new index. +4. Delete the previous concrete index. + +See [Index](index.md#reset-zero-downtime) for details. + +## Lazy attributes during import + +If your repository declares `lazy_document_attribute`, you can mix and match how to load them: + +```ruby +UsersIndex.import( + eager_load_lazy_attributes: [:roles], # in the initial bulk + update_lazy_attributes: [:comment_count] # as partial updates after bulk +) +``` + +`eager_load_lazy_attributes: true` resolves all declared lazy attributes during the initial bulk. Use an array to pick specific ones. + +## Bulk settings + +| Setting | Default | Description | +|---------|---------|-------------| +| `bulk_wait_interval` | `0.1` | Seconds to wait between bulk pages (set via `Esse.config.bulk_wait_interval`) | +| `batch_size` | Collection-dependent | Passed through from your collection implementation | + +## Import errors + +Bulk operations may succeed overall while having per-item errors. Esse raises `Esse::Transport::BulkResponseError` when the response contains errors: + +```ruby +begin + UsersIndex.import +rescue Esse::Transport::BulkResponseError => e + e.response # full raw response hash + e.items # items that failed, each with their ES error +end +``` + +See [Errors](errors.md) for the full exception hierarchy. + +## Events + +Every bulk request emits `elasticsearch.bulk`. Subscribe to audit or instrument imports: + +```ruby +Esse::Events.subscribe('elasticsearch.bulk') do |event| + Rails.logger.info "Bulk: #{event.payload[:body_size]}b in #{event.payload[:runtime]}s" +end +``` + +See [Events](events.md). + +## CLI reference + +```bash +bundle exec esse index import UsersIndex \ + --suffix 20240401 \ + --repo user \ + --context active:true \ + --eager_load_lazy_attributes roles \ + --update_lazy_attributes comment_count +``` + +See [CLI](cli.md) for the full command reference. diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..caa4eb4 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,301 @@ +# Index + +`Esse::Index` is the main building block. An index class defines: + +- **Settings** — shards, replicas, refresh interval, analyzers. +- **Mappings** — field types and analysis rules. +- **Repositories** — how to load and transform data. +- **Plugins** — optional behavior extensions. + +Every index class inherits from `Esse::Index`: + +```ruby +class UsersIndex < Esse::Index + # ... +end +``` + +## Naming + +By convention the class name is underscored and the `Index` suffix is stripped: + +```ruby +UsersIndex.index_name # => "myapp_users" (with cluster prefix) +UsersIndex.uname # => "users_index" +``` + +You can override the generated name explicitly: + +```ruby +class UsersIndex < Esse::Index + self.index_name = 'users' + self.index_prefix = 'app1' # overrides cluster prefix +end + +UsersIndex.index_name # => "app1_users" +UsersIndex.index_name(suffix: 'v2') # => "app1_users_v2" +``` + +### Index suffixes + +Suffixes are used for zero-downtime reindexing. The common pattern: + +```bash +bundle exec esse index reset UsersIndex --suffix 20240401 +``` + +This creates `users_20240401`, imports into it, and swaps the alias `users` → `users_20240401`. + +## Settings + +```ruby +class UsersIndex < Esse::Index + settings do + { + index: { + number_of_shards: 2, + number_of_replicas: 1, + refresh_interval: '1s' + }, + analysis: { + analyzer: { + my_analyzer: { type: 'standard' } + } + } + } + end +end +``` + +Or inline: + +```ruby +settings(number_of_shards: 2, refresh_interval: '1s') +``` + +Simplified keys (`number_of_shards`, `number_of_replicas`, `refresh_interval`, `mapping`) are auto-nested under `index.*`. + +The cluster's global `settings` are deep-merged into each index's settings. + +## Mappings + +```ruby +class UsersIndex < Esse::Index + mappings do + { + properties: { + name: { type: 'text' }, + email: { type: 'keyword' }, + age: { type: 'integer' }, + roles: { type: 'keyword' }, + created: { type: 'date' } + } + } + end +end +``` + +The cluster's global `mappings` are merged into each index. + +## Repositories + +A **repository** defines *how* to load data into the index. One index can host many repositories. + +```ruby +class UsersIndex < Esse::Index + repository :user do + collection do |**context, &block| + User.where(context).find_in_batches { |b| block.call(b, context) } + end + + document do |user, **| + { _id: user.id, name: user.name } + end + end + + repository :admin do + collection { |**ctx, &b| User.admins.find_in_batches { |b2| b.call(b2, ctx) } } + document { |u, **| { _id: u.id, name: u.name, role: 'admin' } } + end +end +``` + +See [Repository](repository.md) for the full DSL. + +Access: + +```ruby +UsersIndex.repo # default (when only one defined) +UsersIndex.repo(:admin) # specific +UsersIndex.repo?(:admin) # => true / false +UsersIndex.repo_hash # => { 'user' => ..., 'admin' => ... } +``` + +## Plugins + +```ruby +class UsersIndex < Esse::Index + plugin :active_record + plugin MyCustomPlugin, option: 'value' +end +``` + +See [Plugins](plugins.md). + +## Lifecycle methods + +These are the most common class-level operations. All accept `suffix:` and other ES options. + +### Create / delete + +```ruby +UsersIndex.create_index(alias: true) # create and point alias at it +UsersIndex.delete_index +UsersIndex.index_exist? # => Boolean +``` + +`create_index` options: + +| Option | Type | Description | +|--------|------|-------------| +| `suffix` | `String` | Concrete index suffix (for zero-downtime) | +| `alias` | `Boolean` | Also create the alias `index_name` → suffix | +| `settings` | `Hash` | Override settings | +| `body` | `Hash` | Pass the full body manually | +| `wait_for_active_shards`, `timeout`, `master_timeout`, `headers` | pass-through | Native ES options | + +### Reset (zero-downtime) + +```ruby +UsersIndex.reset_index( + suffix: Time.now.strftime('%Y%m%d'), + optimize: true, # temporarily drops replicas/refresh during import + import: true, # import from collection + reindex: false # use _reindex API instead of collection +) +``` + +The reset flow: + +1. Create new index with suffix. +2. (If `optimize: true`) reduce replicas to 0, disable refresh. +3. Import data (or reindex from previous index). +4. Restore settings. +5. Swap the alias. +6. Delete old concrete index. + +### Alias management + +```ruby +UsersIndex.update_aliases(suffix: '20240401') +UsersIndex.aliases # => alias info +UsersIndex.indices_pointing_to_alias # => concrete index names +``` + +### Open / close / refresh + +```ruby +UsersIndex.close +UsersIndex.open +UsersIndex.refresh +``` + +### Update settings / mapping + +```ruby +UsersIndex.update_settings(settings: { number_of_replicas: 0 }) +UsersIndex.update_mapping +``` + +## Document-level operations + +```ruby +UsersIndex.get(id: 1) +UsersIndex.mget(ids: [1, 2, 3]) +UsersIndex.exist?(id: 1) +UsersIndex.count(body: { query: { match_all: {} } }) + +UsersIndex.index(id: 1, body: { name: 'John' }) +UsersIndex.update(id: 1, body: { doc: { name: 'Jane' } }) +UsersIndex.delete(id: 1) + +UsersIndex.bulk( + index: [doc1, doc2], + create: [doc3], + update: [doc4], + delete: [doc5] +) +``` + +## Import + +```ruby +UsersIndex.import # import all repositories +UsersIndex.import(:user) # specific repository +UsersIndex.import(context: { active: true }) # pass context to the collection +UsersIndex.import( + suffix: '20240401', + eager_load_lazy_attributes: [:roles], + update_lazy_attributes: [:comment_count] +) +``` + +See [Import](import.md) for details. + +## Search + +```ruby +UsersIndex.search(q: 'john') +UsersIndex.search(body: { query: { match: { name: 'john' } } }) +``` + +See [Search](search.md). + +## Request customization + +```ruby +class UsersIndex < Esse::Index + request_params :index, :update, pipeline: 'my_pipeline' do |doc| + { routing: doc.routing_key } + end +end +``` + +Valid operations are `:index`, `:create`, `:update`, `:delete`. + +## Inheritance + +Index classes support inheritance — settings, mappings, plugins, and repositories are inherited. Subclass and override what you need: + +```ruby +class BaseIndex < Esse::Index + settings do + { index: { number_of_replicas: 1 } } + end +end + +class UsersIndex < BaseIndex + # inherits settings, can override mappings, repositories, etc. +end +``` + +## Module breakdown + +For reference, `Esse::Index` composes these modules (under `lib/esse/index/`): + +| Module | Purpose | +|--------|---------| +| `Base` | Cluster binding and naming | +| `Inheritance` | Class inheritance rules | +| `Plugins` | Plugin loading | +| `Attributes` | Directory paths, template dirs | +| `Type` | Repository definition | +| `Settings` | Settings DSL | +| `Mappings` | Mappings DSL | +| `Indices` | Create/delete/reset operations | +| `Documents` | Single-doc ops (get, mget, index, update, bulk…) | +| `Aliases` | Alias ops | +| `Search` | Search entrypoint | +| `ObjectDocumentMapper` | Bulk serialization | +| `RequestConfigurable` | `request_params` DSL | +| `Descendants` | Index tracking | diff --git a/docs/plugins.md b/docs/plugins.md new file mode 100644 index 0000000..9cd8855 --- /dev/null +++ b/docs/plugins.md @@ -0,0 +1,141 @@ +# Plugins + +Plugins are the primary extension mechanism in Esse. They can: + +- Add class methods to indices and repositories. +- Hook into the index load process. +- Wrap or override existing behavior. +- Add custom DSL methods. + +Official extensions ([esse-active_record](../../esse-active_record/docs/README.md), [esse-async_indexing](../../esse-async_indexing/docs/README.md), [esse-jbuilder](../../esse-jbuilder/docs/README.md), etc.) are all plugins. + +## Using plugins + +```ruby +class UsersIndex < Esse::Index + plugin :active_record + plugin :async_indexing + plugin MyCustomPlugin, option: 'value' +end + +UsersIndex.plugins # => [Esse::Plugins::ActiveRecord, Esse::Plugins::AsyncIndexing, MyCustomPlugin] +``` + +You can pass a symbol (which is `require`d from `esse/plugins/`) or a module directly. Options and blocks are forwarded to `apply` / `configure`. + +## Writing a plugin + +A plugin is a module under `Esse::Plugins::` with any of the following hooks: + +```ruby +module Esse + module Plugins + module MyPlugin + # Called once when the plugin is added to the index + def self.apply(index, **options, &block) + # install class-level defaults, validate options, etc. + end + + # Called after apply, for DSL-style configuration + def self.configure(index, **options, &block) + index.some_setting = options[:default] + end + + # Mixed into index classes (available on MyIndex.*) + module IndexClassMethods + def custom_index_method + # ... + end + end + + # Mixed into each repository class (available on MyIndex::Repo.*) + module RepositoryClassMethods + def custom_repo_method + # ... + end + end + end + end +end +``` + +Load the plugin (either autoload via `plugin :my_plugin` which requires `esse/plugins/my_plugin`, or require it manually), then: + +```ruby +class UsersIndex < Esse::Index + plugin :my_plugin, default: :foo +end + +UsersIndex.custom_index_method +UsersIndex.repo(:user).custom_repo_method +``` + +## Plugin execution order + +1. `plugin :name` calls `apply(index, **opts, &block)` on the module. +2. `configure(index, **opts, &block)` runs. +3. `IndexClassMethods` are extended into the index class. +4. `RepositoryClassMethods` are extended into each repository as they're declared. + +Plugins declared earlier apply first; later plugins can override earlier behavior. + +## Inheriting plugins + +Plugins are inherited by subclasses: + +```ruby +class AppIndex < Esse::Index + plugin :active_record +end + +class UsersIndex < AppIndex + # active_record is already applied +end +``` + +## Example: a simple plugin + +```ruby +module Esse::Plugins::DefaultLogger + def self.apply(index, **) + Esse.logger.info "Loaded index #{index.name}" + end + + module IndexClassMethods + def log_info + Esse.logger.info "#{name}: #{cluster_id}" + end + end +end + +class UsersIndex < Esse::Index + plugin Esse::Plugins::DefaultLogger +end + +UsersIndex.log_info +``` + +## Example: an ORM-style plugin + +This is a simplified version of what [esse-active_record](../../esse-active_record/docs/README.md) does: + +```ruby +module Esse::Plugins::MyORM + module RepositoryClassMethods + def collection(klass, **opts, &block) + coll_class = Class.new(Esse::Collection) do + define_method(:each) do + klass.find_in_batches(batch_size: opts[:batch_size] || 1_000) do |batch| + yield(batch, @params) + end + end + end + super(coll_class, **opts, &block) + end + end +end +``` + +## Using existing extensions + +See [extensions.md](extensions.md) for the curated list. For each, `plugin :` enables it — the rest is ORM-specific DSL documented in its own `docs/`. diff --git a/docs/repository.md b/docs/repository.md new file mode 100644 index 0000000..1e5c329 --- /dev/null +++ b/docs/repository.md @@ -0,0 +1,245 @@ +# Repository + +A **repository** is the bridge between a data source and an index. It declares: + +- A `collection` — how to iterate over source records in batches. +- A `document` — how to serialize a record into an indexable document. +- (Optional) `lazy_document_attribute` — attributes that are loaded in bulk after primary data. + +Repositories are defined inside an index block: + +```ruby +class UsersIndex < Esse::Index + repository :user do + collection { |**ctx, &b| ... } + document { |record, **ctx| ... } + end +end +``` + +An index can have multiple repositories — useful when the same index holds heterogeneous document types (e.g., `users` and `admins`, or `posts` and `pages`). + +## Collection + +The collection must yield batches of raw records. Its signature is: + +```ruby +collection do |**context, &block| + # yield batches + block.call(batch_array, context) +end +``` + +### Block form + +```ruby +repository :user do + collection do |**conditions, &block| + User.where(conditions).find_in_batches(batch_size: 1_000) do |batch| + block.call(batch, conditions) + end + end +end +``` + +`context` is passed to you from callers (for example `UsersIndex.import(context: { active: true })`) and you pass it on to the `document` block. + +### Class form + +```ruby +class UserCollection < Esse::Collection + def each + User.find_in_batches(batch_size: @params[:batch_size] || 1_000) do |batch| + yield(batch, @params) + end + end + + # Optional — enables more efficient async indexing / lazy attribute refreshes + def each_batch_ids + User.select(:id).find_in_batches do |batch| + yield(batch.map(&:id)) + end + end +end + +repository :user do + collection UserCollection +end +``` + +Collection classes inherit from `Esse::Collection`, which is `Enumerable`. + +Implementing `each_batch_ids` is strongly recommended: it lets extensions like [esse-async_indexing](../../esse-async_indexing/docs/README.md) enqueue ID-only jobs efficiently. + +See [Collection](collection.md) for details. + +## Document + +Serialize one raw record into an indexable hash or `Esse::Document`: + +```ruby +repository :user do + document do |user, **context| + { + _id: user.id, + _routing: user.tenant_id, + name: user.name, + email: user.email, + admin: context[:is_admin] + } + end +end +``` + +Three forms are supported: + +```ruby +# 1. Block (most common) +document do |record, **ctx| + { _id: record.id, name: record.name } +end + +# 2. Document class +class UserDocument < Esse::Document + def id; object.id; end + def source; { name: object.name, email: object.email }; end +end + +document UserDocument + +# 3. Callable with to_h / as_json +document MySerializer +``` + +The hash may include the following reserved keys: + +| Key | Purpose | +|-----|---------| +| `_id` / `id` | Document ID (required for most ops) | +| `_type` | Document type (legacy ES <6, normally unused) | +| `_routing` / `routing` | Routing key for custom sharding | + +All other keys become the `_source` of the indexed document. + +See [Document](document.md) for all document variants. + +## Lazy document attributes + +Some attributes are expensive to compute per-record. `lazy_document_attribute` lets you resolve them in bulk **after** the primary documents are built: + +```ruby +repository :user do + collection { |**c, &b| User.find_in_batches { |b2| b.call(b2, c) } } + document { |u, **| { _id: u.id, name: u.name } } + + lazy_document_attribute :comment_count do |doc_headers| + counts = Comment.where(user_id: doc_headers.map(&:id)).group(:user_id).count + doc_headers.each_with_object({}) do |header, hash| + hash[header] = counts.fetch(header.id, 0) + end + end +end +``` + +The block receives an array of `Esse::LazyDocumentHeader` and must return a hash keyed by those headers. + +### Class-based lazy attributes + +```ruby +class UserRoles < Esse::DocumentLazyAttribute + def call(doc_headers) + # fetch bulk roles keyed by header + end +end + +lazy_document_attribute :roles, UserRoles +``` + +### Using lazy attributes + +```ruby +# Include them during bulk import +UsersIndex.import(eager_load_lazy_attributes: [:roles]) + +# Refresh them after import (partial updates) +UsersIndex.import(update_lazy_attributes: [:comment_count]) + +# Or update them independently later +UsersIndex.repo(:user).update_documents_attribute( + :comment_count, [1, 2, 3], refresh: true +) +``` + +## Batch iteration helpers + +A repository exposes helpers to iterate the collection without importing: + +```ruby +UsersIndex.repo(:user).each_batch(active: true) do |batch, ctx| + # raw record batch +end + +UsersIndex.repo(:user).each_serialized_batch( + eager_load_lazy_attributes: [:roles] +) do |docs| + # Array[Esse::Document] — already serialized and enriched +end + +UsersIndex.repo(:user).documents(active: true) # Enumerator +``` + +## Partial updates + +```ruby +UsersIndex.repo(:user).documents_for_lazy_attribute(:name, ids) +# => Array[Esse::DocumentForPartialUpdate] + +UsersIndex.repo(:user).retrieve_lazy_attribute_values(:name, ids) +# => Hash[LazyDocumentHeader => value] +``` + +## Accessing the index + +Inside a repository class: + +```ruby +class UsersIndex::User + def self.index_owner + UsersIndex # the parent index + end +end +``` + +The repository constant is the pascal-cased name: `UsersIndex::User`, `UsersIndex::Admin`, etc. + +## Putting it together + +```ruby +class PostsIndex < Esse::Index + mappings do + { properties: { title: { type: 'text' }, comments_count: { type: 'integer' } } } + end + + repository :post do + collection do |**c, &b| + Post.includes(:author).find_in_batches(batch_size: 500) { |batch| b.call(batch, c) } + end + + document do |post, **| + { + _id: post.id, + title: post.title, + author: post.author.name + } + end + + lazy_document_attribute :comments_count do |headers| + counts = Comment.where(post_id: headers.map(&:id)).group(:post_id).count + headers.each_with_object({}) { |h, acc| acc[h] = counts.fetch(h.id, 0) } + end + end +end + +# Import including comment counts +PostsIndex.import(eager_load_lazy_attributes: [:comments_count]) +``` diff --git a/docs/search.md b/docs/search.md new file mode 100644 index 0000000..a428062 --- /dev/null +++ b/docs/search.md @@ -0,0 +1,198 @@ +# Search + +Esse provides a thin DSL around Elasticsearch/OpenSearch search APIs, plus a wrapper for responses. You can pass raw query DSL hashes or combine indices and suffixes. + +## Running a search + +### From an index class + +```ruby +query = UsersIndex.search( + body: { + query: { match: { name: 'john' } } + }, + size: 20, + from: 0 +) + +query.response # execute and get Esse::Search::Response +query.response.hits # array of hit hashes +query.response.total # total match count +``` + +### Shorthand + +```ruby +UsersIndex.search(q: 'john') # query string +UsersIndex.search('name:john AND age:30') # Lucene query string +``` + +### Across multiple indices + +```ruby +query = Esse.cluster.search(UsersIndex, EventsIndex, body: { query: { match_all: {} } }) +``` + +## Esse::Search::Query + +`UsersIndex.search(...)` returns an `Esse::Search::Query`. It's lazy — no HTTP request is made until you call `.response` or iterate. + +Chainable helpers: + +```ruby +query.limit(50) # set size +query.offset(100) # set from +query.limit_value # => 50 +query.offset_value # => 100 +query.definition # full query hash sent to ES +query.reset! # clear cached response +``` + +Execute: + +```ruby +query.response # Esse::Search::Response +query.results # alias for response.hits +``` + +### Pagination + +Esse ships without a built-in pagination wrapper. Use: + +- [esse-kaminari](../../esse-kaminari/docs/README.md) for Kaminari-style `.page(n).per(x)`. +- [esse-pagy](../../esse-pagy/docs/README.md) for Pagy-style controller helpers. + +Or use `.limit(size)` / `.offset(from)` directly. + +## Esse::Search::Response + +A thin wrapper around the raw ES/OS response: + +```ruby +response = query.response + +response.raw_response # raw Hash (the JSON body) +response.query_definition # what was sent +response.hits # Array of hit hashes (each has _id, _source, etc.) +response.total # Integer total matches +response.shards # shard info +response.aggregations # aggregations hash (if any) +response.suggestions # suggestions hash (if any) + +response.size # hits.length +response.empty? +response.each { |hit| ... } # Enumerable +``` + +## Scrolling + +For iterating through very large result sets, use `scroll_hits`: + +```ruby +UsersIndex + .search(body: { query: { match_all: {} } }) + .scroll_hits(batch_size: 1_000, scroll: '1m') do |batch| + batch.each { |hit| process(hit['_source']) } + end +``` + +The scroll context is automatically cleared when the iteration finishes. + +## search_after pagination + +For live-updated deep pagination (preferred over `from` offsets beyond 10k): + +```ruby +UsersIndex + .search( + body: { + query: { match_all: {} }, + sort: [{ id: 'asc' }] + } + ) + .search_after_hits(batch_size: 1_000) do |batch| + batch.each { |hit| ... } + end +``` + +`search_after` requires a sort in the body. + +## Suffix targeting + +Direct a search at a specific concrete index (not the alias): + +```ruby +UsersIndex.search(suffix: '20240401', body: { query: { match_all: {} } }) +``` + +## Example: a search service + +```ruby +class UserSearch + def initialize(query: nil, limit: 20, page: 1) + @query = query + @limit = limit + @page = page + end + + def call + UsersIndex.search( + body: body, + size: @limit, + from: (@page - 1) * @limit + ) + end + + private + + def body + { + query: @query ? { multi_match: { query: @query, fields: %w[name email] } } + : { match_all: {} }, + sort: [{ created_at: 'desc' }] + } + end +end + +search = UserSearch.new(query: 'john', page: 2).call +search.response.total +search.response.each { |hit| puts hit.dig('_source', 'name') } +``` + +## Integration with Jbuilder + +For complex query bodies, [esse-jbuilder](../../esse-jbuilder/docs/README.md) lets you build the body from a Jbuilder template: + +```ruby +UsersIndex.search do |json| + json.query do + json.bool do + json.must do + json.child! { json.match { json.set! 'name', params[:q] } } + end + end + end +end +``` + +## Counting + +```ruby +UsersIndex.count(body: { query: { match: { active: true } } }) +# => Integer +``` + +## Hit format + +Each hit is the raw ES response hash: + +```ruby +{ + '_index' => 'myapp_users_20240401', + '_id' => '42', + '_score' => 1.2, + '_source' => { 'name' => 'John', 'email' => 'john@example.com' } +} +``` + +Use `.dig('_source', 'name')` to access source fields, or wrap the response in your own result object. diff --git a/docs/transport.md b/docs/transport.md new file mode 100644 index 0000000..9538dc7 --- /dev/null +++ b/docs/transport.md @@ -0,0 +1,129 @@ +# Transport + +The **transport layer** wraps the official `elasticsearch-ruby` or `opensearch-ruby` client with consistent error handling, event instrumentation, and a uniform API surface across both. + +You rarely interact with `Esse::Transport` directly — index classes delegate to it. But understanding what's happening is useful for debugging and extending Esse. + +## Access + +```ruby +transport = UsersIndex.cluster.api +# or: +transport = Esse::Transport.new(UsersIndex.cluster) + +transport.client # underlying Elasticsearch::Client or OpenSearch::Client +transport.cluster # the Esse::Cluster +``` + +## Available modules + +The transport composes the following modules (under `lib/esse/transport/`): + +| Module | ES/OS APIs covered | +|--------|--------------------| +| `Documents` | `index`, `create`, `update`, `delete`, `get`, `mget`, `bulk`, `count`, `exist`, `reindex`, `update_by_query`, `delete_by_query` | +| `Indices` | `create_index`, `delete_index`, `exists`, `open`, `close`, `refresh`, `update_settings`, `update_mapping`, `get_settings`, `get_mapping`, `index_exist` | +| `Aliases` | `aliases`, `update_aliases`, `indices_pointing_to_alias` | +| `Search` | `search`, `scroll`, `clear_scroll`, `count` | +| `Cluster` | `health`, `info`, `stats`, `tasks`, `cancel_task` | + +Each method: + +1. Publishes an `elasticsearch.*` event via `Esse::Events.instrument`. +2. Calls through to the client. +3. Wraps ES/OS exceptions into `Esse::Transport::*` exceptions with a consistent hierarchy. +4. Returns the raw response hash. + +## Error handling + +`Esse::Transport::ServerError` is the base server error. Specific subclasses map HTTP status codes: + +| Exception | Status | +|-----------|--------| +| `BadRequestError` | 400 | +| `UnauthorizedError` | 401 | +| `ForbiddenError` | 403 | +| `NotFoundError` | 404 | +| `RequestTimeoutError` | 408 | +| `ConflictError` | 409 | +| `RequestEntityTooLargeError` | 413 | +| `UnprocessableEntityError` | 422 | +| `InternalServerError` | 500 | +| `BadGatewayError` | 502 | +| `ServiceUnavailableError` | 503 | +| `GatewayTimeoutError` | 504 | + +Other special exceptions: + +- `Esse::Transport::ReadonlyClusterError` — raised when you attempt a write on a `cluster.readonly = true` cluster. +- `Esse::Transport::BulkResponseError` — raised when the bulk response reports per-document errors. Access `error.response` and `error.items`. + +See [Errors](errors.md) for the full hierarchy. + +## Readonly clusters + +Readonly checks happen before any write: + +```ruby +cluster.throw_error_when_readonly! +# => Esse::Transport::ReadonlyClusterError if readonly +``` + +Reads still work: + +```ruby +UsersIndex.search(...) # OK +UsersIndex.count(...) # OK +UsersIndex.import # raises ReadonlyClusterError +``` + +## Calling directly + +```ruby +transport = Esse.cluster.api + +transport.create_index(index: 'foo', body: { settings: {}, mappings: {} }) +transport.delete_index(index: 'foo', ignore: [404]) + +transport.bulk(body: [ + { index: { _index: 'foo', _id: 1 } }, { name: 'John' } +]) +``` + +Signatures match the underlying clients — consult the ES/OS Ruby client docs for exact parameters. + +## Instrumentation + +Every call emits an `elasticsearch.*` event: + +```ruby +Esse::Events.subscribe('elasticsearch.bulk') do |event| + puts "bulk runtime=#{event.payload[:runtime]}" +end +``` + +Payload shape is operation-specific but typically includes `:request`, `:response`, `:runtime`, and `:error` when raised. + +See [Events](events.md) for the full event list. + +## Coercing exceptions + +If you're writing a custom transport plugin or extending behavior, use: + +```ruby +transport.coerce_exception do + client.some_native_call +end +``` + +This wraps any ES/OS client exception into the matching `Esse::Transport::*` class, keeping callers insulated from ES/OS SDK differences. + +## ES vs OS differences + +Esse transparently handles a few quirks via `ClusterEngine`: + +- ES 6.x+ mapping single-type rules (`_doc` vs `doc`). +- OpenSearch fork version reporting. +- Per-version parameter availability (e.g., `include_type_name`). + +This lets you write one index definition that works across ES 1.x → 8.x and OS 1.x → 2.x without branching.