Building an AI-Powered Search Engine for Shopify Stores — A Complete Deep Dive

A technical walkthrough of Endee AI Search: how it indexes products, understands natural language queries, and delivers semantically relevant results — explained from scratch.

What Are We Building?
The Complete Architecture at a Glance
Tech Stack — What Each Tool Does
Flow 1 — The Indexing Pipeline
Why Two Separate Indexes and Not One?
Flow 2 — The Webhook Pipeline (Keeping Index Fresh)
Flow 3 — The Query Pipeline (Search)
How Everything Binds Together
Optimization Techniques Used
Key Design Decisions Explained
Glossary for Beginners

1. What Are We Building?

Traditional e-commerce search is keyword-based. You type "red shoes" and the search engine looks for products that literally contain the words "red" and "shoes". If a product is called "crimson sneakers", it won't show up — even though it's exactly what you meant.

Endee AI Search solves this. It is a Shopify app that replaces the default search with an AI-powered system that understands what you mean, not just what you typed.

What makes it different?

Traditional Search	Endee AI Search
Matches exact keywords	Understands meaning and intent
"red shoes" misses "crimson sneakers"	Finds "crimson sneakers" for "red shoes"
No understanding of "for my wife"	Extracts gender filter automatically
No image understanding	Finds products visually similar to the query
Static, rule-based	AI-powered, learns from product catalog

The system is built on three core ideas:

Dense retrieval — understand meaning using AI embeddings (CLIP)
Sparse retrieval — match keywords efficiently using BM25
NLP understanding — parse user intent using spaCy

2. The Complete Architecture at a Glance

┌─────────────────────────────────────────────────────────────────┐
│                        MERCHANT STORE                           │
│  Products: create / update / delete                             │
└────────────────────────┬────────────────────────────────────────┘
                         │ Shopify Webhooks
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                     GOOGLE PUB/SUB                              │
│  Message broker — buffers webhook events                        │
└────────────────────────┬────────────────────────────────────────┘
                         │ HTTP Push
                         ▼
┌───────────────────────────────────────────────────────────────┐
│                     REMIX APP (Node.js)                       │
│                                                               │
│  ┌──────────────────┐    ┌──────────────────┐                 │
│  │  Webhook Route   │    │   Search Route   │                 │
│  │  /webhooks/...   │    │   /api/search    │                 │
│  └────────┬─────────┘    └────────┬─────────┘                 │
│           │                       │                           │
│           ▼                       ▼                           │
│  ┌──────────────────┐    ┌──────────────────────────┐         │
│  │    pg-boss       │    │   spaCy NLP Service      │         │
│  │  (Job Queue)     │    │   (Python, port 8100)    │         │
│  └────────┬─────────┘    └────────┬─────────────────┘         │
│           │                       │                           │
│           ▼                       ▼                           │
│  ┌──────────────────┐    ┌──────────────────┐                 │
│  │  Worker Process  │    │   CLIP Model     │                 │
│  │  (processJob)    │    │  (ONNX/local)    │                 │
│  └────────┬─────────┘    └────────┬─────────┘                 │
└───────────┼──────────────────────┼────────────────────────────┘
            │                      │
            ▼                      ▼
┌───────────────────────────────────────────────────────────────┐
│                      POSTGRESQL (Neon)                        │
│  - Merchant sessions                                          │
│  - BM25 index data (per shop)                                 │
│  - Analytics & metrics                                        │
│  - pg-boss job queue tables                                   │
└───────────────────────────────────────────────────────────────┘
            │
            ▼
┌───────────────────────────────────────────────────────────────┐
│                      ENDEE VECTOR DB                          │
│  - {shop}_text index  (dense + sparse vectors)                │
│  - {shop}_image index (image vectors)                         │
└───────────────────────────────────────────────────────────────┘

3. Tech Stack — What Each Tool Does

Remix (Node.js Framework)

The backbone of the application. Remix is a full-stack web framework built on top of React. In this app, it serves two purposes:

Admin UI — the dashboard merchants see after installing the app
API endpoints — the /api/search route that the storefront calls, and the /webhooks/... routes that receive events from Shopify

Think of it as the central nervous system that connects everything.

PostgreSQL (via Neon — serverless Postgres)

A relational database. In this system it stores:

Merchant session data (auth tokens)
BM25 corpus data per shop (the pre-computed term statistics)
Analytics (daily search metrics, product counts)
pg-boss job queue tables (the webhook job queue lives here)

Prisma (ORM)

A TypeScript-friendly database client that talks to PostgreSQL. Instead of writing raw SQL, we write TypeScript.

pg-boss (Job Queue)

A job queue library that uses PostgreSQL as its storage backend. When a product is created/updated/deleted in a Shopify store, we don't process it immediately (that would be slow and fragile). Instead, we create a "job" in pg-boss and a background worker picks it up and processes it asynchronously.

Why not just process immediately?

The webhook must respond in under 5 seconds or Shopify considers it failed
Embedding a product with CLIP can take several seconds
A merchant might bulk-update 500 products — we need to queue them, not crash

CLIP (Contrastive Language–Image Pre-Training)

A machine learning model developed by OpenAI. The magic of CLIP is that it can encode both text and images into the same vector space. This means "red shoes" as text and an actual photo of red shoes produce vectors that are mathematically close to each other.

We use Xenova/clip-vit-base-patch16 — a pre-trained CLIP model running locally via ONNX (no API calls to OpenAI). It produces 512-dimensional vectors.

spaCy (en_core_web_md)

A Python NLP (Natural Language Processing) library. It runs as a separate microservice on port 8100. When a search query comes in, we first send it to spaCy to understand:

What product category is being searched (head noun)
Gender signals ("for my wife" → filter by women)
Price constraints ("under $100")
Recipient hints ("for my boyfriend")
The best query text for BM25 (expanded) and for CLIP (cleaned)

BM25 (Custom TypeScript Implementation)

BM25 (Best Match 25) is a classical information retrieval algorithm — the same one that powers search engines like Elasticsearch and Solr. It scores products based on keyword relevance using term frequency, document frequency, and document length normalization.

Unlike CLIP which understands meaning, BM25 is very good at exact keyword matching. Together they complement each other.

Endee (Vector Database)

A vector database purpose-built for this system. It stores vectors (arrays of numbers representing products) and can find the most similar ones given a query vector. Each shop gets two indices:

{shop}_text — stores text vectors (dense from CLIP + sparse from BM25)
{shop}_image — stores image vectors (one per product image)

Google Pub/Sub

A message queue service. Shopify sends webhooks to Pub/Sub, which then forwards them to our app. It acts as a reliable buffer — if our app is temporarily down, Pub/Sub holds the messages for up to 7 days and retries.

Shopify CLI & shopify.app.toml

The shopify.app.toml file is the configuration contract between the app and Shopify's platform. It tells Shopify:

Where to send webhook events
What OAuth scopes (permissions) the app needs
Where to redirect merchants after OAuth
The app proxy URL for storefront search

4. Flow 1 — The Indexing Pipeline

This runs once when a merchant installs the app or manually triggers a re-index.

The goal: take all products from a merchant's Shopify store and represent them as vectors so they can be searched later.

Step 1 — Trigger

A merchant installs the app or clicks "Re-index" in the dashboard. This fires off the indexing process for their shop.

Step 2 — Fetch All Products from Shopify

Using Shopify's Admin GraphQL API, we fetch all products from the merchant's store in batches. Each product contains: title, description (HTML), handle, vendor, product type, tags, variants (with prices), and image URLs.

Shopify GraphQL API
→ products (title, description, tags, vendor, images, variants, ...)
→ paginated in batches of 250

Step 3 — Text Preparation (Building Passages)

For each product, we build a single text "passage" by combining all textual fields:

"id: 12345 title: Classic White Sneakers description: Lightweight canvas sneakers
handle: classic-white-sneakers vendor: NikeStore productType: Footwear tags: shoes,white,casual"

This passage represents the product as a single string that both BM25 and CLIP will process.

Step 4 — BM25 Corpus Building

Before we can use BM25 to score products, we need to build a corpus — a statistical model of the entire product catalog.

BM25 needs to know:

Total documents (N) — how many products the shop has
Document frequency (DF) — how many products contain each term
Average document length — to normalize for product description length

Why do we need this?

Imagine the word "men" appears in 900 out of 1000 products — it's everywhere. BM25 will give it a low weight because it doesn't help distinguish products.

But the word "cashmere" appears in only 5 products — it's rare and meaningful. BM25 gives it a high weight.

This is captured by IDF (Inverse Document Frequency):

IDF = log( 1 + (N - DF + 0.5) / (DF + 0.5) )

We compute this for every term in the catalog and store the result in PostgreSQL (in the bm25Data table) so we don't have to recompute it on every search query.

BM25 Parameters used:

k1 = 2.0 — controls term frequency saturation. High k1 means repeated words still contribute more score. Tuned higher (vs standard 1.2) for recall.
b = 0.85 — controls document length normalization. Penalizes long documents that contain a term simply because they have more words.
IDF floor = 0.25 — even very common terms get a small minimum weight, so no term is completely ignored.

Step 5 — CLIP Text Embedding

Each product's passage is encoded through the CLIP text encoder to produce a 512-dimensional dense vector.

"Classic White Sneakers, Footwear, casual, white, shoes"
   → CLIP Text Encoder
   → [0.021, -0.134, 0.089, ..., 0.045]  (512 numbers)

This vector captures the semantic meaning of the product. Products with similar meanings will have similar vectors (small cosine distance between them).

Step 6 — CLIP Image Embedding

For each product image URL, we download the image and encode it through the CLIP vision encoder:

[product image] → CLIP Vision Encoder → [0.033, -0.092, 0.101, ..., 0.077]  (512 numbers)

The critical property: text vectors and image vectors live in the same space. So when a user searches "red sneakers" (text), the text vector will be close to the image vector of a red sneaker photo.

Images are processed in batches of 8 to manage memory.

Step 7 — Upsert into Endee

Each product gets stored in Endee as two separate records:

Text index ({shop}_text):

{
  id: "12345",
  vector: [dense CLIP text vector — 512 dims],
  sparseIndices: [23, 441, 1092, ...],   ← BM25 term indices
  sparseValues:  [0.82, 0.61, 0.33, ...], ← BM25 weights
  meta: { title, vendor, tags, variants, images, ... }
}

Image index ({shop}_image):

{
  id: "12345__img_0",
  vector: [dense CLIP image vector — 512 dims],
  meta: { productId: "12345", imageUrl: "https://...", ...productData }
}

After this step, the merchant's store is fully indexed and ready for search.

5. Why Two Separate Indexes and Not One?

This is an important design decision. A beginner might ask: since CLIP produces 512-dimensional vectors for both text and images, why not just put everything in one index?

The core problem: one product has multiple images, but only one text representation.

A product like "Classic White Sneakers" has:

1 text passage → 1 text vector
5 images → 5 image vectors

If you put everything in one index, each record needs a single id. But what id do you give the 5 image vectors? You'd use 12345__img_0, 12345__img_1, etc. — which means the same product now has 6 entries in one index.

This creates serious problems:

Problem 1 — Result duplication in search

When you query the combined index, the same product could appear 6 times in the top results (once for the text match, 5 times for each image match). You'd have to deduplicate manually and figure out which result "represents" the product.

With two separate indexes, text and image retrieval are cleanly separated — each gives you one ranked list. RRF then merges them at the product level (not the record level).

Problem 2 — Incompatible scoring signals

The text index stores both dense and sparse vectors (CLIP + BM25). BM25 produces sparse vectors based on keywords — keywords that only make sense for text passages, not images. An image vector has no BM25 representation. Mixing image records (no sparse vector) with text records (has sparse vector) in one index would corrupt the sparse retrieval entirely.

Problem 3 — Different granularity

Text search operates at product granularity — one product, one score. Image search operates at image granularity — one product can match through any of its images, and we want the best image match per product.

Separating the indexes lets us keep this distinction clean. In the image index, we find the best-ranked image per product using the productId in metadata:

If everything were in one index, this aggregation logic would be much harder and messier.

Problem 4 — Index pollution

Text search using BM25 (sparse retrieval) should only search over text documents. If image records are in the same index, the BM25 query vector would try to match against image records that have zero sparse representation — they'd contribute 0 to sparse scores, effectively polluting the result set with irrelevant low-scoring entries.

Summary:

	One Combined Index	Two Separate Indexes
Result deduplication	Manual, complex	Clean — separate ranked lists
BM25 sparse search	Polluted by image records	Only searches text records
Image matching	Messy multi-record per product	Clean, best image per product via RRF
Granularity	Mixed	Each index has consistent granularity
Scoring logic	Complex workarounds needed	Simple RRF merge at the end

Two indexes is more storage, but the retrieval quality and code simplicity more than justify it.

6. Flow 2 — The Webhook Pipeline (Keeping Index Fresh)

This runs continuously in the background, keeping the index in sync with the store.

When a merchant creates, updates, or deletes a product after the initial index is built, we need to reflect that change in Endee. This is done asynchronously through a webhook pipeline.

The Complete Flow

Merchant creates a product in Shopify
        ↓
Shopify fires a "products/create" webhook
        ↓
Google Pub/Sub receives it (configured in shopify.app.toml)
        ↓
Pub/Sub pushes an HTTP POST to /webhooks/app/products/create
        ↓
Remix route parses the Pub/Sub message format:
  - topic: "products/create"
  - shopDomain: "merchant-store.myshopify.com"
  - eventId: "abc-123"
  - productData: { id, title, images, ... } (base64 decoded)
        ↓
addProductWebhookJob() is called
        ↓
pg-boss inserts a job into PostgreSQL:
  {
    queue: "product-webhooks",
    data: { type, shopDomain, productId, productData, eventId },
    singletonKey: "product.create-abc-123"  ← deduplication
  }
        ↓
Route returns HTTP 200 to Pub/Sub immediately
        ↓
Background worker (startProductWorker) picks up the job
        ↓
processJob() runs:
  - Build text passage
  - CLIP embed text
  - CLIP embed images (for create/update)
  - BM25 encode
  - Endee upsert (or deleteVector for delete)
        ↓
Job marked complete in pg-boss

Why the Queue (pg-boss)?

The webhook route must respond with 200 in under 5 seconds or Pub/Sub/Shopify marks it as failed and retries. But embedding a product takes 2-10 seconds depending on image count.

The queue decouples the two:

Webhook route: receive → queue job → return 200 (takes <100ms)
Worker: pick up job → embed → store (takes however long it needs)

Deduplication

When Shopify updates a product, it fires both a products/create and products/update event in quick succession. Without deduplication, we'd process the same product twice.

Two layers of dedup:

pg-boss singletonKey — same eventId can't be queued twice
In-memory recentOperations Map — if an update comes within 10 seconds of a create for the same product, it's skipped

Retry Behavior

If embedding fails (network error, CLIP timeout), pg-boss automatically retries the job:

Up to 5 retries
Exponential backoff (2s → 4s → 8s → 16s → 32s)

7. Flow 3 — The Query Pipeline (Search)

This runs on every search request from a shopper in a merchant's storefront.

Shopper types: "gift for my wife under $100"
        ↓
Storefront sends: GET /api/search?q=gift+for+my+wife+under+100&shop=store.myshopify.com
        ↓
Remix loader handles the request
        ↓
searchProducts(shop, query) is called

Step 1 — spaCy NLP Analysis

The raw query is sent to the Python spaCy microservice (running on port 8100):

POST http://127.0.0.1:8100/parse
{ "query": "gift for my wife under $100" }

spaCy returns a rich analysis:

{
  "head_noun": "gift",
  "intent": "gift",
  "recipient": "wife",
  "filters": {
    "gender": "women",
    "price": { "max": 100 }
  },
  "expanded_query": "gift wife women accessories",
  "dense_query": "gift for wife"
}

What spaCy extracts:

What	How	Example
Head noun	Syntactic parse tree root	"gift for my wife" → "gift"
Gender	Pronouns + relationship words	"for my wife" → "women"
Recipient	Preposition "for" + possessive	"for my wife" → "wife"
Price	Named entity recognition + regex	"under $100" → { max: 100 }
Expanded query	Adds modifiers, lemmas	Better BM25 recall
Dense query	Price terms stripped	Better CLIP semantics

Step 2 — Build Metadata Filters

From spaCy's output, we build structured filters to pass to Endee:

filters = [{ gender: { "$eq": "women" } }]

These are applied to both sparse and dense search to pre-filter candidates.

Step 3 — CLIP Query Encoding

The cleaned query (without price noise) is encoded through CLIP to get the query vector:

"gift for wife" → CLIP Text Encoder → [query vector — 512 dims]

This same vector is used for both text index dense search AND image index search (because CLIP puts text and images in the same space).

Step 4 — BM25 Query Encoding

The expanded query is encoded through BM25 to get a sparse vector:

"gift wife women accessories" → BM25 Encoder (loaded from PostgreSQL) → {
  sparseIndices: [23, 441, 1092, 55],
  sparseValues:  [0.82, 0.61, 0.33, 0.91]
}

The BM25 encoder uses the pre-built corpus statistics (term frequencies, document frequencies, average length) that were computed during indexing.

Step 5 — Three-Way Parallel Retrieval

All three searches fire simultaneously (in parallel) using Promise.all:

Each retrieval returns up to 500 candidates with their product metadata.

Why three separate retrievals?

Retrieval	Strength	Example
BM25 sparse	Exact keywords, product types	"Nike Air Max" → exact brand match
CLIP text dense	Semantic meaning	"something cozy" → finds "fleece jacket"
CLIP image	Visual similarity	"red dress" → matches red-colored products visually

Each alone is incomplete. Together they cover all the ways a shopper might search.

Step 6 — Reciprocal Rank Fusion (RRF)

Each retrieval method returns its own ranked list. We need to combine them into one final ranking. This is done with RRF (Reciprocal Rank Fusion).

How RRF works:

For each product, calculate its score from each retrieval method based on its rank:

RRF score = 1 / (60 + rank)

Rank 1 → 1/61 = 0.0164
Rank 10 → 1/70 = 0.0143
Rank 100 → 1/160 = 0.0063
Not in results → 0

Why 60? The constant 60 dampens the difference between high and low ranks. A product ranked 1st doesn't get astronomically more score than one ranked 2nd.

Combined score:

baseScore = sparseRRF + denseRRF + imageRRF

Step 7 — Head Noun Boost

spaCy identified the head noun ("gift"). We apply a 1.5x boost to products whose type, title, or tags contain that noun:

This ensures "gift boxes" rank higher than unrelated products that happened to score well on keyword matching.

Step 8 — Price Filter (Post-Retrieval)

If spaCy extracted a price constraint ("under $100"), we filter out candidates that don't match:

This is done post-retrieval (after scoring) rather than pre-retrieval because price data is stored in product metadata, not as a separate indexed field.

Step 9 — Sort and Return Top 30

The top 30 results are returned to the storefront with their scores included for debugging.

8. How Everything Binds Together

Here is the complete picture — all three flows unified:

┌─────────────────────────────────────────────────────────────────┐
│                    ONE-TIME: INITIAL INDEXING                   │
│                                                                 │
│  Merchant installs → Fetch all products → Build BM25 corpus     │
│  → CLIP embed text → CLIP embed images → Store in Endee         │
│  → Save BM25 stats to PostgreSQL                                │
└─────────────────────────────────────────────────────────────────┘
                              ↓ Index is ready
┌─────────────────────────────────────────────────────────────────┐
│                  CONTINUOUS: WEBHOOK PIPELINE                   │
│                                                                 │
│  Product change → Pub/Sub → Webhook route → pg-boss queue       │
│  → Background worker → Re-embed changed product → Update Endee  │
└─────────────────────────────────────────────────────────────────┘
                              ↓ Index stays fresh
┌─────────────────────────────────────────────────────────────────┐
│                  PER QUERY: SEARCH PIPELINE                     │
│                                                                 │
│  Query → spaCy (NLP) → CLIP (encode) → BM25 (encode)            │
│  → 3 parallel searches in Endee                                 │
│  → RRF scoring → Head noun boost → Price filter → Top 30        │
└─────────────────────────────────────────────────────────────────┘

9. Optimization Techniques Used

1. Parallel Retrieval with Promise.all

All three index searches (sparse, dense, image) happen simultaneously:

Without this: 3 sequential searches × ~50ms each = ~150ms With this: all 3 run simultaneously = ~50ms

2. Lazy Model Loading with Singleton Pattern

CLIP models are large (hundreds of MB). We load them once and reuse:

Without this: model loads on every request = seconds of latency With this: model loads once on startup, reused on every request

3. BM25 Corpus Cached in PostgreSQL

The BM25 corpus (term frequencies, document frequencies) is expensive to build — it requires scanning all product text. We compute it once during indexing and store the result in PostgreSQL.

On every search query, we just load the pre-computed statistics:

4. Image Batching (8 images per batch)

CLIP image embedding is memory-intensive. Processing images in groups of 8 keeps memory usage bounded:

5. pg-boss Batch Processing (batchSize: 5)

The background worker processes 5 webhook jobs in parallel per poll cycle:

This provides throughput without overwhelming the database or CLIP model.

6. OOV Bucket in BM25

Unknown words (Out-Of-Vocabulary) are mapped to a special __OOV__ bucket with a document frequency of 90% of total documents:

This means unknown query terms get a small but non-zero weight — they don't contribute much but they don't break the scoring either.

7. IDF Floor (Recall Optimization)

Even very common terms get a minimum IDF of 0.25:

Without this floor, extremely common terms (like "product") would get near-zero IDF and be completely ignored. The floor ensures they still contribute weakly.

8. Query Length Clamping

Short search queries (1-2 words) are not penalized by the length normalization factor:

Without clamping, a 1-word query like "shoes" would be penalized because BM25's normalization expects the query to be at least as long as the average document. Clamping to minimum 3 ensures short queries work well.

9. Best-Image-Per-Product in Image Results

A product can have many images. In image results, we keep only the best-ranked image per product to avoid the same product dominating results:

10. In-Memory Deduplication Window (10 seconds)

Shopify sometimes sends both a products/create and products/update event for the same product in rapid succession. We suppress the redundant update:

10. Key Design Decisions Explained

Why CLIP instead of a text-only model?

CLIP understands both text and images in a unified embedding space. This means:

A text query ("red floral dress") is close in vector space to an image of a red floral dress
This enables cross-modal search — text queries find products by their visual appearance

Text-only models can't do this.

Why BM25 + CLIP instead of just CLIP?

CLIP is great at semantic understanding but can miss exact matches. If someone searches "Nike Air Max 270", CLIP might return other Nike shoes or similar styles. BM25 will score the exact product very high because it matches the exact model number.

Combining both gives you the best of both worlds: exact keyword precision from BM25, semantic generalization from CLIP.

Why spaCy as a separate Python service?

spaCy's en_core_web_md model runs in Python. Our app is Node.js/TypeScript. Running spaCy as a separate HTTP microservice (on port 8100) lets both coexist:

Node.js handles the web server and API
Python handles NLP
They communicate via HTTP

Why pg-boss instead of Redis + BullMQ?

Redis is an in-memory data store. BullMQ uses Redis as its job queue backend. The problem: Redis commands are consumed even when there are no webhook events (BullMQ's worker polls Redis every few milliseconds for new jobs).

pg-boss uses PostgreSQL — which the app already uses for sessions and BM25 data. Job polling happens every 2 seconds, consuming negligible database resources. No additional service to manage or pay for.

Why Pub/Sub as an intermediary?

Shopify requires webhook endpoints to respond in under 5 seconds. If our app goes down during maintenance, Pub/Sub holds the messages for up to 7 days. Combined with pg-boss's retry logic, this creates a robust, fault-tolerant pipeline where no product events are lost.

Why `shopify.app.toml` matters

This file is the single source of truth for the app's relationship with Shopify. It defines webhook subscriptions, OAuth scopes, and the app proxy URL. Running shopify app deploy pushes this configuration to Shopify's servers — without it, none of the webhook routing or OAuth would work.

11. Glossary for Beginners

Term	Simple Explanation
Vector / Embedding	A list of numbers that represents the meaning of text or an image. Similar things have similar numbers.
Dense vector	A full 512-number list from CLIP. Every number has a value.
Sparse vector	A list where most values are zero. Only terms present in the query have non-zero weights. Used in BM25.
BM25	A scoring formula that says "how relevant is this product to this query based on keywords?"
IDF	How rare a word is. Rare words matter more in search.
TF (Term Frequency)	How many times a word appears in a document.
k1 parameter	Controls how much extra TF helps. Beyond a point, repeating a word doesn't help much more.
b parameter	Controls how much document length matters. Longer documents get slightly penalized.
Cosine similarity	A way to measure how similar two vectors are. 1 = identical, 0 = unrelated.
RRF	A formula to merge multiple ranked lists into one. Each result gets points based on its rank in each list.
spaCy	A Python library that understands grammar and meaning in sentences.
Head noun	The main noun in a phrase. In "casual red shoes for men", the head noun is "shoes".
NER	Named Entity Recognition — identifying things like names, prices, locations in text.
ONNX	A format for AI models. Lets you run models trained in Python within JavaScript.
Webhook	An HTTP notification Shopify sends to your app when something happens in a store.
Pub/Sub	A publish-subscribe system. Shopify publishes events; your app subscribes to receive them.
Job Queue	A list of tasks waiting to be processed. Workers pick tasks off the queue and process them.
pg-boss	A job queue that stores jobs in PostgreSQL tables.
Upsert	Insert if the record doesn't exist, update if it does.
Singleton key	A unique identifier that prevents the same job from being queued twice.
SIGTERM	A signal sent to a process asking it to shut down gracefully.
ORM	Object-Relational Mapper — maps database tables to TypeScript objects (Prisma does this).
Remix resource route	A Remix route file that only handles API requests (no UI). Named with dots in the filename.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Building an AI-Powered Search Engine for Shopify Stores — A Complete Deep Dive

Table of Contents

1. What Are We Building?

What makes it different?

2. The Complete Architecture at a Glance

3. Tech Stack — What Each Tool Does

Remix (Node.js Framework)

PostgreSQL (via Neon — serverless Postgres)

Prisma (ORM)

pg-boss (Job Queue)

CLIP (Contrastive Language–Image Pre-Training)

spaCy (en_core_web_md)

BM25 (Custom TypeScript Implementation)

Endee (Vector Database)

Google Pub/Sub

Shopify CLI & shopify.app.toml

4. Flow 1 — The Indexing Pipeline

Step 1 — Trigger

Step 2 — Fetch All Products from Shopify

Step 3 — Text Preparation (Building Passages)

Step 4 — BM25 Corpus Building

Step 5 — CLIP Text Embedding

Step 6 — CLIP Image Embedding

Step 7 — Upsert into Endee

5. Why Two Separate Indexes and Not One?

6. Flow 2 — The Webhook Pipeline (Keeping Index Fresh)

The Complete Flow

Why the Queue (pg-boss)?

Deduplication

Retry Behavior

7. Flow 3 — The Query Pipeline (Search)

Step 1 — spaCy NLP Analysis

Step 2 — Build Metadata Filters

Step 3 — CLIP Query Encoding

Step 4 — BM25 Query Encoding

Step 5 — Three-Way Parallel Retrieval

Step 6 — Reciprocal Rank Fusion (RRF)

Step 7 — Head Noun Boost

Step 8 — Price Filter (Post-Retrieval)

Step 9 — Sort and Return Top 30

8. How Everything Binds Together

9. Optimization Techniques Used

1. Parallel Retrieval with Promise.all

2. Lazy Model Loading with Singleton Pattern

3. BM25 Corpus Cached in PostgreSQL

4. Image Batching (8 images per batch)

5. pg-boss Batch Processing (batchSize: 5)

6. OOV Bucket in BM25

7. IDF Floor (Recall Optimization)

8. Query Length Clamping

9. Best-Image-Per-Product in Image Results

10. In-Memory Deduplication Window (10 seconds)

10. Key Design Decisions Explained

Why CLIP instead of a text-only model?

Why BM25 + CLIP instead of just CLIP?

Why spaCy as a separate Python service?

Why pg-boss instead of Redis + BullMQ?

Why Pub/Sub as an intermediary?

Why shopify.app.toml matters

11. Glossary for Beginners

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Why `shopify.app.toml` matters

Packages