From 3d3e6450b4195dbf507073002393229bf17302c7 Mon Sep 17 00:00:00 2001 From: Adhi2624 Date: Tue, 31 Mar 2026 18:04:27 +0000 Subject: [PATCH 1/3] init submission --- .gitignore | 7 +- README.md | 3 + README_submission.md | 150 + devrev_search_with_v2_ranking.ipynb | 1223 +++++++ requirements.txt | 168 +- test_queries_results_v2.json | 5154 +++++++++++++++++++++++++++ 6 files changed, 6696 insertions(+), 9 deletions(-) create mode 100644 README_submission.md create mode 100644 devrev_search_with_v2_ranking.ipynb create mode 100644 test_queries_results_v2.json diff --git a/.gitignore b/.gitignore index 9d3993f..0e39424 100644 --- a/.gitignore +++ b/.gitignore @@ -12,17 +12,22 @@ dist/ build/ # Large binary files -embeddings.npy + faiss_index/ data/ + # Jupyter .ipynb_checkpoints/ +*.ipynb:Zone.Identifier # Environment & secrets .env *.env +# Editor +.vscode/ + # OS .DS_Store Thumbs.db diff --git a/README.md b/README.md index 337a0bf..3e39768 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,6 @@ +## File to evaluate : devrev_search_with_v2_ranking.ipynb + + # DevRev Search — Semantic Search over DevRev Knowledge Base Semantic search system for the [DevRev Search](https://huggingface.co/datasets/devrev/search) dataset. Embeds ~65K knowledge base articles using either OpenAI `text-embedding-3-small` or Ollama `qwen3-embedding:0.6b`, indexes them with FAISS, and retrieves relevant documents for test queries. diff --git a/README_submission.md b/README_submission.md new file mode 100644 index 0000000..8e371d8 --- /dev/null +++ b/README_submission.md @@ -0,0 +1,150 @@ +# Submission README: `devrev_search_with_v2_ranking.ipynb` + +## Overview + +This submission is based on `devrev_search_with_v2_ranking.ipynb`. It implements a hybrid retrieval pipeline for the `devrev/search` benchmark using: + +- Sparse retrieval with BM25 +- Dense retrieval with `BAAI/bge-base-en-v1.5` +- Reciprocal Rank Fusion (RRF) to combine sparse and dense candidates +- Cross-encoder reranking with `BAAI/bge-reranker-v2-m3` + +The goal is to improve retrieval quality over the baseline dense-only notebook by combining lexical matching, semantic retrieval, and a final learned reranking stage. + +## What The Notebook Does + +The notebook runs the full submission pipeline end to end: + +1. Loads the three dataset splits from Hugging Face: + - `annotated_queries` + - `test_queries` + - `knowledge_base` +2. Builds an in-memory corpus from the knowledge base. +3. Tokenizes the corpus and creates a BM25 index. +4. Encodes the corpus with `sentence-transformers` and caches dense embeddings to `bge_embeddings.npy`. +5. Builds or reloads a FAISS dense index from `bge_faiss.index`. +6. Retrieves candidates from: + - BM25 + - dense BGE embeddings +7. Fuses both ranked lists using RRF. +8. Reranks the fused candidates with `BAAI/bge-reranker-v2-m3`. +9. Writes submission outputs to: + - `test_queries_results_v2.json` + - `test_queries_results_v2.parquet` +10. Saves reusable retrieval state to `corpus_state_v2.pkl`. + +## Key Output Files + +- `test_queries_results_v2.json`: submission-ready retrieval output +- `test_queries_results_v2.parquet`: tabular version of the same output +- `bge_embeddings.npy`: cached dense document embeddings +- `bge_faiss.index`: cached FAISS dense index +- `corpus_state_v2.pkl`: cached corpus ID/title/text mappings + +## How This Differs From `devrev_search.ipynb` + +`devrev_search.ipynb` is the earlier baseline pipeline. The submission notebook differs in several important ways: + +### 1. Retrieval strategy + +Baseline: +- Dense retrieval only +- One embedding backend at a time (`OpenAI` or `Ollama`) +- Direct nearest-neighbor search over a FAISS index + +Submission notebook: +- Hybrid retrieval +- BM25 for lexical matching +- BGE dense embeddings for semantic retrieval +- RRF to merge sparse and dense candidate lists +- Cross-encoder reranking for the final top results + +### 2. Models used + +Baseline: +- `text-embedding-3-small` through OpenAI, or +- `qwen3-embedding:0.6b` through Ollama + +Submission notebook: +- `BAAI/bge-base-en-v1.5` for dense retrieval +- `BAAI/bge-reranker-v2-m3` for reranking + +### 3. Indexing setup + +Baseline: +- Builds normalized embeddings +- Uses a flat FAISS inner-product index +- Stores artifacts under `faiss_index/` plus `embeddings.npy` + +Submission notebook: +- Builds cached BGE embeddings in `bge_embeddings.npy` +- Uses a FAISS IVF index in `bge_faiss.index` +- Saves corpus metadata separately in `corpus_state_v2.pkl` + +### 4. Ranking quality + +Baseline: +- Final ranking is the FAISS dense retrieval order + +Submission notebook: +- Final ranking is produced after: + - BM25 retrieval + - dense retrieval + - RRF fusion + - cross-encoder reranking + +This gives the submission notebook a stronger ranking stack, especially for queries where exact keyword overlap and semantic similarity both matter. + +### 5. Output filenames + +Baseline: +- `test_queries_results.json` +- `test_queries_results.parquet` + +Submission notebook: +- `test_queries_results_v2.json` +- `test_queries_results_v2.parquet` + +This keeps the submission artifacts separate from the baseline outputs. + +## Running The Submission Notebook + +Create and activate the environment, then install the pinned dependencies: + +```bash +python -m venv env +source env/bin/activate +pip install -r requirements.txt +``` + +Launch Jupyter: + +```bash +jupyter lab +``` + +Open `devrev_search_with_v2_ranking.ipynb` and run the cells in order. + +## Submission Format + +The generated JSON follows the expected benchmark structure: + +```json +{ + "query_id": "example-query-id", + "query": "example query text", + "retrievals": [ + { + "id": "document-id", + "text": "document text", + "title": "document title" + } + ] +} +``` + +## Notes + +- The notebook is designed to reuse cached embeddings and the FAISS index on repeated runs. +- The first full run is heavier because it computes document embeddings and builds the retrieval artifacts. +- The main improvement over the earlier notebook is the ranking pipeline, not just the embedding model choice. diff --git a/devrev_search_with_v2_ranking.ipynb b/devrev_search_with_v2_ranking.ipynb new file mode 100644 index 0000000..e84460c --- /dev/null +++ b/devrev_search_with_v2_ranking.ipynb @@ -0,0 +1,1223 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# DevRev Search Dataset\n", + "\n", + "Loading and exploring the `devrev/search` dataset from Hugging Face.\n", + "\n", + "This copy replaces the original FAISS-only retrieval step with the v2 search stack: BM25 + dense retrieval + RRF fusion + cross-encoder reranking.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/adhi/projects/devrev-search-bench/env/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "from datasets import load_dataset\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Load Annotated Queries\n", + "Queries paired with annotated (golden) article chunks for training/validation." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DatasetDict({\n", + " train: Dataset({\n", + " features: ['query_id', 'query', 'retrievals'],\n", + " num_rows: 291\n", + " })\n", + "})\n" + ] + } + ], + "source": [ + "# Load annotated queries\n", + "annotated_queries = load_dataset(\"devrev/search\", \"annotated_queries\")\n", + "print(annotated_queries)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
query_idqueryretrievals
00ae94217-c6a0-4895-83a2-841a95f01637create DevRev ticket from Microsoft Teams[{'id': 'ART-4216_KNOWLEDGE_NODE-26', 'text': ...
1d0b209b3-6cea-46d8-bfac-bd0e286ea21bworkflow builder auto close ticket after 48 ho...[{'id': 'ART-2012_KNOWLEDGE_NODE-24', 'text': ...
240c1aa6f-cd21-46ab-8f6f-76fdc267b584automated reminder to customer ticket will be ...[{'id': 'ART-3068_KNOWLEDGE_NODE-24', 'text': ...
3e47d883f-b712-4f98-bd06-14ade143e3c2connect Bitbucket account to DevRev account[{'id': 'ART-2030_KNOWLEDGE_NODE-27', 'text': ...
42e6f9413-15ac-4974-a380-7aa22fc98a61use of workflows in DevRev[{'id': 'ART-1961_KNOWLEDGE_NODE-28', 'text': ...
\n", + "
" + ], + "text/plain": [ + " query_id \\\n", + "0 0ae94217-c6a0-4895-83a2-841a95f01637 \n", + "1 d0b209b3-6cea-46d8-bfac-bd0e286ea21b \n", + "2 40c1aa6f-cd21-46ab-8f6f-76fdc267b584 \n", + "3 e47d883f-b712-4f98-bd06-14ade143e3c2 \n", + "4 2e6f9413-15ac-4974-a380-7aa22fc98a61 \n", + "\n", + " query \\\n", + "0 create DevRev ticket from Microsoft Teams \n", + "1 workflow builder auto close ticket after 48 ho... \n", + "2 automated reminder to customer ticket will be ... \n", + "3 connect Bitbucket account to DevRev account \n", + "4 use of workflows in DevRev \n", + "\n", + " retrievals \n", + "0 [{'id': 'ART-4216_KNOWLEDGE_NODE-26', 'text': ... \n", + "1 [{'id': 'ART-2012_KNOWLEDGE_NODE-24', 'text': ... \n", + "2 [{'id': 'ART-3068_KNOWLEDGE_NODE-24', 'text': ... \n", + "3 [{'id': 'ART-2030_KNOWLEDGE_NODE-27', 'text': ... \n", + "4 [{'id': 'ART-1961_KNOWLEDGE_NODE-28', 'text': ... " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Convert to DataFrame and display\n", + "annotated_df = annotated_queries[\"train\"].to_pandas()\n", + "annotated_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'query_id': '0ae94217-c6a0-4895-83a2-841a95f01637',\n", + " 'query': 'create DevRev ticket from Microsoft Teams',\n", + " 'retrievals': [{'id': 'ART-4216_KNOWLEDGE_NODE-26',\n", + " 'text': 'DevRev Object | Sync to DevRev |\\\\n| --- | --- | --- |\\\\n| Plan | Parts | \\\\xe2\\\\x9c\\\\x85 |\\\\n| User | Identity/DevUser | \\\\xe2\\\\x9c\\\\x85 |\\\\n| Channel | Chat | \\\\xe2\\\\x9c\\\\x85 |\\\\n| Attachments in Message/Thread/Task | Artifacts on Article | \\\\xe2\\\\x9c\\\\x85 |\\\\n| Message | Comment | \\\\xe2\\\\x9c\\\\x85 |\\\\n| Thread | Comment | \\\\xe2\\\\x9c\\\\x85 |\\\\n| Task | Issue/Ticket | \\\\xe2\\\\x9c\\\\x85 |\\\\n\\\\nImporting from Microsoft Teams\\\\n------------------------------\\\\n\\\\nFollow the steps below to import from Microsoft Teams:\\\\n\\\\n1. Go to',\n", + " 'title': 'Microsoft Teams AirSync | AirSync | Snap-ins | DevRev'},\n", + " {'id': 'ART-4216_KNOWLEDGE_NODE-29',\n", + " 'text': 'with many\\\\nattachments. DevRev honors the Microsoft Graph API rate limits and back-off and resumes automatically.\\\\n\\\\nPost import options\\\\n-------------------\\\\n\\\\nAfter a successful import, you have the following options available for the imported account:\\\\n\\\\n* **Sync to DevRev** \\\\n This option allows you to synchronize any modifications made in Microsoft Teams with the corresponding items previously imported into DevRev. It also creates new items in DevRev for any new data in Microsoft Teams',\n", + " 'title': 'Microsoft Teams AirSync | AirSync | Snap-ins | DevRev'}]}" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Sample a single annotated query example\n", + "annotated_queries[\"train\"][0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load Test Queries\n", + "Held-out queries used for evaluation." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DatasetDict({\n", + " test: Dataset({\n", + " features: ['query_id', 'query'],\n", + " num_rows: 92\n", + " })\n", + "})\n" + ] + } + ], + "source": [ + "# Load test queries\n", + "test_queries = load_dataset(\"devrev/search\", \"test_queries\")\n", + "print(test_queries)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
query_idquery
0a97f93d2-410a-431f-ae9a-1e23ed35d74cend customer organization name not appearing i...
17dd7e2b4-9349-4535-8007-1d706e0fabffAndroid SDK session generated with Unknown user
24bc92187-cdaa-4c20-b189-abd1672e5a71email reply received on wrong ticket
34d9878e8-f746-4df5-8bf6-f9444989b385manage access and privileges in DevRev
4483151ec-aff4-4569-b3df-651f578b61d8SSO setup SAML IDP metadata connection string ...
\n", + "
" + ], + "text/plain": [ + " query_id \\\n", + "0 a97f93d2-410a-431f-ae9a-1e23ed35d74c \n", + "1 7dd7e2b4-9349-4535-8007-1d706e0fabff \n", + "2 4bc92187-cdaa-4c20-b189-abd1672e5a71 \n", + "3 4d9878e8-f746-4df5-8bf6-f9444989b385 \n", + "4 483151ec-aff4-4569-b3df-651f578b61d8 \n", + "\n", + " query \n", + "0 end customer organization name not appearing i... \n", + "1 Android SDK session generated with Unknown user \n", + "2 email reply received on wrong ticket \n", + "3 manage access and privileges in DevRev \n", + "4 SSO setup SAML IDP metadata connection string ... " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Convert to DataFrame and display\n", + "test_df = test_queries[\"test\"].to_pandas()\n", + "test_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'query_id': 'a97f93d2-410a-431f-ae9a-1e23ed35d74c',\n", + " 'query': 'end customer organization name not appearing in ticket or conversation'}" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Sample a single test query example\n", + "test_queries[\"test\"][0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Load Knowledge Base\n", + "Article chunks from DevRev's customer-facing support documentation." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DatasetDict({\n", + " corpus: Dataset({\n", + " features: ['id', 'text', 'title'],\n", + " num_rows: 65224\n", + " })\n", + "})\n" + ] + } + ], + "source": [ + "# Load knowledge base\n", + "knowledge_base = load_dataset(\"devrev/search\", \"knowledge_base\")\n", + "print(knowledge_base)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idtexttitle
0ART-17711_KNOWLEDGE_NODE-0b'We ran into a case where an AirSync was star...Sync fails when original sync owners loses per...
1ART-17711_KNOWLEDGE_NODE-1access.\\n\\nOnce Person A was re-added with the...Sync fails when original sync owners loses per...
2ART-17650_KNOWLEDGE_NODE-0b\"American cybersecurity leader unifies securi...American cybersecurity leader unifies security...
3ART-17650_KNOWLEDGE_NODE-1DevRev\\n======================================...American cybersecurity leader unifies security...
4ART-17650_KNOWLEDGE_NODE-2solutions help organisations build and deploy ...American cybersecurity leader unifies security...
\n", + "
" + ], + "text/plain": [ + " id \\\n", + "0 ART-17711_KNOWLEDGE_NODE-0 \n", + "1 ART-17711_KNOWLEDGE_NODE-1 \n", + "2 ART-17650_KNOWLEDGE_NODE-0 \n", + "3 ART-17650_KNOWLEDGE_NODE-1 \n", + "4 ART-17650_KNOWLEDGE_NODE-2 \n", + "\n", + " text \\\n", + "0 b'We ran into a case where an AirSync was star... \n", + "1 access.\\n\\nOnce Person A was re-added with the... \n", + "2 b\"American cybersecurity leader unifies securi... \n", + "3 DevRev\\n======================================... \n", + "4 solutions help organisations build and deploy ... \n", + "\n", + " title \n", + "0 Sync fails when original sync owners loses per... \n", + "1 Sync fails when original sync owners loses per... \n", + "2 American cybersecurity leader unifies security... \n", + "3 American cybersecurity leader unifies security... \n", + "4 American cybersecurity leader unifies security... " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Convert to DataFrame and display\n", + "knowledge_df = knowledge_base[\"corpus\"].to_pandas()\n", + "knowledge_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': 'ART-17711_KNOWLEDGE_NODE-0',\n", + " 'text': \"b'We ran into a case where an AirSync was started by one person (Person A) and later failed. Another user (Person B) tried to click Retry, but it didn\\\\xe2\\\\x80\\\\x99t work. The logs showed 401 and 403 errors in communication between the snap-in and the snap-in manager.\\\\n\\\\nIt turned out that AirSync assigns the sync owner to whoever started it. Since Person A had been removed from the org or lost permissions, the retry failed \\\\xe2\\\\x80\\\\x94 the system still expected the original owner to have valid\",\n", + " 'title': 'Sync fails when original sync owners loses permissions'}" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Sample a single knowledge base chunk\n", + "knowledge_base[\"corpus\"][0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Dataset Summary" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "============================================================\n", + "DevRev Search Dataset Summary\n", + "============================================================\n", + "\n", + "Annotated Queries:\n", + "DatasetDict({\n", + " train: Dataset({\n", + " features: ['query_id', 'query', 'retrievals'],\n", + " num_rows: 291\n", + " })\n", + "})\n", + "\n", + "Test Queries:\n", + "DatasetDict({\n", + " test: Dataset({\n", + " features: ['query_id', 'query'],\n", + " num_rows: 92\n", + " })\n", + "})\n", + "\n", + "Knowledge Base:\n", + "DatasetDict({\n", + " corpus: Dataset({\n", + " features: ['id', 'text', 'title'],\n", + " num_rows: 65224\n", + " })\n", + "})\n", + "\n", + "============================================================\n" + ] + } + ], + "source": [ + "print(\"=\" * 60)\n", + "print(\"DevRev Search Dataset Summary\")\n", + "print(\"=\" * 60)\n", + "print(f\"\\nAnnotated Queries:\")\n", + "print(annotated_queries)\n", + "print(f\"\\nTest Queries:\")\n", + "print(test_queries)\n", + "print(f\"\\nKnowledge Base:\")\n", + "print(knowledge_base)\n", + "print(\"\\n\" + \"=\" * 60)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "## 5. Index Knowledge Base with FAISS\n", + "\n", + "Use either OpenAI (`text-embedding-3-small`) or Ollama (`qwen3-embedding:0.6b`) embeddings with the same FAISS pipeline." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Build the v2 Retrieval Pipeline\n", + "\n", + "This version swaps the original FAISS-only search for the v2 retrieval pipeline used in `devrev_search_v2.ipynb`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Embedding model: BAAI/bge-base-en-v1.5\n", + "Reranker model: BAAI/bge-reranker-v2-m3\n" + ] + } + ], + "source": [ + "import os\n", + "import re\n", + "import json\n", + "import math\n", + "import pickle\n", + "\n", + "import faiss\n", + "import numpy as np\n", + "from rank_bm25 import BM25Okapi\n", + "from sentence_transformers import CrossEncoder, SentenceTransformer\n", + "from tqdm.auto import tqdm\n", + "from typing import Callable\n", + "\n", + "EMBED_MODEL_NAME = \"BAAI/bge-base-en-v1.5\"\n", + "RERANKER_MODEL_NAME = \"BAAI/bge-reranker-v2-m3\"\n", + "BGE_QUERY_PREFIX = \"Represent this sentence for searching relevant passages: \"\n", + "\n", + "EMBED_PATH = \"bge_embeddings.npy\"\n", + "INDEX_PATH = \"bge_faiss.index\"\n", + "STATE_PATH = \"corpus_state_v2.pkl\"\n", + "OUTPUT_JSON = \"test_queries_results_v2.json\"\n", + "OUTPUT_PARQUET = \"test_queries_results_v2.parquet\"\n", + "\n", + "print(f\"Embedding model: {EMBED_MODEL_NAME}\")\n", + "print(f\"Reranker model: {RERANKER_MODEL_NAME}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Building corpus index: 100%|██████████| 65224/65224 [00:01<00:00, 46582.82it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Corpus size: 65,224 documents\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "# Build corpus structures for retrieval\n", + "corpus = knowledge_base[\"corpus\"]\n", + "\n", + "kb_id_to_text = {}\n", + "kb_id_to_title = {}\n", + "corpus_ids = []\n", + "corpus_texts = []\n", + "\n", + "for item in tqdm(corpus, desc=\"Building corpus index\"):\n", + " doc_id = item[\"id\"]\n", + " title = (item.get(\"title\", \"\") or \"\").strip()\n", + " text = (item.get(\"text\", \"\") or \"\").strip()\n", + " combined = f\"{title}. {text}\".strip(\". \")\n", + "\n", + " kb_id_to_text[doc_id] = text\n", + " kb_id_to_title[doc_id] = title\n", + " corpus_ids.append(doc_id)\n", + " corpus_texts.append(combined)\n", + "\n", + "print(f\"Corpus size: {len(corpus_ids):,} documents\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Stage 1: BM25 Lexical Retrieval\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Tokenizing corpus for BM25...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Tokenizing: 100%|██████████| 65224/65224 [00:02<00:00, 26545.20it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Building BM25 index...\n", + "BM25 index ready\n" + ] + } + ], + "source": [ + "STOP_WORDS = {\n", + " \"the\", \"a\", \"an\", \"is\", \"in\", \"of\", \"to\", \"and\", \"or\", \"for\", \"on\", \"at\", \"by\",\n", + " \"with\", \"from\", \"be\", \"are\", \"was\", \"were\", \"this\", \"that\", \"it\", \"as\", \"not\",\n", + " \"can\", \"do\", \"does\", \"have\", \"has\", \"had\", \"will\", \"would\", \"how\", \"what\",\n", + " \"when\", \"where\", \"which\", \"who\", \"why\", \"if\", \"then\", \"so\", \"but\"\n", + "}\n", + "\n", + "\n", + "def tokenize(text: str) -> list[str]:\n", + " tokens = re.sub(r\"[^a-z0-9\\s]\", \" \", text.lower()).split()\n", + " return [token for token in tokens if token not in STOP_WORDS and len(token) > 1]\n", + "\n", + "\n", + "print(\"Tokenizing corpus for BM25...\")\n", + "tokenized_corpus = [tokenize(text) for text in tqdm(corpus_texts, desc=\"Tokenizing\")]\n", + "\n", + "print(\"Building BM25 index...\")\n", + "bm25 = BM25Okapi(tokenized_corpus, k1=1.6, b=0.75)\n", + "print(\"BM25 index ready\")\n", + "\n", + "\n", + "def bm25_retrieve(query: str, top_k: int = 100) -> list[tuple[str, float]]:\n", + " tokens = tokenize(query)\n", + " scores = bm25.get_scores(tokens)\n", + " top_indices = np.argsort(scores)[::-1][:top_k]\n", + " return [(corpus_ids[i], float(scores[i])) for i in top_indices]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Stage 2: Dense Retrieval with FAISS\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Loading weights: 100%|██████████| 199/199 [00:00<00:00, 1338.46it/s]\n", + "\u001b[1mBertModel LOAD REPORT\u001b[0m from: BAAI/bge-base-en-v1.5\n", + "Key | Status | | \n", + "------------------------+------------+--+-\n", + "embeddings.position_ids | UNEXPECTED | | \n", + "\n", + "Notes:\n", + "- UNEXPECTED:\tcan be ignored when loading from different task/architecture; not ok if you expect identical arch.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded embedding model: BAAI/bge-base-en-v1.5\n", + "Encoding 65,224 documents...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Batches: 100%|██████████| 255/255 [22:08<00:00, 5.21s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Saved embeddings to bge_embeddings.npy\n", + "Embedding matrix shape: (65224, 768)\n", + "Building FAISS IVF index...\n", + "Saved FAISS index to bge_faiss.index\n", + "Index contains 65,224 vectors\n" + ] + } + ], + "source": [ + "embed_model = SentenceTransformer(EMBED_MODEL_NAME)\n", + "print(f\"Loaded embedding model: {EMBED_MODEL_NAME}\")\n", + "\n", + "if os.path.exists(EMBED_PATH):\n", + " print(\"Loading cached dense embeddings...\")\n", + " doc_embeddings = np.load(EMBED_PATH)\n", + "else:\n", + " print(f\"Encoding {len(corpus_texts):,} documents...\")\n", + " doc_embeddings = embed_model.encode(\n", + " corpus_texts,\n", + " batch_size=256,\n", + " show_progress_bar=True,\n", + " normalize_embeddings=True,\n", + " convert_to_numpy=True,\n", + " )\n", + " np.save(EMBED_PATH, doc_embeddings)\n", + " print(f\"Saved embeddings to {EMBED_PATH}\")\n", + "\n", + "embedding_dim = int(doc_embeddings.shape[1])\n", + "print(f\"Embedding matrix shape: {doc_embeddings.shape}\")\n", + "\n", + "if os.path.exists(INDEX_PATH):\n", + " print(\"Loading cached FAISS index...\")\n", + " faiss_index = faiss.read_index(INDEX_PATH)\n", + "else:\n", + " print(\"Building FAISS IVF index...\")\n", + " nlist = 256\n", + " quantizer = faiss.IndexFlatIP(embedding_dim)\n", + " faiss_index = faiss.IndexIVFFlat(quantizer, embedding_dim, nlist, faiss.METRIC_INNER_PRODUCT)\n", + " faiss_index.train(doc_embeddings.astype(np.float32))\n", + " faiss_index.add(doc_embeddings.astype(np.float32))\n", + " faiss.write_index(faiss_index, INDEX_PATH)\n", + " print(f\"Saved FAISS index to {INDEX_PATH}\")\n", + "\n", + "faiss_index.nprobe = 32\n", + "print(f\"Index contains {faiss_index.ntotal:,} vectors\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "def dense_retrieve(query: str, top_k: int = 100) -> list[tuple[str, float]]:\n", + " query_embedding = embed_model.encode(\n", + " BGE_QUERY_PREFIX + query,\n", + " normalize_embeddings=True,\n", + " convert_to_numpy=True,\n", + " ).reshape(1, -1).astype(np.float32)\n", + "\n", + " scores, indices = faiss_index.search(query_embedding, top_k)\n", + " return [\n", + " (corpus_ids[idx], float(score))\n", + " for idx, score in zip(indices[0], scores[0])\n", + " if idx >= 0\n", + " ]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Hybrid Fusion and Reranking\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Loading weights: 100%|██████████| 393/393 [00:00<00:00, 1502.39it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded reranker: BAAI/bge-reranker-v2-m3\n" + ] + } + ], + "source": [ + "def reciprocal_rank_fusion(\n", + " ranked_lists: list[list[tuple[str, float]]],\n", + " k: int = 60,\n", + " top_k: int = 100,\n", + ") -> list[tuple[str, float]]:\n", + " rrf_scores: dict[str, float] = {}\n", + " for ranked in ranked_lists:\n", + " for rank, (doc_id, _) in enumerate(ranked):\n", + " rrf_scores[doc_id] = rrf_scores.get(doc_id, 0.0) + 1.0 / (k + rank + 1)\n", + " return sorted(rrf_scores.items(), key=lambda item: item[1], reverse=True)[:top_k]\n", + "\n", + "\n", + "def hybrid_retrieve(query: str, bm25_k: int = 100, dense_k: int = 100, top_k: int = 100):\n", + " bm25_results = bm25_retrieve(query, top_k=bm25_k)\n", + " dense_results = dense_retrieve(query, top_k=dense_k)\n", + " return reciprocal_rank_fusion([bm25_results, dense_results], top_k=top_k)\n", + "\n", + "\n", + "reranker = CrossEncoder(RERANKER_MODEL_NAME, max_length=512)\n", + "print(f\"Loaded reranker: {RERANKER_MODEL_NAME}\")\n", + "\n", + "\n", + "def rerank(query: str, candidates: list[tuple[str, float]], top_k: int = 10) -> list[tuple[str, float]]:\n", + " if not candidates:\n", + " return []\n", + "\n", + " pairs = []\n", + " valid_ids = []\n", + " for doc_id, _ in candidates:\n", + " text = kb_id_to_text.get(doc_id, \"\")\n", + " if text:\n", + " pairs.append([query, text[:512]])\n", + " valid_ids.append(doc_id)\n", + "\n", + " if not pairs:\n", + " return candidates[:top_k]\n", + "\n", + " scores = reranker.predict(pairs, show_progress_bar=False)\n", + " ranked = sorted(zip(valid_ids, scores.tolist()), key=lambda item: item[1], reverse=True)\n", + " return ranked[:top_k]\n", + "\n", + "\n", + "def full_pipeline(\n", + " query: str,\n", + " bm25_k: int = 100,\n", + " dense_k: int = 100,\n", + " rerank_k: int = 50,\n", + " final_k: int = 10,\n", + ") -> list[tuple[str, float]]:\n", + " candidates = hybrid_retrieve(query, bm25_k=bm25_k, dense_k=dense_k, top_k=rerank_k)\n", + " return rerank(query, candidates, top_k=final_k)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9. Search the Knowledge Base\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "def search(query: str, top_k: int = 5):\n", + " \"\"\"Search the knowledge base with the v2 hybrid retrieval pipeline.\"\"\"\n", + " results = full_pipeline(query, final_k=top_k)\n", + "\n", + " formatted = []\n", + " for rank, (doc_id, score) in enumerate(results, start=1):\n", + " formatted.append({\n", + " \"rank\": rank,\n", + " \"score\": float(score),\n", + " \"id\": doc_id,\n", + " \"title\": kb_id_to_title.get(doc_id, \"\"),\n", + " \"text\": kb_id_to_text.get(doc_id, \"\"),\n", + " })\n", + " return formatted\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: How do I set up AirSync?\n", + "============================================================\n", + "[Rank 1] Score: 0.9968\n", + "Doc ID: ART-2045_KNOWLEDGE_NODE-29\n", + "Title: AirSync | Snap-ins | DevRev\n", + "Text: Setting up a new AirSync\\n\\n![]()\\n\\nFor best results, AirSyncs should be done using an administrator account on the external source. This ensures all necessary permissions are available to complete the import successfully.\\n\\nWhether you want to perform only a one-time import or set up an ongoing s...\n", + "----------------------------------------\n", + "[Rank 2] Score: 0.9892\n", + "Doc ID: ART-2044_KNOWLEDGE_NODE-34\n", + "Title: ServiceNow AirSync | AirSync | Snap-ins | DevRev\n", + "Text: a regular basis. By default, the sync occurs once an hour.\\n\\nTo configure periodic sync, follow these steps:\\n\\n1. Go to **Settings** > **Integrations** > **AirSyncs**.\\n2. Locate the previously imported project.\\n3. Select the **\\xe2\\x8b\\xae** > **Set Periodic Sync** option.\\n\\nThe **Enable automa...\n", + "----------------------------------------\n", + "[Rank 3] Score: 0.9786\n", + "Doc ID: ART-4215_KNOWLEDGE_NODE-27\n", + "Title: Notion AirSync | AirSync | Snap-ins | DevRev\n", + "Text: Notion\\n---------------------\\n\\nFollow these steps to install the Notion AirSync snap-in:\\n\\n1. In the **Snap-in Config Modal**, search for **Notion** under **All snap-ins**.\\n2. Click **Add** and **Install snap-in**.\\n3. Go to **Settings** > **Integrations** > **AirSyncs** in the left-hand navigat...\n", + "----------------------------------------\n", + "[Rank 4] Score: 0.9699\n", + "Doc ID: ART-2048_KNOWLEDGE_NODE-28\n", + "Title: ClickUp AirSync | AirSync | Snap-ins | DevRev\n", + "Text: install the snap-in.\\n2. In the snap-in config modal, click **Install** then go to **Integrations** > **AirSyncs** in your settings left nav.\\n3. Click the **AirSync** import button and select the **ClickUp** tile in the **Start import** window.\\n4. Create a new connection to your ClickUp account, o...\n", + "----------------------------------------\n", + "[Rank 5] Score: 0.9688\n", + "Doc ID: ART-5011_KNOWLEDGE_NODE-27\n", + "Title: Zoho Projects AirSync | AirSync | Snap-ins | DevRev\n", + "Text: the top-right corner. This triggers the snap-in configuration and installation modal.\\n3. Click on **Install Snap-in** to add it to your organization.\\n4. Go to the **AirSyncs** section from the navigation bar and select **AirSync** (or **Start AirSync** if it\\'s your first).\\n5. A pop-up opens to s...\n", + "----------------------------------------\n" + ] + } + ], + "source": [ + "# Test search with a sample query\n", + "query = \"How do I set up AirSync?\"\n", + "results = search(query, top_k=5)\n", + "\n", + "print(f\"Query: {query}\")\n", + "print(\"=\" * 60)\n", + "for result in results:\n", + " print(f\"[Rank {result['rank']}] Score: {result['score']:.4f}\")\n", + " print(f\"Doc ID: {result['id']}\")\n", + " print(f\"Title: {result['title']}\")\n", + " print(f\"Text: {result['text'][:300]}...\")\n", + " print(\"-\" * 40)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10. Generate Test Query Predictions\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Processing test queries: 100%|██████████| 92/92 [11:39<00:00, 7.61s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Processed 92 test queries\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "TOP_K = 10\n", + "\n", + "test_results = []\n", + "\n", + "for item in tqdm(test_queries[\"test\"], desc=\"Processing test queries\"):\n", + " query_id = item[\"query_id\"]\n", + " query = item[\"query\"]\n", + "\n", + " retrieved = full_pipeline(query, bm25_k=100, dense_k=100, rerank_k=50, final_k=TOP_K)\n", + " retrievals = [\n", + " {\n", + " \"id\": doc_id,\n", + " \"text\": kb_id_to_text.get(doc_id, \"\"),\n", + " \"title\": kb_id_to_title.get(doc_id, \"\"),\n", + " }\n", + " for doc_id, _ in retrieved\n", + " ]\n", + "\n", + " test_results.append({\n", + " \"query_id\": query_id,\n", + " \"query\": query,\n", + " \"retrievals\": retrievals,\n", + " })\n", + "\n", + "print(f\"Processed {len(test_results)} test queries\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Sample result:\n", + "{\n", + " \"query_id\": \"a97f93d2-410a-431f-ae9a-1e23ed35d74c\",\n", + " \"query\": \"end customer organization name not appearing in ticket or conversation\",\n", + " \"retrievals\": [\n", + " {\n", + " \"id\": \"ART-1953_KNOWLEDGE_NODE-29\",\n", + " \"text\": \"[support@yourdomain.com](mailto:support@yourdomain.com)\\\\n* **Subject**: \\\"You are missing messages from \\\"\\\\n\\\\nReply to the customer on a ticket\\\\n---------------------------------\\\\n\\\\n* **Trigger**: When a reply is made to a customer on a ticket.\\\\n* **Action**: The system sends out a notification to the customer with the reply message.\\\\n* **Sender**: {Company\\\\\\\\_Name} [support@yourdomain.com](mailto:support@yourdomain.com)\\\\n* **Subject**: \\\"[{Company\\\\\\\\_Name}] Update on TKT-XXX\\\"\\\\n\\\\nTicket\",\n", + " \"title\": \"Customer email notifications | Computer by DevRev | DevRev\"\n", + " },\n", + " {\n", + " \"id\": \"ART-1981_KNOWLEDGE_NODE-30\",\n", + " \"text\": \"conversation of which you are not the owner, let the owner know to respond. It's beneficial to retain the same point of contact for the duration of the conversation unless the owner refers some another user.\\\\n* If the conversation has a customer org that's unidentified or is new, add yourself (the customer experience engineer) as the owner of the ticket. Try to find the appropriate owner for the customer org and update the customer record accordingly.\\\\n* Change the stage of the conversation to\",\n", + " \"title\": \"Support best practices | Computer for Support Teams | DevRev\"\n", + " },\n", + " {\n", + " \"id\": \"ART-197\n" + ] + } + ], + "source": [ + "print(\"Sample result:\")\n", + "print(json.dumps(test_results[0], indent=2, default=str)[:1500])\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Saved JSON results to test_queries_results_v2.json\n", + "Saved parquet results to test_queries_results_v2.parquet\n", + "Queries: 92\n", + "Retrievals per query: 10\n" + ] + } + ], + "source": [ + "with open(OUTPUT_JSON, \"w\", encoding=\"utf-8\") as file:\n", + " json.dump(test_results, file, indent=2)\n", + "\n", + "results_df = pd.DataFrame(test_results)\n", + "results_df.to_parquet(OUTPUT_PARQUET, index=False)\n", + "\n", + "print(f\"Saved JSON results to {OUTPUT_JSON}\")\n", + "print(f\"Saved parquet results to {OUTPUT_PARQUET}\")\n", + "print(f\"Queries: {len(test_results)}\")\n", + "print(f\"Retrievals per query: {TOP_K}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 11. Save and Reload Retrieval State\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Saved retrieval state to corpus_state_v2.pkl\n", + "Dense index cache: bge_faiss.index\n", + "Dense embedding cache: bge_embeddings.npy\n" + ] + } + ], + "source": [ + "state = {\n", + " \"corpus_ids\": corpus_ids,\n", + " \"kb_id_to_text\": kb_id_to_text,\n", + " \"kb_id_to_title\": kb_id_to_title,\n", + "}\n", + "\n", + "with open(STATE_PATH, \"wb\") as file:\n", + " pickle.dump(state, file)\n", + "\n", + "print(f\"Saved retrieval state to {STATE_PATH}\")\n", + "print(f\"Dense index cache: {INDEX_PATH}\")\n", + "print(f\"Dense embedding cache: {EMBED_PATH}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 12. Load Saved State (Optional)\n", + "Use this to reload the cached retrieval components without rebuilding the corpus structures.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "# with open(STATE_PATH, \"rb\") as file:\n", + "# state = pickle.load(file)\n", + "#\n", + "# corpus_ids = state[\"corpus_ids\"]\n", + "# kb_id_to_text = state[\"kb_id_to_text\"]\n", + "# kb_id_to_title = state[\"kb_id_to_title\"]\n", + "#\n", + "# doc_embeddings = np.load(EMBED_PATH)\n", + "# faiss_index = faiss.read_index(INDEX_PATH)\n", + "# faiss_index.nprobe = 32\n", + "#\n", + "# embed_model = SentenceTransformer(EMBED_MODEL_NAME)\n", + "# reranker = CrossEncoder(RERANKER_MODEL_NAME, max_length=512)\n", + "# print(\"Reloaded retrieval state\")\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python (devrev-search-bench)", + "language": "python", + "name": "devrev-search-bench" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/requirements.txt b/requirements.txt index b196bf1..a04ec61 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,8 +1,160 @@ -datasets -pandas -openai -ollama -faiss-cpu -numpy -tqdm -pyarrow +aiohappyeyeballs==2.6.1 +aiohttp==3.13.3 +aiosignal==1.4.0 +annotated-doc==0.0.4 +annotated-types==0.7.0 +anyio==4.12.1 +argon2-cffi==25.1.0 +argon2-cffi-bindings==25.1.0 +arrow==1.4.0 +asttokens==3.0.1 +async-lru==2.2.0 +attrs==25.4.0 +babel==2.18.0 +beautifulsoup4==4.14.3 +bleach==6.3.0 +certifi==2026.2.25 +cffi==2.0.0 +charset-normalizer==3.4.5 +click==8.3.1 +comm==0.2.3 +cuda-bindings==13.2.0 +cuda-pathfinder==1.5.0 +cuda-toolkit==13.0.2 +datasets==4.7.0 +debugpy==1.8.20 +decorator==5.2.1 +defusedxml==0.7.1 +dill==0.4.0 +distro==1.9.0 +executing==2.2.1 +faiss-cpu==1.13.2 +fastjsonschema==2.21.2 +filelock==3.25.2 +fqdn==1.5.1 +frozenlist==1.8.0 +fsspec==2026.2.0 +h11==0.16.0 +hf-xet==1.4.2 +httpcore==1.0.9 +httpx==0.28.1 +huggingface_hub==1.7.1 +idna==3.11 +ipykernel==7.2.0 +ipython==9.11.0 +ipython_pygments_lexers==1.1.1 +isoduration==20.11.0 +jedi==0.19.2 +Jinja2==3.1.6 +jiter==0.13.0 +joblib==1.5.3 +json5==0.13.0 +jsonpointer==3.0.0 +jsonschema==4.26.0 +jsonschema-specifications==2025.9.1 +jupyter-events==0.12.0 +jupyter-lsp==2.3.0 +jupyter_client==8.8.0 +jupyter_core==5.9.1 +jupyter_server==2.17.0 +jupyter_server_terminals==0.5.4 +jupyterlab==4.5.6 +jupyterlab_pygments==0.3.0 +jupyterlab_server==2.28.0 +lark==1.3.1 +markdown-it-py==4.0.0 +MarkupSafe==3.0.3 +matplotlib-inline==0.2.1 +mdurl==0.1.2 +mistune==3.2.0 +mpmath==1.3.0 +multidict==6.7.1 +multiprocess==0.70.18 +nbclient==0.10.4 +nbconvert==7.17.0 +nbformat==5.10.4 +nest-asyncio==1.6.0 +networkx==3.6.1 +notebook_shim==0.2.4 +numpy==2.4.3 +nvidia-cublas==13.1.0.3 +nvidia-cuda-cupti==13.0.85 +nvidia-cuda-nvrtc==13.0.88 +nvidia-cuda-runtime==13.0.96 +nvidia-cudnn-cu13==9.19.0.56 +nvidia-cufft==12.0.0.61 +nvidia-cufile==1.15.1.6 +nvidia-curand==10.4.0.35 +nvidia-cusolver==12.0.4.66 +nvidia-cusparse==12.6.3.3 +nvidia-cusparselt-cu13==0.8.0 +nvidia-nccl-cu13==2.28.9 +nvidia-nvjitlink==13.0.88 +nvidia-nvshmem-cu13==3.4.5 +nvidia-nvtx==13.0.85 +ollama==0.6.1 +openai==2.28.0 +packaging==26.0 +pandas==3.0.1 +pandocfilters==1.5.1 +parso==0.8.6 +pexpect==4.9.0 +platformdirs==4.9.4 +prometheus_client==0.24.1 +prompt_toolkit==3.0.52 +propcache==0.4.1 +psutil==7.2.2 +ptyprocess==0.7.0 +pure_eval==0.2.3 +pyarrow==23.0.1 +pycparser==3.0 +pydantic==2.12.5 +pydantic_core==2.41.5 +Pygments==2.19.2 +python-dateutil==2.9.0.post0 +python-json-logger==4.0.0 +PyYAML==6.0.3 +pyzmq==27.1.0 +rank-bm25==0.2.2 +referencing==0.37.0 +regex==2026.3.32 +requests==2.32.5 +rfc3339-validator==0.1.4 +rfc3986-validator==0.1.1 +rfc3987-syntax==1.1.0 +rich==14.3.3 +rpds-py==0.30.0 +safetensors==0.7.0 +scikit-learn==1.8.0 +scipy==1.17.1 +Send2Trash==2.1.0 +sentence-transformers==5.3.0 +setuptools==81.0.0 +shellingham==1.5.4 +six==1.17.0 +sniffio==1.3.1 +soupsieve==2.8.3 +stack-data==0.6.3 +sympy==1.14.0 +terminado==0.18.1 +threadpoolctl==3.6.0 +tinycss2==1.4.0 +tokenizers==0.22.2 +torch==2.11.0 +tornado==6.5.5 +tqdm==4.67.3 +traitlets==5.14.3 +transformers==5.4.0 +triton==3.6.0 +typer==0.24.1 +typing-inspection==0.4.2 +typing_extensions==4.15.0 +tzdata==2025.3 +uri-template==1.3.0 +urllib3==2.6.3 +wcwidth==0.6.0 +webcolors==25.10.0 +webencodings==0.5.1 +websocket-client==1.9.0 +xxhash==3.6.0 +yarl==1.23.0 diff --git a/test_queries_results_v2.json b/test_queries_results_v2.json new file mode 100644 index 0000000..98ac389 --- /dev/null +++ b/test_queries_results_v2.json @@ -0,0 +1,5154 @@ +[ + { + "query_id": "a97f93d2-410a-431f-ae9a-1e23ed35d74c", + "query": "end customer organization name not appearing in ticket or conversation", + "retrievals": [ + { + "id": "ART-1953_KNOWLEDGE_NODE-29", + "text": "[support@yourdomain.com](mailto:support@yourdomain.com)\\n* **Subject**: \"You are missing messages from \"\\n\\nReply to the customer on a ticket\\n---------------------------------\\n\\n* **Trigger**: When a reply is made to a customer on a ticket.\\n* **Action**: The system sends out a notification to the customer with the reply message.\\n* **Sender**: {Company\\\\_Name} [support@yourdomain.com](mailto:support@yourdomain.com)\\n* **Subject**: \"[{Company\\\\_Name}] Update on TKT-XXX\"\\n\\nTicket", + "title": "Customer email notifications | Computer by DevRev | DevRev" + }, + { + "id": "ART-1981_KNOWLEDGE_NODE-30", + "text": "conversation of which you are not the owner, let the owner know to respond. It's beneficial to retain the same point of contact for the duration of the conversation unless the owner refers some another user.\\n* If the conversation has a customer org that's unidentified or is new, add yourself (the customer experience engineer) as the owner of the ticket. Try to find the appropriate owner for the customer org and update the customer record accordingly.\\n* Change the stage of the conversation to", + "title": "Support best practices | Computer for Support Teams | DevRev" + }, + { + "id": "ART-1978_KNOWLEDGE_NODE-44", + "text": "URL and not on the support portal of some other customer.\\n* Customer admin isn't able to see all the tickets of the organization.\\n\\n + This could happen if the customer isn't logged in on the correct URL. If the customer is logged in on the correct URL, then check if there are any tickets that are reported by the other customers in that organization or not. Check if the customer is added as a customer admin or not by logging in to your DevRev application.\\n* You are not able to add customer", + "title": "Customer portal | Computer for Support Teams | DevRev" + }, + { + "id": "ART-1986_KNOWLEDGE_NODE-45", + "text": "selected on the ticket is not assigned any SLA.\\n\\n\\xc2\\xa0\\xc2\\xa0 \\xc2\\xa0 - Action: Check your SLA assignment rules or add the customer as an exception to any of your SLAs.\\n\\n![]()\\n\\nThe **SLA Name** is never empty if your organization has a default SLA.\\n\\n1. Verify policy conditions:\\n\\n\\xc2\\xa0\\xc2\\xa0 - If the **SLA Name** is populated but you still see no SLA metrics running on the ticket, the ticket does not satisfy the conditions of any policy within the SLA.\\n\\n\\xc2\\xa0\\xc2\\xa0", + "title": "Service-level agreement | Computer for Support Teams | DevRev" + }, + { + "id": "ART-1981_KNOWLEDGE_NODE-34", + "text": "ticket.\\n* Make sure all tickets have the customer org field populated.\\n* Cancel any internal ticket without a customer org that has been created by a developer. Ask them to create an issue instead.\\n* If a customer raises a feature request that aligns with the product strategy, but needs significant development effort and will not be delivered in the near future, move it to the *accepted* stage, rather than keeping the ticket open. Inform the customer accordingly.\\n* If a customer reports a", + "title": "Support best practices | Computer for Support Teams | DevRev" + }, + { + "id": "ART-2002_KNOWLEDGE_NODE-25", + "text": "about the customer initiating the conversation. Similarly, tickets on DevRev can capture who the ticket was reported by (or reported for).\\n\\nConcepts\\n--------\\n\\nCustomer identity in DevRev includes the following important constructs:\\n\\n* **External User/contact**: Your end user or customer or users associated with organization Accounts or Workspaces.\\n* **Account/workspace**: Any logical grouping that an external user is part of. It could represent a customer account for your B2B product", + "title": "Contacts | Computer for Growth Teams | DevRev" + }, + { + "id": "ART-1953_KNOWLEDGE_NODE-32", + "text": "\\xe2\\x80\\x9cUpdate on your Conversation with {Company\\\\_Name}\"\\n\\n![]()\\n\\nThis email is only sent to organizations that have installed [Convergence snap-in](https://docs.devrev.ai/automations/converge).\\n\\nCSAT survey for conversation/ticket\\n-----------------------------------\\n\\n* **Trigger**: A CSAT survey is sent for a conversation or ticket.\\n* **Action**: The system sends out a notification with the ticket/conversation number and CSAT form.\\n* **Sender**: DevRev", + "title": "Customer email notifications | Computer by DevRev | DevRev" + }, + { + "id": "ART-1986_KNOWLEDGE_NODE-44", + "text": "**Custom**: Filters all tickets that will breach by the selected date.\\n\\n![]()\\n\\nTroubleshooting: No SLA running on the ticket\\n---------------------------------------------\\n\\n### Issue\\n\\nYou have created and published an SLA, but no SLA is running on the ticket.\\n\\n### Solution\\n\\n1. Check the **SLA Name** attribute:\\n\\n\\xc2\\xa0\\xc2\\xa0 - Verify that the **SLA Name** attribute on the ticket is not empty.\\n\\n\\xc2\\xa0\\xc2\\xa0 - If the **SLA Name** is empty, it means the customer account", + "title": "Service-level agreement | Computer for Support Teams | DevRev" + }, + { + "id": "ART-1981_KNOWLEDGE_NODE-28", + "text": "new conversation. Respond within 1 hour to new messages on existing conversations. Change the stage of conversation to *awaiting customer response* as soon as you have responded.\\n* In **Updates**, filter by **Type** > **Mentioned**. Respond to those updates first.\\n* Create a ticket if you aren't able to resolve the conversation in 20 minutes. As soon as the ticket is opened, move it to the *escalate* stage. The owner of the ticket is the owner of the customer org where the conversation", + "title": "Support best practices | Computer for Support Teams | DevRev" + }, + { + "id": "ART-1974_KNOWLEDGE_NODE-33", + "text": "have been addressed.\\n\\n A conversation set to *resolved* still shows in the end-user's widget. If they respond again, it reopens the conversation and set the status to *needs response*.\\n* *Archived*\\n\\n The final stage for conversation.\\n\\n[PreviousTicket-Team Performance](/docs/dashboards/ticket-team-performance)[NextConversation to ticket conversion](/docs/product/conversation-ticket)\\n\\n#### On this page\\n\\n* [Stages](#stages)\\n\\n[Enterprise grade security to protect customer", + "title": "Conversations | Computer for Support Teams | DevRev" + } + ] + }, + { + "query_id": "7dd7e2b4-9349-4535-8007-1d706e0fabff", + "query": "Android SDK session generated with Unknown user", + "retrievals": [ + { + "id": "ART-2895_KNOWLEDGE_NODE-13", + "text": "\\n[/code] \\n \\n## Identify users without session token\\n\\nYou can also pass the identifiers in the `plugSDK.init` option without generating the session token.\\n\\n#####\\n\\nThis frontend user identification, by its nature, is not secure as the data is transmitted through the client side. It is recommended to use the session token method to securely identify users.\\n\\n#####\\n\\nThis method is currently in beta and comes with the following limitations:\\n\\n * Unverified users cannot be", + "title": "Identify your users with PLuG \u2014 DevRev | Docs" + }, + { + "id": "ART-2898_KNOWLEDGE_NODE-1", + "text": "identification](/public/sdks/mobile/android#unverified-identification)\\n * [Update user information](/public/sdks/mobile/android#update-user-information)\\n * [PLuG support chat](/public/sdks/mobile/android#plug-support-chat)\\n * [Analytics](/public/sdks/mobile/android#analytics)\\n * [Session analytics](/public/sdks/mobile/android#session-analytics)\\n * [Opt in or out](/public/sdks/mobile/android#opt-in-or-out)\\n * [Session recording](/public/sdks/mobile/android#session-recording)\\n *", + "title": "Android integration \u2014 DevRev | Docs" + }, + { + "id": "ART-4256_KNOWLEDGE_NODE-13", + "text": "configured and the user to be identified, whether they are unverified or anonymous.\\n\\nThe DevRev SDK allows you to send custom analytic events by using a properties map. You can track these events using the following function:\\n\\n[code]\\n\\n 1| DevRev.trackEvent(name: string, properties?: Map) \\n ---|---\\n[/code] \\n \\n## Session analytics\\n\\nThe DevRev SDK offers session analytics features to help you understand how users interact with your app.\\n\\n### Opting-in or", + "title": "DevRev SDK for React Native \u2014 DevRev | Docs" + }, + { + "id": "ART-2895_KNOWLEDGE_NODE-0", + "text": "b'[](/public/sdks/web/user-identity)\\n\\nPublic\\n\\nOn this page\\n\\n * [Generate an application access token](/public/sdks/web/user-identity#generate-an-application-access-token)\\n * [Generate a session token](/public/sdks/web/user-identity#generate-a-session-token)\\n * [Pass custom attributes](/public/sdks/web/user-identity#pass-custom-attributes)\\n * [Pass the session token](/public/sdks/web/user-identity#pass-the-session-token)\\n * [Identify users without session", + "title": "Identify your users with PLuG \u2014 DevRev | Docs" + }, + { + "id": "ART-2895_KNOWLEDGE_NODE-12", + "text": "their context.\\n\\n[code]\\n\\n 1| // You can generate this session token using the above API in your backend \\n ---|--- \\n 2| const sessionToken = \\'\\' \\n 3| \\n 4| \\n[/code] \\n \\n## Identify users without session token\\n\\nYou can also pass the identifiers in the `plugSDK.init` option without generating the session token.\\n\\n#####\\n\\nThis frontend user identification, by its nature, is not secure as the data is transmitted through the client side. It is recommended to use the session token method to securely identify users.\\n\\n#####\\n\\nThis method is currently in beta and comes with the following limitations:\\n\\n * Unverified users cannot be", - "title": "Identify your users with PLuG \u2014 DevRev | Docs" - }, - { - "id": "ART-2898_KNOWLEDGE_NODE-1", - "text": "identification](/public/sdks/mobile/android#unverified-identification)\\n * [Update user information](/public/sdks/mobile/android#update-user-information)\\n * [PLuG support chat](/public/sdks/mobile/android#plug-support-chat)\\n * [Analytics](/public/sdks/mobile/android#analytics)\\n * [Session analytics](/public/sdks/mobile/android#session-analytics)\\n * [Opt in or out](/public/sdks/mobile/android#opt-in-or-out)\\n * [Session recording](/public/sdks/mobile/android#session-recording)\\n *", - "title": "Android integration \u2014 DevRev | Docs" - }, - { - "id": "ART-4256_KNOWLEDGE_NODE-13", - "text": "configured and the user to be identified, whether they are unverified or anonymous.\\n\\nThe DevRev SDK allows you to send custom analytic events by using a properties map. You can track these events using the following function:\\n\\n[code]\\n\\n 1| DevRev.trackEvent(name: string, properties?: Map) \\n ---|---\\n[/code] \\n \\n## Session analytics\\n\\nThe DevRev SDK offers session analytics features to help you understand how users interact with your app.\\n\\n### Opting-in or", - "title": "DevRev SDK for React Native \u2014 DevRev | Docs" - }, - { - "id": "ART-2895_KNOWLEDGE_NODE-0", - "text": "b'[](/public/sdks/web/user-identity)\\n\\nPublic\\n\\nOn this page\\n\\n * [Generate an application access token](/public/sdks/web/user-identity#generate-an-application-access-token)\\n * [Generate a session token](/public/sdks/web/user-identity#generate-a-session-token)\\n * [Pass custom attributes](/public/sdks/web/user-identity#pass-custom-attributes)\\n * [Pass the session token](/public/sdks/web/user-identity#pass-the-session-token)\\n * [Identify users without session", - "title": "Identify your users with PLuG \u2014 DevRev | Docs" - }, - { - "id": "ART-2895_KNOWLEDGE_NODE-12", - "text": "their context.\\n\\n[code]\\n\\n 1| // You can generate this session token using the above API in your backend \\n ---|--- \\n 2| const sessionToken = \\'\\' \\n 3| \\n 4| \\n[/code] \\n \\n## Identify users without session token\\n\\nYou can also pass the identifiers in the `plugSDK.init` option without generating the session token.\\n\\n#####\\n\\nThis frontend user identification, by its nature, is not secure as the data is transmitted through the client side. It is recommended to use the session token method to securely identify users.\\n\\n#####\\n\\nThis method is currently in beta and comes with the following limitations:\\n\\n * Unverified users cannot be", + "title": "Identify your users with PLuG \u2014 DevRev | Docs" + }, + { + "id": "ART-2898_KNOWLEDGE_NODE-1", + "text": "identification](/public/sdks/mobile/android#unverified-identification)\\n * [Update user information](/public/sdks/mobile/android#update-user-information)\\n * [PLuG support chat](/public/sdks/mobile/android#plug-support-chat)\\n * [Analytics](/public/sdks/mobile/android#analytics)\\n * [Session analytics](/public/sdks/mobile/android#session-analytics)\\n * [Opt in or out](/public/sdks/mobile/android#opt-in-or-out)\\n * [Session recording](/public/sdks/mobile/android#session-recording)\\n *", + "title": "Android integration \u2014 DevRev | Docs" + }, + { + "id": "ART-4256_KNOWLEDGE_NODE-13", + "text": "configured and the user to be identified, whether they are unverified or anonymous.\\n\\nThe DevRev SDK allows you to send custom analytic events by using a properties map. You can track these events using the following function:\\n\\n[code]\\n\\n 1| DevRev.trackEvent(name: string, properties?: Map) \\n ---|---\\n[/code] \\n \\n## Session analytics\\n\\nThe DevRev SDK offers session analytics features to help you understand how users interact with your app.\\n\\n### Opting-in or", + "title": "DevRev SDK for React Native \u2014 DevRev | Docs" + }, + { + "id": "ART-2895_KNOWLEDGE_NODE-0", + "text": "b'[](/public/sdks/web/user-identity)\\n\\nPublic\\n\\nOn this page\\n\\n * [Generate an application access token](/public/sdks/web/user-identity#generate-an-application-access-token)\\n * [Generate a session token](/public/sdks/web/user-identity#generate-a-session-token)\\n * [Pass custom attributes](/public/sdks/web/user-identity#pass-custom-attributes)\\n * [Pass the session token](/public/sdks/web/user-identity#pass-the-session-token)\\n * [Identify users without session", + "title": "Identify your users with PLuG \u2014 DevRev | Docs" + }, + { + "id": "ART-2895_KNOWLEDGE_NODE-12", + "text": "their context.\\n\\n[code]\\n\\n 1| // You can generate this session token using the above API in your backend \\n ---|--- \\n 2| const sessionToken = \\'\\' \\n 3| \\n 4|