diff --git a/authors.yaml b/authors.yaml
index e1a18ed0fa..a27ad0d02a 100644
--- a/authors.yaml
+++ b/authors.yaml
@@ -493,6 +493,10 @@ heejingithub:
website: "https://www.linkedin.com/in/heejc/"
avatar: "https://avatars.githubusercontent.com/u/169293861"
+ayman-openai:
+ name: "Ayman Farhat"
+ website: "https://www.linkedin.com/in/ayman-farhat-7baa9a11/"
+ avatar: "https://avatars.githubusercontent.com/u/229349247"
himadri518:
name: "Himadri Acharya"
diff --git a/examples/vector_databases/duckdb/.gitignore b/examples/vector_databases/duckdb/.gitignore
new file mode 100644
index 0000000000..f6b9cd9231
--- /dev/null
+++ b/examples/vector_databases/duckdb/.gitignore
@@ -0,0 +1 @@
+arxiv_data.db
\ No newline at end of file
diff --git a/examples/vector_databases/duckdb/README.md b/examples/vector_databases/duckdb/README.md
new file mode 100644
index 0000000000..fbffef3061
--- /dev/null
+++ b/examples/vector_databases/duckdb/README.md
@@ -0,0 +1,8 @@
+# DuckDB
+
+[DuckDB](https://duckdb.org/) is an in-process SQL OLAP database management system designed for analytics.
+DuckDB provides a lightweight, efficient way to query and store embeddings directly alongside your structured data, making it easy to combine vector operations with rich SQL analytics.
+
+For technical details, refer to the [DuckDB documentation](https://duckdb.org/docs/).
+
+The [`duckdb`](https://github.com/duckdb/duckdb) GitHub repository contains the source code, examples, and community resources for experimenting with DuckDB.
diff --git a/examples/vector_databases/duckdb/duckdb-sql-with-openai-embeddings.ipynb b/examples/vector_databases/duckdb/duckdb-sql-with-openai-embeddings.ipynb
new file mode 100644
index 0000000000..e963466767
--- /dev/null
+++ b/examples/vector_databases/duckdb/duckdb-sql-with-openai-embeddings.ipynb
@@ -0,0 +1,587 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "0434d61f",
+ "metadata": {},
+ "source": [
+ "# Semantic Search with DuckDB and OpenAI Embeddings\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "785ee6d1",
+ "metadata": {},
+ "source": [
+ "DuckDB is an increasingly popular analytical database, known for its speed, simplicity, and ability to handle large-scale data analysis directly from your laptop or server. Its lightweight design and SQL compatibility make it a great choice for modern data science workflows.\n",
+ "\n",
+ "In this Cookbook, we will demonstrate integrating DuckDB with OpenAI APIs for performing semantic search on the Arxiv dataset, including loading data, generating embeddings, and running similarity queries using SQL.\n",
+ "\n",
+ "This notebook demonstrates how to:\n",
+ "\n",
+ "- Load the [arXiv](https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts) paper abstracts dataset into DuckDB\n",
+ "- Generate and store OpenAI embeddings into DuckDB\n",
+ "- Embed a search query with the OpenAI embeddings endpoint\n",
+ "- Perform semantic search in DuckDB using the embedded query"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "aadf2202",
+ "metadata": {},
+ "source": [
+ "## Install dependencies"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ad752660",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!pip install numpy kagglehub duckdb pandas openai"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3058bd5a",
+ "metadata": {},
+ "source": [
+ "## Extract the dataset and load into DuckDB\n",
+ "In this example, we'll be using the arXiv paper abstracts from kaggle as an example. Its a simple CSV with titles and summaries. Let's extract it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6ae41715",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import kagglehub\n",
+ "import pandas as pd\n",
+ "\n",
+ "path = kagglehub.dataset_download(\"spsayakpaul/arxiv-paper-abstracts\")\n",
+ "\n",
+ "path = path+\"/arxiv_data.csv\"\n",
+ "print(path)\n",
+ "\n",
+ "# Load the dataset into DuckDB\n",
+ "import duckdb\n",
+ "\n",
+ "# Create a connection to the database\n",
+ "conn = duckdb.connect('arxiv_data.db')\n",
+ "\n",
+ "# Load the dataset into DuckDB, limiting to 400 rows for testing\n",
+ "duckdb.sql(f\"\"\"\n",
+ " CREATE OR REPLACE TABLE papers AS \n",
+ " SELECT * FROM read_csv('{path}', header=true, parallel=false)\n",
+ " LIMIT 400\n",
+ "\"\"\")\n",
+ "\n",
+ "# Inspect the first 5 rows of the dataset\n",
+ "result = duckdb.sql(\"SELECT * FROM papers LIMIT 5\").df()\n",
+ "\n",
+ "result.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "32a2a22e",
+ "metadata": {},
+ "source": [
+ "### Add an embeddings column to the schema"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "5da323b3",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "┌───────┬────────────┬─────────────┬─────────┬────────────┬─────────┐\n",
+ "│ cid │ name │ type │ notnull │ dflt_value │ pk │\n",
+ "│ int32 │ varchar │ varchar │ boolean │ varchar │ boolean │\n",
+ "├───────┼────────────┼─────────────┼─────────┼────────────┼─────────┤\n",
+ "│ 0 │ titles │ VARCHAR │ false │ NULL │ false │\n",
+ "│ 1 │ summaries │ VARCHAR │ false │ NULL │ false │\n",
+ "│ 2 │ terms │ VARCHAR │ false │ NULL │ false │\n",
+ "│ 3 │ embeddings │ FLOAT[1024] │ false │ NULL │ false │\n",
+ "└───────┴────────────┴─────────────┴─────────┴────────────┴─────────┘"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "duckdb.sql(\"ALTER TABLE papers ADD COLUMN IF NOT EXISTS embeddings FLOAT[1024]\")\n",
+ "\n",
+ "# Verify the new column has been added by inspecting the schema\n",
+ "duckdb.sql(\"PRAGMA table_info(papers)\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d366692b",
+ "metadata": {},
+ "source": [
+ "## Generate embeddings for the dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f361c07",
+ "metadata": {},
+ "source": [
+ "There are multiple options for creating embeddings in DuckDB. We could either\n",
+ "\n",
+ "1. Loop through batches of inputs in Python, call the embedding model and store each batch in the database.\n",
+ "\n",
+ "2. Create a custom DuckDB function (UDF) to call the model and write the embeddings in a single SQL statement.\n",
+ "\n",
+ "In this notebook, I'll go with option 2, in order to have an \"SQL first\" experience, defining a re-usable SQL embedding function that I could use in different use cases."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5763a058",
+ "metadata": {},
+ "source": [
+ "### Defining an OpenAI embeddings UDF for DuckDB\n",
+ "\n",
+ "The function below specifies the encoding format as \"float\" and sets the embedding dimensions to 1024 which is compatible with the embeddings field size on DuckDB."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bf1dcf16",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "from duckdb.typing import VARCHAR\n",
+ "import openai\n",
+ "client = openai.OpenAI()\n",
+ "\n",
+ "# Define the UDF for embedding a text input using the OpenAI API.\n",
+ "def embed_openai(text: str) -> np.ndarray:\n",
+ " \"\"\"\n",
+ " DuckDB UDF for embedding a text input using the OpenAI API.\n",
+ " \"\"\"\n",
+ " model = \"text-embedding-3-small\"\n",
+ " response = client.embeddings.create(\n",
+ " model=model,\n",
+ " input=text,\n",
+ " encoding_format=\"float\",\n",
+ " dimensions=1024\n",
+ " )\n",
+ "\n",
+ " return response.data[0].embedding\n",
+ "\n",
+ "# Register the UDF with DuckDB.\n",
+ "duckdb.create_function(\"embed_openai\", embed_openai, [VARCHAR], \"FLOAT[1024]\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "222d1038",
+ "metadata": {},
+ "source": [
+ "*Note on performance:* The above function will run a call to OpenAI's embeddings API for every single row. Depending on your dataset size, this might be slow. For larger datasets, consider [upgrading this function](https://lukaszrogalski.substack.com/p/python-udfs-in-duckdb) to work with aggregated data and pass in multiple sentences (batches) to the OpenAI embeddings call."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0b2d7c9",
+ "metadata": {},
+ "source": [
+ "Now that we’ve registered the function with DuckDB, we can use in like any native function as part of our SQL query:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0619fbf5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "duckdb.sql(\"SELECT embed_openai('Which papers are related to quantum computing?') AS query_embedding;\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9f452192",
+ "metadata": {},
+ "source": [
+ "### Generating Embeddings\n",
+ "\n",
+ "With the embedding function in place, we can now use it to generate and store embeddings in our table via SQL. The query below runs on every row in the table, calling the OpenAI embedding UDF we defined earlier. On a dataset of about 400 rows, it typically completes in around 2 minutes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "2d1d68ba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "duckdb.query(\"\"\"\n",
+ "UPDATE papers\n",
+ "SET embeddings = embed_openai(\n",
+ " COALESCE(titles, '') || ' ' || COALESCE(summaries, '')\n",
+ ")\n",
+ "WHERE embeddings IS NULL\n",
+ "\"\"\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4cc350f1",
+ "metadata": {},
+ "source": [
+ "Inspecting the first 5 rows of the dataset we can see that the embeddings have been created for every row."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5642bcf9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "result = duckdb.sql(\"SELECT * FROM papers LIMIT 5\").df()\n",
+ "result.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "975954b4",
+ "metadata": {},
+ "source": [
+ "## Running a Similarity Search with SQL"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "35ee611a",
+ "metadata": {},
+ "source": [
+ "Now that we have embeddings for each paper, we can use them to perform a semantic similarity search.\n",
+ "\n",
+ "To achieve this, we can use a native DuckDB array distance function such as `array_cosine_similarity`, which computes the cosine similarity between two vectors.\n",
+ "\n",
+ "The query below demonstrates how to generate an embedding for a search term using our `embed_openai` function, and then apply array_cosine_similarity to compare the query embedding with each of the paper embeddings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "357d8fad",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " titles | \n",
+ " summaries | \n",
+ " score | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " Medical Matting: A New Perspective on Medical ... | \n",
+ " In medical image segmentation, it is difficult... | \n",
+ " 0.579598 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Self-Supervision with Superpixels: Training Fe... | \n",
+ " Few-shot semantic segmentation (FSS) has great... | \n",
+ " 0.570959 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " A Spatial Guided Self-supervised Clustering Ne... | \n",
+ " The segmentation of medical images is a fundam... | \n",
+ " 0.562010 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " Superpixel-Guided Label Softening for Medical ... | \n",
+ " Segmentation of objects of interest is one of ... | \n",
+ " 0.561727 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " Efficient and Generic Interactive Segmentation... | \n",
+ " Semantic segmentation of medical images is an ... | \n",
+ " 0.560177 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " titles \\\n",
+ "0 Medical Matting: A New Perspective on Medical ... \n",
+ "1 Self-Supervision with Superpixels: Training Fe... \n",
+ "2 A Spatial Guided Self-supervised Clustering Ne... \n",
+ "3 Superpixel-Guided Label Softening for Medical ... \n",
+ "4 Efficient and Generic Interactive Segmentation... \n",
+ "\n",
+ " summaries score \n",
+ "0 In medical image segmentation, it is difficult... 0.579598 \n",
+ "1 Few-shot semantic segmentation (FSS) has great... 0.570959 \n",
+ "2 The segmentation of medical images is a fundam... 0.562010 \n",
+ "3 Segmentation of objects of interest is one of ... 0.561727 \n",
+ "4 Semantic segmentation of medical images is an ... 0.560177 "
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "def search_papers(query_text: str, k: int = 5):\n",
+ " return duckdb.execute(\"\"\"\n",
+ " WITH q AS (\n",
+ " SELECT embed_openai(?) AS qe\n",
+ " )\n",
+ " SELECT\n",
+ " titles,\n",
+ " summaries,\n",
+ " array_cosine_similarity(embeddings, q.qe) AS score\n",
+ " FROM papers, q\n",
+ " WHERE embeddings IS NOT NULL\n",
+ " ORDER BY score DESC\n",
+ " LIMIT ?\n",
+ " \"\"\", [query_text, k]).fetchdf()\n",
+ "\n",
+ "# Test the function\n",
+ "search_papers(\"What are the research papers on image segmentation for the medical field?\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0db28fc6",
+ "metadata": {},
+ "source": [
+ "### Optimizing queries with an index"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8fb6dbd9",
+ "metadata": {},
+ "source": [
+ "While the search query above works well on a dataset of 400 rows, it will become much slower as the data grows into hundreds of thousands of rows. Without an index, DuckDB must compare the query embedding against all document embeddings to find the most similar results.\n",
+ "\n",
+ "To speed up vector search, we can use ANN (Approximate Nearest Neighbor) with [HNSW (Hierarchical Navigable Small World)](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world), available through DuckDB’s [vector similarity search extension](https://duckdb.org/2024/05/03/vector-similarity-search-vss.html).\n",
+ "\n",
+ "Let’s give it a try."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "id": "887d3660",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Install the extension\n",
+ "duckdb.sql(\"INSTALL vss;\")\n",
+ "duckdb.sql(\"LOAD vss;\")\n",
+ "duckdb.sql(\"SET GLOBAL hnsw_enable_experimental_persistence = true;\")\n",
+ "\n",
+ "# Create an index on the embeddings column\n",
+ "duckdb.sql(\"CREATE INDEX IF NOT EXISTS idx_embeddings ON papers USING HNSW (embeddings);\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "753b3740",
+ "metadata": {},
+ "source": [
+ "Now we can verify that the index has been created and run a quick test"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "id": "9d016af2",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "┌───────────────┬──────────────┬─────────────┬────────────┬────────────────┬───────────┬────────────┬───────────┬─────────┬───────────────────────┬───────────┬────────────┬──────────────┬────────────────────────────────────────────────────────────────┐\n",
+ "│ database_name │ database_oid │ schema_name │ schema_oid │ index_name │ index_oid │ table_name │ table_oid │ comment │ tags │ is_unique │ is_primary │ expressions │ sql │\n",
+ "│ varchar │ int64 │ varchar │ int64 │ varchar │ int64 │ varchar │ int64 │ varchar │ map(varchar, varchar) │ boolean │ boolean │ varchar │ varchar │\n",
+ "├───────────────┼──────────────┼─────────────┼────────────┼────────────────┼───────────┼────────────┼───────────┼─────────┼───────────────────────┼───────────┼────────────┼──────────────┼────────────────────────────────────────────────────────────────┤\n",
+ "│ memory │ 570 │ main │ 572 │ idx_embeddings │ 2070 │ papers │ 2065 │ NULL │ {} │ false │ false │ [embeddings] │ CREATE INDEX idx_embeddings ON papers USING HNSW (embeddings); │\n",
+ "└───────────────┴──────────────┴─────────────┴────────────┴────────────────┴───────────┴────────────┴───────────┴─────────┴───────────────────────┴───────────┴────────────┴──────────────┴────────────────────────────────────────────────────────────────┘"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Verify the index has been created\n",
+ "duckdb.sql(\"SELECT * FROM duckdb_indexes();\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "id": "682ce99c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " titles | \n",
+ " summaries | \n",
+ " score | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " Medical Matting: A New Perspective on Medical ... | \n",
+ " In medical image segmentation, it is difficult... | \n",
+ " 0.579481 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Self-Supervision with Superpixels: Training Fe... | \n",
+ " Few-shot semantic segmentation (FSS) has great... | \n",
+ " 0.570870 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " A Spatial Guided Self-supervised Clustering Ne... | \n",
+ " The segmentation of medical images is a fundam... | \n",
+ " 0.562006 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " Superpixel-Guided Label Softening for Medical ... | \n",
+ " Segmentation of objects of interest is one of ... | \n",
+ " 0.561522 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " Efficient and Generic Interactive Segmentation... | \n",
+ " Semantic segmentation of medical images is an ... | \n",
+ " 0.560087 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " titles \\\n",
+ "0 Medical Matting: A New Perspective on Medical ... \n",
+ "1 Self-Supervision with Superpixels: Training Fe... \n",
+ "2 A Spatial Guided Self-supervised Clustering Ne... \n",
+ "3 Superpixel-Guided Label Softening for Medical ... \n",
+ "4 Efficient and Generic Interactive Segmentation... \n",
+ "\n",
+ " summaries score \n",
+ "0 In medical image segmentation, it is difficult... 0.579481 \n",
+ "1 Few-shot semantic segmentation (FSS) has great... 0.570870 \n",
+ "2 The segmentation of medical images is a fundam... 0.562006 \n",
+ "3 Segmentation of objects of interest is one of ... 0.561522 \n",
+ "4 Semantic segmentation of medical images is an ... 0.560087 "
+ ]
+ },
+ "execution_count": 34,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Test the function\n",
+ "search_papers(\"What are the research papers on image segmentation for the medical field?\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "977a9595",
+ "metadata": {},
+ "source": [
+ "## Conclusion\n",
+ "\n",
+ "In this cookbook, we explored how to integrate OpenAI’s embedding calls as a reusable UDF in DuckDB. This approach is especially powerful when storing and querying embeddings directly alongside your data. By combining embeddings with DuckDB’s familiar SQL interface, you unlock new possibilities for advanced data analysis and retrieval—all within a simple, efficient workflow."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.13.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/registry.yaml b/registry.yaml
index 8cff92f5e1..e1d601d013 100644
--- a/registry.yaml
+++ b/registry.yaml
@@ -2541,6 +2541,16 @@
tags:
- images
+- title: Semantic Search with DuckDB and OpenAI Embeddings
+ path: examples/vector_databases/duckdb/duckdb-sql-with-openai-embeddings.ipynb
+ date: 2025-09-06
+ authors:
+ - ayman-openai
+ tags:
+ - embeddings
+ - duckdb
+ - sql
+ - semantic-search
- title: Use Codex CLI to automatically fix CI failures
path: examples/codex/Autofix-github-actions.ipynb