Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,23 @@ Each paper includes:
- Open access PDF link (when available)
- Paper type (article, review, etc.)

## Semantic Search

Find related papers using AI embeddings. Works locally, no API keys needed.

```typescript
yield* engine.embedAll()

const similar = yield* engine.similar("your research question")
```

The model downloads once (~23MB) on first use.

## Docs

- [Getting Started](./docs/getting-started.md)
- [Search Options](./docs/search-options.md)
- [Semantic Search](./docs/semantic-search.md)
- [Working with Data](./docs/working-with-data.md)
- [Review Workflow](./docs/review-workflow.md)

Expand Down
157 changes: 157 additions & 0 deletions bun.lock

Large diffs are not rendered by default.

77 changes: 77 additions & 0 deletions docs/semantic-search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Semantic Search

Find papers by meaning, not just keywords. Uses AI embeddings to understand what papers are actually about.

## How It Works

1. Each paper's abstract gets converted to a 384-dimensional vector
2. Your search query also becomes a vector
3. Papers closest to your query (by cosine similarity) are returned

All processing happens locally using the `all-MiniLM-L6-v2` model.

## Generate Embeddings

Before searching, papers need embeddings:

```typescript
const program = Effect.gen(function* () {
// Embed all papers that don't have embeddings yet
const result = yield* engine.embedAll()
console.log(`Embedded ${result.embedded} papers`)
})
```

First run downloads the model (~23MB) to `~/.get-papers/models/`.

You can also embed a single paper:

```typescript
yield* engine.embed("paper-id")
```

## Search by Query

Find papers similar to a natural language question:

```typescript
const program = Effect.gen(function* () {
const results = yield* engine.similar("machine learning for drug discovery", {
limit: 10,
minScore: 0.5,
})

for (const { paper, similarity } of results) {
console.log(`${paper.title} (${similarity.toFixed(2)})`)
}
})
```

## Find Related Papers

Given one paper, find others like it:

```typescript
const program = Effect.gen(function* () {
const related = yield* engine.similarTo("paper-id", { limit: 5 })

for (const { paper, similarity } of related) {
console.log(`${paper.title} (${similarity.toFixed(2)})`)
}
})
```

## Options

| Option | Default | Description |
|--------|---------|-------------|
| `limit` | 10 | Maximum results to return |
| `minScore` | 0.3 | Minimum similarity (0-1) |
| `status` | all | Filter by paper status |

## Tips

- Higher `minScore` = more relevant but fewer results
- Embeddings persist in the database, so you only generate once
- Works best with papers that have abstracts
- Model is ~23MB and caches locally after first download
21 changes: 21 additions & 0 deletions llms.txt
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,25 @@ yield* engine.reject("paper-id", "reason")
const pending = yield* engine.listPendingReviews()
```

### Semantic Search (embeddings)

Generate embeddings and find similar papers using local AI model (all-MiniLM-L6-v2).

```typescript
yield* engine.embed("paper-id")

const result = yield* engine.embedAll(100)

const similar = yield* engine.similar("machine learning for climate", {
limit: 10,
minScore: 0.5,
})

const related = yield* engine.similarTo("paper-id", { limit: 5 })
```

First call downloads model (~23MB) to `~/.get-papers/models/`.

## Paper Object

```typescript
Expand All @@ -116,6 +135,7 @@ const pending = yield* engine.listPendingReviews()
title: string
authors: string[]
abstract: string
abstractEmbedding: Float32Array | null
publishedDate: string | null
publishedYear: number | null
url: string
Expand Down Expand Up @@ -152,6 +172,7 @@ const pending = yield* engine.listPendingReviews()
- `StorageError` - Database failed
- `ReviewError` - Review operation failed
- `PaperNotFoundError` - ID not found
- `EmbeddingError` - Embedding generation failed

## Database

Expand Down
5 changes: 4 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
"example": "bun run example.ts"
},
"dependencies": {
"@xenova/transformers": "^2.17.2",
"effect": "^3.12.0"
},
"devDependencies": {
Expand All @@ -41,7 +42,9 @@
"effect",
"typescript",
"bun",
"sqlite"
"sqlite",
"embeddings",
"semantic-search"
],
"license": "MIT",
"repository": {
Expand Down
51 changes: 51 additions & 0 deletions src/embeddings/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import { Effect, Data } from "effect"

export { cosineSimilarity, EMBEDDING_DIM, embeddingToBuffer, bufferToEmbedding } from "./utils"

export class EmbeddingError extends Data.TaggedError("EmbeddingError")<{
readonly operation: string
readonly cause: unknown
}> {}

export const getEmbedding = (text: string): Effect.Effect<Float32Array, EmbeddingError> =>
Effect.tryPromise({
try: async () => {
const { getEmbeddingRaw } = await import("./local")
return getEmbeddingRaw(text)
},
catch: (e) => new EmbeddingError({ operation: "getEmbedding", cause: e }),
})

export const getEmbeddings = (texts: string[]): Effect.Effect<Float32Array[], EmbeddingError> =>
Effect.tryPromise({
try: async () => {
const { getEmbeddingsRaw } = await import("./local")
return getEmbeddingsRaw(texts)
},
catch: (e) => new EmbeddingError({ operation: "getEmbeddings", cause: e }),
})

export interface SimilarityResult {
id: string
similarity: number
}

export const findSimilar = (
queryEmbedding: Float32Array,
embeddings: Array<{ id: string; embedding: Float32Array }>,
options: { limit?: number; minScore?: number } = {}
): SimilarityResult[] => {
const { limit = 10, minScore = 0.3 } = options

const { cosineSimilarity } = require("./utils")

const scored = embeddings
.map(({ id, embedding }) => ({
id,
similarity: cosineSimilarity(queryEmbedding, embedding),
}))
.filter((r) => r.similarity >= minScore)
.sort((a, b) => b.similarity - a.similarity)

return scored.slice(0, limit)
}
64 changes: 64 additions & 0 deletions src/embeddings/local.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import { join } from "path"
import { homedir } from "os"

export { cosineSimilarity, EMBEDDING_DIM } from "./utils"

const MODEL_NAME = "Xenova/all-MiniLM-L6-v2"

const MODELS_DIR =
process.env["GET_PAPERS_MODELS"] || join(homedir(), ".get-papers", "models")

let embedder: unknown = null
let loadingPromise: Promise<unknown> | null = null
let transformersError: string | null = null

async function loadEmbedder(): Promise<unknown> {
if (embedder) return embedder
if (transformersError) throw new Error(transformersError)
if (loadingPromise) return loadingPromise

try {
const { pipeline, env } = await import("@xenova/transformers")

env.localModelPath = MODELS_DIR
env.cacheDir = MODELS_DIR
env.allowRemoteModels = true

loadingPromise = pipeline("feature-extraction", MODEL_NAME, {
quantized: true,
})

embedder = await loadingPromise
loadingPromise = null

return embedder
} catch (err) {
const msg = err instanceof Error ? err.message : String(err)
transformersError = `Embeddings unavailable: ${msg}. Run 'bun install' with network access to fix.`
throw new Error(transformersError)
}
}

export async function getEmbeddingRaw(text: string): Promise<Float32Array> {
const model = await loadEmbedder()

const truncated = text.slice(0, 2000)

const output = await (model as (text: string, options: { pooling: string; normalize: boolean }) => Promise<{ data: ArrayLike<number> }>)(truncated, {
pooling: "mean",
normalize: true,
})

return new Float32Array(output.data)
}

export async function getEmbeddingsRaw(texts: string[]): Promise<Float32Array[]> {
const results: Float32Array[] = []

for (const text of texts) {
const embedding = await getEmbeddingRaw(text)
results.push(embedding)
}

return results
}
31 changes: 31 additions & 0 deletions src/embeddings/utils.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
export const EMBEDDING_DIM = 384

export function cosineSimilarity(a: Float32Array, b: Float32Array): number {
if (a.length !== b.length) {
throw new Error(`Dimension mismatch: ${a.length} vs ${b.length}`)
}

let dot = 0
let normA = 0
let normB = 0

for (let i = 0; i < a.length; i++) {
dot += a[i]! * b[i]!
normA += a[i]! * a[i]!
normB += b[i]! * b[i]!
}

const denom = Math.sqrt(normA) * Math.sqrt(normB)
return denom === 0 ? 0 : dot / denom
}

export function embeddingToBuffer(embedding: Float32Array | number[]): Buffer {
const arr = embedding instanceof Float32Array ? embedding : new Float32Array(embedding)
return Buffer.from(arr.buffer, arr.byteOffset, arr.byteLength)
}

export function bufferToEmbedding(buffer: Buffer | Uint8Array): Float32Array {
const copy = new ArrayBuffer(buffer.length)
new Uint8Array(copy).set(buffer)
return new Float32Array(copy)
}
Loading