Feature/build embeddings by kuraisle · Pull Request #180 · Health-Informatics-UoN/lettuce

kuraisle · 2026-01-07T12:18:29Z


✨ Feature

PR Description

Previously, if you don't have a parquet file for embeddings and want to have some in OMOP with PGVector, there wasn't anything to be done about it. This code provides the ability to read from Athena vocabulary CSVs or postgres, create embeddings from some string representation of concepts, then either load them into the db, or write to a parquet file.

I've meant to do this for a while, getting it done was prompted by someone wanting to run lettuce and getting stuck because they couldn't make embeddings

Related Issues or other material

Related #179
Closes #179

Screenshots, example outputs/behaviour etc.

✅ Added/updated tests?

This PR contains relevant tests / Or doesn't need to per the below explanation

CodeByKarthik

Hi James, thanks for your commit. The PR doesn't have any issues and I have left few comments. Let me know your thoughts on this.

CodeByKarthik · 2026-01-07T16:20:23Z

build-embeddings/embedding_utils/fetch_concept_batches.py

+                )
+                self._logger.info(f"Creating a table for {self._embedding_model.get_sentence_embedding_dimension()} dimensional vectors")
+                table_manage_cursor.execute(
+                        sql.SQL("""


Looks like this query is built with f-string interpolation. Can we switch to a parameterised query or a T-string instead of f-string's string concatenation?

If yes, could you please change in other places as well?

This is actually psycopg3's API for composing queries including identifiers: https://www.psycopg.org/psycopg3/docs/api/sql.html

If you look carefully it's not formatting the string, but a sql.SQL object

CodeByKarthik · 2026-01-07T16:25:04Z

build-embeddings/embedding_utils/fetch_concept_batches.py

+                        )
+                conn.commit()
+
+    def embed_batch(self, concept_batch: list[Concept]) -> list[tuple[int, str, Tensor]]:


Would it make sense to wrap this result in a dataclass rather than returning a tuple, to improve type safety and readability?

Example:

from dataclasses import dataclass @dataclass(slots=True) class EmbeddedConcept: concept_id: int concept_name: str embedding: list[float]

CodeByKarthik · 2026-01-07T16:25:50Z

build-embeddings/embedding_utils/fetch_concept_batches.py

+        "invalid_reason": pl.String(),
+        }
+
+class PostgresConceptEmbedder():


Could you please add comments to the class and methods?

CodeByKarthik · 2026-01-07T16:42:42Z

build-embeddings/embedding_utils/fetch_concept_batches.py

+        }
+
+class PostgresConceptEmbedder():
+    def __init__(


"I think PostgresConceptEmbedder currently has too many responsibilities, including:

Database connection management
Schema/extension management (check_extension, reset_embedding_table)
Data fetching
Concept transformation
Embedding generation

From a separation-of-concerns perspective, this could be a good one for refactoring into smaller, focused classes or modules (will be easier to test and scalable in the future for adding additional methods).

Please ignore this if it needs to be done urgently, as you mentioned that people need it to clone. If that's the case, we can focus on the functionality instead.

James Mitchell-White and others added 12 commits December 15, 2025 21:16

embeddings code in workspace

cead718

move to embedding_utils to stop collisions

1b288d1

move utils

7803360

start tests

04fa7bd

slightly broken

d8077cc

typer cli

16afb03

build package

cbdd252

slow mapping elements

0f847b6

no map just go

affc937

log to stdout

eedd3b1

add logging

a5fe701

Update README

b3f6f82

CodeByKarthik requested changes Jan 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Feature/build embeddings#180

Feature/build embeddings#180
kuraisle wants to merge 12 commits intomainfrom
feature/build-embeddings

kuraisle commented Jan 7, 2026 •

edited

Loading

Uh oh!

CodeByKarthik left a comment

Uh oh!

CodeByKarthik Jan 7, 2026

Uh oh!

kuraisle Jan 9, 2026

Uh oh!

CodeByKarthik Jan 7, 2026

Uh oh!

CodeByKarthik Jan 7, 2026

Uh oh!

CodeByKarthik Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

kuraisle commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

Related Issues or other material

Screenshots, example outputs/behaviour etc.

✅ Added/updated tests?

Uh oh!

CodeByKarthik left a comment

Choose a reason for hiding this comment

Uh oh!

CodeByKarthik Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

kuraisle Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

CodeByKarthik Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

CodeByKarthik Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

CodeByKarthik Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kuraisle commented Jan 7, 2026 •

edited

Loading