Skip to content

Comments

Feature/build embeddings#180

Open
kuraisle wants to merge 12 commits intomainfrom
feature/build-embeddings
Open

Feature/build embeddings#180
kuraisle wants to merge 12 commits intomainfrom
feature/build-embeddings

Conversation

@kuraisle
Copy link
Member

@kuraisle kuraisle commented Jan 7, 2026

✨ Feature

PR Description

Previously, if you don't have a parquet file for embeddings and want to have some in OMOP with PGVector, there wasn't anything to be done about it. This code provides the ability to read from Athena vocabulary CSVs or postgres, create embeddings from some string representation of concepts, then either load them into the db, or write to a parquet file.

I've meant to do this for a while, getting it done was prompted by someone wanting to run lettuce and getting stuck because they couldn't make embeddings

Related Issues or other material

Related #179
Closes #179

Screenshots, example outputs/behaviour etc.

✅ Added/updated tests?

  • This PR contains relevant tests / Or doesn't need to per the below explanation

Copy link
Collaborator

@CodeByKarthik CodeByKarthik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi James, thanks for your commit. The PR doesn't have any issues and I have left few comments. Let me know your thoughts on this.

)
self._logger.info(f"Creating a table for {self._embedding_model.get_sentence_embedding_dimension()} dimensional vectors")
table_manage_cursor.execute(
sql.SQL("""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this query is built with f-string interpolation. Can we switch to a parameterised query or a T-string instead of f-string's string concatenation?

If yes, could you please change in other places as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually psycopg3's API for composing queries including identifiers: https://www.psycopg.org/psycopg3/docs/api/sql.html

If you look carefully it's not formatting the string, but a sql.SQL object

)
conn.commit()

def embed_batch(self, concept_batch: list[Concept]) -> list[tuple[int, str, Tensor]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to wrap this result in a dataclass rather than returning a tuple, to improve type safety and readability?

Example:

from dataclasses import dataclass

@dataclass(slots=True)
class EmbeddedConcept:
    concept_id: int
    concept_name: str
    embedding: list[float]

"invalid_reason": pl.String(),
}

class PostgresConceptEmbedder():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add comments to the class and methods?

}

class PostgresConceptEmbedder():
def __init__(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I think PostgresConceptEmbedder currently has too many responsibilities, including:

Database connection management
Schema/extension management (check_extension, reset_embedding_table)
Data fetching
Concept transformation
Embedding generation

From a separation-of-concerns perspective, this could be a good one for refactoring into smaller, focused classes or modules (will be easier to test and scalable in the future for adding additional methods).

Please ignore this if it needs to be done urgently, as you mentioned that people need it to clone. If that's the case, we can focus on the functionality instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pipeline for embeddings creation

2 participants