Conversation
CodeByKarthik
left a comment
There was a problem hiding this comment.
Hi James, thanks for your commit. The PR doesn't have any issues and I have left few comments. Let me know your thoughts on this.
| ) | ||
| self._logger.info(f"Creating a table for {self._embedding_model.get_sentence_embedding_dimension()} dimensional vectors") | ||
| table_manage_cursor.execute( | ||
| sql.SQL(""" |
There was a problem hiding this comment.
Looks like this query is built with f-string interpolation. Can we switch to a parameterised query or a T-string instead of f-string's string concatenation?
If yes, could you please change in other places as well?
There was a problem hiding this comment.
This is actually psycopg3's API for composing queries including identifiers: https://www.psycopg.org/psycopg3/docs/api/sql.html
If you look carefully it's not formatting the string, but a sql.SQL object
| ) | ||
| conn.commit() | ||
|
|
||
| def embed_batch(self, concept_batch: list[Concept]) -> list[tuple[int, str, Tensor]]: |
There was a problem hiding this comment.
Would it make sense to wrap this result in a dataclass rather than returning a tuple, to improve type safety and readability?
Example:
from dataclasses import dataclass
@dataclass(slots=True)
class EmbeddedConcept:
concept_id: int
concept_name: str
embedding: list[float]
| "invalid_reason": pl.String(), | ||
| } | ||
|
|
||
| class PostgresConceptEmbedder(): |
There was a problem hiding this comment.
Could you please add comments to the class and methods?
| } | ||
|
|
||
| class PostgresConceptEmbedder(): | ||
| def __init__( |
There was a problem hiding this comment.
"I think PostgresConceptEmbedder currently has too many responsibilities, including:
Database connection management
Schema/extension management (check_extension, reset_embedding_table)
Data fetching
Concept transformation
Embedding generation
From a separation-of-concerns perspective, this could be a good one for refactoring into smaller, focused classes or modules (will be easier to test and scalable in the future for adding additional methods).
Please ignore this if it needs to be done urgently, as you mentioned that people need it to clone. If that's the case, we can focus on the functionality instead.
PR Description
Previously, if you don't have a parquet file for embeddings and want to have some in OMOP with PGVector, there wasn't anything to be done about it. This code provides the ability to read from Athena vocabulary CSVs or postgres, create embeddings from some string representation of concepts, then either load them into the db, or write to a parquet file.
I've meant to do this for a while, getting it done was prompted by someone wanting to run lettuce and getting stuck because they couldn't make embeddings
Related Issues or other material
Related #179
Closes #179
Screenshots, example outputs/behaviour etc.
✅ Added/updated tests?