Skip to content

Conversation

@fileames
Copy link
Member

This PR makes the necessary changes to make sure our integrations pass the standard tests offered in langchain-tests.

Changes include:

  • Previously, inserting documents with duplicate IDs could raise a unique constraint error and fail the entire batch. We now use batcherrors=True (https://python-oracledb.readthedocs.io/en/latest/user_guide/batch_statement.html#handling-data-errors ) so per-row errors don’t invalidate other inserts. Only successfully inserted IDs are returned.

  • Optional upsert behavior: Standard tests expect rows with duplicate IDs to be updated rather than erroring. To preserve backward compatibility, we introduced a constructor parameter mutate_on_duplicate:
    False (default): preserve previous behavior (no updates on duplicate IDs).
    True: update existing rows (texts, metadata, etc.) when duplicate IDs are provided.

  • New methods: Added get_by_ids and aget_by_ids.

  • ID handling and hashing

    • In our current implementation, when IDs aren’t provided on add_texts, we generate them via uuid.uuid4() and store a hashed version in a RAW column. Users need these generated ids to use in delete or get_by_ids. To enable this add_texts is expected to return these generated ids.
    • However, we return the hashed versions, which does not work given in delete or get_by_ids as we hash them again to search in the documents:
original_documents = [
    Document(page_content="foo1", metadata={"id": "1"}),
    Document(page_content="bar2", metadata={"id": "2"}),
]
ids = store.add_documents(original_documents)
store.delete(ids)

assert len(store.similarity_search("foo", k=10)) == 0 # FAILS
  • This behaviour is fixed to return the unhashed versions.

  • similarity_search functions returned Documents did not have the id field as we did not have the original unhashed ids not saved to DB. To keep the table structure same for users with existing tables, these original ids are added to the metadata with the key "__orcl_internal_doc_id", which is then used to return Documents including the id fields.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Sep 19, 2025
@fileames
Copy link
Member Author

Hi @cjbj, if you have any comments, I'd be happy to address

Copy link

@sudarshan12s sudarshan12s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can note on the flag , continue_on_error flag and updating table schema for easier sql for later discussion.

@YouNeedCryDear
Copy link
Member

@fileames Could you resolve the conflicts? We now have a updated CI to enforce lint and formatting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants