-
Notifications
You must be signed in to change notification settings - Fork 783
Adding Doc.content_hash and integrating for unique DocDetails.doc_id
#1029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds content hashing capabilities to enable unique document identification for files that share metadata but have different content, such as main papers and their supplemental information. The implementation adds a content_hash field to the Doc class and integrates it into the DocDetails.doc_id generation to create composite keys based on both DOI and content hash.
- Adds
content_hashfield toDocclass with auto-population from file contents - Updates
DocDetails.doc_idto be a composite key of DOI and content hash via newcompute_unique_doc_idfunction - Enhances test coverage with scenarios for duplicate detection across files with shared metadata but different content
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_paperqa.py |
Expands duplicate detection tests to validate content hash differentiation |
src/paperqa/utils.py |
Adds compute_unique_doc_id function for generating composite document IDs |
src/paperqa/types.py |
Adds content_hash field to Doc and updates DocDetails validation/merging logic |
src/paperqa/llms.py |
Updates Doc instantiation to include content hash field |
src/paperqa/docs.py |
Integrates content hash computation during document addition |
src/paperqa/clients/__init__.py |
Updates metadata client to handle content hash field during doc upgrades |
Comments suppressed due to low confidence (1)
tests/test_paperqa.py:1068
- The test expects specific hardcoded dockey values, but these appear to be content hash-based values that could change if the document content or hashing logic changes. Consider using a more flexible assertion that validates the dockey format or length rather than exact values.
assert doc_details.dockey in {"8ce7ddba9c9dcae6", "a353fa2478475c9c"}
| if doi: | ||
| value_to_encode: str = doi.lower() + (content_hash or "") | ||
| else: | ||
| value_to_encode = content_hash or str(uuid4()) |
Copilot
AI
Jul 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using str(uuid4()) as a fallback when both DOI and content_hash are None could lead to non-deterministic document IDs for the same document across different runs. Consider if this is the intended behavior or if a more deterministic fallback would be appropriate.
| value_to_encode = content_hash or str(uuid4()) | |
| value_to_encode = content_hash or "default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recompute doc_id if there's changes in DocDetails.__add__, so I believe we're safe
| content_hash = md5sum(path) | ||
| dockey_is_content_hash = False | ||
| if dockey is None: | ||
| # md5 sum of file contents (not path!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this comment to Doc.content_hash's description, fyi
|
|
||
| # note we have some extra fields which may have come from reading the doc text, | ||
| # but aren't in the doc object, we add them here too. | ||
| extra_fields = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of "extra" here is an overloaded term:
- Extra in a Pydantic context (and we have
DocDetails.other, to add to the confusion) - Extra in a query context
Paired with vagueness such as "some extra fields", I decided to rename and reworded this whole section
| ), "Expected citation to be inferred" | ||
| assert shorter_flag_day.content_hash | ||
| assert flag_day.content_hash != shorter_flag_day.content_hash | ||
| assert flag_day.doc_id != shorter_flag_day.doc_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole PR basically enables this singular assertion to pass
e685706 to
542eb13
Compare
| elif "doc_id" not in data or not data["doc_id"]: # keep user defined doc_ids | ||
| data["doc_id"] = encode_id(uuid4()) | ||
| if not data.get("doc_id"): # keep user defined doc_ids | ||
| data["doc_id"] = compute_unique_doc_id(doi, data.get("content_hash")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we talked about this in person, but let's move this into an optional feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I piloted this today some more btw. Moving compute_unique_doc_id into something like Settings.details_id_factory requires us to pass a throwaway Settings reference to all DocDetails constructions, which is both awkward and error prone.
Perhaps a less awkward route is having an bool environment variable PQA_USE_OLD_ID_FACTORY that gets used inside compute_unique_doc_id
542eb13 to
b614322
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to read this in more detail and think about it
b614322 to
bc35ab5
Compare
bc35ab5 to
1df5ece
Compare
1df5ece to
5e83ca0
Compare
5e83ca0 to
9ad19ee
Compare
9ad19ee to
07f49bd
Compare
…doc_to_doc_details works
…tateful to __add__
993ab04 to
d7c45de
Compare
This PR:
DocDetails.doc_iddeduplication across files sharing the beginning (e.g. a main text and a supplemental information whose metadata is inferred to be identical)content_hashfield onDoc(which we already collect duringDocs.aadd)DocDetails.doc_idto be a composite key of DOI and content hashtest_duplicatethat does not work on currentmainDocDetailsmetadata upgrades andfile_locationCloses #1005