Adding `Doc.content_hash` and integrating for unique `DocDetails.doc_id` #1029

jamesbraza · 2025-07-24T21:58:14Z

This PR:

Enables DocDetails.doc_id deduplication across files sharing the beginning (e.g. a main text and a supplemental information whose metadata is inferred to be identical)
- Adds a content_hash field on Doc (which we already collect during Docs.aadd)
- Expands DocDetails.doc_id to be a composite key of DOI and content hash
- Adds test coverage here in test_duplicate that does not work on current main
Substantially documents DocDetails metadata upgrades and file_location

Copilot

Pull Request Overview

This PR adds content hashing capabilities to enable unique document identification for files that share metadata but have different content, such as main papers and their supplemental information. The implementation adds a content_hash field to the Doc class and integrates it into the DocDetails.doc_id generation to create composite keys based on both DOI and content hash.

Adds content_hash field to Doc class with auto-population from file contents
Updates DocDetails.doc_id to be a composite key of DOI and content hash via new compute_unique_doc_id function
Enhances test coverage with scenarios for duplicate detection across files with shared metadata but different content

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`tests/test_paperqa.py`	Expands duplicate detection tests to validate content hash differentiation
`src/paperqa/utils.py`	Adds `compute_unique_doc_id` function for generating composite document IDs
`src/paperqa/types.py`	Adds `content_hash` field to `Doc` and updates `DocDetails` validation/merging logic
`src/paperqa/llms.py`	Updates Doc instantiation to include content hash field
`src/paperqa/docs.py`	Integrates content hash computation during document addition
`src/paperqa/clients/__init__.py`	Updates metadata client to handle content hash field during doc upgrades

Comments suppressed due to low confidence (1)

tests/test_paperqa.py:1068

The test expects specific hardcoded dockey values, but these appear to be content hash-based values that could change if the document content or hashing logic changes. Consider using a more flexible assertion that validates the dockey format or length rather than exact values.

    assert doc_details.dockey in {"8ce7ddba9c9dcae6", "a353fa2478475c9c"}

Copilot · 2025-07-24T21:58:51Z

src/paperqa/utils.py

+    if doi:
+        value_to_encode: str = doi.lower() + (content_hash or "")
+    else:
+        value_to_encode = content_hash or str(uuid4())


Using str(uuid4()) as a fallback when both DOI and content_hash are None could lead to non-deterministic document IDs for the same document across different runs. Consider if this is the intended behavior or if a more deterministic fallback would be appropriate.

Suggested change

value_to_encode = content_hash or str(uuid4())

value_to_encode = content_hash or "default"

We recompute doc_id if there's changes in DocDetails.__add__, so I believe we're safe

src/paperqa/types.py

jamesbraza · 2025-07-24T22:08:38Z

src/paperqa/docs.py

+        content_hash = md5sum(path)
        dockey_is_content_hash = False
        if dockey is None:
-            # md5 sum of file contents (not path!)


I moved this comment to Doc.content_hash's description, fyi

jamesbraza · 2025-07-24T22:10:49Z

src/paperqa/clients/__init__.py

-
-        # note we have some extra fields which may have come from reading the doc text,
-        # but aren't in the doc object, we add them here too.
-        extra_fields = {


The usage of "extra" here is an overloaded term:

Extra in a Pydantic context (and we have DocDetails.other, to add to the confusion)

Extra in a query context

Paired with vagueness such as "some extra fields", I decided to rename and reworded this whole section

jamesbraza · 2025-07-24T22:11:37Z

tests/test_paperqa.py

+    ), "Expected citation to be inferred"
+    assert shorter_flag_day.content_hash
+    assert flag_day.content_hash != shorter_flag_day.content_hash
+    assert flag_day.doc_id != shorter_flag_day.doc_id


This whole PR basically enables this singular assertion to pass

mskarlin · 2025-07-24T23:13:17Z

src/paperqa/types.py

-        elif "doc_id" not in data or not data["doc_id"]:  # keep user defined doc_ids
-            data["doc_id"] = encode_id(uuid4())
+            if not data.get("doc_id"):  # keep user defined doc_ids
+                data["doc_id"] = compute_unique_doc_id(doi, data.get("content_hash"))


we talked about this in person, but let's move this into an optional feature

I piloted this today some more btw. Moving compute_unique_doc_id into something like Settings.details_id_factory requires us to pass a throwaway Settings reference to all DocDetails constructions, which is both awkward and error prone.

Perhaps a less awkward route is having an bool environment variable PQA_USE_OLD_ID_FACTORY that gets used inside compute_unique_doc_id

mskarlin

I need to read this in more detail and think about it

…doc_to_doc_details works

…c_id

…tateful to __add__

jamesbraza requested review from maykcaldas, mskarlin, nadolskit and whitead July 24, 2025 21:58

jamesbraza self-assigned this Jul 24, 2025

Copilot AI review requested due to automatic review settings July 24, 2025 21:58

jamesbraza added the bug Something isn't working label Jul 24, 2025

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. documentation Improvements or additions to documentation enhancement New feature or request labels Jul 24, 2025

Copilot AI reviewed Jul 24, 2025

View reviewed changes

jamesbraza commented Jul 24, 2025

View reviewed changes

jamesbraza force-pushed the fixing-same-docid branch from e685706 to 542eb13 Compare July 24, 2025 22:40

mskarlin reviewed Jul 24, 2025

View reviewed changes

jamesbraza force-pushed the fixing-same-docid branch from 542eb13 to b614322 Compare July 25, 2025 00:08

mskarlin reviewed Jul 26, 2025

View reviewed changes

jamesbraza force-pushed the fixing-same-docid branch from b614322 to bc35ab5 Compare July 26, 2025 03:39

jamesbraza force-pushed the fixing-same-docid branch from bc35ab5 to 1df5ece Compare August 9, 2025 00:50

jamesbraza force-pushed the main branch from 4457624 to 35e85a8 Compare August 25, 2025 22:39

jamesbraza force-pushed the fixing-same-docid branch from 1df5ece to 5e83ca0 Compare August 25, 2025 23:27

jamesbraza force-pushed the main branch from 097ccd2 to 27c88de Compare August 26, 2025 21:36

jamesbraza force-pushed the fixing-same-docid branch from 5e83ca0 to 9ad19ee Compare August 26, 2025 22:43

jamesbraza force-pushed the fixing-same-docid branch from 9ad19ee to 07f49bd Compare October 29, 2025 20:39

jamesbraza added 7 commits October 29, 2025 14:39

Documented what query_kwargs means in Docs.aadd

205ea81

Improved names and comments explaining how DocMetadataClient.upgrade_…

bb37af7

…doc_to_doc_details works

Added description to DocDetails.file_location

2e076b1

Using simpler .get inside of DocDetails.lowercase_doi_and_populate_do…

bd9d740

…c_id

Fixed typing in DocDetails.__add__

4295fae

Incorporated a Doc.content_hash to ensure doc_id uniqueness, with tests

cc6767d

Removed content_hash wipe if only one is present, as this is overly s…

41e034a

…tateful to __add__

Refreshing cassettes as needed

d7c45de

jamesbraza force-pushed the fixing-same-docid branch from 993ab04 to d7c45de Compare October 29, 2025 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding `Doc.content_hash` and integrating for unique `DocDetails.doc_id` #1029

Adding `Doc.content_hash` and integrating for unique `DocDetails.doc_id` #1029

Uh oh!

jamesbraza commented Jul 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 24, 2025

Uh oh!

jamesbraza Jul 24, 2025

Uh oh!

Uh oh!

jamesbraza Jul 24, 2025

Uh oh!

jamesbraza Jul 24, 2025

Uh oh!

jamesbraza Jul 24, 2025

Uh oh!

mskarlin Jul 24, 2025

Uh oh!

jamesbraza Jul 26, 2025

Uh oh!

mskarlin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	value_to_encode = content_hash or str(uuid4())
	value_to_encode = content_hash or "default"

Adding Doc.content_hash and integrating for unique DocDetails.doc_id #1029

Are you sure you want to change the base?

Adding Doc.content_hash and integrating for unique DocDetails.doc_id #1029

Uh oh!

Conversation

jamesbraza commented Jul 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

jamesbraza Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jamesbraza Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

jamesbraza Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

jamesbraza Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

mskarlin Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

jamesbraza Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

mskarlin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding `Doc.content_hash` and integrating for unique `DocDetails.doc_id` #1029

Adding `Doc.content_hash` and integrating for unique `DocDetails.doc_id` #1029