Skip to content

Fix case-sensitive search in PDF term matching#3

Open
rootdevss wants to merge 1 commit intojafrank88:mainfrom
rootdevss:fix-case-sensitive-search
Open

Fix case-sensitive search in PDF term matching#3
rootdevss wants to merge 1 commit intojafrank88:mainfrom
rootdevss:fix-case-sensitive-search

Conversation

@rootdevss
Copy link
Copy Markdown

Summary

This PR addresses a bug in the PDF search functionality where term matching was case-sensitive, potentially causing valid matches to be missed.

Problem

The current implementation in check_pdf.py uses case-sensitive string comparison:

if term in text:

This causes the search to fail when the case differs between the search term and the PDF content. For a tool designed to identify legal citations, this is particularly problematic because:

  • Legal citations may appear in various case formats (e.g., F.3d vs f.3d)
  • Case names can be written differently (e.g., Singh-Kaur vs singh-kaur vs SINGH-KAUR)
  • Federal Reporter citations may vary in capitalization
  • The tool could miss legitimate citations simply due to case differences

Solution

Implemented case-insensitive string matching by converting both the search term and extracted text to lowercase before comparison:

if term.lower() in text.lower():

Additionally updated the context extraction logic to use the same case-insensitive approach when finding the position of matches:

pos = text.lower().find(term.lower())

Testing Recommendation

To verify this fix works correctly, test with PDFs containing:

  • Mixed case legal citations
  • All uppercase case names
  • All lowercase reporter citations
  • Variations in spacing and punctuation with different cases

Impact

This change ensures that all legitimate matches are found regardless of case formatting, improving the reliability and accuracy of the hallucination detection tool.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change
  • Documentation update

The previous implementation used case-sensitive string matching,
which could miss valid matches when the case differed between
the search term and the PDF content. This is particularly
problematic for legal citations which may appear in various
case formats (e.g., 'F.3d' vs 'f.3d', 'Singh-Kaur' vs 'singh-kaur').

This commit makes the search case-insensitive by converting
both the search term and the extracted text to lowercase before
comparison, ensuring all valid matches are found regardless of case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant