Migrate CitationMixin cache from CSV to SQLite#24
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #24 +/- ##
==========================================
+ Coverage 97.04% 97.85% +0.80%
==========================================
Files 8 8
Lines 1254 1306 +52
==========================================
+ Hits 1217 1278 +61
+ Misses 37 28 -9 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Pull request overview
This PR migrates CitationMixin’s on-disk citation cache from a CSV file to a SQLite database, aiming to improve lookup/update performance and better handle concurrent access, with a one-time automatic migration from legacy CSV caches.
Changes:
- Replaced CSV cache read/write logic in
CitationMixin.get_citation()/CitationMixin.update_cache()with SQLite queries and inserts. - Added
_init_db()to initialize the SQLite schema and migrate an existing CSV cache on first use. - Updated the citation mixin test to seed/verify cache state using SQLite, and updated the changelog entry.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
bib_lookup/citation_mixin.py |
Implements SQLite-backed cache, adds CSV→SQLite migration, updates cache read/write paths. |
test/test_citation_mixin.py |
Updates the existing test to write/read the cache via SQLite instead of CSV. |
CHANGELOG.rst |
Documents the cache backend change and migration behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bib_lookup/citation_mixin.py
Outdated
| # Use placeholders for the IN clause | ||
| placeholders = ",".join("?" * len(doi)) | ||
|
|
||
| if format is not None and format != self._bl.format: | ||
| citation = "" # no cache for format other than bibtex | ||
| existing_dois = set() | ||
| cached_citations = [] | ||
| else: | ||
| citation = "\n".join(df_cc[df_cc["doi"].isin(doi)]["citation"].tolist()) | ||
| doi = [item for item in doi if item not in df_cc["doi"].tolist()] | ||
| if print_result: | ||
| query = f"SELECT citation FROM citations WHERE doi IN ({placeholders})" | ||
| cursor.execute(query, doi) |
There was a problem hiding this comment.
The IN (...) query uses one bound parameter per DOI. SQLite has a fairly low default maximum for host parameters (commonly 999), so get_citation() can start failing for large DOI lists (regression vs the previous pandas-based filtering). Consider chunking doi into batches and querying/inserting per batch, or using a temporary table to join against.
test/test_citation_mixin.py
Outdated
| # Back up existing cache if any | ||
| if bib_lookup.CitationMixin.citation_cache.exists(): | ||
| df_system_cache = pd.read_csv(bib_lookup.CitationMixin.citation_cache) | ||
| df_system_cache = _read_cache_db(bib_lookup.CitationMixin.citation_cache) | ||
| else: | ||
| df_system_cache = None | ||
| df_cache_before_update.to_csv(bib_lookup.CitationMixin.citation_cache, index=False) | ||
|
|
||
| # Write initial cache state for test | ||
| if bib_lookup.CitationMixin.citation_cache.exists(): | ||
| bib_lookup.CitationMixin.citation_cache.unlink() | ||
|
|
||
| _write_cache_db(df_cache_before_update, bib_lookup.CitationMixin.citation_cache) |
There was a problem hiding this comment.
This test reads/writes CitationMixin.citation_cache under ~/.cache/bib-lookup, which can delete or overwrite a developer’s real cache when running tests locally. Prefer using tmp_path (or similar) and monkeypatching CitationMixin.citation_cache_db/citation_cache_csv (or CACHE_DIR) to an isolated temp directory for the duration of the test.
test/test_citation_mixin.py
Outdated
| # Write initial cache state for test | ||
| if bib_lookup.CitationMixin.citation_cache.exists(): | ||
| bib_lookup.CitationMixin.citation_cache.unlink() | ||
|
|
||
| _write_cache_db(df_cache_before_update, bib_lookup.CitationMixin.citation_cache) | ||
|
|
||
| obj = SomeClass() | ||
| obj.update_cache() |
There was a problem hiding this comment.
The PR introduces automatic CSV→SQLite migration in CitationMixin._init_db(), but the updated test only exercises the SQLite cache path. Add a focused test that creates a legacy bib-lookup-cache.csv, calls get_citation()/update_cache(), and asserts the DB is populated correctly and the CSV is removed (or retained on failure).
bib_lookup/citation_mixin.py
Outdated
| # Delete CSV after successful migration | ||
| self.citation_cache_csv.unlink() | ||
| print(f"Migrated citation cache from CSV to SQLite: {self.citation_cache_db}") | ||
| except Exception as e: | ||
| print(f"Failed to migrate CSV cache: {e}") |
There was a problem hiding this comment.
_init_db() prints migration success/failure unconditionally. This will write to stdout for any library user calling get_citation()/update_cache(), which is a behavior change and makes it hard to use the library in apps that capture stdout. Prefer using warnings.warn(...) or the project’s verbosity/logging mechanism, and only emit a message when the user opted into it.
| # Fetch from cache | ||
| conn = sqlite3.connect(self.citation_cache_db) | ||
| cursor = conn.cursor() | ||
|
|
||
| # Use placeholders for the IN clause | ||
| placeholders = ",".join("?" * len(doi)) | ||
|
|
There was a problem hiding this comment.
placeholders = ",".join("?" * len(doi)) will produce an empty IN () clause when self.doi is an empty sequence, causing a SQL syntax error. Consider early-returning when doi is empty (e.g., return ""/None depending on print_result) before building the SQL query.
- Changed backend storage from CSV to SQLite for improved performance and concurrency support. - Implemented automatic backward-compatible migration: existing CSV caches will be converted to SQLite on first use. - Updated 'get_citation' and 'update_cache' to use SQL queries for efficient data retrieval. - Updated tests to verify SQLite cache functionality and migration logic.
9b43477 to
716cc4a
Compare
- Replace print with warnings.warn for migration status - Handle empty DOI sequence to avoid invalid SQL syntax - Chunk DOI list in SQL queries to avoid SQLite parameter limits - Update tests to check for warnings instead of stdout output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
get_citationandupdate_cacheto use SQL queries for efficient data retrieval.