Migrate CitationMixin cache from CSV to SQLite by wenh06 · Pull Request #24 · DeepPSP/bib_lookup

wenh06 · 2026-02-22T07:52:35Z

Changed backend storage from CSV to SQLite for improved performance and concurrency support.
Implemented automatic backward-compatible migration: existing CSV caches will be converted to SQLite on first use.
Updated get_citation and update_cache to use SQL queries for efficient data retrieval.
Updated tests to verify SQLite cache functionality and migration logic.

codecov · 2026-02-22T07:54:39Z

Codecov Report

❌ Patch coverage is 98.75000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 97.85%. Comparing base (99d3c16) to head (874db75).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
bib_lookup/citation_mixin.py	98.75%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #24      +/-   ##
==========================================
+ Coverage   97.04%   97.85%   +0.80%     
==========================================
  Files           8        8              
  Lines        1254     1306      +52     
==========================================
+ Hits         1217     1278      +61     
+ Misses         37       28       -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copilot

Pull request overview

This PR migrates CitationMixin’s on-disk citation cache from a CSV file to a SQLite database, aiming to improve lookup/update performance and better handle concurrent access, with a one-time automatic migration from legacy CSV caches.

Changes:

Replaced CSV cache read/write logic in CitationMixin.get_citation() / CitationMixin.update_cache() with SQLite queries and inserts.
Added _init_db() to initialize the SQLite schema and migrate an existing CSV cache on first use.
Updated the citation mixin test to seed/verify cache state using SQLite, and updated the changelog entry.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`bib_lookup/citation_mixin.py`	Implements SQLite-backed cache, adds CSV→SQLite migration, updates cache read/write paths.
`test/test_citation_mixin.py`	Updates the existing test to write/read the cache via SQLite instead of CSV.
`CHANGELOG.rst`	Documents the cache backend change and migration behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-22T07:56:49Z

bib_lookup/citation_mixin.py

+            # Use placeholders for the IN clause
+            placeholders = ",".join("?" * len(doi))
+
            if format is not None and format != self._bl.format:
                citation = ""  # no cache for format other than bibtex
+                existing_dois = set()
+                cached_citations = []
            else:
-                citation = "\n".join(df_cc[df_cc["doi"].isin(doi)]["citation"].tolist())
-                doi = [item for item in doi if item not in df_cc["doi"].tolist()]
-                if print_result:
+                query = f"SELECT citation FROM citations WHERE doi IN ({placeholders})"
+                cursor.execute(query, doi)


The IN (...) query uses one bound parameter per DOI. SQLite has a fairly low default maximum for host parameters (commonly 999), so get_citation() can start failing for large DOI lists (regression vs the previous pandas-based filtering). Consider chunking doi into batches and querying/inserting per batch, or using a temporary table to join against.

Copilot · 2026-02-22T07:56:49Z

test/test_citation_mixin.py

+    # Back up existing cache if any
    if bib_lookup.CitationMixin.citation_cache.exists():
-        df_system_cache = pd.read_csv(bib_lookup.CitationMixin.citation_cache)
+        df_system_cache = _read_cache_db(bib_lookup.CitationMixin.citation_cache)
    else:
        df_system_cache = None
-    df_cache_before_update.to_csv(bib_lookup.CitationMixin.citation_cache, index=False)
+
+    # Write initial cache state for test
+    if bib_lookup.CitationMixin.citation_cache.exists():
+        bib_lookup.CitationMixin.citation_cache.unlink()
+
+    _write_cache_db(df_cache_before_update, bib_lookup.CitationMixin.citation_cache)


This test reads/writes CitationMixin.citation_cache under ~/.cache/bib-lookup, which can delete or overwrite a developer’s real cache when running tests locally. Prefer using tmp_path (or similar) and monkeypatching CitationMixin.citation_cache_db/citation_cache_csv (or CACHE_DIR) to an isolated temp directory for the duration of the test.

Copilot · 2026-02-22T07:56:49Z

test/test_citation_mixin.py

+    # Write initial cache state for test
+    if bib_lookup.CitationMixin.citation_cache.exists():
+        bib_lookup.CitationMixin.citation_cache.unlink()
+
+    _write_cache_db(df_cache_before_update, bib_lookup.CitationMixin.citation_cache)

    obj = SomeClass()
    obj.update_cache()


The PR introduces automatic CSV→SQLite migration in CitationMixin._init_db(), but the updated test only exercises the SQLite cache path. Add a focused test that creates a legacy bib-lookup-cache.csv, calls get_citation()/update_cache(), and asserts the DB is populated correctly and the CSV is removed (or retained on failure).

Copilot · 2026-02-22T07:56:49Z

bib_lookup/citation_mixin.py

+                # Delete CSV after successful migration
+                self.citation_cache_csv.unlink()
+                print(f"Migrated citation cache from CSV to SQLite: {self.citation_cache_db}")
+            except Exception as e:
+                print(f"Failed to migrate CSV cache: {e}")


_init_db() prints migration success/failure unconditionally. This will write to stdout for any library user calling get_citation()/update_cache(), which is a behavior change and makes it hard to use the library in apps that capture stdout. Prefer using warnings.warn(...) or the project’s verbosity/logging mechanism, and only emit a message when the user opted into it.

Copilot · 2026-02-22T07:56:50Z

bib_lookup/citation_mixin.py

+            # Fetch from cache
+            conn = sqlite3.connect(self.citation_cache_db)
+            cursor = conn.cursor()
+
+            # Use placeholders for the IN clause
+            placeholders = ",".join("?" * len(doi))
+


placeholders = ",".join("?" * len(doi)) will produce an empty IN () clause when self.doi is an empty sequence, causing a SQL syntax error. Consider early-returning when doi is empty (e.g., return ""/None depending on print_result) before building the SQL query.

- Changed backend storage from CSV to SQLite for improved performance and concurrency support. - Implemented automatic backward-compatible migration: existing CSV caches will be converted to SQLite on first use. - Updated 'get_citation' and 'update_cache' to use SQL queries for efficient data retrieval. - Updated tests to verify SQLite cache functionality and migration logic.

- Replace print with warnings.warn for migration status - Handle empty DOI sequence to avoid invalid SQL syntax - Chunk DOI list in SQL queries to avoid SQLite parameter limits - Update tests to check for warnings instead of stdout output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings February 22, 2026 07:52

wenh06 enabled auto-merge February 22, 2026 07:52

wenh06 self-assigned this Feb 22, 2026

Copilot started reviewing on behalf of wenh06 February 22, 2026 07:53 View session

wenh06 added the enhancement New feature or request label Feb 22, 2026

Copilot AI reviewed Feb 22, 2026

View reviewed changes

wenh06 force-pushed the feature/sqlite-cache branch from 9b43477 to 716cc4a Compare February 22, 2026 08:47

wenh06 and others added 2 commits February 22, 2026 17:10

improve codecov in citation_mixin.py

874db75

wenh06 disabled auto-merge February 22, 2026 13:38

wenh06 merged commit 34dad0a into master Feb 22, 2026
10 checks passed

wenh06 deleted the feature/sqlite-cache branch February 22, 2026 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate CitationMixin cache from CSV to SQLite#24

Migrate CitationMixin cache from CSV to SQLite#24
wenh06 merged 3 commits intomasterfrom
feature/sqlite-cache

wenh06 commented Feb 22, 2026

Uh oh!

codecov bot commented Feb 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 22, 2026

Uh oh!

Copilot AI Feb 22, 2026

Uh oh!

Copilot AI Feb 22, 2026

Uh oh!

Copilot AI Feb 22, 2026

Uh oh!

Copilot AI Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wenh06 commented Feb 22, 2026

Uh oh!

codecov bot commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Feb 22, 2026 •

edited

Loading